Graph Conversion

TreeToGraphConverter converts an ete3.Tree with numeric node attributes into a PyTorch Geometric Data object.

Basic workflow

from phylognn import TreeToGraphConverter

converter = TreeToGraphConverter(feature_names=engineer.feature_names)
data = converter.convert(tree, graph_attrs={"tree_id": "example"})

The converter expects every node to have every requested feature and every feature value to be numeric.

When to use it

Use this step after feature engineering and before any model or training code. Keep feature_names explicit when comparing experiments so data.x columns remain stable.

Output fields

data.x

Floating-point node feature matrix with shape [num_nodes, num_features]. Column order is defined by feature_names, commonly TreeFeatureEngineer.feature_names.

data.edge_index

torch.long tensor with shape [2, num_edges]. Tree parent-child relations are included, and bidirectional conversion adds reverse edges.

data.edge_type

torch.long tensor aligned with data.edge_index. Values are 0 for tree edges, 1 for virtual-to-real edges, and 2 for virtual-chain edges.

data.node_names

Optional list aligned with graph node order when preserve_node_names=True. Original nodes use ETE names, unnamed nodes use an empty string, and virtual nodes use generated names.

data.original_num_nodes

Count of nodes from the original tree before virtual nodes are appended.

Converted data also includes data.virtual_node_mask and data.node_type. User-provided graph_attrs are attached as graph-level attributes, except for reserved generated field names such as time_bin.

When feature_names includes time_bin, the converter also attaches data.time_bin as a one-dimensional torch.long tensor with one label per final graph node. The labels follow the same row order as data.x. When feature_names does not include time_bin, the converter does not infer or attach data.time_bin.

Virtual nodes

Set add_virtual_nodes=True to add one virtual node per time bin. In this mode, feature_names must include time_bin. Virtual-to-real edges have edge_type=1; virtual-chain edges have edge_type=2. If append_is_virtual_feature=True, the final feature column identifies virtual nodes. If num_time_bins is configured, one virtual node is created for every configured bin, including empty bins. Generated data.time_bin labels for virtual nodes are appended in ascending bin order.

Deterministic ordering

Feature order is deterministic when callers pass an ordered sequence such as TreeFeatureEngineer.feature_names. Node order follows the converter traversal strategy, with preorder as the default. Metadata aligned to nodes, including node_names, follows that same order.

Common validation errors

The converter raises clear errors for missing requested node attributes, non-numeric feature values, duplicate feature names, unsupported traversal strategies, invalid virtual-node settings, and graph attribute names that collide with generated fields.

Saving and loading

Use convert_and_save() for preprocessing pipelines, save_data() to store a PyTorch Geometric Data object, and load_data() to restore it with torch.load. Saved graph files are complete PyTorch objects; load them only from trusted PhyloGNN project outputs.