Datasets And Splits

PhyloGNN training utilities work with PyTorch Geometric Data samples. Build graphs first, attach each graph target as data.y, then choose an in-memory or disk-backed dataset depending on how many samples you need to manage.

Dataset choices

Situation

Use

Notes

Small or generated datasets

SplitPhyloDataset

Keeps Data objects and labels in memory.

Preprocessed graph files

SplitPhyloDiskDataset

Reads .pt graphs, with optional mirrored label files.

Existing train/validation/test IDs

DatasetSplit.from_manifest_dir()

Loads train.txt, val.txt, and test.txt.

Split definitions

DatasetSplit.from_ratios() creates deterministic train, validation, and test IDs from sample IDs, ratios, and a seed. Use this when you need a repeatable local split without maintaining manifest files.

DatasetSplit.from_dict() accepts explicit sample IDs for each split. Use this when splits come from previous experiments or external metadata.

SplitPhyloDataset.build_subsets(split) returns split-specific views that can be passed directly to Trainer.fit() and Trainer.predict().

Target labels

Each supervised sample must expose data.y. Keep label dtype and shape aligned with the selected loss and model output. For graph-level regression, a common shape is one float target per graph.

Disk-backed .pt graph and label files are loaded as complete PyTorch objects. Use this path only for trusted artifacts produced by your PhyloGNN preprocessing workflow.

Determinism

Keep sample IDs stable across preprocessing runs. Use explicit feature order during graph conversion so every split sees the same column layout. Use a fixed split seed when ratio-based splits are acceptable.