Datasets And Splits
PhyloGNN training utilities work with PyTorch Geometric Data samples. Build
graphs first, attach each graph target as data.y, then choose an in-memory or
disk-backed dataset depending on how many samples you need to manage.
Dataset choices
Situation |
Use |
Notes |
|---|---|---|
Small or generated datasets |
|
Keeps |
Preprocessed graph files |
|
Reads |
Existing train/validation/test IDs |
|
Loads |
Split definitions
DatasetSplit.from_ratios() creates deterministic train, validation, and test
IDs from sample IDs, ratios, and a seed. Use this when you need a repeatable
local split without maintaining manifest files.
DatasetSplit.from_dict() accepts explicit sample IDs for each split. Use this
when splits come from previous experiments or external metadata.
SplitPhyloDataset.build_subsets(split) returns split-specific views that can
be passed directly to Trainer.fit() and Trainer.predict().
Target labels
Each supervised sample must expose data.y. Keep label dtype and shape aligned
with the selected loss and model output. For graph-level regression, a common
shape is one float target per graph.
Disk-backed .pt graph and label files are loaded as complete PyTorch
objects. Use this path only for trusted artifacts produced by your PhyloGNN
preprocessing workflow.
Determinism
Keep sample IDs stable across preprocessing runs. Use explicit feature order during graph conversion so every split sees the same column layout. Use a fixed split seed when ratio-based splits are acceptable.