Datasets And Splits
===================

PhyloGNN training utilities work with PyTorch Geometric `Data` samples. Build
graphs first, attach each graph target as `data.y`, then choose an in-memory or
disk-backed dataset depending on how many samples you need to manage.

Dataset choices
---------------

.. list-table::
   :header-rows: 1

   * - Situation
     - Use
     - Notes
   * - Small or generated datasets
     - `SplitPhyloDataset`
     - Keeps `Data` objects and labels in memory.
   * - Preprocessed graph files
     - `SplitPhyloDiskDataset`
     - Reads `.pt` graphs, with optional mirrored label files.
   * - Existing train/validation/test IDs
     - `DatasetSplit.from_manifest_dir()`
     - Loads `train.txt`, `val.txt`, and `test.txt`.

Split definitions
-----------------

`DatasetSplit.from_ratios()` creates deterministic train, validation, and test
IDs from sample IDs, ratios, and a seed. Use this when you need a repeatable
local split without maintaining manifest files.

`DatasetSplit.from_dict()` accepts explicit sample IDs for each split. Use this
when splits come from previous experiments or external metadata.

`SplitPhyloDataset.build_subsets(split)` returns split-specific views that can
be passed directly to `Trainer.fit()` and `Trainer.predict()`.

Target labels
-------------

Each supervised sample must expose `data.y`. Keep label dtype and shape aligned
with the selected loss and model output. For graph-level regression, a common
shape is one float target per graph.

Disk-backed `.pt` graph and label files are loaded as complete PyTorch
objects. Use this path only for trusted artifacts produced by your PhyloGNN
preprocessing workflow.

Determinism
-----------

Keep sample IDs stable across preprocessing runs. Use explicit feature order
during graph conversion so every split sees the same column layout. Use a fixed
split seed when ratio-based splits are acceptable.

Related pages
-------------

See :doc:`graph_conversion` for graph field contracts,
:doc:`training` for trainer usage, and :doc:`../reference/training` for API
lookup.