Data Contract¶

xdflow works on xarray.DataArray objects. DataContainer is a thin framework wrapper around a DataArray, not a separate data model. The wrapped xarray object remains the source of truth for values, dimensions, coordinates, and attrs.

The wrapper gives transforms, predictors, validators, and tuners a consistent object to pass around. It initializes data.attrs["data_history"], rewraps xarray methods such as sel and mean when they return a new DataArray, and exposes the underlying array through .data.

The library does not require a fixed schema beyond a few conventions, but it assumes your data is labeled consistently enough for transforms to use dimensions by name.

The data contract is also what validators and tuners use at runtime. Dimensions and coordinates tell the framework what can be selected, split, transformed, scored, cached, and aligned without moving metadata into separate side arrays.

Required structure¶

supervised workflows are organized around a trial dimension
target labels typically live in coordinates attached to trial
transforms may require additional dimensions such as channel, time, feature, or freq_band

Each transform advertises dimension expectations through input_dims and output_dims, or computes them dynamically through get_expected_output_dims.

Dimensions¶

dimension names are part of the contract, not incidental metadata
transforms should use dimension names such as data.mean(dim="time")
positional-axis logic should be avoided unless a transform is explicitly reshaping into a new labeled representation

Common dimension names in this repo include:

trial
channel
time
feature
prediction
freq_band

Domain-specific labels are fine as long as they remain internally consistent.

Runtime use¶

xdflow uses the data contract together with transform state to run pipelines:

sample_dim identifies the independent sample axis for predictors
target coordinates such as stimulus stay attached to that sample axis
group coordinates such as session, subject, or animal can define split boundaries
transform input_dims and output_dims catch invalid handoffs between steps
is_stateful tells validators which steps must be cloned and refit per fold
fold-invariant stateless steps can be reused across folds or tuning trials

Do not mark a transform stateless just because it has no fitted Python object. If it computes cross-sample statistics or otherwise depends on the validation split, model it as stateful or keep it after the split boundary.

Transform history¶

DataContainer initializes a data_history list in the wrapped array attrs. The base Transform appends each completed transform with its class name and public parameters, so transformed outputs carry a lightweight provenance trail.

This history is useful for inspection and debugging. It is not a replacement for experiment tracking, model cards, or persisted pipeline configuration.

Automatic validation¶

XDFlow uses transform dimension declarations as lightweight contract checks:

adjacent pipeline steps are checked at construction time when one step declares concrete output_dims and the next declares concrete input_dims
transforms with dynamic output shapes expose get_expected_output_dims(input_dims)
Pipeline(expected_input_dims=...) validates the expected input dims for every step and checks handoffs through get_expected_output_dims
during fit_transform and transform, expected_input_dims checks the actual dims before each step runs
pipeline.get_expected_output_dims(input_dims, print_steps=True) shows expected dimension flow without running data
transform_sel and transform_drop_sel write-back checks dims, sizes, and dimension coordinates before replacing a selected subset

These checks catch many invalid reshapes and step handoffs early, but they are not a full semantic schema validator. Transform authors should still validate required coordinates, attrs, units, size constraints, and domain-specific assumptions inside the transform.

See Writing Custom Transforms for the authoring workflow and examples.

Coordinates and attrs¶

Coordinates are the main place for labels and grouping metadata:

class labels, sessions, animals, subjects, or conditions should be stored as coordinates
timestamps and channel labels should remain attached to the relevant dimension
additional metadata can live in attrs

Predictors and splitters often depend on coordinates such as stimulus, session, or animal, so those names need to exist on the data used by the relevant workflow.

Immutability¶

Transforms are expected to behave functionally:

transform() should return a new DataContainer
the incoming container should not be mutated in place
selective transforms that write results back into a larger array must do so on a copied container

This contract is exercised by the test suite, especially the transform immutability tests.

Selection semantics¶

All transforms support optional selection arguments:

sel: apply an xarray.sel(...) selection before the transform
drop_sel: drop labels before the transform
transform_sel: transform only a selected subset, then write it back into the original structure
transform_drop_sel: inverse form of transform_sel

Selective in-place replacement is only valid for transforms that declare support for it via _supports_transform_sel.

Fitted state¶

Stateful transforms must keep learned parameters out of the constructor so cloning stays safe during cross-validation.

Good pattern:

constructor arguments are pure hyperparameters
fitted artifacts are stored on private attributes such as _estimator, _encoder, or _stats

Bad pattern:

populating constructor-declared attributes with learned values during fit

Guidance for authoring new transforms¶

Define explicit constructor parameters and assign them to matching public attributes.
Validate dimensions and coordinate assumptions early.
Preserve or intentionally recompute coordinates when reshaping data.
Keep transform logic label-aware.
Add focused unit tests, including immutability coverage where applicable.

For examples and test patterns, see Writing Custom Transforms.