Data Contract¶
xdflow works on xarray.DataArray objects. DataContainer is a thin framework wrapper around a DataArray, not a separate data model. The wrapped xarray object remains the source of truth for values, dimensions, coordinates, and attrs.
The wrapper gives transforms, predictors, validators, and tuners a consistent object to pass around. It initializes data.attrs["data_history"], rewraps xarray methods such as sel and mean when they return a new DataArray, and exposes the underlying array through .data.
The library does not require a fixed schema beyond a few conventions, but it assumes your data is labeled consistently enough for transforms to use dimensions by name.
The data contract is also what validators and tuners use at runtime. Dimensions and coordinates tell the framework what can be selected, split, transformed, scored, cached, and aligned without moving metadata into separate side arrays.
Required structure¶
- supervised workflows are organized around a
trialdimension - target labels typically live in coordinates attached to
trial - transforms may require additional dimensions such as
channel,time,feature, orfreq_band
Each transform advertises dimension expectations through input_dims and output_dims, or computes them dynamically through get_expected_output_dims.
Dimensions¶
- dimension names are part of the contract, not incidental metadata
- transforms should use dimension names such as
data.mean(dim="time") - positional-axis logic should be avoided unless a transform is explicitly reshaping into a new labeled representation
Common dimension names in this repo include:
trialchanneltimefeaturepredictionfreq_band
Domain-specific labels are fine as long as they remain internally consistent.
Runtime use¶
xdflow uses the data contract together with transform state to run pipelines:
sample_dimidentifies the independent sample axis for predictors- target coordinates such as
stimulusstay attached to that sample axis - group coordinates such as
session,subject, oranimalcan define split boundaries - transform
input_dimsandoutput_dimscatch invalid handoffs between steps is_statefultells validators which steps must be cloned and refit per fold- fold-invariant stateless steps can be reused across folds or tuning trials
Do not mark a transform stateless just because it has no fitted Python object. If it computes cross-sample statistics or otherwise depends on the validation split, model it as stateful or keep it after the split boundary.
Transform history¶
DataContainer initializes a data_history list in the wrapped array attrs. The base Transform appends each completed transform with its class name and public parameters, so transformed outputs carry a lightweight provenance trail.
This history is useful for inspection and debugging. It is not a replacement for experiment tracking, model cards, or persisted pipeline configuration.
Automatic validation¶
XDFlow uses transform dimension declarations as lightweight contract checks:
- adjacent pipeline steps are checked at construction time when one step declares concrete
output_dimsand the next declares concreteinput_dims - transforms with dynamic output shapes expose
get_expected_output_dims(input_dims) Pipeline(expected_input_dims=...)validates the expected input dims for every step and checks handoffs throughget_expected_output_dims- during
fit_transformandtransform,expected_input_dimschecks the actual dims before each step runs pipeline.get_expected_output_dims(input_dims, print_steps=True)shows expected dimension flow without running datatransform_selandtransform_drop_selwrite-back checks dims, sizes, and dimension coordinates before replacing a selected subset
These checks catch many invalid reshapes and step handoffs early, but they are not a full semantic schema validator. Transform authors should still validate required coordinates, attrs, units, size constraints, and domain-specific assumptions inside the transform.
See Writing Custom Transforms for the authoring workflow and examples.
Coordinates and attrs¶
Coordinates are the main place for labels and grouping metadata:
- class labels, sessions, animals, subjects, or conditions should be stored as coordinates
- timestamps and channel labels should remain attached to the relevant dimension
- additional metadata can live in
attrs
Predictors and splitters often depend on coordinates such as stimulus, session, or animal, so those names need to exist on the data used by the relevant workflow.
Immutability¶
Transforms are expected to behave functionally:
transform()should return a newDataContainer- the incoming container should not be mutated in place
- selective transforms that write results back into a larger array must do so on a copied container
This contract is exercised by the test suite, especially the transform immutability tests.
Selection semantics¶
All transforms support optional selection arguments:
sel: apply anxarray.sel(...)selection before the transformdrop_sel: drop labels before the transformtransform_sel: transform only a selected subset, then write it back into the original structuretransform_drop_sel: inverse form oftransform_sel
Selective in-place replacement is only valid for transforms that declare support for it via _supports_transform_sel.
Fitted state¶
Stateful transforms must keep learned parameters out of the constructor so cloning stays safe during cross-validation.
Good pattern:
- constructor arguments are pure hyperparameters
- fitted artifacts are stored on private attributes such as
_estimator,_encoder, or_stats
Bad pattern:
- populating constructor-declared attributes with learned values during
fit
Guidance for authoring new transforms¶
- Define explicit constructor parameters and assign them to matching public attributes.
- Validate dimensions and coordinate assumptions early.
- Preserve or intentionally recompute coordinates when reshaping data.
- Keep transform logic label-aware.
- Add focused unit tests, including immutability coverage where applicable.
For examples and test patterns, see Writing Custom Transforms.