HDF5#

This is a sandbox file that should be split out to its own pydantic-hdf5 package, but just experimenting here to get our bearings

Notes

Rather than a set of recursive build steps as is used elsewhere in the package, since we need to instantiate some models first that are referred to elsewhere, we flatten the hdf5 file and build each from a queue.

Mapping operations (mostly TODO atm)

Create new models from DynamicTables
Handle softlinks as object references and vice versa by adding a path attr

Other TODO:

Read metadata only, don’t read all arrays
Write, obvi lol.

class HDF5IO(path: Path)#

read(path: None) → NWBFile#

read(path: str) → BaseModel | Dict[str, BaseModel]

Read data into models from an NWB File.

The read process is in several stages:

Use make_provider() to generate any needed LinkML Schema or Pydantic Classes using a SchemaProvider
flatten_hdf() file into a ReadQueue of nodes.
Apply the queue’s ReadPhases :
- plan - trim any blank nodes, sort nodes to read, etc.
- read - load the actual data into temporary holding objects
- construct - cast the read data into models.

Read is split into stages like this to handle references between objects, where the read result of one node might depend on another having already been completed. It also allows us to parallelize the operations since each mapping operation is independent of the results of all the others in that pass.

Todo

Implement reading, skipping arrays - they are fast to read with the ArrayProxy class and dask, but there are times when we might want to leave them out of the read entirely. This might be better implemented as a filter on model_dump , but to investigate further how best to support reading just metadata, or even some specific field value, or if we should leave that to other implementations like eg. after we do SQL export then not rig up a whole query system ourselves.

Parameters:: path (Optional[str]) – If None (default), read whole file. Otherwise, read from specific (hdf5) path and its children
Returns:: NWBFile if path is None, otherwise whatever Model or dictionary of models applies to the requested path

write(path: Path)#: Write to NWB file

Todo

Implement HDF5 writing.

Need to create inverse mappings that can take pydantic models to hdf5 groups and datasets. If more metadata about the generation process needs to be preserved (eg. explicitly notating that something is an attribute, dataset, group, then we can make use of the LinkML_Meta model. If the model to edit has been loaded from an HDF5 file (rather than freshly created), then the hdf5_path should be populated making mapping straightforward, but we probably want to generalize that to deterministically get hdf5_path from position in the NWBFile object – I think that might require us to explicitly annotate when something is supposed to be a reference vs. the original in the model representation, or else it’s ambiguous.

Otherwise, it should be a matter of detecting changes from file if it exists already, and then write them.

make_provider() → SchemaProvider#

Create a SchemaProvider by reading specifications from the NWBFile /specification group and translating them to LinkML and generating pydantic models

Returns:

Schema Provider with correct versions: specified as defaults

Return type:

SchemaProvider

read_specs_as_dicts(group: Group) → dict#

Utility function to iterate through the /specifications group and load the schemas from it.

Parameters:: group (h5py.Group) – the /specifications group!
Returns:: dict of schema.

find_references(h5f: File, path: str) → List[str]#

Find all objects that make a reference to a given object in

Attributes
Dataset-level dtype (a dataset of references)
Compound datasets (a dataset with one “column” of references)

Notes

This is extremely slow because we collect all references first, rather than checking them as we go and quitting early. PR if you want to make this faster!

Todo

Test find_references() !

Parameters:

h5f (h5py.File) – Open hdf5 file
path (str) – Path to search for references to

Returns:

List of paths that reference the given path

Return type:

list[str]

truncate_file(source: Path, target: Path | None = None, n: int = 10) → Path#

Create a truncated HDF5 file where only the first few samples are kept.

Used primarily to create testing data from real data without it being so damn bit

Parameters:

source (pathlib.Path) – Source hdf5 file
target (pathlib.Path) – Optional - target hdf5 file to write to. If None, use {source}_truncated.hdf5
n (int) – The number of items from datasets (samples along the 0th dimension of a dataset) to include

Returns:

pathlib.Path path of the truncated file