HDF5

This is a sandbox file that should be split out to its own pydantic-hdf5 package, but just experimenting here to get our bearings

Notes

  • Rather than a set of recursive build steps as is used elsewhere in the package, since we need to instantiate some models first that are referred to elsewhere, we flatten the hdf5 file and build each from a queue.

Mapping operations (mostly TODO atm)

  • Create new models from DynamicTables

  • Handle softlinks as object references and vice versa by adding a path attr

Other TODO:

  • Read metadata only, don’t read all arrays

  • Write, obvi lol.

SKIP_PATTERN = re.compile('(^/specifications.*)|(\\.specloc)')

Nodes to always skip in reading e.g. because they are handled elsewhere

hdf_dependency_graph(h5f: Path | File | Group) DiGraph

Directed dependency graph of dataset and group nodes in an NWBFile such that each node n_i is connected to node n_j if

  • n_j is n_i’s child

  • n_i contains a reference to n_j

Resolve references in

  • Attributes

  • Dataset columns

  • Compound dtypes

Edges are labeled with reference or child depending on the type of edge it is, and attributes from the hdf5 file are added as node attributes.

Parameters:

h5f (pathlib.Path | h5py.File) – NWB file to graph

Returns:

networkx.DiGraph

filter_dependency_graph(g: DiGraph) DiGraph

Remove nodes from a dependency graph if they

  • have no neurodata type AND

  • have no outbound edges

OR

  • are a VectorIndex (which are handled by the dynamictable mixins)

class HDF5IO(path: Path)

Read (and eventually write) from an NWB HDF5 file.

read(path: None) NWBFile
read(path: str) BaseModel | Dict[str, BaseModel]

Read data into models from an NWB File.

The read process is in several stages:

  • Use make_provider() to generate any needed LinkML Schema or Pydantic Classes using a SchemaProvider

  • flatten_hdf() file into a ReadQueue of nodes.

  • Apply the queue’s ReadPhases :

    • plan - trim any blank nodes, sort nodes to read, etc.

    • read - load the actual data into temporary holding objects

    • construct - cast the read data into models.

Read is split into stages like this to handle references between objects, where the read result of one node might depend on another having already been completed. It also allows us to parallelize the operations since each mapping operation is independent of the results of all the others in that pass.

Todo

Implement reading, skipping arrays - they are fast to read with the ArrayProxy class and dask, but there are times when we might want to leave them out of the read entirely. This might be better implemented as a filter on model_dump , but to investigate further how best to support reading just metadata, or even some specific field value, or if we should leave that to other implementations like eg. after we do SQL export then not rig up a whole query system ourselves.

Parameters:

path (Optional[str]) – If None (default), read whole file. Otherwise, read from specific (hdf5) path and its children

Returns:

NWBFile if path is None, otherwise whatever Model or dictionary of models applies to the requested path

write(path: Path) Never

Write to NWB file

Todo

Implement HDF5 writing.

Need to create inverse mappings that can take pydantic models to hdf5 groups and datasets. If more metadata about the generation process needs to be preserved (eg. explicitly notating that something is an attribute, dataset, group, then we can make use of the LinkML_Meta model. If the model to edit has been loaded from an HDF5 file (rather than freshly created), then the hdf5_path should be populated making mapping straightforward, but we probably want to generalize that to deterministically get hdf5_path from position in the NWBFile object – I think that might require us to explicitly annotate when something is supposed to be a reference vs. the original in the model representation, or else it’s ambiguous.

Otherwise, it should be a matter of detecting changes from file if it exists already, and then write them.

make_provider() SchemaProvider

Create a SchemaProvider by reading specifications from the NWBFile /specification group and translating them to LinkML and generating pydantic models

Returns:

Schema Provider with correct versions

specified as defaults

Return type:

SchemaProvider

read_specs_as_dicts(group: Group) dict

Utility function to iterate through the /specifications group and load the schemas from it.

Parameters:

group (h5py.Group) – the /specifications group!

Returns:

dict of schema.

find_references(h5f: File, path: str) List[str]

Find all objects that make a reference to a given object in

  • Attributes

  • Dataset-level dtype (a dataset of references)

  • Compound datasets (a dataset with one “column” of references)

Notes

This is extremely slow because we collect all references first, rather than checking them as we go and quitting early. PR if you want to make this faster!

Todo

Test find_references() !

Parameters:
  • h5f (h5py.File) – Open hdf5 file

  • path (str) – Path to search for references to

Returns:

List of paths that reference the given path

Return type:

list[str]

truncate_file(source: Path, target: Path | None = None, n: int = 10) Path | None

Create a truncated HDF5 file where only the first few samples are kept.

Used primarily to create testing data from real data without it being so damn bit

Parameters:
  • source (pathlib.Path) – Source hdf5 file

  • target (pathlib.Path) – Optional - target hdf5 file to write to. If None, use {source}_truncated.hdf5

  • n (int) – The number of items from datasets (samples along the 0th dimension of a dataset) to include

Returns:

pathlib.Path path of the truncated file