HDF5

This is a sandbox file that should be split out to its own pydantic-hdf5 package, but just experimenting here to get our bearings

Notes

  • Rather than a set of recursive build steps as is used elsewhere in the package, since we need to instantiate some models first that are referred to elsewhere, we flatten the hdf5 file and build each from a queue.

Mapping operations (mostly TODO atm)

  • Create new models from DynamicTables

  • Handle softlinks as object references and vice versa by adding a path attr

Other TODO:

  • Read metadata only, don’t read all arrays

  • Write, obvi lol.

SKIP_PATTERN = re.compile('(^/specifications.*)|(\\.specloc)')

Nodes to always skip in reading e.g. because they are handled elsewhere

class HDF5IO(path: Path)

Read (and eventually write) from an NWB HDF5 file.

read(path: None) NWBFile
read(path: str) BaseModel | Dict[str, BaseModel]

Read data into models from an NWB File.

Todo

Document this!

Parameters:

path (Optional[str]) – If None (default), read whole file. Otherwise, read from specific (hdf5) path and its children

Returns:

NWBFile if path is None, otherwise whatever Model or dictionary of models applies to the requested path

write(path: Path) Never

Write to NWB file

Todo

Implement HDF5 writing.

Need to create inverse mappings that can take pydantic models to hdf5 groups and datasets. If more metadata about the generation process needs to be preserved (eg. explicitly notating that something is an attribute, dataset, group, then we can make use of the LinkML_Meta model. If the model to edit has been loaded from an HDF5 file (rather than freshly created), then the hdf5_path should be populated making mapping straightforward, but we probably want to generalize that to deterministically get hdf5_path from position in the NWBFile object – I think that might require us to explicitly annotate when something is supposed to be a reference vs. the original in the model representation, or else it’s ambiguous.

Otherwise, it should be a matter of detecting changes from file if it exists already, and then write them.

make_provider() SchemaProvider

Create a SchemaProvider by reading specifications from the NWBFile /specification group and translating them to LinkML and generating pydantic models

Returns:

Schema Provider with correct versions

specified as defaults

Return type:

SchemaProvider

hdf_dependency_graph(h5f: Path | File | Group) DiGraph

Directed dependency graph of dataset and group nodes in an NWBFile such that each node n_i is connected to node n_j if

  • n_j is n_i’s child

  • n_i contains a reference to n_j

Resolve references in

  • Attributes

  • Dataset columns

  • Compound dtypes

Edges are labeled with reference or child depending on the type of edge it is, and attributes from the hdf5 file are added as node attributes.

Parameters:

h5f (pathlib.Path | h5py.File) – NWB file to graph

Returns:

networkx.DiGraph

filter_dependency_graph(g: DiGraph) DiGraph

Remove nodes from a dependency graph if they

  • have no neurodata type AND

  • have no outbound edges

OR

  • They match the .SKIP_PATTERN

read_specs_as_dicts(group: Group) dict

Utility function to iterate through the /specifications group and load the schemas from it.

Parameters:

group (h5py.Group) – the /specifications group!

Returns:

dict of schema.

find_references(h5f: File, path: str) List[str]

Find all objects that make a reference to a given object in

  • Attributes

  • Dataset-level dtype (a dataset of references)

  • Compound datasets (a dataset with one “column” of references)

Notes

This is extremely slow because we collect all references first, rather than checking them as we go and quitting early. PR if you want to make this faster!

Todo

Test find_references() !

Parameters:
  • h5f (h5py.File) – Open hdf5 file

  • path (str) – Path to search for references to

Returns:

List of paths that reference the given path

Return type:

list[str]

get_attr_references(obj: Dataset | Group) dict[str, str]

Get any references in object attributes

get_dataset_references(obj: Dataset | Group) list[str] | dict[str, str]

Get references in datasets

get_references(obj: Dataset | Group) List[str]

Find all hdf5 object references in a dataset or group

Locate references in

  • Attrs

  • Scalar datasets

  • Single-column datasets

  • Multi-column datasets

Distinct from find_references() which finds a references to an object.

Parameters:

obj (h5py.Dataset | h5py.Group) – Object to evaluate

Returns:

List of paths that are referenced within this object

Return type:

List[str]

Unhelpfully, hardlinks are pretty challenging to detect with h5py, so we have to do extra work to check if an item is “real” or a hardlink to another item.

Particularly, an item will be excluded from the visititems method used by flatten_hdf() if it is a hardlink rather than an “original” dataset, meaning that we don’t even have them in our sources list when start reading.

We basically dereference the object and return that path instead of the path given by the object’s name

truncate_file(source: Path, target: Path | None = None, n: int = 10) Path | None

Create a truncated HDF5 file where only the first few samples are kept.

Used primarily to create testing data from real data without it being so damn bit

Parameters:
  • source (pathlib.Path) – Source hdf5 file

  • target (pathlib.Path) – Optional - target hdf5 file to write to. If None, use {source}_truncated.hdf5

  • n (int) – The number of items from datasets (samples along the 0th dimension of a dataset) to include

Returns:

pathlib.Path path of the truncated file