HDF5¶
This is a sandbox file that should be split out to its own pydantic-hdf5 package, but just experimenting here to get our bearings
Notes
Rather than a set of recursive build steps as is used elsewhere in the package, since we need to instantiate some models first that are referred to elsewhere, we flatten the hdf5 file and build each from a queue.
Mapping operations (mostly TODO atm)
Create new models from DynamicTables
Handle softlinks as object references and vice versa by adding a
pathattr
Other TODO:
Read metadata only, don’t read all arrays
Write, obvi lol.
- SKIP_PATTERN = re.compile('(^/specifications.*)|(\\.specloc)')¶
Nodes to always skip in reading e.g. because they are handled elsewhere
- class HDF5IO(path: Path)¶
Read (and eventually write) from an NWB HDF5 file.
- read(path: None) NWBFile¶
- read(path: str) BaseModel | Dict[str, BaseModel]
Read data into models from an NWB File.
Todo
Document this!
- Parameters:
path (Optional[str]) – If
None(default), read whole file. Otherwise, read from specific (hdf5) path and its children- Returns:
NWBFileifpathisNone, otherwise whatever Model or dictionary of models applies to the requestedpath
- write(path: Path) Never¶
Write to NWB file
Todo
Implement HDF5 writing.
Need to create inverse mappings that can take pydantic models to hdf5 groups and datasets. If more metadata about the generation process needs to be preserved (eg. explicitly notating that something is an attribute, dataset, group, then we can make use of the
LinkML_Metamodel. If the model to edit has been loaded from an HDF5 file (rather than freshly created), then thehdf5_pathshould be populated making mapping straightforward, but we probably want to generalize that to deterministically get hdf5_path from position in the NWBFile object – I think that might require us to explicitly annotate when something is supposed to be a reference vs. the original in the model representation, or else it’s ambiguous.Otherwise, it should be a matter of detecting changes from file if it exists already, and then write them.
- make_provider() SchemaProvider¶
Create a
SchemaProviderby reading specifications from the NWBFile/specificationgroup and translating them to LinkML and generating pydantic models- Returns:
- Schema Provider with correct versions
specified as defaults
- Return type:
- hdf_dependency_graph(h5f: Path | File | Group) DiGraph¶
Directed dependency graph of dataset and group nodes in an NWBFile such that each node
n_iis connected to noden_jifn_jisn_i’s childn_icontains a reference ton_j
Resolve references in
Attributes
Dataset columns
Compound dtypes
Edges are labeled with
referenceorchilddepending on the type of edge it is, and attributes from the hdf5 file are added as node attributes.- Parameters:
h5f (
pathlib.Path|h5py.File) – NWB file to graph- Returns:
networkx.DiGraph
- filter_dependency_graph(g: DiGraph) DiGraph¶
Remove nodes from a dependency graph if they
have no neurodata type AND
have no outbound edges
OR
They match the .SKIP_PATTERN
- read_specs_as_dicts(group: Group) dict¶
Utility function to iterate through the /specifications group and load the schemas from it.
- Parameters:
group (
h5py.Group) – the/specificationsgroup!- Returns:
dictof schema.
- find_references(h5f: File, path: str) List[str]¶
Find all objects that make a reference to a given object in
Attributes
Dataset-level dtype (a dataset of references)
Compound datasets (a dataset with one “column” of references)
Notes
This is extremely slow because we collect all references first, rather than checking them as we go and quitting early. PR if you want to make this faster!
Todo
Test
find_references()!
- get_references(obj: Dataset | Group) List[str]¶
Find all hdf5 object references in a dataset or group
Locate references in
Attrs
Scalar datasets
Single-column datasets
Multi-column datasets
Distinct from
find_references()which finds a references to an object.- Parameters:
obj (
h5py.Dataset|h5py.Group) – Object to evaluate- Returns:
List of paths that are referenced within this object
- Return type:
List[str]
- resolve_hardlink(obj: Group | Dataset) str¶
Unhelpfully, hardlinks are pretty challenging to detect with h5py, so we have to do extra work to check if an item is “real” or a hardlink to another item.
Particularly, an item will be excluded from the
visititemsmethod used byflatten_hdf()if it is a hardlink rather than an “original” dataset, meaning that we don’t even have them in our sources list when start reading.We basically dereference the object and return that path instead of the path given by the object’s
name
- truncate_file(source: Path, target: Path | None = None, n: int = 10) Path | None¶
Create a truncated HDF5 file where only the first few samples are kept.
Used primarily to create testing data from real data without it being so damn bit
- Parameters:
source (
pathlib.Path) – Source hdf5 filetarget (
pathlib.Path) – Optional - target hdf5 file to write to. IfNone, use{source}_truncated.hdf5n (int) – The number of items from datasets (samples along the 0th dimension of a dataset) to include
- Returns:
pathlib.Pathpath of the truncated file