HDF5#
This is a sandbox file that should be split out to its own pydantic-hdf5 package, but just experimenting here to get our bearings
Notes
Rather than a set of recursive build steps as is used elsewhere in the package, since we need to instantiate some models first that are referred to elsewhere, we flatten the hdf5 file and build each from a queue.
Mapping operations (mostly TODO atm)
Create new models from DynamicTables
Handle softlinks as object references and vice versa by adding a
path
attr
Other TODO:
Read metadata only, don’t read all arrays
Write, obvi lol.
- class HDF5IO(path: Path)#
- read(path: None) NWBFile #
- read(path: str) BaseModel | Dict[str, BaseModel]
Read data into models from an NWB File.
The read process is in several stages:
Use
make_provider()
to generate any needed LinkML Schema or Pydantic Classes using aSchemaProvider
flatten_hdf()
file into aReadQueue
of nodes.Apply the queue’s
ReadPhases
:plan
- trim any blank nodes, sort nodes to read, etc.read
- load the actual data into temporary holding objectsconstruct
- cast the read data into models.
Read is split into stages like this to handle references between objects, where the read result of one node might depend on another having already been completed. It also allows us to parallelize the operations since each mapping operation is independent of the results of all the others in that pass.
Todo
Implement reading, skipping arrays - they are fast to read with the ArrayProxy class and dask, but there are times when we might want to leave them out of the read entirely. This might be better implemented as a filter on
model_dump
, but to investigate further how best to support reading just metadata, or even some specific field value, or if we should leave that to other implementations like eg. after we do SQL export then not rig up a whole query system ourselves.- Parameters:
path (Optional[str]) – If
None
(default), read whole file. Otherwise, read from specific (hdf5) path and its children- Returns:
NWBFile
ifpath
isNone
, otherwise whatever Model or dictionary of models applies to the requestedpath
- write(path: Path)#
Write to NWB file
Todo
Implement HDF5 writing.
Need to create inverse mappings that can take pydantic models to hdf5 groups and datasets. If more metadata about the generation process needs to be preserved (eg. explicitly notating that something is an attribute, dataset, group, then we can make use of the
LinkML_Meta
model. If the model to edit has been loaded from an HDF5 file (rather than freshly created), then thehdf5_path
should be populated making mapping straightforward, but we probably want to generalize that to deterministically get hdf5_path from position in the NWBFile object – I think that might require us to explicitly annotate when something is supposed to be a reference vs. the original in the model representation, or else it’s ambiguous.Otherwise, it should be a matter of detecting changes from file if it exists already, and then write them.
- make_provider() SchemaProvider #
Create a
SchemaProvider
by reading specifications from the NWBFile/specification
group and translating them to LinkML and generating pydantic models- Returns:
- Schema Provider with correct versions
specified as defaults
- Return type:
- read_specs_as_dicts(group: Group) dict #
Utility function to iterate through the /specifications group and load the schemas from it.
- Parameters:
group (
h5py.Group
) – the/specifications
group!- Returns:
dict
of schema.
- find_references(h5f: File, path: str) List[str] #
Find all objects that make a reference to a given object in
Attributes
Dataset-level dtype (a dataset of references)
Compound datasets (a dataset with one “column” of references)
Notes
This is extremely slow because we collect all references first, rather than checking them as we go and quitting early. PR if you want to make this faster!
Todo
Test
find_references()
!
- truncate_file(source: Path, target: Path | None = None, n: int = 10) Path #
Create a truncated HDF5 file where only the first few samples are kept.
Used primarily to create testing data from real data without it being so damn bit
- Parameters:
source (
pathlib.Path
) – Source hdf5 filetarget (
pathlib.Path
) – Optional - target hdf5 file to write to. IfNone
, use{source}_truncated.hdf5
n (int) – The number of items from datasets (samples along the 0th dimension of a dataset) to include
- Returns:
pathlib.Path
path of the truncated file