ftag.hdf5.h5writer#
Classes#
Write jet-based data to an HDF5 file. |
Module Contents#
- class ftag.hdf5.h5writer.H5Writer#
Write jet-based data to an HDF5 file.
This class creates one dataset per entry in
dtypes/shapesand supports both fixed-size and dynamically growing output files. Floating-point fields can optionally be downcast before writing, selected metadata groups can be copied from an existing file, and several HDF5 compression backends are supported.- dst#
Path to the output file.
- Type:
Path | str
- dtypes#
Mapping from dataset name to output dtype.
- Type:
dict[str, np.dtype]
- shapes#
Mapping from dataset name to output shape. All datasets must agree in their first dimension unless
num_jetsis explicitly given.- Type:
dict[str, tuple[int, …]]
- jets_name#
Name of the jet dataset. This dataset is used to determine batch sizes during writing. Default is
"jets".- Type:
str, optional
- add_flavour_label#
If
True, append a"flavour_label"field of typei4to the jet dataset if it is not already present. Default isFalse.- Type:
bool, optional
- compression#
Compression algorithm to use. Supported values are
None,"none","gzip","lzf","lz4", and"zstd". Default is"lz4".- Type:
str | None, optional
- compression_opts#
Optional compression level or backend-specific compression setting. For
"gzip", this is passed ascompression_optsto HDF5. For plugin-based compressors such as"lz4"and"zstd", this is interpreted as the plugin compression level and folded into the filter object. Ignored for compressors that do not support an explicit level. Default isNone.- Type:
int | None, optional
- precision#
Floating-point storage precision for output fields. Supported values are
"full": cast floating-point fields tonp.float32"half": cast floating-point fields tonp.float16None: keep original floating-point dtypes
Default is
"full".- Type:
str | None, optional
- full_precision_vars#
Variables that should keep their original dtype even when
precisionrequests downcasting. Default isNone.- Type:
list[str] | None, optional
- shuffle#
If
True, shuffle each batch before writing. Default isTrue.- Type:
bool, optional
- num_jets#
Expected total number of jets to write. If given, datasets are created in fixed-size mode. If
None, datasets are created in dynamic mode and resized during writing. Default isNone.- Type:
int | None, optional
- groups#
Mapping of metadata group names to extracted group contents to be copied into the output file. Default is
None.- Type:
dict[str, h5py.Group] | None, optional
- Raises:
ValueError – If an unsupported precision or compression setting is provided.
AssertionError – If dataset shapes disagree in their first dimension when
num_jetsis not explicitly specified.
- dst: pathlib.Path | str#
- dtypes: dict[str, numpy.dtype]#
- shapes: dict[str, tuple[int, Ellipsis]]#
- jets_name: str = 'jets'#
- add_flavour_label: bool = False#
- compression: str | None = 'lz4'#
- compression_opts: int | None = None#
- precision: str | None = 'full'#
- full_precision_vars: list[str] | None = None#
- shuffle: bool = True#
- num_jets: int | None = None#
- groups: dict[str, h5py.Group] | None = None#
- __post_init__() None#
- _resolve_compression(compression: str | None) tuple[str | object | None, int | None]#
Resolve a user-facing compression setting into HDF5 write arguments.
This method converts a human-readable compression identifier into the values used when calling
h5py.File.create_dataset(). Built-in HDF5 filters such as"gzip"and"lzf"are returned as strings, optionally together with an HDF5compression_optsvalue. Plugin-based filters such as"lz4"and"zstd"are converted into the corresponding filter objects provided byhdf5plugin. In that case, any user-providedcompression_optsvalue is absorbed into the plugin object and the returned HDF5compression_optsisNone.- Parameters:
compression (str | None) –
Compression algorithm identifier. Supported values are
Noneor"none": disable compression"gzip": gzip/deflate compression"lzf": built-in fast compression"lz4": LZ4 compression viahdf5plugin"zstd": Zstandard compression viahdf5plugin
- Returns:
Two-element tuple
(compression, compression_opts)suitable for passing toh5py.File.create_dataset().For no compression, returns
(None, None).For built-in HDF5 filters, returns the filter name and optional HDF5
compression_opts.For plugin filters, returns the instantiated plugin object and
None.
- Return type:
tuple[str | object | None, int | None]
- Raises:
ValueError – If the provided compression identifier is not supported.
- classmethod from_file(source: pathlib.Path | str, num_jets: int | None = 0, variables: dict[str, list[str] | None] | None = None, copy_groups: bool = True, **kwargs: Any) H5Writer#
Construct a writer from the structure of an existing HDF5 file.
This class method inspects an input file and derives output dataset dtypes, shapes, compression, and optionally metadata groups from it. It can be used to create a writer that mirrors the input file layout, optionally restricted to a subset of variables and/or a different number of output jets.
- Parameters:
source (Path | str) – Source HDF5 file from which to infer the output structure.
num_jets (int | None, optional) – If non-zero, override the first dimension of all dataset shapes with this value. If
0, keep the original dataset lengths. Default is0.variables (dict[str, list[str] | None] | None, optional) – Optional mapping from dataset name to a list of variables to keep. If provided, output dtypes are reduced accordingly. Default is
None.copy_groups (bool, optional) – If
True, copy non-dataset groups from the source file into the created writer. Default isTrue.**kwargs (Any) – Additional keyword arguments forwarded to the class constructor. This can be used, for example, to override
compression,compression_opts,precision, orfull_precision_vars.
- Returns:
Writer initialized from the source file structure.
- Return type:
- Raises:
TypeError – If an object in the source file is neither an HDF5 dataset nor group.
- save_groups(groups: dict[str, dict]) None#
Write extracted metadata groups into the output file.
- Parameters:
groups (dict[str, dict]) – Mapping from group name to extracted group contents.
- create_ds(name: str, dtype: numpy.dtype) None#
Create one output dataset.
- Parameters:
name (str) – Dataset name.
dtype (np.dtype) – Input dtype definition for the dataset.
- close() None#
Close the output file.
- Raises:
ValueError – If the writer is closed before the expected number of jets has been written in fixed-size mode.
- get_attr(name: str, group: str | None = None) Any#
Return an attribute from the output file or one of its groups.
- Parameters:
name (str) – Name of the attribute to retrieve.
group (str | None, optional) – Name of the group from which the attribute should be read. If
None, the attribute is read from the root of the HDF5 file.
- Returns:
Value of the requested attribute.
- Return type:
Any
- add_attr(name: str, data: Any, group: str | None = None) None#
Add an attribute to the output file or one of its groups.
- Parameters:
name (str) – Name of the attribute to create.
data (Any) – Attribute value to store. The value must be compatible with HDF5 attribute storage.
group (str | None, optional) – Name of the group to which the attribute should be added. If
None, the attribute is written to the root of the HDF5 file.
- copy_attrs(fname: pathlib.Path) None#
Copy file- and dataset-level attributes from another HDF5 file.
- Parameters:
fname (Path) – Path to the source HDF5 file.
- write(data: dict[str, numpy.ndarray]) None#
Write one batch of data to the output file.
- Parameters:
data (dict[str, np.ndarray]) – Mapping from dataset name to batch array.
- Raises:
ValueError – If writing this batch would exceed
num_jetsin fixed-size mode.