ftag.hdf5#

HDF5 package of the atlas-ftag-tools.

Submodules#

Classes#

H5Reader

Reads data from multiple HDF5 files.

H5Writer

Write jet-based data to an HDF5 file.

Functions#

h5_add_column(→ None)

Appends one or more columns to one or more groups in an h5 file.

cast_dtype(→ numpy.dtype)

Cast float type to half or full precision.

get_dtype(→ numpy.dtype)

Return a dtype based on an existing dataset and requested variables.

join_structured_arrays(→ numpy.ndarray)

Join a list of structured numpy arrays.

structured_from_dict(→ numpy.ndarray)

Convert a dict to a structured array.

Package Contents#

ftag.hdf5.h5_add_column(input_file: str | pathlib.Path, output_file: str | pathlib.Path, append_function: collections.abc.Callable | list[collections.abc.Callable], num_jets: int = -1, input_groups: list[str] | None = None, output_groups: list[str] | None = None, reader_kwargs: dict | None = None, writer_kwargs: dict | None = None, overwrite: bool = False) None#

Appends one or more columns to one or more groups in an h5 file.

Parameters:
  • input_file (str | Path) – Input h5 file to read from.

  • output_file (str | Path) – Output h5 file to write to.

  • append_function (Callable | list[Callable]) –

    A function, or list of functions, which take a batch from H5Reader and returns a dictionary of the form:

    {
    group1{

    new_column1 : data, new_column2 : data,

    }, group2 : {

    new_column3 : data, new_column4 : data,

    }

  • num_jets (int, optional) – Number of jets to read from the input file. If -1, reads all jets. By default -1.

  • input_groups (list[str] | None, optional) – List of groups to read from the input file. If None, reads all groups. By default None.

  • output_groups (list[str] | None, optional) – List of groups to write to the output file. If None, writes all groups. By default None. Note that this is a subset of the input groups, and must include all groups that the append functions wish to write to.

  • reader_kwargs (dict | None, optional) – Additional arguments to pass to the H5Reader. By default None.

  • writer_kwargs (dict | None, optional) – Additional arguments to pass to the H5Writer. By default None.

  • overwrite (bool, optional) – If True, will overwrite the output file if it exists. By default False. If False, will raise a FileExistsError if the output file exists. If None, will check if the output file exists and raise an error if it does unless overwrite is True.

Raises:
  • FileNotFoundError – If the input file does not exist.

  • FileExistsError – If the output file exists and overwrite is False.

  • ValueError – If the new variable already exists, shape is incorrect, or the output group is not in the input groups.

class ftag.hdf5.H5Reader#

Reads data from multiple HDF5 files.

fname#

Path to the HDF5 file or list of paths

Type:

Path | str | list[Path | str]

batch_size#

Number of jets to read at a time, by default 100_000

Type:

int, optional

jets_name#

Name of the jets dataset, by default “jets”

Type:

str, optional

precision#

Cast floats to given precision, by default None

Type:

str | None, optional

shuffle#

Read batches in a shuffled order, by default True

Type:

bool, optional

weights#

Weights for different input datasets, by default None

Type:

list[float] | None, optional

do_remove_inf#

Remove jets with inf values, by default False

Type:

bool, optional

transform#

Transform to apply to data, by default None

Type:

Transform | None, optional

equal_jets#

Take the same number of jets (weighted) from each sample, by default True. This is useful when you specify a list of DSIDs for the sample and they are qualitatively different, and you want to ensure that you always return batches with jets from all DSIDs. This is used for example in the QCD resampling for Xbb. If False, use all jets in each sample, allowing for the full available statistics to be used. Useful for example if you have multiple ttbar samples and you want to use all available jets from each sample.

Type:

bool, optional

vds_dir#

Directory where virtual datasets will be stored if wildcard is used, by default None. If None, the virtual files will be created in the same directory as the input files.

Type:

Path | str | None, optional

fname: pathlib.Path | str | list[pathlib.Path | str]#
batch_size: int = 100000#
jets_name: str = 'jets'#
precision: str | None = None#
shuffle: bool = True#
weights: list[float] | None = None#
do_remove_inf: bool = False#
transform: ftag.transform.Transform | None = None#
equal_jets: bool = False#
vds_dir: pathlib.Path | str | None = None#
__post_init__() None#
property num_jets: int#
property files: list[pathlib.Path]#
dtypes(variables: dict[str, list[str]] | None = None) dict[str, numpy.dtype]#
shapes(num_jets: int, groups: list[str] | None = None) dict[str, tuple[int, Ellipsis]]#
stream(variables: dict | None = None, num_jets: int | None = None, cuts: ftag.cuts.Cuts | None = None, start: int = 0, skip_batches: int = 0) collections.abc.Generator#

Generate batches of selected jets.

Parameters:
  • variables (dict | None, optional) – Dictionary of variables to for each group, by default use all jet variables.

  • num_jets (int | None, optional) – Total number of selected jets to generate, by default all.

  • cuts (Cuts | None, optional) – Selection cuts to apply, by default None

  • start (int, optional) – Starting index of the first jet to read, by default 0

  • skip_batches (int, optional) – Number of batches to skip, by default 0

Yields:

Generator – Generator of batches of selected jets.

get_batch_reader(variables: dict | None = None, cuts: ftag.cuts.Cuts | None = None, shuffle: bool = True) collections.abc.Callable#

Get a function to read batches of selected jets.

Parameters:
  • variables (dict | None, optional) – Dictionary of variables to for each group, by default use all jet variables.

  • cuts (Cuts | None, optional) – Selection cuts to apply, by default None

  • shuffle (bool, optional) – Read batches in a shuffled order, by default True

Returns:

Function that takes an index and returns a batch of selected jets.

Return type:

Callable

load(variables: dict | None = None, num_jets: int | None = None, cuts: ftag.cuts.Cuts | None = None) dict#

Load multiple batches of selected jets into memory.

Parameters:
  • variables (dict | None, optional) – Dictionary of variables to for each group, by default use all jet variables.

  • num_jets (int | None, optional) – Total number of selected jets to load, by default all.

  • cuts (Cuts | None, optional) – Selection cuts to apply, by default None

Returns:

Dictionary of arrays for each group.

Return type:

dict

estimate_available_jets(cuts: ftag.cuts.Cuts, num: int = 1000000) int#

Estimate the number of jets available after selection cuts.

Parameters:
  • cuts (Cuts) – Selection cuts to apply.

  • num (int, optional) – Number of jets to use for the estimation, by default 1_000_000.

Returns:

Estimated number of jets available after selection cuts, rounded down.

Return type:

int

ftag.hdf5.cast_dtype(typestr: str, precision: str) numpy.dtype#

Cast float type to half or full precision.

Parameters:
  • typestr (str) – Input type string

  • precision (str) – Precision to cast to, “half” or “full”

Returns:

Output dtype

Return type:

np.dtype

Raises:

ValueError – If precision is not “half” or “full”

ftag.hdf5.get_dtype(ds: h5py.Dataset, variables: list[str] | None = None, precision: str | None = None, transform: ftag.transform.Transform | None = None, full_precision_vars: list[str] | None = None) numpy.dtype#

Return a dtype based on an existing dataset and requested variables.

Parameters:
  • ds (h5py.Dataset) – Input h5 dataset

  • variables (list[str] | None, optional) – List of variables to include in dtype, by default None

  • precision (str | None, optional) – Precision to cast floats to, “half” or “full”, by default None

  • transform (Transform | None, optional) – Transform to apply to variables names, by default None

  • full_precision_vars (list[str] | None, optional) – List of variables to keep in full precision, by default None

Returns:

Output dtype

Return type:

np.dtype

Raises:

ValueError – If variables are not found in dataset

ftag.hdf5.join_structured_arrays(arrays: list) numpy.ndarray#

Join a list of structured numpy arrays.

See numpy/numpy#7811

Parameters:

arrays (list) – List of structured numpy arrays to join

Returns:

A merged structured array

Return type:

np.ndarray

ftag.hdf5.structured_from_dict(d: dict[str, numpy.ndarray]) numpy.ndarray#

Convert a dict to a structured array.

Parameters:

d (dict[str, np.ndarray]) – Input dict of numpy arrays

Returns:

Structured array

Return type:

np.ndarray

class ftag.hdf5.H5Writer#

Write jet-based data to an HDF5 file.

This class creates one dataset per entry in dtypes/shapes and supports both fixed-size and dynamically growing output files. Floating-point fields can optionally be downcast before writing, selected metadata groups can be copied from an existing file, and several HDF5 compression backends are supported.

dst#

Path to the output file.

Type:

Path | str

dtypes#

Mapping from dataset name to output dtype.

Type:

dict[str, np.dtype]

shapes#

Mapping from dataset name to output shape. All datasets must agree in their first dimension unless num_jets is explicitly given.

Type:

dict[str, tuple[int, …]]

jets_name#

Name of the jet dataset. This dataset is used to determine batch sizes during writing. Default is "jets".

Type:

str, optional

add_flavour_label#

If True, append a "flavour_label" field of type i4 to the jet dataset if it is not already present. Default is False.

Type:

bool, optional

compression#

Compression algorithm to use. Supported values are None, "none", "gzip", "lzf", "lz4", and "zstd". Default is "lz4".

Type:

str | None, optional

compression_opts#

Optional compression level or backend-specific compression setting. For "gzip", this is passed as compression_opts to HDF5. For plugin-based compressors such as "lz4" and "zstd", this is interpreted as the plugin compression level and folded into the filter object. Ignored for compressors that do not support an explicit level. Default is None.

Type:

int | None, optional

precision#

Floating-point storage precision for output fields. Supported values are

  • "full": cast floating-point fields to np.float32

  • "half": cast floating-point fields to np.float16

  • None: keep original floating-point dtypes

Default is "full".

Type:

str | None, optional

full_precision_vars#

Variables that should keep their original dtype even when precision requests downcasting. Default is None.

Type:

list[str] | None, optional

shuffle#

If True, shuffle each batch before writing. Default is True.

Type:

bool, optional

num_jets#

Expected total number of jets to write. If given, datasets are created in fixed-size mode. If None, datasets are created in dynamic mode and resized during writing. Default is None.

Type:

int | None, optional

groups#

Mapping of metadata group names to extracted group contents to be copied into the output file. Default is None.

Type:

dict[str, h5py.Group] | None, optional

Raises:
  • ValueError – If an unsupported precision or compression setting is provided.

  • AssertionError – If dataset shapes disagree in their first dimension when num_jets is not explicitly specified.

dst: pathlib.Path | str#
dtypes: dict[str, numpy.dtype]#
shapes: dict[str, tuple[int, Ellipsis]]#
jets_name: str = 'jets'#
add_flavour_label: bool = False#
compression: str | None = 'lz4'#
compression_opts: int | None = None#
precision: str | None = 'full'#
full_precision_vars: list[str] | None = None#
shuffle: bool = True#
num_jets: int | None = None#
groups: dict[str, h5py.Group] | None = None#
__post_init__() None#
_resolve_compression(compression: str | None) tuple[str | object | None, int | None]#

Resolve a user-facing compression setting into HDF5 write arguments.

This method converts a human-readable compression identifier into the values used when calling h5py.File.create_dataset(). Built-in HDF5 filters such as "gzip" and "lzf" are returned as strings, optionally together with an HDF5 compression_opts value. Plugin-based filters such as "lz4" and "zstd" are converted into the corresponding filter objects provided by hdf5plugin. In that case, any user-provided compression_opts value is absorbed into the plugin object and the returned HDF5 compression_opts is None.

Parameters:

compression (str | None) –

Compression algorithm identifier. Supported values are

  • None or "none": disable compression

  • "gzip": gzip/deflate compression

  • "lzf": built-in fast compression

  • "lz4": LZ4 compression via hdf5plugin

  • "zstd": Zstandard compression via hdf5plugin

Returns:

Two-element tuple (compression, compression_opts) suitable for passing to h5py.File.create_dataset().

  • For no compression, returns (None, None).

  • For built-in HDF5 filters, returns the filter name and optional HDF5 compression_opts.

  • For plugin filters, returns the instantiated plugin object and None.

Return type:

tuple[str | object | None, int | None]

Raises:

ValueError – If the provided compression identifier is not supported.

classmethod from_file(source: pathlib.Path | str, num_jets: int | None = 0, variables: dict[str, list[str] | None] | None = None, copy_groups: bool = True, **kwargs: Any) H5Writer#

Construct a writer from the structure of an existing HDF5 file.

This class method inspects an input file and derives output dataset dtypes, shapes, compression, and optionally metadata groups from it. It can be used to create a writer that mirrors the input file layout, optionally restricted to a subset of variables and/or a different number of output jets.

Parameters:
  • source (Path | str) – Source HDF5 file from which to infer the output structure.

  • num_jets (int | None, optional) – If non-zero, override the first dimension of all dataset shapes with this value. If 0, keep the original dataset lengths. Default is 0.

  • variables (dict[str, list[str] | None] | None, optional) – Optional mapping from dataset name to a list of variables to keep. If provided, output dtypes are reduced accordingly. Default is None.

  • copy_groups (bool, optional) – If True, copy non-dataset groups from the source file into the created writer. Default is True.

  • **kwargs (Any) – Additional keyword arguments forwarded to the class constructor. This can be used, for example, to override compression, compression_opts, precision, or full_precision_vars.

Returns:

Writer initialized from the source file structure.

Return type:

H5Writer

Raises:

TypeError – If an object in the source file is neither an HDF5 dataset nor group.

save_groups(groups: dict[str, dict]) None#

Write extracted metadata groups into the output file.

Parameters:

groups (dict[str, dict]) – Mapping from group name to extracted group contents.

create_ds(name: str, dtype: numpy.dtype) None#

Create one output dataset.

Parameters:
  • name (str) – Dataset name.

  • dtype (np.dtype) – Input dtype definition for the dataset.

close() None#

Close the output file.

Raises:

ValueError – If the writer is closed before the expected number of jets has been written in fixed-size mode.

get_attr(name: str, group: str | None = None) Any#

Return an attribute from the output file or one of its groups.

Parameters:
  • name (str) – Name of the attribute to retrieve.

  • group (str | None, optional) – Name of the group from which the attribute should be read. If None, the attribute is read from the root of the HDF5 file.

Returns:

Value of the requested attribute.

Return type:

Any

add_attr(name: str, data: Any, group: str | None = None) None#

Add an attribute to the output file or one of its groups.

Parameters:
  • name (str) – Name of the attribute to create.

  • data (Any) – Attribute value to store. The value must be compatible with HDF5 attribute storage.

  • group (str | None, optional) – Name of the group to which the attribute should be added. If None, the attribute is written to the root of the HDF5 file.

copy_attrs(fname: pathlib.Path) None#

Copy file- and dataset-level attributes from another HDF5 file.

Parameters:

fname (Path) – Path to the source HDF5 file.

write(data: dict[str, numpy.ndarray]) None#

Write one batch of data to the output file.

Parameters:

data (dict[str, np.ndarray]) – Mapping from dataset name to batch array.

Raises:

ValueError – If writing this batch would exceed num_jets in fixed-size mode.