ftag.hdf5#

Submodules#

Classes#

`H5Reader`	Reads data from multiple HDF5 files.
`H5Writer`	Writes jets to an HDF5 file.

Functions#

`h5_add_column`(→ None)	Appends one or more columns to one or more groups in an h5 file.
`cast_dtype`(→ numpy.dtype)	Cast float type to half or full precision.
`get_dtype`(→ numpy.dtype)	Return a dtype based on an existing dataset and requested variables.
`join_structured_arrays`(arrays)	Join a list of structured numpy arrays.
`structured_from_dict`(→ numpy.ndarray)	Convert a dict to a structured array.

Package Contents#

ftag.hdf5.h5_add_column(input_file: str | pathlib.Path, output_file: str | pathlib.Path, append_function: Callable | list[Callable], num_jets: int = -1, input_groups: list[str] | None = None, output_groups: list[str] | None = None, reader_kwargs: dict | None = None, writer_kwargs: dict | None = None, overwrite: bool = False) → None#

Appends one or more columns to one or more groups in an h5 file.

Parameters:

input_file (str | Path) – Input h5 file to read from.
output_file (str | Path) – Output h5 file to write to.
append_function (Callable | list[Callable]) –
A function, or list of functions, which take a batch from H5Reader and returns a dictionary of the form:

{

group1{
new_column1 : data, new_column2 : data,

}, group2 : {

new_column3 : data, new_column4 : data,

}
num_jets (int, optional) – Number of jets to read from the input file. If -1, reads all jets. By default -1.
input_groups (list[str] | None, optional) – List of groups to read from the input file. If None, reads all groups. By default None.
output_groups (list[str] | None, optional) – List of groups to write to the output file. If None, writes all groups. By default None. Note that this is a subset of the input groups, and must include all groups that the append functions wish to write to.
reader_kwargs (dict | None, optional) – Additional arguments to pass to the H5Reader. By default None.
writer_kwargs (dict | None, optional) – Additional arguments to pass to the H5Writer. By default None.
overwrite (bool, optional) – If True, will overwrite the output file if it exists. By default False. If False, will raise a FileExistsError if the output file exists. If None, will check if the output file exists and raise an error if it does unless overwrite is True.

Raises:

FileNotFoundError – If the input file does not exist.
FileExistsError – If the output file exists and overwrite is False.
ValueError – If the new variable already exists, shape is incorrect, or the output group is not in the input groups.

class ftag.hdf5.H5Reader#

Reads data from multiple HDF5 files.

fname#

Path to the HDF5 file or list of paths

Type:: Path | str | list[Path | str]

batch_size#

Number of jets to read at a time, by default 100_000

Type:: int, optional

jets_name#

Name of the jets dataset, by default “jets”

Type:: str, optional

precision#

Cast floats to given precision, by default None

Type:: str | None, optional

shuffle#

Read batches in a shuffled order, by default True

Type:: bool, optional

weights#

Weights for different input datasets, by default None

Type:: list[float] | None, optional

do_remove_inf#

Remove jets with inf values, by default False

Type:: bool, optional

transform#

Transform to apply to data, by default None

Type:: Transform | None, optional

equal_jets#

Take the same number of jets (weighted) from each sample, by default True. This is useful when you specify a list of DSIDs for the sample and they are qualitatively different, and you want to ensure that you always return batches with jets from all DSIDs. This is used for example in the QCD resampling for Xbb. If False, use all jets in each sample, allowing for the full available statistics to be used. Useful for example if you have multiple ttbar samples and you want to use all available jets from each sample.

Type:: bool, optional

fname: pathlib.Path | str | list[pathlib.Path | str]#

batch_size: int = 100000#

jets_name: str = 'jets'#

precision: str | None = None#

shuffle: bool = True#

weights: list[float] | None = None#

do_remove_inf: bool = False#

transform: ftag.transform.Transform | None = None#

equal_jets: bool = False#

__post_init__() → None#

property num_jets: int#

property files: list[pathlib.Path]#

dtypes(variables: dict[str, list[str]] | None = None) → dict[str, numpy.dtype]#

shapes(num_jets: int, groups: list[str] | None = None) → dict[str, tuple[int, Ellipsis]]#

stream(variables: dict | None = None, num_jets: int | None = None, cuts: ftag.cuts.Cuts | None = None, start: int = 0, skip_batches: int = 0) → collections.abc.Generator#

Generate batches of selected jets.

Parameters:

variables (dict | None, optional) – Dictionary of variables to for each group, by default use all jet variables.
num_jets (int | None, optional) – Total number of selected jets to generate, by default all.
cuts (Cuts | None, optional) – Selection cuts to apply, by default None
start (int, optional) – Starting index of the first jet to read, by default 0
skip_batches (int, optional) – Number of batches to skip, by default 0

Yields:

Generator – Generator of batches of selected jets.

get_batch_reader(variables: dict | None = None, cuts: ftag.cuts.Cuts | None = None, shuffle: bool = True)#

Get a function to read batches of selected jets.

Parameters:

variables (dict | None, optional) – Dictionary of variables to for each group, by default use all jet variables.
cuts (Cuts | None, optional) – Selection cuts to apply, by default None
shuffle (bool, optional) – Read batches in a shuffled order, by default True

Returns:

Function that takes an index and returns a batch of selected jets.

Return type:

function

load(variables: dict | None = None, num_jets: int | None = None, cuts: ftag.cuts.Cuts | None = None) → dict#

Load multiple batches of selected jets into memory.

Parameters:

variables (dict | None, optional) – Dictionary of variables to for each group, by default use all jet variables.
num_jets (int | None, optional) – Total number of selected jets to load, by default all.
cuts (Cuts | None, optional) – Selection cuts to apply, by default None

Returns:

Dictionary of arrays for each group.

Return type:

dict

estimate_available_jets(cuts: ftag.cuts.Cuts, num: int = 1000000) → int#

Estimate the number of jets available after selection cuts.

Parameters:

cuts (Cuts) – Selection cuts to apply.
num (int, optional) – Number of jets to use for the estimation, by default 1_000_000.

Returns:

Estimated number of jets available after selection cuts, rounded down.

Return type:

int

ftag.hdf5.cast_dtype(typestr: str, precision: str) → numpy.dtype#

Cast float type to half or full precision.

Parameters:

typestr (str) – Input type string
precision (str) – Precision to cast to, “half” or “full”

Returns:

Output dtype

Return type:

np.dtype

Raises:

ValueError – If precision is not “half” or “full”

ftag.hdf5.get_dtype(ds: h5py.Dataset, variables: list[str] | None = None, precision: str | None = None, transform: ftag.transform.Transform | None = None, full_precision_vars: list[str] | None = None) → numpy.dtype#

Return a dtype based on an existing dataset and requested variables.

Parameters:

ds (h5py.Dataset) – Input h5 dataset
variables (list[str] | None, optional) – List of variables to include in dtype, by default None
precision (str | None, optional) – Precision to cast floats to, “half” or “full”, by default None
transform (Transform | None, optional) – Transform to apply to variables names, by default None
full_precision_vars (list[str] | None, optional) – List of variables to keep in full precision, by default None

Returns:

Output dtype

Return type:

np.dtype

Raises:

ValueError – If variables are not found in dataset

ftag.hdf5.join_structured_arrays(arrays: list)#

Join a list of structured numpy arrays.

See numpy/numpy#7811

Parameters:: arrays (list) – List of structured numpy arrays to join
Returns:: A merged structured array
Return type:: np.array

ftag.hdf5.structured_from_dict(d: dict[str, numpy.ndarray]) → numpy.ndarray#

Convert a dict to a structured array.

Parameters:: d (dict[str, np.ndarray]) – Input dict of numpy arrays
Returns:: Structured array
Return type:: np.ndarray

class ftag.hdf5.H5Writer#

Writes jets to an HDF5 file.

dst#

Path to the output file.

Type:: Path | str

dtypes#

Dictionary of group names and their corresponding dtypes.

Type:: dict[str, np.dtype]

shapes#

Dictionary of group names and their corresponding shapes.

Type:: dict[str, tuple[int, …]], optional

jets_name#

Name of the jets group. Default is “jets”.

Type:: str, optional

add_flavour_label#

Whether to add a flavour label to the jets group. Default is False.

Type:: bool, optional

compression#

Compression algorithm to use. Default is “lzf”.

Type:: str, optional

precision#

Precision to use. Default is None.

Type:: str, optional

full_precision_vars#

List of variables to store in full precision. Default is None.

Type:: list[str] | None, optional

shuffle#

Whether to shuffle the jets before writing. Default is True.

Type:: bool, optional

num_jets#

Number of jets to write.

Type:: int | None

dst: pathlib.Path | str#

dtypes: dict[str, numpy.dtype]#

shapes: dict[str, tuple[int, Ellipsis]]#

jets_name: str = 'jets'#

add_flavour_label: bool = False#

compression: str = 'lzf'#

precision: str = 'full'#

full_precision_vars: list[str] | None = None#

shuffle: bool = True#

num_jets: int | None = None#

__post_init__()#

classmethod from_file(source: pathlib.Path, num_jets: int | None = 0, variables=None, **kwargs) → H5Writer#

create_ds(name: str, dtype: numpy.dtype) → None#

close() → None#

get_attr(name, group=None)#

add_attr(name, data, group=None) → None#

copy_attrs(fname: pathlib.Path) → None#

write(data: dict[str, numpy.ndarray]) → None#

ftag.hdf5#

Submodules#

Classes#

Functions#

Package Contents#

This Page