ftag.hdf5.h5reader#

Classes#

H5SingleReader

H5Reader

Reads data from multiple HDF5 files.

Module Contents#

class ftag.hdf5.h5reader.H5SingleReader#
fname: pathlib.Path | str#
batch_size: int = 100000#
jets_name: str = 'jets'#
precision: str | None = None#
shuffle: bool = True#
do_remove_inf: bool = False#
transform: ftag.transform.Transform | None = None#
__post_init__() None#
property num_jets: int#
get_attr(name, group=None)#
empty(ds: h5py.Dataset, variables: list[str]) numpy.ndarray#
read_chunk(ds: h5py.Dataset, array: numpy.ndarray, low: int) numpy.ndarray#
remove_inf(data: dict) dict#
_process_batch(data: dict, cuts: ftag.cuts.Cuts | None = None) dict#

Apply cuts and transformations to the batch.

Parameters:
  • data (dict) – Dictionary of arrays for each group.

  • cuts (Cuts | None, optional) – Selection cuts to apply, by default None

Returns:

Processed data dictionary with arrays for each group. After applying cuts, (optional) removal of infs, and (optional) transformation.

Return type:

dict

stream(variables: dict | None = None, num_jets: int | None = None, cuts: ftag.cuts.Cuts | None = None, start: int = 0, skip_batches: int = 0) collections.abc.Generator#
get_batch_reader(variables: dict | None = None, cuts: ftag.cuts.Cuts | None = None)#

Get a function to read batches of selected jets.

Parameters:
  • variables (dict | None, optional) – Dictionary of variables to for each group, by default use all jet variables.

  • cuts (Cuts | None, optional) – Selection cuts to apply, by default None

Returns:

Function that takes an index and returns a batch of selected jets.

Return type:

function

class ftag.hdf5.h5reader.H5Reader#

Reads data from multiple HDF5 files.

fname#

Path to the HDF5 file or list of paths

Type:

Path | str | list[Path | str]

batch_size#

Number of jets to read at a time, by default 100_000

Type:

int, optional

jets_name#

Name of the jets dataset, by default “jets”

Type:

str, optional

precision#

Cast floats to given precision, by default None

Type:

str | None, optional

shuffle#

Read batches in a shuffled order, by default True

Type:

bool, optional

weights#

Weights for different input datasets, by default None

Type:

list[float] | None, optional

do_remove_inf#

Remove jets with inf values, by default False

Type:

bool, optional

transform#

Transform to apply to data, by default None

Type:

Transform | None, optional

equal_jets#

Take the same number of jets (weighted) from each sample, by default True. This is useful when you specify a list of DSIDs for the sample and they are qualitatively different, and you want to ensure that you always return batches with jets from all DSIDs. This is used for example in the QCD resampling for Xbb. If False, use all jets in each sample, allowing for the full available statistics to be used. Useful for example if you have multiple ttbar samples and you want to use all available jets from each sample.

Type:

bool, optional

fname: pathlib.Path | str | list[pathlib.Path | str]#
batch_size: int = 100000#
jets_name: str = 'jets'#
precision: str | None = None#
shuffle: bool = True#
weights: list[float] | None = None#
do_remove_inf: bool = False#
transform: ftag.transform.Transform | None = None#
equal_jets: bool = False#
__post_init__() None#
property num_jets: int#
property files: list[pathlib.Path]#
dtypes(variables: dict[str, list[str]] | None = None) dict[str, numpy.dtype]#
shapes(num_jets: int, groups: list[str] | None = None) dict[str, tuple[int, Ellipsis]]#
stream(variables: dict | None = None, num_jets: int | None = None, cuts: ftag.cuts.Cuts | None = None, start: int = 0, skip_batches: int = 0) collections.abc.Generator#

Generate batches of selected jets.

Parameters:
  • variables (dict | None, optional) – Dictionary of variables to for each group, by default use all jet variables.

  • num_jets (int | None, optional) – Total number of selected jets to generate, by default all.

  • cuts (Cuts | None, optional) – Selection cuts to apply, by default None

  • start (int, optional) – Starting index of the first jet to read, by default 0

  • skip_batches (int, optional) – Number of batches to skip, by default 0

Yields:

Generator – Generator of batches of selected jets.

get_batch_reader(variables: dict | None = None, cuts: ftag.cuts.Cuts | None = None, shuffle: bool = True)#

Get a function to read batches of selected jets.

Parameters:
  • variables (dict | None, optional) – Dictionary of variables to for each group, by default use all jet variables.

  • cuts (Cuts | None, optional) – Selection cuts to apply, by default None

  • shuffle (bool, optional) – Read batches in a shuffled order, by default True

Returns:

Function that takes an index and returns a batch of selected jets.

Return type:

function

load(variables: dict | None = None, num_jets: int | None = None, cuts: ftag.cuts.Cuts | None = None) dict#

Load multiple batches of selected jets into memory.

Parameters:
  • variables (dict | None, optional) – Dictionary of variables to for each group, by default use all jet variables.

  • num_jets (int | None, optional) – Total number of selected jets to load, by default all.

  • cuts (Cuts | None, optional) – Selection cuts to apply, by default None

Returns:

Dictionary of arrays for each group.

Return type:

dict

estimate_available_jets(cuts: ftag.cuts.Cuts, num: int = 1000000) int#

Estimate the number of jets available after selection cuts.

Parameters:
  • cuts (Cuts) – Selection cuts to apply.

  • num (int, optional) – Number of jets to use for the estimation, by default 1_000_000.

Returns:

Estimated number of jets available after selection cuts, rounded down.

Return type:

int