Examples#

On the following page, examples will be given how the code can be used. The first part covers scripts and functions that can be used directly from the terminal while the second part gives a more detailed look on the different functionalities.

Calculate WPs#

This package contains a script to calculate tagger working points (WPs). The script is working_points.py and can be run after installing this package with

wps \
    --ttbar "path/to/ttbar/*.h5" \
    --tagger GN2v01 \
    --fc 0.1

Both the --tagger and --fc options accept a list if you want to get the WPs for multiple taggers. If you are doing c-tagging or xbb-tagging, dedicated fx arguments are available ()you can find them all with -h.

If you want to use the ttbar WPs get the efficiencies and rejections for the zprime sample, you can add --zprime "path/to/zprime/*.h5" to the command. Note that a default selection of $p_T > 250 ~GeV$ to jets in the zprime sample.

If instead of defining the working points for a series of signal efficiencies, you wish to calculate a WP corresponding to a specific background rejection, the --rejection option can be given along with the desired background.

By default the working points are printed to the terminal, but you can save the results to a YAML file with the --outfile option.

See wps --help for more options and information.

Calculate efficiency at discriminant cut#

The same script can be used to calculate the efficiency and rejection values at a given discriminant cut value. The script working_points.py can be run after intalling this package as follows

wps \
    --ttbar "path/to/ttbar/*.h5" \
    --tagger GN2v01 \
    --fx 0.1
    --disc_cuts 1.0 1.5

The --tagger, --fx, and --outfile follow the same procedure as in the ‘Calculate WPs’ script as described above.

H5 Utils#

Create virtual file#

This package contains a script to easily merge a set of H5 files. A virtual file is a fast and lightweight way to wrap a set of files. See the h5py documentation for more information on virtual datasets.

The script is vds.py and can be run after installing this package with

vds <pattern> <output path>

The <pattern> argument should be a quotes enclosed glob pattern, for example "dsid/path/*.h5"

If you don’t want to glob an entire path, or if you want to merge files that do not contain a common string, you can instead use the regex option as follows:

vds <regex_pattern> <output path> --use-regex

where <regex_pattern> can contain a list of files, for example "dsid/path/(myfile_id1|myfile_id5|other_file).h5". You can also regex match files located in a different folder, specifying the option --regex_path.

See vds --help for more options and information.

h5move #

A script to move/rename datasets inside an h5file. Useful for correcting discrepancies between group names. See h5move.py for more info.

h5split #

A script to split a large h5 file into several smaller files. Useful if output files are too large for EOS/grid storage. See h5split.py for more info.

Extensive Examples#

The content below is generated automatically from ftag/example.ipynb.

from __future__ import annotations

%load_ext autoreload
%autoreload 2

Example Usage#

This notebooks demonstrates how to use this package. The main features are:

A Cuts class that can be used to select jets.
A set of Flavours defining common jet flavours.
An H5Reader class allowing for batched reading of jets across multiple files.
An H5Writer class allowing for batched writing of jets.

We can start by getting some dummy data to work with.

from ftag import get_mock_file

fname, f = get_mock_file()
jets = f["jets"]

Cuts#

The Cuts class provides an interface for applying selections to structured nummpy arrays loaded from HDF5 files. To take a look, first import the Cuts:

from ftag import Cuts

Instances of Cuts can be defined from lists of strings or tuples of strings and values. For example

kinematic_cuts = Cuts.from_list(["pt > 20e3", "abs_eta < 2.5"])
flavour_cuts = Cuts.from_list([("HadronConeExclTruthLabelID", "==", 5)])

It’s easy to combine cuts

combined_cuts = kinematic_cuts + flavour_cuts

And then apply them to a a structured array with

idx, selected_jets = combined_cuts(jets)

Both the selected indices and the selected jets are returned. The indices can be used to reapply the same selection to another array (e.g. tracks). The return values idx and values can also be accessed by name:

idx = combined_cuts(jets).idx
selected_jets = combined_cuts(jets).values

Flavours#

A list of flavours is provided.

from ftag import Flavours

Flavours.bjets

Label(name='bjets', label='$b$-jets', cuts=['HadronConeExclTruthLabelID == 5'], colour='tab:blue', category='single-btag', _px=None)

dict like access is also supported:

Flavours["qcd"]

Label(name='qcd', label='QCD', cuts=['R10TruthLabel_R22v1 == 10'], colour='#38761D', category='xbb', _px=None)

As you can see from the output, each flavour has a name, a label and colour (used for plotting), and a Cuts instance, which can be used to select jets of the given flavour. For example:

bjets = Flavours.bjets.cuts(jets).values

Each flavour is also assigned to a category, which can be used to group flavours together. For example:

Flavours.categories

['single-btag',
 'single-btag-extended',
 'single-btag-ghost',
 'xbb',
 'xbb-extended',
 'partonic',
 'lepton-decay',
 'PDGID',
 'isolation',
 'trigger-xbb']

Flavours.by_category("xbb")

LabelContainer(hbb, hcc, top, qcd, qcdbb, qcdnonbb, qcdbx, qcdcx, qcdll, htauel, htaumu, htauhad)

Probability names are also accessible using .px:

[f.px for f in Flavours.by_category("single-btag")]

['pb', 'pc', 'pu', 'ptau']

And you can also access a flavour from it’s definition

Flavours.from_cuts(["HadronConeExclTruthLabelID == 5"])

Label(name='bjets', label='$b$-jets', cuts=['HadronConeExclTruthLabelID == 5'], colour='tab:blue', category='single-btag', _px=None)

H5Reader#

The H5Reader class allows you to read (batches) of jets from one or more HDF5 files.

Variables are specified as dict[str, list[str]].
By default the reader will randomly access chunks in the file, giving you a weakly shuffled set of jets.
By default the reader will load all variables for all available jets if variables and num_jets are not specified respectively.

For example to load 300 jets using three batches of size 100:

from ftag.hdf5 import H5Reader

reader = H5Reader(fname, batch_size=100)
data = reader.load({"jets": ["pt", "eta"]}, num_jets=300)
len(data["jets"])

To transparently load jets across several files fname can also be a pattern including wildcards (*). Behind the scenes files are globbed and merged into a virtual dataset. So the following also works:

from pathlib import Path

reader = H5Reader(Path(fname).parent / "*.h5", batch_size=100)

If you have globbed several files, you can easily get the total number of jets across all files with

reader.num_jets

You can also load tracks alongside jets (or by themselves) by specifying an additional entry in the variables dict:

data = reader.load({"jets": ["pt", "eta"], "tracks": ["deta", "dphi"]}, num_jets=100)
data["tracks"].dtype

dtype([('dphi', '<f4'), ('deta', '<f4')])

You can apply cuts to the jets as they are loaded. For example, to load 1000 jets which satisfy $p_T > 20$ GeV:

data = reader.load({"jets": ["pt"]}, num_jets=100, cuts=Cuts.from_list(["pt > 20e3"]))
assert data["jets"]["pt"].min() > 20e3

Rather than return a single dict of arrays, the reader can also return a generator of batches. This is useful when you want to work with a large number of jets, but don’t want to load them all into memory at once.

reader = H5Reader(fname, batch_size=100)
stream = reader.stream({"jets": ["pt", "eta"]}, num_jets=300)
for batch in stream:
    jets = batch["jets"]
    # do processing on batch...

H5Writer#

The H5Writer class complents the reader class by allowing you to easily write batches of jets to a target file.

from tempfile import NamedTemporaryFile

from ftag.hdf5 import H5Writer

reader = H5Reader(fname, batch_size=100, shuffle=False)
variables = {"jets": None}  # "None" means "all variables"
out_fname = NamedTemporaryFile(suffix=".h5").name
writer = H5Writer(
    dst=out_fname,
    dtypes=reader.dtypes(variables),
    shapes=reader.shapes(1000),
    shuffle=False,
)

To write jets in batches to the output file, you can use the write method:

for batch in reader.stream(variables=variables, num_jets=1000):
    writer.write(batch)
writer.close()

When you are finished you need to manually close the file using H5Writer.close(). The two files will now have the same contents (since we disabled shuffling):

import h5py

assert (h5py.File(fname)["jets"][:] == h5py.File(out_fname)["jets"][:]).all()

Data Transform#

The transform.py module allows for data to be transformed as it is loaded. Three operations are supported:

Renaming variables
Mapping integer values
Functional transforms on floating point values

See below for an example. First, we can make a transform config (which in this case just log scales the jet pt).

from ftag.transform import Transform

transform_config = {
    "floats_map": {
        "jets": {
            "pt": "log",
        },
    },
}

transform = Transform(**transform_config)

This object can be passed to the H5Reader constructor. The resulting object will return transformed values.

reader = H5Reader(fname, batch_size=100, transform=transform)
data = reader.load({"jets": ["pt", "eta"]}, num_jets=300)
data["jets"]["pt"].min(), data["jets"]["pt"].max()

(np.float32(8.156475), np.float32(12.896797))

Adding columns to an h5 file.#

Quite often we wish to make minor changes to h5 files, e.g. adding a single new jet or track variable. A helper function, h5_add_column is available to aid here. The function requires an input file, and output file, and a function (or list of functions) which return a dictionary detailing what needs to be appended. An example is included below

import tempfile

from ftag.hdf5 import h5_add_column


def add_number_of_tracks(batch):
    num_tracks = batch["tracks"]["valid"].sum(axis=1)
    return {
        "jets": {  # Add to the 'jets' group
            "number_of_tracks": num_tracks,  # ... a new variable called 'number_of_tracks'
        }
    }


dummy_file = h5py.File(fname, "r")
# Should be False
print("Input has 'number_of_tracks'? ", "number_of_tracks" in dummy_file["jets"].dtype.names)

# Create a temporary file path for output
with tempfile.TemporaryDirectory() as tmpdir:
    output = Path(tmpdir) / "output.h5"
    h5_add_column(
        fname,
        output,
        add_number_of_tracks,
        num_jets=1000,
    )
    # Check if the new column was added
    with h5py.File(output, "r") as f:
        # Should be True
        print(
            "Output has 'number_of_tracks'? ", "number_of_tracks" in dummy_file["jets"].dtype.names
        )

Input has 'number_of_tracks'?  False
Output has 'number_of_tracks'?  False

You ccan also use the above as a script from the terminal

h5addcol --input [input file] --append_function /a/path/to/a/python/script.py:a_function1 /a/differentpath/to/a/python/script.py:a_function2

Which will then a_function1 from /a/path/to/a/python/script.py and a_function2 from /a/differentpath/to/a/python/script.py