Examples#
On the following page, examples will be given how the code can be used. The first part covers scripts and functions that can be used directly from the terminal while the second part gives a more detailed look on the different functionalities.
Calculate WPs#
This package contains a script to calculate tagger working points (WPs).
The script is working_points.py and can be run after installing this package with
wps \
--ttbar "path/to/ttbar/*.h5" \
--tagger GN2v01 \
--fc 0.1
Both the --tagger and --fc options accept a list if you want to get the WPs for multiple taggers.
If you are doing c-tagging or xbb-tagging, dedicated fx arguments are available ()you can find them all with -h.
If you want to use the ttbar WPs get the efficiencies and rejections for the zprime sample, you can add --zprime "path/to/zprime/*.h5" to the command.
Note that a default selection of $p_T > 250 ~GeV$ to jets in the zprime sample.
If instead of defining the working points for a series of signal efficiencies, you wish to calculate a WP corresponding to a specific background rejection, the --rejection option can be given along with the desired background.
By default the working points are printed to the terminal, but you can save the results to a YAML file with the --outfile option.
See wps --help for more options and information.
Calculate efficiency at discriminant cut#
The same script can be used to calculate the efficiency and rejection values at a given discriminant cut value.
The script working_points.py can be run after intalling this package as follows
wps \
--ttbar "path/to/ttbar/*.h5" \
--tagger GN2v01 \
--fx 0.1
--disc_cuts 1.0 1.5
The --tagger, --fx, and --outfile follow the same procedure as in the ‘Calculate WPs’ script as described above.
H5 Utils#
Create virtual file#
This package contains a script to easily merge a set of H5 files. A virtual file is a fast and lightweight way to wrap a set of files. See the h5py documentation for more information on virtual datasets.
The script is vds.py and can be run after installing this package with
vds <pattern> <output path>
The <pattern> argument should be a quotes enclosed glob pattern, for example "dsid/path/*.h5"
If you don’t want to glob an entire path, or if you want to merge files that do not contain a common string, you can instead use the regex option as follows:
vds <regex_pattern> <output path> --use-regex
where <regex_pattern> can contain a list of files, for example "dsid/path/(myfile_id1|myfile_id5|other_file).h5".
You can also regex match files located in a different folder, specifying the option --regex_path.
See vds --help for more options and information.
h5move#
A script to move/rename datasets inside an h5file. Useful for correcting discrepancies between group names. See h5move.py for more info.
h5split#
A script to split a large h5 file into several smaller files. Useful if output files are too large for EOS/grid storage. See h5split.py for more info.
Extensive Examples#
The content below is generated automatically from ftag/example.ipynb.
from __future__ import annotations
%load_ext autoreload
%autoreload 2
Example Usage#
This notebooks demonstrates how to use this package. The main features are:
A
Cutsclass that can be used to select jets.A set of
Flavoursdefining common jet flavours.An
H5Readerclass allowing for batched reading of jets across multiple files.An
H5Writerclass allowing for batched writing of jets.
We can start by getting some dummy data to work with.
from ftag import get_mock_file
fname, f = get_mock_file()
jets = f["jets"]
Cuts#
The Cuts class provides an interface for applying selections to structured nummpy arrays loaded from HDF5 files.
To take a look, first import the Cuts:
from ftag import Cuts
Instances of Cuts can be defined from lists of strings or tuples of strings and values. For example
kinematic_cuts = Cuts.from_list(["pt > 20e3", "abs_eta < 2.5"])
flavour_cuts = Cuts.from_list([("HadronConeExclTruthLabelID", "==", 5)])
It’s easy to combine cuts
combined_cuts = kinematic_cuts + flavour_cuts
And then apply them to a a structured array with
idx, selected_jets = combined_cuts(jets)
Both the selected indices and the selected jets are returned. The indices can be used to reapply the same selection to another array (e.g. tracks). The return values idx and values can also be accessed by name:
idx = combined_cuts(jets).idx
selected_jets = combined_cuts(jets).values
Flavours#
A list of flavours is provided.
from ftag import Flavours
Flavours.bjets
Label(name='bjets', label='$b$-jets', cuts=['HadronConeExclTruthLabelID == 5'], colour='tab:blue', category='single-btag', _px=None)
dict like access is also supported:
Flavours["qcd"]
Label(name='qcd', label='QCD', cuts=['R10TruthLabel_R22v1 == 10'], colour='#38761D', category='xbb', _px=None)
As you can see from the output, each flavour has a name, a label and colour (used for plotting), and a Cuts instance, which can be used to select jets of the given flavour.
For example:
bjets = Flavours.bjets.cuts(jets).values
Each flavour is also assigned to a category, which can be used to group flavours together. For example:
Flavours.categories
['single-btag',
'single-btag-extended',
'single-btag-ghost',
'xbb',
'xbb-extended',
'partonic',
'lepton-decay',
'PDGID',
'isolation',
'trigger-xbb']
Flavours.by_category("xbb")
LabelContainer(hbb, hcc, top, qcd, qcdbb, qcdnonbb, qcdbx, qcdcx, qcdll, htauel, htaumu, htauhad)
Probability names are also accessible using .px:
[f.px for f in Flavours.by_category("single-btag")]
['pb', 'pc', 'pu', 'ptau']
And you can also access a flavour from it’s definition
Flavours.from_cuts(["HadronConeExclTruthLabelID == 5"])
Label(name='bjets', label='$b$-jets', cuts=['HadronConeExclTruthLabelID == 5'], colour='tab:blue', category='single-btag', _px=None)
H5Reader#
The H5Reader class allows you to read (batches) of jets from one or more HDF5 files.
Variables are specified as
dict[str, list[str]].By default the reader will randomly access chunks in the file, giving you a weakly shuffled set of jets.
By default the reader will load all variables for all available jets if
variablesandnum_jetsare not specified respectively.
For example to load 300 jets using three batches of size 100:
from ftag.hdf5 import H5Reader
reader = H5Reader(fname, batch_size=100)
data = reader.load({"jets": ["pt", "eta"]}, num_jets=300)
len(data["jets"])
300
To transparently load jets across several files fname can also be a pattern including wildcards (*).
Behind the scenes files are globbed and merged into a virtual dataset.
So the following also works:
from pathlib import Path
reader = H5Reader(Path(fname).parent / "*.h5", batch_size=100)
If you have globbed several files, you can easily get the total number of jets across all files with
reader.num_jets
1000
You can also load tracks alongside jets (or by themselves) by specifying an additional entry in the variables dict:
data = reader.load({"jets": ["pt", "eta"], "tracks": ["deta", "dphi"]}, num_jets=100)
data["tracks"].dtype
dtype([('dphi', '<f4'), ('deta', '<f4')])
You can apply cuts to the jets as they are loaded. For example, to load 1000 jets which satisfy $p_T > 20$ GeV:
data = reader.load({"jets": ["pt"]}, num_jets=100, cuts=Cuts.from_list(["pt > 20e3"]))
assert data["jets"]["pt"].min() > 20e3
Rather than return a single dict of arrays, the reader can also return a generator of batches.
This is useful when you want to work with a large number of jets, but don’t want to load them all into memory at once.
reader = H5Reader(fname, batch_size=100)
stream = reader.stream({"jets": ["pt", "eta"]}, num_jets=300)
for batch in stream:
jets = batch["jets"]
# do processing on batch...
H5Writer#
The H5Writer class complents the reader class by allowing you to easily write batches of jets to a target file.
from tempfile import NamedTemporaryFile
from ftag.hdf5 import H5Writer
reader = H5Reader(fname, batch_size=100, shuffle=False)
variables = {"jets": None} # "None" means "all variables"
out_fname = NamedTemporaryFile(suffix=".h5").name
writer = H5Writer(
dst=out_fname,
dtypes=reader.dtypes(variables),
shapes=reader.shapes(1000),
shuffle=False,
)
To write jets in batches to the output file, you can use the write method:
for batch in reader.stream(variables=variables, num_jets=1000):
writer.write(batch)
writer.close()
When you are finished you need to manually close the file using H5Writer.close().
The two files will now have the same contents (since we disabled shuffling):
import h5py
assert (h5py.File(fname)["jets"][:] == h5py.File(out_fname)["jets"][:]).all()
Data Transform#
The transform.py module allows for data to be transformed as it is loaded.
Three operations are supported:
Renaming variables
Mapping integer values
Functional transforms on floating point values
See below for an example. First, we can make a transform config (which in this case just log scales the jet pt).
from ftag.transform import Transform
transform_config = {
"floats_map": {
"jets": {
"pt": "log",
},
},
}
transform = Transform(**transform_config)
This object can be passed to the H5Reader constructor.
The resulting object will return transformed values.
reader = H5Reader(fname, batch_size=100, transform=transform)
data = reader.load({"jets": ["pt", "eta"]}, num_jets=300)
data["jets"]["pt"].min(), data["jets"]["pt"].max()
(np.float32(8.156475), np.float32(12.896797))
Adding columns to an h5 file.#
Quite often we wish to make minor changes to h5 files, e.g. adding a single new jet or track variable. A helper function, h5_add_column is available to aid here.
The function requires an input file, and output file, and a function (or list of functions) which return a dictionary detailing what needs to be appended. An example is included below
import tempfile
from ftag.hdf5 import h5_add_column
def add_number_of_tracks(batch: dict) -> dict:
"""Add the number of tracks to the jets dict.
Parameters
----------
batch : dict
Loaded batch of jets
Returns
-------
dict
dict with the number of tracks in the jets dict
"""
num_tracks = batch["tracks"]["valid"].sum(axis=1)
return {
"jets": { # Add to the 'jets' group
"number_of_tracks": num_tracks, # ... a new variable called 'number_of_tracks'
}
}
dummy_file = h5py.File(fname, "r")
# Should be False
print("Input has 'number_of_tracks'? ", "number_of_tracks" in dummy_file["jets"].dtype.names)
# Create a temporary file path for output
with tempfile.TemporaryDirectory() as tmpdir:
output = Path(tmpdir) / "output.h5"
h5_add_column(
fname,
output,
add_number_of_tracks,
num_jets=1000,
)
# Check if the new column was added
with h5py.File(output, "r") as f:
# Should be True
print(
"Output has 'number_of_tracks'? ", "number_of_tracks" in dummy_file["jets"].dtype.names
)
Input has 'number_of_tracks'? False
Output has 'number_of_tracks'? False
You ccan also use the above as a script from the terminal
h5addcol --input [input file] --append_function /a/path/to/a/python/script.py:a_function1 /a/differentpath/to/a/python/script.py:a_function2
Which will then a_function1 from /a/path/to/a/python/script.py and a_function2 from /a/differentpath/to/a/python/script.py
Optimization of Fraction Values#
In flavour tagging, we don’t use the output probabilities of our networks directly. To make them easier to use, the multi-dimensional output is broken down into a one dimensional flavour tagging discriminant, $D_b$. The index marks for which type of tagging the discriminant will be used. In the example here, we will discuss this for $b$-tagging. The full $b$-tagging discriminant for “single-btag”, meaning four classes in total ($b$, $c$, light, $\tau$), is defined as:
$$ D_b = \ln \left(\frac{p_b}{f_c \cdot p_c + f_\text{light} \cdot p_u + f_\tau \cdot p_\tau} \right) \qquad \text{with } 1 = f_c + f_\text{light} + f_\tau $$
While in the case of three total classes (one signal, two background) the optimization can still be done using 2D plots and “try and error”, the optimization for four or more total classes suffers under the high dimensionality of the problem. To counteract this and still be able to optimize the fraction values, the atlas-ftag-tools provide a callable script/function called fraction_optimization.
fraction_optimization uses the jets provided to find the optimal fraction values using a minimization technique. The following steps are run during this optimization:
For the given tagger/WP, calculate the maximally reachable rejection per flavour as a normalization for the final optimization.
Calculate the sum of all rejections in a normalized way using the previously extracted maximally reachable rejection per flavour.
To be able to optimize this, the total normalized sum of all rejections is multiplied by -1. This new function is given to a minimizer, which optimizes the fraction values based on the negative normalized sum of all rejections.
The optimizer returns now a dict with the optimal fraction values.
To manually balance each class against each other, one can introduce
rejection_weights, with which you can balance and fine tune a bit the optimization process.
Terminal Usage#
To use the script directly from your terminal, you can run the following command:
fraction_optimization -i <path-to-h5> -t <tagger-name> -s <signal-flavour> -w <working-point>
This will calculate the optimal fraction values and will print it to your terminal. Please make sure to run
fraction_optimization -h
once to check which other options are available and which defaults are set for them.
Usage in a Python Script#
In addition to the functionality as a callable script, you can also import the function to optimize your fraction values directly to your own script. In the example here, we assume that you loaded all the jets and their variables as data and the jet variables are in the key jets. The example will use the MockTagger tagger and will use the default single-btag flavours with bjets as signal. As working point, we will use the 70% working point.
from ftag.fraction_optimization import calculate_best_fraction_values
# Get a test file and load the jets from it
_, f = get_mock_file()
jets = f["jets"]
# Get the optimal fraction dict
optimal_fraction_dict = calculate_best_fraction_values(
jets=jets,
tagger="MockTagger",
signal=Flavours["bjets"],
flavours=Flavours.by_category("single-btag"),
working_point=0.7,
rejection_weights=None,
optimizer_method="Powell",
)
In this example, we wrote out all possible options.
The code will return a dict with the optimal fraction values, that you can then directly use in your code.