Examples#
On the following page, examples will be given how the code can be used. The first part covers scripts and functions that can be used directly from the terminal while the second part gives a more detailed look on the different functionalities.
Calculate WPs#
This package contains a script to calculate tagger working points (WPs).
The script is working_points.py
and can be run after installing this package with
wps \
--ttbar "path/to/ttbar/*.h5" \
--tagger GN2v01 \
--fc 0.1
Both the --tagger
and --fc
options accept a list if you want to get the WPs for multiple taggers.
If you are doing c-tagging or xbb-tagging, dedicated fx arguments are available ()you can find them all with -h
.
If you want to use the ttbar
WPs get the efficiencies and rejections for the zprime
sample, you can add --zprime "path/to/zprime/*.h5"
to the command.
Note that a default selection of $p_T > 250 ~GeV$ to jets in the zprime
sample.
If instead of defining the working points for a series of signal efficiencies, you wish to calculate a WP corresponding to a specific background rejection, the --rejection
option can be given along with the desired background.
By default the working points are printed to the terminal, but you can save the results to a YAML file with the --outfile
option.
See wps --help
for more options and information.
Calculate efficiency at discriminant cut#
The same script can be used to calculate the efficiency and rejection values at a given discriminant cut value.
The script working_points.py
can be run after intalling this package as follows
wps \
--ttbar "path/to/ttbar/*.h5" \
--tagger GN2v01 \
--fx 0.1
--disc_cuts 1.0 1.5
The --tagger
, --fx
, and --outfile
follow the same procedure as in the ‘Calculate WPs’ script as described above.
H5 Utils#
Create virtual file#
This package contains a script to easily merge a set of H5 files. A virtual file is a fast and lightweight way to wrap a set of files. See the h5py documentation for more information on virtual datasets.
The script is vds.py
and can be run after installing this package with
vds <pattern> <output path>
The <pattern>
argument should be a quotes enclosed glob pattern, for example "dsid/path/*.h5"
If you don’t want to glob an entire path, or if you want to merge files that do not contain a common string, you can instead use the regex option as follows:
vds <regex_pattern> <output path> --use-regex
where <regex_pattern>
can contain a list of files, for example "dsid/path/(myfile_id1|myfile_id5|other_file).h5"
.
You can also regex match files located in a different folder, specifying the option --regex_path
.
See vds --help
for more options and information.
h5move#
A script to move/rename datasets inside an h5file. Useful for correcting discrepancies between group names. See h5move.py for more info.
h5split#
A script to split a large h5 file into several smaller files. Useful if output files are too large for EOS/grid storage. See h5split.py for more info.
Extensive Examples#
The content below is generated automatically from ftag/example.ipynb
.
from __future__ import annotations
%load_ext autoreload
%autoreload 2
Example Usage#
This notebooks demonstrates how to use this package. The main features are:
A
Cuts
class that can be used to select jets.A set of
Flavours
defining common jet flavours.An
H5Reader
class allowing for batched reading of jets across multiple files.An
H5Writer
class allowing for batched writing of jets.
We can start by getting some dummy data to work with.
from ftag import get_mock_file
fname, f = get_mock_file()
jets = f["jets"]
Cuts#
The Cuts
class provides an interface for applying selections to structured nummpy arrays loaded from HDF5 files.
To take a look, first import the Cuts
:
from ftag import Cuts
Instances of Cuts
can be defined from lists of strings or tuples of strings and values. For example
kinematic_cuts = Cuts.from_list(["pt > 20e3", "abs_eta < 2.5"])
flavour_cuts = Cuts.from_list([("HadronConeExclTruthLabelID", "==", 5)])
It’s easy to combine cuts
combined_cuts = kinematic_cuts + flavour_cuts
And then apply them to a a structured array with
idx, selected_jets = combined_cuts(jets)
Both the selected indices and the selected jets are returned. The indices can be used to reapply the same selection to another array (e.g. tracks). The return values idx
and values
can also be accessed by name:
idx = combined_cuts(jets).idx
selected_jets = combined_cuts(jets).values
Flavours#
A list of flavours is provided.
from ftag import Flavours
Flavours.bjets
Label(name='bjets', label='$b$-jets', cuts=['HadronConeExclTruthLabelID == 5'], colour='tab:blue', category='single-btag', _px=None)
dict
like access is also supported:
Flavours["qcd"]
Label(name='qcd', label='QCD', cuts=['R10TruthLabel_R22v1 == 10'], colour='#38761D', category='xbb', _px=None)
As you can see from the output, each flavour has a name
, a label
and colour
(used for plotting), and a Cuts
instance, which can be used to select jets of the given flavour.
For example:
bjets = Flavours.bjets.cuts(jets).values
Each flavour is also assigned to a category, which can be used to group flavours together. For example:
Flavours.categories
['single-btag',
'single-btag-extended',
'single-btag-ghost',
'xbb',
'xbb-extended',
'partonic',
'lepton-decay',
'PDGID',
'isolation',
'trigger-xbb']
Flavours.by_category("xbb")
LabelContainer(hbb, hcc, top, qcd, qcdbb, qcdnonbb, qcdbx, qcdcx, qcdll, htauel, htaumu, htauhad)
Probability names are also accessible using .px
:
[f.px for f in Flavours.by_category("single-btag")]
['pb', 'pc', 'pu', 'ptau']
And you can also access a flavour from it’s definition
Flavours.from_cuts(["HadronConeExclTruthLabelID == 5"])
Label(name='bjets', label='$b$-jets', cuts=['HadronConeExclTruthLabelID == 5'], colour='tab:blue', category='single-btag', _px=None)
H5Reader#
The H5Reader
class allows you to read (batches) of jets from one or more HDF5 files.
Variables are specified as
dict[str, list[str]]
.By default the reader will randomly access chunks in the file, giving you a weakly shuffled set of jets.
By default the reader will load all variables for all available jets if
variables
andnum_jets
are not specified respectively.
For example to load 300 jets using three batches of size 100:
from ftag.hdf5 import H5Reader
reader = H5Reader(fname, batch_size=100)
data = reader.load({"jets": ["pt", "eta"]}, num_jets=300)
len(data["jets"])
300
To transparently load jets across several files fname
can also be a pattern including wildcards (*
).
Behind the scenes files are globbed and merged into a virtual dataset.
So the following also works:
from pathlib import Path
reader = H5Reader(Path(fname).parent / "*.h5", batch_size=100)
If you have globbed several files, you can easily get the total number of jets across all files with
reader.num_jets
1000
You can also load tracks alongside jets (or by themselves) by specifying an additional entry in the variables
dict:
data = reader.load({"jets": ["pt", "eta"], "tracks": ["deta", "dphi"]}, num_jets=100)
data["tracks"].dtype
dtype([('dphi', '<f4'), ('deta', '<f4')])
You can apply cuts to the jets as they are loaded. For example, to load 1000 jets which satisfy $p_T > 20$ GeV:
data = reader.load({"jets": ["pt"]}, num_jets=100, cuts=Cuts.from_list(["pt > 20e3"]))
assert data["jets"]["pt"].min() > 20e3
Rather than return a single dict
of arrays, the reader can also return a generator of batches.
This is useful when you want to work with a large number of jets, but don’t want to load them all into memory at once.
reader = H5Reader(fname, batch_size=100)
stream = reader.stream({"jets": ["pt", "eta"]}, num_jets=300)
for batch in stream:
jets = batch["jets"]
# do processing on batch...
H5Writer#
The H5Writer
class complents the reader class by allowing you to easily write batches of jets to a target file.
from tempfile import NamedTemporaryFile
from ftag.hdf5 import H5Writer
reader = H5Reader(fname, batch_size=100, shuffle=False)
variables = {"jets": None} # "None" means "all variables"
out_fname = NamedTemporaryFile(suffix=".h5").name
writer = H5Writer(
dst=out_fname,
dtypes=reader.dtypes(variables),
shapes=reader.shapes(1000),
shuffle=False,
)
To write jets in batches to the output file, you can use the write
method:
for batch in reader.stream(variables=variables, num_jets=1000):
writer.write(batch)
writer.close()
When you are finished you need to manually close the file using H5Writer.close()
.
The two files will now have the same contents (since we disabled shuffling):
import h5py
assert (h5py.File(fname)["jets"][:] == h5py.File(out_fname)["jets"][:]).all()
Data Transform#
The transform.py
module allows for data to be transformed as it is loaded.
Three operations are supported:
Renaming variables
Mapping integer values
Functional transforms on floating point values
See below for an example. First, we can make a transform config (which in this case just log scales the jet pt).
from ftag.transform import Transform
transform_config = {
"floats_map": {
"jets": {
"pt": "log",
},
},
}
transform = Transform(**transform_config)
This object can be passed to the H5Reader
constructor.
The resulting object will return transformed values.
reader = H5Reader(fname, batch_size=100, transform=transform)
data = reader.load({"jets": ["pt", "eta"]}, num_jets=300)
data["jets"]["pt"].min(), data["jets"]["pt"].max()
(np.float32(8.156475), np.float32(12.896797))
Adding columns to an h5 file.#
Quite often we wish to make minor changes to h5 files, e.g. adding a single new jet or track variable. A helper function, h5_add_column
is available to aid here.
The function requires an input file, and output file, and a function (or list of functions) which return a dictionary detailing what needs to be appended. An example is included below
import tempfile
from ftag.hdf5 import h5_add_column
def add_number_of_tracks(batch):
num_tracks = batch["tracks"]["valid"].sum(axis=1)
return {
"jets": { # Add to the 'jets' group
"number_of_tracks": num_tracks, # ... a new variable called 'number_of_tracks'
}
}
dummy_file = h5py.File(fname, "r")
# Should be False
print("Input has 'number_of_tracks'? ", "number_of_tracks" in dummy_file["jets"].dtype.names)
# Create a temporary file path for output
with tempfile.TemporaryDirectory() as tmpdir:
output = Path(tmpdir) / "output.h5"
h5_add_column(
fname,
output,
add_number_of_tracks,
num_jets=1000,
)
# Check if the new column was added
with h5py.File(output, "r") as f:
# Should be True
print(
"Output has 'number_of_tracks'? ", "number_of_tracks" in dummy_file["jets"].dtype.names
)
Input has 'number_of_tracks'? False
Output has 'number_of_tracks'? False
You ccan also use the above as a script from the terminal
h5addcol --input [input file] --append_function /a/path/to/a/python/script.py:a_function1 /a/differentpath/to/a/python/script.py:a_function2
Which will then a_function1
from /a/path/to/a/python/script.py
and a_function2
from /a/differentpath/to/a/python/script.py