The configuration of the preprocessing is done with a .yaml
file which steers the whole preprocessing.
Available example config files for UPP can be found in upp/configs
Each aspect of the configuration is described in detail below.
Input H5 Samples#
Here we define the input h5 samples which are to be preprocessed.
Each sample is defined using one or more DSIDs, which generally come from the training-dataset-dumper.
If a list of DSIDs is provided, jets from each DSID will be merged according to the equal_jets
flag (see below).
The samples are used to define components later on in configs and so one should define them with anchors.
Below is an example and a table explaining each setting.
ttbar: &ttbar
name: ttbar
pattern: name1.*.410470.*/*.h5
ttbar: &ttbar
name: ttbar
equal_jets: False
- name1.*.410470.*/*.h5
- name2.*.410470.*/*.h5
Setting | Type | Explanation | Default |
name |
str |
The name of the sample, used in output filenames. | Required |
pattern |
str or list[str] |
A single pattern or a list of pattern that match h5 files in a downloaded dataset. H5 files matching each pattern will be transparently merged using virtual datasets. | Required |
equal_jets |
bool |
Only relevant when providing a list of patterns. If True , the same number of jets from each DSID are selected. This is required for e.g. in Xbb QCD where each DSID belongs to a different slice, and the resampling would break if you tried to resample with one or more slices missing. If False this is not enforced, allowing for larger numbers of available jets. |
True |
Global Cuts#
The selections that should be applied to all the data should be listed under common:
For example these could be outlier removal cuts, or a global kinematic selection.
To do this one first provides the variable name (str
), then the comparison operator (str
) and a number to compare to (int
, float
or list
Possible operators are:
which work the same as in python."in"
to check if the value is in the list."%{i}=="
operators to compare the modulo w.r.t.i
of an integer.
Along with the common selection cuts, you should also specify the cuts that separate train
, val
and test
splits using modulo of eventNumber
For example:
- [JetFitterSecondaryVertex_mass, "<", 25000]
- [JetFitter_deltaR, "<", 0.6]
- [eventNumber, "%10<=", 7]
- [eventNumber, "%10==", 8]
- [eventNumber, "%10==", 9]
More info about cuts
The Cuts
class is defined in the atlas-ftag-tools
k-fold training selection
If you are training a model that will be used in production, you may need to worry about overtraining.
A variable jetFoldHash
is included in newer h5 dumps which allows you to independent models on different
folds of the data.
If you are just performing studies, then don't worry about applying any selections on the jetFoldHash
since the train/val/test split will suffice.
Resampling Regions#
Next we define any kinematic regions which need to be resampled separately, again using anchors as these will also be used in the definition of our components. For each region you need to provide a name and a list of cuts (see above). Here is an example:
lowpt: &lowpt
name: lowpt
- [pt_btagJes, ">", 20_000]
- [pt_btagJes, "<", 250_000]
highpt: &highpt
name: highpt
- [pt_btagJes, ">", 250_000]
- [pt_btagJes, "<", 6_000_000]
Again, aliasing these just helps to reduce duplication of information when defining the components as can be seen below.
The components
section is where all the configuration comes together.
A component is a combination of a region, a sample and a flavour.
They allow for full flexibility when defining different preprocessing pipelines
(e.g. single-b versus Xbb).
An example components
block is provided below.
- region:
<<: *lowpt
<<: *ttbar
flavours: [bjets, cjets, ujets]
num_jets: 10_000_000
- region:
<<: *highpt
<<: *zprime
flavours: [bjets, cjets, ujets]
num_jets: 5_000_000
Notice that we use <<*
insertion tool to insert already defined regions and samples.
Setting | Type | Explanation |
region |
anchor | The pre-defined kinematic region anchor, e.g. lowpt or highpt , or inclusive if not splitting in p_T |
sample |
anchor | The pre-defined sample anchor, e.g. t\bar{t} or Z' |
flavours |
list[str] |
One or more jet flavours, e.g. [bjets] or [ujets] . The list syntax is pure syntactic sugar. If more then one is provided, separate components are created for each flavour. |
num_jets |
int |
The number of jets to be sampled from this component in the training split |
num_jets_val |
int |
Optional (default: num_jets//10 ) number of jets of this component in validation set. |
num_jets_test |
int |
Optional (default: num_jets//10 ) number of jets of this component in a test set. |
The next thing you need is to provide the variables that are taken from the TDD files and written in the resampled dataset. Selecting only a subset of variables keeps the output files lightweight, and ensures the dataloading does not become a bottleneck during training.
One can simply define them under variables:
- pt_btagJes
- absEta_btagJes
- HadronConeExclTruthLabelID
- pt
- eta
- dphi
- deta
- qOverP
- IP3D_signed_d0_significance
- IP3D_signed_z0_significance
- ftagTruthOriginLabel
- ftagTruthVertexIndex
corresponds to a dataset name in the TDD h5 file (e.g. jets
, tracks
, hits
The combined set of variables in inputs
and labels
are carried over to the output files to a dataset with the same name as the input dataset.
Internally, UPP will compute normalisation parameters for variables in the inputs
, and compute class weightings (for categorical labels) for variables in the labels
Alternatively include the variables from your custom variable config by providing the full path to the file after an include statement.
The file you provide should have the same structure as shown above but without variable:
For example:
variables: !include xbb-variables.yaml
One can also import vaiables configs already provided in this package upp/config/
yaml files using just the yaml file name e.g.:
variables: !include /<full path to your file>.yaml
You can choose later which variables in your output files are used for training
When it comes to defining your training config, you will be required to define the variables used for training. So it's okay to include here input variables you are not sure whether you will need, for example when testing the importance of different inputs. This is straightforward since we always store data using structured arrays (in the same format as the TDD outputs).
Track selections#
You can apply on the fly selections to tracks in the preprocessing stage (specifically the merging step).
To do this, include a selection
key in the variable config block under the tracks, for example:
- d0
- ftagTruthOriginLabel
- [d0, ">", 0.1]
There are currently two resampling methods implemented in the package pdf
and countup
and they share most of setting.
Below is the example of setting up the pdf
resampling method and a table describitng all the parameters.
In order to run UPP without any kinematic resampling, just set method: none
Note you will still need to run the resampling stage of the preprocessing pipeline.
target: cjets
method: pdf
upscale_pdf: 2
sampling_fraction: auto
bins: [[20_000, 250_000, 50], [250_000, 1_000_000, 50], [1_000_000, 6_000_000, 50]]
bins: [[0, 2.5, 20]]
Setting | Type | Explanation |
target |
str |
The resampling is done in such a way that the distribution of the kinematic variables matches the distribution of those in one particular flavour given in here. Usually it is the leat populated flavour, as this flavour will not be resampled instead all jets of this flavour are taken. |
method |
str |
Either pdf , countup or none , depending on the method you would like to use |
upscale_pdf |
int |
Optional only availabe for pdf preprocessing. The coarse approximation of the pdf functions based on histograms are interpolated and to bins that are upscale_pdf**dimensions times smaller than original |
sampling_fraction |
None , float or auto |
The number of the jets sampled from each batch is equal to the sampling fraction time number of the jets in input batch (after the curs and flavour selection). The large is this variable, the more are jets upsampled i.e. repeated, thus smaller values are prefered. On the other hand eith smaller sampling fractions lead to longer preprocesing times. auto option gives the smallest resampling fraction for each component depending on the number of available jets and number of jets that is asked for but caps it from below at 0.1 to prevent long preprocessing times when enough statistic is present. |
variables |
dict |
The jets will be resampled according to the distribution of the kinematic variables you provide here. The variable names must correspond to the ones in TDD. For each variable prlease provide a bins setting with a list of lists of 2 floats and a an integer each. Each of the sub lists represent a binning region and is described by lower bound upper bound and the number of bins of equal width in this regions. The bins from each region will be combined to provide one (heterogenous width) binning. When upscaling the pdf each bin region is upscaled separately. THerefore is not necessary but advisable to have a split in binnings at the same place where the cut betwenn regions takes place to better handle the discontinuities. |
Global Config#
Global options for the preprocessing.
These options are specified in the config file
under the global:
key. They are passed as kwargs to PreprocessingConfig.
The config file is also copied to the output directory.
For example:
jets_name: jets
batch_size: 1_000_000
num_jets_estimate: 5_000_000
base_dir: /my/stuff/
ntuple_dir: h5-inputs # resolved path: /my/stuff/h5-inputs/
Name | Type | Description | Default |
base_dir |
Base directory for all other paths. |
required |
ntuple_dir |
Directory containing the input h5 ntuples. If a relative path is given, it is interpreted as relative to base_dir. |
components_dir |
Directory for intermediate component files. If a relative path is given, it is interpreted as relative to base_dir. |
out_dir |
Directory for output files. If a relative path is given, it is interpreted as relative to base_dir. |
out_fname |
Filename stem for the output files. |
batch_size |
Batch size for the preprocessing. For each batch select
num_jets_estimate |
Any of the further three arguments that are not specified will default to this value Is equal to 1_000_000 by default. |
num_jets_estimate_available |
int | None
A sabsample taken from the whole sample to estimate the number of jets after the cuts. Please keep this number high in order to not get poisson error of more then 5%. If time allows you can use -1 to get a precise number of jets and not just an estimate although it will be slow for large datasets. Is equal to num_jets_estimate by default. |
num_jets_estimate_hist |
Number of jets of each flavour that are used to construct histograms for probability density function estimation. Larger numbers give a better quality estmate of the pdfs. Is equal to num_jets_estimate by default. |
num_jets_estimate_norm |
Number of jets of each flavour that are used to estimate shifting and scaling during normalisation step. Larger numbers give a better quality estmates. Is equal to num_jets_estimate by default. |
jets_name |
Name of the jets dataset in the input file. |
Created: September 5, 2023