Run#
Before running UPP, make sure you have modified the configuration file according to the configuration instructions
Basic Usage#
To run all preprocessing stages for the train
split use:
preprocess --config configs/test.yaml
For a comprehensive list of available flags, refer to preprocess --help
.
If you are running on lxplus you may need to use python3 upp/main.py
instead of preprocess
Splits#
The data is divided into three splits: training (train
), validation (val
), and testing (test
).
These splits are defined in configuration files, typically based on the eventNumber
variable.
By default, the train split contains 80% of the jets, while val and test contain 10% each.
If you want to preprocess the val
or test
split, use the --split
argument:
preprocess --config configs/config.yaml --split val
You can also process train
, val
, and test
with a single command using --split=all
.
Stages#
The preprocessing is broken up into several stages.
To run with only specific stages enabled, include the flag for the required stages:
preprocess --config configs/config.yaml --prep --resample
To run the whole chain excluding certain stages, include the corresponding negative flag (--no-*
).
For example to run without plotting
preprocess --config configs/config.yaml --no-plot
The stages are described below.
1. Prepare#
The prepare stage (--prep
) reads a specified number of jets (num_jets_estimate_hist
) for each flavor and constructs histograms of the resampling variables.
These histograms are stored in <base_dir>/hists
.
2. Resample#
The resample stage (--resample
) resamples jets to achieve similar p_T and \eta distributions across flavours.
After execution, resampled samples for each flavor, sample, and split are saved separately in <base_dir>/components/<split>/
.
You need to run the resampling stage even if you don't apply any resampling (e.g. you configured with method: none
).
3. Merge#
The merge stage (--merge
) combines the resampled samples into a single file named <tbase_dir>/<out_dir>/pp_output_<split>.h5
.
It also handles shuffling.
4. Normalise#
The normalise stage (--norm
) calculates scaling and shifting values for all variables intended for training based on (num_jets_estimate_norm
). The results are stored in<tbase_dir>/<out_dir>/norm_dict.yaml
.
5. Plotting#
The plotting stage (--plot
) produces histograms of resampled variables to verify the resampling quality.
You can find these plots in <tbase_dir>/plots/
.
Created: April 9, 2024