This repository contains code for calibration of the transverse momentum pT of jets produced in proton collisions in the CMS experiment at the Large Hadron Collider (LHC). The calibration is performed with deep neural networks, which are trained to predict the true value of pT starting from properties of a reconstructed jet. The goal is to improve the pT resolution and the dependence on the jet flavour compared to the standard calibration procedure.
Technically, the DNN is trained to predict the logarithm of the ratio between the true pT known from simulation and the uncalibrated pT of the corresponding reconstructed jet. It uses as inputs event- and jet-level features as well as properties of individual jet constituents. The snapshot of an old version of the code that used a simple model that did not exploit jet constituents, is available in this tag.
The input dataset, containing some 107 examples, is produced using code from a dedicated repository. This includes a non-uniform downsampling from an initial set of 1010 examples, which prunes overpopulated regions of the parameter space while preserving less populated but physically more relevant ones.
The dataset consists of files in the native ROOT format, which are read with the help of uproot
. Several example files with just a handful of jets are provided for tests in directory tests/data
. It is assumed that the data files are placed in a directory (local or in a Google Cloud Storage bucket) with the following structure:
data
├── data.yaml
├── transform.yaml
└── shards
├── 001.root
├── 002.root
└── ...
The file data.yaml
lists the ROOT files and also specifies the number of jets in each of them (typically, 105). Consult the example to see its structure.
The file transform.yaml
above defines preprocessing transformations to be applied to individual features in the dataset before the start of the training. This file is created by the script build_transform.py
. Non-linear transformations are specified manually. They are normally followed by a linear rescaling whose parameters are determined by the script.
Features are not only read from the ROOT files but also constructed on the fly. Examples of this are relative pT, Δη, and Δφ of jet constituents.
The DNN consists of multiple blocks. An example architecture is shown in the figure below. A jet can contain a variable number of constituents, and their order is not relevant. To reflect this, each type of jet constituents is processed with a block based on Deep Sets. This approach has already been used to classify jets in Particle Flow Networks. Within each block, an MLP with shared weights is applied to every jet constituent of given type, and the output of the MLP is then summed over all constituents of that type in a jet. The resulting outputs for different types of constituents, as well as jet-level features, are concatenated and processed by another MLP. This jet-level MLP has a single output unit, whose value gives (the logarithm of) the correction factor for the jet pT.
The input features to use and the parameters of the DNN are described in the master configuration file (example). In section model
, it specifies the dimensionality of the embeddings for categorical features, the numbers of units in each layer of each MLP block, and the type of the MLP blocks. The supported types are vanilla MLP and ResNet.
To use this package, its location should be added to PYTHONPATH
. This can be done by executing
. ./env.sh
The version of Python used is 3.7. Python dependencies are listed in requirements.txt
. To read files from Google Cloud Platform buckets, the gsutil
program should also be installed and configured.
Run tests with
cd tests
pytest
The full training pipeline can be tried with
train.py test_config.yaml -o test_output
(from the same directory). Note that since the test files are tiny, the results of this training are not meaningful and the runtime is completely dominated by various overheads.
The training is done with the script train.py
, as in the example above. Its behaviour is controlled by master configuration file provided as an argument. Script steer.py
will perform the training for multiple configuration files consecutively and copy logs and all outputs produced in each task to a given destination (normally, a Google Cloud Storage bucket).