DAS
3.0
Das Analysis System
|
Make physics from the shell, factorise the analysis into atomic steps, re-use existing tools as much as possible, mutualise effort, keep analysis at event level as late as possible, and make it easy to add your own step for your specific analysis.
This framework is designed to account for the significant amount of time spent at developing and debugging the framework (which is typically by far superior to the time necessary to run the whole analysis from A to Z). It is intended to ease the development by factorising out each step of the analysis and sticking to a few healthy principles, such as not overengineering, testing while developing, etc. Although we do not primarily optimise the total running time of the program itself, we obtain results faster by reducing the number of iterations through the code.
In practice, the user handles data files by running commands in the shell. Darwin is opinionated on the form commands should take, leading to a consistent user experience. The data files are modified step by step, including more and more physics corrections. The user can then consistely compare the results at all steps, e.g. investigate the impact of each calibration on a particular distribution.
Darwin is interfaced to Git to implement the reproducibility of the results. A full data record can be produced from the data files at any step of the analysis. Such a data record may be used to rerun the full analysis.
A typical analysis is performed by running a series of simple commands:
where the in.root
and out.root
file share a very similar structure. Typically, out.root
will be a copy of in.root
with minor modifications, such as the application of a correction or of a normalisation. The steps gradually construct files with all required corrections and uncertainties, such that they can be used to create histograms in the final processing step. The data having the same structure, they can be easily compared with one another after each step by running a common plotter after each step to visualize the effect of each correction independently.
By default, the modified events are not written to disk to allow faster running through the entire input file for testing purposes. Add -f
to write the modified events to the output file. To write only the modified branches, use -F
(this uses TTree
friends in the back-end, see below). This latter approach reduces the necessary disk space at the cost of making events slightly longer to open, as they are split over several files.
The input and output ROOT files must always be passed explicitly in the command line, whereas all other parameters may be provided either directly on the command or using a configuration file:
The command line approach is more concrete, which is good for teaching purposes or when the same command should be tried for different values of an argument. As time goes on, it is easier to store the "best" settings in a configuration file for reproducibility (see Reproducibility section below). The command line options are implemented in Darwin::Tools::Options
, which mostly relies on Boost Program Options (and Property Trees in a lower extent).
Additionally, so-called prefix commands may be used to handle heavy files:
Input directories may also be given instead of input ROOT files. In that case, a TChain
is constructed on the fly from all ROOT files found in the directory. Full workflows may be synthesised in the form of a shell script:
Such scripts may even be submitted to HTCondor (if available) with transcribe
, which will create a job array for each command.
Along with processing, the settings of the data analysis are archived in the ROOT files. This includes the values of the parameters, the history of command, a few general flags, some git information (using libgit2), etc. This information can be used to generate a configuration file to reproduce or compare ROOT files:
This so-called metainfo is handled in the back-end by Darwin::Tools::MetaInfo
, which uses Boost Property Trees.
A light standard format is strongly suggested to handle the systematic variations weights and scale corrections of the different objects, implemented via the classes in the Darwin::Physics
namespace. Darwin provides a minimal set of physics objects in the Darwin::Physics
namespace, e.g. GenJet
, RecJet
, GenMuon
, RecMuon
, etc. The use of these objects is optional but encouraged as they integrate well with the rest of the framework.
Each object carries its own set of systematic uncertainties, both for efficiency corrections ("weights") and momentum calibration ("scales"). They are collectively referred to as "variations".
Weights are defined separately at the generator and detector levels, as well as event weights and respective object weights. In simulated data, the generator-level weights typically correspond to the normalisation, whereas the detector-level weights correspond to efficiency corrections. In real data, detector-level weights typically contain the normalisation to the recorded luminosity. The object weights can be used, for instance, to apply efficiency corrections or effectively remove objects.
Composite objects may be constructed and stored in the course of an analysis, such as dijet or dimuon systems:
All variations are propagated in a transparent way. For instance, the weight of a dijet system will be the product of the respective weights for a given variation.
These objects are generic and most analyses will want to specialise them. This is achieved by deriving new classes from the ones Darwin provides.
Variations may be handled in an abstract way with the help of the Darwin::Physics::Variation
and Darwin::Physics::Weight
classes. These are already implemented in the high-level physics objects described in the previous section:
Following these principles, an event loop may then look as follows:
The Darwin philosophy is to perform physics analyses in small, self-contained increments. Each step produces a new version of its input files with a few variables changed. The disk usage of the files can quickly add up, filling up one's available quota. Darwin provides a tool to greatly reduce the disk usage by removing duplication between steps. This is based on a feature of standard ROOT called "friend trees".
In Darwin, the use of friend trees is enabled by passing the -F
option to commands that support it. With this option, the output tree no longer contains the full event information. Instead, only the modified branches are stored along with references to the input files where to retrieve the rest. This makes for a much lighter output, at the expense of depending on other files: when using this feature, it is no longer possible to delete intermediate results.
As a user, reading trees produced with -F
is no different than using standard, fat trees with the full event information. In particular, the relationship with the input file is restored automatically (something that cannot be done in plain ROOT with the required level of flexibility). For the curious, this full transparency is implemented by two classes: ChainSlice
is a TChain
restricted to a subset of the original entries, and SlicedFriendElement
allows recording such a subset of a chain as a friend tree.
The libraries and tools provided by Darwin try to leave as much freedom as possible for scientific innovation. We expect user to be developers and vice-versa. Generic developments that are not specific to a given analysis should be proposed for integration. Specialisations should be developed in dedicated repositories, such as DAS for the analysis of CMS data. The framework is only loosely constraining: it is always possible to break the rules, as long as this does not break the existing tests (although breaking the rule should only be done for a valid reason).