master master

Make physics from the shell, factorise the analysis into atomic steps, re-use existing tools as much as possible, mutualise effort, keep analysis at event level as late as possible, and make it easy to add your own step for your specific analysis.

Principles

This framework is designed to account for the significant amount of time spent at developing and debugging the framework (which is typically by far superior to the time necessary to run the whole analysis from A to Z). It is intended to ease the development by factorising out each step of the analysis and sticking to a few healthy principles, such as not overengineering, testing while developing, etc. Although we do not primarily optimise the total running time of the program itself, we obtain results faster by reducing the number of iterations through the code.

In practice, the user handles data files by running commands in the shell. Darwin is opinionated on the form commands should take, leading to a consistent user experience. The data files are modified step by step, including more and more physics corrections. The user can then consistely compare the results at all steps, e.g. investigate the impact of each calibration on a particular distribution.

Darwin is interfaced to Git to implement the reproducibility of the results. A full data record can be produced from the data files at any step of the analysis. Such a data record may be used to rerun the full analysis.

Interface

A typical analysis is performed by running a series of simple commands:

applySomething in.root out.root [args ...]

where the in.root and out.root file share a very similar structure. Typically, out.root will be a copy of in.root with minor modifications, such as the application of a correction or of a normalisation. The steps gradually construct files with all required corrections and uncertainties, such that they can be used to create histograms in the final processing step. The data having the same structure, they can be easily compared with one another after each step by running a common plotter after each step to visualize the effect of each correction independently.

By default, the modified events are not written to disk to allow faster running through the entire input file for testing purposes. Add -f to write the modified events to the output file. To write only the modified branches, use -F (this uses TTree friends in the back-end, see below). This latter approach reduces the necessary disk space at the cost of making events slightly longer to open, as they are split over several files.

The input and output ROOT files must always be passed explicitly in the command line, whereas all other parameters may be provided either directly on the command or using a configuration file:

applySomething in.root out.root [args ...]

applySomething in.root out.root -c config.info

The command line approach is more concrete, which is good for teaching purposes or when the same command should be tried for different values of an argument. As time goes on, it is easier to store the "best" settings in a configuration file for reproducibility (see Reproducibility section below). The command line options are implemented in Darwin::Tools::Options, which mostly relies on Boost Program Options (and Property Trees in a lower extent).

Additionally, so-called prefix commands may be used to handle heavy files:

try applySomething in.root out.root # only run on 100 events
parallel applySomething in.root out/ -j10 # split the input in ten slices and run in parallel over each slice
submit applySomething in.root out/ -j10 # split the input in ten slices and submit an HTCondor job array (only work if HTC is available)

Input directories may also be given instead of input ROOT files. In that case, a TChain is constructed on the fly from all ROOT files found in the directory. Full workflows may be synthesised in the form of a shell script:

submit applyOneThing one/ two/ arg1 -f

submit applyAnotherThing two/ three/ arg2 -f

Such scripts may even be submitted to HTCondor (if available) with transcribe, which will create a job array for each command.

Reproducibility

Along with processing, the settings of the data analysis are archived in the ROOT files. This includes the values of the parameters, the history of command, a few general flags, some git information (using libgit2), etc. This information can be used to generate a configuration file to reproduce or compare ROOT files:

getMetaInfo three/ config.info
submit applyOneThing one/ two/ -c config.info
submit applyAnoterThing two/ three -c config.info

This so-called metainfo is handled in the back-end by Darwin::Tools::MetaInfo, which uses Boost Property Trees.

High-level objects and handling systematic variations

A light standard format is strongly suggested to handle the systematic variations weights and scale corrections of the different objects, implemented via the classes in the Darwin::Physics namespace. Darwin provides a minimal set of physics objects in the Darwin::Physics namespace, e.g. GenJet, RecJet, GenMuon, RecMuon, etc. The use of these objects is optional but encouraged as they integrate well with the rest of the framework.

Each object carries its own set of systematic uncertainties, both for efficiency corrections ("weights") and momentum calibration ("scales"). They are collectively referred to as "variations".

Weights are defined separately at the generator and detector levels, as well as event weights and respective object weights. In simulated data, the generator-level weights typically correspond to the normalisation, whereas the detector-level weights correspond to efficiency corrections. In real data, detector-level weights typically contain the normalisation to the recorded luminosity. The object weights can be used, for instance, to apply efficiency corrections or effectively remove objects.

Composite objects may be constructed and stored in the course of an analysis, such as dijet or dimuon systems:

auto dijet = jet1 + jet2;

cout << dijet.CorrP4(v).M() << endl; // print the invariant mass of the system, accounting for the JES corrections of the respective jets

All variations are propagated in a transparent way. For instance, the weight of a dijet system will be the product of the respective weights for a given variation.

These objects are generic and most analyses will want to specialise them. This is achieved by deriving new classes from the ones Darwin provides.

Variations may be handled in an abstract way with the help of the Darwin::Physics::Variation and Darwin::Physics::Weight classes. These are already implemented in the high-level physics objects described in the previous section:

GenJet jet;
Variation v("RecJetScale", "JESup", 1); // entry with index 1 in `jet.scales`
// ...
cout << jet.CorrPt(v) << endl; // print corrected transverse momentum correction to `v`

Following these principles, an event loop may then look as follows:

for (DT::Looper looper(tIn, slice); looper(); ++looper) { // event loop
    for (auto& v: variations) { // variation loop
 
        // reconstruct the Z+jet system
        if (recMuons->size() < 2) continue;
        if (recJets->size() < 1) continue;
        auto recdimuon = recmuons[0] + recmuons[1];
        auto recZjet = recdimuon + jets[0];
 
        // extract the weight
        auto w = genWgt->Weight(v) * recWgt->Weight(v)
                                   * recZjet->Weight(v);
 
        // fill the histogram
        h[v]->Fill(Zjet->CorrPt(v), w);
    }
}

Friend trees

The Darwin philosophy is to perform physics analyses in small, self-contained increments. Each step produces a new version of its input files with a few variables changed. The disk usage of the files can quickly add up, filling up one's available quota. Darwin provides a tool to greatly reduce the disk usage by removing duplication between steps. This is based on a feature of standard ROOT called "friend trees".

In Darwin, the use of friend trees is enabled by passing the -F option to commands that support it. With this option, the output tree no longer contains the full event information. Instead, only the modified branches are stored along with references to the input files where to retrieve the rest. This makes for a much lighter output, at the expense of depending on other files: when using this feature, it is no longer possible to delete intermediate results.

As a user, reading trees produced with -F is no different than using standard, fat trees with the full event information. In particular, the relationship with the input file is restored automatically (something that cannot be done in plain ROOT with the required level of flexibility). For the curious, this full transparency is implemented by two classes: ChainSlice is a TChain restricted to a subset of the original entries, and SlicedFriendElement allows recording such a subset of a chain as a friend tree.

Final word

The libraries and tools provided by Darwin try to leave as much freedom as possible for scientific innovation. We expect user to be developers and vice-versa. Generic developments that are not specific to a given analysis should be proposed for integration. Specialisations should be developed in dedicated repositories, such as DAS for the analysis of CMS data. The framework is only loosely constraining: it is always possible to break the rules, as long as this does not break the existing tests (although breaking the rule should only be done for a valid reason).