DAS
3.0
Das Analysis System
|
This package contains the generic configuration and commands to generate n-tuples. The n-tuple format is described in ../Objects/README.md "Objects". To run the n-tupliser, the input datasets must be stored on Tier2 servers (T2_*
) in MINIAOD
format.
We first provide technical details on the datasets, then on the concept of campaign, finally on the running of the n-tupliser itself.
As soon as you want to deal with CMS datasets, you need a valid grid certificate. You should check with your computing admin in your institute how to obtain a valid grid certificate.
Then use the following command to activate your grid certificate:
By default, your certificate will be valid for 24h; to run longer, use -rfc -valid 192:00
. To check if any certificate has already been activated:
Note: activating your grid is only necessary to run commands that deal with CMS datasets: once the n-tuples are being or have been produced, you don't need it anymore. In other words: there is no point sourcing it each time you source the environment for daily analysis.
A nice tool to find Monte Carlo datasets is the Grasp website. It has an intuitive search interface that lets you filter by data-taking era, dataset size, or MiniAOD version. The full path of a dataset can be found by clicking on the "DAS" link in the results page (note that in this context, DAS means "Data Aggregation System").
To work from the shell, one standard command is dasgoclient
, reachable in any CMSSW environment (if you are running from a EL9 node, you need to prefix all upcoming commands with das-cmssw el8
or similar). The most basic type of call goes as follows:
to see all existing JetHT
datasets in MiniAOD (beware: the list may be quite long). More advanced commands may be run for example to investigate to location of datasets; for instance:
or
Run dasgoclient -h
to get some help, and dasgoclient -examples
to see an extensive list of examples with this command.
An alternative way to get the location of a file is to use edmFileUtil
, also reachable within any CMSSW environment:
or
To investigate the content of a file, use edmDumpEventContent
:
(output may be quite long... just use grep to focus on what you're looking for).
If the datasets are not available on a Tier2 server, but exist on a Tier1 server or on tape, you may need to make a Rucio request
A few commands have to be run. First, you need to set up your Rucio account:
where $USER
is your CERN username. Note that DAS has already tried to set up this variable for you, assuming your username on the running machine to be identical to your CERN username. Then you can "submit rules", e.g.:
where
/*/*/*
),#
),The use of campaigns is not mandatory to execute any of the commands to produce the n-tuple, but is recommended to ensure reproducibility and documentation of an analysis.
In general, to avoid reinventing the wheel, we resort to CRAB commands whenever possible. Only the submission is partly re-written.
At DESY, the n-tuples will be stored on pnfs
. This LFS disk may only be used to storage the n-tuples produced with CRAB. See here for more information.
The steering files in CMSSW are mostly written in Python. In particular, the steering file of the n-tupliser is python/Ntupliser_cfg.py
. It contains options like jet size, triggers, datasets, etc. which may vary from time to time. Most of them are guessed automatically from the year or from the type of data (simulated vs real); other options are provided with the help of a configuration file in JSON format (see below).
To test the n-tupliser directly, you can use the standard CMSSW command:
where the prefix das-cmssw el8
is expected if you are running from EL9 (this will be implicit in the following) or
where [path]
corresponds to a path to a dataset file (which you can obtain with dasgoclient
from a generic dataset name, as explained ../README.md "here". In general, CMSSW provides a command line parser for cmsRun
; in this framework, we only use it to provide a config file in JSON format (see ../test/example.json
).
If the datasets are reachable locally, it is technically possible to run the n-tupliser locally too, using the local farm, but for better reproducibility and greater generality, it is better to rely on CRAB. The essential commands to run CRAB are explained ../README.md "above"; in practice, a custom command is available to submit several datasets at a time. All other actions rely on standard CRAB commands (e.g. resubmit, kill, etc.).
Documentation:
The source script may be found in scripts
; after the installation of the software, it is available directly as a command in the prompt. It should be run from the workarea or any large file storage (LFS) area. It takes a JSON config file as input; examples of config file may be found in test
.
The script may be run both from EL8 or EL9; its behaviour is slightly different. For instance, the local mode only works from an EL9 node. Use mkNtuples -h
to know more.
Here, we only present the essential commands. Many TWikis provide additional information:
To initialise CRAB, you may have to use the following (although in principle, DAS sources it automatically for you):
This is done in the default initialisation script of the framework, in case you use it. All CRAB commands start with crab
(assuming that you run from EL8); to get a list of the available commands, enter crab help
.
To submit, in principle, one may use crab submit config.py
where a CRAB configuration in Python format must be provided; in practice, here, we rather rely on a custom command, reachable anywhere in the shell after sourcing CMSSW. It makes use of the CRAB API.
One submission per dataset (/*/*/*
) is required (this is transparent when using the custom command). A new directory with local details for each submission: think twice before deleting it.
Check the status:
This will show you if the submission has succeeded and if it is still running. You may have to resubmit failed jobs:
If the jobs keep failing:
For real data, it is absolutely essential to reach 100% of the dataset. Instead, for MC, it is acceptable if you only reach 95% or 99% (the events are distributed uniformly, and as long as the whole phase space is covered by real data is also covered by MC with decent statistics).
If you are running from EL9, add das-cmssw el8
in front of each command.
This output of the CRAB job may be needed for instance to calculate the luminosity:
First produce the lumi files from the CRAB jobs:
It will produce the JSON files in CRAB/[dir]/results
Then take example from the following script: ```bash files=CRAB/darwin_*_JetHT_Run2017*
jsonFiles= for f in $files do json=echo $f/results/crab*.json
#echo $json jsonFiles="$jsonFiles $json" done allFiles=‘echo $jsonFiles | sed 's/\/afs/$USER@naf-cms.desy.de:\/afs/g’`
for f in files/*.json do echo $f brilcalc lumi –normtag /cvmfs/cms-bril.cern.ch/cms-lumi-pog/Normtags/normtag_PHYSICS.json -u /fb -i $f | grep -A2 totrecorded | tail -1 | awk '{print $12}' done
files/crab_das1_JetHT_Run2017B-31Mar2018-v1.json 4.794 files/crab_das1_JetHT_Run2017C-31Mar2018-v1.json 9.617 files/crab_das1_JetHT_Run2017D-31Mar2018-v1.json 4.248 files/crab_das1_JetHT_Run2017E-31Mar2018-v1.json 9.314 files/crab_das1_JetHT_Run2017F-31Mar2018-v1.json 13.535 ```