![]() |
DAS
3.0
Das Analysis System
|
Executables running over a large number of events may need some parallelisation. The input samples may be split into sub samples, and a distinct process may be started independently for each subsample.
This may be achieved by hand with the -j (number of slices) and -k options (index of the slice). For instance, the following command would run over the first 10% of the input data set:
where exec represents any executable that returns exactly one output (in most cases, such commands also take one input, although this is not so strict). The -k and -j options may be useful for two reasons:
-j option.-k. A trivial implementation of the parallelisation would look as follows: The following prefix commands implement these two possibilities in a safer and more sophisticated way. Their use will be further illustrated in the dedicated subsections. The helper of the exec command is modified if it is prefixed with any prefix command; it then also includes additional options such as dry run and background modes.
The try prefix command determines the number of necessary slices to run over exactly 100 events and runs over the first available slice:
If a directory is given as output, then the file name will be automatically set to 0.root.
It may be run several times in a row.
These two prefix commands work in a similar way:
which submits a job to HTCondor; or
which runs locally on the present machine. The former should be preferred for any task taking longer than a few minutes; the latter should only be used if you are sure that you are not disturbing working on the same machine. By default, the value of -j corresponds to the number of core on the current machine; the value may be overwritten:
Going beyond the number of cores (nproc) on a machine is likely useless; a too large number of jobs on HTCondor may also be counterproductive.
Both commands copy the necessary executables and libraries to the output directory (following the so-called fire-and-forget strategy). This avoids causing job failure while working on the code (and recompiling it) that is running at the very same moment.
The standard output is also stored (but hidden) in the output directory as well for each job. Use ls -a to see them. If one single job has failed, the command to run is given at the first line of the standard output, and running it locally is often sufficient (rather than resubmitting all jobs).
The try, submit, and parallel commands are focused on running one single executable. One would then typically come up with a script than runs them in a row:
or similar with parallel. In all cases, a shell must be left open to run all three commands in a row. This is not necessarily a problem, as these things could be started in a screen session, but one may still wish to rely on more advanced functionalities of HTCondor.
Furthermore, two chains of commands may need to merge:
In this section, we propose solutions to ease the submission of large workflows.
These commands are used together to build HTCondor DAG jobs and submit them. The transcribe command is used to execute a shell script that in turn calls job once per job in the HTCondor DAG. transcribe automatically figures out dependencies between jobs and generates the corresponding graph.
The job prefix commands works similar as submit, but rather than submitting a job, it appends a section to a DAGMan (which must be provided using the -i option). It also does not copy all executables. It finds out its own dependencies by scanning the other jobs already provided in the DAGMan. Note that job may work standalone and does not necessarily need to be called by transcribe.
This DAGMan may be submitted using condor_submit_dag by hand. The full power of HTCondor may then be used to recover failing jobs.
The transcribe command is a slightly different prefix command: unlike all previously exposed commands, this one does not prefix the executables directly but a shell script, in which it aliases the submit command to replace it effectively with job. It also copies the necessary executables and a libraries just once (to avoid too many copies, which would happen in case of loops in the script) in ./.dag/[name], where name corresponds to the input script. This allows to have one single script, possibly tested with try or used in the past with an interactive shell, and to run it as it is as a DAG job. This however imposes a few (hopefully mild) constraints on the script (e.g. a hadd will fail since the ROOT files won't be ready at the time of its execution).
A test of these commands is run when installing the software. It is shared among two scripts, test/transcribe and test/job. Their output may be found in the build/test directory.
Run a series of commands from the command line:
which replaces the first script given above. The additional table in CSV format would look as follows:
| name | source | exec2 |
|---|---|---|
| data1 | /path/to/first/input1 | 42 |
| data2 | /path/to/first/input2 | 4.2 |
| data3 | /path/to/first/input3 | 0.42 |
| data4 | /path/to/first/input4 | 0.042 |
| ... |
where the columns exec2 specifies arguments that are specific to a data set and cannot be provided in the meta.info.
TODO
This approach is based on a global config file in INI format than includes the inputs, the commands, and the interdependencies.
TODO