DAS
3.0
Das Analysis System
|
Executables running over a large number of events may need some parallelisation. The input samples may be split into sub samples, and a distinct process may be started independently for each subsample.
This may be achieved by hand with the -j
(number of slices) and -k
options (index of the slice). For instance, the following command would run over the first 10% of the input data set:
where exec
represents any executable that returns exactly one output (in most cases, such commands also take one input, although this is not so strict). The -k
and -j
options may be useful for two reasons:
-j
option.-k
. A trivial implementation of the parallelisation would look as follows: The following prefix commands implement these two possibilities in a safer and more sophisticated way. Their use will be further illustrated in the dedicated subsections. The helper of the exec
command is modified if it is prefixed with any prefix command; it then also includes additional options such as dry run and background modes.
The try
prefix command determines the number of necessary slices to run over exactly 100 events and runs over the first available slice:
If a directory is given as output, then the file name will be automatically set to 0.root
.
It may be run several times in a row.
These two prefix commands work in a similar way:
which submits a job to HTCondor; or
which runs locally on the present machine. The former should be preferred for any task taking longer than a few minutes; the latter should only be used if you are sure that you are not disturbing working on the same machine. By default, the value of -j
corresponds to the number of core on the current machine; the value may be overwritten:
Going beyond the number of cores (nproc
) on a machine is likely useless; a too large number of jobs on HTCondor may also be counterproductive.
Both commands copy the necessary executables and libraries to the output directory (following the so-called fire-and-forget strategy). This avoids causing job failure while working on the code (and recompiling it) that is running at the very same moment.
The standard output is also stored (but hidden) in the output directory as well for each job. Use ls -a
to see them. If one single job has failed, the command to run is given at the first line of the standard output, and running it locally is often sufficient (rather than resubmitting all jobs).
The try
, submit
, and parallel
commands are focused on running one single executable. One would then typically come up with a script than runs them in a row:
or similar with parallel
. In all cases, a shell must be left open to run all three commands in a row. This is not necessarily a problem, as these things could be started in a screen
session, but one may still wish to rely on more advanced functionalities of HTCondor.
Furthermore, two chains of commands may need to merge:
In this section, we propose solutions to ease the submission of large workflows.
These commands are used together to build HTCondor DAG jobs and submit them. The transcribe
command is used to execute a shell script that in turn calls job
once per job in the HTCondor DAG. transcribe automatically figures out dependencies between jobs and generates the corresponding graph.
The job
prefix commands works similar as submit
, but rather than submitting a job, it appends a section to a DAGMan (which must be provided using the -i
option). It also does not copy all executables. It finds out its own dependencies by scanning the other jobs already provided in the DAGMan. Note that job
may work standalone and does not necessarily need to be called by transcribe
.
This DAGMan may be submitted using condor_submit_dag
by hand. The full power of HTCondor may then be used to recover failing jobs.
The transcribe
command is a slightly different prefix command: unlike all previously exposed commands, this one does not prefix the executables directly but a shell script, in which it aliases the submit
command to replace it effectively with job
. It also copies the necessary executables and a libraries just once (to avoid too many copies, which would happen in case of loops in the script) in ./.dag/[name]
, where name
corresponds to the input script. This allows to have one single script, possibly tested with try
or used in the past with an interactive shell, and to run it as it is as a DAG job. This however imposes a few (hopefully mild) constraints on the script (e.g. a hadd
will fail since the ROOT files won't be ready at the time of its execution).
A test of these commands is run when installing the software. It is shared among two scripts, test/transcribe
and test/job
. Their output may be found in the build/test
directory.
Run a series of commands from the command line:
which replaces the first script given above. The additional table in CSV format would look as follows:
name | source | exec2 |
---|---|---|
data1 | /path/to/first/input1 | 42 |
data2 | /path/to/first/input2 | 4.2 |
data3 | /path/to/first/input3 | 0.42 |
data4 | /path/to/first/input4 | 0.042 |
... |
where the columns exec2
specifies arguments that are specific to a data set and cannot be provided in the meta.info
.
TODO
This approach is based on a global config file in INI format than includes the inputs, the commands, and the interdependencies.
TODO