arrayrun
Folders and files
Name | Name | Last commit date | ||
---|---|---|---|---|
parent directory.. | ||||
-*- text -*- $Id: README.arrayrun,v 1.2 2011/06/28 11:21:27 bhm Exp $ Overview ======== Arrayrun is an attempt to simulate arrayjobs as found in SGE and PBS. It works very similarly to mpirun: arrayrun [-r] taskids [sbatch arguments] YourCommand [arguments] In principle, arrayrun does TASK_ID=id sbatch [sbatch arguments] YourCommand [arguments] for each id in the 'taskids' specification. 'taskids' is a comma separated list of integers, ranges of integers (first-last) or ranges with step size (first-last:step). If -r is specified, arrayrun will restart a job that has failed. To avoid endless loops, a job is only restarted once, and a maximum of 10 (configurable) jobs will be restarted. The idea is to submit a master job that calls arrayrun to start the jobs, for instance $ cat workerScript #!/bin/sh #SBATCH --account=YourProject #SBATCH --time=1:0:0 #SBATCH --mem-per-cpu=1G DATASET=dataset.$TASK_ID OUTFILE=result.$TASK_ID cd $SCRATCH YourProgram $DATASET > $OUTFILE # end of workerScript $ cat submitScript #!/bin/sh #SBATCH --account=YourProject #SBATCH --time=50:0:0 #SBATCH --mem-per-cpu=100M arrayrun 1-200 workerScript # end of submitScript $ sbatch submitScript The --time specification in the master script must be long enough for all jobs to finish. Alternatively, arrayrun can be run on the command line of a login or master node. If the master job is cancelled, or the arrayrun process is killed, it tries to scancel all running or pending jobs before it exits. Arrayrun tries not to flood the queue with jobs. It works by submitting a limited number of jobs, sleeping a while, checking the status of its jobs, and iterating, until all jobs have finished. All limits and times are configurable (see below). It also tries to handle all errors in a graceful manner. Installation and configuration ============================== There are two files, arrayrun (to be called by users) and arrayrun_worker (exec'ed or srun'ed by arrayrun, to make scancel work). arrayrun should be placed somewhere on the $PATH. arrayrun_worker can be place anywhere. Both files should be accessible from all nodes. There are quite a few configuration variables, so arrayrun can be tuned to work under different policies and work loads. Configuration variables in arrayrun: - WORKER: the location of arrayrun_worker Configuration variables in arrayrun_worker: - $maxJobs: The maximal number of jobs arrayrun will allow in the queue at any time - $maxIdleJobs: The maximal number of _pending_ jobs arrayrun will allow in the queue at any time - $maxBurst: The maximal number of jobs submitted at a time - $pollSeconds: How many seconds to sleep between each iteration - $maxFails: The maximal number of errors to accept when submitting a job - $retrySleep: The number of seconds to sleep between each retry when submitting a job - $doubleCheckSleep: The number of seconds to sleep after a failed sbatch before runnung squeue to double check whether the job was submitted or not. - $maxRestarts: The maximal number of restarts all in all - $sbatch: The full path of the sbatch command to use Notes and caveats ================= Arrayrun is an attempt to simulate array jobs. As such, it is not perfect or foolproof. Here are a couple of caveats. - Sometimes, arrayrun fails to scancel all jobs when it is itself cancelled - When arrayrun is run as a master job, it consumes one CPU for the whole duration of the job. Also, the --time limit must be long enough. This can be avoided by running arrayrun interactively on a master/login node (in which case running it under screen is probably a good idea). - Arrayrun does (currently) not checkpoint, so if an arrayrun is restarted, it starts from scratch with the first taskid. We welcome any suggestions for improvements or additional functionality! Copyright ========= Copyright 2009,2010,2011 Bjørn-Helge Mevik <b.h.mevik@usit.uio.no> This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License version 2 as published by the Free Software Foundation. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License version 2 for more details. A copy of the GPL v. 2 text is available here: http://www.gnu.org/licenses/old-licenses/gpl-2.0.txt