Murska User's Guide > Batch jobs and the batch system > Serial batch jobs
Tehdyt toimenpiteet

Serial batch jobs

This section will give examples how to submit serial batch jobs and examples of serial batch job scripts.

IMPORTANT: All files needed by a job must be copied to $WRKDIR, for example the program and input/output files. Remember to give module load modulefilename commands if needed. Include these commands also in a batch job script if needed.

See sections Running programs (batch and interactive) and Parallel jobs for current examples of serial and parallel jobs (both interactive and real batch jobs).


Array jobs


In many cases the computational analysis job contains a number of similar independent subtasks. The user may have several datasets that will be analyzed in the same way or one dataset will be analyzed with a number of different analysis parameters. These kind of analysis tasks are often called as “embarrassingly parallel” jobs as the task can be in principle distributed to as many processors as there are subtasks to be run.  In Murska this kind of tasks can be effectively run by using the array job function of the LSF batch job system.


Defining an array job


In LSF an array job is defined by using the option –J. Normally this option defines the name of your job

   #BSUB –J my_job

In array job, an index list definition is added to the job name,

   #BSUB –J my_array_job[1-100]

The definition above will launch not just one batch job, but 100 batch jobs where the subjob specific environment variable $LSB_JOBINDEX gets values form 1 to 100. This variable can then be utilized in the actual job launching commands so that each subtask gets processed. All the subjobs are launched to the batch jobs system at once and they will be executed using as many processors as there are available.


Simple array job


As a first array job example lets assume that we have 50 datasets (data_1.inp, data_2.inp … data_50.inp) that we would like to analyze using program my_prog that uses syntax

   my_prog inputfile outputfile

Each of the subtasks requires about 2 hours and less than 1 GB of memory. We can perform all 50 analysis tasks with following batch job script:

#!/bin/csh 
#BSUB -L /bin/csh
#BSUB -J my_job[1-50]
#BSUB -o job_out
#BSUB -e job_err
#BSUB -M 1048576
#BSUB -W 3:00
#BSUB -n 1

# move to the directory where the data files locate
cd data_dir

# run the analysis command
my_prog data_”$LSB_JOBINDEX”.inp data_”$LSB_JOBINDEX”.out

In the beginning the batch job script the line #BSUB -J my_job[1-50] defines that 50 subjobs will be submitted. The names of the subjobs will range from my_job[1] to my_job[50]. Other #BSUB lines refer to the individual subjobs. In this case one subjob uses one processor (-n 1) max. 1 GB of memory (-M 1048576) and can last max. 3 hours (-W 3:00 ). However, the total wall clock time needed to process all the 50 tasks is not limited by any sense.


Note that adding #BSUB –N definition to the batch job script would make every subjob to send an e-mail when the subjob finishes. This is probably not a good idea in a case where dozens of subjobs are to be processed.


In the job execution commands, the script utilizes $LSB_JOBINDEX variable in the definition of input and output files so that the first subjob will run command:


   my_prog data_1.inp data_1.out


second  will run command

   my_prog data_2.inp data_2.out


and so on…

The job can be now launched with command:
bsub < job_script.lsf

If you give command bjobs after submitting your array job, you can see that you have 50 jobs in the batch job system. All these jobs have the same jobid (in the first column) but different name ( 7th column).  Typically not all jobs get into the execution at once. However after a while a large number of jobs may be running in the same time. When the batch job is finished the data_dir directory contains 50 output files.


Directing the output of each subjob into a separate file is recommended as the file system of Murska may fail if several dozens of processes try to write into same file at the same time. If the output files need to be merged into one file it can often be easily done after the array job has finished. For example in the case above we could collect the results into one file with command:

cat data_*.out > all_data.out

Picking filenames from a list


In the example above, we were able to use $LSF_JOBINDEX to refer to the order numbers in the input files. If this type of approach is not possible a list of files or commands, created before the submission of the batch jobs, can be used. Let’s assume that we have a similar task as defined above, but the filenames don’t contain numbers but are in format data_aa.inp, data_ab.inp, data_ac.inp… and so on. Now we need first to make a list of files to be analyzed, In this case we could collect the filenames into file “namelist” with command:

   ls data_*.inp > namelist

In example bellow we will use command :

    sed –n “row_number”p inputfile

to read a certain line form the name list file.

In this case the actual command script could be following:

#!/bin/csh 
#BSUB -L /bin/csh
#BSUB -J my_job[1-50]
#BSUB -o job_out
#BSUB -e job_err
#BSUB -M 1048576
#BSUB -W 3:00
#BSUB -n 1

# move to the directory where the data files locate
cd data_dir

# set input file to be processed
set name=(`sed -n "$LSB_JOBINDEX"p namelist`)

# run the analysis command
my_prog $name $name.out

This  example is otherwise similar to the first one, but it will read the name of the file to be analysed form a file called “namelist”. This value is stored into variable $name, which will be used in the job execution command. As the row number to be read is defined by the $LSB_JOBINDEX each data file listed in the file “namelist” will get processed in a different subjob. Note that as we now use the $name also in the output definition the output file name will be in format data_aa.inp.out, data_ab.inp.out, data_ac.inp.out… and so on.


In some cases it may be reasonable to run each analysis task in a separate directory. For example, your analysis program may create temporary files with fixed names and thus several processes can not be executed simultaneously in the same directory. In the example below the previous analysis is run so that each subtask is executed in a separate temporary directory. The #BSUB –o and #BSUB –e definitions are modified too. By using the %I notation, the standard output and the standard error are written to a subjob specific files instead of a one common file.

#!/bin/csh 
#BSUB -L /bin/csh
#BSUB -J my_job[1-50]
#BSUB -o job_out.%I
#BSUB -e job_err.%I
#BSUB -M 1048576
#BSUB -W 3:00
#BSUB -n 1

# move to the directory where the data files locate
cd data_dir

# set input file to be processed
set name=(`sed -n "$LSB_JOBINDEX"p namelist`)

# create temporary directory
mkdir -p tmp_$LSB_JOBINDEX

# copy inputfile
cp $name tmp_$LSB_JOBINDEX

# move to the temporary directory
cd tmp_$LSB_JOBINDEX

# run the analysis command
my_prog $name $name.out

# copy the result file
cp $name.out ../

# move to the main directory
cd ..

# remove the temporary directory
rm –rf tmp_$LSB_JOBINDEX

Running large array jobs


If you have several thousands of tasks to be processed, we recommend that not all jobs are submitted at once. This is because the LSF batch job system has difficulties in managing very large amounts of subjobs. In these cases you can first send just 1000 jobs using definition:

   #BSUB -J my_job[1-1000]

When the first array job is finished, you only need to modify the –J definition in the batch job file into from:

   #BSUB -J my_job[1001-2000]

to run the next 1000 jobs.

Furthermore, it is often reasonable and polite to limit the number of subjobs that can be executed simultaneously. This is done with definition:

   #BSUB -J my_job[index_definition]%max_run_number

For example, if the subjobs read and/or write data frequently, it may be more effective to limit the number of simultaneous processes in order to avoid overloading the file system of Murska. Licensed software tools, like Gold or Dmol3, are another case where limiting is needed. In these cases limiting is used to ensure that capacity of the license is not exceeded as this would cause most of the subjobs to fail.


The following line would send 1000 subjobs but allow only 128 subjobs to be in execution in the same time.

   #BSUB -J my_job[1-1000]%128


More examples of array jobs are given in Running Materials Studio array jobs in Murska, Running Gromacs in Murska, Running Gromacs jobs as an array job or in the grid and in Finnish in Murskan eräajoista ja päivityksestä


Deleting array jobs


Deleting array jobs is done with the bkill command.

   bkill job_id_number


The command above deletes all the subjobs of one array job. You can also delete just part of your job. If we have an array job with jobid 3434 from which we would like to delete subjobs 100-150, we can do so with command

   bkill “3434[100-150]”

Note the obligatory quotation marks in the job definition. If we would like to remove just subjobs 100, 113 and 213, we could do so with command:

   bkill “3434[100,113,213]”

Resubmitting these three subtask could be done by modifying the -J definition in the batch job file into form (note the quotation marks again):

  #BSUB -J “my_job[100, 113, 213]”


By using array jobs overall job throughput time may notably be reduced while still using system resources in a polite manner. Users are encouraged to evaluate array jobs to find most optimal use patterns for their own use cases.