Louhi User's Guide, the 2nd Edition > Batch jobs and the batch system > Parallel batch jobs
Tehdyt toimenpiteet

Parallel batch jobs

This section describes parallel batch jobs, giving examples of batch job scripts, and interactive PBS sessions.

See section Submitting jobs: qsub for information about the qsub command and section Job launching command: aprun for information about the aprun command. Their arguments and usage are described on these pages. See also section Commands for other PBS commands.

The batch job script named ad job.sh can be submitted by the command

qsub job.sh

You can use also the options of qsub in the command line or in the batch job script. Examples here are in the latter form.

Job script for quad- and eight-core mode

The normal and recommended mode is to use all four cores on each node (XT4 node queues) or all eight cores on each node (XT5 node queues), that is 4 PEs is default for XT4 nodes and 8 PEs for XT5 nodes. The following example is valid both for XT4 and XT5 nodes:

#!/bin/sh
#PBS -N my_job
#PBS -j oe
#PBS -l walltime=1:00:00
#PBS -l mppwidth=256
#PBS -m e

#PBS -M user1@univ2.fi
#PBS -r n


cd $PBS_O_WORKDIR
aprun -n 256 ./program

Suppose that the name of this batch job script is job.sh.  It is submitted by the command

qsub job.sh

However, you don't know in advance is your job going to XT4 or XT5 queues. It depends on how the routing and executing queues are configured (qstat -Q -f shows that). The jobs requesting small number of cores (say 32-526 cores) tend to favour XT5 queues and those requesting bigger number of cores (say 527 up to 2048) tend to favour XT4 queues. Only secure way to get XT5 nodes is at present to use qsub option  -l mppnppn=8 and the corresponding aprun option -N 8. Even this may change in future. There is no similar way for XT4 queues.

The name "my_job" is given to the job (-N my_job),  standard error is merged to standard output (-j oe), one hour, zero minutes and zero seconds are requested for total run time (wall time) of the job (-l walltime=1:00:00), 256 processing elements (PEs) and practically the same number of cores  are requested for the job (-l mppwidth=256), and the batch queuing system is requested to send an email to the user when the job is finished (-m e) to the address user1@univ2.fi (-M user1@univ2.fi). Replace this address here with your own e-mail address. Other sub-options for -m are: a mail is sent when the job is aborted by PBS, b mail is sent when the job begins execution. You can also use any combination of a, b and  e. The job is not rerun after unavailibility of the execution host (-r n).

The quad-core mode (XT4 nodes and queues) or the eight-core mode (XT5 nodes and queues) is on by default on Louhi, so the aprun command implicitly includes the option -N 4 (4 cores per node, XT4) or -N 8 (8 cores per node, XT5) .  Therefore by requesting 256 PEs (and cores) means that 64 XT4 compute nodes in quad-core mode or 32 XT5 compute nodes in eight-core mode is requested to perform the job. The following are partially equivalent to the above aprun commands for the present configuration of Louhi.  PE numbers/node are not, however, given explicitly in normal MPI run (except we want XT5 nodes), because it is not known in advance will the job get XT4 or XT5 nodes:

XT4 (depends on the core size, this probably goes to XT5 nodes, but then 4 cores are idle on them):

aprun -n 256 -N 4 ./program

XT5 (this is sure, if the same is given for PBS):

aprun -n 256 -N 8 ./program

You should use also the option -l mppmem=size of PBS Pro and the option -m size of aprun for specifying the memory needed by each PE as accurately as possible. You should always use both of these options if you use one of them. Preferably, you should request the same amount of memory by both options.  The 2000 MB (2 GB) is maximum request per PE you can do with qsub. See below how this restriction can be bypassed.

...
#PBS -l mppmem=1000M
...
aprun -n 256 -N 8 -m 1000M ./program

Job script for single-core mode

Using only a single core per node can be necessary for MPI processes requiring 3-16 GB of memory. Use the single-core mode only of it is absolutely necessary! Please note that you will also be charged for the CPU time of both cores on your quota. Suppose we need almost 16 GB of memory for each MPI task (per PE), but we can request with qsub only 2 GB/PE. We must cheat the PBS system as follows:

#!/bin/sh
#PBS -N my_job
#PBS -j oe
#PBS -l walltime=1:00:00
#PBS -l mppwidth=256
#PBS -l mppnppn=8
#PBS -l mppmem=2000M
#PBS -r n
  
cd $PBS_O_WORKDIR
aprun -n 32 -N 1 ./program

We request here 2 GB/PE and 8 PEs/node which means that we request 16 GB/node. This way we will get surely the biggest memory XT5 nodes (16 GB/node). We request here also 256 PEs (cores) which means that we request 32 nodes (256 PE/(8 PE/node) = 32 node). We are not however going to use 8 PEs or 8 cores in these nodes by only one in each so that we will get the 16 GB memory for a single PE. We are thus cheating PBS. We specify only 32 PEs (cores) for aprun  and only one PE/node with the option -N 1. This means that we are requesting 32 nodes also from aprun. We are not specifying the memory reaquest for aprun, because we believe that we will get all 16 GB for each PE. We can however specify memory for aprun up to 16000 MB with  -m 16000M, because -N-m ( = 1 x 16000) must be equal or less than mppnppn x mppmem ( = 8 x 2000), otherwise we will get an error.

Job script for a chain of sequentially dependent jobs

In order to begin execution of a new job after the specified job have successfully executed, -W depend option can be used. In the example below a chain of dependent jobs is submitted, with previous job having successfully executed as an initial condition. This means that the job itself is going put on hold until the previous job has completed with no errors. Only then the subsequent job is going to be released.

#!/bin/bash
basename="mytest"
noproc="24"
walltime="00:10:00"
queuename="test"
for count in {0..2}
do
    (( prev = count - 1 ))  # sets the previous file to restart
    # Create submit file
    echo "#PBS -N t_${basename}_${count}" > ${basename}_${count}.pbs
    echo "#PBS -o t_${basename}_${count}.log" >> ${basename}_${count}.pbs
    echo "#PBS -j oe" >> ${basename}_${count}.pbs
    echo "#PBS -l mppwidth=${noproc}" >> ${basename}_${count}.pbs
    echo "#PBS -l walltime=${walltime}" >> ${basename}_${count}.pbs
    if [ $count -gt 0 ];
    then
    echo "#PBS -W depend=afterok:'$jobid'" >> ${basename}_${count}.pbs
    fi
    echo "#PBS -q ${queuename}" >> ${basename}_${count}.pbs
    echo "cat \$PBS_NODEFILE" >> ${basename}_${count}.pbs
    echo "cd \$PBS_O_WORKDIR" >> ${basename}_${count}.pbs
    echo "echo \$PBS_JOBID" >> ${basename}_${count}.pbs
    if [ $count -gt 0 ];
    then
    echo "aprun -n ${noproc} echo \"executed after job\" '$jobid'" >> ${basename}_${count}.pbs
    else
    echo "aprun -n ${noproc} echo \"primary job\" " >> ${basename}_${count}.pbs
    fi
    # submit the job
    jobid=`qsub ${basename}_${count}.pbs`
    echo ${jobid}
done

In this example PBS will wait for the first job to finish before it will release the second job. Then PBS will wait for the second job to finish, before the third job gets released. All jobs are submitted at the same time from a single shell script.

Job Script for OpenMP program

OpemMP programs can be run in XT4 nodes using 1-4 threads and in XT5 nodes using 1-8 threads. In fact, you can use any number of threads, but if it is more than 4 in XT4 nodes or more than 8 in XT5 nodes, performance may reduce, because then more than one thread is running in one or more cores. The optimal number of threads is almost always equal to the number of employed cores.

The following batch job script is an example for running an OpenMP program in a single XT5 node using 8 threads.

#!/bin/sh
#PBS -l walltime=00:15:00
#PBS -l mppwidth=1

#PBS -l mppmem=2000M
#PBS -l mppdepth=8
#PBS -l mppnppn=1
#PBS -r n
#PBS -q test

     
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=8
aprun -n 1 -N 1 -d 8 ./openmp_mpi.exe

In principle,  only one PE should be reserved for a pure OpenMP job with -l mppwidth and one PE per node with -l mppnppn and with the corresponding aprun options -n  and -N, which means that for a pure OpenMP job only one node is reserved. One PE can be reserved only if the job is submitted to the queue test as is done here and also the maximum wall time, 15 min, for this queue is requested. The number of threads used by the PE, that is the program, is determined with the PBS option -l mppdepth and the aprun option  -d. This number corresponds the number of cores used and reserved if this  number is 4 or less for XT 4 nodes and 8 or less for XT5 nodes.

The environment variable determining the thread number OPM_NUM_THREAD is set to the same value as options of qsub and aprun.

If more wall time than 15 minutes is needed for an OpenMP job then PBS must be cheated to believe that the amount of the PEs needed is the same as the minimum PE number in a suitable queue, even though it is only one PE for a pure OpenMP job.  In many queues it is 8 (earlier it was 32, see qstat -Q -f | less). Then the following kind batch job script would do the the job:

#!/bin/sh
#PBS -l walltime=02:00:00
#PBS -l mppwidth=8
#PBS -l mppnppn=8


#PBS -l mppmem=2000M


#PBS -r n

     
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=8
aprun -n 1 -N 1 -d 8 ./openmp_mpi.exe

Here we reserved with PBS two hours of wall time, 8 PEs without saying the depth. For aprun the true values are given.

See also Chapter OpenMP parallelization.

Job Script for mixed or hybrid OpenMP/MPI program

In some cases it is beneficial to combine MPI and OpenMP parallelization. More precisely, the inter-node communication is handled with MPI and within the nodes OpenMP and shared memory of the node are used. In OpenMP/MPI applications, MPI calls can be made from master or sequential regions but not from parallel regions.

For example, consider an 32-node job in which there is one MPI task (PE) per node and each MPI task has 8 OpenMP threads, resulting in a total core (and thread) count of 256.

The following batch job script could be used


#!/bin/sh
#PBS -l mppwidth=32
#PBS -l mppmem=2000M
#PBS -l mppdepth=8
#PBS -l mppnppn=1
#PBS -r n

     
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=8
aprun -n 32 -N 1 -d 8 ./openmp_mpi.exe

The environment variable determining the thread number OPM_NUM_THREAD is set to the same value as options of qsub and aprun (tcsh example):

qsub openmp_mpi.job

if the script name is openmp_mpi.job. This job goes to XT5 nodes. See also Chapter OpenMP parallelization.

To disable OpenMP parallelization in an OpenMP or OpenMP/MPI job one only needs to set

setenv OMP_NUM_THREADS 1

Submit a job in a hold state and release it later for execution, but delete before it

louhi-login2 /wrk/user3> qsub -h job.sh
54687.sdb
louhi-login2 /wrk/user3> qstat -a 54687
sdb:
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
54687.sdb       user3 parallel job.sh     --    1   1    --  04:00 H   --
louhi-login2 /wrk/user3> qrls 54687
louhi-login2 /wrk/user3> qstat -a 54687
sdb:
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
54687.sdb       user3 large-xt job.sh --    1   1    --  04:00 Q   --
louhi-login2 /wrk/user3> qdel 54687
louhi-login2 /wrk/user3> qstat -a 54687
qstat: Unknown Job Id 54687.sdb

Interactive PBS session

louhi-login1 /lus/nid00131/user2> qsub -I -l mppwidth=4 -V
qsub: waiting for job 3217.nid00003 to start
qsub: job 3217.nid00003 ready

tcsh: using dumb terminal settings.
Directory: /home/u1/user2
Wed Feb 21 04:49:13 EET 2007
louhi-login3 ~> pwd
/home/u1/user2
louhi-login3 ~> cd $PBS_O_WORKDIR
Directory: /lus/nid00131/user2
louhi-login3 /lus/nid00131/user2> aprun -n 4 program1
(... program prints its output ...)
louhi-login3 /lus/nid00131/user2> exit
logout

qsub: job 3217.nid00003 completed
louhi-login1 /lus/nid00131/user2>

Here we started interactive PBS session and run the program program1 in quad-core mode using aprun command. We requested 4 cores (PEs) from aprun, that is 1 node (either XT or XT5).

When one of the resources is consumed, session stops giving information like this (here walltime exceeded 3600 seconds, which is the default upper limit now).

louhi-login1 ~>=>> PBS: job killed: walltime 3609 exceeded limit 3600
( ... it may take some time, before the ouput below is displayed ...)

qsub: job 3020.nid00003 completed
louhi-login4 ~>

It may be necessary to know the precise login node you are logged in after qsub command. If your prompt does not show the current login node, here are a few alternative ways how to find out where you are: Show the login node.