An introduction how to run batch and interactive jobs on Louhi is given in the section Running Programs. Example batch script files are found in the subsection Parallel batch jobs.
PBS (Portable Batch System)
PBS Pro is a networked subsystem for submitting, monitoring and controlling a work load of batch jobs. PBS Pro is provided by Altair Engineering (http://www.altair.com).
Batch job
Batch means that the job will be scheduled or submitted (qsub) for execution at a time chosen by the subsystem according to a defined policy and the availability of resources. In a real batch job the standard output and standard error of the job are returned in files to the user when the job is complete. There is also an interactive batch mode (qsub -I ...) where the input and output are connected to the user's terminal, but the job is still under control of PBS.
A job is normally a batch shell script (batch job script) which contains control information and resource reservations (called attributes, options or flags) for the job in addition to normal script contents and a launch command or commands (aprun) for a computing part (application, executable program) of a job. Options for the job may be specified also on the command line when the job is submitted.
Examples of batch job scripts can found in the subsection Parallel batch jobs. Batch job scripts are written, batch jobs are submitted, monitored, managed and deleted, and aprun is running on a login node. However, the executable program (its MPI tasks) launched by aprun is running on compute (CNL) nodes.
Submitting, monitoring, deleting and managing a batch job
A batch job is submitted on a login node by using the command qsub. Its syntax is as follows:
qsub [option ...] [script]
The options can be given for PBS batch job on a command line as shown here, but most often they are written to the batch job script. The most important option is -l mppwidth=pes (#PBS -l mppwidth=pes in the batch job script). pes is the number of processing elements (PEs) needed for an application. A PE is an intance of ALPS launched excutable such as a MPI task of and MPI job or a single instance of a sequential executable. For pure MPI job pes is the same as the number of procesessor cores, e.g., 128 cores:
qsub -l mppwidth=128 job.sh
The PE number specified for the option -I mppwidth of the PBS Pro V. 9.2 submitting command qsub and for the option for -n of ALPS exeutable launching command aprun is noarmally the same, both in single-core and other modes. See below.
Louhi has quad-core processors, i.e., 4 cores per node in the XT4 nodes and 8 cores per node in the XT5 nodes. At present, all cores (4 or 8) of the node are used to run one PE by default, if only XT4 or only XT5 nodes are used in a job. Therfore there is separate queues for XT4 and XT5 nodes (use the commands qstat -Q, and qstat -Q -f and qstat -B -f to see the available queues and their poperties. So in this example the number of nodes needed to run this job is a quarter ore one eighth of the cores needed, that is, 32 or 16 nodes.
The fundamental concepts of how PBS works and allocates resources are explained in the page Further information on resource allocation.
qstat [option ...] [job_identifier ... | destination ... | server_name ...]
You can also delete a job before its normal end by the qdel command:
qdel [-W delay|force] job_identifier ...
More details about these and several other PBS commands to manage your jobs are given in the subsection Commands.
ALPS (Application Level Placement Scheduler)
ALPS is the Cray supported mechanism for placing and launching applications on Cray XT series or Cray X2 compute nodes. ALPS provides application placement, launch, and management functionality and cooperates closely with the PBS Pro batch system on Louhi for application scheduling across compute nodes of Louhi. PBS Pro makes policy and scheduling decisions, while ALPS provides a mechanism to place and launch the applications contained within batch jobs. ALPS also supports interactive application placement and launch. See the intro_alps(1) manual page.
Computing job launcher aprun
Users launch applications to execution on compute nodes using the aprun utility of ALPS. It is similar to mpirun of certain Unix/Linux implementions of MPI and poe of AIX.
The aprun utility executes on a login node to load compute node applications. The aprun command can be given on an interactive (real or batch) session or in a batch job script:
aprun [option ... ] progname [progarg ...]
The most important option is -n pes specifying the number of PEs, which is equivalent to the number of processor cores needed to run a pure MPI executable, as explained already above for the corresponging option of qsub. e.g.,
aprun -n 128 progname
runs the program on 128 cores. More detailed description and examples of aprun command and its options is given in subsections Commands and Parallel batch jobs.
apstat
Displays information about pending and launched applications, node availability, and resource reservations.
apkill
Sends a signal to a specified launched application.