Options of qsub
The following table shows the most important options of the qsub command
| Option | Description |
|---|---|
| -I | Declares that the job is to be run "interactively". Default: Run in a batch job, if the option is not specified. |
| -j {oe,eo,n} | oe merges standard error to standard output; eo merges standard output to standard error. Default n, or if the option is not specified, standard output and standard error are two separate files |
| -o path | The name of the file for standard output. Default: job_name.osequence_number if the option is not specified. |
| -e path | The name of the file for standard error. Default: job_name.esequence_number if the option is not specified. |
| -l resource_list | Defines the resources that are required by the job and establishes a limit to the amount of resource that can be consumed. See table below. |
| -m {a,b,e,n} | Mail is sent: a when the job is aborted by PBS, b when the job begins execution, e when the job terminates. Default n or if option is not specified: no mail is sent. |
| -M user_list | List of users to whom mail about the job is sent. Give here mail address(es). Default: job owner if unset, but the mail don't go anywhere. |
| -N job_name | The job name. Default: is based on the name of the job script if the option is not specified, STDIN if no script and it was read from standard input. |
| -r y|n | Declares whether the job is rerunnable. The default is y. Please use -r n if you don't want your job to be rerun after crash and reboot of the execution host. |
| -S path_list | Shell that interprets the job script. Do not use! It does not work properly. |
| -q destination | Destination of the job for special queues. destination names a queue, a server or a queue at a server. Mostly used with the test queue: -q test |
| -v variable_list | Environment variables exported to the job. variable_list is a comma separated list of the form variable or variable=value |
| -V | All environment variables in the qsub command's environment are to be exported to the batch job. |
Please, don't request the queue with option -q, because most XT4 and XT5 queues don't accept it. It can be used only with test (the queue name test) and other special queues. The memory and PE (core) number requests only or no resource request at all directs job to a proper queue, either to XT4 nodes or XT5 nodes, but not to both, because it is not approriate to be use XT4 and XT4 nodes simultaneosly in the same MPI job (only 4 cores are used in that case in XT5 nodes).
The sequence_number in this table is the sequence number, also called the job number, assigned for the job when it is submitted. It is the first part of the job identifier (see below).
If you want that PBS sends you a mail the options -m and -M must be given. Use your own mail address(es) with -M. The users old CSC addresses don't work any more.
Please note, that jobs are rerun, that is they are run again from the beginning, if the execution host first becomes unavailable and becomes again available. This may waste CPU-time, because everything is calculated again. If you don't want your job to be rerun specify -r n.
Don't use the option -S, because it is not needed in batch job scripts, because the first line specification is enough for setting the working shell to be another than user's login shell (/bin/tcsh is default now). The first line specification is of the from:
#!/bin/sh
#PBS ....
There are two types of interactive jobs:
- Using PBS system with the option -I of the qub command and the ALPS command aprun. It can be done in the command line: qsub -I -l mppwidth 128, or "submitting" a batch job script which contains, for example, the lines
#!/bin/sh
#PBS -I
#PBS -l mppwidth=128
If there is not yet available requested resources enough, you may be forced to wait too long. However, when you get the resources you are logged on to another login node, where you give the normal aprun commands. See an example in Chapter Parallel batch jobs.
- Using only aprun command from the command line: aprun -n 32 ./myprogram . Certain interactive XT4 and XT5 nodes are reserved for that purpose. The command xtprocadmin | grep interactive | grep compute which nodes are available for pure interactive working. More information: Chapter Job launching command: aprun.
PBS resources
Resources are allocated to jobs both by explicitly requesting them and by applying specified defaults. For certain resources there are specified or physical lower (min), upper (max) or both limits. If you request less or more than these limits allow you job is rejected. Jobs explicitly request resources either at the host level in chunks defined in a selection statement, or in job-wide resource requests. In Louhi the job-wide resource request method is used. Certain PBS PRO commands may show resource requests in chunks after the reservations.
The following table shows the most important PBS Pro version 9 (or higher) resources for the option -l (this is lowercase L) of qsub. See the manual page pbs_resources(7B) for all resources.
| Resource | Description |
|---|---|
| mppwidth=number_of_PEs | Number of processing elements (PEs). |
| mppdepth=depth | Depth of each processor (number of threads). Default is 1. Specifies the number of processors each processing element will use. |
| mppnppn=PEs_per_node | Number of processing elements (PEs) per node |
| walltime=time | Maximum amount of real time during which the job can be in the running state. |
| mppmem=size | The per processing element (PE) maximum Resident Set Size memory limit in megabytes. K|M|G suffixes are supported (16 = 16M = 16 megabytes). Any truncated or full spelling of unlimited is recognized. |
| file=size | The largest size of any single file that may be created by the job. |
Normally the qsub resource specifications options are and should be the same as the aprun options. You should give both qsub and aprun options, because the defaults may be sometimes unpredictable or at least you don't know those. The correspondense is shown in the table in Chapter Job launching command: aprun.
Resource units
time is expressed in seconds as an integer, or in the form:
[[hours:]minutes:]seconds[.milliseconds]
e.g., -l walltime=2:00:00 or -l walltime=7200
size is expressed in the form integer[suffix]. The suffix is expressed in terms of bytes or words.
| b | bytes |
| kb or K | Kilo (1024) bytes |
| mb or M | Mega (1,048,576) bytes |
| gb or G | Giga (1,073,741,824) bytes |
e.g., -l mppmem=1000mb,file=50gb
The size parameter must be an integer and not a decimal number. You must say -l mppmem=1500mb and not, for example, "-l mppmen=1.465gb" (1500 MB/1024 = 1.465 GB).
The resource mppmem also supports K|M|G suffixes (see table above). Lower case symbols k|m|g are also accepted. The resrervation above can also written as:
-l mppmen=1000M or -l mppmem=1000m
Please note, that symbol for size unit must follow immediately after size number without space. Otherwise it is tried to interprete as an option for qsub or file name.
Available memory
There is 1 GB/core or 2 GB/core XT4 and XT5 nodes. Thus there is 4 GB and 8 GB XT4 nodes, and 8 GB and 16 GB XT5 nodes. See the table in chapter Cray XT system.
CNL uses approximately 250 MB of memory. The remaining memory is available for the user program executables; user data arrays; the stacks, libraries and buffers; and SHMEM symmetric stack heap. The default stack size is 16 MB. The memory used for the MPI libraries is approximately 72 MB.
Using only a single core per node may be necessary for MPI and SHMEM processes requiring more than 1 GB or 2 GB of memory. Then each task has access to the all remaining memory of the node (about 3.75, 7.75 or 15.75 GB) which is available, when the usage of CNL is substracted (about 250 MB). The single-core mode allows also each task (PE) to have full access to the system interconnection network. This may give your program the fastest run time. In this mode the other cores of each reserved node is allocated for your job, but are idle. Therefore, you will be charged for the CPU time of all reserved cores on your quota.
With qsub you can request at most 2000 MB (or 2 GB) of memory per PE. So you should request 1000mb (which is default) or 2000mb per PE. See examples in chapter Chapter Parallel batch jobs how to do such jobs where memory is needed more than normally. The you must cheat the PBS system. In that case mppnppn x mppmem must be greater or equal to -N x -m and the mppnppn is greater than the -N value.
PBS batch queues
The PBS batch queuing system is under configuration and testing because no one really knows yet, how it behaves under different circumstances like different kinds of loads and diferent sized jobs. Only CSC has the mixed XT4 and XT5 environment. Therefore the queues and their properties may change without notice until a stable and load balanced configuration is found.
You can find what queues are available and which is their status:
qstat -Q
The next command shows what are the properties and resources of all queues, like min, max or default walltime, mppwidth and mppnppn (see above):
qstat -Q -f
This command shows the properties of the specified queue:
qstat -Q -f queue_name
Part of the properties are defined concerning all queues. They are valid if queue specifications and your requests (if allowed) do not change them. You will display these properties with the command:
qstat -B -f
See Chapter Monitoring and displaying job and queue status.
Even though the queue name is not specified (except for test queues) when jobs are sent to execution, the queuing system selects free nodes either from XT4 nodes of from XT5 nodes but not form both. There are mainly two different reasons to that situation:
-
XT5 nodes are about 15 % slower in MPI jobs, when all 8 cores are used, when compared to XT4 nodes using all 4 cores. If both types of nodes are used in the same MPI job, there may be some imbalance in performance of different types of nodes. It depends howerver on the application.
-
The more seriuos one is that aprun is not at present able to use all 8 cores of XT5 nodes if both XT4 and XT5 nodes are used in the same MPI job. In that case only 4 cores are used (the cores on NUMA node 0) and other 4 cores are idle (the cores of NUMA node 1) if normal aprun command is used. In the case of interactive jobs this can be solved by giving an MPMD (Multiple Program Multiple Data) type aprun command. See Chapter Open issues for more detailed information with examples.
Example of resource requests
Usage policy and general queue arrangements are or will be described in section Usage policy and obtaining a user id. These may change, and you can find the qurrent congifuration of the queuing system as described in Section Monitoring and displaying job status.
At present, it is not obligatory to request other resources than mppwidth for qsub if the default walltime one hour (1:00:00 or 3600 s) and 1 GB of memory per core is enough for you. You should, however, give the best estimate for the run time (wall time) needed of the total job and and the memory needed by each PE as accurately as possible. In the next example 256 PEs, walltime 4 h, 1 GB of memory and 4 PEs/node are requested:
qsub -l mppwidth=256 -l walltime=4:00:00 -l mppmem=1000M -l mppnppn=4 ...
Environment variables and working directory
The value for the certain variables will be taken from the environment of the qsub command and these values will be assigned to a new name which is the current name prefixed with the string PBS_O_, e.g., HOME -> PBS_O_HOME, which the job can then use.
In addition to these variables, certain other variables will be available to the batch job (these are defined by the PBS system). Here are two of them:
-
PBS_O_WORKDIR: This is the absolute path of the directory where you launched the qsub command.
-
PBS_JOBID: This is the job identifier assigned to the job by the batch system.
The initial working directory for a job is always the user's home directory on the execution machine (a login node). Because it is not seen in compute nodes, you must send the batch job script for execution from a Lustre mounted direcotory, such as $WRKDIR or its subdirecories using the command qsub. In addition you must change the current working directory for execution of the batch job script commands to be under the Lustre file system in your batch job script file before the aprun command. In practice, if you have launched your batch job from $WRKDIR, you can use: cd $PBS_O_WORKDIR, instead of: cd $WRKDIR. Also output files go to the current working directory, except standard error and standard output files always go to the directory where qsub command was given. If you use a non-Lustre directory (such as your home) for the I/O, your code's performance will suffer.
For more information about variables see the manual page qsub(1B).
Batch job scripts
A job script may consist of PBS directives, comments and executable statements (an important one being the aprun command). A PBS directive is a line starting with #PBS, and provides a way of specifying job attributes in addition to the command line options. For example
:
#PBS -N Job_name
#PBS -l walltime=10:30,mem=320kb
#PBS -m be
#
step1 arg1 arg2
step2 arg3 arg4
The qsub command scans the script file for directives. An initial line in the script that begins with the characters "#!" or the character ":" will be ignored. Scanning will continue until the first executable line, i.e., a line that is not blank, not a directive line, nor a line whose first non-white-space character is "#". If directives occur on subsequent lines, they will be ignored.
To be more precise, a line in the script file is processed as a directive to qsub if the string of characters starting with the first non-white-space character matches the directive prefix (by default #PBS).
The remainder of the directive line consists of the options to qsub in the same syntax as on the command line. The option character is to be preceded with the "-" character. A table of the directive options are given above.
If an option is present in both a directive and on the command line, the command line takes precedence. The option and its argument in the directive will be ignored.
Usage of qsub
The resources are specified by including them in the -l (lower case L) option argument on the qsub or qalter command or in the PBS Pro job script. They may be in a comma separated list after one -l, or each may have -l in front of them and space separated:
qsub -l mppwidth=64,walltime=1:30:00 job_script
qsub -l mppwidth=64 -l walltime=1:30:00 job_script
or in a qsub job_script as a directive:
#PBS -l mppwidth=64,walltime=1:30:00
or separate directives:
#PBS -l mppwidth=64
#PBS -l walltime=1:30:00
The simplest way to use qsub command is as follows:
louhi ~> qsub job.sh
2729.nid00003
The job script here is job.sh which contains the job attributes (directives), shell commands and the required aprun command(s). The usage and options of aprun commands are describeb in Chapter Job launching command: aprun.
PBS assigns an unique job identifier for the job when it is submitted. Here it is 2729.nid00003. The first part of this is the sequence number (the job number) 2729, and the second part is one of the internal alias names of PBS Server (one of the service nodes). This name is formed from nid and a five digit number whose rightmost numbers are taken from the Node ID (NID) of the server and remaining leftmost numbers are filled with zeros (0).
You can use the sequence number in other PBS commands for identifying the job you want to manage (qdel, qalter, qrls and others). If there will be more than one (nid0003 now) PBS servers in future, you must use the whole job identifier.
More examples of the usage of the command qsub are given in Chapter Parallel batch jobs. Also the behaviour of the job script execution is described there (where it runs, for example).