Louhi User's Guide, the 2nd Edition > Batch jobs and the batch system > Commands > Job launching command: aprun
Tehdyt toimenpiteet

Job launching command: aprun

This subsection describes the aprun utility of ALPS which launches the executable from a service node to compute nodes.

The aprun utility executes on a service node to load compute node applications. The aprun utility reads the executable, obtains compute node processors for it to run.

NB! The aprun command works only in a Lustre-mounted directory. $WRKDIR and directories under it are such and therefore you must change both in an interactive session or in a batch job script  the working directory to one of $WRKDIR directories before giving the aprun command.

Options of aprun

The following table lists the most important options for aprun. For a complete description of options see the manual page aprun(1).

Option Description
-n number_of_PEs   

The number of processing elements (PEs) needed for the application. The default is 1.

-m size Specifies the per PE required  Resident Set Size memory size in megabytes. K,  M, and G suffixes are supported (16M = 16 megabytes, for example). Any truncated or full spelling of unlimited is recognized.
-N PEs_per_node 

The number of PEs per node. The default is 4 in XT4 nodes and 8 in XT5 nodes

-d depth

The number of threads per PE. The default is 1.  Compute nodes must have at least depth cores (now it is 4 or 8).

-L node_list The node_list specifies the candidate nodes to constrain application placement. The syntax allows a comma-separated list of node IDs (node,node,...), a range of nodes (node_x-               node_y), and a combination of both formats.

See the special options at he end of this page: memory and cpu affinity and placement option, which are mainly for XT5 nodes.

Normally the qsub resource specifications options are and should be the same as the aprun options. You should give both qsub and aprun options, because the defaults may be sometimes unpredictable or at least you don't know those. The correspondense is shown in the following table:

aprun option qsub -l option Description
-n 128 -l mppwidth 128 Number of PEs (often same as cores)
-m 1000mb -l mppmem=1000mb Memory per PE
-N 8 -l mppnppn=8 Number of PEs per node (8 is for XT5 nodes  and 4 for XT4 nodes with one OpenMP thread or without threads)
-d 4 -l mppdepth=4 Number of OpenMP threads, environment variable OMP_NUM_THREADS should be set to the same value (4 is for XT4 nodes with -N 1,  and 8 for XT5 nodes with -N 1)

Please note that on Louhi the aprun default of -N is set to 4 (XT4) and 8 (XT5), that is, Louhi is running jobs in the quad-core mode or eight-core mode by default, that is, there is running one PE on each core. It is, however, better to set this option explicitly. If only the number of PEs and possibly also the memory are given the jobs goes to XT4 or XT5 nodes partly depending on the number of cores. The smaller jobs seems to go XT5 queues and the bigger jobs to XT4 queues. This may change if configuration will change. The most secure way to get XT5 nodes is to specify the value 8 for -N and -l mppnppn, but there is not, at present, similar way for XT4 nodes.

Using only a single core per node may be necessary for MPI and SHMEM processes requiring more than 1 GB (1 GB/core, 4 or 8 GB/node nodes) or 2 GB (2 GB/core, 8 or 16 GB/node nodes) of memory. Then each task has access to the all remaining memory of the node (about 3.75 GB, 7.75 or 15.75 GB) which is available, when the usage of CNL is substracted (about 250 MB). The single-core mode allows also each task (PE) to have full access to the system interconnection network. This may give your program the fastest run time. In this mode the other core of each reserved node is allocated for your job, but it is idle. Therefore, you will  be charged for the CPU time of both cores on your quota. See examples in chapter Chapter Parallel batch jobs how to do such jobs where memory is needed more than normally. With qsub you can request at most 2000 MB (or 2 GB) of memory and then must cheat the PBS system. In that case mppnppn x mppmem must be greater or equal to -N x -m and the mppnppn is greater than the  -N value.

Example for quad-core mode in XT4 and eight-core mode in XT5 nodes:

aprun -n 256 -N 4 program
aprun -n 256 -N 8 program

The corresponding qsub command would be: qsub -l mppwidth=256 -l mppnppn=4 (XT4) and qsub -l mppwidth=256 -l mppnppn=8 (XT5)

This loads the application to the all 4 or 8 cores (cores 0, 1, 2 and 3, and 4, 5, 6 and 7) of all reserved pocessors depending on the node type (XT4/XT5). Only 64 or 32 nodes (or processors) are needed, and there are four or eight MPI tasks running on all processors. Each task can get only one quarter one eighth of the memory of the node.

Only the XT4 nodes or only the XT5 nodes are used for jobs in Louhi. aprun is not at present able to use all 8 cores of XT5 nodes if both XT4 and XT5 nodes are used in the same MPI job. In that case only 4 cores are used (the cores on NUMA node 0) and other 4 cores are idle (the cores of NUMA node 1) if normal aprun command is used. In the case of interactive jobs this can be solved by giving an MPMD (Multiple Program Multiple Data) type aprun command and specifying XT4 and XT5 nodes with -L option. See Chapter Open issues for more detailed information with examples.

MPI and SHMEM task distribution in dual-core mode (this will be updated to quad-core and eight-core mode when appropriate Cray manuals are available)

The deafault MPI or SHMEM tasks are placed in packed rank-sequential order. For example, the above 256 core, 128 node job would be distributed as follows:

Node 1 Node 2 Node 3 ........... Node 128
Core       0   1        0   1       0   1       ....         0      1
Rank       0   1        2   3       4   5       254  255

You can change  the default MPI rank placement scheme by setting the environment variable MPICH_RANK_REORDER_METHOD to a proper value (see the manual page intro_mpi(1) or mpi(1)) . If this variable is not set or if its value is set to 1 (the default), the default launcher placement policy shown in the table above is used. To display the MPI rank placement and launching  information, set PMI_DEBUG to 1.

Example for single-core mode in XT4 and XT5 nodes:

aprun -n 256 -N 1 program

The correspondig qsub command would be: qsub -l mppwidth=256 -l mppnppn=1.

This loads the application to a single core, namely core 0, of all reserverd processors. The cores 1-3 (XT4 node) or 1-7 (XT5 node) on these processors are idle, but every MPI task of the application can get the whole memory of the node.

MPI and SHMEM task distribution in single-core mode:

The rank 0 MPI task is loaded to the core 0 of the processor 1, the rank 1 MPI task is loaded to the core 0 of the processor 2, ..., and the rank 255 MPI task is loaded to the core 0 of the processor 256. All other cores of all processors are empty and idle.

Example for MPMD job:

The aprun command supports Multiple Program Multiple Data (MPMD) jobs. You can launch several executables on the same MPI_COMM_WORLD :
aprun -n 32 -N 4 ./a.out : -n 48 -N 4 ./b.out     

launches 32 MPI tasks of the program a.out and 48 MPI tasks of the program b.out both in quad-core mode (-N 4 is the default for XT4, it is not necessary for them). The MPI tasks of both programs can communicate with each other using the same communicators.

The command cnselect

The aprun utility supports manual and automatic node selection. For manual node selection, first use the cnselect command to get a list of compute nodes that meet the criteria you specify. You can also use this command for listing only compute nodes that meat the specified criteria. The command without options (see man cnselect for options and usage) lists the node id (nid) numbers of all compute nodes

louhi-login2 ~> cnselect
24-95,148-223,256-351,384-479,512-607,640-735,768-863,896-991,1024-1119,1152-1247,1280-1375,
1408-1503,1536-1631,1664-1759,1792-1887,1920-2015,2048-2143,2176-2271

cnslect shows the NIDs grouped by the cabinets. It shows only the compute node IDs.

You can list the nodes with 4 GB, 8 GB and 16 GB memory by the following way:

louhi-login2 ~> cnselect availmem.eq.4000
24-95,148-223,256-351,512-607,1024-1119,1152-1247,1536-1631,1664-1759,2048-2143,2176-2271
louhi-login2 ~> cnselect availmem.eq.8000
384-479,640-735,768-863,896-991,1408-1503,1792-1887,1920-2015
louhi-login2 ~> cnselect availmem.eq.16000
1280-1375

XT4 and XT5 nodes can be specified according to memory (availmem: 4000 = 4 GB, 8000 = 8 GB and 16000 = 16 GB/node) and according to coremask (XT4 nodes: 15 or 0xf in hexadecimal; XT5 nodes: 255 of 0xff in hexadecimal):

XT4:

louhi-login8 ~> cnselect availmem.eq.4000.and.coremask.eq.0xf
24-95,148-223,256-351,512-607,1024-1119,1152-1247,1536-1631,1664-1759,2048-2143,2176-2271
louhi-login8 ~> cnselect availmem.eq.4000.and.coremask.eq.15
24-95,148-223,256-351,512-607,1024-1119,1152-1247,1536-1631,1664-1759,2048-2143,2176-2271
louhi-login8 ~> cnselect availmem.eq.8000.and.coremask.eq.0xf
640-735

XT5:

louhi-login8 ~> cnselect availmem.eq.8000.and.coremask.eq.0xff
384-479,768-863,896-991,1408-1503,1792-1887,1920-2015
louhi-login8 ~> cnselect availmem.eq.8000.and.coremask.eq.255
384-479,768-863,896-991,1408-1503,1792-1887,1920-2015
louhi-login8 ~> cnselect availmem.eq.16000.and.coremask.eq.0xff
1280-1375

The 18 cabinets are physically on the floor of the computer hall in two rows, 9 cabinets in each row. The cabinet name, like c2-1, is the first part of the node name. The second number in the cabinet name is the row number (0 or 1) and the first number is the cabinet number (0-8) in that row.

Other useful commands for checking node information

You can check where each node is physically in the cabinets of Louhi even if you know only the nid numbers of nodes. First the command xtprocadmin | less  shows the nid numbers and corresponding cabinet numbering. Here is one of those:

louhi-login3 ~> xtprocadmin | grep " 151 "
   151     0x97  c1-0c0s5n3  compute        up       batch      4    4

The first c indicates the cabinet (c1-0 ), the second c indicates the cage or chassis (c0), s indicates the slot number (s5), and n indicates the node (n3). Then you can see from a graphical representation of the command xtnodestat (which replaces older xtshowcab and xtshowmesh) where this node is. In this display the cabinet is indicated by the capital C. Other symbols c, n and s have same meaning as above.

The command

xtnodestat

displays current node (process) allocation and information on running jobs. It shows also the service nodes, empty locations in the cabinets and also which nodes are down and thus unavailable.

The output is a textual two-dimensional grid where each node is represented by its status or currently assigned job. In this display the cabinet is indicated by the capital C. Other symbols c, n and s have same meaning as above. See also Resource monitoring (system and user).

There are several commands for translation of the node names to NIDs and vice versa:

cnselect
xtprocadmin
xtnodestat
apstat
rca-helper

Many of these commands also shows, which nodes are XT4 and XT5 (according to the core number and amount of memory), when suitable options are used, which are up or down, etc. See the appropriate man pages of these commands.

The following forms, among others may be useful:

xtprocadmin -A
xtprocadmin -a cores
xtprocadmin -a coremask
xtprocadmin -a availmem

also the option -y, -x, -g, -s and -c may be useful

apstat -nv
apstat -an
apstat -anv
apstat -avvv

The latter one shows how PEs of jobs are placed in nodes cores.

For conversion of NIDs to node names and and vice versa the most simple way is to use the command rca-helper (see man rca-helper for examples), which replaced xtnidname in CLE 2.2. You can convert a node name to NID (-a name), a NID to the node name (-x name), e.g., The node c0-0c0s6n1 or nid00025:

louhi-login2 > rca-helper -a c0-0c0s6n1
25
louhi-login2 > rca-helper -x 25
c0-0c0s6n1

This same information can be found from the /etc/hosts file of the login node.

The command also gives the node ID (-i) or "cname" (-i) of the node where you are:

louhi-login2 /wrk/ruusvuor> aprun -n 1 rca-helper -i
25
Application 954065 resources: utime 0, stime 0
louhi-login2 /wrk/ruusvuor> aprun -n 1 rca-helper -I
c0-0c0s6n1
Application 954068 resources: utime 0, stime 0

Memory and cpu affinity and placement options of aprun

However the new memory and cpu affinity options of aprun  are explained in Chapters: 13.2. Memory Affinity Optimizations (Deferred implementation) and 13.3. CPU Affinity Optimizations (Deferred implementation). The manual page of aprun also contains this information. 

Option Description
CPU (core) affinity:

-cc cpu_list | keyword

 

-cp cpu_placement_file_name

 
Memory affinity:  

-S pes_per_numa_node

 

-sl li st_of_numa_nodes

 

-sn numa_nodes_per_node

 

-ss

 

Application Placement:

 

-sl and -sn above

 

See man aprun. Some examples:

Only one core in each two processors (NUMA domains) of all 4 nodes is used (not bound to any specific core inside the NUMA node):

$ aprun -S 1 -n 8 -L 386-389 ./hello

All 8 cores in the node 386 are used (MPI tasks 0-3 bound to cores 0-3 and tasks 4-7 to cores 4-7, but not to any specific cores inside the NUMA node or the processor):

$ aprun -S 4 -n 8 -L 386 ./hello

All 4 cores of the NUMA node 0 of both nodes is used (the cores of both NUMA nodes 1 are unused):

$ aprun -sn 1 -n 8 -L 386-387 ./hello

All cores of both NUMA nodes in the node 386 (that is all cores of this node) are used. The node 387 is not used at all:

$ aprun -sl 0,1 -n 8 -L 386-387 ./hello

The all 4 cores of the NUMA nodes 0 of the both nodes are used, and the memory of NUMA nodes 0 is only used in both nodes (the NUMA nodes 1 of both nodes and their local memory is not used):

$ aprun -ss -sl 0 -n 8 -L 386-387 ./hello

The rank 0 MPI task is bound to the core 0 of the NUMA node 0 of the node 386, The rank 1 MPI task is bound to the core 1 of the NUMA node 0 of the node 386, etc. ,...,  the rank 4 MPI task is bound to the core 4 of of the node 386 (the core 0 of NUMA node 1) etc,... The whole node 387 is unused:

$ aprun -cc cpu -n 8 -L 386-387 ./hello