Vuori User's Guide > Program development > Shared memory parallelization
Tehdyt toimenpiteet

Shared memory parallelization

A short introduction to OpenMP and MPI/OpenMP hybrid programming on Vuori.

Each node on Vuori contains two hexa-core processors. Hence it is possible to run shared memory parallel (OpenMP) programs efficiently within a node with twelve threads at maximum.

Compiling for OpenMP

All the compilers on Murska (PGI, Pathscale and GCC from version 4.2 onwards) support OpenMP. Use the following compiler flags to enable OpenMP:
Compiler
Flag
PGI
-mp=nonuma
PathScale
-mp
GCC
-fopenmp
Intel
-openmp

Here are examples for OpenMP and mixed OpenMP/MPI using the PGI compilers:

pgf90 -mp=nonuma -fast -o my_openmp.exe my_openmp.f90
mpif90 -mp=nonuma -fast -o my_hybrid.exe my_hybrid.f90

See OpenMP web pages for more information including standards and tutorials.


Running OpenMP programs

The number of OpenMP threads is specified with an environment variable OMP_NUM_THREADS.

Running a shared memory program typically requires requesting a whole node. Thus, a twelve thread OpenMP job can be run interactively as

setenv OMP_NUM_THREADS 12
setenv MP_BIND yes
salloc --nodes=1 --ntasks=1 --cpus-per-task=12 --mem-per-cpu=1000 -t 01:00:00
run ./my_openmp.exe
exit

The program my_openmp.exe in this example is compiled with a PGI compiler. The corresponding batch queue script would be:


#!/bin/tcsh
#SBATCH -J my_openmp
#SBATCH -e my_output_err_%j
#SBATCH -o my_output_%j


#SBATCH --mem-per-cpu=1000

#SBATCH -t 01:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=12

setenv OMP_NUM_THREADS 12
setenv MP_BIND yes
srun ./my_openmp.exe

Hybrid parallelization

In many cases it is beneficial to combine MPI and OpenMP parallelization. More precisely, the inter-node communication is handled with MPI and for communication within the nodes OpenMP is used.

For example, consider an eight-node job in which there is one MPI task per node and each MPI task has twelve OpenMP threads, resulting in a total core (and thread) count of 96.

Running a hybrid job can be done interactively as above with the exception that more nodes are specified and for each node one MPI task is requested. The parallel partition must be requested to run the program because there are more than one node. That is, for a 8 x 12 job the following flags are used

setenv OMP_NUM_THREADS 12
setenv MP_BIND yes
salloc -p parallel --nodes=8 --ntasks=8 --cpus-per-task=12 --mem-per-cpu=1000 -t 02:00:00

srun ./my_hybrid.exe
exit

The program my_hybrid.exe in this example is compiled with a PGI compiler (using one of the mpi* commands), too. The corresponding batch queue script would be:

#!/bin/tcsh
#SBATCH -J my_hybrid
#SBATCH -e my_output_err_%j
#SBATCH -o my_output_%j


#SBATCH --mem-per-cpu=1000

#SBATCH -t 02:00:00


#SBATCH --nodes=8
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=12
#SBATCH -p parallel


setenv OMP_NUM_THREADS 12
setenv MP_BIND yes
srun ./my_hybrid.exe


Binding threads to cores

The compilers on Vuori support thread/core affinity which binds threads to cores for better performance. This is enabled with compiler-specific environment variables as follows:

PGI

setenv MP_BIND yes

The value of MP_BIND must be set to yes. Otherwise all threads run in a node are run only in one core. The default of MP_BIND is no.

PathScale
setenv PSC_OMP_AFFINITY TRUE
setenv PSC_OMP_AFFINITY_GLOBAL TRUE

The default of PSC_OMP_AFFINITY is TRUE. So it is not necessary to set it again.

GCC
setenv GOMP_CPU_AFFINITY "0-11"
It is necessary to set this value. Otherwise all threads run in a node are run only in one core.