We consider the following topics in performance optimization: single-node, parallel (or MPI) and I/O optimization.
The aim of single-node or scalar optimization is to make each MPI task run as fast as possible. As computing should dominate the total run time, this is the most important aspect even for parallel codes.
Parallel or MPI optimization deals with reducing the message passing overhead and improving load balancing, among other things.
Finally, choosing the best strategies for writing and reading files is considered in I/O optimization.
Single-node Optimization
The key to single CPU performance is vectorization. This means exploiting the SIMD floating point registers and SSE/SSE2 instructions available on the Opteron processor. These are used with the PGI compilers by the option
-fast
For a more detailed discussion on PGI compiler options, see General usage of PGI compilers.
For GNU and PathScale compiler optimization flags, see GNU Compiler Collection and PathScale Compiler Suite.
With the PGI flags
-Minfo
-Mneginfo
the compiler generates messages that describe the performed optimizations or why they haven't been possible, respectively. It is recommended to make sure that the most important loops, for example, have been optimized by the compiler.
The loops with
- if statements
- subroutine calls
- indirect addressing
- non-unit stride addressing
do not vectorize so avoid these in your code.
Data dependencies can also hinder vectorization. These include
- data overwritten (store before load)
- result overwritten (dependency between two stores)
- result not ready (load before store)
- recurrence
For more information see PGI User's Guide and Workload Management and Application Placement for the Cray Linux Environment, Chapter 8, which can be found from http://docs.cray.com.
MPI optimization
- Always strive to post receives before sends. This enables MPI to store received data directly in the receive buffer, instead of buffering it.
- The Cray XT series computers are designed to perform well when messages are long. With short messages the latency is substantial compared to the total overhead. Instead of sending short messages one should aggregate them into larger ones.
- Do not aggregate too much. The MPI protocol switches from an short (eager) protocol to a long message protocol using a receiver pull method once the message is larger than the eager limit. This limit is by default 128000 bytes, but it can be changes with the MPICH_MAX_SHORT_MSG_SIZE environment variable. The optimal size for messages is most of the time bit less than the eager limit.
- Avoid using MPI_Iprobe and MPI_Probe. They eliminate many of the advantages of the Portals network stack.
- Overlap computing with communication. For this, use non-blocking routines and check completion after computation.
- Compute instead of communication. Extra computing can sometimes be used to reduce communication needs.
- Good load balance is imperative for scaling.
- Do not synchronize unless it is absolutely necessary.
- Do not use MPI-2 RMA (one-sided communication). It's designed for functionality and not performance on the Cray XT architecture.
- When using derived datatypes one should note that sending non-contiguous data is not as efficient as sending contiguous data. When sending non-contiguous data there is additional overhead from MPI first packing the data on the sender side, and thereafter unpacking it on the receiver side.
Rank placement
In some cases changing how the processes are laid out on the machine may affect performance. The default is currently SMP-style placement. This means that for or a multi-node core, sequential MPI ranks are placed on the same node.
For example, an 8-process job launched on a XT5 node with 2 quad-core processors would be placed as:
PROCESSOR 0 1
RANK 0,1,2,3 4,5,6,7
The default can be changed using the following environment variable:
MPICH_RANK_REORDER_METHOD
These are the different values that you can set it to:
- 0: Round-robin placement. Sequential MPI ranks are placed on the next node in the list.
- 1: SMP-style placement. All cores from all nodes are allocated in a sequential order.
- 2: Folded rank placement. Similar to default ordering except that the tasks N+1 ... 2N are mapped to slave cores of nodes N ... 1.
- 3: Custom ordering. The ordering is specified in a file named MPICH_RANK_ORDER.
One can also use the CrayPat performance measurement tools to generate a suggested custom ordering. Please look at the CrayPat documentation, especially the mpi_rank_order and mpi_sm_rank_order options of pat_report.