The most important goal of an application developer is to write a code that produces correct results. After this has been achieved, attention can be paid to performance. If the program is expected to be used intensively and it demands large computational resources, the developer should pay attention on the application performance.
Three easy steps on decent application performance
- Employ tuned libraries
Check that your application does not try to carry out basic operations and algorithms, such as matrix multiplication, matrix transpose, matrix inversion, fast Fourier transform, or solution of a linear set of equations, by an own implementation of the algorithm or some intrinsic Fortran function - almost all imaginable basic operations and off-the-shelf algorithms are implemented in tuned libraries present on Vuori.
- Use compiler optimization
The choice of the compiler, and especially what you ask it to do, really matter. It is useful to quickly evaluate all the compilers available in the Vuori programming environment. The best choice varies from an application to another. Selecting good compiler flags is addressed in section 4.1.
- Mind the I/O
Writing and reading from disk needs much more attention in an HPC system than in a workstation, even when the application I/O demands are modest. I/O is always a challenge in parallel computing, but its performance can be enhanced without modifying the source code (which is, however, quite likely needed for the best performance). Firstly, assess what you actually ask from your application, as it may have dramatic effects on performance. Ask yourself for example the following: Do you really need restart data from every iteration? Checkpointing is very communication and I/O intensive, and may well be the bottleneck of your application. Is the output superfluous, could you get the insight you are after with less output? Minimal possible output means more throughput, that is, obtained simulation time from a run.
The first step when starting more hands-on optimization of an application is to profile the code to find the possible hot spots or the routines in which the majority of the computing time is spent. This can be done with the tools introduced in the section Program Development Tools or by timing different parts with system calls. Of interest are, e.g.,
- Time required for communication (MPI routines) and load balance across the ranks
- Bad load balance can be alleviated by revisiting data structures and/or parallelization schemas or for instance mixing OpenMP with MPI
- Single-core performance - hardware performance counters provide accurate information on different performance aspects
- Cache hit rations
- Computational intensity
- Packed SSE instructions
- I/O statistics
- If the code does a lot of I/O from only one task, I/O should be parallelized with e.g. using MPI-I/O.