General
The most important goal of an application developer is to write a code that produces correct results. After this has been achieved, attention can be paid to performance. If the program is expected to be used a lot and requires a lot of computer resources, the developer should make sure that the code performance is sufficient.
The first step in optimization is to profile the code to find the possible hot spots or the routines in which the majority of the computing time is spent. This can be done with the tools introduced in the section Program Development Tools or user written timing. If possible, the floating point performance of these key sections should be measured with hardware performance counters. If the results are not satisfying, various optimization approaches can be tried.
PGI Compilers
The first guess
Usually a good starting point is to compile with
pgf95 -fastsse -Mipa=fast prog.fNote that the --no_exceptions flag should not be set if the C++ code in question uses exception handling.
pgcc -fastsse -Mipa=fast prog.c
pgcpp -fastsse -Mipa=fast -Minline=levels:10 --no_exceptions prog.cc
PGI Optimization Options
The first and easiest way in improving the performance is trying out a number of compiler options. The table below lists the most importan optimization related flags for the Portland Group Compilers.
| Option |
Description |
|---|---|
| -O{0|1|2|3|4} | Optimization level (none to max, default 1) |
| -fastsse | Set of optimizations, includes SSE/SSE2 instructions |
| -Mipa=fast | Interprocedural analysis |
| -Munroll | Loop unrolling |
| -Minline | Function inlining |
| -Mvect | Loop vectorization (included in -fastsse) |
| -Mpfi, -Mpfo |
Profile-feedback optimization |
| -Minfo, -Mneginfo |
Optimization reports |
| -help | Information about compiler options |
| -tp=barcelona-64 -tp=nehalem-64 |
Choose the target processor type of Hippu 1 and 2: Barcelona Choose the target processor type of Hippu 3 and 4: Nehalem |
Note that for some of the compiler flags there are corresponding directives (Fortran) and pragmas (C/C++).
In the following we briefly discuss some of the optimization related options for the Portland Group compilers. Most of the options have sub-options with which the programmer can control the optimizations in a detailed way.
For more information, see compiler man pages and PGI User's Guide. Chapter 3 discusses optimization in general, Chapter 2 describes the compiler options in detail, Chapter 4 deals with inlining and, finally, Chapter 7 considers compiler directives and pragmas.
-O
Optimization level. Level 0 specifies no optimization, which produces slow code but is useful when debugging. The default optimization level is 1 unless -g is given (which implies -O0). Note that -gopt can be used for producing debugging information with optimized code. -O alone is equivalent to -O2. According to man pages -O3 is equivalent with -O4. When increasing the optimization level, make sure that the code continues to produce correct results.
-fastsse
This is a composite option:
> pgf90 -fastsse -helpMost importantly, it vectorizes loops and uses Streaming SIMD Extensions (SSE/SSE2) which utilize Opteron's eight 128 bit registers and usually produces faster code.
Reading rcfile /opt/pgi/linux86-64/6.2/bin/.pgf90rc
-fastsse -Mdaz is also set for Intel EM64T in 64-bit mode
== -fast -Mvect=sse -Mscalarsse -Mcache_align -Mflushz
-fast Common optimizations; includes -O2 -Munroll=c:1 -Mnoframe -Mlre
-Mipa=fast
Inter-procedural analysis (IPA) is a technique which allows optimization across all the functions and subroutines of a program. Compiling and linking with IPA involves multiple phases, i.e. precompiling and linking, gathering and passing optimization information around and recompiling and linking the code. With the IPA flag the compiler performs all these stages automatically.
-Munroll (-Mnounroll)
Loop unrolling is a basic optimization technique which reduces loop overhead and facilitates better instruction scheduling. With the Fortran directives
!pgi$ unrollor the C/C++ pragmas
!pgi$ nounroll
#pragma unrollloop unrolling can be controlled for individual loops. The Fortran directives are valid only if the code is compiled with -Munroll. Compile with -Minfo to see which loops are unrolled.
#pragma nounroll
-Minline
Inlining replaces a subroutine call with the actual code which reduces overhead but increases code size. Inlining is most beneficial for small subroutines that are frequently called.
-Mvect
Enabled with -fastsse. Invokes loop transformations such as loop distribution (splitting, fission), loop interchange and cache tiling (blocking). These typically increase the performance of computing intensive codes, but it is still recommended that the programmer carefully checks if the run time is decreased when vectorization is enabled. Again, compiling with -Minfo provides useful information about the performed optimizations.
This flag has various sub-options for fine tuning the behavior of the compiler as well as corresponding Fortran directives and C/C++ pragmas.
-Mpfi, -Mpfo
In profile-feedback optimization an instrumented version of the code is run to produce profile information which can then be utilized in a subsequent optimizing compilation.
The first step is to compile and link with -Mpfi and to run the code to produce a log file (pgfi.out). Note that the instrumented program runs slower due to added code and data collection. For the optimized production version of the program, compile with -Mpfo (this requires that the trace file is in the current directory).
-Minfo, -Mneginfo
Provide information to stdout of the optimizations performed and failed. Use these to make sure that the most important parts of the code have been successfully optimized.
-help
Use this option to find out the details of other options.
PathScale Compilers
With the PathScale compilers the recommended first optimization flags to try are as follows:
-
-O or, equivalently, -O2 (this is the default)
- -O3
- -O3 -OPT:Ofast
- -Ofast
PathScale Optimization Options
In the following table the most important optimization options are listed.
| Option |
Description |
|---|---|
| -O{0|1|2|3} | Optimization level, none to max, default is 2 |
| -ipa | Inter-procedural analysis |
| -IPA | Inter-procedural analysis option group |
| -LNO | Loop nest optimization option group |
| -fb-create, -fb-opt |
Feedback Directed Optimization (FDO) |
| -march=barcelona -march=nehalem |
Choose the target processor type of Hippu 1 and 2: Barcelona Choose the target processor type of Hippu 3 and 4: Nehalem |
The general optimization level is specified with -O; -O0 disables all optimizations and -O2 is the default. At -O3 more aggressive optimization techniques are used which in some cases can lead to worse performance.
If -O3 is slower than -O2 one can try -O3 -LNO:prefetch=0.
Inter-procedural analysis (IPA) allows the compiler to optimize a program as a whole as opposed to separately compiled and optimized units. In this way, inlining can be performed across the code, for example. Inter-procedural analysis involves a special compilation and linking sequence and the programmer must specify -ipa for both compilation and linking phases. IPA is implicitly invoked by -Ofast. Detailed control of the IPA can be achieved with -IPA and its sub-options. Please see the documentation for further details.
With the -LNO option the exact behavior loop nest optimizer can be specified. Optimizations in this group include loop fusion and fission, cache blocking (or tiling), Transfer Lookaside Buffer (TLB) optimizations, vectorization and prefetching. Please see the documentation for further details.
Feedback Directed Optimization (FDO) is based on running a program and producing a log file which is then used in optimizing the code. Data is collected for e.g. branches. FDO is best suited for programs run with same input data, otherwise a performance degradation may be observed.
For FDO the code is first compiled with -fb-create:
f90 -O3 -ipa -fb-create fblog my_prog.f90The log file is prefixed with the give file name, here fblog. After the code is run the logged data is used in the subsequent compilation which should produce a faster binary:
f90 -O3 -ipa -fb-opt fblog my_prog.f90Note that the instrumented binary produced by the first compilation may run slower than the non-FDO code.
Feedback from the Compiler
Generally, the compiler is quite good in optimizing a wide variety of codes. There are, however, circumstances in which the compiler cannot perform loop modifications, vectorization and so on. Sometimes only a minor modification of the code by the programmer is enough to allow the compiler to produce a well performing executable. Thus it is important to know what the compiler has done to the most computing intensive parts of the code. This can be examined in three ways.
Read the Assembly
Compiling with -S preserves the assembly file (suffix .s). Even though the assembly itself may be difficult to read, it is annotated with loop transformation and scheduling information, which can be very useful. For example:#<loop> Loop body line 127, nesting depth: 2, estimated iterations: 250Here a loop has been unrolled 4 times and it should run at about 18 % of the peak performance (as there are 13 flops in 35 cycles during which max 70 flops could be achieved, and 13/70 is about 0.186). Make sure that the most important loops are efficiently scheduled.
#<loop> unrolled 4 times
#<sched>
#<sched> Loop schedule length: 35 cycles (ignoring nested loops)
#<sched>
#<sched> 13 flops ( 18% of peak)
#<sched> 18 mem refs ( 25% of peak)
#<sched> 4 integer ops ( 5% of peak)
#<sched> 35 instructions ( 25% of peak)
Read the Intermediate Code
Compile Fortran codes with -FLIST:=on or -flist and C codes with -CLIST:=on to produce an intermediate code listing (with the suffix .w2c.c or .w2f.f for C and Fortran, respectively). From this listing loop transformations, vectorized math routines and such can be seen.
Direct Compiler Feedback
With the verbose flag -LNO:simd_verbose set the compiler prints out information about loop transformations, e.g.
(jacobi.f:126) LOOP WAS VECTORIZED.Similarly, -LNO:vintr_verbose produces feedback on vector intrinsics:
(jacobi.f:169) Loop has no aligned loads/stores. Loop was not vectorized.
(jacobi.f:169) Loop has no aligned loads/stores. Loop was not vectorized.
(jacobi.f:174) LOOP WAS VECTORIZED.
(jacobi.f:174) LOOP WAS VECTORIZED FOR VECTOR INTRINSIC ROUTINE(S).Again, one should check that the most important loops are vectorized and in the negative case, optimization inhibiting features should be removed by appropriately modifying the code.
Further Information
For more details on PathScale optimization options, see PathScale User Guide, PathOpt utility guide, man pathf95, man pathcc, man pathCC and man eko or man ekopath.