Hippu User's Guide > Program development > Optimization
Tehdyt toimenpiteet

Optimization

Some general hints and guide lines and an overview of optimization compiler flags.

General


The most important goal of an application developer is to write a code that produces correct results. After this has been achieved, attention can be paid to performance.  If the program is expected to be used a lot and requires a lot of computer resources, the developer should make sure that the code performance is sufficient.

The first step in optimization is to profile the code to find the possible hot spots or the routines in which the majority of the computing time is spent. This can be done with the tools introduced in the section  Program Development Tools or user written timing. If possible, the floating point performance of these key sections should be measured with hardware performance counters. If the results are not satisfying, various optimization approaches can be tried.


PGI Compilers


The first guess


Usually a good starting point is to compile with
pgf95 -fastsse -Mipa=fast prog.f
pgcc -fastsse -Mipa=fast prog.c
pgcpp -fastsse -Mipa=fast -Minline=levels:10 --no_exceptions prog.cc
Note that the --no_exceptions flag should not be set if the C++ code in question uses exception handling.

PGI Optimization Options


The first and easiest way in improving the performance is trying out  a number of compiler options. The table below lists the most importan optimization related flags for the Portland Group Compilers.

Option
Description
-O{0|1|2|3|4} Optimization level (none to max, default 1)
-fastsse Set of optimizations, includes SSE/SSE2 instructions
-Mipa=fast Interprocedural analysis
-Munroll Loop unrolling
-Minline Function inlining
-Mvect Loop vectorization (included in -fastsse)
-Mpfi, -Mpfo
Profile-feedback optimization
-Minfo, -Mneginfo
Optimization reports
-help Information about compiler options
-tp=barcelona-64
-tp=nehalem-64
Choose the target processor type of Hippu 1 and 2: Barcelona
Choose the target processor type of Hippu 3 and 4: Nehalem

Note that for some of the compiler flags there are corresponding directives (Fortran) and pragmas (C/C++).

In the following we briefly discuss some of the optimization related options for the Portland Group compilers. Most of the options have sub-options with which the programmer can control the optimizations in a detailed way.

For more information, see compiler man pages and PGI User's Guide. Chapter 3 discusses optimization in general, Chapter 2 describes the compiler options in detail, Chapter 4 deals with inlining and, finally, Chapter 7 considers compiler directives and pragmas.

-O

Optimization level. Level 0 specifies no optimization, which produces slow code but is useful when debugging. The default optimization level is 1 unless -g is given (which implies -O0). Note that -gopt can be used for producing debugging information with optimized code.  -O alone is equivalent to -O2. According to man pages -O3 is equivalent with -O4. When increasing the optimization level, make sure that the code continues to produce correct results.

-fastsse

This is a composite option:
> pgf90 -fastsse -help
Reading rcfile /opt/pgi/linux86-64/6.2/bin/.pgf90rc
-fastsse  -Mdaz is also set for Intel EM64T in 64-bit mode
== -fast -Mvect=sse -Mscalarsse -Mcache_align -Mflushz
-fast      Common optimizations; includes -O2 -Munroll=c:1 -Mnoframe -Mlre
Most importantly, it vectorizes loops and uses Streaming SIMD Extensions (SSE/SSE2) which utilize Opteron's eight 128 bit registers and usually produces faster code.

-Mipa=fast

Inter-procedural analysis (IPA) is a technique which allows optimization across all the functions and subroutines of a program. Compiling and linking with IPA involves multiple phases, i.e. precompiling and linking, gathering and passing optimization information around and recompiling and linking the code. With the IPA flag the compiler performs all these stages automatically.

-Munroll (-Mnounroll)

Loop unrolling is a basic optimization technique which reduces loop overhead and facilitates better instruction scheduling. With the Fortran directives
!pgi$ unroll
!pgi$ nounroll
or the C/C++ pragmas
#pragma unroll
#pragma nounroll
loop unrolling can be controlled for individual loops. The Fortran directives are valid only if the code is compiled with -Munroll. Compile with -Minfo to see which loops are unrolled.

-Minline

Inlining replaces a subroutine call with the actual code which reduces overhead but increases code size. Inlining is most beneficial for small subroutines that are frequently called.

-Mvect

Enabled with -fastsse. Invokes loop transformations such as loop distribution (splitting, fission), loop interchange and cache tiling (blocking). These typically increase the performance of computing intensive codes, but it is still recommended that the programmer carefully checks if the run time is decreased when vectorization is enabled. Again, compiling with -Minfo provides useful information about the performed optimizations.

This flag has various sub-options for fine tuning the behavior of the compiler as well as corresponding Fortran directives and C/C++ pragmas.

-Mpfi, -Mpfo

In profile-feedback optimization an instrumented version of the code is run to produce profile information which can then be utilized in a subsequent optimizing compilation.

The first step is to compile and link with -Mpfi and to run the code to produce a log file (pgfi.out). Note that the instrumented program runs slower due to added code and data collection. For the optimized production version of the program, compile with -Mpfo (this requires that the trace file is in the current directory).

-Minfo, -Mneginfo

Provide information to stdout of the optimizations performed and failed. Use these to make sure that the most important parts of the code have been successfully optimized.

-help

Use this option to find out the details of other options.

PathScale Compilers


With the PathScale compilers the recommended first optimization flags to try are as follows:
  1. -O or, equivalently, -O2 (this is the default)

  2. -O3
  3. -O3 -OPT:Ofast
  4. -Ofast
Here the flags are listed in the order of increasing optimization level. As mentioned before, it is a good practice to check the results for correctness whenever more aggressive optimization is used.


PathScale Optimization Options


In the following table the most important optimization options are listed.

Option
Description
-O{0|1|2|3} Optimization level, none to max, default is 2
-ipa Inter-procedural analysis
-IPA Inter-procedural analysis option group
-LNO Loop nest optimization option group
-fb-create, -fb-opt
Feedback Directed Optimization (FDO)
-march=barcelona
-march=nehalem
Choose the target processor type of Hippu 1 and 2: Barcelona
Choose the target processor type of Hippu 3 and 4: Nehalem


The general optimization level is specified with -O; -O0 disables all optimizations and -O2 is the default. At -O3 more aggressive optimization techniques are used which in some cases can lead to worse performance.
If -O3 is slower than -O2 one can try -O3 -LNO:prefetch=0.

Inter-procedural analysis (IPA) allows the compiler to optimize a program as a whole as opposed to separately compiled and optimized units. In this way, inlining can be performed across the code, for example. Inter-procedural analysis involves a special compilation and linking sequence and the programmer must specify -ipa for both compilation and linking phases. IPA is implicitly invoked by -Ofast. Detailed control of the IPA can be achieved with -IPA and its sub-options. Please see the documentation for further details.

With the -LNO option the exact behavior loop nest optimizer can be specified. Optimizations in this group include loop fusion and fission, cache blocking (or tiling), Transfer Lookaside Buffer (TLB) optimizations, vectorization and prefetching. Please see the documentation for further details.

Feedback Directed Optimization (FDO) is based on running a program and producing a log file which is then used in optimizing the code. Data is collected for e.g. branches. FDO is best suited for programs run with same input data, otherwise a performance degradation may be observed.

For FDO the code is first compiled with -fb-create:
f90 -O3 -ipa -fb-create fblog my_prog.f90
The log file is prefixed with the give file name, here fblog. After the code is run the logged data is used in the subsequent compilation which should produce a faster binary:
f90 -O3 -ipa -fb-opt fblog my_prog.f90
Note that the instrumented binary produced by the first compilation may run slower than the non-FDO code.

Feedback from the Compiler


Generally, the compiler is quite good in optimizing a wide variety of codes. There are, however, circumstances in which the compiler cannot perform loop modifications, vectorization and so on. Sometimes only a minor modification of the code by the programmer is enough to allow the compiler to produce a well performing executable. Thus it is important to know what the compiler has done to the most computing intensive parts of the code. This can be examined in three ways.

Read the Assembly

Compiling with -S preserves the assembly file (suffix .s). Even though the assembly itself may be difficult to read, it is annotated with loop transformation and scheduling information, which can be very useful. For example:
#<loop> Loop body line 127, nesting depth: 2, estimated iterations: 250
#<loop> unrolled 4 times
#<sched>
#<sched> Loop schedule length: 35 cycles (ignoring nested loops)
#<sched>
#<sched>    13 flops        ( 18% of peak)
#<sched>    18 mem refs     ( 25% of peak)
#<sched>     4 integer ops  (  5% of peak)
#<sched>    35 instructions ( 25% of peak)
Here a loop has been unrolled 4 times and it should run at about 18 % of the peak performance (as there are 13 flops in 35 cycles during which max 70 flops could be achieved, and 13/70 is about 0.186). Make sure that the most important loops are efficiently scheduled.

Read the Intermediate Code

Compile Fortran codes with -FLIST:=on or -flist and C codes with -CLIST:=on to produce an intermediate code listing (with the suffix .w2c.c or .w2f.f for C and Fortran, respectively). From this listing loop transformations, vectorized math routines and such can be seen.

Direct Compiler Feedback

With the verbose flag -LNO:simd_verbose set the compiler prints out information about loop transformations, e.g.
(jacobi.f:126) LOOP WAS VECTORIZED.
(jacobi.f:169) Loop has no aligned loads/stores. Loop was not vectorized.
(jacobi.f:169) Loop has no aligned loads/stores. Loop was not vectorized.
(jacobi.f:174) LOOP WAS VECTORIZED.
Similarly, -LNO:vintr_verbose produces feedback on vector intrinsics:
(jacobi.f:174) LOOP WAS VECTORIZED FOR VECTOR INTRINSIC ROUTINE(S).
Again, one should check that the most important loops are vectorized and in the negative case, optimization inhibiting features should be removed by appropriately modifying the code.

Further Information


For more details on PathScale optimization options, see PathScale User Guide PathOpt utility guide, man pathf95, man pathcc, man pathCC and man eko or man ekopath.