CSC Blog has moved

Find our blogs at www.csc.fi/blog.

This site is an archive version and is no longer updated.
 

Go to CSC Blog

Code modernization: 4 successful cases and lessons learned

When it comes to parallel computing, we all know that the quality of code plays an important role in how well the available resources are utilized. Many of the scientific codes used today are not working effectively in modern parallel computing environments. However, code modernization requires some effort and skills, and is not so easy to do without professional help.

We at CSC – IT Center for Science hosted an Intel Parallel Computing Centre (Intel PCC) project during 2014–2016 with two successful years modernizing scientific software for the Xeon and Xeon Phi platforms. Intel Xeon Phi is a novel microprocessor designed for scientific computing featuring a large number of slower (than on ordinary Xeons) cores for improved energy efficiency.

 

Many of the scientific codes used today are not working effectively in modern parallel computing environments

 

The Intel PCC project allowed us to dive deep into four important applications for Finnish scientific community and make them ready for future computing platforms.

The future challenges for scientific software include, e.g., omnipresent parallelism with multiple layers, together with increasingly complex memory hierarchy. In case you are interested in what the future may bring in for supercomputing, we’ve covered the trends in our blog earlier (See HPC Predictions for 2016 - Part 1 and HPC Predictions for 2016 - Part 2).

In close collaboration with research communities in Earth and materials sciences we went into the details of the selected codes:

  • multi-physics code Elmer
  • materials science (density-functional theory) code GPAW
  • pollution transportation model SILAM, and
  • a space-plasma simulation code Vlasiator.
     

We worked with the developers and Intel experts in optimizing performance and applicability of these codes for present and upcoming Xeon processor family features such as utilization of wide (256 and 512 bit) vector units and cache hierarchies.

We achieved three major outputs of our work:

  1. A number of modernized and optimized codes
  2. Deepening of our own expertise, and
  3. Raised awareness among researchers for the need to adapt to the changes in computing platforms, which is necessary for taking the most out of the current and future computing platforms.
     

Case 1: Code modernization with Elmer – even hundred-fold speedups

Elmer is an open-source multi-physical simulation software mainly developed by CSC. Elmer includes physical models of e.g. fluid dynamics, structural mechanics, electromagnetics, heat transfer and acoustics. These are described by partial differential equations which Elmer solves by the finite-element method (FEM).

 

The aim with Elmer was to improve the multithreading and vectorization capabilities of the Elmer core.

 

Firstly, we changed Elmer’s building system from an old GNU Automake-based approach to CMake. At the same time, a code cleanup was performed. This consisted of removing several deprecated legacy parts of the code and a modernization of the Fortran/C interfaces within the codebase to use C binding interfaces from the Fortran 2003 standard.

The main code modernization effort with Elmer was to improve the multithreading and vectorization capabilities of the Elmer core.Secondly, to enable multithreading of the models, we improved the thread-safety of several key routines within the Elmer core.

It became apparent early that the benefits from improving vectorization without modifying the data layout would be limited. Thus, a new data layout and a new application-programmer interface with more amenable vectorization properties was devised. This included rewriting the basis functions in a more SIMD friendly fashion by using OpenMP 4.0 SIMD function semantics.

The new API and the data layout make it possible to use Basic Linear-Algebra Subprograms (BLAS) level 2 and 3 routines for matrix-vector and matrix-matrix multiplications needed in the evaluation of some of the fundamental operations arising in the finite element assembly process. They are also very efficient as compared to the preceding hand-written loops.

In certain cases, over hundred-fold speedups were measured as compared to the old approaches.

The SIMD improvements are publicly available in the so-called mesh color branch of Elmer, but they are not yet integrated into the main Elmer trunk.

Case 2: GPAW – materials science modeling with Xeon Phi co-processors, using Python

GPAW is a software package for ab initio electronic-structure calculations using the projector augmented wave (PAW) method. GPAW supports different basis sets such as plane waves, atomic orbital basis, and uniform real-space grids discretizing the equations of the density-functional theory.

GPAW is written mostly in Python, but includes also computational kernels written in C as well as leveraging external libraries such as NumPy, BLAS and ScaLAPACK. Parallelization is based on message-passing using MPI.

During the project, support for offloading computation to Xeon Phi co-processors was added to GPAW.

This is accomplished using the PyMIC library, which was co-developed by CSC and Intel. The offload work with GPAW is documented in more detail in Ref [1].

 

PyMIC library now gives GPAW user full control over data transfers and offloading through a clear and easy to use interface at the Python level.

 

PyMIC is designed to give the user full control over data transfers and offloading through a clear and easy to use interface at the Python level, which was not possible previously. The PyMIC interface is built to fully integrate with the efficient numerical arrays implemented in NumPy, and uses them as a basis for buffer management and data transfer.

Also as the main granularity for offloading, PyMIC uses functions since most computationally intensive routines in NumPy are implemented as function calls. Current version of PyMIC requires the user to write the offload kernels in C, C++, or Fortran, and compile them for co-processor execution as shared libraries.

In small to medium size calculations (e.g. high throughput calculations), the plane wave basis together with Davidson method for the iterative eigensolver is typically the most efficient approach when using a modest number of processor cores. With a larger amount of cores, the limited parallelization prospects in the current implementation of Davidson solver become, however, a bottleneck.

Thus, we added a new parallelization level to the Davidson eigensolver of GPAW. This work is targeted for the pure CPU version of the host-only code, but could in future be utilized also in the offload version.

Case 3: Parallelizing the SILAM air quality model

SILAM is a global-to-meso-scale dispersion model developed for atmospheric composition, air quality, and emergency decision support applications. It is developed by the Finnish Meteorological Institute and it is used by several research groups in Europe. SILAM is one of the modeling codes that is used for air quality forecasts in Europe.  Before the Intel PCC project SILAM had already been parallelized using OpenMP and computationally intensive parts, mostly chemical reaction related operations, scaling over one typical supercomputer node.

Because SILAM is very flexible tool that can be used for many different types of analysis, the performance profile can vary a lot. In most cases the computation of chemical reactions is the most time-consuming part, but sometimes the I/O becomes the bottleneck. OpenMP parallelization limited the SILAM runs to one node, which made long global-scale simulations very time-consuming and limited the resolution of simulations.

Therefore, hybrid MPI-OpenMP parallelization was set as the target of the code development.

 

New version of SILAM can use different thread and process numbers for optimal performance.

 

The MPI implementation was done simultaneously with the main development of SILAM. This caused some work-flow related problems when SILAM upgraded the advection scheme. Certain problems of the new algorithm were noticed only in the MPI version and it took a long debugging and code inspection before the problem was finally solved.

The result was a version of SILAM that can use different thread and process numbers for optimal performance.

The I/O of the MPI version is often significantly faster than that of the OpenMP version and the hybrid version scales to system sizes well beyond the original SILAM.

Case 4: Space plasma simulation code Vlasiator now ready to harness the next-generation supercomputers

Vlasiator is a plasma simulation code being developed at the Finnish Meteorological Institute. Its main scientific objective is simulating Earth’s magnetosphere using a hybrid-Vlasov model, where protons are described by a huge 6-dimensional distribution function (velocity and real-space).

The code is parallelized with MPI and OpenMP, and the core solvers are vectorized. In the middle of the second year of the Intel PCC project, Vlasiator was brought into the project as a new code.

The target for Vlasiator was to improve the performance of the core Vlasov solvers and enable them to work efficiently on the upcoming Xeon Phi Knights Landing architecture.

First, we extracted the one of the two core Vlasov propagators, and implemented it as an open source library, lib-slice3d available under the LGPL3 license. This operation is vectorized, and in Vlasiator the operation is threaded over all spatial cells handled by one process. The library also includes a small benchmark with which the performance can be investigated.

 

Performance was increased by 16% when running the benchmark on a Haswell processor with 12 threads.

 

The threading of the acceleration kernel in Vlasiator is limited by the number of spatial cells, which can be very small due to dynamic load balancing. We wrote a new version of the solver which is threaded over two of the three velocity dimensions. This increases the number of work units for threading to hundreds per spatial cell.

Due to the sparse nature of the distribution function the kernel needs an additional temporary copy of the distribution function which made this version 7% slower when running on a Haswell processor with 12 threads. The plan is to use it in Vlasiator for those cases where the original threading approach does not scale far enough, and switch to the original one elsewhere.

To improve single core performance we added support for AVX512 to our vectorization backend, which is based on Agner’s vectorclass. During the project we did not have access to hardware with AVX512 support, but the functionality was verified in the KNL simulator.

Additionally, we investigated the performance of the kernel in detail using Intel VTune and noticed that the dynamic data structure, based on blocks of 64 elements, was frequently stalling execution when waiting for memory. By using our knowledge of the memory accesses we were able insert manual prefetches.

These, in addition to improved inlining, improved alignment of a few data structures and other miscellaneous optimizations, were able to further increase performance by 16% when running the benchmark on a Haswell processor with 12 threads.

Vlasiator's performance is tracked using the open-source Phiprof library, which enables one to create a timing report based on regions that the users adds into the code. The report includes a hierarchical view of the regions, showing each region the average time, MPI imbalance, and optionally work-units per second. In the project, the library was added support for OpenMP. Now Phiprof is able to report the performance of all threads as well as the thread load imbalance.

Code modernization skills transfer and community engagement

The Intel PCC project also arranged several training events targeted for researchers and other CSC end-users, since it is crucial that the user communities are aware of the need for code modernization and have the necessary set of skills.

  • Intel Software Development Tools Workshop
    A two-day workshop introduced various Intel tools for developing and optimizing high performance computing software. The workshop discussed optimization and development for Intel Xeon and Xeon Phi architectures and contained both lectures and hands-on exercises. The learning outcome was using Intel tools for developing and optimization HPC software.

    The lecturers were Heinrich Bockhorst and Mikko Byckling from Intel. The workshop gathered 20 attendees, and received good feedback from the participants.
     
  • Advanced Threading and Optimization
    This three-day workshop was held twice during the Intel PCC project, jointly with the PRACE Advanced Training Centre at CSC. It covered topics on code optimization for Intel Xeon and Xeon Phi, and efficient code parallelization using OpenMP threading. Advanced aspects of threading and optimization, such as new features of OpenMP 4.0, were covered during the course. Some performance aspects of MPI were also discussed.

    The learning outcomes included awareness of modern features of Intel CPUs, how to vectorize computations, using advanced features of OpenMP, and ability to improve code performance using threading and x86 optimization.

    The lecturers featured CSC specialists as well as Mikko Byckling and Michael Klemm from Intel. Both workshops had some 15 attendees.
     
  • Intel Parallel Computing Workshop
    The workshop covered all the most important topics in code modernization. The lecturer was Heinz Bast from Intel. The workshop lasted one day and had 25 attendees.

The work on code modernization and evaluation of the new Xeon and Xeon Phi platforms of course continues at CSC.

 

The author was the co-PI of the Intel PCC project.

 

[1] J. Enkovaara, M. Klemm and F. Witherden, High Performance Python Offloading, in High Performance Parallelism Pearls, J. Reinders and J. Jeffers (eds), Morgan Kauffmann (2015).