At CSC we constantly track trends and technologies related to High Performance Computing. However, now as preparations are starting for our next procurement, this work is prioritized especially high to ensure that we get the best possible value for our investment. This is demonstrated by our recent report on future computing and data handling needs which is well worth a read for anyone interested in scientific computing.
The new systems will be rolling into our Kajaani datacenter in 2017 at earliest. The constant, rapid evolution of supercomputers ensures that there will be some interesting developments between now and when the new systems arrive. Things often also move in unpredictable ways. However, we decided to risk it and make some predictions for what will happen in HPC in next year, and possibly bit beyond.
Turns out that there are quite a lot of predictions so we decided to split them up in a couple of parts. Here’s the first part with the sequel coming up before the year’s end.
Olli-Pekka Lehto: Interconnect wars heat up
Intel’s Omni-Path interconnect will challenge the dominance of Mellanox InfiniBand in the mid-range HPC space. At SC15 both vendors were strongly asserting their relative superiority, but still based mostly on synthetic benchmarks and projections. With major deployments of both the latest-generation EDR InfiniBand and Omni-Path in the beginning of the year, it will be very interesting to see how the performance stacks up in real life scenarios.
The French supercomputer manufacturer Bull (Atos) is also deploying the first system using their BXI interconnect in the 2nd half of 2016. This is a very promising design especially in the high-end of the market and provides competition for Cray's Aries interconnect that has dominated that segment for a couple of years.
This new competition is very welcome and I predict it will provide more alternatives, drive introduction innovative features while keeping prices down. However having a multi-system environment with a unified fabric for storage, like we have at CSC today, may become more challenging to architect: The different high-speed interconnect fabrics may be incompatible and require gateway devices.
While ubiquitous, Ethernet has long been confined to the low-end and “embarrassingly parallel” clusters with relatively little take-up on the mid- and high-end general purpose clusters. However, the hyperscale industry is deploying ever larger and more scalable Ethernet fabrics with especially the financial community pushing for low latency solutions. Thus I predict that the first very large and fairly capable Ethernet-based HPC cluster for scientific workloads will be deployed in 2016. I’ve predicted this for many years and it has yet to emerge so I’m not betting any big money on it :) If such a thing would happen and prove successful, it would set an important precedent and potentially open the floodgates for Ethernet-based solutions in the mid-range.
Sebastian von Alfthan: Many-core processors making an impact
During the last twenty to thirty years the performance of processors has increased exponentially, first by increasing the clock frequency and the number of instructions per clock (IPC), and in the last ten years through increased parallelism. This trend has lead to multicore processors with more than ten cores each able to execute up to w16 double precision floating point operations per cycle, and GPUs with thousands of very lightweight cores able to run tens of thousands of simultaneous threads.
A major new architecture is introduced by Intel in 2016; the latest generation many integrated core (MIC) Xeon Phi processor, Knights Landing (KNL). This processor is not an accelerator, but a x86 CPU that is fully compatible with normal x86 processors. It is different in that it is a thoroughbred HPC processor giving in total 3 Tflops of performance per socket. It achieve this high performance through a number of new technologies, which in the future will become commonplace also for normal processors:
- A very high core number. KNL contains 72 compute cores connected with a 2D mesh interconnect. This is a true “cluster on a chip”.
- New AVX-512 vector instructions that are able to operate on 8 double precision numbers, enabling each core to execute 32 double precision floating point operations per cycle.
- A new level in the memory hierarchy in the form of a high bandwidth memory sitting on the socket. This memory is 16 GB in size, and will have 5x more bandwidth than normal DDR4 main memory, while latency is comparable to normal main memory.
- Integrated on-socket network interface for Omni-Path.
I predict that KNL in itself will become a successful processor architecture for HPC. I also predict that one should look at KNL as a proxy for what the future will bring. Tuning applications now for KNL by implementing a hybrid MPI + OpenMP parallelization scaling to tens of threads per MPI process, and by enabling the core loops to vectorize well will pay off on any architecture. The new memory level is also something that one should be able to exploit in the algorithms, the new flop-monsters need to be fed and optimizing memory traffic will become ever more important.
Pekka Manninen: First steps towards practical quantum computing
Rather than storing information as 0’s and 1’s as conventional computers (from tablets to supercomputers) do, a quantum computer uses qubits, which can be 0 and 1 at the same time. This quantum superposition, along with the quantum mechanical phenomena of entanglement and tunneling, enable quantum computers to consider and manipulate all combinations of bits simultaneously.
The first commercial realization of a quantum computer by Canadian startup company D-Wave Systems has gained interest during this year, announcing installations by e.g. Lockheed Martin, a joint procurement by Google and NASA and a joint procurement of Los Alamos and Sandia National Laboratories. Their solution is a quantum annealing computer, consisting of 1024 (in fact a bit more) qubits based on superconducting loops. It is capable of finding an energy minimum of more or less arbitrarily complex optimization problem in a constant time (exactly one operation). Early benchmarks by Google show 108-fold speedup in quantum system simulation as compared to the Quantum Monte Carlo method calculation on a conventional computer.
My related prediction for the year 2016 is that the D-Wave Systems will have their order book full, announcing more and more installations by national laboratories and leading computing centres across the globe. In addition, we are going to see first computational challenges (intractable by conventional computers) solved by quantum computers published in scientific journals. Even if it is unlikely that CSC will be installing a quantum computer in the near future, some introductory activity on this topic may be coming up. Stay tuned!