U.S. Department of Commerce | National Oceanic & Atmospheric Administration | NOAA Research

Massively Parallel Fine-Grain Computing

A new generation of high-performance computing has emerged that we call Massively Parallel Fine Grain (MPFG). MPFG computing falls into two general categories: Graphics Processing Units (GPUs) from NVIDIA and AMD, and Many Integrated Core (MIC) offered by Intel. These chips are up to 10 - 20 times more powerful (Peak Performance measure) than traditional CPUs.  Rather than up to 24 powerful cores found in a typical CPU, they rely on hundreds to thousands of simple compute cores that execute calculations simultaneously. Almost all of the fastest supercomputers in the world are either GPU or MIC-based systems. Two of the largest systems, Oak Ridge National Laboratory's(ORNL) Titan, with 18,688 GPUs, and the Texas Advanced Computing Center (TACC) Stampede, with 6880 MICs are being used by NOAA.

Graphic: NVIDIA Kepler GPU

Figure 1: NVIDIA Kepler GPU (2012), 2880 cores, 1.3 TeraFlops Peak

Intel MIC (Xeon-Phi) 2012

Figure 2:Intel MIC (Xeon-Phi) (2012), 61 cores, 1.2 TeraFlops Peak

Typical MPFG-based System Configuration Graphic of a common node toppology

Figure 3: Two compute nodes, each node contains one Intel Sandy Bridge (2 sockets, 16 cores) CPU and two attached accelerators connected via system interconnect

Currently GPUs and MIC must be attached to a host device, which is typically a conventional CPU. One or more accelerators can be attached to a single compute node as shown in Figure 3. Typical node configurations on systems NOAA uses include:

  • TACC Stampede: Intel SandyBridge (16 cores - 2 sockets) + 1 MICs/node ( some nodes have 2 MICs/node)
  • TACC Maverick: Intel IvyBridge (20 cores - 2 sockets) + 2 GPUs/node
  • ORNL Titan: AMD Opteron (16 cores - 1 socket) + 1 GPU/node

Large increases in compute power (peak performance) do not map directly into real application performance, however. Performance gains can only be achieved if fine-grain or loop level parallelism can be found and exploited in the applications.  Fortunately, weather and climate codes generally contain a high degree of parallelism but minor to substantial code modifications may be required to expose it. Most researchers have found that changes made to improve application performance on MPFG also benefit the performance on CPUs.

HIWPP MPFG Activities

The use of Hurricane Sandy Supplemental Funds for the High Impact Weather Prediction Project (HIWPP) to develop, test, and run 3.5km global models on MPFG computers, includes:

  • parallelization of WRF physics (used in the NIM) for GPU and MIC (in progress);
  • parallelization of NCAR's Model Prediction Across Scales (MPAS) model for MPFG;
  • parallelization of GFS physics for MPFG; and
  • advancement of one or more global non-hydrostatic, MPFG-ready, research models supported by HIWPP into operationally-ready models under the NWS Next Generation Global Prediction System (NGGPS) project.

These HIWPP activities build on successful R&D work on MPFG at NOAA's Earth System Research Laboratory.

Additional MPFG work is being done at NCEP and GFDL including:

  • parallelization of NMM-B physics for Intel-MIC
  • parallelization of the GFDL HIRAM model for GPU and MIC

I. MPFG Research and Development at ESRL

Icosahedral Model Image

An illustration of the icosahedral grid used by the Non-hydrostatic Icosahedral Model (NIM) and Flow Following Finite Volume Icosahedral Grid (FIM) developed at NOAA ESRL.

ESRL's Advanced Computing Section began exploring GPUs as next-generation supercomputers in 2008 as a means to lower the cost of computing, while speeding up the calculations. GPUs were originally developed for the video-gaming industry as low-cost, high performance graphics cards (Figure 1, above) that are attached as a co-processor to a traditional CPU node. ESRL began exploring the Intel MIC processor (Figure 2, above) when it became available in 2012.

MPFG computing technologies influenced the design of a new global numerical weather prediction model called the Non-hydrostatic Icosahedral Model (NIM) in 2008. NIM uses an icosahedral grid structure (all hexagons except for 12 pentagons - pictured at right) to represent flow in the model across the spherical surface. The NIM was designed to run at cloud-permitting scales of 3.5km (42 million grid cells) and is configured to run with 96 vertical levels.

GSD has also parallelized the predecessor to the NIM, called the Flow-following finite-volume Icosahedral Model (FIM), for MPFG, and portions of the Weather Research and Forecasting (WRF) model for both GPU and MIC processors.

GSD developed the F2C-ACC compiler to run the NIM, FIM, and WRF models on the GPU. Commercial Fortran GPU compilers, now available from Cray and PGI are being evaluated regularly and will be used once they are sufficiently mature.

The notable successes at GSD on MPFG include:

  • NIM dynamics are running efficiently on CPU, GPU, and MIC with a single code;
  • GPU/MIC performance is 2-3 times faster than the CPU; and
  • Good parallel performance of the NIM has been demonstrated on up to 10,000 GPUs of ORNL’s Titan System, and up to 600 MIC nodes of the TACC System.

Further details about NIM performance, including comparisons between CPU, GPU and MIC architectures are available.

II.  MPFG Research and Development at NCEP

The Environmental Modeling Center at NCEP is responsible for developing and maintain NWP modeling systems for operational forecasting, including the Global Forecast System (GFS), Non-hydrostatic Multi-Scale Model (NMM-B), the Hurricane-WRF (HWRF), the High Resolution Rapid Refresh model (HRRR), Wave Watch III, and models that run as members of ensemble forecasts. In collaboration with other national forecast centers, EMC is working on forecast application readiness for next-generation HPC systems that will include increased thread concurrency – large numbers of OpenMP threads per MPI task – and fine-grained parallelism in the form of the AVX (vector) instructions on the Intel Xeon Phi (Knights Corner and Knights Landing) processors as well as on "conventional" multi-core Xeon architectures.

Detailed analysis of column-physics components such as RRTMG radiation and other model components shows that threads executing these packages are too state-heavy for cache memory on conventional CPU and MIC cores or shared memory on GPUs. In spite of these constraints, various code and data restructuring techniques have yielded performance gains for RRTMG on MIC processors and also result in improved performance on the host Xeon Sandy Bridge multi-core processor.