High Impact Weather Prediction Project
Funded by Hurricane Sandy Disaster Relief Supplemental Appropriations
A new generation of high-performance computing has emerged that we call Massively Parallel Fine Grain (MPFG). MPFG computing falls into two general categories: Graphics Processing Units (GPUs) from NVIDIA and AMD, and Many Integrated Core (MIC) offered by Intel. These chips are up to 10 - 20 times more powerful (Peak Performance measure) than traditional CPUs. Rather than up to 24 powerful cores found in a typical CPU, they rely on hundreds to thousands of simple compute cores that execute calculations simultaneously. Almost all of the fastest supercomputers in the world are either GPU or MIC-based systems. Two of the largest systems, Oak Ridge National Laboratory's(ORNL) Titan, with 18,688 GPUs, and the Texas Advanced Computing Center (TACC) Stampede, with 6880 MICs are being used by NOAA.
|Typical MPFG-based System Configuration
Currently GPUs and MIC must be attached to a host device, which is typically a conventional CPU. One or more accelerators can be attached to a single compute node as shown in Figure 3. Typical node configurations on systems NOAA uses include:
Large increases in compute power (peak performance) do not map directly into real application performance, however. Performance gains can only be achieved if fine-grain or loop level parallelism can be found and exploited in the applications. Fortunately, weather and climate codes generally contain a high degree of parallelism but minor to substantial code modifications may be required to expose it. Most researchers have found that changes made to improve application performance on MPFG also benefit the performance on CPUs.
The use of Hurricane Sandy Supplemental Funds for the High Impact Weather Prediction Project (HIWPP) to develop, test, and run 3.5km global models on MPFG computers, includes:
These HIWPP activities build on successful R&D work on MPFG at NOAA's Earth System Research Laboratory.
Additional MPFG work is being done at NCEP and GFDL including:
ESRL's Advanced Computing Section began exploring GPUs as next-generation supercomputers in 2008 as a means to lower the cost of computing, while speeding up the calculations. GPUs were originally developed for the video-gaming industry as low-cost, high performance graphics cards (Figure 1, above) that are attached as a co-processor to a traditional CPU node. ESRL began exploring the Intel MIC processor (Figure 2, above) when it became available in 2012.
MPFG computing technologies influenced the design of a new global numerical weather prediction model called the Non-hydrostatic Icosahedral Model (NIM) in 2008. NIM uses an icosahedral grid structure (all hexagons except for 12 pentagons - pictured at right) to represent flow in the model across the spherical surface. The NIM was designed to run at cloud-permitting scales of 3.5km (42 million grid cells) and is configured to run with 96 vertical levels.
GSD has also parallelized the predecessor to the NIM, called the Flow-following finite-volume Icosahedral Model (FIM), for MPFG, and portions of the Weather Research and Forecasting (WRF) model for both GPU and MIC processors.
GSD developed the F2C-ACC compiler to run the NIM, FIM, and WRF models on the GPU. Commercial Fortran GPU compilers, now available from Cray and PGI are being evaluated regularly and will be used once they are sufficiently mature.
The notable successes at GSD on MPFG include:
The Environmental Modeling Center at NCEP is responsible for developing and maintain NWP modeling systems for operational forecasting, including the Global Forecast System (GFS), Non-hydrostatic Multi-Scale Model (NMM-B), the Hurricane-WRF (HWRF), the High Resolution Rapid Refresh model (HRRR), Wave Watch III, and models that run as members of ensemble forecasts. In collaboration with other national forecast centers, EMC is working on forecast application readiness for next-generation HPC systems that will include increased thread concurrency – large numbers of OpenMP threads per MPI task – and fine-grained parallelism in the form of the AVX (vector) instructions on the Intel Xeon Phi (Knights Corner and Knights Landing) processors as well as on "conventional" multi-core Xeon architectures.
Detailed analysis of column-physics components such as RRTMG radiation and other model components shows that threads executing these packages are too state-heavy for cache memory on conventional CPU and MIC cores or shared memory on GPUs. In spite of these constraints, various code and data restructuring techniques have yielded performance gains for RRTMG on MIC processors and also result in improved performance on the host Xeon Sandy Bridge multi-core processor.