The open archive for STFC research publications

Full Record Details

Persistent URL http://purl.org/net/epubs/work/12153206
Record Status Checked
Record Id 12153206
Title Hybrid strategies for the NEMO ocean model on multi-core processors
Abstract We present the results of an investigation into strategies for the introduction of OpenMP to the NEMO oceanographic code. At present NEMO is parallelized using only the Message Passing Interface (MPI) with a two-dimensional domain decomposition in latitude/longitude. This version performs well on core counts of up to a few thousand. However, the number of cores on CPUs and consequently, supercomputer nodes, continues to increase. This trend is exaggerated by the introduction of many-core accelerators such as GPUs alongside traditional CPUs. Therefore, hybrid or mixed-mode parallelization is an attractive option for improving application scaling on large core counts. In this approach, maybe just one or two MPI processes are placed on each compute node, resulting in a significant reduction in the quantity of off-node message exchanges. The MPI processes are then themselves parallelized using OpenMP to make effective use of the multiple cores on the node. As of version 4.0, the OpenMP standard will provide support for many-core accelerators and therefore this hybrid approach should be applicable to the increasing number of machines with GPU or Intel Phi co-processors. We have implemented OpenMP parallelism using both loop-level and tiling/coarse-grained approaches and report on their effectiveness. Of these, the loop-level approach is the simplest with each doubly- or triply-nested loop nest individually parallelised using the OMP DO...OMP END DO directives, all enclosed within a single PARALLEL region. An extension to this approach is to flatten each loop nest into a single, large loop with the previously rank three arrays converted to rank one vectors. This exposes finer-grained parallelism, possibly at the expense of the SIMD auto-vectorisation performed on the loop by the compiler. In the tiling approach, the sub-domain assigned to each MPI process is further sub-divided into overlapping ‘tiles.’ It is the loop over tiles that is parallelised over the OpenMP threads. The dimension and shape of the tiles are key factors for better exploiting the processor cache hierarchy and hence strongly influence the overall performance. We have also examined the effect of array-index ordering. In NEMO, the 3D arrays have the level/depth index outermost. The outer loop for the vast majority of loop nests is therefore over this index. It is this loop that is parallelized in the loop-level approach to using OpenMP. We investigate the implications of this choice by applying the various OpenMP approaches to a version of NEMO that has been adapted so as to have the level index innermost. The loop over this index is then normally the best candidate for auto-vectorisation by the compiler. The proposed approaches have been applied in two different forms of the tracer advection kernel (MUSCL and TVD) from NEMO and evaluated on an IBM Power6 cluster at CMCC, Italy, an IBM iDataPlex with Intel Westmere CPUs at CINECA, Italy, a dual-socket Intel Sandy Bridge system at STFC Daresbury, UK and on a Cray XE6 (HECToR), UK.
Organisation STFC , HC
Funding Information
Related Research Object(s):
Licence Information:
Language English (EN)
Type Details URI(s) Local file(s) Year
Presentation Presented at Exascale Applications and Software Conference 2013 (EASC 2013), Edinburgh, Scotland, 9-11 Apr 2013. 2013