open access publication

Preprint, 2024

Refactoring the EVP solver for improved performance – a case study based on CICE v6.5

EGUsphere, Volume 2024, Pages 1-23, 10.5194/gmd-2024-40

Contributors

Rasmussen, Till Andreas Soya 0000-0003-2931-5180 [1] Poulsen, Jacob W [2] Ribergaard, Mads Hvid [1] Sasanka, Ruchira [2] Craig, Anthony P [3] Hunke, Elizabeth Clare 0000-0002-7033-6031 [4] Rethmeier, Stefan [1]

Affiliations

  1. [1] Danish Meteorological Institute
  2. [NORA names: DMI Danish Meteorological Institute; Governmental Institutions; Denmark; Europe, EU; Nordic; OECD];
  3. [2] Intel (United States)
  4. [NORA names: United States; America, North; OECD];
  5. [3] Contractor to Science and Technology Corporation, Seattle, WA
  6. [NORA names: United States; America, North; OECD];
  7. [4] Los Alamos National Laboratory
  8. [NORA names: United States; America, North; OECD]

Abstract

This study focuses on the performance of CICE and its Elastic-Viscous-Plastic (EVP) dynamical solver. The study has been conducted in two steps. First, the standard EVP solver has been extracted from CICE for experiments with refactored versions of it. Secondly, one refactored version was integrated and tested as part of the full model. Two dominant bottlenecks were revealed. The first is the number of MPI and OpenMP synchronization points required for halo exchanges during each time-step combined with the irregular domain of active sea ice points. The second is the lack of Single Instruction Multiple Data (SIMD) code generation. The study refactors the standard EVP solver based on two generic patterns. The first pattern exposes how general finite-differences on masked multi-dimensional arrays can be expressed in order to produce significantly better code generation. The primary change is that the memory access pattern is changed from random access to direct access. The second pattern exposes an alternative approach to handle static grid properties. The measured single core improvement is increased by more than a factor of five compared to the standard implementation. The refactored implementation strong scales on the Intel® Xeon® Scalable Processor Series node until the available bandwidth of the node is used. For the Intel® Xeon® CPU Max Series Series there is sufficient bandwidth to allow the strong scaling to continue for all the cores on the node resulting in a single node improvement factor of 35 over the standard implementation. This study also show improved performance on GPU processors.

Keywords

CICE, GPU, GPU processors, Instruction Multiple Data, Intel, MPI, Multiple Data, OpenMP, Single Instruction Multiple Data, access, access patterns, alternative approach, approach, array, bandwidth, bottleneck, case study, cases, changes, code, code generation, core, core improvement, data, domain, dominant bottleneck, dynamic solver, elastic-viscous-plastic, exchange, experiments, factors, finite-difference, full model, generation, grid properties, halo, halo exchange, ice point, implementation, improved performance, improvement, irregular domains, lack, memory, memory access patterns, model, multi-dimensional arrays, nodes, patterns, performance, point, primary changes, processor, properties, random access, refactored version, refactoring, scale, series, series series, solver, standard implementation, strong scaling, study, synchronization points, version

Data Provider: Digital Science