Large-data memory layout for equation systems

Data layout requirements and design
Possibilities
Discussion
Assessment of the Blaze library that offers a similar choice
Assessment of the Kokkos library that offers a similar choice
Requirements
Data layout implementation and assembly
Data layout benchmark
Conclusion

This page discusses the requirements, design, implementation, and verification of a data container with zero-cost compile-time-configurable memory layout for storing a 3-dimensional array of real values. The three dimensions are

Equation index, as required for a coupled multi-system equation solver,
Component index, a specific scalar component of a system of equations, and
Unknown index, as required by some discretization of a numerical solution, representing field variables at discrete locations, e.g., particles or a mesh entity such as cell, face, node, etc.

Data layout requirements and design

How should field values, storing particle properties (or solution unknowns in mesh cells or nodes) should be stored in memory? How should mesh field data (associated to nodes, elements, faces, etc., in Inciter (Compressible flow solver)) should be stored in memory? These are the single largest chunks of data a particle-, and/or mesh-based code operates on. The data layout, i.e., how the data is stored and organized in memory, determines how the data is accessed and potentially has a first-degree effect on overall performance.

Possibilities

Unknown-major, in which various physical properties, e.g., position, velocity, energy, etc., i.e., the unknowns in a solver, of a single particle or mesh entity are close to each other in memory. For example:
where the are governed by one equation (e.g., position), the are governed by another equation (e.g., velocity), and the are governed by a third equation (e.g., energy), etc. Here the first letter denotes a physical quantity, while the second is the particle number or mesh entity index. If the algorithm advances the properties at the unknowns one equation at a time, data in memory is accessed by having to jump a distance that corresponds to the number of scalar physical variables per particle or mesh entity. In the example, the update will have to jump as are updated.
Property-major, in which the same type of physical properties are close to each other in memory. For example,
The legend here is the same as in unknown-major: the first letter denotes a physical quantity, while the second is the particle or mesh entity index. If the algorithm advances the properties at the unknowns one equation at a time, data in memory is accessed contiguously in memory as the properties are contiguously stored.

Discussion

A property-major storage, case 2 above, seems to be the most efficient at first sight, as it stores data, as it is read and written by the equation algorithms, contiguously. However, data access is contiguous only if the particle properties (or data stored at mesh entities) are independent, i.e., if there is no coupling among the equations. Unfortunately, this is rarely the case, at least not for fluid dynamics or any computational physics problem that is interesting enough. For example in a Lagrangian particle code, velocity is used by the position update, and velocity is required by the energy update. The same is true for a mesh-based solver, where the physical variables are coupled, i.e., their update needs other physical variables at the same mesh entity (and usually that of the neighbors). Depending on the physical approximation, density (or mass) may be required for all equations. The stronger the equations are coupled, the more far-reads are required for a given update with a property-major data layout. These far-reads are practically always cache misses, as the property-major storage stores the physical variables for the same particle or mesh entity very far in memory, e.g., the distance between and is the number of particles or mesh entities. While the unknown-major storage, case 1 above, inherently stores data non-contiguously, the distance between properties of a single particle (or a single mesh entity) is relatively small, i.e., the number of properties, which may incur less cache misses as several particles (or nodes) but all of their physical properties at the same unknown could fit into cache.

Assuming strong coupling among the variables, the unknown-major storage will be favored, but it would be nice if the design allowed for both layouts, so depending on the type of equations the most appropriate layout could be selected. If such a design is maintainable, there is still a question whether the data layout selection should be done at compile-, or run-time.

Assessment of the Blaze library that offers a similar choice

Have looked at https://bitbucket.org/blaze-lib/blaze which implements row-, and column-major matrix classes based on a template argument. See, e.g., <blaze-1.5>/blaze/math/dense/StaticMatrix.h, which reveals that the template argument (bool) SO selects between row-, or column-major internal storage. Then SO is used at both compile-time (e.g., by the class-user, when instantiating the type of the matrix), as well as run-time (e.g., the implementation of isDefault()). Both compile-time and run-time usage of the SO template arguments are problematic:

The compile-time usage duplicates a lot of code by having to provide similar implementations for the element-access operator() of StaticMatrix specialized to column-major. There is a generic implementation for SO for everything that is agnostic of SO, and there is a specialization when SO is column-major.
The run-time usage also duplicates code by doing an if-test on SO in, e.g., isDefault(). Is there a better way of doing this? If there are only two types of data layout (unknown-, and property-major), code duplication should not be too much of an issue. However, the implementation of unknown-property data access must be absolutely zero run-time cost. This means the selection must be at compile-time and the element access must be absolutely invisible to client code. In other words, there must be no re-implementation of a time-integrator for an equation just because the data access is different.

Assessment of the Kokkos library that offers a similar choice

Since the first implementation of the configurable data layout, described below, the Kokkos library, https://github.com/kokkos/kokkos , from Sandia National Labs, has been released. Kokkos appears to have all the requirements described here, and many more other features. It does have compile-time configurable data layouts, views, for the purpose of optimal data access on various compute devices, such as multi-core CPUs, many-core accelerators, and Graphics Processing Units. Kokkos appears to provide an excellent abstraction for data layout abstraction. However, Kokkos provides a lot more functionality than just data layout abstraction, e.g., it can generate low level code for various devices using its configurable views. The currently available Kokkos back-ends are OpenMP, pthreads, and CUDA. Thus Kokkos provides abstractions for shared-memory parallelism. While Kokkos has been successfully used in Charm++ code, at this point (Jan 2016) we opt for not adopting Kokkos' views and keep our original data layout abstractions, discussed below. The reasons:

At this point we are too early in the development of Quinoa and its tools to be able to say that the current abstractions Charm++ provides are sufficient or not to strike the correct balance between performance and productivity. In particular, it could be possible that Charm++'s abstractions (using overdecomposition), which we use already, are enough. In that case, using a single abstraction for parallelism is preferable to two.
At this point we would really only need the compile-time configurable data layout abstractions from Kokkos and not the other rich features, such as abstractions for loop-level shared-memory parallelism.

As far I can tell, the views of an array in Kokkos provide a very similar abstraction for data layout and thus memory access what is described below.

Requirements

Is it possible to implement a compile-time configurable data-access policy via a thin data-access interface with zero run-time cost, no code-duplication, and in a way that is invisible to client code? Yes. See below.

Zero-cost is achieved via type-based compile-time polymorphism. This is controlled via a cmake variable.

The particle (or mesh entity) data is a logically 3-dimensional array that stores the particle properties or unknowns at mesh entities, e.g., faces, cell centers, nodes, etc.

For clarity, the discussion below will use the expression "particle" for a particle-code and will not specifically mention mesh-based unknowns, stored at the nodes, elements, faces, etc., of a computational mesh. The former is for a particle-code, the latter is for a mesh-code. The data layout discussion is independent of whether particles or mesh entities (nodes, cells, etc.) are used.

In principle there are a total of 6 permutations:

1. ParEqComp: [ particle ] [ equation ] [ component ]
2. ParCompEq: [ particle ] [ component ] [ equation ]
3. EqCompPar: [ equation ] [ component ] [ particle ]
4. EqParComp: [ equation ] [ particle ] [ component ]
5. CompEqPar: [ component ] [ equation ] [ particle ]
6. CompParEq: [ component ] [ particle ] [ equation ]

Of these 6 we only consider those where component follows equation. (For those layouts where equation follows component the access would be unnecessarily complicated by the potentially unequal number of components for different equations which is not known at compile-time and thus does not allow some optimizations.) This decision leaves us with the following choices:

1. ParEqComp: [ particle ] [ equation ] [ component ]
3. EqCompPar: [ equation ] [ component ] [ particle ]
4. EqParComp: [ equation ] [ particle ] [ component ]

Access is based on the 2 coordinates: particle (or unknown), and component. Particle is the particle index, component denotes the given component of a vector equation, e.g., velocity has 3 components, a multi-material turbulent mix model governed by the Dirichlet SDE has scalars (components). Using these 2 coordinates the index calculations for the above 3 cases are:

1. ParEqComp: [ particle ] [ equation ] [ component ]

     baseptr + particle*nprop + component

where nprop is the total number of particle properties, e.g., 3 positions, 3 velocities, 5 scalars nprop = 11.

3. EqCompPar: [ equation ] [ component ] [ particle ]

     baseptr + component*npar + particle

where npar is the total number of particles.

4. EqParComp: [ equation ] [ particle ] [ component ]

     baseptr + npar + nce*particle + component

where nce is the number of components for the given equation. Since this would require another function argument (besides particle and component), and it costs an integer-multiply more than the other two layouts, we dismiss this layout, and only implement the following two:

1. ParEqComp - Particle-major
3. EqCompPar - Equation-major

These options are exposed via a cmake variable and can be switched before a build.

Data layout implementation and assembly

This section documents the implementation and the assembly code, produced by the compilers, of the compile-time configurable data-access policy discussed above. The implementation is via a thin data-access interface with zero run-time cost, no code-duplication, and in a way that is invisible to client code.

Zero-runtime-cost data-layout wrappers with type-based compile-time dispatch

Tags for selecting particle-, or property-major data layout policies:

const bool ParticleMajor = true;
const bool PropertyMajor = false;

Essentially, the implementation is as follows:

template< bool Major >
class Data {

  private:
   // Transform a compile-time bool into a type
   template< bool m >
   struct int2type {
     enum { value = m };
   };

   // Overloads for particle-, and property-major accesses
   inline
   tk::real& access( int particle, int property, int2type<ParticleMajor> ) {
     return *(m_ptr + particle*m_nprop + property);
   }
   inline
   tk::real& access( int particle, int property, int2type<PropertyMajor> ) {
     // This is the same for now, not called, irrelevant in zero-cost-test
     return *(m_ptr + particle*m_nprop + property);
   }

   tk::real* const m_ptr;
   const int m_nprop;

  public:
    // Constructor
    Data( tk::real* const ptr, int nprop ) :
      m_ptr(ptr), m_nprop(nprop) {}

    // Access dispatch
    inline tk::real& operator()( int particle, int property ) {
      return access( particle, property, int2type<Major>() );
    }
};

Test of zero-cost

Test by adding to Dirichlet (derived equation class) constructor (client code):

Data< Layout > d( particles, m_nprop );
Model::aa = d( 34, 3 );
Model::bb = *(m_particles + 34*m_nprop + 3);

Add to Model:

Model {
  ...
  public:
    tk::real aa;
    tk::real bb;
  ...
}

Add to Physics constructor:

std::cout << m_mix->aa << m_mix->bb;

This is so the optimizing compiler cannot entirely optimize the assignments of aa and bb away.

Debug assembly

Generated assembly code with CMAKE_BUILD_TYPE=DEBUG (i.e., without optimization) of the assignments of aa (line 42) and bb (line 43) in Dirichlet's constructor, with clang -g -S -mllvm --x86-asm-syntax=intel, gnu and intel generate very similar code:

     .loc    143 42 20  ; <quinoa>/src/DiffEq/Dirichlet.h:42:20
.Ltmp27038:
     lea     RDI, QWORD PTR [RBP - 56]
     mov     ESI, 34
     mov     EDX, 3
     call    _ZN6quinoa4DataILb1EEclEii
.Ltmp27039:
     mov     QWORD PTR [RBP - 176], RAX # 8-byte Spill
     jmp     .LBB2550_7
.LBB2550_7:
     mov     RAX, QWORD PTR [RBP - 176] # 8-byte Reload
     movsd   XMM0, QWORD PTR [RAX]
     mov     RCX, QWORD PTR [RBP - 64] # 8-byte Reload
     movsd   QWORD PTR [RCX + 8], XMM0
     .loc    143 43 0    ; <quinoa>/src/DiffEq/Dirichlet.h:43:0
     mov     RDX, QWORD PTR [RCX + 32]
     imul    ESI, DWORD PTR [RCX + 48], 34
     movsxd  RDI, ESI
     shl     RDI, 3
     add     RDX, RDI
     movsxd  RDI, DWORD PTR [RCX + 52]
     movsd   XMM0, QWORD PTR [RDX + 8*RDI + 24]
     movsd   QWORD PTR [RCX + 16], XMM0

Line 42 translates to register loads and a function call into tk::Data, while line 43 translates to some integer arithmetic of the address and loads.

Excerpt from the Intel® 64 and IA-32 Architectures Software Developer Manual

The LEA (load effective address) instruction computes the effective address in memory (offset within a segment) of a source operand and places it in a general-purpose register. This instruction can interpret any of the processor’s addressing modes and can perform any indexing or scaling that may be needed. It is especially useful for initializing the ESI or EDI registers before the execution of string instructions or for initializing the EBX register before an XLAT instruction.
The MOVSXD instruction operates on 64-bit data. It sign-extends a 32-bit value to 64 bits. This instruction is not encodable in non-64-bit modes.
A common type of operation on packed integers is the conversion by zero- or sign-extension of packed integers into wider data types. SSE4.1 adds 12 instructions that convert from a smaller packed integer type to a larger integer type (PMOVSXBW, PMOVZXBW, PMOVSXBD, PMOVZXBD, PMOVSXWD, PMOVZXWD, PMOVSXBQ, PMOVZXBQ, PMOVSXWQ, PMOVZXWQ, PMOVSXDQ, PMOVZXDQ). The source operand is from either an XMM register or memory; the destination is an XMM register.
IMUL Signed multiply. The IMUL instruction multiplies two signed integer operands. The result is computed to twice the size of the source operands; however, in some cases the result is truncated to the size of the source operands.
SAL/SHL Shift arithmetic left/Shift logical left. The SAL (shift arithmetic left), SHL (shift logical left), SAR (shift arithmetic right), SHR (shift logical right) instructions perform an arithmetic or logical shift of the bits in a byte, word, or doubleword. The SAL and SHL instructions perform the same operation. They shift the source operand left by from 1 to 31 bit positions. Empty bit positions are cleared. The CF flag is loaded with the last bit shifted out of the operand.

Optimized assembly

Generated assembly code with CMAKE_BUILD_TYPE=RELWITHDEBINFO (i.e., with optimization) of the assignments of aa (line 42) and bb (line 43) in Dirichlet's constructor, with clang -O2 -g DNDEBUG -S -mllvm --x86-asm-syntax=intel, gnu and intel generate very similar optimized code:

.loc    144 42 20  ; <quinoa>/src/DiffEq/Dirichlet.h:42:20
movsd   XMM0, QWORD PTR [R14 + 8*RAX + 24]
movsd   QWORD PTR [R13 + 8], XMM0
.loc    144 43 0    ; <quinoa>/src/DiffEq/Dirichlet.h:43:0
mov     RCX, QWORD PTR [R13 + 32]
movsd   XMM0, QWORD PTR [RCX + 8*RAX + 24]
movsd   QWORD PTR [R13 + 16], XMM0

Both lines 42 and 43 translate to very similar SSE loads with pointer arithmetic, i.e., line 42 costs the same as line 43.

Excerpt from the Intel® 64 and IA-32 Architectures Software Developer Manual

The MOVS instruction moves the string element addressed by the ESI register to the location addressed by the EDI register. The assembler recognizes three “short forms” of this instruction, which specify the size of the string to be moved: MOVSB (move byte string), MOVSW (move word string), and MOVSD (move doubleword string).
The MOVSD (move scalar double-precision floating-point) instruction transfers a 64-bit double-precision floating- point operand from memory to the low quadword of an XMM register or vice versa, or between XMM registers. Alignment of the memory address is not required, unless alignment checking is enabled.

Data layout benchmark

This section documents the benchmark of the implementation of the compile-time configurable data-access policy discussed above. The implementation is via a thin data-access interface with zero run-time cost, no code-duplication, and in a way that is invisible to client code.

Quinoa::Walker input file used for the benchmark

Note that walker is no longer part of Quinoa. Walker's code can be pulled from Quinoa's git history.

We will integrate for the duration of a 100,000 time steps a system of 100 coupled non-linear stochastic differential equations (SDEs) whose statistically stationary solution converge to the Dirichlet distribution and measure the wall-clock time. For more on the Dirichlet SDE, see src/DiffEq/Dirichlet.h.

walker

  nstep 100000  # Max number of time steps
  term  140.0   # Max time
  dt    0.05    # Time step size
  npar  40000   # Number of particles

  ttyi  100     # TTY output interval

  rngs
    mkl_mrg32k3a seed 0 end
  end

  dirichlet
    ncomp 100  # = K = N-1
    b     0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5
          0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5
          0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5
          0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5
          0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5
          0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5
          0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5
          0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5
          0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5
          0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5
    end
    S     0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4
          0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4
          0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4
          0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4
          0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4
          0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4
          0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4
          0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4
          0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4
          0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4
    end
    kappa 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3
          0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3
          0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3
          0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3
          0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3
          0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3
          0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3
          0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3
          0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3
          0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3
    end
    rng mkl_mrg32k3a
  end

  statistics    # Estimate statistics
    <Y1>        # mean of Y1
    <Y2>
    <y1y1>      # variance of Y1 = <(Y1-<Y1>)^2> = <y1^2>
    <y2y2>
    <y1y2>
  end

end

Ptr - Working with raw pointers

This algorithm gets the starting raw pointer from which the given particle data is (contiguously) accessible in memory and simply adds integers to the address to access and update the 100 components specified above. The algorithm assumes a particular data layout – it only works with the particle-major storage – a logically 3-dimensional array with [ particle ] [ equation ] [ component ] layout.

Layout-dependent algorithm:

//! Advance particles
void advance(int p, int tid, tk::real dt) override {
  // Get access to particle scalars
  tk::real* y = m_particles.ptr() + p*m_nprop;
  
  // Compute Nth scalar
  tk::real yn = 1.0 - y[0];
  for (int i=1; i<m_ncomp; ++i) yn -= y[i];
  
  // Generate Gaussian random numbers with zero mean and unit variance
  tk::real dW[m_ncomp];
  m_rng->gaussian( tid, m_ncomp, dW );
  
  // Advance first m_ncomp (K=N-1) scalars
  for (int i=0; i<m_ncomp; ++i) {
    tk::real d = m_k[i]*y[i]*yn*dt;
    if (d > 0.0) d = sqrt(d); else d = 0.0;
    y[i] += 0.5*m_b[i]*(m_S[i]*yn - (1.0-m_S[i])*y[i])*dt + d*dW[i];
  }
}

Par - Access via particle-major layout policy

This algorithm accesses particle data via the wrapper class, tk::Data, in a data-layout-agnostic fashion. Access itself via this class is demonstrably "zero-cost", i.e., an optimizing compiler completely optimizes the abstraction away: see Data layout implementation and assembly for the assembly generated by 3 compilers. However, writing an SDE-advance algorithm in a data-layout-agnostic manner, requires index calculations at every particle-access compared to working with raw pointers, as described above. Thus the following tests are designed to measure only the additional index calculations that the layout-agnostic access entails.

Layout-independent algorithm:

//! Advance particles
void advance(int p, int tid, tk::real dt) override {
  // Compute Nth scalar
  tk::real yn = 1.0 - m_particles(p, 0);
  for (int i=1; i<m_ncomp; ++i) yn -= m_particles(p, i);

  // Generate Gaussian random numbers with zero mean and unit variance
  tk::real dW[m_ncomp];
  m_rng->gaussian( tid, m_ncomp, dW );

  // Advance first m_ncomp (K=N-1) scalars
  for (int i=0; i<m_ncomp; ++i) {
    tk::real d = m_k[i] * m_particles(p, i) * yn * dt;
    if (d > 0.0) d = sqrt(d); else d = 0.0;
    m_particles(p, i) +=
      0.5*m_b[i]*(m_S[i]*yn - (1.0-m_S[i]) * m_particles(p, i) )*dt
      + d*dW[i];
  }
}

Comparison of the algorithms

DEBUG mode uses -O0 and does not optimize function calls away for all of three compilers tested. RELEASE mode uses -O3 and the abstraction is completely optimized away. However, index calculations still remain compared to a layout-dependent advance algorithm.

Total time measured in micro-seconds, run on a Lenovo laptop with Intel Core i7, 8 compute cores:

Run	Ptr	Par	Par/Ptr
clang/DEBUG	150350236	338851735	2.2537 x slowdown
clang/RELEASE	98157742	104077139	1.0603 x slowdown
DEBUG/RELEASE	1.5317 x speedup	3.2558 x speedup	n/a

Run	Ptr	Par	Par/Ptr
gnu/DEBUG	161603164	386646353	2.3926 x slowdown
gnu/RELEASE	94747953	98187568	1.0363 x slowdown
DEBUG/RELEASE	1.7056 x speedup	3.9378 x speedup	n/a

Run	Ptr	Par	Par/Ptr
intel/DEBUG	171691440	608407412	3.5436 x slowdown
intel/RELEASE	90059133	89892665	0.99815 x speedup
DEBUG/RELEASE	1.9064 x speedup	6.7682 x speedup	n/a

Data layout benchmark discussion

As expected, inlining has a significant effect on performance: going from DEBUG to RELEASE mode yields a significant speedup with all three compilers, see last, DEBUGcode{.cmake}RELEASE, rows.
As expected, the additional index calculations required by layout-agnostic access do take a performance hit: though only 6% with clang, and 3% with gnu, see last, Par/Ptr, columns.
Surprisingly, the layout-agnostic access is even a tiny bit faster than the layout-dependent algorithm with the Intel compiler with -O3.

Conclusion

As the implementation is not a significant performance hit, the equation advancement algorithms and particle-, and mesh-property data access are used only via the data-layout-independent interface. The data layout can be changed at compile time. Data access is abstracted (and optimized) away.

Note that the above discussion is independent of whether a particle-code or a mesh-code is used. From the data layout viewpoint the particle-major and mesh-field-major are equivalent and in the code, and in tk::Data, this is called the unknown-major. See src/Base/Data.h and its specializations in src/Base/Fields.h and src/Base/Particles.h. The data layouts for mesh-fields and particles can be configured independently at compile time via cmake variables PARTICLE_DATA_LAYOUT and FIELD_DATA_LAYOUT, see also cmake/ConfigureDataLayout.cmake.

For the API, see tk::Data, and for the full (and current) implementation, see src/Base/Data.h.

Large-data memory layout for equation systems

Contents