Large-data memory layout for equation systems
Contents
This page discusses the requirements, design, implementation, and verification of a data container with zero-cost compile-time-configurable memory layout for storing a 3-dimensional array of real values. The three dimensions are
- Equation index, as required for a coupled multi-system equation solver,
- Component index, a specific scalar component of a system of equations, and
- Unknown index, as required by some discretization of a numerical solution, representing field variables at discrete locations, e.g., particles or a mesh entity such as cell, face, node, etc.
Data layout requirements and design
How should field values, storing particle properties (in Walker) should be stored in memory? How should mesh field data (associated to nodes, elements, faces, etc., in Inciter) should be stored in memory? These are the single largest chunks of data a particle-, and/or mesh-based code operates on. The data layout, i.e., how the data is stored and organized in memory, determines how the data is accessed and potentially has a first-degree effect on overall performance.
Possibilities
Unknown-major, in which various physical properties, e.g., position, velocity, energy, etc., i.e., the unknowns in a solver, of a single particle or mesh entity are close to each other in memory. For example:
where the are governed by one equation (e.g., position), the are governed by another equation (e.g., velocity), and the are governed by a third equation (e.g., energy), etc. Here the first letter denotes a physical quantity, while the second is the particle number or mesh entity index. If the algorithm advances the properties at the unknowns one equation at a time, data in memory is accessed by having to jump a distance that corresponds to the number of scalar physical variables per particle or mesh entity. In the example, the update will have to jump as are updated.
Property-major, in which the same type of physical properties are close to each other in memory. For example,
The legend here is the same as in unknown-major: the first letter denotes a physical quantity, while the second is the particle or mesh entity index. If the algorithm advances the properties at the unknowns one equation at a time, data in memory is accessed contiguously in memory as the properties are contiguously stored.
Discussion
A property-major storage, case 2 above, seems to be the most efficient at first sight, as it stores data, as it is read and written by the equation algorithms, contiguously. However, data access is contiguous only if the particle properties (or data stored at mesh entities) are independent, i.e., if there is no coupling among the equations. Unfortunately, this is rarely the case, at least not for fluid dynamics or any computational physics problem that is interesting enough. For example in a Lagrangian particle code, velocity is used by the position update, and velocity is required by the energy update. The same is true for a mesh-based solver, where the physical variables are coupled, i.e., their update needs other physical variables at the same mesh entity (and usually that of the neighbors). Depending on the physical approximation, density (or mass) may be required for all equations. The stronger the equations are coupled, the more far-reads are required for a given update with a property-major data layout. These far-reads are practically always cache misses, as the property-major storage stores the physical variables for the same particle or mesh entity very far in memory, e.g., the distance between and is the number of particles or mesh entities. While the unknown-major storage, case 1 above, inherently stores data non-contiguously, the distance between properties of a single particle (or a single mesh entity) is relatively small, i.e., the number of properties, which may incur less cache misses as several particles (or nodes) but all of their physical properties at the same unknown could fit into cache.
Assuming strong coupling among the variables, the unknown-major storage will be favored, but it would be nice if the design allowed for both layouts, so depending on the type of equations the most appropriate layout could be selected. If such a design is maintainable, there is still a question whether the data layout selection should be done at compile-, or run-time.
Assessment of the Blaze library that offers a similar choice
Have looked at https:/<blaze-1.5>/blaze/math/dense/StaticMatrix.h
, which reveals that the template argument (bool) SO
selects between row-, or column-major internal storage. Then SO
is used at both compile-time (e.g., by the class-user, when instantiating the type of the matrix), as well as run-time (e.g., the implementation of isDefault()
). Both compile-time and run-time usage of the SO template arguments are problematic:
- The compile-time usage duplicates a lot of code by having to provide similar implementations for the element-access
operator()
ofStaticMatrix
specialized to column-major. There is a generic implementation forSO
for everything that is agnostic ofSO
, and there is a specialization whenSO
is column-major. - The run-time usage also duplicates code by doing an if-test on
SO
in, e.g.,isDefault()
. Is there a better way of doing this? If there are only two types of data layout (unknown-, and property-major), code duplication should not be too much of an issue. However, the implementation of unknown-property data access must be absolutely zero run-time cost. This means the selection must be at compile-time and the element access must be absolutely invisible to client code. In other words, there must be no re-implementation of a time-integrator for an equation just because the data access is different.
Assessment of the Kokkos library that offers a similar choice
Since the first implementation of the configurable data layout, described below, the Kokkos library, https:/views
, for the purpose of optimal data access on various compute devices, such as multi-core CPUs, many-core accelerators, and Graphics Processing Units. Kokkos appears to provide an excellent abstraction for data layout abstraction. However, Kokkos provides a lot more functionality than just data layout abstraction, e.g., it can generate low level code for various devices using its configurable views. The currently available Kokkos back-ends are OpenMP, pthreads, and CUDA. Thus Kokkos provides abstractions for shared-memory parallelism. While Kokkos has been successfully used in Charm++ code, at this point (Jan 2016) we opt for not adopting Kokkos' views and keep our original data layout abstractions, discussed below. The reasons:
- At this point we are too early in the development of Quinoa and its tools to be able to say that the current abstractions Charm++ provides are sufficient or not to strike the correct balance between performance and productivity. In particular, it could be possible that Charm++'s abstractions (using overdecomposition), which we use already, are enough. In that case, using a single abstraction for parallelism is preferable to two.
- At this point we would really only need the compile-time configurable data layout abstractions from Kokkos and not the other rich features, such as abstractions for loop-level shared-memory parallelism.
As far I can tell, the views of an array in Kokkos provide a very similar abstraction for data layout and thus memory access what is described below.
Requirements
Zero-cost is achieved via type-based compile-time polymorphism. This is controlled via a cmake variable.
The particle (or mesh entity) data is a logically 3-dimensional array that stores the particle properties or unknowns at mesh entities, e.g., faces, cell centers, nodes, etc.
For clarity, the discussion below will use the expression "particle" for a particle-code and will not specifically mention mesh-based unknowns, stored at the nodes, elements, faces, etc., of a computational mesh. The former is for a particle-code, the latter is for a mesh-code. The data layout discussion is independent of whether particles or mesh entities (nodes, cells, etc.) are used, and accordingly both Walker and Inciter uses the same implementation.
In principle there are a total of 6 permutations:
1. ParEqComp: [ particle ] [ equation ] [ component ] 2. ParCompEq: [ particle ] [ component ] [ equation ] 3. EqCompPar: [ equation ] [ component ] [ particle ] 4. EqParComp: [ equation ] [ particle ] [ component ] 5. CompEqPar: [ component ] [ equation ] [ particle ] 6. CompParEq: [ component ] [ particle ] [ equation ]
Of these 6 we only consider those where component follows equation. (For those layouts where equation follows component the access would be unnecessarily complicated by the potentially unequal number of components for different equations which is not known at compile-time and thus does not allow some optimizations.) This decision leaves us with the following choices:
1. ParEqComp: [ particle ] [ equation ] [ component ] 3. EqCompPar: [ equation ] [ component ] [ particle ] 4. EqParComp: [ equation ] [ particle ] [ component ]
Access is based on the 3 coordinates: particle (or unknown), component, and offset. Particle is the particle index, component denotes the given component of a vector equation, e.g., velocity has 3 components, a multi-material turbulent mix model governed by the Dirichlet SDE has scalars (components), and offset is determined by the relative position of the given equation compared to the other equations. Using these 3 coordinates the index calculations for the above 3 cases are:
1. ParEqComp: [ particle ] [ equation ] [ component ] baseptr + particle*nprop + offset + component
where nprop is the total number of particle properties, e.g., 3 positions, 3 velocities, 5 scalars nprop = 11.
3. EqCompPar: [ equation ] [ component ] [ particle ] baseptr + (offset+component)*npar + particle
where npar is the total number of particles.
4. EqParComp: [ equation ] [ particle ] [ component ] baseptr + offset*npar + nce*particle + component
where nce is the number of components for the given equation. Since this would require another function argument (besides particle, component, and offset), and it costs an integer-multiply more than the other two layouts, we dismiss this layout, and only implement the following two:
1. ParEqComp - Particle-major 3. EqCompPar - Equation-major
These options are exposed via a cmake variable and can be switched before a build.
Data layout implementation and assembly
This section documents the implementation and the assembly code, produced by the compilers, of the compile-time configurable data-access policy discussed above. The implementation is via a thin data-access interface with zero run-time cost, no code-duplication, and in a way that is invisible to client code.
Zero-runtime-cost data-layout wrappers with type-based compile-time dispatch
Tags for selecting particle-, or property-major data layout policies:
const bool ParticleMajor = true; const bool PropertyMajor = false;
Essentially, the implementation is as follows:
template< bool Major > class Data { private: // Transform a compile-time bool into a type template< bool m > struct int2type { enum { value = m }; }; // Overloads for particle-, and property-major accesses inline tk::real& access( int particle, int property, int2type<ParticleMajor> ) { return *(m_ptr + particle*m_nprop + m_offset + property); } inline tk::real& access( int particle, int property, int2type<PropertyMajor> ) { // This is the same for now, not called, irrelevant in zero-cost-test return *(m_ptr + particle*m_nprop + m_offset + property); } tk::real* const m_ptr; const int m_nprop; const int m_offset; public: // Constructor Data( tk::real* const ptr, int nprop, int offset ) : m_ptr(ptr), m_nprop(nprop), m_offset(offset) {} // Access dispatch inline tk::real& operator()( int particle, int property ) { return access( particle, property, int2type<Major>() ); } };
Test of zero-cost
Test by adding to Dirichlet
(derived equation class) constructor (client code):
Data< Layout > d( particles, m_nprop, m_offset ); Model::aa = d( 34, 3 ); Model::bb = *(m_particles + 34*m_nprop + m_offset + 3);
Add to Model
:
Model { ... public: tk::real aa; tk::real bb; ... }
Add to Physics
constructor:
std::cout << m_mix->aa << m_mix->bb;
This is so the optimizing compiler cannot entirely optimize the assignments of aa
and bb
away.
Debug assembly
Generated assembly code with CMAKE_BUILD_TYPE=DEBUG
(i.e., without optimization) of the assignments of aa
(line 42) and bb
(line 43) in Dirichlet
's constructor, with clang -g -S -mllvm --x86-asm-syntax=intel
, gnu and intel generate very similar code:
.loc 143 42 20 ; <quinoa>/src/DiffEq/Dirichlet.h:42:20 .Ltmp27038: lea RDI, QWORD PTR [RBP - 56] mov ESI, 34 mov EDX, 3 call _ZN6quinoa4DataILb1EEclEii .Ltmp27039: mov QWORD PTR [RBP - 176], RAX # 8-byte Spill jmp .LBB2550_7 .LBB2550_7: mov RAX, QWORD PTR [RBP - 176] # 8-byte Reload movsd XMM0, QWORD PTR [RAX] mov RCX, QWORD PTR [RBP - 64] # 8-byte Reload movsd QWORD PTR [RCX + 8], XMM0 .loc 143 43 0 ; <quinoa>/src/DiffEq/Dirichlet.h:43:0 mov RDX, QWORD PTR [RCX + 32] imul ESI, DWORD PTR [RCX + 48], 34 movsxd RDI, ESI shl RDI, 3 add RDX, RDI movsxd RDI, DWORD PTR [RCX + 52] movsd XMM0, QWORD PTR [RDX + 8*RDI + 24] movsd QWORD PTR [RCX + 16], XMM0
Line 42 translates to register loads and a function call into tk::Data
, while line 43 translates to some integer arithmetic of the address and loads.
Excerpt from the Intel® 64 and IA-32 Architectures Software Developer Manual
The LEA (load effective address) instruction computes the effective address in memory (offset within a segment) of a source operand and places it in a general-purpose register. This instruction can interpret any of the processor’s addressing modes and can perform any indexing or scaling that may be needed. It is especially useful for initializing the ESI or EDI registers before the execution of string instructions or for initializing the EBX register before an XLAT instruction.
The MOVSXD instruction operates on 64-bit data. It sign-extends a 32-bit value to 64 bits. This instruction is not encodable in non-64-bit modes.
A common type of operation on packed integers is the conversion by zero- or sign-extension of packed integers into wider data types. SSE4.1 adds 12 instructions that convert from a smaller packed integer type to a larger integer type (PMOVSXBW, PMOVZXBW, PMOVSXBD, PMOVZXBD, PMOVSXWD, PMOVZXWD, PMOVSXBQ, PMOVZXBQ, PMOVSXWQ, PMOVZXWQ, PMOVSXDQ, PMOVZXDQ). The source operand is from either an XMM register or memory; the destination is an XMM register.
IMUL Signed multiply. The IMUL instruction multiplies two signed integer operands. The result is computed to twice the size of the source operands; however, in some cases the result is truncated to the size of the source operands.
SAL/SHL Shift arithmetic left/Shift logical left. The SAL (shift arithmetic left), SHL (shift logical left), SAR (shift arithmetic right), SHR (shift logical right) instructions perform an arithmetic or logical shift of the bits in a byte, word, or doubleword. The SAL and SHL instructions perform the same operation. They shift the source operand left by from 1 to 31 bit positions. Empty bit positions are cleared. The CF flag is loaded with the last bit shifted out of the operand.
Optimized assembly
Generated assembly code with CMAKE_BUILD_TYPE=RELWITHDEBINFO
(i.e., with optimization) of the assignments of aa
(line 42) and bb
(line 43) in Dirichlet
's constructor, with clang -O2 -g DNDEBUG -S -mllvm --x86-asm-syntax=intel
, gnu and intel generate very similar optimized code:
.loc 144 42 20 ; <quinoa>/src/DiffEq/Dirichlet.h:42:20 movsd XMM0, QWORD PTR [R14 + 8*RAX + 24] movsd QWORD PTR [R13 + 8], XMM0 .loc 144 43 0 ; <quinoa>/src/DiffEq/Dirichlet.h:43:0 mov RCX, QWORD PTR [R13 + 32] movsd XMM0, QWORD PTR [RCX + 8*RAX + 24] movsd QWORD PTR [R13 + 16], XMM0
Both lines 42 and 43 translate to very similar SSE loads with pointer arithmetic, i.e., line 42 costs the same as line 43.
Excerpt from the Intel® 64 and IA-32 Architectures Software Developer Manual
The MOVS instruction moves the string element addressed by the ESI register to the location addressed by the EDI register. The assembler recognizes three “short forms” of this instruction, which specify the size of the string to be moved: MOVSB (move byte string), MOVSW (move word string), and MOVSD (move doubleword string).
The MOVSD (move scalar double-precision floating-point) instruction transfers a 64-bit double-precision floating- point operand from memory to the low quadword of an XMM register or vice versa, or between XMM registers. Alignment of the memory address is not required, unless alignment checking is enabled.
Data layout benchmark
This section documents the benchmark of the implementation of the compile-time configurable data-access policy discussed above. The implementation is via a thin data-access interface with zero run-time cost, no code-duplication, and in a way that is invisible to client code.
Quinoa::Walker input file used for the benchmark
We will integrate for the duration of a 100,000 time steps a system of 100 coupled non-linear stochastic differential equations (SDEs) whose statistically stationary solution converge to the Dirichlet distribution and measure the wall-clock time. For more on the Dirichlet SDE, see src/
walker nstep 100000 # Max number of time steps term 140.0 # Max time dt 0.05 # Time step size npar 40000 # Number of particles ttyi 100 # TTY output interval rngs mkl_mrg32k3a seed 0 end end dirichlet ncomp 100 # = K = N-1 b 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 0.1 1.5 end S 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 0.625 0.4 end kappa 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 0.0125 0.3 end rng mkl_mrg32k3a end statistics # Estimate statistics <Y1> # mean of Y1 <Y2> <y1y1> # variance of Y1 = <(Y1-<Y1>)^2> = <y1^2> <y2y2> <y1y2> end end
Ptr - Working with raw pointers
This algorithm gets the starting raw pointer from which the given particle data is (contiguously) accessible in memory and simply adds integers to the address to access and update the 100 components specified above. The algorithm assumes a particular data layout – it only works with the particle-major storage – a logically 3-dimensional array with [ particle ] [ equation ] [ component ] layout.
Layout-dependent algorithm:
//! Advance particles void advance(int p, int tid, tk::real dt) override { // Get access to particle scalars tk::real* y = m_particles.ptr() + p*m_nprop + m_offset; // Compute Nth scalar tk::real yn = 1.0 - y[0]; for (int i=1; i<m_ncomp; ++i) yn -= y[i]; // Generate Gaussian random numbers with zero mean and unit variance tk::real dW[m_ncomp]; m_rng->gaussian( tid, m_ncomp, dW ); // Advance first m_ncomp (K=N-1) scalars for (int i=0; i<m_ncomp; ++i) { tk::real d = m_k[i]*y[i]*yn*dt; if (d > 0.0) d = sqrt(d); else d = 0.0; y[i] += 0.5*m_b[i]*(m_S[i]*yn - (1.0-m_S[i])*y[i])*dt + d*dW[i]; } }
Par - Access via particle-major layout policy
This algorithm accesses particle data via the wrapper class, tk::Data
, in a data-layout-agnostic fashion. Access itself via this class is demonstrably "zero-cost", i.e., an optimizing compiler completely optimizes the abstraction away: see Data layout implementation and assembly for the assembly generated by 3 compilers. However, writing an SDE-advance algorithm in a data-layout-agnostic manner, requires index calculations at every particle-access compared to working with raw pointers, as described above. Thus the following tests are designed to measure only the additional index calculations that the layout-agnostic access entails.
Layout-independent algorithm:
//! Advance particles void advance(int p, int tid, tk::real dt) override { // Compute Nth scalar tk::real yn = 1.0 - m_particles(p, 0, m_offset); for (int i=1; i<m_ncomp; ++i) yn -= m_particles(p, i, m_offset); // Generate Gaussian random numbers with zero mean and unit variance tk::real dW[m_ncomp]; m_rng->gaussian( tid, m_ncomp, dW ); // Advance first m_ncomp (K=N-1) scalars for (int i=0; i<m_ncomp; ++i) { tk::real d = m_k[i] * m_particles(p, i, m_offset) * yn * dt; if (d > 0.0) d = sqrt(d); else d = 0.0; m_particles(p, i, m_offset) += 0.5*m_b[i]*(m_S[i]*yn - (1.0-m_S[i]) * m_particles(p, i, m_offset) )*dt + d*dW[i]; } }
Comparison of the algorithms
DEBUG
mode uses -O0
and does not optimize function calls away for all of three compilers tested. RELEASE
mode uses -O3
and the abstraction is completely optimized away. However, index calculations still remain compared to a layout-dependent advance algorithm.
Total time measured in micro-seconds, run on a Lenovo laptop with Intel Core i7, 8 compute cores:
Run | Ptr | Par | Par/Ptr |
---|---|---|---|
clang/DEBUG | 150350236 | 338851735 | 2.2537 x slowdown |
clang/RELEASE | 98157742 | 104077139 | 1.0603 x slowdown |
DEBUG/RELEASE | 1.5317 x speedup | 3.2558 x speedup | n/a |
Run | Ptr | Par | Par/Ptr |
---|---|---|---|
gnu/DEBUG | 161603164 | 386646353 | 2.3926 x slowdown |
gnu/RELEASE | 94747953 | 98187568 | 1.0363 x slowdown |
DEBUG/RELEASE | 1.7056 x speedup | 3.9378 x speedup | n/a |
Run | Ptr | Par | Par/Ptr |
---|---|---|---|
intel/DEBUG | 171691440 | 608407412 | 3.5436 x slowdown |
intel/RELEASE | 90059133 | 89892665 | 0.99815 x speedup |
DEBUG/RELEASE | 1.9064 x speedup | 6.7682 x speedup | n/a |
Data layout benchmark discussion
- As expected, inlining has a significant effect on performance: going from
DEBUG
toRELEASE
mode yields a significant speedup with all three compilers, see last,DEBUGcode{.cmake}RELEASE
, rows. - As expected, the additional index calculations required by layout-agnostic access do take a performance hit: though only 6% with clang, and 3% with gnu, see last, Par/Ptr, columns.
- Surprisingly, the layout-agnostic access is even a tiny bit faster than the layout-dependent algorithm with the Intel compiler with
-O3
.
Conclusion
As the implementation is not a significant performance hit, the equation advancement algorithms and particle-, and mesh-property data access are used only via the data-layout-independent interface. The data layout can be changed at compile time. Data access is abstracted (and optimized) away.
Note that the above discussion is independent of whether a particle-code or a mesh-code is used. From the data layout viewpoint the particle-major and mesh-field-major are equivalent and in the code, and in tk::PARTICLE_DATA_LAYOUT
and FIELD_DATA_LAYOUT
, see also cmake/
For the API, see tk::