Inciter software design

This page discusses the high-level software design and some implementation aspects of Inciter (Compressible flow solver). The discussion roughly follows as execution happens in time, from program start to finish. On any of the Charm++ concepts below, consult the Charm++ manual.

1. Startup and migration of read-only global-scope data

Startup

As all other executables in Quinoa, Inciter uses the Charm++ runtime system. Runtime execution starts in the Charm++ mainmodule constructor, defined in Main/Inciter.C as Main. Starting down the constructor's initializer list, the command line is parsed, followed by creating a driver, InciterDriver, which parses the input file.

Command line and input file parsing, global-scope data

After the main chare constructor has finished, the runtime system initializes global scope data and migrates it to all other processing elements (PEs), which from that point is considered read-only.

This global-scope data are defined at the beginning of Main/Inciter.C in the same order as they appear in the companion Charm++ interface file Main/inciter.ci. This is the order in which these data are initialized and migrated. Global scope data is limited to such read-only data. It stores data initialized during parsing the command line and the input (control) file. The command line is parsed by CmdLineParser's constructor, called during the main chare's constructor, while the input file is parsed by InputDeckParser's constructor, called during InciterDriver's constructor.

Global_scope data is always prefixed by g_. Inciter's global-scope data contains two input decks: the defaults (g_inputdeck_defaults) and g_inputdeck which contains all data parsed during command line and input file parsing.

Global scope data for runtime polymorphism

Two global scope vectors in inciter are g_cgpde and g_dgpde. These two vectors hold base classes for partial differential equations (PDE) that can be specialized to different types of derived PDEs that use continuous Galerkin finite element (FE) discretizations (CGPDE) and discontinuous Galerkin FE discretizations (DGPDE). These two vectors are global scope because they hold pointers to derived classes that enable runtime polymorphism. Runtime polymorphism enables code reuse and helps client code stay uniform and modular. Such polymorphic use of pointers (std::unique_ptr) are one of the rare uses of pointers throughout the code.

Create inciter's driver, single Charm++ chare Transporter

Up to this point execution is serial, since there is only a single main Charm++ chare. InciterDriver's constructor then fires up a single Charm++ chare instance of Transporter, called from Main::execute() after the runtime system has finished migrating all global-scope data. Note that InciterDriver is not a Charm++ chare, only an ordinary C++ class. Transporter, defined in Inciter/Transporter.C, is the main driver class of Inciter that is a Charm++ chare from which all execution happens, e.g., via broadcasts, and to which all execution ends up in, leading to Transporter::finish(), which eventually calls back to Main::finalize(), calling CkExit(), signaling the runtime system to exit.

2. Important classes

Here are the important classes that interoperate within inciter:

  • Transporter (single chare, driver)
  • Partitioner (chare nodegroup, mesh partitioner)
  • tk::MeshWriter (chare group, mesh writer, performing file output in parallel)
  • Refiner (chare array, mesh refiner)
  • Sorter (chare array, performs mesh reordering)
  • Discretization (chare array, generic PDE solver base class)
    • DiagCG (chare array, PDE solver child class, specialized to a continuous Galerkin finite element discretization scheme with a lumped-mass left-hand side and flux-corrected transport combined with Lax-Wendroff-like explicit time stepping scheme that is second order accurate in space and time),
    • DG (chare array, PDE solver child class, specialized to a discontinuous Galerkin finite element discretization scheme with a diagonal left-hand side with explicit Runge-Kutta time stepping),
    • ... (more hydro schemes to be added in the future)
  • DistFCT (chare array, performs distributed-memory flux-corrected transport, if used by a specific discretization scheme, e.g., DiagCG)

Chare above means a single Charm++ chare. There is a single instance of this class. By design, Transporter is a single chare and is used as a driver that creates objects, used as a target of global parallel reductions, and thus global synchronization points. Transporter also does most of the printouts to screen, collects statistics, and is the end-point of execution in Transporter::finish().

Group above means a Charm++ chare group. A group is a processor-aware chare collection, which means that there is guaranteed to be a single instance of a group per PE which does not migrate. MeshWriter is a group because it calls the MPI-only library, ExodusII, for outputing mesh and mesh-based solution field data to files. Note that while MeshWriter is a group, the way we use it makes it similar to a nodegroup but with group semantics: we call its Charm++ entry method MeshWriter::write() from multiple array elements (from Discretization) targeting ony the first PE of each logical node. This yields "serializing" every call per node, required for the underlying non-thread-safe NetCDF/HDF5 library calls, made by ExodusII. This way writing large solution data works in both non-SMP and SMP mode correctly and efficiently, since parallel output load is configurable separately from the number of work-units for computation.

Nodegroup above means a Charm++ chare nodegroup. A nodegroup is a processor-aware chare collection, which means that there is guaranteed to be a single instance of a nodegroup per logical (e.g., compute) node which does not migrate. Partitioner is a nodegroup because it calls MPI-only libraries.

Array above means a Charm++ chare array, whose elements can migrate (if enabled) and thus they actively participate in automatic load balancing. With nonzero overdecomposition, there may be more array elements (workers) than the number of available PEs. Arrays do the bulk of the heavy lifting in a calculation, i.e., computing right-hand sides for PDE operators, and hold the bulk of unknown/solution arrays. The degree of overdecomposition can be specified by the -u command line argument. This argument accepts a real value between 0.0 and 1.0, inclusive. 0.0 means no overdecomposition, which corresponds to partitioning the mesh into a number of pieces equalling the number of PEs available. (-u 0.0 yields an execution style that is most similar to how MPI codes are traditionally used, which is the default.) Nonzero overdecomposition yields larger number of mesh partitions than the available PEs. The extreme of 1.0 represents the largest degree of overdecomposition, which also results in the smallest work units. For a discussion on the effects of overdecomposition, see the page on Inciter performance.

Bound arrays: Discretization and its specialized children, DiagCG, DG, etc., Refiner, and, if used/created, DistFCT are bound arrays. This means that the runtime system migrates its corresponding array elements together. Bound arrays facilitate modularization among workers that migrate. Since array elements that are bound always appear together on a given PE, even after migration, they can also be thought of as part of the same class, because they can access data from each other (but still respecting the C++ rules of public, private, etc). However, dividing functionality into classes, as always, helps readibility and makes reasoning about code easier.

3. Setup

Transporter constructor

Inciter's setup starts with Transporter's constructor. After some printouts on configuration, some of the important classes are created, introduced above. In some cases, we simply call Charm++'s ckNew() without arguments. This means an empty Charm++ chare array is created, we get hold of its proxy, but no constructors are run yet. An example for this is Sorter. We do empty array creation for two main reasons: (1) we don't yet have all data that is needed to create the given class (more preparation is required but we need its proxy already), and/or (2) we need to pass specific data to each array (or group) element's constructor, which can only be done via dynamic insertion. (Passing it via Charm++'s ckNew() would pass the same data to all elements to be created in a broadcast fashion, and this is not what we want most of the time.) For example, Sorter needs part of the mesh connectivity, coordinates, etc., for which we need to have the mesh partitioned first. This also allows nicely adhering to the RAII idiom of the object oriented paradigm, which helps writing correct code.

Transporter thus creates tk::MeshWriter, Sorter, Refiner, and Partitioner. These classes play an importart role during setup, but some are also (re-)used during time stepping.

In many cases, we pass a number of callbacks to chare arrays and groups when their Charm++ proxy is created with ckNew(). These callbacks are of type CkCallback. This is one technique that we use as a kind of type erasure so that these classes can interoperate with each other as well as with their host, Transporter. This helps resolving cyclic include dependencies. These callbacks mostly denote reduction targets to Transporter, that are necessary synchronization points during setup. A CkCallback is similar to std::function or a C-style function pointer, but can also store a callback to a Charm++ chare object that happens to reside across the network, i.e., an entry method call via a proxy.

Partitioner constructor

While from Transporter's constructor we create a number of Charm++ chare arrays, most of them are empty to start with. An exception is Partitioner, whose constructor starts by reading a different chunk of the mesh in blocks as they appear in the file. We read a chunk of the mesh and associated node coordinates for all mesh cells that are in the chunk read. Partitioner's constructor also reads/computes the triangle element connectivity associated to side sets as well as the node lists associated to side sets. Note that only the portion of all this data is read in on a nodegroup that belongs to a given chunk of the mesh. At this point this results in a simple partitioning which almost certainly not ideal for computing equations later, because of a large surface-area-to-volume ratio of these partitions of the computational domain, since we cannot assume that the ordering in the mesh file also corresponds to close physical-space proximity. Partitioner's constructor finishes with a global reduction to Transporter::load(), which sums the number of elements and the number of nodes (points) in the complete problem (read in in a distributed fashion).

Compute total load and overdecomposition

After reading the mesh in a distributed fashion, we now know the total number of cells and points and Transporter::load() computes the total number of worker chares that will be used to partition the problem into. If there is a nonzero overdecomposition configured by the command line argument -u, the number of chares (partitions) may be more than the available PEs. This is computed by tk::linearLoadDistributor(). We then start mesh partitioning by issuing a broadcast to Partitioner::partition().

Mesh partitioning

Partitioner::partition() sets up the necessary data for calling an external mesh partitioner. Currently, various coordinate-based partitioners are hooked up from Zoltan2. This works in distributed-memory parallel fashion, and calls MPI under the hood (inside the library call). Note that the number of desired mesh partitions equals the number of Charm++ (worker) chares we want, which can be larger than the number of PEs (or compute nodes). The output of mesh partitioning is a map that assigns a chare id to each mesh cell. Next is to categorize (or group) all cells (their connectivity and node coordinates) together that are assigned to each chare by the partitioner and send them to their (owner) chare. This is started out in Partitioner::distribute(), and communicated, in a point-to-point fashion, to other Partitioner chares by calling the Partitioner::addMesh() entry method. When every Partitioner object has received their assigned mesh, they issue a global contribute call to Transporter::distributed(). When all contribute calls have arrived, communication of the different parts of the mesh has finished. This is followed by optional initial mesh refinement.

Optional initial mesh refinement

Transporter::distributed() issues a broadcast to Partitioner::refine(), which uses dynamic array insertion to create all Refiner chare array elements. Note that there may be multiple Refiner objects created on each Partitioner PE (or compute node in SMP mode). Each Refiner constructor gets a chunk of the mesh (which now are smaller chunks than the chunks that were originally read from file), corresponding to the total number of chares. Not only the mesh connectivity and node coordinates, but also the boundary face connectivity and boundary node lists associated to multiple side sets are also passed to Refiner, and only those portions of these data structures that belong to the particular Refiner chare. (Boundary face connectivity is used for cell-based discretization schemes, such as DG, while boundary node lists are used for nodal discretizations, such as DiagCG to set boundary conditions.)

Refiner's constructor evaluates user configuration and decides if initial mesh refinement is to be performed or not. If not, execution is simply skipped ahead to Refiner::endt0ref(). If initial refinement is configured, we descend into Refiner::t0ref().

There are two ways Refiner is used:

  1. Mesh refinement before time stepping (t<0), t0ref, and
  2. Mesh refinement during time stepping (t>0), dtref.

Initial mesh refinement consists of potentially multiple steps of different refinement types, e.g., uniform, ic, coordref, which respectively stand for uniform refinement, i.e., split each tetrahedron into 8 new ones, initial-conditions-based, i.e., non-uniform refinement based on estimating the error in the initial conditions on a given initial mesh, and coordinate-based, which allows specifying planes and simple extents in 3 dimensional space and allows tagging edges for refinement between two extremes. Additionally the user can also tag edges manually, creating a list of pairs of global node ids.

Refiner::eval() evaluates, at the end of each initial refinement step, if a next initial refinement step is to be performed. In the last step, execution continues to Refiner::endt0ref(), which creates Sorter.

Refiner::endt0ref() not only creates the Sorter array chares but also reports back to Transporter::refined(), signaling the end of the initial mesh refinement. In this reduction target the final number of mesh cells and nodes are also aggregated across the whole problem.

Optional PE-locality mesh node reordering

Sorter takes a chunk of the mesh, together with its physical boundary face and node data structures, and sets up a symmetric node communication map, Sorter::m_msum, which stores the global node IDs of chare boundaries associated to chare IDs on each chare. This map is symmetric in a sense that if two chares share a node after partitioning, the same node ID will be stored on each chare assigned to the other chare ID. This map is used for point-point communication during time stepping. This map is used by node-centered (CG) schemes, as well as a starting point for setting up face and ghost communication data structures for cell-centered DG.

Sorter performs optional PE-locality mesh node reordering in parallel. This reordering is optional and is off by default. The reordering assigns new global mesh node IDs that roughly increase with chare ID.

Whether node reordering is to be performed or not is evaluated in Sorter::start(). If reordering is done, it starts in Sorter::mask(). If reordering is not done, we skip to Sorter::createDiscWorkers(), which starts creating the worker chares, that will eventually perform the heavy lifting during time stepping: computing the PDE operators during time integration.

Creating workers

The worker Charm++ chare array elements, that store the mesh and associated data structures and compute PDE operators during time stepping, are organized into a single-level base-child relationship for code reuse and runtime polymorphism. However, both the base-child relationship and runtime polymorphism are slightly differently implemented compared to what would be familiar with inheritance in standard object-oriented programming (OOP).

There is a single base class, Discretization, that encapsulates data and member functions that are generic to all mesh-based discretization schemes. It stores the mesh, connectivity, node coordinates, etc. As Discretization is a chare array, its elements are distributed across the whole problem on all available compute nodes and PEs, and can also migrate for load balancing. In the OOP sense, derived (or child) classes are the classes that implement a particular discretization scheme, e.g., DiagCG or DG.

Sorter::createDiscWorkers() starts creating the Discretization chares. This is done via the usual Charm++ dynamic array insertion passing the mesh chunk and associated data structures to their constructors. Eventually, this is followed by creating the child workers in Sorter::createWorkers().

Execution from the Discretization constructor follows to Transporter::disccreated() which fires up Discretization::vol(), which computes nodal volumes as well as the total volume of the complete problem, which ends up in Transporter::totalvol(). Volume calculations are followed by computing various statistics on the mesh cell sizes, including min/max edge lengths, and histograms, which are useful diagnostics to estimate load imbalance.

The various mesh statistics aggregate and arrive independently in reduction target member functions of Transporter, minstat(), maxstat(), sumstat(), and pdfstat(). When all of these are complete (see src/Inciter/transporter.ci how this condition is told the runtime system using a DAG) ,Transporter::stat() is called in which execution continues by creating the child (or derived) workers in Sorter::createWorkers(). As usual, the children are also created by dynamic array insertion.

After the child constructors have started, one way or another, they are expected to call Transporter::comfinal(), which signals the end of setting up any (additional) communication maps needed by the specific discretization schemes. After this the child classes are called again via their member function setup(). All specific discretization schemes, "deriving from" Discretization, are expected to define certain member functions, such as setup(), which do scheme-specific setup and is eventually expected to call dt(), which starts time stepping (by computing the size of the next time step). dt() is also the member function that is called again and again starting a new time step during time stepping.

4. Time stepping

After all the above, time stepping starts. Time stepping is and may be done very differently by the different types of discretizations but they are all expected to define a few member functions that are common so they can interoperate with their host, Transporter. These are Charm++ entry methods and must also have the same function signature (as in a derived class in OOP). Since the set of these function may change, the best way to find out what is required is to compare the Charm++ interface (.ci) files for the specific discretization schemes, e.g., DiagCG, DG, etc., and look for entry methods that are defined by all child schemes.