Inciter software design
- 1. Startup and migration of read-only global-scope data
- 2. Important classes
- 3. Setup
- 4. Time stepping
This page discusses the high-level software design and some implementation aspects of Inciter. The discussion roughly follows as execution happens in time, from program start to finish. On any of the Charm++ concepts below, consult the Charm++ manual.
As all other executables in Quinoa, Inciter uses the Charm++ runtime system. Runtime execution starts in the Charm++
mainmodule constructor, defined in Main/
Main. Starting down the constructor's initializer list, the command line is parsed, followed by creating a driver, InciterDriver, which parses the input file.
After the main chare constructor has finished, the runtime system initializes global scope data and migrates it to all other processing elements (PEs), which from that point is considered read-only.
This global-scope data are defined at the beginning of Main/
Global_scope data is always prefixed by
g_. Inciter's global-scope data contains two input decks: the defaults (g_inputdeck_defaults) and g_inputdeck which contains all data parsed during command line and input file parsing.
Two other global scope vectors are
g_dgpde. These two vectors hold base classes for partial differential equations (PDE) that can be specialized to different types of derived PDEs that use continuous Galerkin finite element (FE) discretizations (CGPDE) and discontinuous Galerkin FE discretizations (DGPDE). These two vectors are global scope because they hold pointers to derived classes that enable runtime polymorphism. Runtime polymorphism enables code reuse and helps client code stay uniform and modular. Such polymorphic use of pointers (
std::unique_ptr) are one of the rare uses of pointers throughout the code.
Up to this point execution is serial, since there is only a single main Charm++ chare.
InciterDriver's constructor then fires up a single Charm++ chare instance of
Transporter, called from Main::
Transporter, defined in Inciter/
Transporter::, which eventually calls back to
CkExit(), signaling the runtime system to exit.
Here are the important classes that interoperate within inciter:
Transporter(single chare, driver)
tk::(chare group, linear system solver)
Partitioner(chare group, mesh partitioner)
Refiner(chare array, mesh refiner)
Sorter(chare array, performs mesh reordering)
Discretization(chare array, generic PDE solver base class)
DiagCG(chare array, PDE solver child class, specialized to continuous Galerkin finite element discretization scheme with a lumped-mass left-hand side and flux-corrected transport combined with Lax-Wendroff time stepping),
MatCG(chare array, PDE solver child class, specialized to continuous Galerkin finite element discretization in space with a consistent-mass left-hand side and a linear solver with flux-corrected transport combined with Lax-Wendroff time stepping),
DG(chare array, PDE solver child class, specialized to discontinuous Galerkin finite element discretization scheme with a diagonal left-hand side with Runge-Kutta time stepping),
- ... (more hydro schemes to be added in the future)
DistFCT(chare array, performs distributed-memory flux-corrected transport, if used by a specific discretization scheme)
Chare above means a single Charm++ chare. There is a single instance of this class. By design,
Transporter is a single chare and is used as a driver that creates objects, used as a target of global parallel reductions, and thus global synchronization points.
Transporter also does most of the printouts to screen, collects statistics, and is the end-point of execution in
Group above means a Charm++ chare group. A group is a processor-aware chare collection, which means that there is guaranteed to be a single instance of a group per PE which does not migrate. Examples in inciter are
Partitioner both of which are groups because they call MPI-only libraries, which require calls from PEs.
Array above means a Charm++ chare array, whose elements can migrate (if enabled) and thus they actively participate in automatic load balancing. With nonzero overdecomposition, configured with the
-u command line argument, there may be more array elements (workers) than the number of available PEs. Arrays do the bulk of the heavy lifting in a calculation, i.e., computing right-hand sides for PDE operators, and hold the bulk of unknown/solution arrays. The degree of overdecomposition can be specified by the
-u command line argument. This argument accepts a real value between 0.0 and 1.0, inclusive. 0.0 means no overdecomposition, which corresponds to partitioning the mesh into a number of pieces equalling the number of PEs available. (
-u 0.0 yields an execution style that is most similar to how MPI codes are traditionally used, which is the default.) Nonzero overdecomposition yields larger number of mesh partitions than the available PEs. The extreme of
1.0 represents the largest degree of overdecomposition, which also results in the smallest work units. For a discussion on the effects of overdecomposition, see the page on Inciter performance.
Discretization and its specialized children,
Refiner, and, if used/created,
DistFCT are bound arrays. This means that the runtime system migrates its corresponding array elements together. Bound arrays facilitate modularization among workers that migrate. Since array elements that are bound always appear together on a given PE, even after migration, they can also be thought of as part of the same class, because they can access data from each other (but still respecting the C++ rules of
private, etc). However, dividing functionality into classes, as always, helps readibility and makes reasoning about code easier.
Inciter's setup starts with
Transporter's constructor. After some printouts on configuration, some of the important classes are created, introduced above. In some cases, we simply call Charm++'s
ckNew() without arguments. This means an empty Charm++ chare array is created, we get hold of its proxy, but no constructors are run yet. An example for this is
Sorter. We do empty array creation for two main reasons: (1) we don't yet have all data that is needed to create the given class (more preparation is required but we need its proxy already), and/or (2) we need to pass specific data to each array (or group) element's constructor, which can only be done via dynamic insertion. (Passing it via Charm++'s
ckNew() would pass the same data to all elements to be created in a broadcast fashion, and this is not what we want most of the time.) For example,
Sorter needs part of the mesh connectivity, coordinates, etc., for which we need to have the mesh partitioned first. This also allows nicely adhering to the RAII idiom of the object oriented paradigm, which helps writing correct code.
In many cases, we pass a number of callbacks to chare arrays and groups when their Charm++ proxy is created with
ckNew(). These callbacks are of type
CkCallback. This is one technique that we use as a kind of type erasure so that these classes can interoperate with each other as well as with their host, Transporter. This helps resolving cyclic include dependencies. These callbacks mostly denote reduction targets to
Transporter, that are necessary synchronization points during setup.
Transporter's constructor we create a number of Charm++ chare arrays and groups, most of them are empty to start with, except
Partitioner, whose constructor starts by reading a different chunk of the mesh in blocks as they appear in the file. We read a chunk of the mesh and associated node coordinates for all mesh cells that are in the chunk read.
Partitioner's constructor also reads/computes the triangle element connectivity associated to side sets as well as the node lists associated to side sets. Note that only the portion of all this data is read in on a group that belongs to a given chunk of the mesh. At this point this results in a simple partitioning which almost certainly not ideal for computing equations later, because of a large surface-area-to-volume ratio of these partitions, since we cannot assume that the ordering in the mesh file also corresponds to close physical-space proximity.
Partitioner's constructor finishes with a global reduction to
Transporter::, which sums the number of elements and the number of nodes (points) in the complete problem (read in in a distributed fashion).
After reading the mesh in a distributed fashion, we now know the total number of cells and points and
Transporter:: computes the total number of worker chares that will be used to partition the problem into. If there is a nonzero overdecomposition configured by the command line argument
-u, the number of chares (partitions) may be more than than the available PEs. This is computed by tk::
Partitioner:: sets up the necessary data for calling an external mesh partitioner. Currently, various coordinate-based partitioners are hooked up from Zoltan2. This works in distributed-memory parallel fashion, and calls MPI under the hood (inside the library call). Note that the number of desired mesh partitions equals the number of Charm++ (worker) chares we want, which can be larger than the number of PEs (or compute nodes). The output of mesh partitioning is a map that assigns a chare id to each mesh cell. Next is to categorize (or group) all cells (their connectivity and node coordinates) together that are assigned to each chare by the partitioner and send them to their (owner) chare. This is started out in
Partitioner::, and communicated, in a point-to-point fashion, to other Partitioner chares by calling the Partitioner::
contribute call to
Transporter::. When all contribute calls have arrived, communication of the different parts of the mesh has finished. This is followed by optional initial mesh refinement.
Transporter:: issues a broadcast to
Partitioner::, which uses dynamic array insertion to create all
Refiner chare array elements. Note that there may be multiple
Refiner objects created on each
Partitioner PE (or compute node in SMP mode). Each
Refiner constructor gets a chunk of the mesh (which now are smaller chunks than the chunks that were originally read from file), corresponding to the total number of chares. Not only the mesh connectivity and node coordinates, but also the boundary face connectivity and boundary node lists associated to multiple side sets are also passed to Refiner, and only those portions of these data structures that belong to the particular
Refiner chare. (Boundary face connectivity is used for cell-based discretization schemes, such as
DG, while boundary node lists are used for nodal discretizations, such as
CG to set boundary conditions.)
Refiner's constructor evaluates user configuration and decides if initial mesh refinement is to be performed or not. If not, execution is simply skipped ahead to Refiner::
There are two ways Refiner is used:
- Mesh refinement before time stepping (t<0),
- Mesh refinement during time stepping (t>0),
Initial mesh refinement consists of potentially multiple steps of different refinement types:
coordref, which respectively stand for uniform refinement, i.e., split each tetrahedron into 8 new ones, initial-conditions-based, i.e., non-uniform refinement based on estimating the error in the initial conditions on a given initial mesh, and coordinate-based, which allows specifying planes and simple extents in 3 dimensional space and allows tagging edges for refinement between two extremes. Additionally the user can also tag edges manually, using the
initref keyword and creating a list of pairs of global node ids.
Sorter array chares but also reports back to Transporter::
Sorter takes a chunk of the mesh, together with its physical boundary face and node data structures, and sets up a symmetric node communication map, Sorter::
Sorter performs optional PE-locality mesh node reordering in parallel. This reordering is optional and is off by default unless
MatCG is configured by the user. The reordering assigns new global mesh node IDs that roughly increase with chare ID. This enables optimal communication for
MatCG when contributions to the linear system are communicated and assembled for a linear solution in tk::
Whether node reordering is to be performed or not is evaluated in Sorter::
The worker Charm++ chare array elements, that store the mesh and associated data structures and compute PDE operators during time stepping, are organized into a single-level base-child relationship for code reuse and runtime polymorphism. However, both the base-child relationship and runtime polymorphism are slightly differently implemented compared to what would be familiar with inheritance in standard object-oriented programming (OOP).
There is a single base class,
Discretization, that encapsulates data and member functions that are generic to all mesh-based discretization schemes. It stores the mesh, connectivity, node coordinates, etc. As
Discretization is a chare array, its elements are distributed across the whole problem on all available compute nodes and PEs, and can also migrate for load balancing. In the OOP sense, derived (or child) classes are the classes that implement a particular discretization scheme, e.g., DiagCG or DG.
Discretization chares in Sorter::
Execution from the
Discretization constructor follows to Transporter::
Discretization::, which computes nodal volumes as well as the total volume of the complete problem, which ends up in
Transporter::. Volume calculations are followed by computing various statistics on the mesh cell sizes, including min/max edge lengths, and histograms, which are useful diagnostics to estimate load imbalance.
The various mesh statistics aggregate and arrive independently in reduction target member functions of
pdfstat(). When all of these are complete (see src/
After the child constructors have started, one way or another, they are expected to call Transporter::
setup(). All specific discretization schemes, "deriving from"
Discretization, are expected to define certain member functions, such as
setup(), which do scheme-specific setup and is eventually expected to call
dt(), which starts time stepping (by computing the size of the next time step).
dt() is also the member function that is called again and again starting a new time step during time stepping.
After all the above, time stepping starts. Time stepping is and may be done very differently by the different types of discretizations but they are all expected to define a few member functions that are common so they can interoperate with their host,
Transporter. These are Charm++ entry methods and must also have the same function signature (as in a derived class in OOP). Since the set of these function may change, the best way to find out what is required is to compare the Charm++ interface (
.ci) files for the specific discretization schemes, e.g., DiagCG, DG, etc., and look for entry methods that are defined by all child schemes.