On the parallel I/O optimality of linear algebra kernels: near-optimal matrix factorizations

Matrix factorizations are among the most important building blocks of scientific computing. However, state-of-the-art libraries are not communication-optimal, underutilizing current parallel architectures. We present novel algorithms for Cholesky and LU factorizations that utilize an asymptotically communication-optimal 2.5D decomposition. We first establish a theoretical framework for deriving parallel I/O lower bounds for linear algebra kernels, and then utilize its insights to derive Cholesky and LU schedules, both communicating [EQUATION] elements per processor, where M is the local memory size. The empirical results match our theoretical analysis: our implementations communicate significantly less than Intel MKL, SLATE, and the asymptotically communication-optimal CANDMC and CAPITAL libraries. Our code outperforms these state-of-the-art libraries in almost all tested scenarios, with matrix sizes ranging from 2,048 to 524,288 on up to 512 CPU nodes of the Piz Daint supercomputer, decreasing the time-to-solution by up to three times. Our code is ScaLAPACK-compatible and available as an open-source library.


INTRODUCTION
Matrix factorizations, such as LU and Cholesky decompositions, play a crucial role in many scientific computations [42,49,65], and their performance can dominate the overall runtime of entire applications [19].Therefore, accelerating these routines is of great significance for numerous domains [18,44].The ubiquity and importance of LU factorization is even reflected by the fact that it is used to rank top supercomputers worldwide [25].
Since the arithmetic complexity of matrix factorizations is O ( 3 ) while the input size is O ( 2 ), these kernels are traditionally considered compute-bound.However, the end of Dennard scaling [22] puts increasing pressure on data movement minimization, as the cost of moving data far exceeds its computation cost, both in terms of power and time [40,63].Thus, deriving algorithmic I/O lower bounds is a subject of both theoretical analysis [15,36,37] and practical value for developing I/O-efficient schedules [33,59,60].
While asymptotically optimal matrix factorizations were proposed, among others, by Ballard et al. [7] and Solomonik et al. [33,61], we observe two major challenges with the existing approaches: First, the presented algorithms are only asymptotically optimal: the I/O cost of these proposed parallel algorithms can be as high as 7 times the lower bound for LU [61] and up to 16 times for Cholesky [33].This means that they communicate less than "standard" 2D algorithms like ScaLAPACK [14] only for almost prohibitively large numbers of processors -e.g., according to the LU cost  fastest state-of-the-art library (S=SLATE [28], C=CANDMC [57], M=MKL [34]).Right: COnf LUX's achieved % of machine peak performance.

Number of nodes
model [61], it requires more than 15,000 processors to communicate less than an optimized 2D algorithm.Second, their time-tosolution performance can be worse than highly-optimized, existing 2D-parallel libraries [33].
To tackle these challenges, we first provide a general method for deriving precise I/O lower bounds of Disjoint Array Access Programs (DAAP) -a broad range of programs composed of a sequence of statements enclosed in an arbitrary number of nested loops.We then illustrate the applicability of our framework to derive parallel I/O lower bounds of Cholesky and LU factorizations: elements, respectively, where  is the matrix size,  is the number of processors, and  is the local memory size.
Moreover, we use the insights from deriving the above lower bounds to develop COnf LUX and COnf CHOX, near communicationoptimal parallel LU and Cholesky factorization algorithms that minimize data movement across the 2.5D processor decomposition.For LU factorization, to further reduce the latency and bandwidth cost, we use a row-masking tournament pivoting strategy resulting in a communication requirement of  3  √  + O (  2  ) elements per processor, where the leading order term is only 1.5 times the lower bound.Furthermore, to secure high performance, we carefully tune block sizes and communication routines to maximize the efficiency of local computations such as trsm (triangular solve) and gemm (matrix multiplication).
We measure both communication volume and achieved performance of COnf LUX and COnf CHOX and compare them to state-ofthe-art libraries: a vendor-optimized Intel MKL [34], SLATE [28] (a recent library targeting exascale systems), as well as CANDMC [57,58] and CAPITAL [32,33] (codes based on the asymptotically optimal 2.5D decomposition).In our experiments on the Piz Daint supercomputer, we measure up to 1.6x communication reduction compared to the second-best implementation.Furthermore, our 2.5D decomposition communicates asymptotically less than SLATE and MKL, with even greater expected benefits on exascale machines.Compared to the communication-avoiding CANDMC library with I/O cost of 5 3 /( √ ) elements [61], COnf LUX communicates five times less.Most importantly, our implementations outperform all compared libraries in almost all scenarios, both for strong and weak scaling, reducing the time-to-solution by up to three times compared to the second best performing library (Figure 1).In this work, we make the following contributions: • A general method for deriving parallel I/O lower bounds of a broad range of linear algebra kernels.• COnf LUX and COnf CHOX, provably near-I/O-optimal parallel algorithms for LU and Cholesky factorizations, with their full communication volume analysis.• Open-source and fully ScaLAPACK-compatible implementations of our algorithms that outperform existing state-of-the-art libraries in almost all scenarios.

Improved seq. I/O bound; new par. I/O bound
A bird's eye view of our work is presented in Figure 2.

BACKGROUND
We now establish the background for our theoretical model (Sections [3][4][5].We use it to derive parallel I/O lower bounds for Cholesky and LU factorizations (Section 6) that will guide the design of our communication-minimizing implementations (Section 7).

Machine Model
To model algorithmic I/O complexity, we start with a model of a sequential machine equipped with a two-level deep memory hierarchy.We then outline the parallel machine model.

Sequential machine.
A computation is performed on a sequential machine with a fast memory of limited size and unlimited slow memory.The fast memory can hold up to  elements at any given time.To perform any computation, all input elements must reside in fast memory, and the result is stored in fast memory.
Parallel machine.The sequential model is extended to a machine with  processors, each equipped with a private fast memory of size .There is no global memory of unlimited size -instead, elements are transferred between processors' fast memories.

Input Programs
We consider a general class of programs that operate on multidimensional arrays.Array elements can be loaded from slow to fast memory, stored from fast to slow memory, and computed inside fast memory.These elements have versions that are incremented every time they are updated.We model the program execution as a computational directed acyclic graph (cDAG, details in Section 2.  [23]):  where (cf. Figure 3 for a summary) for each innermost loop iteration, statement  is an evaluation of some function  on  inputs, where every input is an element of array   ,  = 1, . . ., , and the result of  is stored to the output array  0 .Each loop has an associated iteration variable   that iterates over its domain   ∈ D  .All  iteration variables form the iteration vector  = [  3).The loop nest depth is  = 2, with two iteration variables  1 = k and  2 = i forming the iteration vector  = [k, i].For access [, ], the access function vector   = [, ] is a function of only one iteration variable .Therefore, (  ) = 2, but (  (  )) = 1.

I/O Complexity and Pebble Games
We now establish the relationship between DAAP and the red-blue pebble game -a powerful abstraction for deriving lower bounds and optimal schedules of cDAG evaluation.

cDAG and red-blue pebble game.
We base our computation model on the red-blue pebble game, played on the computational directed acyclic graph  = ( , ), as introduced by Hong and Kung [37].Every vertex  ∈  represents the result of a unique computation stored in some memory, and a directed edge (, ) ∈  represents a data dependency.Vertices without any incoming (outgoing) edges are called inputs (outputs).To perform a computation, i.e., to evaluate the value corresponding to vertex , all vertices that are direct predecessors of  must be loaded into fast memory.The vertices that are currently in fast memory are marked by a red pebble on the corresponding vertex of the cDAG.Since the size of fast memory is limited, we can never have more than  red pebbles on the cDAG at any moment.Analogously, the contents of the slow memory (of unlimited size) is represented by an unlimited number of blue pebbles.[37].For any subset of vertices  ⊂  , a dominator set Dom ( ) is a set such that every path in the cDAG from an input vertex to any vertex in  must contain at least one vertex in Dom ( ).In general, for a given  , its ( ) is not uniquely defined.The minimum set Min ( ) is the set of all vertices in  that do not have any immediate successors in  .In this work, to avoid the ambiguity of non-uniqueness of dominator set size (in principle, for any subset, its valid dominator set is always the whole  ), we will refer to Dom  ( ) as a minimum dominator set, i.e. a dominator set with the smallest size.

Dominator and Minimum Sets
Intuition.One can think of  's dominator set as a set of inputs required to execute subcomputation  , and of  's minimum set as the output of  .We use the notions of Dom  ( ) and Min ( ) when proving I/O lower bounds.Intuitively, we bound computation "volume" (number of vertices in  ) by its communication "surface", comprised of its inputs -vertices in Dom  ( ) -and outputs -vertices in Min ( ).
For a given cDAG and for any given  > , let Π( ) denote a set of all its valid  -partitions, P ( ) ∈ Π( ).Kwasniewski et al. prove that an I/O optimal schedule of  that performs  load and store operations has an associated  -partition

Deriving lower bounds.
To bound the I/O cost, we further need to introduce the computational intensity .For each subcomputation   ,   is defined as a ratio of the number of computations (vertices) in   to the number of I/O operations required to pebble   , where the latter is bounded by the size of the dominator set (  ) [45].Then, the following lemma bounds the number of I/O operations required to pebble a given cDAG:

GENERAL SEQUENTIAL I/O LOWER BOUNDS
We now present our method for deriving the I/O lower bounds of a sequential execution of programs in the form defined in Section 2.2.Specifically, in Section 3.2 we derive I/O bounds for programs that contain only a single statement.In Section 4 we extend our analysis to capture interactions and reuse between multiple statements.
In this paper, we present only the key lemmas required to establish the lower bounds of parallel Cholesky and LU factorizations.However, the method covers a much wider spectrum of algorithms.For curious readers, we present all proofs of provided lemmas in the appendix.
We start by stating our key lemma: Intuition. ( ) expresses computation "volume", while  is its input "surface".The term  −  bounds the required communication and it comes from the fact that not all inputs have to be loaded (at most  of them can be reused). 0 corresponds to the situation where the ratio of this "volume" to the required communication is minimized (corresponding to a highest lower bound).
Proof.Note that Lemma 1 is valid for any   (i.e., for any   , it gives a valid lower bound).Yet, these bounds are not necessarily tight.As we want to find tight I/O lower bounds, we need to maximize the lower bound. 0 by definition minimizes ; thus, it maximizes the bound.Lemma

𝑑𝑋
= 0.The key limitation is that it is not always possible to find , that is, to express |  | solely as a function of  .However, for many linear algebra kernels  ( ) exists.Furthermore, one can relax this problem preserving the correctness of the lower bound, that is, by finding a function χ : ∀  χ ( ) ≥  ( ).
To find  ( ), we take advantage of the DAAP structure.Observe that every computation (and therefore, every compute vertex  ∈  in the cDAG  = ( , )) is executed in a different iteration of the loop nest, and thus, there is a one-to-one mapping from a value of the iteration vector  to the compute vertex .Moreover, each vertex accessed from any of the input arrays   is also associated with some iteration vector value -however, if (  ) < , this is a one-to-many relation, as the same input vertex may be used to evaluate multiple compute vertices .This is, in fact, the source of the data reuse, and exploiting this relation is a key to minimizing the I/O cost.If for all input arrays   we have that (  ) = , then for each compute vertex ,  different, unique input vertices are required, there is no data reuse and it implies a trivial computational intensity  = 1   .The high-level idea of our method is to count how many different iteration vector values  can be formed if we know how many different  values each iteration variable  1 , . . .,   takes.We now formalize this in Lemmas 3-8.

Iteration vector, iteration domain, access set
Proof.Inequality 1 follows from a combinatorial argument: each computation in  is uniquely defined by its iteration vector where    ∋    is the set over which iteration variable    iterates during  .
Proof.We use the same combinatorial argument as in Lemma 4.

Each vertex in 𝐴
Knowing the number of different values each    takes, we bound the number of different access vectors   ( ℎ ).□ Example: Consider once more statement 1 from LU factorization in Figure 3.We have  2) On the other hand, for  2 , the iteration variable  is used twice.Recall that the access dimension is the minimum number of different iteration variables that uniquely address it (Section 2.2), so its dimension is ( 2 ) = 1 and the only iteration variable needed to uniquely determine  2 is .Therefore, | 2 ()| ≤  ℎ .
Dominator set.Input vertices  1 , . . .,   form a dominator set of vertices  0 , because any path from graph inputs to any vertex in  0 must include at least one vertex from  1 , . . .,   .This is also the minimum dominator set, because of the disjoint access property (Section 2.2): any path from graph inputs to any vertex in  0 can include at most one vertex from  1 , . . .,   .
Proof of Lemma 3.For subcomputation  , we have |  =1   ()| ≤  (by the definition of an  -partition).Again, by the disjoint access property, we have We now want to maximize | |, that is to find   to obtain computational intensity  (Lemma 2).Now we prove that to maximize | |, inequalities 1 and 2 must be tight (become equalities).
From proof of Lemma 4 it follows that | | is maximized when iteration vector  takes all possible combinations of iteration variables   ∈   during  .But, as we visit each combination of all  iteration variables, for each access   every combination of its [ |   |.This result is more general than, e.g., polyhedral techniques [11,15,51] as it does not require loop nests to be affine.Instead, it solely relies on set algebra and combinatorial methods.

Finding the I/O Lower Bound
Denoting   = arg max  ∈ P ( ) | | as the largest subcomputation among all valid  -partitions, we use Lemma 3 and combine it with the dominator set constraint from Section 2.3.3.Note that all access set sizes are strictly positive integers Otherwise, if any of the sets is empty, no computation can be performed.However, as we only want to find the bound on the number of I/O operations, we relax the integer constraints and replace them with |  | ≥ 1.Then, we formulate finding  ( ) (Lemma 2) as the following optimization problem: We then find |  | =  ( ) as a function of  using Karush -Kuhn-Tucker (KKT) conditions [43].Next, we solve Denoting  0 as the solution to Equation (4), we finally obtain Computational intensity and out-degree-one vertices.There exist cDAGs where every non-input vertex has at least  ≥ 0 direct predecessors that are input vertices with out-degree 1.We can use this fact to put an additional bound on the computational intensity.
Lemma 6.If in a cDAG  = ( , ) every non-input vertex has at least  direct predecessors with out-degree one that are graph inputs, then the maximum computational intensity  of this cDAG is bounded by  ≤ 1  .Proof.By the definition of the red-blue pebble game, all inputs start in slow memory, and therefore, have to be loaded.By the assumption on the cDAG, to compute any non-input vertex  ∈  , at least  input vertices need to have red pebbles placed on them using a load operation.Because these vertices do not have any other direct successors (their out-degree is 1), they cannot be used to compute any other non-input vertex .Therefore, each computation of a non-input vertex requires at least  unique input vertices to be loaded.□ Example: Consider Figure 5.In a), each compute vertex  [, ] has two input vertices: [, ] with out-degree 1, and  [ ] with outdegree , thus  = 1.As both array  and vector  start in the slow memory (having blue pebbles on each vertex), for each computed vertex from , at least one vertex from  has to be loaded, therefore  ≤ 1.In b), each computation needs two out-degree 1 vertices, one from vector  and one from vector , resulting in  = 2. Thus,  ≤ 1 2 .

DATA REUSE ACROSS MULTIPLE STATEMENTS
Until now, we have analyzed each statement separately.However, almost all computational kernels contain multiple statements connected by data dependencies -e.g., a column update (1) and a trailing matrix update (2) in LU factorization (Figure 3).The challenge here is that, in general, I/O cost  is not composable: due to the data reuse, the total I/O cost of the program may be smaller than the sum of I/O costs of its constituent kernels.In this section we examine how these dependencies influence the total I/O cost of a program.We derive I/O lower bounds for programs with  statements  1 , . . .,   in two steps.First, we analyze each statement   separately, as in Section 3.Then, we derive how many loads could be avoided at most during one statement if another statement owned shared data.There are two cases where data reuse can occur: I) input overlap, where shared arrays are inputs for multiple statements, and II) output overlap, where the output array of one statement is the input array of another.
Case I).Assume there are  statements in the program, and there are  arrays   ,  = 1, . . .,  that are shared between at least two statements.We still evaluate each statement separately, but we will subtract the upper bound on shared loads   ≥  =1   −  =1 | (  )|, where | (  )| is the reuse bound on array   (Section 4.1).Case II).Consider each pair of "producerconsumer" statements  and  , that is, the output of  is the input of  .The I/O lower bound   of statement  does not change due to the reuse, as the same number of loads has to be performed to evaluate .On the other hand, it may invalidate   , as the dominator set of  formulated in Section 3.1 may not be minimalinputs of a statement may not be graph inputs anymore.For each "consumer" statement  , we reevaluate  ′  ≤   using Lemma 8. Finally, for a program consisting of  statements in total, connected by the output overlap, we have   ≥  =1  ′  .Note that for each "producer" statement ,  ′  =   , i.e. output overlap does not change their I/O lower bound.We seek if any loads can be avoided in the common schedule if we add statement  , denoting its cDAG  + .Consider a subset   (  ) of vertices in   .
Consider some subset of vertices in   which potentially could be reused and denote it Θ  .Now denote all vertices in  0 (statement ) which depend on any vertex from Θ  as Θ  , and, analogously, set Θ  for statement  .Now consider these two subsets Θ  and Θ  separately.If Θ  is computed before Θ  , then it had to load all vertices from Θ  , avoiding no loads compared to the schedule of   only.Now, computation of Θ  may take benefit of some vertices from Θ  , which can still reside in fast memory, avoiding up to |Θ  | loads. The ) is an overapproximation of the actual reuse.Since finding the optimal schedule is PSPACE-complete [46], we conservatively assume that only the minimum number of loads from   is performed.Thus, Lemma 7 generalizes to any number of statements  1 , . . .,   sharing array   -the total number of loads from   is lower-bounded by a maximum number of loads from   among   , max =1,..., |  (   )|.

Case II: Output Reuse and Access Sizes
Consider the case where the output  0 of statement  is also the input   of statement  .Furthermore, consider subcomputation  of statement  (and its associated iteration domain ).Any path from the graph inputs to vertices in  0 () must pass through vertices in   ().The following question arises: Is there a smaller set of vertices  ′  (), | ′  ()| < |  ()| that every path from graph inputs to   () must pass through?
Let   denote computational intensity of statement .With that, we can state the following lemma: Similarly to case I, this result also generalizes to multiple "consumer" statements that reuse the same output array of a "producer" statement, as well as any combination of input and output reuse for multiple arrays and statements.Since the actual I/O optimal schedule is unknown, the general strategy to ensure correctness of our lower bound is to consider each pair of interacting statements separately as one of these two cases.Since both Lemma 7 and 8 overapproximate the reuse, the final bound may not be tight -the more inter-statement reuse, the more overapporixmation is needed.Still, this method can be successfully applied to derive tight I/O lower bounds for many linear algebra kernels, such as matrix factorizations, tensor products, or solvers.

GENERAL PARALLEL I/O LOWER BOUNDS
We now establish how our method applies to a parallel machine with  processors (Section 2.1).Since we target large-scale distributed systems, our parallel pebbling model differs from the one introduced e.g. by Alwen and Serbinenko [5], which is inspired by shared-memory models like PRAM [39].Instead, we disallow sharing memory (pebbles) between the processors, and enforce explicit communication -analogous to the load/store operations -using red and blue pebbles.This allows us to better match the behavior of real, distributed applications that use, e.g., MPI.
Each processor   owns its private fast memory that can hold up to  words, represented in the cDAG as  vertices of color   .Vertices with different colors (belonging to different processors) cannot be shared between these processors, but any number of different pebbles may be placed on one vertex.
All the standard red-blue pebble game rules apply with the following modifications: (1) compute.Uf all direct predecessors of vertex  have pebbles of   's color placed on them, one can place a pebble of color   on  (no sharing of pebbles between processors), (2) communication.If a vertex  has any pebble placed on it, a pebble of any other color may be placed on this vertex.
From this game definition it follows that from a perspective of a single processor   , any data is either local (the corresponding vertex has   's pebble placed on it) or remote, without a distinction on the remote location (remote access cost is uniform).Proof.Following the analysis of Section 3 and the parallel machine model (Section 5), the computational intensity  is independent of a number of parallel processors -it is solely a property of a cDAG and private fast memory size .Therefore, following Lemma

I/O LOWER BOUNDS OF PARALLEL FACTORIZATION ALGORITHMS
We gather all the insights from Sections 2 to 5 and use them to obtain the parallel I/O lower bounds of LU and Cholesky factorization algorithms, which we use to develop our communication-avoiding implementations.
Memory size.Clearly,  ≥  2 /, as otherwise the input cannot fit into the collective memory of all processors.Furthermore, in the following sections, we analyze the memory-dependent communication cost [15].That is, following Solomonik et al. [61], we assume  ≤  2 / 2/3 .This is an upper bound on the amount of memory per processor that can be efficiently utilized under the assumptions that 1) initially, the input is not replicated (every element of input matrix A resides in exactly one location of one of the processors); 2) every processor performs Θ( 3 /) elementary computations.
For larger , the communication cost transitions to the memoryindependent regime [15].All presented memory-dependent lower bounds and algorithmic costs can be easily translated to memoryindependent version by plugging in the upper bound on the size of the usable memory.

LU Factorization
In our I/O lower bound analysis we omit the row pivoting, since swapping rows can increase the I/O cost by at most  2 , which is the cost of permuting the entire matrix.However, the total I/O cost of the LU factorization is O ( 3 / √ ) [61].LU factorization (without pivoting) contains two statements (Figure 3).Observe that we can use Lemma 6 (out-degree one vertices) for statement 1 : The loop nest depth is  1 = 2, with iteration variables  1 = k and  2 = i.The dimension of the access function vector (k,k) is 1, revealing potential for data reuse: every input vertex A[k,k] is accessed  − times and used to compute vertices A[i,k],  + 1 <=  <  .However, the access function vector (i,k) has dimension 2; therefore, every compute vertex has one direct predecessor with out-degree one, which is the previous version of element A[i,k].Using Lemma 6, we therefore have  1 ≤ 1.
We now proceed to statement 2 : Observe that there is an output reuse (Section 4.2 and Figure 3, red arrow) of A[i,k] between statements 1 and 2.We therefore have the following access size in statement S2: 7).Note that in this case where the computational intensity is  1 ≤ 1, the output reuse does not change the access size | 2 ( 2 )| of statement 2.This follows the intuition that it is not beneficial to recompute vertices if the recomputation cost is higher than loading it from the memory.Denoting  2 as the maximal subcomputation for statement 2 over the subcomputation domain , we have the following (Lemma 3): We then solve the optimization problem from Section 3.2: for  =  =  =  .
Then, we find  0 that minimizes the expression  2 ( ) = 4), yielding  0 = 3.Plugging it into  2 ( ), we conclude that the maximum computational intensity of 2 is bounded by  2 ≤ √ /2.We bounded the maximum computational intensities  1 and  2 , that is, the minimum number of I/O operations to compute vertices belonging to statements 1 and 2.As the last step, we find the total number of compute vertices for each statement: Previously, Solomonik et al. [61] established the asymptotic I/O bound for sequential execution  = O ( 3 / √ ).Recently, Olivry et al. [51] derived a tight leading term constant  ≥ 2 3 /(3 √ ).To the best of our knowledge, our result is the first non-asymptotic bound for parallel execution.The generalization from the sequential to the parallel bound is straightforward.Note, however, that this is only the case due to our pebble-based execution model, and it may thus not apply to other parallel machine models.

Cholesky Factorization
We proceed analogously to our derivation of the LU I/O bound -here we just briefly outline the steps.The algorithm contains three statements (Listing 1).For statements 1 and 2, we can again use Lemma 6 (out-degree-one vertices).For 1 : L(k,k) = sqrt(L(k,k)), the loop nest depth is  1 = 1, we have a single iteration variable  1 = k, and a single input array  1 = L with the access function  1 () =(k,k).Since there is only one iteration variable present in  1 , we have ( 1 ) = 1 =  1 .Therefore, for every compute vertex  we have one direct predecessor, which is the previous version of element L(k,k).We conclude that  1 ≤ 1 and | 1 | =  .For statement 2 : L(i,k) = (L(i,k)) / L(k,k), we also have output reuse of L(k,k) between statements 2 and 1.However, as with the output reuse considered in the LU analysis, the computational intensity is  1 ≤ 1.Therefore, it does not change the dominator set size of 2.We then use the same reasoning as for the corresponding statement 1 in LU factorization, yielding  2 ≤ 1.
For statement 3, we derive its bound similarly to 2 of LU, with )/6.Note that compared to LU, the only significant difference is the iteration domain | 3 |.Even though Cholesky has one statement more -the diagonal element update L(k,k) -its impact on the final I/O bound is negligible for large  .
Again, using Lemmas 1 and 9 we establish the Cholesky factorization's parallel I/O lower bound: The derived I/O lower bound for a sequential machine ( = 1) improves the previous bound  ℎ ≥  3 /(6 √ ) derived by Olivry et al. [51].Furthermore, to the best of our knowledge, this is the first parallel bound for this kernel.

NEAR-I/O OPTIMAL PARALLEL MATRIX FACTORIZATION ALGORITHMS
We now present our parallel LU and Cholesky factorization algorithms.We start with the former, more complex algorithm, i.e.LU factorization.Pivoting in LU poses several performance challenges.First, since pivots are not known upfront, additional communication and synchronization is required to determine them in each step.Second, the nondeterministic pivot distribution between the ranks may introduce load imbalance of computation routines.Third, to minimize the communication a 2.5D parallel decomposition must be used, i.e. parallelization along the reduction dimension.We address all these challenges with COnf LUX -a near Communication Optimal LU factorization using  -Partitioning.

LU Dependencies and Parallelization
Due to the dependency structure of LU, the input matrix is often divided recursively into four submatrices  00 ,  10 ,  01 , and  11 [24,61].Arithmetic operations performed in LU create noncommutative dependencies (Figure 6) between vertices in  00 (LU factorization of the top-left corner of the matrix),  10 , and  01 (triangular solve of left and top panels of the matrix).Only  11 (Schur complement update) has no such dependencies, and may therefore be efficiently parallelized in the reduction dimension.A high-level summary is presented in Algorithm 1.

LU Computation Routines
The computation is performed in   steps,  is a tunable block size.In each step, only submatrix   of input matrix  is updated.Initially,   is set to .   can be further viewed as composed of four submatrices  00 ,  10 ,  01 , and  11 (see Figure 7).These submatrices are distributed and updated by routines TournPivot, Factorize 10 , Factorize 01 , and Factorize 11 : •  00 .This  ×  submatrix contains the first  elements of the current  pivot rows.It is computed during TournPivot, and, as it is required to compute  10 and  01 , it is redundantly copied to all processors.•  10 and  01 .Submatrices  10 and  01 of sizes ( −  • ) ×  and  × ( −  • ) are distributed using a 1D decomposition among all processors.They are updated using a triangular solve.1D decomposition guarantees that there are no dependencies between processors, so no communication or synchronization is performed during computation, as  00 is already owned by every processor.
5D, block-cyclic distribution (Figure 7).First, the updated submatrices  10 and  01 are broadcast among the processors.Then,  11 (Schur complement) is updated.Finally, the first block column and  chosen pivot rows are reduced, which will form  10 and  01 in the next iteration.Block size .The minimum size of each block is the number of processor layers in the reduction dimension ( ≥  =   2 ).However, to secure high performance, this value should be adjusted to hardware parameters of a given machine (e.g., vector length, prefetch distance of a CPU, or warp size of a GPU).Throughout the analysis, we assume  =  •   2 for some small constant .

Pivoting
Our pivoting strategy differs from state-of-the-art block [6], tile [4], or recursive [24] pivoting approaches in two aspects: • To minimize I/O cost, we do not swap pivot rows.Instead, we keep track of which rows were chosen as pivots and we use masks to update the remaining rows.• To reduce latency, we take advantage of our derived block decomposition and use tournament pivoting [29].Tournament Pivoting.This procedure finds  pivot rows in each step that are then used to mask which rows will form the new  01 and then filter the non-processed rows in the next step.It is shown to be as stable as partial pivoting [29], which might be an issue for, e.g., incremental pivoting [54].On the other hand, it reduces the O ( ) latency cost of partial pivoting, which requires step-by-step column reduction to find consecutive pivots, to O   , where  is the tunable block size parameter.Row Swapping vs. Row Masking.To achieve close to optimal I/O cost, we use a 2.5D decomposition.This, however, implies that in the presence of extra memory, the matrix data is replicated   2 times.This increases the row swapping cost from , which asymptotically matches the I/O lower bound of the entire factorization.Performing row swapping would then increase the constant term of the leading factor of the algorithm from  processors perform a tournament pivoting routine using a butterfly communication pattern [55].Each processor owns √   −  rows, among which it chooses  local candidate pivots.Then, final pivots are chosen in log(  √ ) "playoff-like" tournament rounds, after which all  √  processors own both  pivot row indices and the already factored, new  00 .This result is distributed to all remaining processors (line 2).Pivot row indices are then used to determine which processors participate in the reduction of the current  01 (line 4).Then, the new   is formed by masking currently chosen rows   ←   [,  : ] (Line 12).

I/O cost of COnf LUX
We now prove the I/O cost of COnf LUX, which is only a factor of Proof.We assume that the input matrix  is already distributed in the block cyclic layout imposed by the algorithm.Otherwise, data reshuffling imposes only Ω  2  cost, which does not contribute to the leading order term.We first derive the cost of a single iteration  of the main loop of the algorithm, proving that its cost is   () = + O   .The total cost after   iterations is: We define 1 =   Note that this cost is a factor 1/3 over the lower bound established in Section 6.1.This is due to the fact that any processor can only maximally utilize its local memory in the first iteration of the outer loop.In this first iteration, a processor updates a total of √  × √  elements of .In subsequent iterations, however, the local domain shrinks as less rows and columns are updated, which leads to an underutilization of the resources.Since the shape of the iteration space is determined by the algorithm, this behavior is unavoidable for  ≥  2 /.Note that the bound is attainable by a sequential machine, however.

Cholesky Factorization
From a data flow perspective, Cholesky factorization can be viewed as a special case of LU factorization without pivoting for symmetric, positive definite matrices.Therefore, our Cholesky algorithm -COnf CHOX-heavily bases on COnf LUX, using the same 2.5D parallel decomposition, block-cyclic data distribution, and analogous computation routines.
For both algorithms, the dominant cost, both in terms of computation and communication, is the  11 update.Due to the Cholesky factorization's iteration domain, which exploits the symmetry of the input matrix, the compute cost is twice as low, as only one half of the matrix needs to be updated.However, the input size required to perform this update is the same -therefore, the communication cost imposed by  11 is similar.We list the key differences between the two factorization algorithms in Table 1.Parallel decomposition.Our experiments show that the parallelization in the reduction dimension, while reducing communication volume, does incur performance overheads.This is mainly due to the increased communication latency, as well as smaller buffer sizes used for local BLAS calls.Since formal modeling of the tradeoff between communication volume and performance is outside of the scope of the paper, we keep the depth of parallelization in the third dimension as a tunable parameter, while providing heuristics-based default values.Data distribution.COnf LUX and COnf CHOX provide ScaLA-PACK wrappers by using the highly-optimized COSTA algorithm [38] to transform the matrices between different layouts.In addition, they support the COSTA API for matrix descriptors, which is more general than ScaLAPACK's layout, as it supports matrices distributed in arbitrary grid-like layouts, processor assignments, and local blocks orderings.Our benchmarks reflect real-world problems in scientific computing.The High-Performance Linpack benchmark uses a maximal size of  = 16,473,600 [62].In quantum physics, matrix size scales with 2 qubits .In physical chemistry or density functional theory (DFT), simulations require factorizing matrices of atom interactions, yielding sizes ranging from  = 1,024 up to  = 131,072 [18,66].In machine learning, matrix factorizations are used for inverting Kronecker factors [52] whose sizes are usually around  = 4,096.This motivates us to focus not only on exascale problems, but also improve performance for relatively small matrices ( ≤100,000).Communication Models.Together with empirical measurements, we put significant effort into understanding the underlying communication patterns of the compared LU factorization implementations.Both MKL and SLATE base on the standard partial pivoting algorithm using the 2D decomposition [10].For CANDMC and CAPITAL, the models provided by the authors [33,61] are used.For COnf LUX and COnf CHOX, we use the results from Section 7.These models are summarized in Table 2.

RESULTS
Our experiments confirm advantages of COnf LUX and COnf CHOX in terms of both communication volume and time-to-solution over  all other implementations tested.A significant communication reduction can be observed (up to 1.42 times for COnf LUX compared with the second-best implementation for  = 1,024).Moreover, the performance models predict even greater benefits for larger runs (expected 2.1 times communication reduction for a full-machine run on the Summit supercomputer -Figure 8c).Most importantly, our implementations consistently outperform existing implementations (up to three times -Figures 1 and 9).Communication volume.Fig. 8a presents the measured communication volume per node, as well as our derived cost models (  fastest state-of-the-art library (S=SLATE [28], C=CAPITAL [33], M=MKL [34]).Right: COnf CHOX's achieved % of machine peak performance.

Number of nodes
since both MKL and SLATE use similar 2D decompositions, their communication volumes are mostly equal, with a slight advantage for SLATE.In Fig. 8b, we show the weak scaling characteristics of the analyzed implementations.Observe that for a fixed amount of work per node, the 2D algorithms -MKL and SLATE -scale sub-optimally.Figure 8c summarizes the communication volume reduction of COnf LUX compared with the second-best implementation, both for measurements and theoretical predictions.It can be seen that for all combinations of  and  , COnf LUX always communicates the least.For all measured data points, the asymptotically optimal CANDMC performed worse than MKL or SLATE.The figure also presents the predicted communication cost of all considered implementations for up to  = 262,144 based on our theoretical models.Performance.Our measurements show that both COnf LUX and COnf CHOX outperform all considered state-of-the art libraries in almost all scenarios (Figures 1 and 11).Thanks to the optimized block data decomposition and efficient overlap of computation and communication, our implementations achieve high performance already on relatively small matrices (approx.40% of hardware peak for cases where  2 / > 2 27 ).In cases where the local domain per processor becomes very small ( 2 / < 2 27 ) our block decomposition does not add that much benefit, since the performance is mostly latency-bound, and not bandwidth-bound.This is visible not only in strong scaling (Figures 9 and 10, a) and b)), but also in weak scaling (c)), where the input size per processor  2 / is constant.This is again caused by latency overheads of scattering data between 1D and 2.5D layouts.
However, as the local domains become larger and may be more efficiently pipelined and overlapped using asynchronous MPI routines and intra-node OpenMP parallelism, the advantage becomes significant (Figures 9 and 10).COnf LUX outperforms existing libraries up to three times (for  = 4,  = 4096, second-best library is SLATE -Figure 1) and COnf CHOX achieves up to 1.8 times speedup (e.g.,  = 4,  = 4,096, second-best is again SLATE).Implications for Exascale.Both the communication models' predictions (Figure 8c) and measured speedups (Figures 1 and 11) allow us to predict that when running our implementations on exascale machines, we can expect to see further performance improvements over state-of-the-art libraries.Furthermore, throughput-oriented hardware, such as GPUs and FPGAs, may benefit even more from the communication reduction of our schedules.Thus, COnf LUX and COnf CHOX not only outperform the state-of-the-art libraries at relatively small scales -which are most common use cases in practice [18,52,66] -but also promise speedups on full-scale performance runs on modern supercomputers.

RELATED WORK
Previous work on I/O analysis can be categorized into three classes (see Table 3): work based on direct pebbling or variants of it, such as Vitter's block-based model [64]; works using geometric arguments of projections based on the Loomis-Whitney inequality [47]; and works applying optimizations limited to specific structural properties such as affine loops [27], and more generally, the polyhedral model program representation [9,48,51].Although the scopes of those approaches significantly overlap -for example, kernels like matrix multiplication can be captured by most of the models -there are important differences both in methodology and the end-results they provide, as summarized in Table 3.
Dense linear algebra operators are among the standard core kernels in scientific applications.Ballard et al. [8] present a comprehensive overview of their asymptotic I/O lower bounds and I/O minimizing schedules, both for sparse and dense matrices.Recently,

CONCLUSIONS
In this work, we present a method of analyzing I/O cost of DAAPa general class of programs that covers many fundamental computational motifs.We show, both theoretically and in practice, that our pebbling-based approach for deriving the I/O lower bounds is more general: programs with disjoint array accesses cover a wide variety of applications, more powerful: it can explicitly capture inter-statement dependencies, more precise: it derives tighter I/O bounds, and more constructive:  -partition provides powerful hints for obtaining parallel schedules.
When applying the approach to LU and Cholesky factorizations, we are able to derive new lower bounds, as well as new, communication-avoiding schedules.Not only do they communicate less than state-of-the-art 2D and 3D decompositions -by a factor of up to 1.6× -but most importantly, they outperform existing commercial libraries in a wide range of problem parameters (up to 3× for LU, up to 1.8× for Cholesky).Finally, our code is openly available, offering full ScaLAPACK layout compatibility.

Figure 2 :
Figure 2: From the input program through the I/O lower bounds to communication-minimizing parallel schedules and high performing implementations.In this paper, we mainly focus on the Cholesky and LU factorizations.The proofs of the lemmas presented in this work can be found in the AD/AE appendix.

Figure 3 :
Figure 3: In-place LU factorization (for simplicity, no pivoting is performed).The algorithm contains two statements (1 and 2), for which we provide key components of our program representation together with the corresponding cDAG for  = 4.For statement 2, we also provide a graphical visualization of a single subcomputation  in its  -partition.

Lemma 1 .
(Lemma 4 in [45]) For any constant   , the number of I/O operations  required to pebble a cDAG  = ( , ) with | | =  vertices using  red pebbles is bounded by  ≥ /, where  = |  |   − is the maximal computational intensity and   = arg max  ∈ P (  ) | | is the largest subcomputation among all valid   -partitions.

Lemma 2 .
If |  | can be expressed as a closed-form function of  , that is if there exists some function  such that |  | =  ( ), then the lower bound on  can be expressed as  ≥  ( 0 − )  ( 0 ) , where  0 = arg min   = arg min   ( )  − .

Figure 4 :
Figure 4: Lemma 3 bounds the set sizes (both the subcomputation's  and input access sets' |  () |) with the number of values |  | each iteration variable   takes during the subcomputation.

Lemma 8 .Corollary 1 .
Any dominator set of set   () must be of size at least |Dom (  ())| ≥ |  () |   .Proof.By Lemma 1, for one loaded vertex, we may compute at most   vertices of  0 .These are also vertices of   .Thus, to compute |  (D)| vertices of   , at least |  ( D) |   loads must be performed.We just need to show that at least that many vertices have to be in any dominator set Dom (  (D)).Now, consider the converse: There is a vertex set  = Dom (  (D)) such that | | < |  ( D) |   .But that would mean, that we could potentially compute all |  (D)| vertices by only loading | | vertices, violating Lemma 1. □ Combining Lemmas 8 and 3, the data access size of |  ()| during subcomputation  is

Lemma 9 .
The minimum number of I/O operations in a parallel pebble game, played on a cDAG with | | vertices with  processors each equipped with  pebbles, is  ≥ | |  • , where  is the maximum computational intensity, which is independent of  (Lemma 1).

Figure 6 :
Figure 6: LU Factorization cDAG for  = 4 with the logical decomposition into  00 ,  10 ,  01 , and  11 .Dashed arrows represent commutative dependencies (reduction of a value).Solid arrows represent non-commutative operations, meaning that any parallel pebbling has to respect the induced order (e.g., no vertex in  11 can be pebbled before  00 is pebbled).

.
To keep the I/O cost of our algorithm as low as possible, instead of performing row-swapping, we only propagate pivot row indices.When the tournament pivoting finds the  pivot rows, they are broadcast to all processors with only  cost per step.Pivoting in COnf LUX.In each step  of the outer loop (line 1 in Algorithm 1),  √

1 3
higher than the obtained lower bound for large  .Lemma 10.The total I/O cost of COnf LUX, presented in Algorithm 1, is  COnfLUX =  3  √  + O ().

Figure 8 :
Figure 8: Communication volume measurements across different scenarios for MKL, SLATE, CANDMC, and COnf LUX.In all considered scenarios, enough memory  ≥  2 / 2/3 was present to allow for the maximum number of replications  =  1/3 .

Figure 9 :
Figure 9: Achieved % of peak performance for LU factorization.We show median and 95% confidence intervals.

:
ways how to uniquely choose the iteration vector in  .□ Now, given , we want to assess how many different vertices are accessed for each input array   .Recall that this number is denoted as access size |  ()|.We will apply the same combinatorial reasoning to   (). = 1, . . ., (  ) iteration variables loops over set   ℎ, during subcomputation  .We can thus bound size of   () similarly to Lemma 4Lemma 5.The access size |  ()| of subcomputation  (the number of vertices from the array   required to compute  ) is bounded by the sizes of (  ) iteration variables' sets   ℎ, ,  = 1, . . ., (  ): ∀ =1,..., : | |  ℎ | is then the upper bound on | |, and its tightness implies that all bounds on access sizes |  (D)| ≤ Lemma 3 states that if each iteration variable   ,  = 1, . . .,  takes |  | different values, then there are at most   =1 |  | different iteration vector values  that can be formed in  .Thus, to maximize | | all combinations of values of   should be evaluated.On the other hand, this also implies the maximization of all access sizes |  ()| = 1  , . . .,  (  )  ] iteration variables is also visited.Therefore, for every  = 1, . . ., , each access size |  (D)| is maximized (Lemma 5), as access functions are injective, which implies that for each combination of [ 1  , . . .,  (  )  ], there is one access to   .  =1 4.1 Case I: Input Reuse and Reuse Size Consider two statements  and  that share one input array   .Let |  (  )| denote the total number of accesses to   during the I/O optimal execution of a program that contains only statement .Naturally, |  (  )| denotes the same for a program containing only  .Define  (  ) min{|  (  )|, |  (  )|}.We then have: Lemma 7. The I/O cost of a program containing statements  and  that share the input array   is bounded by  tot ≥   +   −  (  ), where   ,   are the I/O costs of a program containing only statement  or  , respectively.Proof.Consider an optimal sequential schedule of a cDAG   containing statement  only.For any subcomputation   and its associated iteration domain   its minimum dominator set is Dom (  ) =  =1   (  ).To compute   , at least  =1 |  (  )| −  vertices have to be loaded, as only  vertices can be reused from previous subcomputations.
total number of avoided loads is bounded by the number of loads from   which are shared by both  and  .Because statement  loads at most |  (  )| vertices from   during optimal schedule of   , and  loads at most |  (  )| of them for   , the upper bound of shared, and possibly avoided loads is  (  ) = min{|  (  )|, |  (  )|}.Now, how to find |  (  )| and |  (  )|? Observe that |  (  )| is a property of   , that is, the cDAG containing statement  only.Denote the I/O optimal schedule parameters of   :    ,   |  | , 2) at least how many accesses to   are performed per optimal subcomputation |  (  ( 0 ))|.Then: □The reuse size is defined as  (  ) = min{|  (  )|, |  (  )|}.0 , and |  (   (  0 ))| (Section 3.2).Similarly, for   :    ,   0 , and |  (   (  0 ))|.We now derive: 1) at least how many subcomputations does the optimal schedule have:  ≥ | | 2  and  =   2 . processors are decomposed into the 3D grid [ We refer to all processors that share the same second and third coordinate as [:, , ].We now examine each of 11 steps in Algorithm 1. Step 2. Processors with coordinates [:,  mod √ 1,  mod ] perform the tournament pivoting.Every processor owns the first  elements of  − ( − 1) rows, among which they choose the next  pivots.First, they locally perform the LUP decomposition to choose the local  candidate rows.Then, in ⌈log 2 ( √ 1)⌉ rounds they exchange  ×  blocks to decide on the final pivots.After the exchange, these processors also hold the factorized submatrix  00 .I/O cost per processor:  2 ⌈log 2 ( Factorized  00 and  pivot row indices are broadcast.First  columns and  pivot rows are scattered to all .I/O cost per processor:  2 +  + 2( − )  .Steps 1 and 5.  columns and  pivot rows are reduced.With high probability, pivots are evenly distributed among all processors.There are  layers to reduce, each of size ( − ).I/O cost per processor: The updates Factorize 10 , Factorize 01 , and Factorize 11 are local and incur no additional I/O cost.Steps 8 and 10.Factorized  10 and  01 are scattered among all processors.Each processor requires

Table 1 :
[34]nd  01 reduction, local trsm 2( − )  Comparison of the implemented LU and Cholesky factorizations.Even though Cholesky performs half as many computations (the use of gemmt instead of gemm in  11 ), it communicates the same amount of data, since the number of elements needed to perform gemm and gemmt is the same.Our algorithms are implemented in C++, using MPI for inter-node communication.For static communication patterns (e.g., column reductions) we use dedicated, asynchronous MPI collectives.For runtime-dependent communication (e.g., pivot index distribution) we use MPI one-sided[31].For intra-node tasks, we use OpenMP and local BLAS calls (provided by Intel MKL[34]) for computations.Our code is available as an open-source git repository 1 .

Table 2 .
Problem Sizes.We evaluate the algorithms starting from 2 compute nodes (4 MPI ranks) up to 512 nodes (1,024 ranks).For each node count, matrix sizes range from  = 2,048 to  = 219= 524,288, provided they fit into the allocated memory (e.g., LU or Cholesky factorization on a double-precision input matrix of dimension 262,144 × 262,144 cannot be run on less than 32 nodes).Runs in which none of the libraries achieved more than 3% of the hardware peak are discarded since by adding more nodes the performance starts to deteriorate.

hardware peak Input does not fit into memory Number of nodes Matrix size Performance below 3% of hardware peak Input does not fit into memory
, 66]ScopeGeneral cDAGs Programs Geometric structure of iteration space ŏ Individually tailored for given problem

Table 3 :
Overview of different approaches to modeling data movement.