Communication-Optimal Convolutional Neural Nets

James Demmel, Grace Dinh

Introduction

Convolutional neural networks are a bottleneck in many machine learning applications, and as such must be efficiently implemented on modern architectures. To do so, it is important to understand where most of the time (and energy) goes when executing a program on a current architecture. There are two costs to consider: arithmetic and communication, i.e. moving data, either between levels of a memory hierarchy or between processors over a network. The cost to move one word of data can be orders of magnitude larger than the cost to perform one arithmetic operation, and this difference in cost is growing over time, following technological trends [GSP05, FM11]. Avoiding communication has long motivated optimization efforts (many of which have in fact managed to attain known communication lower bounds) in numerical linear algebra, resulting in tuned libraries (e.g. the BLAS and LAPACK) that attain a high fraction of machine peak. We seek to extend this optimization approach to CNNs.

In this paper, we consider one phase of CNNs, which can be written most simply as seven nested loops combining a four-dimensional $Image$ array and a four-dimensional $Filter$ array to compute a four-dimensional $Out$ array, which may be most simply stated as followsWe ignore boundary conditions in this paper and focus solely on asymptotic optimization.:

We consider all possible ways to reorganize this computation, performing the same arithmetic operations in an arbitrary order, and ask which order minimizes communication. We use a simple sequential architectural model where there is one memory large enough to hold all the input and output data, and a smaller cache of size $M$ where data needs to reside to perform arithmetic; we wish to minimize the number of words moved between the large memory and cache (we also show how to generalize this simple architectural model to more complicated and realistic ones).

Our first contribution is to prove new communication lower bounds that hold for all possible loop bounds on the seven nested loops, all strides, and all cache sizes $M$ . Our second contribution is to show how to attain these lower bounds in all cases, using appropriate loop reorganizations and tilings.

Both the lower bounds and optimal tilings are more complicated than, say, those for matrix multiplication because the seven loop bounds and two stride values lead to many more possible situations than the three loop bounds of matrix multiplication (which we will briefly review for contrast). Not all loop bounds and strides may be used in practice, but we consider them all for completeness.

Fortunately, the lower bound can be tersely written as the maximum of five simple algebraic expressions in the seven loop bounds and cache size $M$ :

(see Theorem 1 in section 4). Any one of these five expressions may be much larger than the others, depending on the loop bounds, the stride values, and $M$ . Notice that in many common cases (small filter, arrays too big to fit entirely in cache), the fifth term in the above expression is the maximum. If we were only able to achieve as much data reuse as in conventional $O(n^{3})$ matrix multiplication (or the many other dense linear algebra operations for which matrix multiply is a bottleneck), this term would be equal to $KCHWBRS/M^{1/2}$ . In contrast, we see our bound is smaller (better) than this by a factor $\min(M^{1/2},(RS/\sigma_{h}\sigma_{w})^{1/2})$ . The proof (which can be skipped on a first reading) uses techniques from functional analysis, group theory and lattice theory.

The optimal tilings that attain these lower bounds lead to a large number of cases. We determine these cases by formulating the problem of finding an optimal tiling as an optimization problem, where we want to choose loop tile sizes that both fit in fast memory and maximize the amount of work that can be done. By taking logarithms, this becomes a linear program over the tiling parameters, with constraints restricting the tiling parameters to feasible tilings that fit inside memory. We show in section 5 that for all possible values of the loop bounds and strides, there is a feasible solution of this linear program that attains the corresponding lower bound. A practical implementation of our result might be done either by formulating and solving the linear program, or by using the cases precomputed using the method described in Section 5. For certain sets of parameters found in real-world neural networks such as AlexNet, our algorithms can produce an integer-factor reduction in the communication cost over a matrix multiply-based approach for sufficiently small (L1-L2 cache) values of $M$ .

The remainder of this paper is organized as follows. Section 2 describes the seven-nested loop version of a CNN in more detail, and the simplifications we make for the purpose of our analysis. Section 3 briefly reviews lower bounds and optimal algorithms for matrix multiplication, to set the stage for our more complicated analysis of CNNs. Section 4 presents our new lower bounds, and Section 5 presents the matching upper bounds, i.e. optimal algorithms. We extend our analysis to pooling in Section (6).

CNN Model

As stated in the Introduction, we consider the following CNN computation:

where $Image$ has dimensions $(\sigma_{w}W+R)\times(\sigma_{h}H+S)\times C\times B$ , $Out$ has dimensions $K\times H\times W\times B$ , and $Filter$ has dimensions $K\times R\times S\times C$ . $B$ is the number of images, $C$ is the number of input channels, $K$ is the number of output channels, $W$ and $H$ are the width and height of the output image, $R$ and $S$ are the sizes of one convolution, $\sigma_{w}$ is the stride size in the $w$ dimension, and $\sigma_{h}$ is the stride size in the $h$ dimension. We assume that the filter size is smaller than the input image size, i.e. $R\leq\sigma_{w}W$ and $S\leq\sigma_{h}H$ , and typically, they are much smaller, though our analysis does not require this. We also assume that $\sigma_{w}\leq R$ and $\sigma_{h}\leq S$ (i.e. all elements of $Image$ are used in the computation); we can reduce any problem onto one where this assumption holds using no more communication than is necessary to read the useful elements in Image. Thus $Filter$ has total size $KCRS$ , $Out$ has total size $KHWB$ , and $Image$ has total size $C(\sigma_{h}H+S)(\sigma_{w}W+R)B$ , which is usually close to, and at most four times, $CHWB\sigma_{h}\sigma_{w}$ . These three array sizes will appear in our communication bounds. We will use the expression $CHWB\sigma_{h}\sigma_{w}$ to simplify our algebra later, since we are only interested in the asymptotics.

Review of Matrix Multiplication

In this section we review the well-known case of matrix multiplication (matmul), both to compare to our analogous but more complicated result for CNNs, and because the lower bound for matmul is used in the CNN lower bound proof. We consider only “classical” matmul, i.e. the algorithm that does $mnk$ multiplies and adds to multiply an $m$ -by- $n$ matrix times an $n$ -by- $k$ matrix (we discuss Strassen-like algorithms briefly below). To keep it simple, we consider $n$ -by- $n$ times $n$ -by- $k$ matmul $C=A*B$ , where the matrix $A$ and $B$ originally reside in main memory, the result $C$ is stored in main memory at the end of execution, $1\leq k\leq n$ , and the cache has size $M$ . In this case, the attainable lower bound on the number $W_{MM}$ of words moved between main memory and cache is

The $n^{2}$ term arises because it is obviously necessary to read the input matrices from main memory to cache at least once, and write the output matrix to main memory at least once. The $n^{2}k/M^{1/2}$ term is the interesting one, dominating $n^{2}$ for large enough $k$ , and a decreasing function of $M$ . In the square $n=k$ case, it is attained by the well-known tiling approach, i.e. breaking matrices $A$ , $B$ and $C$ into square submatrices of dimension $(M/3)^{1/2}$ , so one submatrix of $A$ , $B$ and $C$ can all fit in cache simultaneously. The tiled algorithm then loops over tiles, multiplying two tiles of $A$ and $B$ and updating one tile of $C$ . As $k$ decreases, the same tiling approach works until $k=(M/3)^{1/2}$ , at which point one tile just fits in the $n$ -by- $k$ matrices $B$ and $C$ , and the two terms $n^{2}$ and $n^{2}k/M^{1/2}$ in $W_{MM}$ become equal (to within a modest constant factor). As $k$ decreases further, the $W_{MM}$ remains equal to $n^{2}$ , and this is attained by continuing to use an $(M/3)^{1/2}$ -by- $(M/3)^{1/2}$ tile for $A$ , but $(M/3)^{1/2}$ -by- $k$ tiles for $B$ and $C$ . We will see an analogous, but more complicated transitioning of optimal tilings for CNNs.

The lower bound $n^{2}k/M^{1/2}$ was first derived for sequential classical matmul in [HK81], generalized to parallel implementations in [ITT04], and to classical linear algebra more generally in [BDHS11]. We will apply a further generalization of these bounds [CDK+13a, CDK+15, Kni15] that apply to more general nested loops accessing arrays to CNNs in section 4. The case of Strassen-like matmul was addressed in [BDHS12, Sco15].

Communication Lower Bounds

In this section, we state and prove our communication lower bound for the CNN in (1). We first state the bound for the following basic memory model, and then show how to generalize it to other models, following [BDHS11]. We assume the input data initially resides in a large main memory, and that at the end of the computation the answer also resides in the main memory. The main memory is attached to a cache of smaller size $M$ , to which data can be loaded, and from which data can be stored back to main memory. Arithmetic can only be performed on data in cache, with the operands and result of an operation needing to fit in cache. Our goal is to find a lower bound the number of loads and stores needed to complete the algorithm.

In the simplest case, when $M$ is large enough to hold all the inputs and outputs, an optimal algorithm would move all the inputs from main memory to cache, perform the algorithm without any more loads or stores, and store the answer back to main memory at the end. This would attain the trivial lower bound on the number of loads and stores, equal to the size of all the inputs plus the size of all the outputs. The interesting case is when $M$ is not large enough to do this.

In this case, following the approach of [CDK+13b], we will proceed as follows: consider the algorithm as a sequence of instructions, including load instructions that transfer data from slow to fast memory, store instructions that transfer data from fast to slow memory, and $F$ arithmetic operations. Break the sequence into $R$ rounds of consecutive instructions, with each round containing exactly $M$ load and store operations (with the possible exception of the last); this provides a $2M$ upper bound on the amount of data that can be used as input by operations in a single bound - the $M$ words already available at the beginning, and the at most $M$ words loaded into fast memory at the beginning of a round. If we can show that at most $G$ operations can be computed with $2M$ inputs and outputs (i.e. in one round), then the number of rounds must be

and the number of words transferred in the execution must be

This approach can be used for more general memory models. For example, we can bound the communication between two consecutive levels of a multilevel memory hierarchy by treating the smaller and faster of the two levels as the “fast memory” and everything slower and larger than it as “slow memory”. For distributed memory parallel computations, following the approach of [ITT04, BDHS11, CDK+13b, Kni15], we can bound the memory traffic in and out of any node by treating the memory on that node as the “fast memory”, and the memory on all the other processors as ”slow memory”.

Any execution order of (1) moves $W_{CNN}$ words between main memory and a cache of size $M$ , where

To provide some intuition for this, note that the first three terms in the $\max()$ are the sizes of the output and inputs $Out$ , $Image$ and $Filter$ respectively. The fourth term uses the results in [CDK+13a, CDK+15], which apply to general loop nests accessing arrays, though we will see that more work is required to apply these results concretely. This lower bound has the same power of $M$ in the denominator as the direct n-body problem. The fifth term is new, and is larger than the fourth term if and only if $RS<M\sigma_{W}\sigma_{H}$ , which is a common case.

2 Proof of the lower bound BCKWHRS/M𝐵𝐶𝐾𝑊𝐻𝑅𝑆𝑀BCKWHRS/M

is the bound we seek. A set $(s_{1},s_{2},s_{3})$ satisfies (5) for all $V$ if and only if they satisfy the linear inequalities

It will turn out that the minimal value of $\sum_{i=1}^{3}s_{i}$ is 2, leading to $G=M^{2}$ , and using (3) the desired lower bound of

Suppose $A=\{A_{1},...,A_{n}\}$ and $B=\{B_{1},...,B_{m}\}$ are finite sets of subgroups of an abelian group. We will call them independent if

Let ${\rm lattice}(A)$ denote the lattice generated by the subgroups in $A$ , and similarly for ${\rm lattice}(B)$ ; recall that we will add $\{0\}$ to these lattices if they do not already contain it. Then from the definition of a lattice it is easy to see that ${\rm lattice}(A)$ and ${\rm lattice}(B)$ are independent if $A$ and $B$ are independent, in which case we have

Suppose $A$ and $B$ are independent. Then

Proof: $A\cup B\subset{\rm lattice}(A)+{\rm lattice}(B)$ since both lattices include $\{0\}$ . It suffices to show that if $A_{1}+B_{1}$ and $A_{2}+B_{2}$ are both in ${\rm lattice}(A)+{\rm lattice}(B)\subseteq{\rm lattice}(A\cup B)$ , then so are their sum and intersection. The sum $(A_{1}+B_{1})+(A_{2}+B_{2})=(A_{1}+A_{2})+(B_{1}+B_{2})\in{\rm lattice}(A)+{\rm lattice}(B)$ follows from being abelian, and a lattice being closed under addition. The intersection $(A_{1}+B_{1})\cap(A_{2}+B_{2})=(A_{1}\cap A_{2})+(B_{1}\cap B_{2})$ follows from independence, and a lattice being closed under intersection. $\Box$

The advantage of this is that if we have simple descriptions of ${\rm lattice}(A)$ and ${\rm lattice}(B)$ (say finite lists of each), then it is easy to describe ${\rm lattice}(A\cup B)$ .

More generally, let $C_{i}=\{C_{i,1},...,C_{i,n(i)}\}$ be a set of $n(i)$ subgroups, for $i=1,...,m$ . We call them independent if for all $i$

This is a natural generalization of the case where each $C_{i,j}$ is a vector space. Then, as above the ${\rm lattice}(C_{i})$ are independent, and

and if we have finite lists of the members of each ${\rm lattice}(C_{i})$ , we can also list the members of ${\rm lattice}(\cup_{i}C_{i})$ .

Now suppose we start with some sets of the form

Each of the above sums may be over a subset of the possible values of $i$ . In other words, each $D_{j}$ is gotten by choosing at most one member of each $C_{i}$ , and adding them. Later we will choose these $D_{j}$ to be the kernels of the projections $\phi_{i}$ in the CNN. Then for any such $D_{1},\cdots,D_{k}$ , we have

Now we tie this to code for CNNs. The projections $\phi_{i}$ and their kernels are

Our goal is a finite list of subgroups $H$ that is a superset of ${\rm lattice}(K)$ , where

and where for each such $H$ we can write down the inequality (7). Then by [CDK+15, Val10], solving the linear program that minimizes $\sum_{i=1}^{3}s_{i}$ subject to these constraints will give us our desired bound $G=M^{\sum_{i=1}^{3}s_{i}}$ .

which we identify with $D_{1}$ , $D_{2}$ and $D_{3}$ in (9) above. So by (10), all we need are finite lists of members of each ${\rm lattice}(C_{i})$ to write down a finite list of subgroups $H$ containing ${\rm lattice}(K)$ . Since each $C_{i}$ is small, it is easy to confirm the following facts:

where we have added $\{0\}$ to ${\rm lattice}(C_{1})$ , ${\rm lattice}(C_{4})$ and ${\rm lattice}(C_{5})$ . Since the cardinalities of these five lattices are 2, 5, 5, 2 and 2, respectively, the number of possibly different subgroups in $\sum_{i}{\rm lattice}(C_{i})$ in (10) is at most $2\cdot 5\cdot 5\cdot 2\cdot 2=200$ .

It turns out that we only need $1+4+4+1+1=11$ subgroups, or more generally $\sum_{i}(|C_{i}|-1)$ , not $\prod_{i}|C_{i}|$ . This simplification is a generalization of the “Product Case” in section 6.3 of [CDK+13a]. The idea is that if $H\in\sum_{i}{\rm lattice}(C_{i})$ , so that $H=\sum_{i}C_{i,j(i)}$ where $C_{i,j(i)}\in{\rm lattice}(C_{i})$ , then by independence of the $C_{i}$ we get ${\rm rank}(H)=\sum_{i}{\rm rank}(C_{i,j(i)})$ , and we also get ${\rm rank}(\phi_{k}(H))=\sum_{i}{\rm rank}(\phi_{k}(C_{i,j(i)}))$ by the construction of the $C_{i}$ from the $\phi_{k}$ . Thus (7) follows from adding all the inequalities

There are only 11 such inequalities, because using $C_{i,j(i)}=\{0\}$ only yields the trivial inequality $0\leq 0$ .

The table below has one row for each $C_{i,j(i)}$ , one column for ${\rm rank}(C_{i,j(i)})$ , 3 columns for each ${\rm rank}(\phi_{k}(C_{i,j(i)}))$ , and the rightmost column for the resulting inequality (11).

Removing redundant inequalities, we get just the following four:

We see that minimizing $\sum_{i=1}^{3}s_{i}$ subject to these inequalities yields the desired value of 2, say by choosing $s_{1}=s_{2}=s_{3}=2/3$ . The solution $(s_{1},s_{2},s_{3})$ is not unique, but their sum is.

Using the notation introduced above, we still want to bound the number of lattice points $|V|$ in any set $V$ of 7-tuples $(b,c,k,w,h,r,s)$ of integers, given the bounds $|\phi_{i}(V)|\leq M$ for $i\in\{1,2,3\}$ . However, we want a bound that is tighter than $M^{2}$ when $RS$ is small.

We will begin by rewriting the loop indices as $r=\sigma_{w}r^{\prime}+r^{\prime\prime}$ and $s=\sigma_{h}s^{\prime}+s^{\prime\prime}$ , where $r^{\prime\prime}\in[0,\sigma_{w}-1]$ and $s^{\prime\prime}\in[0,\sigma_{h}-1]$ . Replace the loop over $r$ with loops over $r^{\prime}$ (from to $R/\sigma_{w}-1)$ and $r^{\prime\prime}$ (from to $\sigma_{w}-1$ ), and replace the loop over $s$ similarly. Since each $r,s$ maps uniquely onto a single $r^{\prime},r^{\prime\prime},s^{\prime},s^{\prime\prime}$ , we can lift both $Image$ and $Filter$ onto a higher dimension.

There is a one-to-one correspondence between every point in the original seven-dimensional lattice and every point in the lifted nine-dimensional lattice. Therefore, it suffices to bound the number of lattice points $|V|$ of any set $V$ of $9$ -tuples $(b,c,k,w,h,r^{\prime},r^{\prime\prime},s^{\prime},s^{\prime\prime})$ such that $|\phi^{\prime}_{i}(V)|\leq M$ , with $\phi^{\prime}_{i}$ defined as follows:

Let $V$ be any set of 9-tuples of integers $(b,c,k,w,h,r^{\prime},r^{\prime\prime},s^{\prime},s^{\prime\prime})$ where $|\phi^{\prime}_{i}(V)|\leq M$ for $i\in\{1,2,3\}$ , and $1\leq r^{\prime}\leq R/\sigma_{W}$ and $1\leq s\leq S/\sigma_{H}$ . Then

This bound is obviously tighter than $M^{2}$ precisely when $RS<M\sigma_{W}\sigma_{H}$ . Plugging $G=(RS)^{1/2}M^{3/2}(\sigma_{W}\sigma_{H})^{-1/2}$ into (3) immediately yields:

The number of reads and writes to execute a CNN with a fast memory of size $M$ is at least $BCKWH(RS\sigma_{W}\sigma_{H}/M)^{1/2}$ .

Proof of Lemma 4.2: Let $V(r^{\prime},s^{\prime})$ denote the restriction of $V$ to a given value of $r^{\prime},s^{\prime}$ , so that $V=\cup_{r^{\prime},s^{\prime}}V(r^{\prime},s^{\prime})$ is a disjoint union of sets and $|V|=\sum_{r^{\prime},s^{\prime}}|V(r^{\prime},s^{\prime})|$ . Note that $\phi^{\prime}_{3}(V)=\cup_{r^{\prime},s^{\prime}}\phi^{\prime}_{3}(V(r^{\prime},s^{\prime}))$ is also a disjoint union of sets, so

Also $|\phi^{\prime}_{1}(V(r^{\prime},s^{\prime}))|\leq|\phi^{\prime}_{1}(V)|\leq M$ and $|\phi^{\prime}_{2}(V(r^{\prime},s^{\prime}))|\leq|\phi^{\prime}_{2}(V)|\leq M$ . We want to bound $|V(r^{\prime},s^{\prime})|$ in terms of these bounds on $|\phi^{\prime}_{1}(V(r^{\prime},s^{\prime}))|$ , $|\phi^{\prime}_{2}(V(r^{\prime},s^{\prime}))|$ , and $|\phi^{\prime}_{3}(V(r^{\prime},s^{\prime}))|$ .

This is another application of the HBL inequalities discussed in section 4.2. Since each of the loop indices appear in exactly two of the three $\phi^{\prime}_{i}$ , this is a special case of a tensor contraction, for which the optimal exponents are $s_{1}=s_{2}=s_{3}=1/2$ as shown in Section 6.3 of [CDK+13a]. This yields:

We want to maximize this subject to (12). Since we are summing $RS/(\sigma_{w}\sigma_{h})$ terms, a simple application of Lagrange multipliers tells us that this maximum is attained when all $|\phi^{\prime}_{3}(V(r^{\prime},s^{\prime}))|=M/(RS/(\sigma_{w}\sigma_{h}))$ , yielding the desired

Communication Optimal Algorithms

In this section we show that the lower bounds in Theorem 1 are always attainable by an appropriate tiling, analogous to the one for matrix multiplication in section 3. We will prove this by constructing a tiling for any possible set of array bounds, and verifying that the tiling attains one of lower bounds from (4). The tile sizes we construct may also be useful starting points for optimization in practice, although the exact tile sizes may not necessarily give the best performance (due to constant factors omitted from our analysis).

Following the approach of [RD16], we will achieve this tiling by blocking each variable into contiguous blocks, except for $r$ and $s$ , which we will rewrite as $r=\sigma_{w}r^{\prime}+r^{\prime\prime}$ and $s=\sigma_{h}s^{\prime}+s^{\prime\prime}$ (with $r^{\prime\prime}\in[0,\sigma_{w}-1]$ and $s^{\prime\prime}\in[0,\sigma_{h}-1]$ ) respectively. That is, we will rewrite our loop as follows, where we use ${\rm for}\,i=\alpha:\beta:\gamma$ to denote iterating from $\alpha$ to $\gamma$ with a step size of $\beta$ :

To minimize the communication cost, it suffices to maximize the size of each block (that is, $b_{b}b_{c}b_{k}b_{w}b_{h}b_{r^{\prime}}b_{r^{\prime\prime}}b_{s^{\prime}}b_{s^{\prime\prime}}$ ) subject to the following constraints:

The block size in each dimension is smaller than the loop bound for that dimension. For the first five indices, we have:

The loop bounds on $r^{\prime\prime}$ and $s^{\prime\prime}$ (given by their definitions) give the following two constraints:

To ensure that the blocks for $r^{\prime}$ and $s^{\prime}$ are of appropriate size, recall that $r=\sigma_{w}r^{\prime}+r^{\prime\prime}$ and $s=\sigma_{h}s^{\prime}+s^{\prime\prime}$ , which gives

Since $b_{r^{\prime\prime}}\leq\sigma_{w}$ and $b_{s^{\prime\prime}}\leq\sigma_{h}$ , we can safely omit those from the inequality (since their effect on left-hand side is at most equivalent to adding $1$ to $b_{r^{\prime}}$ and $b_{s^{\prime}}$ , and we are only interested in asymptotics) to get

The size of each block does not exceed the size of fast memory $M$ . This is straightforward for $Out$ :

For $Image$ , notice that if a block for $r^{\prime}$ is $[r^{\prime}_{start},r^{\prime}_{end}]$ , a block of $r^{\prime\prime}$ is $[r^{\prime\prime}_{start},r^{\prime\prime}_{end}]$ and a block for $w$ is $[w_{start},w_{end}]$ , then the indices of $Image$ in the $w$ -dimension accessed will be of the form $i+\sigma_{w}j$ , where $i\in[r^{\prime\prime}_{start},r^{\prime\prime}_{end}]$ and $j\in[w_{start}+r^{\prime}_{start},w_{end}+r^{\prime}_{end}]$ . As a result, the number of indices in the $w$ -dimension accessed is $(b_{w}+b_{r^{\prime}})b_{r^{\prime\prime}}$ ; similarly for the $h$ -dimension. Therefore, the total number of elements accessed from $Image$ must be:

As we will see shortly, it is convenient to recast our maximization problem as a linear program by taking logs; for this to happen, we only want products in the inequality. Multiplying out the left-hand side of the above gives a sum of four terms; bounding each of them by $M$ is sufficient for an asymptotic analysis. Therefore, we get:

Taking the log base $M$ of the objective and all the constraints, we get the following linear program, with $l_{\{b,...,s^{\prime\prime}\}}=\log_{M}b_{\{b,...,s^{\prime\prime}\}}$ :

We will first determine a closed-form solution for this linear program; that is, we will partition the space of possible input parameters (i.e. the array bounds and strides) into convex polytopes, within each of which the optimal tiling and communication cost it achieves are described by a single linear function of the input parameters. We will then use this closed-form solution to verify optimality of the tiling (and attainability of the lower bound) by showing that the communication cost of the optimal tiling is always equal to a communication lower bound for every point in the parameter space.

Although this construction and verification can in theory be performed by hand (see Appendix A for a hand analysis of the case where $\sigma_{w}=\sigma_{h}=1$ ), the size of the result - the partition we find is a set of $200$ regions - makes it far more expedient to automate the analysis, which will also allow us to more easily extend this approach to other problems. An implementation of the algorithm, as well as a table of partitions, optimal tilings, and optimal communication costs in each of these partitions, may be found at https://people.eecs.berkeley.edu/~dinh/papers/DD18/partitioning.nb.

Algorithms for determining a closed-form solution to the parameterized linear programs have been studied extensively in the context of control theory [GN72, BBM03, STJ05, JBM07]. We used the geometric algorithm from [BBM03], which we briefly describe in this section and fully specify in Figure 1.; see the original paper for a proof of correctness.

For convenience, represent the LP (13) as

where $c=[-1,...,-1]^{T}$ , $x=[l_{b},l_{c},...,l_{s^{\prime\prime}}]$ , $G$ is the coefficient matrix for the left-hand side of the inequalities, $\theta=[\log_{M}B,\log_{M}C,...,\log_{M}\sigma_{h}]^{T}$ , and $F$ and $w$ are the coefficient matrix and vector, respectively, for the right-hand side of the inequalities.

The intuition for the algorithm is as follows: start with a (possibly open) polytope in parameter space; during the first iteration of the algorithm, this is the set of all possible valid loop bounds and strides (i.e. nonnegative parameters, filter fitting inside the input). Pick a random point (not necessarily uniformly) inside that region, setting the parameters to its coordinates. Solve the linear program at that point using a method, such as simplex, that guarantees that the solution produced will be a vertex of the polytope, and note which constraints are made tight.

The number of tight constraints should be at least to the number of variables in the linear program, which is nine in this case, since our solution is a vertex of the polytope. If there are more than nine constraints, there are two possibilities: either (a) when the constraints are set to equality (to ensure tightness), there are redundant constraints, and the number of non-redundant constraints is nine, or (b) the point we selected lies on the border of two partitions in parameter space. Since the borders of partitions are of lower dimension than the parameter space itself (and since we selected our initial point randomly), the probability that we encounter case (b) is zeroBecause of the discreteness of random number generators in practice, as well as the possibility of using a nonuniform sampler for performance, the probability may not be exactly zero. Nevertheless, randomly resampling a point or perturbing our initial point if we see case (b) is sufficient..

If there are nine tight constraints at our solution, the optimizer $x^{*}(\theta)$ at this point is the solution to $G_{t}x=w_{t}+F_{t}\theta$ , where $G_{t}$ , $w_{t}$ , and $F_{t}$ correspond to $G$ , $w$ , and $F$ restricted to the tight constraints. The polytope (in parameter space) within which $x^{*}(\theta)$ is the optimizer is given by $G_{s}x^{*}(\theta)\leq w_{s}+F_{s}\theta$ , where $G_{s}$ , $w_{s}$ , and $F_{s}$ are the restrictions of $G$ , $w$ , and $F$ to slack constraints. Similarly, if there are more than nine tight constraints, we set the tight constraints to equality and solve to get the optimizer; the polytope is defined as the region where the slack constraints remain slack at the optimizer (see Figure 1 for details).

Once we have obtained this polytope, we partition the remainder of our initial polytope into convex polytopes and recursively partition each one using this algorithm. We terminate when the remainder is either empty or is of lower dimension than parameter space.

2 Verifying optimality of the tiling

In order to verify the optimality of the tiling and the attainability of the lower bound, we must ensure that the communication cost attained by the tiling equals the maximum of the five lower bounds for every possible element.

Define the function $\mathscr{C}_{p}(\theta)$ as the log of the communication cost at point $\theta$ using the tiling provided by the algorithm above for partition $p$ , that is:

where $\hat{x}_{b}^{p},...,\hat{x}_{s^{\prime\prime}}^{p}$ represent the closed-form optimizers in region $p$ as a function of $\theta$ , and let $\mathscr{L}_{i}(\theta)$ be the $i$ th lower bound from (4) as a function of $\theta$ .

It suffices to show that for every partition $p$ and parameter set $\theta\in p$ , $\mathscr{C}_{p}(\theta)$ is equal to the maximum of the five lower bounds; in other words, that the quantity

is zero for all $i\in\{1,...,5\}$ and for all partitions $p$ . This is not precisely the case - in the cases where all the inputs and outputs fit inside fast memory, the result may be nonzero since the computation in Equation 5.2 assumes that $M$ words are transmitted in each round (which is obviously more than the number of words transmitted if everything fits in cache); as a result, we exclude regions where the communication lower bound is less than $M$ . The LP solutions can easily be checked to be correct using an LP solver; for our code and results, see https://people.eecs.berkeley.edu/~dinh/papers/DD18/partitioning.nb.

As a sanity-check, we can also ensure that no tiling generated by our LP breaks any of our lower bounds, i.e.

should be nonnegative for all $p$ , $i$ . This is confirmed by our code as well.

3 Observations

A plot of our communication costs for one set of CNN parameters for a real world neural net is shown in Figure (2). Notice that different communication bounds (and different tilings) apply depending on the size of the fast memory being optimized for. As a result, a more sophisticated analysis would be required to optimize communication for multi-level memory model with different layers of cache; we leave this to future work.

It can also be observed in this example that when the memory size is larger than the size of the filter, the optimal tiling requires asymptotically no more memory than the size of the output array. In fact, an examination of the solutions for the LP (13) shows that in cases where any of $Image$ , $Output$ , and $Filter$ fits entirely in cache, there always exists a tiling that will ensure a communication cost no higher than the max of their sizes.

Application to Pooling

Our techniques can also be extended to programs with similar loop structures, such as pooling:

where $a\oplus b$ can either mean $a+b/(rs)$ (“average-pooling”) or $\max(a,b)$ (“max-pooling”). The following communication lower bound holds for pooling:

The first two terms correspond to the size of $Out$ and $Image$ respectively, while the third can be found using the approach from Section 4.2. In fact, since the only difference between pooling and the convolution (1) is the absence of $Filter$ , it suffices to remove $\phi_{3}$ from our calculations. In particular, the sets of subgroups we consider

Solving the resulting LP gives us the third lower bound, as desired.

We can also follow the approach from Section 5 to verify that this lower bound is tight and always attainable with the tiling given by the solution to the LP (13) with the filter constraint ( $l_{c}+l_{k}+l_{r^{\prime}}+l_{r^{\prime\prime}}+l_{s^{\prime}}+l_{s^{\prime\prime}}\leq 1$ ) removed.

Conclusions and Future Work

We have found an asymptotic lower bound on communication required to evaluate a convolutional neural net, and provided an optimal reordering of the nested loops to attain the lower bound in every case by taking advantage of significantly more ( $2.75$ times in the example from Figure 2) data reuse than is possible with matrix multiply or many other dense linear algebra operations. We describe a few avenues for future research below:

Parallel Algorithm: Suppose we have $p$ processors which do not communicate with each other except through a single shared memory of size M, and we wish to divide work between them while maintaining communication efficiency. For simplicity, let us only consider dividing work by evenly partitioning the output dimensions (i.e. assigning each processor a subset of $b$ , $k$ , $h$ , and $w$ ) so as to avoid possible race conditions caused by assigning the same coordinate in the output to multiple processors. Suppose we assign each processor a single block of size $B\text{\textquoteright}=B/a_{B}$ , $K\text{\textquoteright}=K/a_{K}$ , $H\text{\textquoteright}=H/a_{H}$ , and $W\text{\textquoteright}=W/a_{W}$ , such that $a_{B}a_{K}a_{H}a_{W}\leq p$ .

Since each of the processors handles a subproblem of the same size, the communication lower bound (total memory traffic between the shared memory and all the processors) under these assumptions is $p$ multiplied by the per-processor expression (4), with $B$ , $K$ , $H$ , $W$ replaced by $B\text{\textquoteright}$ , $K\text{\textquoteright}$ , $W\text{\textquoteright}$ , $H\text{\textquoteright}$ . The first (output size), fourth, and fifth terms in this expression remain the same; the second and third (output and filter, respectively) increase by a factor of $a_{K}$ and $a_{B}a_{W}a_{H}$ respectively; once these parameters are fixed, the communication lower bound is

Since the bound is for fixed values of $a$ , we should choose those values in order to minimize the above quantity.

Optimizing the tiling (under these assumptions) can be done by solving a modified version of LP (13), with the $B\text{\textquoteright}$ , $K\text{\textquoteright}$ , $W\text{\textquoteright}$ , $H\text{\textquoteright}$ replacing $B$ , $K$ , $H$ , $W$ and with additional constraints encoding $a_{B}a_{K}a_{H}a_{W}\leq p$ , $pa_{B}\leq B$ , etc.

A more sophisticated analysis (examining tiling schemes that may cause race conditions and distributed models where processors can directly communicate with each other) is left to future work.

Implementation and benchmarking: The tilings we generate are asymptotically optimal in terms of communication. However, many factors, such as cache locality and processor architecture (tile sizes that are multiples of a processor’s vector width are likely to be more efficient), can provide significant constant-factor changes to the real-world performance (both in terms of time and energy). Since our algorithm is simply a rearrangement of the same arithmetic operations performed during a CNN, we believe our algorithms can provide a significant advantage over current implementations of convolutions that do no tiling (e.g. that used in Torchhttps://github.com/torch/torch7/blob/master/lib/TH/generic/THTensorConv.c ) or those that tile only based on machine parameters without taking into account the dimensions of the problem (e.g. CuTorchhttps://github.com/torch/cutorch/blob/master/lib/THC/THCTensorConv.cu ); we leave the validation of this intuition through an implementation and benchmark to future work.

Generalizing to arbitrary nested loops: Our lower bound in the “small filter” case rests on a problem-specific lifting of the HBL LP to a higher dimension; similarly, our approach for finding a closed form for optimal tilings (which is necessary for checking if a lower bound is always attainable) relies on the creation of a linear program tailored to this specific problem. Generalizing this to arbitrary nested loops would move us closer to being able to automatically optimize arbitrary loop nests for communication, e.g. at a compiler level.

References

Appendix A: Manual Exploration of Tilings

In this section we show, by hand, that the lower bounds in Theorem 1 are attainable in the case where $\sigma_{w}=\sigma_{h}=1$ . The advantage of this approach appears is that it seems to be possible to attain far more compact representations of the optimal tiling function than with an automated exploration, which produced well over twice as many cases (for the one-stride case) as the hand exploration in this section.

There are a number of cases, depending on which of the five terms in Theorem 1 is largest, and other inequalities. To simplify the presentation, we present the overall result as a decision tree, where each leaf of the tree represents a disjoint subset of the possible lower bounds and optimal algorithms. After stating the result, we will give some intuition for the decision tree, as arising from a linear program, and then prove the theorem by providing an algorithm and communication cost analysis, with a separate lemma for each leaf.

The following cases describe which of the lower bound expressions in Theorem 1 are attainable. The abbreviation ALB stands for Attainable Lower Bound:

Here is some intuition for why the cases in Theorem 2 arise, and some notation we will use later in the proof. Suppose we tile the $b$ loop with block size $bB$ , the $k$ loop with $bK$ , the $h$ loop with $bH$ , and so on. Then the algorithm becomes (recall that in this case we are assuming $\sigma_{w}=\sigma_{h}=1$ ):

The argument list in the name is meant to indicate both the order of the nested loops, via the notation $b|$ , $k|$ , $h|$ , etc., and the block sizes. A natural way to try to optimize this code is to pick the block sizes so that all the data accessed in the 7 innermost loops fits in fast memory of size $M$ , and to maximize the number of loop iterations that can be performed by these loops, namely $bB\cdot bK\cdot bH\cdot bW\cdot bR\cdot bS\cdot bC$ . It is easy to see that the submatrix of $Out$ accessed by these loops is of size $bK\cdot bH\cdot bW\cdot bB$ , the submatrix of $Filter$ accessed is of size $bK\cdot bR\cdot bS\cdot bC$ , and the submatrix of $Image$ accessed is of size $bC\cdot(bS+bH)\cdot(bR+bW)\cdot bB$ . Since $R\leq W$ , we will assume $bR\leq bW$ , and similarly $bS\leq bH$ , which means the submatrix of $C$ accessed is of size at most $4\cdot bC\cdot bH\cdot bW\cdot bB$ ; we will ignore the constant 4 for simplicity, since it does not change our Big-O analysis below. This yields the following optimization problem:

We call a tiling admissible if it satisfies the constraints above. (The reader may wonder why the first 3 constraints do not use the upper bound $M/3$ instead of $M$ , to be sure all three submatrices fit in fast memory simultaneously; this would only change the value of $G$ by a constant factor, which would not change the following Big-O analysis.) Then assuming the 3 submatrices are each read or written just once in the innermost 7 loops, the total number of reads and writes is $O(BKHWCRSM/G)$ , since $BKHWCRS$ is the total number of loop iterations, $BKHWCRS/G$ is the number of iterations of the outermost 7 loops, and there are $O(M)$ reads/writes per iteration of the outermost 7 loops. This last statement about $O(M)$ read/writes per iteration may depend on the cache replacement policy in the hardware, but again it will not change the Big-O analysis.

If we now replace each quantity by its logarithm base $M$ , so $B$ by $lB=\log_{M}B$ , $bB$ by $lbB=\log_{M}bB$ and so on, we get the following linear program:

Exploring the finite number of corners of the polytope defined by this linear program lead to the cases in Theorem 2. While the proof of Theorem 2 requires this exploration by hand, and confirming that the lower bound of Theorem 1 is attained, in practice the linear program could be used to determine the optimal block sizes. We note that this linear program has 7 variables and 12 constraints (besides nonnegativity), so there are as many as $\binom{12}{7}=792$ corners in the feasible polytope to explore; fortunately only a few turn out to be important.

To keep the proofs short, we will use the notation BlockCNN above, and the cost expression $O(BKHWCRSM/G)$ , to describe the optimal algorithm in each case. We will use the expression

to denote the lower bound from Theorem 1. To capture (some of) the non-uniqueness of the optimal solutions, we will use the following two functions to help solve linear programs: Function $(x,y)=f_{2}(\bar{x},\bar{y},s)$ takes 3 nonnegative arguments satisfying $\bar{x}+\bar{y}\geq s$ , and returns some $0\leq x\leq\bar{x}$ and $0\leq y\leq\bar{y}$ satisfying $x+y=s$ . Function $(x,y,z)=f_{3}(\bar{x},\bar{y},\bar{z},s)$ similarly takes 4 nonnegative arguments satisfying $\bar{x}+\bar{y}+\bar{z}\geq s$ , and returns some $0\leq x\leq\bar{x}$ , $0\leq y\leq\bar{y}$ and $0\leq z\leq\bar{z}$ satisfying $x+y+z=s$ .

Upper Bound Case 1: Suppose $\min(CHWB,KCRS,KHWB)\leq M$ , i.e. at least one of the 3 arrays $Image$ , $Filter$ and $Out$ fits in fast memory. Then the attainable communication lower bound is $O(\max(CHWB,KCRS,KHWB))$ .

Proof: Case 1 in turn breaks down into a number of subcases (again, we ignore constant factors):

Use BlockCNN( $b|B,k|K,h|H,w|W,r|R,s|S,c|C$ ), i.e. the original unblocked algorithm,

$LB=\max(CHWB,KHWB,KCRS)$ because $KCRS\leq M$ implies $RS\leq M$ implies

Use BlockCNN( $b|1,k|K,h|H,w|W,r|R,s|S,c|(M/(HW)$ ). It is straightforward to confirm that this tiling is admissible. Then $G=KHWRS(M/(HW))=KRSM$ , and so the number of read/writes is $O(CHWB)$ . The same inequalities as in Case 1.1 show

Swap the roles of $C$ and $K$ in Case 1.2.

Let $(lbR,lbS)=f_{2}(lR,lS,1-lK-lC)$ , and then $bR=M^{lbR}$ and $bS=M^{lbS}$ . Then use BlockCNN( $b|B,k|K,h|H,w|W,r|bR,s|bS,c|C$ ). Admissibility follows from the definition of $f_{2}()$ . Then $G=KHWB(M/KC)C=HWMB$ , and the number of reads/writes is $O(KCRS)$ .

To show $LB=\max(CHWB,KHWB,KCRS)$ we first note $CHWB\leq M$ implies $HWB\leq M$ implies $KCHWBRS/M\leq KCRS$ . Multiplying the first 3 inequalities defining Case 1.4 yields $KCRS\cdot M\cdot M\geq M\cdot CHWB\cdot KHWB$ , or $RS\geq(RS/M)^{1/2}HWB$ , and thus $KCRS\geq KCHWB(RS/M)^{1/2}$ .

$RS\leq HW\leq M$ , so $lR+lS\leq 1$ . Let $(lbK,lbC)=f_{2}(lK,lC,1-lR-lS)$ , and then $bK=M^{lbK}$ and $bC=M^{lbC}$ . Then use BlockCNN( $b|B,k|bK,h|H,w|W,r|R,s|S,c|bC$ ). Admissibility follows from the definition of $f_{2}()$ . Then $G=(M/(RS))HWRSB=HWMB$ , and the number of reads/writes is $O(KCRS)$ .

Showing $LB=\max(CHWB,KHWB,KCRS)$ is identical to Case 1.4.

$KC\leq M$ so $\max(K,C)\leq M$ . Let $(lbB,\delta_{h},\delta_{w})=f_{3}(lB,lH-lS,lW-lR,1-\max(lC,lK)-lR-lS)$ . This is well-defined because $lH-lS\geq 0$ is equivalent to $H\geq S$ , $lW-lR\geq 0$ is equivalent to $W\geq R$ , $1-\max(lC,lK)-lR-lS\geq 0$ is equivalent to $M\geq\max(KRS,CRS)$ , which is implied by $KCRS\leq M$ , and $lB+lH-lS+lW-lR\geq 1-\max(lC,lK)-lR-lS$ is equivalent to $\max(KHWB,CHWB)\geq M$ . Now let $lbH=lS+\delta_{h}\leq lH$ , $lbW=lR+\delta_{w}\leq lW$ , $bB=M^{lbB}$ , $bH=M^{lbH}$ and $bW=M^{lbW}$ . Thus $S\leq bH\leq H$ and $R\leq bW\leq W$ . Then use BlockCNN( $b|bB,k|K,h|bH,w|bW,r|R,s|S,c|C$ ). Admissibility follows from $KCRS\leq M$ and $lbB+lbH+lbW=1-\max(lC,lK)$ , so $bB\cdot bH\cdot bW=M/\max(K,C)$ , and $\max(K\cdot bH\cdot bW\cdot bB,C\cdot bH\cdot bW\cdot bB)=M$ . Then $G=KC(M/\max(K,C))RS=\min(K,C)MRS$ , and the number of reads/writes is $O(\max(KHWB,CHWB))$ .

To show $LB=\max(CHWB,KHWB,KCRS)$ we note $KCHWBRS/M\leq HWB\leq\max(CHWB,KHWB)$ and $KCHWB(RS/M)^{1/2}\leq KCHWB(1/(KC))^{1/2}=(KC)^{1/2}HWB\leq\max(K,C)HWB$ .

Use BlockCNN( $b|B,k|K,h|H,w|W,r|1,s|1,c|(M/K)$ ). Admissibility follows from $(M/K)HWB\leq M$ . Then $G=MHWB$ and the number of reads/writes is $O(KCRS)$ .

Note that $KCRS\geq HWBCRS\geq CHWB\geq KHWB$ and $KCHWBRS/M\leq KCRS$ . $KHWB\leq M$ and $K\geq HWB$ together imply $HWB\leq M^{1/2}\leq(MRS)^{1/2}$ , and hence $KCHWB(RS/M)^{1/2}\leq KCRS$ . So altogether $LB=\max(KCRS,CHWB,KHWB)$ .

Let $(lbR,lbS)=f_{2}(lR,lS,min(lR+lS,lH+lW+lB-lK))$ , and then $bR=M^{lbR}$ and $bS=M^{lbS}$ . Note that $lH+lW+lB-lK\geq 0$ because $K\leq HWB$ .

Use BlockCNN( $b|B,k|K,h|H,w|W,r|bR,s|bS,c|(M/(HWB))$ . Admissibility follows since $K\cdot(M/(HWB))\cdot bR\cdot bS\leq K\cdot(M/HWB)\cdot(HWB/K)=M$ . Then $G=BKHW(M/HWB)\min(RS,HWB/K)=\min(KMRS,MHWB)$ and the number of reads/writes is $O(KCRSHWBM/\min(KMRS,MHWB))=O(\max(CHWB,KCRS))$ .

$KHWB\leq M$ implies $HWB\leq M$ implies $KCRSHWB/M\leq KCRS$ . If $KRS\leq HWB$ , then $\frac{K(RS)^{1/2}}{M^{1/2}}\leq\frac{K(HWB/K)^{1/2}}{M^{1/2}}=\frac{(KHWB)^{1/2}}{M^{1/2}}\leq 1$ , implying $CHWB\geq KCHWB(RS/M)^{1/2}$ . Alternatively, if $KRS\geq HWB$ , then $M\geq KHWB\geq(HWB)^{2}/RS$ , so $RS(HWB)^{2}/M\leq(RS)^{2}$ , and $KCRS\geq KCHWB(RS/M)^{1/2}$ . So altogether $LB=\max(KCRS,CHWB,KHWB)$ .

Swap the roles of $C$ and $K$ in Case 1.7.

Swap the roles of $C$ and $K$ in Case 1.8.

Upper Bound Case 2.1: Suppose $RS\geq M$ . Then the attainable communication lower bound is $O(KCHWRSB/M)$ .

Proof: Note that $HW\geq RS\geq M$ implies $\min(CHWB,KHWB,KCRS)\geq M$ . Let $(lbR,lbS)=f_{2}(lR,lS,1)$ , $lbW=lbR$ and $lbH=lbS$ , and then $bR=M^{lbR}$ , $bS=M^{lbS}$ , $bW=bR$ and $bH=bS$ . Use BlockCNN( $b|1,k|1,c|1,h|bH,w|bW,r|bR,s|bS$ ). Admissibility follows from $bR\cdot bS=bH\cdot bW=M$ . Then $G=M^{2}$ , so the number of reads/writes is $O(KCHWRSMB/G)=O(KCHWRSB/M)$ .

$HW\geq RS\geq M$ implies $KCHWRSB/M\geq\max(CHWB,KHWB,KCRS)$ . $RS\geq M$ also implies $KCHWBRS/M\geq KCHWB(RS/M)^{1/2}$ . So $LB=KCHWRSB/M$ . $\Box$

Upper Bound Case 2.2.1: Suppose $\min(CHWB,KCRS,KHWB)\geq M$ , $RS\leq M$ and $MRS\geq(HWB)^{2}$ . Then the attainable communication lower bound is $O(KCRS)$ .

Proof: Note that $MHW\geq MRS\geq(HWB)^{2}$ , so $M\geq B^{2}HW$ . Let $bC=M/(HWB)\geq 1$ , and so $bC\leq CHWB/(HWB)=C$ . Also $KHWB\geq M\geq(HWB)^{2}/(RS)$ so $K\geq HWB/(RS)$ . Let $bK=HWB/(RS)\geq 1$ , and so $bK\leq K$ . Use BlockCNN( $b|B,k|bK,h|H,w|W,r|R,s|S,c|bC$ ). Admissibility follows from $bC\cdot HWB=M$ , $bK\cdot HWB=(HWB)^{2}/(RS)\leq M$ , and $bC\cdot bK\cdot RS=M/(HWB)\cdot HWB/(RS)\cdot RS=M$ . Then $G=bK\cdot bC\cdot HWRSB=HWMB$ , so the number of reads/writes is $O(KCHWRSBM/(HWMB))=O(KCRS)$ .

$(HWB)^{2}\leq MRS$ implies $HWB(RS/M)^{1/2}\leq RS$ implies $KCHWB(RS/M)^{1/2}\leq KCRS$ . $RS\leq M$ implies $KCHWBRS/M\leq KCHWB(RS/M)^{1/2}\leq KCRS$ . $(HWB)^{2}/RS\leq M\leq KHWB$ implies $HWB\leq KRS$ implies $CHWB\leq KCRS$ . Similarly, $(HWB)^{2}/RS\leq M\leq CHWB$ implies $HWB\leq CRS$ implies $KHWB\leq KCRS$ . Thus $LB=KCRS$ . $\Box$

Upper Bound Case 2.2.2.1: Suppose $\min(CHWB,KCRS,KHWB)\geq M$ , $RS\leq M$ , $MRS\leq(HWB)^{2}$ and $\min(C,K)\geq(M/(RS))^{1/2}$ . Then the attainable communication lower bound is $O(KCHWB(RS/M)^{1/2})$ .

Proof: Let $bK=bC=(M/(RS))^{1/2}\leq\min(C,K)$ , and $bR=R$ and $bS=S$ . Let $(lbB,\delta_{h},\delta_{w})=f_{3}(lB,lH-lS,lW-lR,(1-lR-lS)/2)$ . This is well-defined because $lH-lS\geq 0$ is equivalent to $H\geq S$ , $lW-lR\geq 0$ is equivalent to $W\geq R$ , $(1-lR-lS)/2\geq 0$ is equivalent to $(M/RS)^{1/2}\geq 1$ or $M\geq RS$ , and $lB+lH-lS+lW-lR\geq(1-lR-lS)/2$ is equivalent to $HWB/(RS)\geq(M/RS)^{1/2}$ , or $(HWB)^{2}\geq MRS$ . Now let $lbH=lS+\delta_{h}\leq lH$ , $lbW=lR+\delta_{w}\leq lW$ , $bH=M^{lbH}$ , $bW=M^{lbW}$ and $bB=M^{lbB}$ . Then $bS\leq bH\leq H$ , $bR\leq bW\leq W$ , and $bH\cdot bW\cdot bB=(MRS)^{1/2}$ . Use BlockCNN( $b|bB,k|bK,h|bH,w|bW,r|bR,s|bS,c|bC$ ). Admissibility follows from $bK\cdot bH\cdot bW\cdot cB=bC\cdot bH\cdot bW\cdot bB=bK\cdot bC\cdot bR\cdot bS=M$ . Then $G=bB\cdot bK\cdot bC\cdot bH\cdot bW\cdot RS=M^{3/2}(RS)^{1/2}$ , and the number of reads/writes is $O(KCHWRSBM/(M^{3/2}(RS)^{1/2}))=O(KCHWB(RS/M)^{1/2})$ .

$K\geq(M/(RS))^{1/2}$ implies $KCHWB(RS/M)^{1/2}\geq CHWB$ . $C\geq(M/(RS))^{1/2}$ implies $KCHWB(RS/M)^{1/2}\geq KHWB$ . $HWB\geq(MRS)^{1/2}$ implies $KCHWB(RS/M)^{1/2}\geq KCRS$ . $RS\leq M$ implies $KCHWB(RS/M)^{1/2}\geq KCHWBRS/M$ . Thus $LB=KCHWB(RS/M)^{1/2}$ . $\Box$

Upper Bound Case 2.2.2.2: Suppose $\min(CHWB,KCRS,KHWB)\geq M$ , $RS\leq M$ , $MRS\leq(HWB)^{2}$ and $\min(C,K)\leq(M/(RS))^{1/2}$ . Then the attainable communication lower bound is $O(\max(KHWB,CHWB))$ .

Proof: Suppose w.l.o.g that $C\leq K$ , so $C\leq(M/(RS))^{1/2}$ and $KCRS\geq M$ implies $K\geq M/(CRS)\geq(M/(RS))^{1/2}$ . Let $bC=C$ , $bK=C$ , $bR=R$ and $bS=S$ . Let $(lbB,\delta_{h},\delta_{w})=f_{3}(lB,lH-lS,lW-lR,1-lC-lR-lS)$ . This is well-defined because $lH-lS\geq 0$ is equivalent to $H\geq S$ , $lW-lR\geq 0$ is equivalent to $W\geq R$ , $1-lC-lR-lS\geq 0$ is equivalent to $M/(CRS)\geq 1$ , which is implied by $C\leq(M/RS)^{1/2}\leq M/RS$ , and $lB+lH-lS+lW-lR\geq 1-lC-lR-lS$ being equivalent to $lC+lH+lW+lB\geq 1$ or $CHWB\geq M$ . Now let $lbH=lS+\delta_{h}\leq lH$ , $lbW=lR+\delta_{w}\leq lW$ , $bH=M^{lbH}$ , $bW=M^{lbW}$ and $bB=M^{lbB}$ . Then $bS\leq bH\leq H$ , $bR\leq bW\leq W$ , and $bH\cdot bW\cdot bB=M/C$ . Use BlockCNN( $b|bB,k|bK,h|bH,w|bW,r|bR,s|bS,c|bC$ ). Admissibility follows from $bK\cdot bH\cdot bW\cdot bB=bC\cdot bH\cdot bW\cdot bB=C\cdot M/C=M$ , and $bK\cdot bC\cdot bR\cdot bS=C^{2}RS\leq M$ . Then $G=bB\cdot bK\cdot bC\cdot bH\cdot bW\cdot bR\cdot bS=C\cdot C\cdot M/C\cdot R\cdot S=CRSM$ , and the number of reads/writes is $O(KCHWRSBM/(CRSM))=O(KWHB)$ .

$KHWB\geq CHWB$ by assumption. $HWB\geq(MRS)^{1/2}\geq CRS$ implies $KHWB\geq KCRS$ . $C\leq(M/(RS))^{1/2}\leq M/(RS)$ implies $KHWB\geq KCHWB(RS/M)^{1/2}\geq KCHWBRS/M$ . Thus $LB=KHWB$ . $\Box$