Communication-Optimal Convolutional Neural Nets

James Demmel, Grace Dinh

Introduction

Convolutional neural networks are a bottleneck in many machine learning applications, and as such must be efficiently implemented on modern architectures. To do so, it is important to understand where most of the time (and energy) goes when executing a program on a current architecture. There are two costs to consider: arithmetic and communication, i.e. moving data, either between levels of a memory hierarchy or between processors over a network. The cost to move one word of data can be orders of magnitude larger than the cost to perform one arithmetic operation, and this difference in cost is growing over time, following technological trends [GSP05, FM11]. Avoiding communication has long motivated optimization efforts (many of which have in fact managed to attain known communication lower bounds) in numerical linear algebra, resulting in tuned libraries (e.g. the BLAS and LAPACK) that attain a high fraction of machine peak. We seek to extend this optimization approach to CNNs.

In this paper, we consider one phase of CNNs, which can be written most simply as seven nested loops combining a four-dimensional ImageImage array and a four-dimensional FilterFilter array to compute a four-dimensional OutOut array, which may be most simply stated as followsWe ignore boundary conditions in this paper and focus solely on asymptotic optimization.:

We consider all possible ways to reorganize this computation, performing the same arithmetic operations in an arbitrary order, and ask which order minimizes communication. We use a simple sequential architectural model where there is one memory large enough to hold all the input and output data, and a smaller cache of size MM where data needs to reside to perform arithmetic; we wish to minimize the number of words moved between the large memory and cache (we also show how to generalize this simple architectural model to more complicated and realistic ones).

Our first contribution is to prove new communication lower bounds that hold for all possible loop bounds on the seven nested loops, all strides, and all cache sizes MM. Our second contribution is to show how to attain these lower bounds in all cases, using appropriate loop reorganizations and tilings.

Both the lower bounds and optimal tilings are more complicated than, say, those for matrix multiplication because the seven loop bounds and two stride values lead to many more possible situations than the three loop bounds of matrix multiplication (which we will briefly review for contrast). Not all loop bounds and strides may be used in practice, but we consider them all for completeness.

Fortunately, the lower bound can be tersely written as the maximum of five simple algebraic expressions in the seven loop bounds and cache size MM:

(see Theorem 1 in section 4). Any one of these five expressions may be much larger than the others, depending on the loop bounds, the stride values, and MM. Notice that in many common cases (small filter, arrays too big to fit entirely in cache), the fifth term in the above expression is the maximum. If we were only able to achieve as much data reuse as in conventional O(n3)O(n^{3}) matrix multiplication (or the many other dense linear algebra operations for which matrix multiply is a bottleneck), this term would be equal to KCHWBRS/M1/2KCHWBRS/M^{1/2}. In contrast, we see our bound is smaller (better) than this by a factor min(M1/2,(RS/σhσw)1/2)\min(M^{1/2},(RS/\sigma_{h}\sigma_{w})^{1/2}). The proof (which can be skipped on a first reading) uses techniques from functional analysis, group theory and lattice theory.

The optimal tilings that attain these lower bounds lead to a large number of cases. We determine these cases by formulating the problem of finding an optimal tiling as an optimization problem, where we want to choose loop tile sizes that both fit in fast memory and maximize the amount of work that can be done. By taking logarithms, this becomes a linear program over the tiling parameters, with constraints restricting the tiling parameters to feasible tilings that fit inside memory. We show in section 5 that for all possible values of the loop bounds and strides, there is a feasible solution of this linear program that attains the corresponding lower bound. A practical implementation of our result might be done either by formulating and solving the linear program, or by using the cases precomputed using the method described in Section 5. For certain sets of parameters found in real-world neural networks such as AlexNet, our algorithms can produce an integer-factor reduction in the communication cost over a matrix multiply-based approach for sufficiently small (L1-L2 cache) values of MM.

The remainder of this paper is organized as follows. Section 2 describes the seven-nested loop version of a CNN in more detail, and the simplifications we make for the purpose of our analysis. Section 3 briefly reviews lower bounds and optimal algorithms for matrix multiplication, to set the stage for our more complicated analysis of CNNs. Section 4 presents our new lower bounds, and Section 5 presents the matching upper bounds, i.e. optimal algorithms. We extend our analysis to pooling in Section (6).

CNN Model

As stated in the Introduction, we consider the following CNN computation:

where ImageImage has dimensions (σwW+R)×(σhH+S)×C×B(\sigma_{w}W+R)\times(\sigma_{h}H+S)\times C\times B, OutOut has dimensions K×H×W×BK\times H\times W\times B, and FilterFilter has dimensions K×R×S×CK\times R\times S\times C. BB is the number of images, CC is the number of input channels, KK is the number of output channels, WW and HH are the width and height of the output image, RR and SS are the sizes of one convolution, σw\sigma_{w} is the stride size in the ww dimension, and σh\sigma_{h} is the stride size in the hh dimension. We assume that the filter size is smaller than the input image size, i.e. RσwWR\leq\sigma_{w}W and SσhHS\leq\sigma_{h}H, and typically, they are much smaller, though our analysis does not require this. We also assume that σwR\sigma_{w}\leq R and σhS\sigma_{h}\leq S (i.e. all elements of ImageImage are used in the computation); we can reduce any problem onto one where this assumption holds using no more communication than is necessary to read the useful elements in Image. Thus FilterFilter has total size KCRSKCRS, OutOut has total size KHWBKHWB, and ImageImage has total size C(σhH+S)(σwW+R)BC(\sigma_{h}H+S)(\sigma_{w}W+R)B, which is usually close to, and at most four times, CHWBσhσwCHWB\sigma_{h}\sigma_{w}. These three array sizes will appear in our communication bounds. We will use the expression CHWBσhσwCHWB\sigma_{h}\sigma_{w} to simplify our algebra later, since we are only interested in the asymptotics.

Review of Matrix Multiplication

In this section we review the well-known case of matrix multiplication (matmul), both to compare to our analogous but more complicated result for CNNs, and because the lower bound for matmul is used in the CNN lower bound proof. We consider only “classical” matmul, i.e. the algorithm that does mnkmnk multiplies and adds to multiply an mm-by-nn matrix times an nn-by-kk matrix (we discuss Strassen-like algorithms briefly below). To keep it simple, we consider nn-by-nn times nn-by-kk matmul C=ABC=A*B, where the matrix AA and BB originally reside in main memory, the result CC is stored in main memory at the end of execution, 1kn1\leq k\leq n, and the cache has size MM. In this case, the attainable lower bound on the number WMMW_{MM} of words moved between main memory and cache is

The n2n^{2} term arises because it is obviously necessary to read the input matrices from main memory to cache at least once, and write the output matrix to main memory at least once. The n2k/M1/2n^{2}k/M^{1/2} term is the interesting one, dominating n2n^{2} for large enough kk, and a decreasing function of MM. In the square n=kn=k case, it is attained by the well-known tiling approach, i.e. breaking matrices AA, BB and CC into square submatrices of dimension (M/3)1/2(M/3)^{1/2}, so one submatrix of AA, BB and CC can all fit in cache simultaneously. The tiled algorithm then loops over tiles, multiplying two tiles of AA and BB and updating one tile of CC. As kk decreases, the same tiling approach works until k=(M/3)1/2k=(M/3)^{1/2}, at which point one tile just fits in the nn-by-kk matrices BB and CC, and the two terms n2n^{2} and n2k/M1/2n^{2}k/M^{1/2} in WMMW_{MM} become equal (to within a modest constant factor). As kk decreases further, the WMMW_{MM} remains equal to n2n^{2}, and this is attained by continuing to use an (M/3)1/2(M/3)^{1/2}-by-(M/3)1/2(M/3)^{1/2} tile for AA, but (M/3)1/2(M/3)^{1/2}-by-kk tiles for BB and CC. We will see an analogous, but more complicated transitioning of optimal tilings for CNNs.

The lower bound n2k/M1/2n^{2}k/M^{1/2} was first derived for sequential classical matmul in [HK81], generalized to parallel implementations in [ITT04], and to classical linear algebra more generally in [BDHS11]. We will apply a further generalization of these bounds [CDK+13a, CDK+15, Kni15] that apply to more general nested loops accessing arrays to CNNs in section 4. The case of Strassen-like matmul was addressed in [BDHS12, Sco15].

Communication Lower Bounds

In this section, we state and prove our communication lower bound for the CNN in (1). We first state the bound for the following basic memory model, and then show how to generalize it to other models, following [BDHS11]. We assume the input data initially resides in a large main memory, and that at the end of the computation the answer also resides in the main memory. The main memory is attached to a cache of smaller size MM, to which data can be loaded, and from which data can be stored back to main memory. Arithmetic can only be performed on data in cache, with the operands and result of an operation needing to fit in cache. Our goal is to find a lower bound the number of loads and stores needed to complete the algorithm.

In the simplest case, when MM is large enough to hold all the inputs and outputs, an optimal algorithm would move all the inputs from main memory to cache, perform the algorithm without any more loads or stores, and store the answer back to main memory at the end. This would attain the trivial lower bound on the number of loads and stores, equal to the size of all the inputs plus the size of all the outputs. The interesting case is when MM is not large enough to do this.

In this case, following the approach of [CDK+13b], we will proceed as follows: consider the algorithm as a sequence of instructions, including load instructions that transfer data from slow to fast memory, store instructions that transfer data from fast to slow memory, and FF arithmetic operations. Break the sequence into RR rounds of consecutive instructions, with each round containing exactly MM load and store operations (with the possible exception of the last); this provides a 2M2M upper bound on the amount of data that can be used as input by operations in a single bound - the MM words already available at the beginning, and the at most MM words loaded into fast memory at the beginning of a round. If we can show that at most GG operations can be computed with 2M2M inputs and outputs (i.e. in one round), then the number of rounds must be

and the number of words transferred in the execution must be

This approach can be used for more general memory models. For example, we can bound the communication between two consecutive levels of a multilevel memory hierarchy by treating the smaller and faster of the two levels as the “fast memory” and everything slower and larger than it as “slow memory”. For distributed memory parallel computations, following the approach of [ITT04, BDHS11, CDK+13b, Kni15], we can bound the memory traffic in and out of any node by treating the memory on that node as the “fast memory”, and the memory on all the other processors as ”slow memory”.

Any execution order of (1) moves WCNNW_{CNN} words between main memory and a cache of size MM, where

To provide some intuition for this, note that the first three terms in the max()\max() are the sizes of the output and inputs OutOut, ImageImage and FilterFilter respectively. The fourth term uses the results in [CDK+13a, CDK+15], which apply to general loop nests accessing arrays, though we will see that more work is required to apply these results concretely. This lower bound has the same power of MM in the denominator as the direct n-body problem. The fifth term is new, and is larger than the fourth term if and only if RS<MσWσHRS<M\sigma_{W}\sigma_{H}, which is a common case.

2 Proof of the lower bound B​C​K​W​H​R​S/M𝐵𝐶𝐾𝑊𝐻𝑅𝑆𝑀BCKWHRS/M

is the bound we seek. A set (s1,s2,s3)(s_{1},s_{2},s_{3}) satisfies (5) for all VV if and only if they satisfy the linear inequalities

It will turn out that the minimal value of i=13si\sum_{i=1}^{3}s_{i} is 2, leading to G=M2G=M^{2}, and using (3) the desired lower bound of

Suppose A={A1,...,An}A=\{A_{1},...,A_{n}\} and B={B1,...,Bm}B=\{B_{1},...,B_{m}\} are finite sets of subgroups of an abelian group. We will call them independent if

Let lattice(A){\rm lattice}(A) denote the lattice generated by the subgroups in AA, and similarly for lattice(B){\rm lattice}(B); recall that we will add {0}\{0\} to these lattices if they do not already contain it. Then from the definition of a lattice it is easy to see that lattice(A){\rm lattice}(A) and lattice(B){\rm lattice}(B) are independent if AA and BB are independent, in which case we have

Suppose AA and BB are independent. Then

Proof: ABlattice(A)+lattice(B)A\cup B\subset{\rm lattice}(A)+{\rm lattice}(B) since both lattices include {0}\{0\}. It suffices to show that if A1+B1A_{1}+B_{1} and A2+B2A_{2}+B_{2} are both in lattice(A)+lattice(B)lattice(AB){\rm lattice}(A)+{\rm lattice}(B)\subseteq{\rm lattice}(A\cup B), then so are their sum and intersection. The sum (A1+B1)+(A2+B2)=(A1+A2)+(B1+B2)lattice(A)+lattice(B)(A_{1}+B_{1})+(A_{2}+B_{2})=(A_{1}+A_{2})+(B_{1}+B_{2})\in{\rm lattice}(A)+{\rm lattice}(B) follows from being abelian, and a lattice being closed under addition. The intersection (A1+B1)(A2+B2)=(A1A2)+(B1B2)(A_{1}+B_{1})\cap(A_{2}+B_{2})=(A_{1}\cap A_{2})+(B_{1}\cap B_{2}) follows from independence, and a lattice being closed under intersection. \Box

The advantage of this is that if we have simple descriptions of lattice(A){\rm lattice}(A) and lattice(B){\rm lattice}(B) (say finite lists of each), then it is easy to describe lattice(AB){\rm lattice}(A\cup B).

More generally, let Ci={Ci,1,...,Ci,n(i)}C_{i}=\{C_{i,1},...,C_{i,n(i)}\} be a set of n(i)n(i) subgroups, for i=1,...,mi=1,...,m. We call them independent if for all ii

This is a natural generalization of the case where each Ci,jC_{i,j} is a vector space. Then, as above the lattice(Ci){\rm lattice}(C_{i}) are independent, and

and if we have finite lists of the members of each lattice(Ci){\rm lattice}(C_{i}), we can also list the members of lattice(iCi){\rm lattice}(\cup_{i}C_{i}).

Now suppose we start with some sets of the form

Each of the above sums may be over a subset of the possible values of ii. In other words, each DjD_{j} is gotten by choosing at most one member of each CiC_{i}, and adding them. Later we will choose these DjD_{j} to be the kernels of the projections ϕi\phi_{i} in the CNN. Then for any such D1,,DkD_{1},\cdots,D_{k}, we have

Now we tie this to code for CNNs. The projections ϕi\phi_{i} and their kernels are

Our goal is a finite list of subgroups HH that is a superset of lattice(K){\rm lattice}(K), where

and where for each such HH we can write down the inequality (7). Then by [CDK+15, Val10], solving the linear program that minimizes i=13si\sum_{i=1}^{3}s_{i} subject to these constraints will give us our desired bound G=Mi=13siG=M^{\sum_{i=1}^{3}s_{i}}.

which we identify with D1D_{1}, D2D_{2} and D3D_{3} in (9) above. So by (10), all we need are finite lists of members of each lattice(Ci){\rm lattice}(C_{i}) to write down a finite list of subgroups HH containing lattice(K){\rm lattice}(K). Since each CiC_{i} is small, it is easy to confirm the following facts:

where we have added {0}\{0\} to lattice(C1){\rm lattice}(C_{1}), lattice(C4){\rm lattice}(C_{4}) and lattice(C5){\rm lattice}(C_{5}). Since the cardinalities of these five lattices are 2, 5, 5, 2 and 2, respectively, the number of possibly different subgroups in ilattice(Ci)\sum_{i}{\rm lattice}(C_{i}) in (10) is at most 25522=2002\cdot 5\cdot 5\cdot 2\cdot 2=200.

It turns out that we only need 1+4+4+1+1=111+4+4+1+1=11 subgroups, or more generally i(Ci1)\sum_{i}(|C_{i}|-1), not iCi\prod_{i}|C_{i}|. This simplification is a generalization of the “Product Case” in section 6.3 of [CDK+13a]. The idea is that if Hilattice(Ci)H\in\sum_{i}{\rm lattice}(C_{i}), so that H=iCi,j(i)H=\sum_{i}C_{i,j(i)} where Ci,j(i)lattice(Ci)C_{i,j(i)}\in{\rm lattice}(C_{i}), then by independence of the CiC_{i} we get rank(H)=irank(Ci,j(i)){\rm rank}(H)=\sum_{i}{\rm rank}(C_{i,j(i)}), and we also get rank(ϕk(H))=irank(ϕk(Ci,j(i))){\rm rank}(\phi_{k}(H))=\sum_{i}{\rm rank}(\phi_{k}(C_{i,j(i)})) by the construction of the CiC_{i} from the ϕk\phi_{k}. Thus (7) follows from adding all the inequalities

There are only 11 such inequalities, because using Ci,j(i)={0}C_{i,j(i)}=\{0\} only yields the trivial inequality 000\leq 0.

The table below has one row for each Ci,j(i)C_{i,j(i)}, one column for rank(Ci,j(i)){\rm rank}(C_{i,j(i)}), 3 columns for each rank(ϕk(Ci,j(i))){\rm rank}(\phi_{k}(C_{i,j(i)})), and the rightmost column for the resulting inequality (11).

Removing redundant inequalities, we get just the following four:

We see that minimizing i=13si\sum_{i=1}^{3}s_{i} subject to these inequalities yields the desired value of 2, say by choosing s1=s2=s3=2/3s_{1}=s_{2}=s_{3}=2/3. The solution (s1,s2,s3)(s_{1},s_{2},s_{3}) is not unique, but their sum is.

Using the notation introduced above, we still want to bound the number of lattice points V|V| in any set VV of 7-tuples (b,c,k,w,h,r,s)(b,c,k,w,h,r,s) of integers, given the bounds ϕi(V)M|\phi_{i}(V)|\leq M for i{1,2,3}i\in\{1,2,3\}. However, we want a bound that is tighter than M2M^{2} when RSRS is small.

We will begin by rewriting the loop indices as r=σwr+rr=\sigma_{w}r^{\prime}+r^{\prime\prime} and s=σhs+ss=\sigma_{h}s^{\prime}+s^{\prime\prime}, where r[0,σw1]r^{\prime\prime}\in[0,\sigma_{w}-1] and s[0,σh1]s^{\prime\prime}\in[0,\sigma_{h}-1]. Replace the loop over rr with loops over rr^{\prime} (from to R/σw1)R/\sigma_{w}-1) and rr^{\prime\prime} (from to σw1\sigma_{w}-1), and replace the loop over ss similarly. Since each r,sr,s maps uniquely onto a single r,r,s,sr^{\prime},r^{\prime\prime},s^{\prime},s^{\prime\prime}, we can lift both ImageImage and FilterFilter onto a higher dimension.

There is a one-to-one correspondence between every point in the original seven-dimensional lattice and every point in the lifted nine-dimensional lattice. Therefore, it suffices to bound the number of lattice points V|V| of any set VV of 99-tuples (b,c,k,w,h,r,r,s,s)(b,c,k,w,h,r^{\prime},r^{\prime\prime},s^{\prime},s^{\prime\prime}) such that ϕi(V)M|\phi^{\prime}_{i}(V)|\leq M, with ϕi\phi^{\prime}_{i} defined as follows:

Let VV be any set of 9-tuples of integers (b,c,k,w,h,r,r,s,s)(b,c,k,w,h,r^{\prime},r^{\prime\prime},s^{\prime},s^{\prime\prime}) where ϕi(V)M|\phi^{\prime}_{i}(V)|\leq M for i{1,2,3}i\in\{1,2,3\}, and 1rR/σW1\leq r^{\prime}\leq R/\sigma_{W} and 1sS/σH1\leq s\leq S/\sigma_{H}. Then

This bound is obviously tighter than M2M^{2} precisely when RS<MσWσHRS<M\sigma_{W}\sigma_{H}. Plugging G=(RS)1/2M3/2(σWσH)1/2G=(RS)^{1/2}M^{3/2}(\sigma_{W}\sigma_{H})^{-1/2} into (3) immediately yields:

The number of reads and writes to execute a CNN with a fast memory of size MM is at least BCKWH(RSσWσH/M)1/2BCKWH(RS\sigma_{W}\sigma_{H}/M)^{1/2}.

Proof of Lemma 4.2: Let V(r,s)V(r^{\prime},s^{\prime}) denote the restriction of VV to a given value of r,sr^{\prime},s^{\prime}, so that V=r,sV(r,s)V=\cup_{r^{\prime},s^{\prime}}V(r^{\prime},s^{\prime}) is a disjoint union of sets and V=r,sV(r,s)|V|=\sum_{r^{\prime},s^{\prime}}|V(r^{\prime},s^{\prime})|. Note that ϕ3(V)=r,sϕ3(V(r,s))\phi^{\prime}_{3}(V)=\cup_{r^{\prime},s^{\prime}}\phi^{\prime}_{3}(V(r^{\prime},s^{\prime})) is also a disjoint union of sets, so

Also ϕ1(V(r,s))ϕ1(V)M|\phi^{\prime}_{1}(V(r^{\prime},s^{\prime}))|\leq|\phi^{\prime}_{1}(V)|\leq M and ϕ2(V(r,s))ϕ2(V)M|\phi^{\prime}_{2}(V(r^{\prime},s^{\prime}))|\leq|\phi^{\prime}_{2}(V)|\leq M. We want to bound V(r,s)|V(r^{\prime},s^{\prime})| in terms of these bounds on ϕ1(V(r,s))|\phi^{\prime}_{1}(V(r^{\prime},s^{\prime}))|, ϕ2(V(r,s))|\phi^{\prime}_{2}(V(r^{\prime},s^{\prime}))|, and ϕ3(V(r,s))|\phi^{\prime}_{3}(V(r^{\prime},s^{\prime}))|.

This is another application of the HBL inequalities discussed in section 4.2. Since each of the loop indices appear in exactly two of the three ϕi\phi^{\prime}_{i}, this is a special case of a tensor contraction, for which the optimal exponents are s1=s2=s3=1/2s_{1}=s_{2}=s_{3}=1/2 as shown in Section 6.3 of [CDK+13a]. This yields:

We want to maximize this subject to (12). Since we are summing RS/(σwσh)RS/(\sigma_{w}\sigma_{h}) terms, a simple application of Lagrange multipliers tells us that this maximum is attained when all ϕ3(V(r,s))=M/(RS/(σwσh))|\phi^{\prime}_{3}(V(r^{\prime},s^{\prime}))|=M/(RS/(\sigma_{w}\sigma_{h})), yielding the desired

Communication Optimal Algorithms

In this section we show that the lower bounds in Theorem 1 are always attainable by an appropriate tiling, analogous to the one for matrix multiplication in section 3. We will prove this by constructing a tiling for any possible set of array bounds, and verifying that the tiling attains one of lower bounds from (4). The tile sizes we construct may also be useful starting points for optimization in practice, although the exact tile sizes may not necessarily give the best performance (due to constant factors omitted from our analysis).

Following the approach of [RD16], we will achieve this tiling by blocking each variable into contiguous blocks, except for rr and ss, which we will rewrite as r=σwr+rr=\sigma_{w}r^{\prime}+r^{\prime\prime} and s=σhs+ss=\sigma_{h}s^{\prime}+s^{\prime\prime} (with r[0,σw1]r^{\prime\prime}\in[0,\sigma_{w}-1] and s[0,σh1]s^{\prime\prime}\in[0,\sigma_{h}-1]) respectively. That is, we will rewrite our loop as follows, where we use fori=α:β:γ{\rm for}\,i=\alpha:\beta:\gamma to denote iterating from α\alpha to γ\gamma with a step size of β\beta:

To minimize the communication cost, it suffices to maximize the size of each block (that is, bbbcbkbwbhbrbrbsbsb_{b}b_{c}b_{k}b_{w}b_{h}b_{r^{\prime}}b_{r^{\prime\prime}}b_{s^{\prime}}b_{s^{\prime\prime}}) subject to the following constraints:

The block size in each dimension is smaller than the loop bound for that dimension. For the first five indices, we have:

The loop bounds on rr^{\prime\prime} and ss^{\prime\prime} (given by their definitions) give the following two constraints:

To ensure that the blocks for rr^{\prime} and ss^{\prime} are of appropriate size, recall that r=σwr+rr=\sigma_{w}r^{\prime}+r^{\prime\prime} and s=σhs+ss=\sigma_{h}s^{\prime}+s^{\prime\prime}, which gives

Since brσwb_{r^{\prime\prime}}\leq\sigma_{w} and bsσhb_{s^{\prime\prime}}\leq\sigma_{h}, we can safely omit those from the inequality (since their effect on left-hand side is at most equivalent to adding 11 to brb_{r^{\prime}} and bsb_{s^{\prime}}, and we are only interested in asymptotics) to get

The size of each block does not exceed the size of fast memory MM. This is straightforward for OutOut:

For ImageImage, notice that if a block for rr^{\prime} is [rstart,rend][r^{\prime}_{start},r^{\prime}_{end}], a block of rr^{\prime\prime} is [rstart,rend][r^{\prime\prime}_{start},r^{\prime\prime}_{end}] and a block for ww is [wstart,wend][w_{start},w_{end}], then the indices of ImageImage in the ww-dimension accessed will be of the form i+σwji+\sigma_{w}j, where i[rstart,rend]i\in[r^{\prime\prime}_{start},r^{\prime\prime}_{end}] and j[wstart+rstart,wend+rend]j\in[w_{start}+r^{\prime}_{start},w_{end}+r^{\prime}_{end}]. As a result, the number of indices in the ww-dimension accessed is (bw+br)br(b_{w}+b_{r^{\prime}})b_{r^{\prime\prime}}; similarly for the hh-dimension. Therefore, the total number of elements accessed from ImageImage must be:

As we will see shortly, it is convenient to recast our maximization problem as a linear program by taking logs; for this to happen, we only want products in the inequality. Multiplying out the left-hand side of the above gives a sum of four terms; bounding each of them by MM is sufficient for an asymptotic analysis. Therefore, we get:

Taking the log base MM of the objective and all the constraints, we get the following linear program, with l{b,...,s}=logMb{b,...,s}l_{\{b,...,s^{\prime\prime}\}}=\log_{M}b_{\{b,...,s^{\prime\prime}\}}:

We will first determine a closed-form solution for this linear program; that is, we will partition the space of possible input parameters (i.e. the array bounds and strides) into convex polytopes, within each of which the optimal tiling and communication cost it achieves are described by a single linear function of the input parameters. We will then use this closed-form solution to verify optimality of the tiling (and attainability of the lower bound) by showing that the communication cost of the optimal tiling is always equal to a communication lower bound for every point in the parameter space.

Although this construction and verification can in theory be performed by hand (see Appendix A for a hand analysis of the case where σw=σh=1\sigma_{w}=\sigma_{h}=1), the size of the result - the partition we find is a set of 200200 regions - makes it far more expedient to automate the analysis, which will also allow us to more easily extend this approach to other problems. An implementation of the algorithm, as well as a table of partitions, optimal tilings, and optimal communication costs in each of these partitions, may be found at https://people.eecs.berkeley.edu/~dinh/papers/DD18/partitioning.nb.

Algorithms for determining a closed-form solution to the parameterized linear programs have been studied extensively in the context of control theory [GN72, BBM03, STJ05, JBM07]. We used the geometric algorithm from [BBM03], which we briefly describe in this section and fully specify in Figure 1.; see the original paper for a proof of correctness.

For convenience, represent the LP (13) as

where c=[1,...,1]Tc=[-1,...,-1]^{T}, x=[lb,lc,...,ls]x=[l_{b},l_{c},...,l_{s^{\prime\prime}}], GG is the coefficient matrix for the left-hand side of the inequalities, θ=[logMB,logMC,...,logMσh]T\theta=[\log_{M}B,\log_{M}C,...,\log_{M}\sigma_{h}]^{T}, and FF and ww are the coefficient matrix and vector, respectively, for the right-hand side of the inequalities.

The intuition for the algorithm is as follows: start with a (possibly open) polytope in parameter space; during the first iteration of the algorithm, this is the set of all possible valid loop bounds and strides (i.e. nonnegative parameters, filter fitting inside the input). Pick a random point (not necessarily uniformly) inside that region, setting the parameters to its coordinates. Solve the linear program at that point using a method, such as simplex, that guarantees that the solution produced will be a vertex of the polytope, and note which constraints are made tight.

The number of tight constraints should be at least to the number of variables in the linear program, which is nine in this case, since our solution is a vertex of the polytope. If there are more than nine constraints, there are two possibilities: either (a) when the constraints are set to equality (to ensure tightness), there are redundant constraints, and the number of non-redundant constraints is nine, or (b) the point we selected lies on the border of two partitions in parameter space. Since the borders of partitions are of lower dimension than the parameter space itself (and since we selected our initial point randomly), the probability that we encounter case (b) is zeroBecause of the discreteness of random number generators in practice, as well as the possibility of using a nonuniform sampler for performance, the probability may not be exactly zero. Nevertheless, randomly resampling a point or perturbing our initial point if we see case (b) is sufficient..

If there are nine tight constraints at our solution, the optimizer x(θ)x^{*}(\theta) at this point is the solution to Gtx=wt+FtθG_{t}x=w_{t}+F_{t}\theta, where GtG_{t}, wtw_{t}, and FtF_{t} correspond to GG, ww, and FF restricted to the tight constraints. The polytope (in parameter space) within which x(θ)x^{*}(\theta) is the optimizer is given by Gsx(θ)ws+FsθG_{s}x^{*}(\theta)\leq w_{s}+F_{s}\theta, where GsG_{s}, wsw_{s}, and FsF_{s} are the restrictions of GG, ww, and FF to slack constraints. Similarly, if there are more than nine tight constraints, we set the tight constraints to equality and solve to get the optimizer; the polytope is defined as the region where the slack constraints remain slack at the optimizer (see Figure 1 for details).

Once we have obtained this polytope, we partition the remainder of our initial polytope into convex polytopes and recursively partition each one using this algorithm. We terminate when the remainder is either empty or is of lower dimension than parameter space.

2 Verifying optimality of the tiling

In order to verify the optimality of the tiling and the attainability of the lower bound, we must ensure that the communication cost attained by the tiling equals the maximum of the five lower bounds for every possible element.

Define the function Cp(θ)\mathscr{C}_{p}(\theta) as the log of the communication cost at point θ\theta using the tiling provided by the algorithm above for partition pp, that is:

where x^bp,...,x^sp\hat{x}_{b}^{p},...,\hat{x}_{s^{\prime\prime}}^{p} represent the closed-form optimizers in region pp as a function of θ\theta, and let Li(θ)\mathscr{L}_{i}(\theta) be the iith lower bound from (4) as a function of θ\theta.

It suffices to show that for every partition pp and parameter set θp\theta\in p, Cp(θ)\mathscr{C}_{p}(\theta) is equal to the maximum of the five lower bounds; in other words, that the quantity

is zero for all i{1,...,5}i\in\{1,...,5\} and for all partitions pp. This is not precisely the case - in the cases where all the inputs and outputs fit inside fast memory, the result may be nonzero since the computation in Equation 5.2 assumes that MM words are transmitted in each round (which is obviously more than the number of words transmitted if everything fits in cache); as a result, we exclude regions where the communication lower bound is less than MM. The LP solutions can easily be checked to be correct using an LP solver; for our code and results, see https://people.eecs.berkeley.edu/~dinh/papers/DD18/partitioning.nb.

As a sanity-check, we can also ensure that no tiling generated by our LP breaks any of our lower bounds, i.e.

should be nonnegative for all pp, ii. This is confirmed by our code as well.

3 Observations

A plot of our communication costs for one set of CNN parameters for a real world neural net is shown in Figure (2). Notice that different communication bounds (and different tilings) apply depending on the size of the fast memory being optimized for. As a result, a more sophisticated analysis would be required to optimize communication for multi-level memory model with different layers of cache; we leave this to future work.

It can also be observed in this example that when the memory size is larger than the size of the filter, the optimal tiling requires asymptotically no more memory than the size of the output array. In fact, an examination of the solutions for the LP (13) shows that in cases where any of ImageImage, OutputOutput, and FilterFilter fits entirely in cache, there always exists a tiling that will ensure a communication cost no higher than the max of their sizes.

Application to Pooling

Our techniques can also be extended to programs with similar loop structures, such as pooling:

where aba\oplus b can either mean a+b/(rs)a+b/(rs) (“average-pooling”) or max(a,b)\max(a,b) (“max-pooling”). The following communication lower bound holds for pooling:

The first two terms correspond to the size of OutOut and ImageImage respectively, while the third can be found using the approach from Section 4.2. In fact, since the only difference between pooling and the convolution (1) is the absence of FilterFilter, it suffices to remove ϕ3\phi_{3} from our calculations. In particular, the sets of subgroups we consider

Solving the resulting LP gives us the third lower bound, as desired.

We can also follow the approach from Section 5 to verify that this lower bound is tight and always attainable with the tiling given by the solution to the LP (13) with the filter constraint (lc+lk+lr+lr+ls+ls1l_{c}+l_{k}+l_{r^{\prime}}+l_{r^{\prime\prime}}+l_{s^{\prime}}+l_{s^{\prime\prime}}\leq 1) removed.

Conclusions and Future Work

We have found an asymptotic lower bound on communication required to evaluate a convolutional neural net, and provided an optimal reordering of the nested loops to attain the lower bound in every case by taking advantage of significantly more (2.752.75 times in the example from Figure 2) data reuse than is possible with matrix multiply or many other dense linear algebra operations. We describe a few avenues for future research below:

Parallel Algorithm: Suppose we have pp processors which do not communicate with each other except through a single shared memory of size M, and we wish to divide work between them while maintaining communication efficiency. For simplicity, let us only consider dividing work by evenly partitioning the output dimensions (i.e. assigning each processor a subset of bb, kk, hh, and ww) so as to avoid possible race conditions caused by assigning the same coordinate in the output to multiple processors. Suppose we assign each processor a single block of size B=B/aBB\text{\textquoteright}=B/a_{B}, K=K/aKK\text{\textquoteright}=K/a_{K}, H=H/aHH\text{\textquoteright}=H/a_{H}, and W=W/aWW\text{\textquoteright}=W/a_{W}, such that aBaKaHaWpa_{B}a_{K}a_{H}a_{W}\leq p.

Since each of the processors handles a subproblem of the same size, the communication lower bound (total memory traffic between the shared memory and all the processors) under these assumptions is pp multiplied by the per-processor expression (4), with BB, KK, HH, WW replaced by BB\text{\textquoteright}, KK\text{\textquoteright}, WW\text{\textquoteright}, HH\text{\textquoteright}. The first (output size), fourth, and fifth terms in this expression remain the same; the second and third (output and filter, respectively) increase by a factor of aKa_{K} and aBaWaHa_{B}a_{W}a_{H} respectively; once these parameters are fixed, the communication lower bound is

Since the bound is for fixed values of aa, we should choose those values in order to minimize the above quantity.

Optimizing the tiling (under these assumptions) can be done by solving a modified version of LP (13), with the BB\text{\textquoteright}, KK\text{\textquoteright}, WW\text{\textquoteright}, HH\text{\textquoteright} replacing BB, KK, HH, WW and with additional constraints encoding aBaKaHaWpa_{B}a_{K}a_{H}a_{W}\leq p, paBBpa_{B}\leq B, etc.

A more sophisticated analysis (examining tiling schemes that may cause race conditions and distributed models where processors can directly communicate with each other) is left to future work.

Implementation and benchmarking: The tilings we generate are asymptotically optimal in terms of communication. However, many factors, such as cache locality and processor architecture (tile sizes that are multiples of a processor’s vector width are likely to be more efficient), can provide significant constant-factor changes to the real-world performance (both in terms of time and energy). Since our algorithm is simply a rearrangement of the same arithmetic operations performed during a CNN, we believe our algorithms can provide a significant advantage over current implementations of convolutions that do no tiling (e.g. that used in Torchhttps://github.com/torch/torch7/blob/master/lib/TH/generic/THTensorConv.c ) or those that tile only based on machine parameters without taking into account the dimensions of the problem (e.g. CuTorchhttps://github.com/torch/cutorch/blob/master/lib/THC/THCTensorConv.cu ); we leave the validation of this intuition through an implementation and benchmark to future work.

Generalizing to arbitrary nested loops: Our lower bound in the “small filter” case rests on a problem-specific lifting of the HBL LP to a higher dimension; similarly, our approach for finding a closed form for optimal tilings (which is necessary for checking if a lower bound is always attainable) relies on the creation of a linear program tailored to this specific problem. Generalizing this to arbitrary nested loops would move us closer to being able to automatically optimize arbitrary loop nests for communication, e.g. at a compiler level.

References

Appendix A: Manual Exploration of Tilings

In this section we show, by hand, that the lower bounds in Theorem 1 are attainable in the case where σw=σh=1\sigma_{w}=\sigma_{h}=1 . The advantage of this approach appears is that it seems to be possible to attain far more compact representations of the optimal tiling function than with an automated exploration, which produced well over twice as many cases (for the one-stride case) as the hand exploration in this section.

There are a number of cases, depending on which of the five terms in Theorem 1 is largest, and other inequalities. To simplify the presentation, we present the overall result as a decision tree, where each leaf of the tree represents a disjoint subset of the possible lower bounds and optimal algorithms. After stating the result, we will give some intuition for the decision tree, as arising from a linear program, and then prove the theorem by providing an algorithm and communication cost analysis, with a separate lemma for each leaf.

The following cases describe which of the lower bound expressions in Theorem 1 are attainable. The abbreviation ALB stands for Attainable Lower Bound:

Here is some intuition for why the cases in Theorem 2 arise, and some notation we will use later in the proof. Suppose we tile the bb loop with block size bBbB, the kk loop with bKbK, the hh loop with bHbH, and so on. Then the algorithm becomes (recall that in this case we are assuming σw=σh=1\sigma_{w}=\sigma_{h}=1):

The argument list in the name is meant to indicate both the order of the nested loops, via the notation bb|, kk|, hh|, etc., and the block sizes. A natural way to try to optimize this code is to pick the block sizes so that all the data accessed in the 7 innermost loops fits in fast memory of size MM, and to maximize the number of loop iterations that can be performed by these loops, namely bBbKbHbWbRbSbCbB\cdot bK\cdot bH\cdot bW\cdot bR\cdot bS\cdot bC. It is easy to see that the submatrix of OutOut accessed by these loops is of size bKbHbWbBbK\cdot bH\cdot bW\cdot bB, the submatrix of FilterFilter accessed is of size bKbRbSbCbK\cdot bR\cdot bS\cdot bC, and the submatrix of ImageImage accessed is of size bC(bS+bH)(bR+bW)bBbC\cdot(bS+bH)\cdot(bR+bW)\cdot bB. Since RWR\leq W, we will assume bRbWbR\leq bW, and similarly bSbHbS\leq bH, which means the submatrix of CC accessed is of size at most 4bCbHbWbB4\cdot bC\cdot bH\cdot bW\cdot bB; we will ignore the constant 4 for simplicity, since it does not change our Big-O analysis below. This yields the following optimization problem:

We call a tiling admissible if it satisfies the constraints above. (The reader may wonder why the first 3 constraints do not use the upper bound M/3M/3 instead of MM, to be sure all three submatrices fit in fast memory simultaneously; this would only change the value of GG by a constant factor, which would not change the following Big-O analysis.) Then assuming the 3 submatrices are each read or written just once in the innermost 7 loops, the total number of reads and writes is O(BKHWCRSM/G)O(BKHWCRSM/G), since BKHWCRSBKHWCRS is the total number of loop iterations, BKHWCRS/GBKHWCRS/G is the number of iterations of the outermost 7 loops, and there are O(M)O(M) reads/writes per iteration of the outermost 7 loops. This last statement about O(M)O(M) read/writes per iteration may depend on the cache replacement policy in the hardware, but again it will not change the Big-O analysis.

If we now replace each quantity by its logarithm base MM, so BB by lB=logMBlB=\log_{M}B, bBbB by lbB=logMbBlbB=\log_{M}bB and so on, we get the following linear program:

Exploring the finite number of corners of the polytope defined by this linear program lead to the cases in Theorem 2. While the proof of Theorem 2 requires this exploration by hand, and confirming that the lower bound of Theorem 1 is attained, in practice the linear program could be used to determine the optimal block sizes. We note that this linear program has 7 variables and 12 constraints (besides nonnegativity), so there are as many as (127)=792\binom{12}{7}=792 corners in the feasible polytope to explore; fortunately only a few turn out to be important.

To keep the proofs short, we will use the notation BlockCNN above, and the cost expression O(BKHWCRSM/G)O(BKHWCRSM/G), to describe the optimal algorithm in each case. We will use the expression

to denote the lower bound from Theorem 1. To capture (some of) the non-uniqueness of the optimal solutions, we will use the following two functions to help solve linear programs: Function (x,y)=f2(xˉ,yˉ,s)(x,y)=f_{2}(\bar{x},\bar{y},s) takes 3 nonnegative arguments satisfying xˉ+yˉs\bar{x}+\bar{y}\geq s, and returns some 0xxˉ0\leq x\leq\bar{x} and 0yyˉ0\leq y\leq\bar{y} satisfying x+y=sx+y=s. Function (x,y,z)=f3(xˉ,yˉ,zˉ,s)(x,y,z)=f_{3}(\bar{x},\bar{y},\bar{z},s) similarly takes 4 nonnegative arguments satisfying xˉ+yˉ+zˉs\bar{x}+\bar{y}+\bar{z}\geq s, and returns some 0xxˉ0\leq x\leq\bar{x}, 0yyˉ0\leq y\leq\bar{y} and 0zzˉ0\leq z\leq\bar{z} satisfying x+y+z=sx+y+z=s.

Upper Bound Case 1: Suppose min(CHWB,KCRS,KHWB)M\min(CHWB,KCRS,KHWB)\leq M, i.e. at least one of the 3 arrays ImageImage, FilterFilter and OutOut fits in fast memory. Then the attainable communication lower bound is O(max(CHWB,KCRS,KHWB))O(\max(CHWB,KCRS,KHWB)).

Proof: Case 1 in turn breaks down into a number of subcases (again, we ignore constant factors):

Use BlockCNN(bB,kK,hH,wW,rR,sS,cCb|B,k|K,h|H,w|W,r|R,s|S,c|C), i.e. the original unblocked algorithm,

LB=max(CHWB,KHWB,KCRS)LB=\max(CHWB,KHWB,KCRS) because KCRSMKCRS\leq M implies RSMRS\leq M implies

Use BlockCNN(b1,kK,hH,wW,rR,sS,c(M/(HW)b|1,k|K,h|H,w|W,r|R,s|S,c|(M/(HW)). It is straightforward to confirm that this tiling is admissible. Then G=KHWRS(M/(HW))=KRSMG=KHWRS(M/(HW))=KRSM, and so the number of read/writes is O(CHWB)O(CHWB). The same inequalities as in Case 1.1 show

Swap the roles of CC and KK in Case 1.2.

Let (lbR,lbS)=f2(lR,lS,1lKlC)(lbR,lbS)=f_{2}(lR,lS,1-lK-lC), and then bR=MlbRbR=M^{lbR} and bS=MlbSbS=M^{lbS}. Then use BlockCNN(bB,kK,hH,wW,rbR,sbS,cCb|B,k|K,h|H,w|W,r|bR,s|bS,c|C). Admissibility follows from the definition of f2()f_{2}(). Then G=KHWB(M/KC)C=HWMBG=KHWB(M/KC)C=HWMB, and the number of reads/writes is O(KCRS)O(KCRS).

To show LB=max(CHWB,KHWB,KCRS)LB=\max(CHWB,KHWB,KCRS) we first note CHWBMCHWB\leq M implies HWBMHWB\leq M implies KCHWBRS/MKCRSKCHWBRS/M\leq KCRS. Multiplying the first 3 inequalities defining Case 1.4 yields KCRSMMMCHWBKHWBKCRS\cdot M\cdot M\geq M\cdot CHWB\cdot KHWB, or RS(RS/M)1/2HWBRS\geq(RS/M)^{1/2}HWB, and thus KCRSKCHWB(RS/M)1/2KCRS\geq KCHWB(RS/M)^{1/2}.

RSHWMRS\leq HW\leq M, so lR+lS1lR+lS\leq 1. Let (lbK,lbC)=f2(lK,lC,1lRlS)(lbK,lbC)=f_{2}(lK,lC,1-lR-lS), and then bK=MlbKbK=M^{lbK} and bC=MlbCbC=M^{lbC}. Then use BlockCNN(bB,kbK,hH,wW,rR,sS,cbCb|B,k|bK,h|H,w|W,r|R,s|S,c|bC). Admissibility follows from the definition of f2()f_{2}(). Then G=(M/(RS))HWRSB=HWMBG=(M/(RS))HWRSB=HWMB, and the number of reads/writes is O(KCRS)O(KCRS).

Showing LB=max(CHWB,KHWB,KCRS)LB=\max(CHWB,KHWB,KCRS) is identical to Case 1.4.

KCMKC\leq M so max(K,C)M\max(K,C)\leq M. Let (lbB,δh,δw)=f3(lB,lHlS,lWlR,1max(lC,lK)lRlS)(lbB,\delta_{h},\delta_{w})=f_{3}(lB,lH-lS,lW-lR,1-\max(lC,lK)-lR-lS). This is well-defined because lHlS0lH-lS\geq 0 is equivalent to HSH\geq S, lWlR0lW-lR\geq 0 is equivalent to WRW\geq R, 1max(lC,lK)lRlS01-\max(lC,lK)-lR-lS\geq 0 is equivalent to Mmax(KRS,CRS)M\geq\max(KRS,CRS), which is implied by KCRSMKCRS\leq M, and lB+lHlS+lWlR1max(lC,lK)lRlSlB+lH-lS+lW-lR\geq 1-\max(lC,lK)-lR-lS is equivalent to max(KHWB,CHWB)M\max(KHWB,CHWB)\geq M. Now let lbH=lS+δhlHlbH=lS+\delta_{h}\leq lH, lbW=lR+δwlWlbW=lR+\delta_{w}\leq lW, bB=MlbBbB=M^{lbB}, bH=MlbHbH=M^{lbH} and bW=MlbWbW=M^{lbW}. Thus SbHHS\leq bH\leq H and RbWWR\leq bW\leq W. Then use BlockCNN(bbB,kK,hbH,wbW,rR,sS,cCb|bB,k|K,h|bH,w|bW,r|R,s|S,c|C). Admissibility follows from KCRSMKCRS\leq M and lbB+lbH+lbW=1max(lC,lK)lbB+lbH+lbW=1-\max(lC,lK), so bBbHbW=M/max(K,C)bB\cdot bH\cdot bW=M/\max(K,C), and max(KbHbWbB,CbHbWbB)=M\max(K\cdot bH\cdot bW\cdot bB,C\cdot bH\cdot bW\cdot bB)=M. Then G=KC(M/max(K,C))RS=min(K,C)MRSG=KC(M/\max(K,C))RS=\min(K,C)MRS, and the number of reads/writes is O(max(KHWB,CHWB))O(\max(KHWB,CHWB)).

To show LB=max(CHWB,KHWB,KCRS)LB=\max(CHWB,KHWB,KCRS) we note KCHWBRS/MHWBmax(CHWB,KHWB)KCHWBRS/M\leq HWB\leq\max(CHWB,KHWB) and KCHWB(RS/M)1/2KCHWB(1/(KC))1/2=(KC)1/2HWBmax(K,C)HWBKCHWB(RS/M)^{1/2}\leq KCHWB(1/(KC))^{1/2}=(KC)^{1/2}HWB\leq\max(K,C)HWB.

Use BlockCNN(bB,kK,hH,wW,r1,s1,c(M/K)b|B,k|K,h|H,w|W,r|1,s|1,c|(M/K)). Admissibility follows from (M/K)HWBM(M/K)HWB\leq M. Then G=MHWBG=MHWB and the number of reads/writes is O(KCRS)O(KCRS).

Note that KCRSHWBCRSCHWBKHWBKCRS\geq HWBCRS\geq CHWB\geq KHWB and KCHWBRS/MKCRSKCHWBRS/M\leq KCRS. KHWBMKHWB\leq M and KHWBK\geq HWB together imply HWBM1/2(MRS)1/2HWB\leq M^{1/2}\leq(MRS)^{1/2}, and hence KCHWB(RS/M)1/2KCRSKCHWB(RS/M)^{1/2}\leq KCRS. So altogether LB=max(KCRS,CHWB,KHWB)LB=\max(KCRS,CHWB,KHWB).

Let (lbR,lbS)=f2(lR,lS,min(lR+lS,lH+lW+lBlK))(lbR,lbS)=f_{2}(lR,lS,min(lR+lS,lH+lW+lB-lK)), and then bR=MlbRbR=M^{lbR} and bS=MlbSbS=M^{lbS}. Note that lH+lW+lBlK0lH+lW+lB-lK\geq 0 because KHWBK\leq HWB.

Use BlockCNN(bB,kK,hH,wW,rbR,sbS,c(M/(HWB))b|B,k|K,h|H,w|W,r|bR,s|bS,c|(M/(HWB)). Admissibility follows since K(M/(HWB))bRbSK(M/HWB)(HWB/K)=MK\cdot(M/(HWB))\cdot bR\cdot bS\leq K\cdot(M/HWB)\cdot(HWB/K)=M. Then G=BKHW(M/HWB)min(RS,HWB/K)=min(KMRS,MHWB)G=BKHW(M/HWB)\min(RS,HWB/K)=\min(KMRS,MHWB) and the number of reads/writes is O(KCRSHWBM/min(KMRS,MHWB))=O(max(CHWB,KCRS))O(KCRSHWBM/\min(KMRS,MHWB))=O(\max(CHWB,KCRS)).

KHWBMKHWB\leq M implies HWBMHWB\leq M implies KCRSHWB/MKCRSKCRSHWB/M\leq KCRS. If KRSHWBKRS\leq HWB, then K(RS)1/2M1/2K(HWB/K)1/2M1/2=(KHWB)1/2M1/21\frac{K(RS)^{1/2}}{M^{1/2}}\leq\frac{K(HWB/K)^{1/2}}{M^{1/2}}=\frac{(KHWB)^{1/2}}{M^{1/2}}\leq 1, implying CHWBKCHWB(RS/M)1/2CHWB\geq KCHWB(RS/M)^{1/2}. Alternatively, if KRSHWBKRS\geq HWB, then MKHWB(HWB)2/RSM\geq KHWB\geq(HWB)^{2}/RS, so RS(HWB)2/M(RS)2RS(HWB)^{2}/M\leq(RS)^{2}, and KCRSKCHWB(RS/M)1/2KCRS\geq KCHWB(RS/M)^{1/2}. So altogether LB=max(KCRS,CHWB,KHWB)LB=\max(KCRS,CHWB,KHWB).

Swap the roles of CC and KK in Case 1.7.

Swap the roles of CC and KK in Case 1.8.

Upper Bound Case 2.1: Suppose RSMRS\geq M. Then the attainable communication lower bound is O(KCHWRSB/M)O(KCHWRSB/M).

Proof: Note that HWRSMHW\geq RS\geq M implies min(CHWB,KHWB,KCRS)M\min(CHWB,KHWB,KCRS)\geq M. Let (lbR,lbS)=f2(lR,lS,1)(lbR,lbS)=f_{2}(lR,lS,1), lbW=lbRlbW=lbR and lbH=lbSlbH=lbS, and then bR=MlbRbR=M^{lbR}, bS=MlbSbS=M^{lbS}, bW=bRbW=bR and bH=bSbH=bS. Use BlockCNN(b1,k1,c1,hbH,wbW,rbR,sbSb|1,k|1,c|1,h|bH,w|bW,r|bR,s|bS). Admissibility follows from bRbS=bHbW=MbR\cdot bS=bH\cdot bW=M. Then G=M2G=M^{2}, so the number of reads/writes is O(KCHWRSMB/G)=O(KCHWRSB/M)O(KCHWRSMB/G)=O(KCHWRSB/M).

HWRSMHW\geq RS\geq M implies KCHWRSB/Mmax(CHWB,KHWB,KCRS)KCHWRSB/M\geq\max(CHWB,KHWB,KCRS). RSMRS\geq M also implies KCHWBRS/MKCHWB(RS/M)1/2KCHWBRS/M\geq KCHWB(RS/M)^{1/2}. So LB=KCHWRSB/MLB=KCHWRSB/M. \Box

Upper Bound Case 2.2.1: Suppose min(CHWB,KCRS,KHWB)M\min(CHWB,KCRS,KHWB)\geq M, RSMRS\leq M and MRS(HWB)2MRS\geq(HWB)^{2}. Then the attainable communication lower bound is O(KCRS)O(KCRS).

Proof: Note that MHWMRS(HWB)2MHW\geq MRS\geq(HWB)^{2}, so MB2HWM\geq B^{2}HW. Let bC=M/(HWB)1bC=M/(HWB)\geq 1, and so bCCHWB/(HWB)=CbC\leq CHWB/(HWB)=C. Also KHWBM(HWB)2/(RS)KHWB\geq M\geq(HWB)^{2}/(RS) so KHWB/(RS)K\geq HWB/(RS). Let bK=HWB/(RS)1bK=HWB/(RS)\geq 1, and so bKKbK\leq K. Use BlockCNN(bB,kbK,hH,wW,rR,sS,cbCb|B,k|bK,h|H,w|W,r|R,s|S,c|bC). Admissibility follows from bCHWB=MbC\cdot HWB=M, bKHWB=(HWB)2/(RS)MbK\cdot HWB=(HWB)^{2}/(RS)\leq M, and bCbKRS=M/(HWB)HWB/(RS)RS=MbC\cdot bK\cdot RS=M/(HWB)\cdot HWB/(RS)\cdot RS=M. Then G=bKbCHWRSB=HWMBG=bK\cdot bC\cdot HWRSB=HWMB, so the number of reads/writes is O(KCHWRSBM/(HWMB))=O(KCRS)O(KCHWRSBM/(HWMB))=O(KCRS).

(HWB)2MRS(HWB)^{2}\leq MRS implies HWB(RS/M)1/2RSHWB(RS/M)^{1/2}\leq RS implies KCHWB(RS/M)1/2KCRSKCHWB(RS/M)^{1/2}\leq KCRS. RSMRS\leq M implies KCHWBRS/MKCHWB(RS/M)1/2KCRSKCHWBRS/M\leq KCHWB(RS/M)^{1/2}\leq KCRS. (HWB)2/RSMKHWB(HWB)^{2}/RS\leq M\leq KHWB implies HWBKRSHWB\leq KRS implies CHWBKCRSCHWB\leq KCRS. Similarly, (HWB)2/RSMCHWB(HWB)^{2}/RS\leq M\leq CHWB implies HWBCRSHWB\leq CRS implies KHWBKCRSKHWB\leq KCRS. Thus LB=KCRSLB=KCRS. \Box

Upper Bound Case 2.2.2.1: Suppose min(CHWB,KCRS,KHWB)M\min(CHWB,KCRS,KHWB)\geq M, RSMRS\leq M, MRS(HWB)2MRS\leq(HWB)^{2} and min(C,K)(M/(RS))1/2\min(C,K)\geq(M/(RS))^{1/2}. Then the attainable communication lower bound is O(KCHWB(RS/M)1/2)O(KCHWB(RS/M)^{1/2}).

Proof: Let bK=bC=(M/(RS))1/2min(C,K)bK=bC=(M/(RS))^{1/2}\leq\min(C,K), and bR=RbR=R and bS=SbS=S. Let (lbB,δh,δw)=f3(lB,lHlS,lWlR,(1lRlS)/2)(lbB,\delta_{h},\delta_{w})=f_{3}(lB,lH-lS,lW-lR,(1-lR-lS)/2). This is well-defined because lHlS0lH-lS\geq 0 is equivalent to HSH\geq S, lWlR0lW-lR\geq 0 is equivalent to WRW\geq R, (1lRlS)/20(1-lR-lS)/2\geq 0 is equivalent to (M/RS)1/21(M/RS)^{1/2}\geq 1 or MRSM\geq RS, and lB+lHlS+lWlR(1lRlS)/2lB+lH-lS+lW-lR\geq(1-lR-lS)/2 is equivalent to HWB/(RS)(M/RS)1/2HWB/(RS)\geq(M/RS)^{1/2}, or (HWB)2MRS(HWB)^{2}\geq MRS. Now let lbH=lS+δhlHlbH=lS+\delta_{h}\leq lH, lbW=lR+δwlWlbW=lR+\delta_{w}\leq lW, bH=MlbHbH=M^{lbH}, bW=MlbWbW=M^{lbW} and bB=MlbBbB=M^{lbB}. Then bSbHHbS\leq bH\leq H, bRbWWbR\leq bW\leq W, and bHbWbB=(MRS)1/2bH\cdot bW\cdot bB=(MRS)^{1/2}. Use BlockCNN(bbB,kbK,hbH,wbW,rbR,sbS,cbCb|bB,k|bK,h|bH,w|bW,r|bR,s|bS,c|bC). Admissibility follows from bKbHbWcB=bCbHbWbB=bKbCbRbS=MbK\cdot bH\cdot bW\cdot cB=bC\cdot bH\cdot bW\cdot bB=bK\cdot bC\cdot bR\cdot bS=M. Then G=bBbKbCbHbWRS=M3/2(RS)1/2G=bB\cdot bK\cdot bC\cdot bH\cdot bW\cdot RS=M^{3/2}(RS)^{1/2}, and the number of reads/writes is O(KCHWRSBM/(M3/2(RS)1/2))=O(KCHWB(RS/M)1/2)O(KCHWRSBM/(M^{3/2}(RS)^{1/2}))=O(KCHWB(RS/M)^{1/2}).

K(M/(RS))1/2K\geq(M/(RS))^{1/2} implies KCHWB(RS/M)1/2CHWBKCHWB(RS/M)^{1/2}\geq CHWB. C(M/(RS))1/2C\geq(M/(RS))^{1/2} implies KCHWB(RS/M)1/2KHWBKCHWB(RS/M)^{1/2}\geq KHWB. HWB(MRS)1/2HWB\geq(MRS)^{1/2} implies KCHWB(RS/M)1/2KCRSKCHWB(RS/M)^{1/2}\geq KCRS. RSMRS\leq M implies KCHWB(RS/M)1/2KCHWBRS/MKCHWB(RS/M)^{1/2}\geq KCHWBRS/M. Thus LB=KCHWB(RS/M)1/2LB=KCHWB(RS/M)^{1/2}. \Box

Upper Bound Case 2.2.2.2: Suppose min(CHWB,KCRS,KHWB)M\min(CHWB,KCRS,KHWB)\geq M, RSMRS\leq M, MRS(HWB)2MRS\leq(HWB)^{2} and min(C,K)(M/(RS))1/2\min(C,K)\leq(M/(RS))^{1/2}. Then the attainable communication lower bound is O(max(KHWB,CHWB))O(\max(KHWB,CHWB)).

Proof: Suppose w.l.o.g that CKC\leq K, so C(M/(RS))1/2C\leq(M/(RS))^{1/2} and KCRSMKCRS\geq M implies KM/(CRS)(M/(RS))1/2K\geq M/(CRS)\geq(M/(RS))^{1/2}. Let bC=CbC=C, bK=CbK=C, bR=RbR=R and bS=SbS=S. Let (lbB,δh,δw)=f3(lB,lHlS,lWlR,1lClRlS)(lbB,\delta_{h},\delta_{w})=f_{3}(lB,lH-lS,lW-lR,1-lC-lR-lS). This is well-defined because lHlS0lH-lS\geq 0 is equivalent to HSH\geq S, lWlR0lW-lR\geq 0 is equivalent to WRW\geq R, 1lClRlS01-lC-lR-lS\geq 0 is equivalent to M/(CRS)1M/(CRS)\geq 1, which is implied by C(M/RS)1/2M/RSC\leq(M/RS)^{1/2}\leq M/RS, and lB+lHlS+lWlR1lClRlSlB+lH-lS+lW-lR\geq 1-lC-lR-lS being equivalent to lC+lH+lW+lB1lC+lH+lW+lB\geq 1 or CHWBMCHWB\geq M. Now let lbH=lS+δhlHlbH=lS+\delta_{h}\leq lH, lbW=lR+δwlWlbW=lR+\delta_{w}\leq lW, bH=MlbHbH=M^{lbH}, bW=MlbWbW=M^{lbW} and bB=MlbBbB=M^{lbB}. Then bSbHHbS\leq bH\leq H, bRbWWbR\leq bW\leq W, and bHbWbB=M/CbH\cdot bW\cdot bB=M/C. Use BlockCNN(bbB,kbK,hbH,wbW,rbR,sbS,cbCb|bB,k|bK,h|bH,w|bW,r|bR,s|bS,c|bC). Admissibility follows from bKbHbWbB=bCbHbWbB=CM/C=MbK\cdot bH\cdot bW\cdot bB=bC\cdot bH\cdot bW\cdot bB=C\cdot M/C=M, and bKbCbRbS=C2RSMbK\cdot bC\cdot bR\cdot bS=C^{2}RS\leq M. Then G=bBbKbCbHbWbRbS=CCM/CRS=CRSMG=bB\cdot bK\cdot bC\cdot bH\cdot bW\cdot bR\cdot bS=C\cdot C\cdot M/C\cdot R\cdot S=CRSM, and the number of reads/writes is O(KCHWRSBM/(CRSM))=O(KWHB)O(KCHWRSBM/(CRSM))=O(KWHB).

KHWBCHWBKHWB\geq CHWB by assumption. HWB(MRS)1/2CRSHWB\geq(MRS)^{1/2}\geq CRS implies KHWBKCRSKHWB\geq KCRS. C(M/(RS))1/2M/(RS)C\leq(M/(RS))^{1/2}\leq M/(RS) implies KHWBKCHWB(RS/M)1/2KCHWBRS/MKHWB\geq KCHWB(RS/M)^{1/2}\geq KCHWBRS/M. Thus LB=KHWBLB=KHWB. \Box