Principal Component Analysis and Higher Correlations for Distributed Data

Ravindran Kannan, Santosh Vempala, David Woodruff

Introduction

In modern large-scale machine learning problems the input data is often distributed among many servers, while the communication as well as time and space resources per server are limited. We consider two well-studied problems: (1) Principal Component Analysis (PCA), and (2) Generalized Higher-order correlations. Both problems study correlations between vectors. For the first problem, the vectors correspond to the rows of a matrix and we are interested in second-order correlations, while in the second problem we are interested in higher-order correlations among the vectors.

PCA is a central tool in many learning algorithms. The goal of PCA is to find a low-dimensional subspace that captures as much of the variance of a dataset as possible. By projecting the rows of a matrix onto this lower-dimensional subspace, one preserves important properties of the input matrix, but can now run subsequent algorithms in the lower-dimensional space, resulting in significant computational and storage savings. In a distributed setting, by having each server first locally project his/her own data onto a low-dimensional subspace, this can also result in savings in communication. PCA is useful for a variety of downstream tasks, e.g., for clustering or shape-fitting problems () and latent semantic analysis.

The second problem we consider is the Generalized Higher Order Correlation Problem. For this problem we assume server tt has an nn-dimensional vector ata_{t} with non-negative entries. Note that for PCA, it is useful and more general to allow the entries to be positive, negative, or zero. On the other hand, the non-negativity assumption for Generalized Higher Order Correlations is justified both by the applications we give , as well as the fact that it is impossible to achieve low communication without this assumption, as described in more detail below.

A special case of this problem is the well-studied frequency moment problem. That is, if server tt holds the vector ata_{t}, with coordinates at1,at2,,atna_{t1},a_{t2},\ldots,a_{tn}, then the kk-th frequency moment of t=1sat\sum_{t=1}^{s}a_{t} is i=1n(t=1sati)k\sum_{i=1}^{n}(\sum_{t=1}^{s}a_{ti})^{k}, where, kk is a positive integer. This problem has been extensively studied in the data stream literature, starting with the work of . Known lower bounds for this problem from that literature rule out low communication algorithms when k>2k>2 in the distributed setting when the number of servers grows as a power of nn (), or when there are only two servers and the entries are allowed to be negative . Here we overcome these lower bounds for smaller ss and indeed will develop algorithms and lower bounds for estimating i=1nf(t=1sati)\sum_{i=1}^{n}f(\sum_{t=1}^{s}a_{ti}), for a general class of functions f:R+R+f:{\bf R}_{+}\rightarrow{\bf R}_{+}.

We then extend these results to the following more general problem: there is a collection of vectors that is partitioned into ss parts - W1,W2,,WsW_{1},W_{2},\ldots,W_{s} - and server tt holds WtW_{t}. For each tt and each iWti\in W_{t}, there is an nn-dimensional vector vi=(vi1,vi2,,vin)v_{i}=(v_{i1},v_{i2},\ldots,v_{in}) wholly residing on server tt. Let f:R+R+f:{\bf R}_{+}\rightarrow{\bf R}_{+} and g:R+kR+g:{\bf R}_{+}^{k}\rightarrow{\bf R}_{+} be functions. For a natural number kk, define the kk-th generalized moment M(f,g,k)M(f,g,k) as

There are many applications of higher-order correlations, and we only mention several here. For a document collection, we seek statistics (second, third and higher moments) of the number of documents in which each trigram (triples of terms) occurs. For a bipartite graph G(V1,V2,E)G(V_{1},V_{2},E) and constants (r,u)(r,u), we want to estimate the number of Kr,uK_{r,u} (complete bipartite graph) subgraphs. For a time series of many events, we want to estimate the number of tuples (E1,E2,,Er;t1,t2,,tu)(E_{1},E_{2},\ldots,E_{r};t_{1},t_{2},\ldots,t_{u}) for which each of the events E1,E2,,ErE_{1},E_{2},\ldots,E_{r} occurs at each of the times t1,t2,,tut_{1},t_{2},\ldots,t_{u}.

Conceptually, for each ii, we can think of a vector aia_{i} with (nk){n\choose k} components - one for each distinct tuple (j1,j2,,jk)(j_{1},j_{2},\ldots,j_{k}). Suppose ai;j1,j2,,jk=g(vi,j1,vi,j2,,vi,jk),a_{i;j_{1},j_{2},\ldots,j_{k}}=g(v_{i,j_{1}},v_{i,j_{2}},\ldots,v_{i,j_{k}}), and let at=iWtai.a_{t}=\sum_{i\in W_{t}}a_{i}. Our first theorem describes a way of estimating M(f,g,k)M(f,g,k) up to a (1+ε)(1+\varepsilon)-factor, where, each server uses polynomial time and polynomial space, but we try to optimize total communication while keeping the number of rounds constant. For this algorithm, server tt explicitly constructs the vector ata_{t} first, so it uses O(nkWt)O(n^{k}|W_{t}|) space. Thereafter the space is linear in the total size of all the ata_{t}. Our second theorem shows how to reduce space to linear in nn. This algorithm does not construct ata_{t} explicitly, but instead performs a rejection sampling procedure.

Before stating our theorems, we need some notation. Let cf,sc_{f,s} be the least positive real number such that

Note that for f(x)=xkf(x)=x^{k} (as in the kk-th frequency moment), cf,s=sk1c_{f,s}=s^{k-1}, since for any non-negative real numbers b1,b2,,bsb_{1},b_{2},\ldots,b_{s}, we have (b1+b2++bs)ksk1(b1k+b2k++bsk),(b_{1}+b_{2}+\cdots+b_{s})^{k}\leq s^{k-1}(b_{1}^{k}+b_{2}^{k}+\cdots+b_{s}^{k}), and taking bt=1b_{t}=1, we see that the factor sk1s^{k-1} cannot be improved.

The communication and computations are not assumed to be synchronous. We arbitrarily denote one of the ss servers as the Central Processor (CP). A round consists of the CP sending a message to each server and each server sending an arbitrary length message to the CP. A round is complete when the CP has received messages from all servers from which it is expecting a message in that round. All servers communicate only with the CP, which, up to a factor of two, is equivalent to the servers communicating directly with each other (provided they indicate in their message who the message is being sent to). For formal details of this model, we refer the reader to Section 3 of . Our algorithms take polynomial time, linear space and O(1)O(1) rounds of communication.

Our Results

Low-rank matrix approximation and approximate PCA.

Our first set of results is for low-rank approximation: given an n×dn\times d matrix AA, a positive integer kk and ε>0\varepsilon>0, find an n×dn\times d matrix BB of rank at most kk such that

Here, for a matrix AA, the Frobenius norm AF2||A||_{F}^{2} is the sum of squares of the entries of AA. A basis for the rowspace of BB provides an approximate kk-dimensional subspace to project the rows of AA onto, and so is a form of approximate PCA. We focus on the frequently occurring case when AA is rectangular, that is, ndn\gg d.

Consider the arbitrary partition model where an n×dn\times d matrix AtA_{t} resides in server tt and the data matrix A=A1+A2++AsA=A^{1}+A^{2}+\cdots+A^{s}. For any 1ε>01\geq\varepsilon>0, there is an algorithm that, on termination, leaves a n×dn\times d matrix CtC^{t} in server tt such that the matrix C=C1+C2++CsC=C^{1}+C^{2}+\cdots+C^{s} is of rank kk and with arbitrarily large constant probability achieves ACF(1+ε)minX:rank(X)kAXF,\|A-C\|_{F}\leq(1+\varepsilon)\min_{X:\text{rank}(X)\leq k}||A-X||_{F}, using linear space, polynomial time and with total communication complexity O(sdk/ε+sk2/ε4)O(sdk/\varepsilon+sk^{2}/\varepsilon^{4}) real numbers. Moreover, if the entries of each AtA_{t} are bb bits each, then the total communication is O(sdk/ε+sk2/ε4)O(sdk/\varepsilon+sk^{2}/\varepsilon^{4}) words each consisting of O(b+log(nd))O(b+\log(nd)) bits.

In contrast to the guarantees in Theorem 1.1, in the streaming model even with multiple passes, a simple encoding argument formalized in Theorem 4.14 of shows the problem requires Ω(n+d)\Omega(n+d) communication. We bypass this problem by allowing the ss different servers to locally output a matrix CtC_{t} so that tCt\sum_{t}C_{t} is a (1+ε)(1+\varepsilon)-approximation to the best rank-kk approximation. We are not aware of any previous algorithms with less than nn communication in the arbitrary partition model.

In the row-partition model, in which each row of AA is held by a unique server, there is an O(sdk/ε)O(sdk/\varepsilon) word upper bound due to . This is also achievable by the algorithms of . As the row-partition model is a special case of our model in which for each row of AA, there is a unique server with a non-zero vector on that row, our result implies their result up to the low order O(sk2/ε4)O(sk^{2}/\varepsilon^{4}) term, but in a stronger model. For example, consider the case in which a customer corresponds to a row of AA, and a column to his/her purchases of a specific item. These purchases could be distributed across multiple servers corresponding to different vendors. Or in the case of search data, each column could correspond to a search term of a user, and the searches may be distributed across multiple servers for storage and processing considerations. These examples are captured by the arbitrary partition model but not by the row partition model.

The technique for our upper bound is based on a two-stage adaptive sketching process, and has played an important role in several followup works, including CUR Matrix Factorizations of and subspace embeddings for the polynomial kernel by .

Frequency moments and higher-order correlations.

Our next set of results are for estimating higher moments and higher-order correlations of distributed data.

Let f:R+R+f:{\bf R}_{+}\rightarrow{\bf R}_{+} and cf,sc_{f,s} be as in (1). There are ss polynomial time, linear space bounded servers, where server tt holds a non-negative nn-vector at=(at1,at2,,atn)a_{t}=(a_{t1},a_{t2},\ldots,a_{tn}). We can estimate i=1nf(t=1sati)\sum_{i=1}^{n}f\left(\sum_{t=1}^{s}a_{ti}\right) up to a (1+ε)(1+\varepsilon) factor by an algorithm using O(s2cf,s/ε2)O(s^{2}c_{f,s}/\varepsilon^{2}) total words of communication (from all servers) in O(1)O(1) rounds. Moreover, any estimation up to a (1+ε)(1+\varepsilon) factor needs in the worst case Ω(cf,s/ε)\Omega(c_{f,s}/\varepsilon) bits of communication.

We remark that the lower bound applies to any function ff with parameter cf,sc_{f,s}, not a specific family of such functions.

Let f:R+R+f:{\bf R}_{+}\rightarrow{\bf R}_{+}, g:R+kR+g:{\bf R}_{+}^{k}\rightarrow{\bf R}_{+} be monotone functions with cf,sc_{f,s} as in (1). kO(1)k\in O(1) is a natural number and let M(f,g,k)M(f,g,k) be the generalized moment. We can approximate M(f,g,k)M(f,g,k) to relative error ε\varepsilon by an algorithm with communication at most O(s3cf,s/ε2)O(s^{3}c_{f,s}/\varepsilon^{2}) words in O(1)O(1) rounds. Further, we use polynomial time and linear space.

A key feature of this algorithm, and our following ones, is worth noting: they involve no dependence on nn or lnn\ln n, so they can be used when ata_{t} are implicitly specified and nn itself is very large, possibly infinite (provided, we can communicate each index ii). In the theorem below Ω\Omega is the set of coordinates of each vector. It is analogous to [n][n]. We use xΩ\sum_{x\in\Omega}, which when Ω\Omega is infinite and the probabilities are densities, should be replaced with an integral; our theorem is also valid for the case when we have integrals.

Let f:R+R+f:{\bf R}_{+}\rightarrow{\bf R}_{+}, g:R+kR+g:{\bf R}_{+}^{k}\rightarrow{\bf R}_{+} be monotone functions with cf,sc_{f,s} as in (1). Server tt is able to draw (in unit time) a sample xΩx\in\Omega according to a probability distribution hth_{t} on Ω\Omega. Also, server tt can estimate xΩf(ht(x))\sum_{x\in\Omega}f(h_{t}(x)). Then with O(s3cf,s/ε2)O(s^{3}c_{f,s}/\varepsilon^{2}) words of communication, CP can estimate xΩf(t=1sht(x))\sum_{x\in\Omega}f\left(\sum_{t=1}^{s}h_{t}(x)\right) to within relative error ε\varepsilon.

As a special case we consider the well-studied case of frequency moments. The best previous upper bound for the kk-th frequency moment problem in the distributed setting is by who gave an algorithm that achieves sk1(Clognε)O(k)s^{k-1}\left(\frac{C\log n}{\varepsilon}\right)^{O(k)} communication, so the complexity still depends, albeit mildly, on nn. Theorem 1.3 implies an algorithm with O(sk+1/ε2)O(s^{k+1}/\varepsilon^{2}) words of communication. We further improve this:

There are ss servers, with server tt holding a non-negative vector at=(at1,at2,,atn)a_{t}=(a_{t1},a_{t2},\ldots,a_{tn}).The vector ata_{t} need not be written down explicitly in server tt. it just has to have the ability to (i) find i=1natik\sum_{i=1}^{n}a_{ti}^{k} to relative error ε\varepsilon and draw a sample according to {atik/jatjk}\{a_{ti}^{k}/\sum_{j}a_{tj}^{k}\}. Then, to estimate A=i=1n(t=1sati)kA=\sum_{i=1}^{n}\left(\sum_{t=1}^{s}a_{ti}\right)^{k} to within relative error ε\varepsilon, there is an algorithm that communicates O((sk1+s3)(lns/ε)3)O((s^{k-1}+s^{3})(\ln s/\varepsilon)^{3}) words Each communicated word is either an index ii or a value atia_{ti}. in O(1)O(1) rounds.

Low-rank Approximation

For a matrix AA, define fk(A)f_{k}(A) as: fk(A)=minX:rank(X)kAXF.f_{k}(A)=\min_{X:\text{rank}(X)\leq k}||A-X||_{F}. Recall that the rank-kk approximation problem is the following: Given an n×dn\times d matrix AA, and ε>0\varepsilon>0, find an n×dn\times d matrix BB of rank at most kk such that ABF(1+ε)fk(A)||A-B||_{F}\leq(1+\varepsilon)\cdot f_{k}(A).

We are mainly concerned with communication, so we omit the tradeoffs of different compositions and just use a composition for which m=O(d/ε2)m=O(d/\varepsilon^{2}), PAPA is an m×dm\times d matrix of words each consisting of O(b+log(nd))O(b+\log(nd)) bits, and PP can be specified using O(dlogn)O(d\log n) bits (using a dd-wise independent hash function, as first shown in ), see Theorem 2.1 below. Since we will assume that bb is at least logn\log n, the O(dlogn)O(d\log n) bits to specify PP will be negligible, though we remark that the number of bits to specify PP can be further reduced using results of .

We will prove the following property about the top kk right singular vectors of PAPA for a subspace embedding PP.

We will combine this property with the following known property.

We can now state the algorithm, which we call AdaptiveCompress.

In AdaptiveCompress, the matrix PP is of size O(k/ε3)×nO(k/\varepsilon^{3})\times n.

(of Theorem 2.1.) Suppose PP is a subspace embedding for the column space of AA. Form an orthonormal basis of Rd{\bf R}^{d} using the right singular vectors of PAPA. Let v1,v2,,vdv_{1},v_{2},\ldots,v_{d} be the basis.

Also, suppose now u1,u2,,udu_{1},u_{2},\ldots,u_{d} is an orthonormal basis consisting of the singular vectors of AA. Then, we have

Thus,AAi=1kviviTF2(1+ε)4fk(A)2,\|A-A\sum_{i=1}^{k}v_{i}v_{i}^{T}\|_{F}^{2}\leq(1+\varepsilon)^{4}f_{k}(A)^{2}, as desired.

For the second part of the theorem, regarding the choice of PP, fix attention on one particular xRdx\in{\bf R}^{d}. We apply Theorem 2.2 of with A,BA,B of that theorem both set to AxAx of the current theorem and m=O(d/ε2)m=O(d/\varepsilon^{2}) in the notation of that theorem. This states that for m=O(d/ε2)m=O(d/\varepsilon^{2}), if PP is an m×nm\times n matrix with O(d)O(d)-wise independent entries uniform in {1/m,+1/m}\{-1/\sqrt{m},+1/\sqrt{m}\}, then for any fixed vector xx, PAx2=(1±ε)Ax2\|PAx\|_{2}=(1\pm\varepsilon)\|Ax\|_{2} with probability 1exp(d)1-\exp(-d). We combine this with Lemma 4 in Appendix A of , based on , to conclude that for all vectors xx, PAx2=(1±ε)Ax2\|PAx\|_{2}=(1\pm\varepsilon)\|Ax\|_{2} with probability 1exp(d)1-\exp(-d) (for a different constant in the exp()\exp() function).

Proof of Theorem 1.1: By definition of the AdaptiveCompress protocol, we have AC=AAUVVTUT,\|A-C\|=\|A-AUVV^{T}U^{T}\|, where all norms in this proof are the Frobenius norm.

Notice that UUTUU^{T} and IdUUTI_{d}-UU^{T} are projections onto orthogonal subspaces, where IdI_{d} is the d×dd\times d identity matrix. It follows by the Pythagorean theorem applied to each row that

where the second equality uses that UTU=IcU^{T}U=I_{c}, where cc is the number of columns of UU.

Observe that the row spaces of AUVVTUTAUVV^{T}U^{T} and AUUTAUU^{T} are both in the row space of UTU^{T}, and therefore in the column space of UU. It follows that since UU has orthonormal columns, AUVVTUTAUUT=(AUVVTUTAUUT)U,\|AUVV^{T}U^{T}-AUU^{T}\|=\|(AUVV^{T}U^{T}-AUU^{T})U\|, and therefore

where the second equality uses that UTU=IcU^{T}U=I_{c}. Let (AU)k(AU)_{k} be the best rank-kk approximation to the matrix AUAU. By Theorem 2.1, with probability 1o(1)1-o(1), AUVVTAU2(1+O(ε))(AU)kAU2,\|AUVV^{T}-AU\|^{2}\leq(1+O(\varepsilon))\|(AU)_{k}-AU\|^{2}, and so

Notice that the row space of (AU)k(AU)_{k} is spanned by the top kk right singular vectors of AUAU, which are in the row space of UU. Let us write (AU)k=BU(AU)_{k}=B\cdot U, where BB is a rank-kk matrix.

We apply the Pythagorean theorem to each row in the expression in (5), noting that the vectors (BiAi)UUT(B_{i}-A_{i})UU^{T} and AiUUTAiA_{i}UU^{T}-A_{i} are orthogonal, where BiB_{i} and AiA_{i} are the ii-th rows of BB and AA, respectively. Hence,

and the last equality uses the definition of BB. By Theorem 2.2, with constant probability arbitrarily close to 11, we have

It follows by combining (2), (3), (4), (5), (6), (7), that AUVVTUTA2(1+O(ε))AkA2\|AUVV^{T}U^{T}-A\|^{2}\leq(1+O(\varepsilon))\|A_{k}-A\|^{2}, which shows the correctness property of AdaptiveCompress.

We now bound the communication. In the first step, by Theorem 2.2, mm can be set to O(k/ε)O(k/\varepsilon) and the matrix SS can be described using a random seed that is O(k)O(k)-wise independent. The communication of steps 1-3 is thus O(sdk/ε)O(sdk/\varepsilon) words. By Theorem 2.1, the remaining steps take O(s(k/ε)2/ε2)=O(sk2/ε4)O(s(k/\varepsilon)^{2}/\varepsilon^{2})=O(sk^{2}/\varepsilon^{4}) words of communication.

To obtain communication with O(b+log(nd))O(b+\log(nd))-bit words if the entries of the matrices AtA_{t} are specified by bb bits, Server 1 can instead send SASA to each of the servers. The tt-th server then computes PAt(SA)TPA_{t}(SA)^{T} and sends this to Server 1. Let SA=RUTSA=RU^{T}, where UTU^{T} is an orthonormal basis for the row space of SASA, and RR is an O(k/ε)×O(k/ε)O(k/\varepsilon)\times O(k/\varepsilon) change of basis matrix. Server 1 computes tPAt(SA)T=PA(SA)T\sum_{t}PA_{t}(SA)^{T}=PA(SA)^{T} and sends this to each of the servers. Then, since each of the servers knows RR, it can compute PA(SA)T(RT)1=PAUPA(SA)^{T}(R^{T})^{-1}=PAU. It can then compute the SVD of this matrix, from which it obtains VVTVV^{T}, the projection onto its top kk right singular vectors. Then, since Server tt knows AtA_{t} and UU, it can compute AtU(VVT)UTA_{t}U(VV^{T})U^{T}, as desired. Notice that in this variant of the algorithm what is sent is SAtSA_{t} and PAt(SA)TPA_{t}(SA)^{T}, which each can be specified with O(b+log(nd))O(b+\log(nd))-bit words if the entries of the AtA_{t} are specified by bb bits.

2 Lower bound for low-rank approximation

Our reduction is from the multiplayer SUM problem.

() Suppose each of ss players has a binary vector aia^{i} with nn bits and the first player wants to compute i=1sai\sum_{i=1}^{s}a^{i} mod 22 with constant probability. Then the total communication needed is Ω(sn)\Omega(sn) bits.

(of Theorem 1.2.) We reduce from the s2s-2 player SUM problem, in which each player has a k×dk\times d binary matrix AiA^{i} and the first player wants to learn their sum. By Theorem 2.3, this problem needs Ω(skd)\Omega(skd) communication, since even the mod 22 version of the problem requires this amount of communication. Now consider the ss-player problem ss-RESTRICT-SUM in which the first player has IdI_{d}, the second player has Id-I_{d}, the remaining s2s-2 players have a k×dk\times d binary matrix AiA^{i} and the first player wants to learn the sum of all inputs. This also requires Ω(skd)\Omega(skd) communication. This follows since if this problem could be solved with o(skd)o(skd) communication, then SUM with s2s-2 players would have o(skd)o(skd) communication by a simulation in which the first player of the (s2)(s-2)-SUM problem simulates the first three players of the ss-RESTRICT-SUM problem.

In our ss-player low-rank approximation problem, we give the first player IdI_{d}, the second player Id-I_{d}, and the remaining s2s-2 players each has a random k×dk\times d binary matrix AiA^{i}. Note that there is a unique rank-kk approximation to the sum of the kk player inputs, namely, it is the matrix i=3sAi\sum_{i=3}^{s}A^{i}. It follows that any algorithm which outputs a projection matrix VVTVV^{T} for which AAVVTF2(1+ε)minX: rank(X)k\|A-AVV^{T}\|_{F}^{2}\leq(1+\varepsilon)\min_{X:\textrm{ rank}(X)\leq k}, for any ε0\varepsilon\geq 0, must be such that VVTVV^{T} is a projection onto the row space of i=3sAi\sum_{i=3}^{s}A^{i}. This follows because (1+ε)minX: rank(X)kAXF=(1+ε)0=0(1+\varepsilon)\min_{X:\textrm{ rank}(X)\leq k}\|A-X\|_{F}=(1+\varepsilon)\cdot 0=0.

Now, since the first player has IdI_{d}, his output is IdVVTI_{d}VV^{T}, where the row space of VTV^{T} equals the row space of i=3sAi\sum_{i=3}^{s}A^{i}. Suppose the total communication of our problem is CC.

We use this to build a protocol for ss-RESTRICT-SUM, which has the same inputs as in our ss-player low rank approximation problem. Notice that A=i=3sAiA=\sum_{i=3}^{s}A^{i} is a k×dk\times d matrix with rows in {0,1,2,...,s2}d\{0,1,2,...,s-2\}^{d}.

Claim. The span of the rows of AA can intersect {0,1,,s2}d\{0,1,\ldots,s-2\}^{d} in at most (2s)k(2s)^{k} distinct points.

Let \mboxrowspace(A){\mbox{r}owspace}(A) denote the row space of AA. We will bound the size of \mboxrowspace(A)GF(p)d{\mbox{r}owspace}(A)\cap GF(p)^{d} for prime pp with s2<p<2(s2)s-2<p<2(s-2), where GF(p)GF(p) is the finite field with elements {0,1,2,,p1}\{0,1,2,\ldots,p-1\}, and GF(p)dGF(p)^{d} is the vector space over GF(p)GF(p). This will be an upper bound on the size of \mboxrowspace(A){0,1,,s2}d{\mbox{r}owspace}(A)\cap\{0,1,\ldots,s-2\}^{d}. Since \mboxrowspace(A){\mbox{r}owspace}(A) is kk-dimensional, so is \mboxrowspace(A)GF(p)d{\mbox{r}owspace}(A)\cap GF(p)^{d}. Hence the intersection has at most kk linearly independent points. These kk linearly independent points can be used to generate the remaining points in \mboxrowspace(A)GF(p)d{\mbox{r}owspace}(A)\cap GF(p)^{d}. The number of distinct combinations of these points is at most pk<(2s)kp^{k}<(2s)^{k}, bounding the intersection size.

Next, players 3,,s3,\ldots,s agree on random {+1,1}d\{+1,-1\}^{d} vectors u1,uku^{1},\ldots u^{k^{\prime}} where k=klog4sk^{\prime}=k\log 4s via a public coin. The entries of u1,...,uku^{1},...,u^{k^{\prime}} need only be O(klogs)O(k\log s)-wise independent, and as such can be agreed upon by all the players using only O(sklogs)O(sk\log s) bits of communication. Each player ii then computes the inner products Ajiu1A^{i}_{j}\cdot u^{1}, …, AjiukA^{i}_{j}\cdot u^{k^{\prime}} for each j{1,,k}j\in\{1,\ldots,k\}. Here AjiA^{i}_{j} denotes the jj’th row of a the ii’th player’s matrix AiA^{i}.

Frequency Moments and Higher Order Correlations

In this section, we prove Theorems (1.3), (1.4) and (1.6).

We begin with some common notation. For t[s],i[n]t\in[s],i\in[n]:

Let B=iBi=tCt  ;  A=iAiB=\sum_{i}B_{i}=\sum_{t}C_{t}\;;\;A=\sum_{i}A_{i}. The task is to estimate AA. We analyze the following algorithm. Let l=100scf,sε2l=100\frac{s\cdot c_{f,s}}{\varepsilon^{2}}. The parameters in the algorithm will be specified presently.

To analyze the algorithm, we think of it differently: suppose CP picks tt for the first of its ll trials and asks that tt to pick ii according to its {f(ati/Ct}\{f(a_{ti}/C_{t}\}. Let XX be the random variable BAi/BiBA_{i}/B_{i} for that ii. Clearly the estimate made by the algorithm can be viewed as the average of ll i.i.d. copies of XX. So it will suffice to show that (i) XX is unbiased : I.e., E(X)=AE(X)=A and (ii) Var(X)cf,ssA2(X)\leq c_{f,s}sA^{2} (whence, the variance of the average of ll i.i.d. copies of XX would have variance at most ε2A2\varepsilon^{2}A^{2} giving us the relative error bound.)

The first part is easy: Let pip_{i} be the probability that we pick ii by this process. Clearly, p_{i}=\sum_{t=1}^{s}{\sf Prob}(\text{ CP pickedt}){\sf Prob}(t\text{ picks }i)=\sum_{t}\frac{C_{t}}{B}\frac{f(a_{ti})}{C_{t}}=\frac{B_{i}}{B}. So, E(X)=i=1nBAiBiBiB=A,E(X)=\sum_{i=1}^{n}B\frac{A_{i}}{B_{i}}\frac{B_{i}}{B}=A, proving (i). For (ii), we have E(X2)=B2ipiAi2Bi2=BiAi2BiABcf,scf,ssA2,E(X^{2})=B^{2}\sum_{i}p_{i}\frac{A_{i}^{2}}{B_{i}^{2}}=B\sum_{i}\frac{A_{i}^{2}}{B_{i}}\leq ABc_{f,s}\leq c_{f,s}sA^{2}, since, Ai=f(tati)cf,st=1sf(ati)A_{i}=f(\sum_{t}a_{ti})\leq c_{f,s}\sum_{t=1}^{s}f(a_{ti}) by the definition of cf,sc_{f,s} and by monotonicity of ff, we have Bi=tf(ati)sf(tati)B_{i}=\sum_{t}f(a_{ti})\leq sf(\sum_{t}a_{ti}).

To prove the claimed resource bounds, note that polynomial time and linear space bounds are obvious, since, all that each server has to do is to compute all f(ati)f(a_{ti}), sum them up and sample at most ll times. The communication is dominated by each of ss servers sending {ati,iS}\{a_{ti},i\in S\} to CP which is scf,s/ε2sc_{f,s}/\varepsilon^{2} words per server giving us a total of O(s2cf,s/ε2)O(s^{2}c_{f,s}/\varepsilon^{2}).

Now for the lower bound, we use (rather unsurprisingly) the set-disjointness problem. It is known ( ) that the following problem needs Ω(n)\Omega(n) bits of communication even for a randomized algorithm: we distinguish between two situations: (a) Each of ss servers holds a subset of [n][n] and the subsets are pairwise disjoint and (b) There is exactly one element common to all ss sets. We reduce this problem to ours. Let StS_{t} be the subset held by server tt. By definition of cf,sc_{f,s}, there exist x1,x2,,xsR+x_{1},x_{2},\ldots,x_{s}\in{\bf R}_{+} such that f(x1+x2++xs)=cf,s(f(x1)+f(x2)++f(xs)).f(x_{1}+x_{2}+\cdots+x_{s})=c_{f,s}(f(x_{1})+f(x_{2})+\cdots+f(x_{s})). Let n=cf,s1ε.n=\frac{c_{f,s}-1}{\varepsilon}. Let atia_{ti} be defined by: ati=xt if iSta_{ti}=x_{t}\text{ if }i\in S_{t} and ati=0a_{ti}=0 otherwise. If the sets are disjoint, then i=1nf(t=1sati)=t=1sStf(xt).\sum_{i=1}^{n}f\left(\sum_{t=1}^{s}a_{ti}\right)=\sum_{t=1}^{s}|S_{t}|f(x_{t}). In the case (b) when the sets all share one element in common, i=1nf(t=1sati)=t=1s(St1)f(xt)+f(x1+x2++xs)=t=1sStf(xt)+(cf,s1)tf(xt)=t=1sStf(xt)+εntf(xt).\sum_{i=1}^{n}f\left(\sum_{t=1}^{s}a_{ti}\right)=\sum_{t=1}^{s}(|S_{t}|-1)f(x_{t})+f(x_{1}+x_{2}+\cdots+x_{s})=\sum_{t=1}^{s}|S_{t}|f(x_{t})+(c_{f,s}-1)\sum_{t}f(x_{t})=\sum_{t=1}^{s}|S_{t}|f(x_{t})+\varepsilon n\sum_{t}f(x_{t}). Since Stn|S_{t}|\leq n, it follows that if we can estimate if(tati)\sum_{i}f(\sum_{t}a_{ti}) to relative error ε\varepsilon, then we can distinguish the two cases. But it is known that this requires Ω(n)\Omega(n) bits of communication which is Ω(cf,s/ε)\Omega(c_{f,s}/\varepsilon) proving the lower bound.

(of Theorem (1.4): The only change is in the sampling algorithm:

Order the j=(j1,j2,,jk)j=(j_{1},j_{2},\ldots,j_{k}) lexicographically. Start with the first jj as the sample and compute atja_{tj} by making a pass through the entire data: For each iWti\in W_{t}, after vi,j1,vi,j2,,vi,jkv_{i,j_{1}},v_{i,j_{2}},\ldots,v_{i,j_{k}} are read, compute g(vi,j1,vi,j2,,vi,jk)g(v_{i,j_{1}},v_{i,j_{2}},\ldots,v_{i,j_{k}}) and sum over all iWti\in W_{t}.

Process the next jj similarly. After processing a jj, say, j=j0j=j_{0}, compute f(atj0)f(a_{tj_{0}}) and keep a running total of f(atj)f(a_{tj}) for all jj seen so far. Reject the old sample and replace it by the current j0j_{0} with probability f(atj0) Total of all f(atj) including j0.\frac{f(a_{tj_{0}})}{\text{ Total of all }f(a_{tj})\text{ including }j_{0}}.

If the old sampled jj is not rejected, just keep it as the sample and go to next jj.

The proof of correctness and linear space bound follow straightforwardly by plugging in this sampling algorithm into Theorem (1.3).

We next turn to a more refined algorithm for estimating frequency moments with near-optimal communication, using the specific function f(x)=xkf(x)=x^{k}. Here is the algorithm.

(of Theorem (1.6): Let Bi=t=1satikB_{i}=\sum_{t=1}^{s}a_{ti}^{k} and Ai=(t=1sati)kA_{i}=\left(\sum_{t=1}^{s}a_{ti}\right)^{k} and ρi=Ai/Bi\rho_{i}=A_{i}/B_{i}. Note: 1ρisk11\leq\rho_{i}\leq s^{k-1}. CP can arrange to pick i[n]i\in[n] with probabilities {Bi/B}\{B_{i}/B\}, where, B=iBiB=\sum_{i}B_{i} as we already saw. First pick m=sk2/ε3m=s^{k-2}/\varepsilon^{3} sample i[n]i\in[n] according to {Bi/B}\{B_{i}/B\}. Then, CP tells all servers all these ii and collects all atia_{ti} and thence all Ai,BiA_{i},B_{i}. Total communication is at most mssk1/ε3ms\leq s^{k-1}/\varepsilon^{3}.

Let ρi=Ai/Bi\rho_{i}=A_{i}/B_{i}. Let β\beta range over {sk1,eεsk1,e2εsk1,1}\{s^{k-1},e^{-\varepsilon}s^{k-1},e^{-2\varepsilon}s^{k-1},\ldots 1\}, a total of O(lns)O(\ln s) values and let Sβ={iS:ρi[βeε,β)}.S_{\beta}=\{i\in S:\rho_{i}\in[\beta e^{-\varepsilon},\beta)\}. Then, iSρiβSββ.\sum_{i\in S}\rho_{i}\approx\sum_{\beta}|S_{\beta}|\beta. Since each ρi1\rho_{i}\geq 1, we have iSρiS\sum_{i\in S}\rho_{i}\geq|S|. So we need only accurate estimates of those Sβ|S_{\beta}| with SβεS/βlns|S_{\beta}|\geq\varepsilon|S|/\beta\ln s. For each SβS_{\beta}, if we pick u.a.r. a subset TT of SS of cardinality Ω(β(lns)2/ε3)\Omega(\beta(\ln s)^{2}/\varepsilon^{3}), then STTSβ\frac{|S|}{|T|}|T\cap S_{\beta}| estimates Sβ|S_{\beta}| to within (1±ε)(1\pm\varepsilon) for every β\beta satisfying SβεS/βlns|S_{\beta}|\geq\varepsilon|S|/\beta\ln s.

For each β\beta, pick such a random subset TT from SS. We have to recognize for each iTi\in T, whether it is in SβS_{\beta}. First, for each iTi\in T, we estimate AiA_{i} as follows: We pick l=sk1/βl=s^{k-1}/\beta servers t1,t2,,tlt_{1},t_{2},\ldots,t_{l} u.a.r. and take Zi=sl(at1,i+at2,i+atl,i)Z_{i}=\frac{s}{l}(a_{t_{1},i}+a_{t_{2},i}+\cdots a_{t_{l},i}) as our estimate of Ai1/k=t=1satiA_{i}^{1/k}=\sum_{t=1}^{s}a_{ti}. If YY is the r.v. based on just one random server (namely Y=satiY=sa_{ti} for a u.a.r tt), then EY=Ai1/k{\sf E}Y=A_{i}^{1/k} and

From this it follows by averaging over ll samples that

We do Ω(klns+ln(1/ε))\Omega(k\ln s+\ln(1/\varepsilon)) such experiments and take the median of all of these to drive down the failure probability for a single ii to less than ε2/sk\varepsilon^{2}/s^{k}, whence, it is small by union bound for the failure of any of the at most sk1/ε2s^{k-1}/\varepsilon^{2} indices ii ’s. Thus all the Ai1/k,iTA_{i}^{1/k},i\in T are estimated to a factor of (1+ϵ)(1+\epsilon) by this process whp. Since kk is a fixed constant, this also means (1+O(ε))(1+O(\varepsilon)) relative error in the estimate of AiA_{i}.

The subset that intersects TT is then {iT:ρiβ/s}20TSs2lnsSβ=O(s2ln3s/ε3)\{i\in T:\rho_{i}\geq\beta/s\}|\leq 20\frac{|T|}{|S|}\frac{s^{2}\ln s|S|}{\beta}=O(s^{2}\ln^{3}s/\varepsilon^{3}). Now for these ii ’s in TT, we collect all atia_{ti} and find Ai,BiA_{i},B_{i} exactly. This costs us O(s3(lns)3/ε2)O(s^{3}(\ln s)^{3}/\varepsilon^{2}) communication. Thus the overall communication is bounded by

We now extend the above theorem and proof for a wide class of functions satisfying a weak Lipschitz condition (and generalizing the case of moments).

For a monotone function f:++f:\Re_{+}\rightarrow\Re_{+}, define

Alternatively, LfL_{f} is the Lipschitz constant of ff wrt the “distance” d(x,y)=log(x)log(y)d(x,y)=\log(x)-\log(y), i.e.,

For the function f(x)=xkf(x)=x^{k}, we see that Lf=kL_{f}=k.

For any function f:+R+f:\Re_{+}\rightarrow R_{+} with L=LfL=L_{f},

Let ff be any nonnegative, superlinear real function with L=Lf4L=L_{f}\geq 4. Suppose there are ss servers, with server tt holding a non-negative vector at=(at1,at2,,atn)a_{t}=(a_{t1},a_{t2},\ldots,a_{tn}) Then, to estimate A=i=1nf(t=1sati)A=\sum_{i=1}^{n}f\left(\sum_{t=1}^{s}a_{ti}\right) to relative error ε\varepsilon, there is an algorithm that communicates O(sL1(lns)3/ε3)O(s^{L-1}(\ln s)^{3}/\varepsilon^{3}) words in O(1)O(1) rounds.

The algorithm is the following, essentially the same as in the moments case, with parameters defined in terms of LfL_{f} for a general function f(.)f(.) in place of xkx^{k}.

Pick an i.i.d. sample S0S_{0} of m=sL2/ε3m=s^{L-2}/\varepsilon^{3} indices ii, where each ii is picked according to {Bi/B}\{B_{i}/B\}.

CP now gets an i.i.d. sample SS of O(sL1ln2s/ε3)O(s^{L-1}\ln^{2}s/\varepsilon^{3}) ii ’s, each according to {Bi/B}\{B_{i}/B\}.

For each β{sL1,eεsL1,e2εsL1,,1}\beta\in\{s^{L-1},e^{-\varepsilon}s^{L-1},e^{-2\varepsilon}s^{L-1},\ldots,1\}, CP does the following:

Pick a subset TT of SS of cardinality Ω(β(lns)2/ε3)\Omega(\beta(\ln s)^{2}/\varepsilon^{3}) u.a.r.

(of Thm. 3.3.) We point out the changes in the analysis from the special case of moments. Now we have Ai=f(t=1sati),Bi=t=1sf(ati)A_{i}=f(\sum_{t=1}^{s}a_{ti}),B_{i}=\sum_{t=1}^{s}f(a_{ti}) and A=i=1nAi,B=i=1nBiA=\sum_{i=1}^{n}A_{i},B=\sum_{i=1}^{n}B_{i}. Also ρi=Ai/Bi\rho_{i}=A_{i}/B_{i}. The reader will have noticed that kk has been replaced by LL in the above algorithm. The first phase remains the same, and at the end we either have a good approximation for AA or we know that AsBA\leq sB.

In the next phase we estimate f(tati)1/Lf(\sum_{t}a_{ti})^{1/L}. To do this, we first estimate t=1sati\sum_{t=1}^{s}a_{ti}, then apply ff to this estimate. We need to analyze the error of both parts. For the first part, let Y=satiY=sa_{ti} as before. Then E(Y)=t=1sati{\sf E}(Y)=\sum_{t=1}^{s}a_{ti} and since the server used to define YY is chosen uniformly at random, we have

The last phase for estimating BiB_{i} and putting together the estimates for all the ρi\rho_{i} is again the same as in the case of moments.

Acknowledgements. We are grateful to Dick Karp and Jelani Nelson for helpful discussions, as well as the Simons Institute at Berkeley. Santosh Vempala was supported in part by NSF award CCF-1217793 and David Woodruff by the XDATA program of the Defense Advanced Research Projects Agency (DARPA), administered through Air Force Research Laboratory contract FA8750-12-C0323.

References