Computational barriers in minimax submatrix detection

Zongming Ma, Yihong Wu

Introduction

Statistical inference of structured large matrices lies at the heart of many applications involving massive datasets, such as matrix completion, functional genomics, community detection and clustering; see, for instance, and the references therein. Many of these detection and estimation problems have been investigated from a decision theoretic viewpoint, where one first establishes a minimax lower bound for any test or estimator and then constructs a specific procedure which attains the lower bound within a constant or logarithmic factor.

An important element absent from the foregoing decision theoreticparadigm is computational complexity. This aspect is especially relevant in the context of high-dimensional statistical inference, where computationally efficient procedures (e.g., convex programming, iterative algorithms, etc.) are highly desirable. However, it has been empirically observed in several basic detection and estimation problems that popular low-complexity algorithms fail to attain the minimax rates; see, for example, . This invites the following question: how much do we need to back off from the statistical optimality due to computational complexity constraints? In this paper, we revisit the sparse submatrix detection problem that has been studied in , where the goal is to detect a small submatrix with elevated mean in a large noisy matrix. Motivations for this detection problem include biclustering for analyzing microarray data and community detection in social networks , etc.

In this problem, the key parameters are the matrix dimension pp, the block size kk and the signal magnitude λ\lambda. Clearly, it is easier to detect the submatrix if either kk or λ\lambda increases. Throughout the paper, we focus on the asymptotic setting where pp tends to infinity and both k=k(p)k=k(p) and λ=λ(p)\lambda=\lambda(p) are functions of pp, though we typically drop the explicit dependence on pp for conciseness.

The optimal total probability of error is denoted by

the necessary and sufficient condition for reliably detecting the submatrix has been characterized by Butucea and Ingster (, Theorems 2.1 and 2.2): E0{\mathcal{E}}^{*}\to 0 if

and, conversely, E1{\mathcal{E}}^{*}\to 1 if

From this point forward, we say reliable detection is statistically possible if E0{\mathcal{E}}^{*}\to 0 and a sequence of tests {ϕp}\{\phi_{p}\} reliably detects the submatrix if E(ϕp)0{\mathcal{E}}(\phi_{p})\to 0.

To reliably detect the submatrix under condition (6), Butucea and Ingster proposed a test involving enumerating all k×kk\times k submatrices of XX, which is asymptotically optimal but computationally intensive. It is unclear from first principles whether statistically optimal detection can be achieved using computationally efficient procedures. Thus an intriguing question is in order: under the optimal condition (6) so that E0{\mathcal{E}}^{*}\to 0, is there a sequence of computationally efficient tests {ϕp}\{\phi_{p}\} such that E(ϕp)0{\mathcal{E}}(\phi_{p})\to 0?

2 The penalty incurred by complexity constraints

To approach the computational hardness of the submatrix detection problem rigorously, we need to investigate the computational cost of testing procedures in a complexity theoretic sense. However, an immediate hurdle for the Gaussian model (1) is that computational complexity is not well defined for all tests dealing with nondiscrete distributions since the observation cannot be represented by finitely many bits almost surely. To propose a paradigm for complexity-constrained hypotheses testing, we consider a sequence of discretized Gaussian models which is asymptotically equivalent to the original model in the sense of Le Cam and hence preserves the statistical difficulty of the problem. More importantly, the computational complexity of tests on the discretized model can be appropriately defined. See Section 3 for details.

Next, we take the standard reduction approach in complexity theory: we show that if the signal magnitude is smaller than a certain threshold, detecting the submatrix is computationally no easier than certain well-known intractable problems. In other words, if an efficient method existed for submatrix detection, it would lead to an efficient solution to this problem. In this paper, we use the planted clique problem as the benchmark, which deals with detecting whether a given instance of an Erdős–Rényi random graph of size NN contains a planted clique of size κ\kappa. It is widely believed that the detection problem cannot be solved in randomized polynomial time when κ=o(N)\kappa=o(\sqrt{N}), which we shall refer to as the planted clique hypothesis. For the precise statement and further discussions, see Definition 1 and Hypothesis 1 in Section 4.

Assuming the planted clique hypothesis, our main finding (Theorem 2 in Section 4) characterizes when it is possible to achieve reliable detection using computationally efficient procedures and when it is impossible. The core of the arguments lies in a randomized polynomial-time reduction scheme which maps the N×NN\times N adjacency matrix of the random graph in the planted clique problem to a p×pp\times p random matrix in polynomial time. It is worth noting that when kpαk\geq p^{\alpha} for some α12\alpha\geq\frac{1}{2}, the cardinality of the graph NN is not equal to the size of the matrix pp but rather chosen to be p1+δp^{1+\delta} (omitting log factor), where δ>0\delta>0 depends on α\alpha. On the other hand, κ\kappa can always be chosen as a constant multiple of kk.

Our main result can be illustrated by focusing on the following asymptotic regime, where the submatrix size grows according to k=Θ(pα)k=\Theta(p^{\alpha}), and the signal magnitude decays as λ=Θ(pβ)\lambda=\Theta(p^{-\beta}) for fixed constants α(0,1)\alpha\in(0,1) and β\beta\inThe regime of β>1\beta>1 is not interesting since the hypotheses are indistinguishable even if the submatrix becomes the whole matrix (k=pk=p). as pp\to\infty. For any two numbers aa and bb, let ab=min(a,b)a\wedge b=\min(a,b) and ab=max(a,b)a\vee b=\max(a,b). Define

The statistical and computational feasibility of the submatrix detection problem is demonstrated in Figure 1, where the (α,β)(\alpha,\beta)-plane is divided into three regions: {longlist}[(2)]

β>β\beta>\beta^{*} (top region): reliable detection of the submatrix is statistically impossible because the signal is too weak.

β<β\beta<\beta^{\sharp} (right triangular region): reliable submatrix detection is achievable by computationally efficient tests.

β<β<β\beta^{\sharp}<\beta<\beta^{*} (lower left triangular region): reliable detection is statistically possible but computationally intractable, in the sense that it is at least as hard as solving the planted clique problem of a particular configuration, which is intractable under the planted clique hypothesis. Therefore, the tractability of the submatrix detection problem undergoes a sharp transition: in the moderately sparse regime where α(2/3,1)\alpha\in({2}/{3},1), computational constraints incur no penalty on the statistical performance. In contrast, in the highly sparse regime where α(0,2/3)\alpha\in(0,{2}/{3}), achieving the statistical optimal boundary requires computational resources that are powerful enough to solve the planted clique problem, and, consequently, computationally efficient procedures require significantly higher signal-to-noise ratio to detect the submatrix.

The complexity-theoretic limits for the submatrix detection problem also lead to interesting findings for the related support recovery problem when the signal submatrix is present . Moreover, it also sheds light on the statistical and computational tradeoff in the problem of estimating block sparse matrices . In particular, we show the surprising result that the hardness of minimax estimation can crucially depend on the loss function, in the sense that attaining the minimax estimation rate can be computationally easy for one type of loss functions but hard for the other.

3 Related works

Despite the vast body of literature on developing computationally efficient procedures with optimal statistical performance for problems such as compressed sensing, rigorous results on inferential limits of statistical problems under computational complexity constraints are comparatively limited. A representative work is the investigation of the complexity of detecting sparse principal components by Berthet and Rigollet , which is one of the motivations of the present paper. Sparse principal component detection refers to the problem of testing N(0,Ip)N(0,I_{p}) against N(0,Ip+avv)N(0,I_{p}+avv^{\prime}) for kk-sparse unit vector vv based on nn i.i.d. observations . Since the model is not discrete, as previously mentioned, the difficulty of ill-defined computational complexity is also present. In , the authors relaxed the Gaussian detection problem to a composite testing problem that includes discrete distributions, where the empirical projection variances of the null and alternative hypotheses satisfy respective uniform χ2\chi^{2}-tail type concentration inequalities. In the regime of pδkpαp^{\delta}\leq k\leq p^{\alpha} for some absolute constants 0<δα<120<\delta\leq\alpha<\frac{1}{2} and npn\leq p, they showed that the computable detection rate for the deviation of the largest principal component from the rest of the spectrum is no smaller than kbn\sqrt{\frac{k^{b}}{n}} for any b<2b<2, which far exceeds the minimax detection rate of knlogp\sqrt{\frac{k}{n}\log p}.

Although both the authors of current paper and Berthet and Rigollet use the planted clique hypothesis for studying complexity theoretic lower bounds, there are a few important differences. First, Berthet and Rigollet extend the original “simple vs. composite” Gaussian sparse principal component detection problem into a “composite versus composite” testing problem, and the data no longer needs to be Gaussian. As a consequence, more distributions are included in both the null and alternative hypotheses, and thus constructing the reduction scheme becomes easier than for the original Gaussian hypotheses. In contrast, the current paper considers an asymptotically equivalent discretized model which is faithful to the original Gaussian submatrix detection problem in . Second, the computational lower bounds in are established only when the sparsity level satisfies pδkpαp^{\delta}\leq k\leq p^{\alpha} for 0<δα<120<\delta\leq\alpha<\frac{1}{2}. In comparison, due to a new reduction scheme, the current paper provides a more complete characterization of the computational limits for all kpδk\geq p^{\delta} and any δ>0\delta>0. Last but not least, we propose an asymptotic equivalence framework in the sense of Le Cam, which preserves the statistical nature of the problem and, at the same time, allows rigorous statements of computational complexity of testing procedures. The approach via asymptotically equivalent discretized experiments is potentially useful in future works dealing with nondiscrete distributions.

In addition, some researchers have studied the minimax sub-optimality of certain computationally efficient methodologies, such as those based on convex relaxations, in an array of problems including estimating sparse eigenvectors , support recovery for sparse submatrices , combinatorial testing , community detection , etc. In some of the papers, the authors also conjecture that the minimax rate optimality cannot be achieved by any computationally efficient algorithms. From a different viewpoint, Chandrasekaran and Jordan consider the tradeoff between computation and statistical performance within a specific family of algorithms parameterized by the level of convex relaxations in the classical normal mean estimation problem. In contrast, the goal of the present paper is to investigate the impact of complexity constraint on any statistical procedure for the submatrix detection problem.

4 Organization of the paper

The rest of the paper is organized as follows. In Section 2, we study test statistics for submatrix detection under Gaussian models. To incorporate computational complexity into the decision theoretic problem, we introduce in Section 3 a sequence of asymptotically equivalent discretized models and show that the minimax detection results (6)–(7) remain unchanged under these models. In Section 4, we state our main result in Theorem 2 under the planted clique hypothesis and present its proof with a concrete randomized polynomial-time procedure that reduces the planted clique problem to a Bayesian version of the submatrix detection problem. We discuss some related problems in Section 5. Section 6 presents additional proofs for results in earlier sections.

5 Notation

Test statistics for submatrix detection

To prepare for later investigation, we first study three test statistics for the submatrix detection problem (1)–(1.1). The first two are the linear and the scan test statistics proposed in ,

In addition, we also consider the maximum test statistic

The following lemma gives nonasymptotic bounds on the Type-I+{}+{}II error probabilities on tests based on these statistics. Recall the definition of M(p,k,λ){\mathcal{M}}(p,k,\lambda) in (1.1).

For TmaxT_{\max} in (9), set τ=(4+c)logp\tau^{\prime\prime}=\sqrt{(4+c)\log p}. Then

Asymptotically equivalent discretized model

Gaussian distributions serve as good statistical models for many real-world datasets. However, as an idealized approximation, Gaussian experiment does not capture the finite-precision nature of statistical computing systems in reality. As mentioned in Section 1, it is an ill-defined problem to investigate the computational complexity of testing the Gaussian hypothesis (1) since the data do not admit any representation using finite bits. Therefore a new paradigm is needed in order to make sense of hypothesis testing with complexity constraints in general. There are two goals of the paradigm: {longlist}[(a)]

to provide a rigorous framework for quantifying the complexity of statistical inference involving continuous, for example, Gaussian, distributions and

to preserve the statistical difficulty of the original problem in the sense of Le Cam’s asymptotic equivalence. In this section, we propose such a paradigm based on discretizing the original Gaussian experiment, which achieves both of the above goals.

The function []t[\cdot]_{t} naturally extends to matrices componentwise: for A=(Aij)A=(A_{ij}), [A]t=([Aij]t)[A]_{t}=([A_{ij}]_{t}).

Recall the submatrix detection problem (1). To model statistical inference with finite precision and complexity constraints, let us consider the same testing problem based on the discretized data [X]t[X]_{t}. In other words, the hypotheses are

Asymptotic equivalence

where the infimum is over all probability transition kernels from (X,F)({\mathsf{X}},{\mathcal{F}}) to (Y,G)({\mathsf{Y}},{\mathcal{G}}) , Theorem 1.7, page 29. The Le Cam distance between P{\mathcal{P}} and Q{\mathcal{Q}} is

For the proof of Theorem 1, see the supplement . An immediate consequence of Theorem 1 on submatrix detection is the following: Since the difference between the optimal Type-I+{}+{}II error probabilities for the Gaussian hypotheses and the discretized hypotheses is upper bounded by their Le Cam distance , Theorem 2.2, which vanishes as pp\to\infty, we conclude that testing on discretized data has no impact on the statistical performance asymptotically in the high-dimensional setting. In particular, conclusion (6)–(7) continues to hold for testing (14).

Complexity theoretic limits

Let A{0,1}N×NA\in\{0,1\}^{N\times N} be the adjacency matrix of a random graph drawn from either G(N,1/2){\mathcal{G}}(N,1/2) or G(N,1/2,κ){\mathcal{G}}(N,1/2,\kappa). The planted clique problem of parameters (N,κ)(N,\kappa), denoted by PC(N,κ)\mathsf{PC}(N,\kappa), refers to the hypothesis testing problem of

The planted clique problem has a long history in the theoretical computer science literature. It is known that finding the clique is statistically impossible when κ=o(logN)\kappa=o(\log N). Moreover, a greedy algorithm succeeds if κcNlogN\kappa\geq c\sqrt{N\log N} for some constant c>0c>0 . Using spectral methods, Alon, Krivelevich and Sudakov provided the first polynomial time detection algorithm when κ=cN\kappa=c\sqrt{N}, with later improvements obtained in, for example, . However, it is widely believed that the detection problem cannot be solved in randomized polynomial time when κ=o(N)\kappa=o(\sqrt{N}), which can be summarized as the following planted clique hypothesis. This version is similar to , Conjecture 4.13, and , Hypothesis BPC\mathsf{B}_{\mathsf{PC}}.

For any sequence {κN}\{\kappa_{N}\} such that lim supNlogκNlogN<1/2\limsup_{N\to\infty}\frac{\log\kappa_{N}}{\log N}<1/2 and any sequence of randomized polynomial-time testsFormally, randomized polynomial-time tests belong to the BPP complexity class. Interested readers are referred to standard textbooks on computational complexity theory (e.g., , Chapter 7) for the formal definitions and discussions. Intuitively speaking, randomized polynomial-time tests refer to algorithms with output space {0,1}\{0,1\}, which have access to external random numbers and whose running time is bounded by a polynomial of the input length regardless of the random numbers. {ψN}\{\psi_{N}\},

Various hardness results in theoretical computer science have been established based on the planted clique hypothesis, for example, approximating the Nash equilibrium , independence testing , certifying the restricted isometry property for compressed sensing measurement matrices , etc. Also, several cryptographic schemes have been proposed assuming the intractability of finding planted cliques or bicliques . Recently, the average-case hardness of planted clique has been established under certain computation models; see, for example, .

The main result of the current paper is the following.

Assume that Hypothesis 1 holds. Consider testing the discrete hypotheses (14) with t=t(p)=4log2pt=t(p)=4{\lceil{\log_{2}p}\rceil} in the asymptotic regime (5). If, for some absolute constant δ>0\delta>0,

there exists no sequence of randomized polynomial-time tests {ϕp}\{\phi_{p}\} such that lim infpE(ϕp)<2/3\liminf_{p\to\infty}{\mathcal{E}}(\phi_{p})<{2}/{3} for testing (14). Conversely, if

there is a sequence of linear-time tests {ϕp}\{\phi_{p}\} such that E(ϕp)0{\mathcal{E}}(\phi_{p})\to 0.

It should be noted that the sub-polynomial factor difference, that is, p/k2+δ{p}/{k^{2+\delta}} versus p/k2{p}/{k^{2}}, in the first part of (16) and (17) is a direct consequence of Hypothesis 1. In contrast, the logarithmic factor difference in the second part of (16) and (17) can potentially be closed by employing better reduction argument and/or more sophisticated testing procedures such as those based on spectral methods, which we leave as a future direction.

The remainder of this section is devoted to proving Theorem 2, with auxiliary lemmas proved in Section 6. First, in Section 4.1 we provide some intuition on how the planted clique problem is related to the submatrix detection problem (1) under the Gaussian model. Next, in Section 4.2 we prove that under the asymptotically equivalent discretized model, every randomized polynomial time submatrix detector for (14) leads to a randomized polynomial time solver for the planted clique problem of appropriate parameters with almost identical performance. Finally, a proof of Theorem 2 is presented in Section 4.3.

We first explain how the submatrix detection problem can be reduced from the planted clique problem under the original Gaussian model. These results are presented as the precursor of the randomized polynomial time reduction for the discretized model in Section 4.2, as well as to provide insights into the hardness of the submatrix detection problem. A connection between the two problems has also been previously hinted at in .

An important step in the following reduction scheme is to map any random edge to an N(0,1)N(0,1) random variable and any edge in the clique to an N(μ,1)N(\mu,1) random variable with some positive mean value μ\mu. Although this goal might not be achievable exactly, we describe below a strategy to achieve it approximately.

To this end, for any M3M\geq 3 and 0<μ12M0<\mu\leq\frac{1}{2M}, let c0=(12Φ(M))1c_{0}=(1-2\overline{\Phi}(M))^{-1} and c1=[1Φ(Mμ)Φ(M+μ)]1c_{1}=[1-\overline{\Phi}(M-\mu)-\overline{\Phi}(M+\mu)]^{-1}. We define two distributions F1{\mathcal{F}}_{1} and F0{\mathcal{F}}_{0} with the respective density functions

Here both f0f_{0} and f1f_{1} are well-defined probability density functions. In particular, the conditions M3M\geq 3 and 0μ12M0\leq\mu\leq\frac{1}{2M} ensure that f00f_{0}\geq 0. In what follows, both MM and μ\mu, and thereby F0{\mathcal{F}}_{0} and F1{\mathcal{F}}_{1}, depend on NN, though we suppress the dependence for notational convenience.

In other words, L(Bij(A0)ij=0)=F0{\mathcal{L}}(B_{ij}|(A_{0})_{ij}=0)={\mathcal{F}}_{0} and L(Bij(A0)ij=1)=F1{\mathcal{L}}(B_{ij}|(A_{0})_{ij}=1)={\mathcal{F}}_{1}.

Therefore, (20), (21) and (22) collectively define a deterministic function

which can be computed in O(N2)O(N^{2}) number of flops. The reason that we call the first step “Gaussianization” is due to the following lemma, which ensures that for appropriately chosen MM and μ\mu, the marginal distribution of BijB_{ij} is close to the Gaussian distribution of unit variance and mean zero (resp., μ\mu) if (A0)ij(A_{0})_{ij} corresponds to a random edge (resp., an edge in the clique).

Let N6N\geq 6. Let ξ\xi be a Bernoulli random variable. Let WW be a random variable such that for i{0,1}i\in\{0,1\}, the conditional distribution of Wξ=iW|\xi=i follows fif_{i} in (4.1) where M3M\geq 3 and μ12M\mu\leq\frac{1}{2M}. Then:

The following two lemmas characterize the law of X=g(A,B0,B1)X=g(A,B_{0},B_{1}) when either H0GH_{0}^{G} or H1GH_{1}^{G} in (15) holds.

Suppose H0GH_{0}^{G} holds and N2p6N\geq 2p\geq 6. Let M6logNM\geq\sqrt{6\log N}. Then

A careful examination of the proof of Lemma 4 in Section 6.4 reveals that the prior π\pi is in fact supported on a subset M~(p,k,λ)M(p,k,λ)\widetilde{\mathcal{M}}(p,k,\lambda)\subset{\mathcal{M}}(p,k,\lambda) where

and λ=2μpN\lambda=\frac{2\mu p}{N}. In other words, any matrix in M~(p,k,λ)\widetilde{\mathcal{M}}(p,k,\lambda) contains a nonzero rectangular submatrix whose row and column support sizes are between kk and 20k20k. This observation will be useful for studying the hardness of estimation in Section 5.2.

Combining Lemmas 3 and 4, the following theorem shows that any submatrix detector leads to a test with almost identical error probability for a planted clique problem, whose parameters (N,κ)(N,\kappa) depend on the parameters (p,k,λ)(p,k,\lambda) of the submatrix detection problem.

By the definition of the total variation distance, Lemma 3 implies that under H0GH_{0}^{G},

Since β=β0+β1\beta=\beta_{0}+\beta_{1}, the desired error bound (28) follows from

where the last inequality is due to assumption (27) on ϕ\phi.

Although the reduction scheme gg can be implemented in O(N2)O(N^{2}) flops, its computational complexity is ill defined since it involves computing sums of continuous random variables and processing infinitely many bits. This issue will be addressed by a quantization argument in the next subsection when we deal with the discretized models.

2 Randomized polynomial-time reduction for discretized models

In this section, we show that with slight modifications, the scheme introduced in Section 4.1 can be made into a randomized polynomial-time reduction from the planted clique problem to the submatrix detection problem under discretized models in rigorous complexity-theoretic sense.

For the discretized model P(p,t){\mathcal{P}}^{(p,t)} introduced in Section 3, the reduction scheme from the planted clique model follows the same steps in Section 4.1, except that both the input (B0,B1)(B_{0},B_{1}) and the output XX are now discretized.

To this end, we first define discrete approximations, denoted by Q0Q_{0} and Q1Q_{1}, to the densities f0f_{0} and f1f_{1} defined in (4.1). Let w,Tw,T be integers to be chosen based on t,Mt,M and NN. Recall the quantization operator defined in (13) and that B0B_{0} and B1B_{1} consist of i.i.d. entries drawn from densities f0f_{0} and f1f_{1}, respectively, which are supported on [M,M][-M,M] by definition. Note that each [(B0)ij]w[(B_{0})_{ij}]_{w} is drawn from a distribution with atoms xix_{i} and probability mass function (p.m.f.) pip_{i} for i[M2w+1]i\in[M2^{w+1}]. To find a dyadic approximation for the p.m.f., let qi=pi2T2Tq_{i}=\lfloor p_{i}2^{T}\rfloor 2^{-T} for i=2,,M2wi=2,\ldots,M2^{w} and q1=p12T2T+1i2qiq_{1}=\lfloor p_{1}2^{T}\rfloor 2^{-T}+1-\sum_{i\geq 2}q_{i}, where

Denote by Q0Q_{0} the discrete distribution with atoms xix_{i} and probability masses qiq_{i}. Similarly, define Q1Q_{1} as the dyadic approximation for the distribution of [(B1)ij]w[(B_{1})_{ij}]_{w}.

The reduction scheme operates as follows: first, generate Bi˘\breve{B_{i}} consisting of i.i.d. entries drawn from QiQ_{i} for i=0,1i=0,1. Next, replace the matrices B0B_{0} and B1B_{1} in (20) by their discretized version B˘0\breve{B}_{0} and B˘1\breve{B}_{1}, and denote the resulting matrix by B˘\breve{B}. Applying (21)–(22) to B˘\breve{B}, we obtain X˘\breve{X} and output its quantized version [X˘]t[\breve{X}]_{t}. Implementing the above steps yields a deterministic function

First we discuss the complexity for generating the auxiliary random variables used in the reduction scheme. Note that each (B˘0)ij(\breve{B}_{0})_{ij} is drawn from Q0Q_{0} whose atoms xix_{i} can be represented by log2M+w{\lceil{\log_{2}M}\rceil}+w bits and the p.m.f. qiq_{i} is a dyadic rational with TT bits. Therefore sampling from the distribution Q0Q_{0} can be done using the inverse CDFMore sophisticated random number generators for discrete distributions (such as Walker’s alias method which requires linear time for preprocessing and constant time per sample) can be found in , Section 3.4.1. by outputting xJx_{J}, where J=min{j\dvtxi=1jqiU2T}J=\min\{j\dvtx\sum_{i=1}^{j}q_{i}\leq U2^{-T}\} and UU is a random integer uniformly distributed on [2T][2^{T}]. Consequently, sampling from Q0Q_{0} requires O(M2wT)O(M2^{w}T) preprocessing time to compute the CDF, and TT fair coin flips and O(logM+w)O(\log M+w) time per sample (via binary search). Furthermore, discretizing each entry X˘ij\breve{X}_{ij} to [X˘ij]t[\breve{X}_{ij}]_{t} involves keeping the first tt bits after the binary point, which can be computed in O(t)O(t) time. Therefore we conclude that g˘\breve{g} can be computed using O((log2M+w+t)N2)O(({\lceil{\log_{2}M}\rceil}+w+t)N^{2}) number of binary operations.

To summarize, the randomized reduction scheme requires O(N2T)O(N^{2}T) random bits and O(M2wT+N2(logM+w+t))O(M2^{w}T+N^{2}(\log M+w+t)) computation, where TT is defined in (31). As we will show in Section 4.3, for all cases of interest in this paper, we can set N=O(p2)N=O(p^{2}) and M,w,t=O(log2p)=O(log2N)M,w,t=O(\log_{2}{p})=O(\log_{2}{N}). Therefore, our reduction for discretized models Ag˘(A,B˘0,B˘1)A\mapsto\breve{g}(A,\breve{B}_{0},\breve{B}_{1}) is a randomized polynomial-time reduction.

Combining the two lemmas, we obtain the following result analogously to Theorem 3.

3 Proof of Theorem \texorpdfstring22

We start with the lower bound. Without loss of generality, we can assume that λ1/p\lambda\geq{1}/{p} since when λ<1/p\lambda<{1}/{p}, both conditions in (7) hold in the asymptotic regime (5), and the problem is statistically impossible. Let the sequence {(k(p),λ(p))}\{(k(p),\lambda(p))\} satisfy (5) and (16). Let {ϕp}\{\phi_{p}\} be a sequence of randomized polynomial time tests. For conciseness we drop the indices in k(p),λ(p)k(p),\lambda(p) and ϕp\phi_{p}. Suppose for the sake of contradiction that

On the other hand, we have Np/λp2N\leq{p}/{\lambda}\leq p^{2}, where the last inequality holds since we have assumed λ1/p\lambda\geq{1}/{p}. Applying Theorem 4 with w=16log2pt+12log2pt+6log2Nw=16{\lceil{\log_{2}p}\rceil}\geq t+12\log_{2}p\geq t+6\log_{2}N, we conclude from (38) that the randomized test ψ()=ϕ(g˘(,B˘0,B˘1))\psi(\cdot)=\phi(\breve{g}(\cdot,\breve{B}_{0},\breve{B}_{1})) satisfies

In view of Remark 5, Ag˘(A,B˘0,B˘1)A\mapsto\breve{g}(A,\breve{B}_{0},\breve{B}_{1}) is a randomized polynomial-time reduction. By the assumption on ϕ\phi, ψ\psi as a composition of ϕ\phi and g˘\breve{g} is a randomized polynomial-time test for PC(N,κ)\mathsf{PC}(N,\kappa). Therefore, (40) contradicts Hypothesis 1 in view of (39).

Discussion

In this paper, assuming the planted clique hypothesis, we have demonstrated a phase transition phenomenon on gaps between the optimal statistical performance with and without computational complexity constraints for the submatrix detection problem. The hardness result in Theorem 2 has important consequences on the hardness of two related problems, namely, support recovery and matrix estimation under submatrix sparsity, both of which are more difficult than detection and require stronger signal level. To discuss computational complexity of statistical procedures, we focus on the discretized models introduced in Section 3 throughout the current section.

As previously studied in , the goal of support recovery is to identify the minimum λ\lambda such that, under the alternative hypothesis H1H_{1} in (1), the submatrix can be consistently located. According to Theorems 1 and 2 of , for all kp/2k\leq p/2, one needs λ=Ω(log(p)/k)\lambda=\Omega(\sqrt{\log(p)/k}) to recover the support consistently under the parameter space (1.1). Compared with (6)–(7), this coincides with the minimum signal strength required for detecting the submatrix when k=O(pα)k=O(p^{\alpha}) for α<2/3\alpha<{2/3}, but is much larger when k=Ω(pα)k=\Omega(p^{\alpha}) for α>2/3\alpha>{2/3}.

2 Hardness of estimation depends on norm

We now consider the computational aspect of the related problem of estimating the mean matrix with submatrix sparsity under squared norm losses. Denote the set of k×kk\times k-sparse matrices by

which includes both the zero matrix and the set M~(p,k/20,λ)\widetilde{\mathcal{M}}(p,{\lfloor{k/20}\rfloor},\lambda) [defined in (3)] for any λ>0\lambda>0.

Given the noisy observation X=θ+ZX=\theta+Z, where ZZ consists of standard Gaussian entries, the minimax risk

has been obtained in , Section 4, within universal constant factors using convex geometry and information-theoretic arguments for all unitarily invariant norms,To be precise, note that the minimax rates in are obtained for the Gaussian model. Since the loss function is unbounded, one cannot directly conclude from asymptotic equivalence that the same rate applies to the discretized model. Nevertheless, it is straightforward to extend the arguments in , Section 4.1, to show that the rate of Ψ(p,k)\Psi_{\|\cdot\|}(p,k) applies to the discretized model in Section 3 as long as t=Ω(logp)t=\Omega(\sqrt{\log p}), independent of the unitarily invariant norm. In particular, the lower bound in , Section 4.1.1, applies verbatim due to the data processing inequality of the KL divergence, which is attained by the same estimator defined in , Section 4.1.2, if 2t1/p2^{-t}\leq 1/\sqrt{p}. in particular, satisfies

which is within a logarithmic term of k2/q+1k^{{2}/{q}+1}. Next we discuss the computational cost of estimation by focusing on the asymptotic regime where k=Θ(pα)k=\Theta(p^{\alpha}) for some α(0,1)\alpha\in(0,1). In view of the relationship between testing and estimation, we can use the construction in Lemmas 3–4 for the detection problem

since the risk is clearly nondecreasing in kk.

On the constructive side, an estimation error of

Comparing the minimax rate (43) with the computationally lower and upper bounds (44)–(45), we obtain the following result, assuming Hypothesis 1:

For qq\in, using the entrywise thresholding estimator defined above, the minimax rate is attained within a logarithmic factor simultaneously for all kk;

More generally, one can show that for all quadratic norms (see , page 95), entrywise thresholding is near optimal (within a sub-polynomial factor) among all computationally efficient estimators. This extends the above result since Schatten-qq norm is quadratic if and only if q[2,]q\in[2,\infty].

Proofs

We present below the proofs of Lemmas 1–4. The proofs of Theorem 1 and Lemmas 5 and 6 are deferred to the supplement .

2 Proof of Lemma \texorpdfstring22

For the first claim, the marginal density function of WW is f1f_{1} in (4.1). So by definition,

were the last inequality is due to the fact that for any 0<μ12M0<\mu\leq\frac{1}{2M}, (Mμ)2(M12M)2M21(M-\mu)^{2}\geq(M-\frac{1}{2M})^{2}\geq M^{2}-1. For the second claim, the marginal density function of WW is f=12(f0+f1)=c0φ(x)1{xM}f=\frac{1}{2}(f_{0}+f_{1})=c_{0}\varphi(x){\mathbf{1}_{\{{|x|\leq M}\}}}. Thus

3 Proof of Lemma \texorpdfstring33

We need the following result on the total variation between product distributions.

TV(i=1nPi,i=1nQi)i=1nTV(Pi,Qi)\mathsf{TV}(\prod_{i=1}^{n}P_{i},\prod_{i=1}^{n}Q_{i})\leq\sum_{i=1}^{n}\mathsf{TV}(P_{i},Q_{i}).

Recall the dual representation of the total variation ,

Here, the first inequality is due to the data processing inequality for the total variation , the second last inequality is due to Lemma 7 and the last inequality is due to Lemma 2.

4 Proof of Lemma \texorpdfstring44

Recall that NN is even with N2=N/2N_{2}=N/2. When AG(N,1/2,κ)A\sim{\mathcal{G}}(N,1/2,\kappa), let V[N]V\subset[N] denote the vertex subset of size κ\kappa on which the planted clique in AA is supported. For any subset S{N2+1,,N}S\subset\{N_{2}+1,\dots,N\}, we have SN2[N2]S-N_{2}\subset[N_{2}]. Further define

By the definition of XX, for each a,b[p]a,b\in[p], we can define sets

occurs with high probability. To this end, first note that

Note that conditioning on the size V1=κ1|V_{1}|=\kappa_{1}, the set V1V_{1} is chosen uniformly at random among all κ1\kappa_{1} subsets of [N2][N_{2}]. Thus, for any κ1[κ/4,3κ/4]\kappa_{1}\in[\kappa/4,3\kappa/4] and c0=1/20c_{0}=1/20,

Then we apply (21) and (22) to B~\widetilde{B} instead of BB to obtain a p×pp\times p random matrix X~\widetilde{X}. The intuition is that B~{\widetilde{B}} and X~{\widetilde{X}} correspond to the ideal input and output of the reduction scheme, in the sense that the L(X~){\mathcal{L}}({\widetilde{X}}) is, as we show next, close to a desired mixture on the alternative hypotheses. Our choice of the distribution F0{\mathcal{F}}_{0} and F1{\mathcal{F}}_{1} ensures that BB is close to the ideal case B~{\widetilde{B}} in total variation, and the data processing inequality guarantees that the output XX is also close to X~{\widetilde{X}}.

To this end, note that conditioned on VV, for each a,b[p]a,b\in[p], we have

where the last inequality is due to (6.4). In view of (55), this completes the proof.

[id=suppA] \stitleSupplement to “Computational barriers in minimax submatrix detection” \slink[doi]10.1214/14-AOS1300SUPP \sdatatype.pdf \sfilenameaos1300_supp.pdf \sdescriptionWe provide proofs of Theorem 1 and Lemmas 5 and 6.

References