Sparse recovery under weak moment assumptions

Guillaume Lecué, Shahar Mendelson

Introduction and main results

Data acquisition is an important task in diverse fields such as mobile communications, medical imaging, radar detection and others, making the design of efficient data acquisition processes a problem of obvious significance.

The core issue in data acquisition is retaining all the valuable information at one’s disposal, while keeping the ‘acquisition cost’ as low as possible. And while there are several ways of defining that cost, depending on the problem (storage, time, financial cost, etc.), the common denominator of being ‘cost effective’ is ensuring the quality of the data while keeping the number of measurements as small as possible.

The rapidly growing area of Compressed Sensing studies ‘economical’ data acquisition processes. We refer the reader to and to the book for more information on the origins of Compressed Sensing and a survey of the progress that has been made in the area in recent years.

At the heart of Compressed Sensing is a simple idea that has been a recurring theme in Mathematics and Statistics: while complex objects (in this case, data), live in high-dimensional spaces, they can be described effectively using low-dimensional, approximating structures; moreover, randomness may be used to expose these low-dimensional structures. Of course, unlike more theoretical applications of this idea, identifying the low-dimensional structures in the context of Compressed Sensing must be robust and efficient, otherwise, such procedures will be of little practical use.

Because the resulting system of equations is under-determined, there is no hope, in general, of identifying x0x_{0}. However, if x0x_{0} is believed to be well approximated by a low-dimensional structure, for example, if x0x_{0} is supported on at most ss coordinates for some sNs\leq N, the problem becomes more feasible.

Given the measurement matrix Γ\Gamma and the measurements \Gamma x_{0}=(\bigl{<}X_{i},x_{0}\bigr{>})_{i=1}^{N}, Basis Pursuit returns a vector x^\hat{x} that satisfies

Since one may solve this minimization problem effectively, the focus may be shifted to the quality of the solution: whether one can identify measurement vectors X1,....,XNX_{1},....,X_{N} for which (1.1) has a unique solution, which is x0x_{0} itself, for any x0x_{0} that is ss-sparse (i.e. supported on at most ss coordinates).

It follows from Proposition 2.2.18 in that if Γ\Gamma satisfies ER(ss) then necessarily the number of measurements (rows) is at least N\geq c_{0}s\log\big{(}en/s\big{)}, where c0c_{0} is a suitable absolute constant. On the other hand, there are constructions of (random) matrices Γ\Gamma that satisfy ER(ss) with NN proportional to s\log\big{(}en/s\big{)}. From here on and with a minor abuse of notation, we will refer to slog(en/s)s\log(en/s) as the optimal number of measurements and ignore the exact dependence on the constant c0c_{0}.

Unfortunately, the only matrices that are known to satisfy the reconstruction property with an optimal number of measurements are random – which is not surprising, as randomness is one of the most effective tools in exposing low-dimensional, approximating structures. A typical example of an ‘optimal matrix’ is the Gaussian matrix, which has independent standard normal random variables as entries. Other examples of optimal measurement matrices are \Gamma=N^{-1/2}\sum_{i=1}^{N}\bigl{<}X_{i},\cdot\bigr{>}f_{i} where X1,...,XNX_{1},...,X_{N} are independent, isotropic and LL-subgaussian random vectors:

The optimal behaviour of isotropic, LL-subgaussian matrix ensembles and other ensembles like it, occurs because a typical matrix acts on Σs\Sigma_{s} in an isomorphic way when Nc1slog(en/s)N\geq c_{1}s\log(en/s), and in the LL-subgaussian case, c1c_{1} is a constant that depends only on LL. In Compressed Sensing literature, this isomorphic behaviour is called the Restricted Isometry property (RIP) (see, for example ): A matrix Γ\Gamma satisfies the RIP in Σs\Sigma_{s} with constant 0<δ<10<\delta<1, if for every tΣst\in\Sigma_{s},

It is straightforward to show that if Γ\Gamma satisfies the RIP in Σ2s\Sigma_{2s} for a sufficiently small constant δ\delta, then it has the exact reconstruction property of order ss (see, e.g. ).

Proving the RIP for subexponential ensembles is a much harder task than for subgaussian ensembles (see, e.g. ). Moreover, the RIP does not exhibit the same optimal quantitative behaviour as in the Gaussian case: it holds with high probability only when Nc2(L)slog2(en/s)N\geq c_{2}(L)s\log^{2}(en/s), and this estimate cannot be improved, as can be seen when XX has independent, symmetric exponential random variables as coordinates .

Although the RIP need not be true for isotropic LL-subexponential ensemble using the optimal number of measurements, results in (see Theorem 7.3 there) and in show that exact reconstruction can still be achieved by such an ensemble and with the optimal number of measurements. This opens the door to an intriguing question: whether considerably weaker assumptions on the measurement vector may still lead to Exact Reconstruction even when the RIP fails.

The main result presented here does just that, using the small-ball method introduced in .

A random vector XX satisfies the small-ball condition in the set Σs\Sigma_{s} with constants u,β>0u,\beta>0 if for every tΣst\in\Sigma_{s},

The small-ball condition is a rather minimal assumption on the measurement vector and is satisfied in fairly general situations for values of uu and β\beta that are suitable constants, independent of the dimension nn.

Under some normalization (like isotropicity), a small-ball condition is an immediate outcome of the Paley-Zygmund inequality (see, e.g. ) and moment equivalence. For example, in the following cases a small-ball condition holds with constants that depend only on κ0\kappa_{0} (and on ε\varepsilon for the first case); the straightforward proof may be found in .

\bullet XX is isotropic and for every tΣst\in\Sigma_{s}, \left\|\bigl{<}X,t\bigr{>}\right\|_{L_{2+\varepsilon}}\leq\kappa_{0}\left\|\bigl{<}X,t\bigr{>}\right\|_{L_{2}} for some ε>0\varepsilon>0;

\bullet XX is isotropic and for every tΣst\in\Sigma_{s}, \left\|\bigl{<}X,t\bigr{>}\right\|_{L_{2}}\leq\kappa_{0}\left\|\bigl{<}X,t\bigr{>}\right\|_{L_{1}}.

Our first result shows that a combination of the small-ball condition and a weak moment assumption suffices to ensure the exact reconstruction property with the optimal number of measurements.

there are κ1,κ2,w>1\kappa_{1},\kappa_{2},w>1 that satisfy that for every 1jn1\leq j\leq n, xjL2=1\|x_{j}\|_{L_{2}}=1 and, for every 4p2κ2log(wn)4\leq p\leq 2\kappa_{2}\log(wn), xjLpκ1pα\|x_{j}\|_{L_{p}}\leq\kappa_{1}p^{\alpha}.

XX satisfies the small ball condition in Σs\Sigma_{s} with constants uu and β\beta.

and X1,...,XNX_{1},...,X_{N} are independent copies of XX, then, with probability at least

\Gamma=N^{-1/2}\sum_{i=1}^{N}\bigl{<}X_{i},\cdot\bigr{>}f_{i} satisfies the exact reconstruction property in Σs1\Sigma_{s_{1}} for s1=c2u2βss_{1}=c_{2}u^{2}\beta s.

An immediate outcome of Theorem A is the following:

\bullet Let xx be a centered random variable that has variance 11 and for which xLpcp\|x\|_{L_{p}}\leq c\sqrt{p} for 1p2logn1\leq p\leq 2\log n. If XX has independent coordinates distributed as xx, then the corresponding matrix Γ\Gamma with Nc1slog(en/s)N\geq c_{1}s\log(en/s) rows can be used as a measurement matrix and recover any ss-sparse vector with large probability.

It is relatively straightforward to derive many other results of a similar flavour, leading to random ensembles that satisfy the exact reconstruction property with the optimal number of measurements.

Our focus is on measurement matrices with independent rows, that satisfy conditions of a stochastic nature – they have i.i.d. rows. Other types of measurement matrices that have some structure have also been used in Compressed Sensing. One notable example is a random Fourier measurement matrix, obtained by randomly selecting rows from the discrete Fourier matrix (see, e.g. , or Chapter 12 in ).

One may wonder if the small-ball condition is satisfied for more structured matrices, as the argument we use here does not extend immediately to such cases. And, indeed, for structured ensembles one may encounter a different situation: a small-ball condition that is not uniform, in the sense that the constants uu and β\beta from Definition 1.4 are direction-dependent. Moreover, in some cases, the known estimates on these constants are far from what is expected.

Results of the same flavour of Theorem A may follow from a ‘good enough’ small-ball condition, even if it is not uniform, by slightly modifying the argument we use here. However, obtaining a satisfactory ‘non-uniform’ small-ball condition is a different story. For example, in the Fourier case, such an estimate is likely to require quantitative extensions of the Littlewood-Paley theory – a worthy challenge in its own right, and one which goes far beyond the goals of this article.

Just as noted for subexponential ensembles, Theorem A cannot be proved using an RIP-based argument. A key ingredient in the proof is the following observation:

for every xΣsx\in\Sigma_{s}, Γx2c0x2\left\|\Gamma x\right\|_{2}\geq c_{0}\left\|x\right\|_{2}, and

for every j{1,,n}j\in\{1,\ldots,n\}, Γej2c1\left\|\Gamma e_{j}\right\|_{2}\leq c_{1}.

Setting s_{1}=\big{\lfloor}(c_{0}^{2}(s-1))/(4c_{1}^{2})\big{\rfloor}-1, Γ\Gamma satisfies the exact reconstruction property in Σs1\Sigma_{s_{1}}.

Compared with the RIP, conditions a) and b) in Theorem B are weaker, as it suffices to verify the right-hand side of (1.2) for 11-sparse vectors rather than for every ss-sparse vector. This happens to be a substantial difference: the assumption that for every tΣst\in\Sigma_{s}, Γt2(1+δ)t2\left\|\Gamma t\right\|_{2}\leq(1+\delta)\left\|t\right\|_{2} is a costly one, and happens to be the reason for the gap between the RIP and the exact reconstruction property. Indeed, while the lower bound in the RIP holds for rather general ensembles (see and the next section for more details), and is guaranteed solely by the small-ball condition, the upper bound is almost equivalent to having the coordinates of XX exhibit a subgaussian behaviour of moments, at least up to some level. Even the fact that one has to verify the upper bound for 11-sparse vectors comes at a cost, namely, the moment assumption (1) in Theorem A.

The second goal of this note is to illustrate that while Exact Reconstruction is ‘cheaper’ than the RIP, it still comes at a cost – namely, that the moment condition (1) in Theorem A is truly needed.

A random matrix Γ\Gamma is generated by the random variable xx if \Gamma=N^{-1/2}\sum_{i=1}^{N}\bigl{<}X_{i},\cdot\bigr{>}f_{i} and X1,...,XNX_{1},...,X_{N} are independent copies of the random vector X=(x1,...,xn)X=(x_{1},...,x_{n})^{\top} whose coordinates are independent copies of xx.

Theorem C. There exist absolute constants c0,c1,c2c_{0},c_{1},c_{2} and c3c_{3} for which the following holds. Given nc0n\geq c_{0} and NlogNc1nN\log N\leq c_{1}n, there exists a mean-zero, variance one random variable xx with the following properties:

\bullet xLpc2p\left\|x\right\|_{L_{p}}\leq c_{2}\sqrt{p} for 2<pc3(logn)/(logN)2<p\leq c_{3}(\log n)/(\log N).

\bullet If (xj)j=1n(x_{j})_{j=1}^{n} are independent copies of xx then X=(x1,...,xn)X=(x_{1},...,x_{n})^{\top} satisfies the small-ball condition with constants uu and β\beta that depend only on c2c_{2}.

\bullet Denote by Γ\Gamma the N×nN\times n matrix generated by xx. For every k{1,,n}k\in\{1,\ldots,n\}, with probability larger than 1/21/2, \operatorname*{argmin}\big{(}\left\|t\right\|_{1}:\Gamma t=\Gamma e_{k}\big{)}\neq\{e_{k}\}; therefore, eke_{k} is not exactly reconstructed by Basis Pursuit and so Γ\Gamma does not satisfy the exact reconstruction property of order 11.

To put Theorem C in some perspective, note that if Γ\Gamma is generated by xx for which xL2=1\left\|x\right\|_{L_{2}}=1 and xLpc4p\left\|x\right\|_{L_{p}}\leq c_{4}\sqrt{p} for 2<pc5logn2<p\leq c_{5}\log n, then X=(xi)i=1nX=(x_{i})_{i=1}^{n} satisfies the small-ball condition with constants that depend only on c4c_{4}, and by Theorem A, if Nc6lognN\geq c_{6}\log n, Γ\Gamma satisfies ER(11) with high probability. On the other hand, the random ensemble from Theorem C is generated by xx that has almost identical properties – with one exception: its LpL_{p} norm is well behaved only for pc7(logn)/loglognp\leq c_{7}(\log n)/\log\log n. This small gap in the number of moments has a significant impact: with probability at least 1/21/2, Γ\Gamma does not satisfy ER(11) when NN is of the order of logn\log n.

The fact that the coordinates of XX do not have enough well behaved moments is the key feature that allows one to generate many ‘spiky’ columns in a typical Γ\Gamma.

An alternative formulation of Theorem C is the following: Theorem C′. There are absolute constants c0,c1,c2c_{0},c_{1},c_{2} and κ\kappa for which the following holds. If nc0n\geq c_{0} and 2<p<c1logn2<p<c_{1}\log n, there exists a mean-zero and variance 11 random variable xx, for which xLqκq\|x\|_{L_{q}}\leq\kappa\sqrt{q} for 2<qp2<q\leq p, and if Nc2p(n/logn)1/pN\leq c_{2}\sqrt{p}(n/\log n)^{1/p} and Γ\Gamma is the N×nN\times n matrix generated by xx, then with probability at least 1/21/2, Γ\Gamma does not satisfy the exact reconstruction property of order 1.

XX satisfies a weak small-ball condition in Σs\Sigma_{s} with constant β\beta if for every tΣst\in\Sigma_{s},

We end this introduction with a word about notation and the organization of the article. The proofs of Theorem A, Theorem B and Theorem D are presented in the next section, while the proofs of Theorem C and Theorem C′ may be found in Section 3. The final section is devoted to results in a natural ‘noisy’ extension of Compressed Sensing. In particular, we prove that both the Compatibility Condition and the Restricted Eigenvalue Condition hold under weak moment assumptions; we also study related properties of random polytopes.

As for notation, throughout, absolute constants or constants that depend on other parameters are denoted by cc, CC, c1c_{1}, c2c_{2}, etc., (and, of course, we will specify when a constant is absolute and when it depends on other parameters). The values of these constants may change from line to line. The notation xyx\sim y (resp. xyx\lesssim y) means that there exist absolute constants 0<c<C0<c<C for which cyxCycy\leq x\leq Cy (resp. xCyx\leq Cy). If b>0b>0 is a parameter then xbyx\lesssim_{b}y means that xC(b)yx\leq C(b)y for some constant C(b)C(b) that depends only on bb.

Proof of Theorem A, B and D

The proof of Theorem A has several components, and although the first of which is rather standard, we present it for the sake of completeness.

Proof. Observe that if xB1nx\in B_{1}^{n} and x2r\|x\|_{2}\geq r then y=rx/x2B1nrSn1y=rx/\|x\|_{2}\in B_{1}^{n}\cap rS^{n-1}. Therefore, if y∉ker(Γ)y\not\in{\rm ker}(\Gamma), the same holds for xx; thus

Let s=(2r)2s=\lfloor(2r)^{-2}\rfloor, fix x0Σsx_{0}\in\Sigma_{s} and put II to be the set indices of coordinates on which x0x_{0} is supported. Given a nonzero hker(Γ)h\in{\rm ker}(\Gamma), let h=hI+hIch=h_{I}+h_{I^{c}} – the decomposition of hh to coordinates in II and in IcI^{c}. Since h/h1B1nker(Γ)h/\|h\|_{1}\in B_{1}^{n}\cap{\rm ker}(\Gamma), it follows that h2<rh1\|h\|_{2}<r\|h\|_{1}, and by the choice of ss, 2sh2<h12\sqrt{s}\|h\|_{2}<\|h\|_{1}. Therefore,

Hence, x0+h1>x01\|x_{0}+h\|_{1}>\|x_{0}\|_{1} and x0x_{0} is the unique minimizer of the basis pursuit algorithm.

The main ingredient in the proof of Theorem A is Lemma 2.3 below, which is based on the small-ball method introduced in . To formulate the lemma, one requires the notion of a VC class of sets.

Let G{\cal G} be a class of {0,1}\{0,1\}-valued functions defined on a set X{\cal X}. The set G{\cal G} is a VC-class if there exists an integer VV for which, given any x1,...,xV+1Xx_{1},...,x_{V+1}\in{\cal X},

The VC-dimension of G{\cal G}, denoted by VC(G)VC({\cal G}), is the smallest integer VV for which (2.1) holds.

The VC dimension is a combinatorial complexity measure that may be used to control the L2(μ)L_{2}(\mu)-covering numbers of the class; indeed, set N(G,ε,L2(μ))N({\cal G},\varepsilon,L_{2}(\mu)) to be the smallest number of open balls of radius ε\varepsilon relative to the L2(μ)L_{2}(\mu) norm that are needed to cover G{\cal G}. A well known result due to Dudley is that if VC(G)=VVC({\cal G})=V and μ\mu is a probability measure on X{\cal X} then for every 0<ε<10<\varepsilon<1,

where c1c_{1} and c2c_{2} are absolute constants.

There exist absolute constants c1c_{1} and c2c_{2} for which the following holds. Let F{\cal F} be a class of functions and assume that there are β>0\beta>0 and u0u\geq 0 for which

Note that u=0u=0 is a ‘legal choice’ in Lemma 2.3, a fact that will be used in the proof of Theorem D.

Standard empirical processes arguments (symmetrization, the fact that Bernoulli processes are subgaussian and the entropy estimate (2.2) – see, for example, Chapters 2.2, 2.3 and 2.6 in ), show that since VC(G)dVC({\cal G})\leq d,

provided that Nd/β2N\gtrsim d/\beta^{2}. Therefore, taking t=Nβ2/16c12t=N\beta^{2}/16c_{1}^{2}, it follows that with probability at least 1exp(c3β2N)1-\exp(-c_{3}\beta^{2}N), for every fFf\in{\cal F},

Therefore, on that event, {i:f(Xi)>u}βN/2|\{i:|f(X_{i})|>u\}|\geq\beta N/2 for every fFf\in{\cal F}.

1. If there are 0<β10<\beta\leq 1 and u0u\geq 0 for which P\big{(}|\bigl{<}t,X\bigr{>}|>u\big{)}\geq\beta for every tSn1t\in S^{n-1} and if Nc1n/β2N\geq c_{1}n/\beta^{2}, then with probability at least 1exp(c2Nβ2)1-\exp(-c_{2}N\beta^{2}),

2. If there are 0<β10<\beta\leq 1 and u0u\geq 0 for which P\big{(}|\bigl{<}t,X\bigr{>}|>u\big{)}\geq\beta for every tΣsSn1t\in\Sigma_{s}\cap S^{n-1} and if Nc1slog(en/s)/β2N\geq c_{1}s\log(en/s)/\beta^{2}, then with probability at least 1exp(c2Nβ2)1-\exp(-c_{2}N\beta^{2}),

Note that the first part of Corollary 2.5 gives an estimate on the smallest singular value of the random matrix \Gamma=N^{-1/2}\sum_{i=1}^{N}\bigl{<}X_{i},\cdot\bigr{>}f_{i}. The proof follows the same path as in , but unlike the latter, no assumption on the covariance structure of XX, used both in and in , is required. In fact, Corollary 2.5 may be applied even if the covariance matrix does not exist. Thus, under a small-ball condition, the smallest singular value of Γ\Gamma is larger than c(β,u)c(\beta,u) with high (exponential) probability.

is at most c1nc_{1}n for a suitable absolute constant c1c_{1} (see, e.g., Chapter 2.6 in ). The claim now follows immediately from Lemma 2.3 because

Turning to the second part, note that ΣsSn1\Sigma_{s}\cap S^{n-1} is a union of (ns)\binom{n}{s} spheres of dimension ss. Applying the first part to each one of those spheres, combined with the union bound, it follows that for Nc2β2slog(en/s)N\geq c_{2}\beta^{-2}s\log(en/s), with probability at least 1exp(c3Nβ2)1-\exp(-c_{3}N\beta^{2}),

Corollary 2.5 shows that the small-ball condition for linear functionals implies that Γ\Gamma ‘acts well’ on ss-sparse vectors. However, according to Lemma 2.1, exact recovery is possible if Γ\Gamma is well behaved on the set

for a well-chosen constant κ0\kappa_{0}. In the standard (RIP-based) argument, one proves exact reconstruction by first showing that the RIP holds in Σs\Sigma_{s}, and then the fact that each vector in κ0sB1nSn1\sqrt{\kappa_{0}s}B_{1}^{n}\cap S^{n-1} is well approximated by vectors from Σs\Sigma_{s} (see, for instance, ) allows one to extend the RIP from Σs\Sigma_{s} to κ0sB1nSn1\sqrt{\kappa_{0}s}B_{1}^{n}\cap S^{n-1}. Unfortunately, this extension requires both upper and lower estimates in the RIP.

Since the upper part of the RIP in Σs\Sigma_{s} forces severe restrictions on the random vector XX, one has to resort to a totally different argument if one wishes to extend the lower bound from Σs\Sigma_{s} (which only requires the small-ball condition) to κ0sB1nSn1\sqrt{\kappa_{0}s}B_{1}^{n}\cap S^{n-1}.

The method presented below is based on Maurey’s empirical method and has been recently used in .

Let Y1,...,YsY_{1},...,Y_{s} be independent copies of YY and set Z=s1k=1sYkZ=s^{-1}\sum_{k=1}^{s}Y_{k}. Note that ZΣsZ\in\Sigma_{s} for every realization of Y1,...,YsY_{1},...,Y_{s}; thus ΓZ22λ2Z22\|\Gamma Z\|_{2}^{2}\geq\lambda^{2}\|Z\|_{2}^{2} and

Therefore, setting μj=yj/y1\mu_{j}=|y_{j}|/\|y\|_{1} and W=j=1nΓej22μjW=\sum_{j=1}^{n}\left\|\Gamma e_{j}\right\|_{2}^{2}\mu_{j},

and using the same argument one may show that

Combining these two estimates with (2.4),

Proof of Theorem B: Assume that for every xΣsx\in\Sigma_{s}, Γx2c0x2\|\Gamma x\|_{2}\geq c_{0}\|x\|_{2} and that for every 1in1\leq i\leq n, Γei2c1\|\Gamma e_{i}\|_{2}\leq c_{1}. It follows from Lemma 2.7 that if s1>c12/(c02r2)s-1>c_{1}^{2}/(c_{0}^{2}r^{2}), then for every yB1nrSn1y\in B_{1}^{n}\cap rS^{n-1},

which is an average of NN iid random variables (though Γe12,\left\|\Gamma e_{1}\right\|_{2}, ,Γen2\ldots,\left\|\Gamma e_{n}\right\|_{2} need not be independent).

Thanks to Theorem B and Corollary 2.5, the final component needed for the proof of Theorem A is information on the sum of iid random variables, which will be used to bound max1jnΓej22\max_{1\leq j\leq n}\left\|\Gamma e_{j}\right\|_{2}^{2} from above.

There exists an absolute constant c0c_{0} for which the following holds. Let zz be a mean-zero random variable and put z1,,zNz_{1},\ldots,z_{N} to be NN independent copies of zz. Let p02p_{0}\geq 2 and assume that there exists κ1>0\kappa_{1}>0 and α1/2\alpha\geq 1/2 for which zLpκ1pα\|z\|_{L_{p}}\leq\kappa_{1}p^{\alpha} for every 2pp02\leq p\leq p_{0}. If Np0max{2α1,1}N\geq p_{0}^{\max\{2\alpha-1,1\}} then for every 2pp02\leq p\leq p_{0},

where c1(α)=c0exp((2α1))c_{1}(\alpha)=c_{0}\exp((2\alpha-1)).

Lemma 2.8 shows that even under a weak moment assumption, namely that zLppα\|z\|_{L_{p}}\lesssim p^{\alpha} for pp0p\leq p_{0} and α1/2\alpha\geq 1/2 that can be large, a normalized sum of NN independent copies of zz exhibits a ‘subgaussian’ moment growth up to the same p0p_{0}, as long as NN is sufficiently large.

The proof of Proposition 2.8 is based on the following fact due to Latała.

If zz is a mean-zero random variable and z1,...,zNz_{1},...,z_{N} are independent copies of zz, then for any p2p\geq 2,

Proof of Lemma 2.8. Let 2pp02\leq p\leq p_{0} and NpN\geq p. Since zLsκ1sα\|z\|_{L_{s}}\leq\kappa_{1}s^{\alpha} for any 2sp2\leq s\leq p, it follows from Theorem 2.9 that

It is straightforward to verify that the function h(s)=(N/p)1/ss1+αh(s)=(N/p)^{1/s}s^{-1+\alpha} is non-increasing when α1\alpha\leq 1 and attains its maximum in s=max{2,p/N}=2s=\max\{2,p/N\}=2 or in s=ps=p when α>1\alpha>1. Therefore, when NpN\geq p,

Finally, if Np2α1N\geq p^{2\alpha-1} then e2α1NpN1/ppαe^{2\alpha-1}\sqrt{Np}\geq N^{1/p}p^{\alpha}, which completes the proof.

Proof of Theorem A. Consider Nc1slog(en/s)/β2N\geq c_{1}s\log(en/s)/\beta^{2}. By Corollary 2.5, with probability at least 1exp(c2Nβ2)1-\exp(-c_{2}N\beta^{2}),

Set (Xi)i=1N(X_{i})_{i=1}^{N} for which (2.5) holds and let \Gamma=N^{-1/2}\sum_{i=1}^{N}\bigl{<}X_{i},\cdot\bigr{>}f_{i}. By Lemma 2.7 for λ2=u2β/2\lambda^{2}=u^{2}\beta/2, it follows that when r1r\geq 1,

Next, one has to obtain a high probability upper estimate on max1jnΓej22\max_{1\leq j\leq n}\|\Gamma e_{j}\|_{2}^{2}. To that end, fix w1w\geq 1 and consider z=xj21z=x_{j}^{2}-1 - where xjx_{j} is the jj-th coordinate of XX. Observe that zz is a centered random variable and that zLq4ακ12q2α\left\|z\right\|_{L_{q}}\lesssim 4^{\alpha}\kappa_{1}^{2}q^{2\alpha} for every 1qκ2log(wn)1\leq q\leq\kappa_{2}\log(wn). Thus, by Lemma 2.8 for p=κ2log(wn)p=\kappa_{2}\log(wn) and c3(α)4αexp((4α1))c_{3}(\alpha)\sim 4^{\alpha}\exp((4\alpha-1)),

provided that Npmax{4α1,1}=(κ2log(wn))max{4α1,1}N\geq p^{\max\{4\alpha-1,1\}}=(\kappa_{2}\log(wn))^{\max\{4\alpha-1,1\}}. Hence, if N(c3(α)κ12)2(κ2log(wn))max{4α1,1}N\geq(c_{3}(\alpha)\kappa_{1}^{2})^{2}(\kappa_{2}\log(wn))^{\max\{4\alpha-1,1\}}, and setting Vj=Γej22V_{j}=\left\|\Gamma e_{j}\right\|_{2}^{2}, one has

and rsλ2/8e=su2β/16er\leq s\lambda^{2}/8e=su^{2}\beta/16e, then with probability at least 1exp(c2Nβ2)1/(wκ2nκ21)1-\exp(-c_{2}N\beta^{2})-1/(w^{\kappa_{2}}n^{\kappa_{2}-1}),

Therefore, by Lemma 2.1, Γ\Gamma satisfies the exact reconstruction property for vectors that are c4u2βsc_{4}u^{2}\beta s-sparse, as claimed.

Proof of Theorem C and Theorem C′

Proof. Let wB1Jcw\in B_{1}^{J^{c}} for which Γv=Γw\Gamma v=\Gamma w and observe that vwv\not=w (otherwise, vB1JB1Jcv\in B_{1}^{J}\cap B_{1}^{J^{c}}, implying that v=0v=0, which is impossible because v1=1\left\|v\right\|_{1}=1).

Set x1,,xnx_{\cdot 1},\cdots,x_{\cdot n} to be the columns of Γ\Gamma. It immediately follows from Lemma 3.1 that if one wishes to prove that Γ\Gamma does not satisfy ER(11), it suffices to show that, for instance the first basis vector e1e_{1} cannot be exactly reconstruct. This follows from

where absconv(S){\rm absconv}(S) is the convex hull of SSS\cup-S. Therefore, if

for some absolute constant c0c_{0}, then Γ\Gamma does not satisfy ER(11).

The proofs of Theorem C and of Theorem C′ follow from the construction of a random matrix ensemble for which (3.1) holds with probability larger than 1/21/2. We now turn on to such a construction.

Let η\eta be a selector (a {0,1}\{0,1\}-valued random variable) with mean δ\delta to be named later, and let ε\varepsilon be a symmetric {1,1}\{-1,1\}-valued random variable that is independent of η\eta. Fix R>0R>0 and set

Observe that if p2p\geq 2 and R1R\geq 1 then

and the last equivalence holds when R2δ1R^{2}\delta\lesssim 1 and Rpδ1R^{p}\delta\gtrsim 1. Fix 2<p2log(1/δ)2<p\leq 2\log(1/\delta) which will be specified later and set R=p(1/δ)1/pR=\sqrt{p}(1/\delta)^{1/p}. Since the function qq/δ1/qq\to\sqrt{q}/\delta^{1/q} is decreasing for 2q2log(1/δ)2\leq q\leq 2\log(1/\delta) one has that for 2qp2\leq q\leq p and for δ\delta that is small enough,

Note that x=z/zL2x=z/\left\|z\right\|_{L_{2}} is a mean-zero, variance one random variable that exhibits a ‘subgaussian’ moment behaviour only up to pp. Indeed, if 2qp2\leq q\leq p, zLqqzL2\|z\|_{L_{q}}\lesssim\sqrt{q}\|z\|_{L_{2}}, and if q>pq>p, zLqpδ1/q1/pzL2\left\|z\right\|_{L_{q}}\sim\sqrt{p}\delta^{1/q-1/p}\left\|z\right\|_{L_{2}}, which may be far larger than qzL2\sqrt{q}\left\|z\right\|_{L_{2}} if δ\delta is sufficiently small.

Let X=(x1,,xn)X=(x_{1},\ldots,x_{n}) be a vector whose coordinates are independent, distributed as xx and let Γ\Gamma be the measurement matrix generated by xx. Note that up to the normalization factor of zL2\left\|z\right\|_{L_{2}}, which is of the order of a constant when R2δ1R^{2}\delta\lesssim 1, NΓ\sqrt{N}\Gamma is a perturbation of a Rademacher matrix by a sparse matrix with few random spikes that are either RR or R-R.

the convex hull of (±vj)j=2n(\pm v_{j})_{j=2}^{n}.

We will show that with probability at least 1/21/2, NB2NV\sqrt{N}B_{2}^{N}\subset V and v12N\|v_{1}\|_{2}\leq\sqrt{N}, in three steps:

With probability at least 3/43/4, for every 1iN1\leq i\leq N there is yiBNy_{i}\in B_{\infty}^{N} for which yi+RfiVy_{i}+Rf_{i}\in V.

Hence, the claim follows by the union bound and integration with respect to the (εij)(\varepsilon_{ij}).

Next, it is straightforward to verify that when VV contains such a perturbation of RB1NRB_{1}^{N} (by vectors in BNB_{\infty}^{N}), it must also contain a large Euclidean ball, assuming that RR is large enough.

Let R>NR>N, and for every 1iN1\leq i\leq N, set yiBNy_{i}\in B_{\infty}^{N} and put vi=Rfi+yiv_{i}=Rf_{i}+y_{i}. If VV is a convex, centrally symmetric set, and if viVv_{i}\in V for every 1iN1\leq i\leq N then \big{(}R/\sqrt{N}-\sqrt{N}\big{)}B_{2}^{N}\subset V.

Proof. A separation argument shows that if \sup_{v\in V}|\bigl{<}v,w\bigr{>}|\geq\rho for every wSN1w\in S^{N-1}, then ρB2NV\rho B_{2}^{N}\subset V (indeed, otherwise there would be some xρB2N\Vx\in\rho B_{2}^{N}\backslash V; but it is impossible to separate xx and the convex and centrally symmetric VV using any norm-one functional).

To complete the proof, observe that for every wSN1w\in S^{N-1},

Applying Lemma 3.3, it follows that if R2NR\geq 2N then with probability at least 3/43/4, NB2NV\sqrt{N}B_{2}^{N}\subset V. Finally, if δ1/N\delta\lesssim 1/N then

and the same assertion holds for the normalized matrix Γ\Gamma, showing that it does not satisfy ER(11).

Of course, this assertion holds under several conditions on the parameters involved: namely, that R=p(1/δ)1/p2NR=\sqrt{p}(1/\delta)^{1/p}\geq 2N; that (\log N)/n\lesssim\delta\lesssim\log\big{(}en/N\big{)}/N; that R4δ1R^{4}\delta\lesssim 1; that p2log(1/δ)p\leq 2\log(1/\delta) and that δ1/N\delta\lesssim 1/N.

For instance, one may select δ(logN)/n\delta\sim(\log N)/n and p(logn)/logNp\sim(\log n)/\log N, in which case all these conditions are met; hence, with probability at least 1/21/2, Γ\Gamma does not satisfy ER(11), proving Theorem C. A similar calculation leads to the proof of Theorem C′.

Note that the construction leads to a stronger, non-uniform result, namely, that for every basis vector eke_{k}, with probability at least 1/21/2, eke_{k} is not the unique solution of min(t1: Γt=Γek)\min(\|t\|_{1}:\ \Gamma t=\Gamma e_{k}). In particular, uniformity over all supports of size 11 in the definition of ER(11) is not the reason why the moment assumption in Theorem A is required.

Results in the noisy measurements setup

In previous sections, we considered the idealized scenario, in which the data was noiseless. Here, we will study the noisy setup: one observes NN couples (zi,Xi)i=1N(z_{i},X_{i})_{i=1}^{N}, and each ziz_{i} is a noisy observation of \bigl{<}X_{i},x_{0}\bigr{>}:

The goal is to obtain as much information as possible on the unknown vector x0x_{0} with only the data (zi,Xi)i=1N(z_{i},X_{i})_{i=1}^{N} at one’s disposal, and for the sake of simplicity, we will assume that the gig_{i}’s are independent Gaussian random variables N(0,σ2){\cal N}(0,\sigma^{2}) that are also independent of the XiX_{i}’s.

Unlike the noiseless case, there is no hope of reconstructing x0x_{0} from the given data, and instead of exact reconstruction, there are three natural questions that one may consider:

These three problems are central in modern Statistics, and are featured in numerous statistical monographs, particularly in the context of the Gaussian regression model (Equation (4.1)).

Recently, all three problems have been recast in a ‘high-dimensional’ scenario, in which the number of observations NN may be much smaller than the ambient dimension nn. Unfortunately, such problems are often impossible to solve without additional assumptions, and just as in the noiseless case, the situation improves dramatically if x0x_{0} has some low-dimensional structure, for example, if it is ss-sparse. The aim is therefore to design a procedure that performs as if the true dimension of the problem is ss rather than nn, despite the noisy data.

Both procedures may be implemented effectively, and their estimation and de-noising properties have been obtained under some assumptions on the measurement matrix (see, e.g. or Chapters 7 and 8 in ).

In this section, we shall focus on two such conditions on the measurement matrix. The first, called the Compatibility Condition, was introduced in (see also Definition 2.1 in ); the second, the Restricted Eigenvalue Condition, was introduced in .

Let Γ\Gamma be an N×n{N\times n} matrix. For L>0L>0 and a set S{1,,n}S\subset\{1,\ldots,n\}, the compatibility constant associated with LL and SS is

where ζS\zeta_{S} (resp. ζSc\zeta_{S^{c}}) denotes a vector that is supported in SS (resp. ScS^{c}).

Γ\Gamma satisfies the Compatibility Condition for the set S0S_{0} with constants L>1L>1 and c0c_{0} if ϕ(L,S0)c0\phi(L,S_{0})\geq c_{0}; it satisfies the uniform Compatibility Condition (CC) of order ss if minSsϕ(L,S)c0\min_{|S|\leq s}\phi(L,S)\geq c_{0}.

A typical result for the LASSO in the Gaussian model (4.1) and when Γ\Gamma satisfies the Compatibility Condition, is Theorem 6.1 in :

Even though the Compatibility Condition in S0S_{0} suffices to show that the LASSO is an effective procedure, the fact remains that S0S_{0} is not known. And while a non-uniform approach is still possible (e.g., if Γ\Gamma is a random matrix, one may try showing that with high probability it satisfies the Compatibility Condition for the fixed, but unknown S0S_{0}), the uniform Compatibility Condition is a safer requirement – and the one we shall explore below.

Let Γ\Gamma be an N×n{N\times n} matrix. Given c01c_{0}\geq 1 and an integer 1smn1\leq s\leq m\leq n for which m+snm+s\leq n, the restricted eigenvalue constant is

The matrix Γ\Gamma satisfies the Restricted Eigenvalue Condition (REC) of order ss with a constant cc if κ(s,s,3)c\kappa(s,s,3)\geq c.

Estimation and de-noising results follow from Theorem 6.1 (for the Dantzig selector) and Theorem 6.2 (for the LASSO) in , when the measurement matrix Γ\Gamma, normalized by having the diagonal elements of ΓΓ\Gamma^{\top}\Gamma equal 11, satisfies the REC of an appropriate order and with a constant that is independent of the dimension. We also refer to Lemma 6.10 in for similar results that do not require normalization.

Because the two lead to bounds on the performance of the LASSO and the Dantzig selector, a question that comes to mind is whether there are matrices that satisfy the CC or the REC. And, as in Compressed Sensing, the only matrices that are known to satisfy those conditions for the optimal number of measurements (rows) are well-behaved random matrices (see for some examples).

Our aim in this final section is to extend our results to the noisy setup, by identifying almost necessary and sufficient moment assumptions for the CC and the REC. This turns out to be straightforward: on one hand, the proof of Theorem A actually provides a stronger quantitative version of the exact reconstruction property; on the other, the uniform compatibility condition can be viewed as a quantitative version of a geometric condition on the polytope ΓB1n\Gamma B_{1}^{n} that characterizes Exact Reconstruction. A similar observation is true for the REC: it can be viewed as a quantitative version of the null space property (see and below) which is also equivalent to the exact reconstruction property.

It is well known that Γ\Gamma satisfies ER(ss) if and only if ΓB1n\Gamma B_{1}^{n} has 2n2n vertices and ΓB1n\Gamma B_{1}^{n} is a centrally symmetric ss-neighbourly polytope. It turns out that this property is characterized by the uniform CC.

Let Γ\Gamma be an N×n{N\times n} matrix. The following are equivalent:

ΓB1n\Gamma B_{1}^{n} has 2n2n vertices and is ss-neighbourly,

\min\big{(}\phi(1,S):S\subset\{1,\ldots,n\},|S|\leq s\big{)}>0.

In particular, minSsϕ(L,S)\min_{|S|\leq s}\phi(L,S) for some L1L\geq 1 is a quantitative measure of the ss-neighbourly property of ΓB1n\Gamma B_{1}^{n}: if ΓB1n\Gamma B_{1}^{n} is ss-neighbourly and has 2n2n vertices then the two sets

are disjoint for every Ss|S|\leq s. However, minSsϕ(1,S)\min_{|S|\leq s}\phi(1,S) measures how far the two sets are from one another, uniformly over all subsets S{1,,n}S\subset\{1,\ldots,n\} of cardinality at most ss. Proof. Let C1,,CnC_{1},\ldots,C_{n} be the nn columns of Γ\Gamma. It follows from Proposition 2.2.13 and Proposition 2.2.16 in that ΓB1n\Gamma B_{1}^{n} has 2n2n vertices and is a centrally symmetric ss-neighbourly polytope if and only if for every S{1,,n}S\subset\{1,\ldots,n\} of cardinality Ss|S|\leq s and every choice of signs (εi){1,1}S(\varepsilon_{i})\in\{-1,1\}^{S},

As a consequence, (4.5) holds for every S{1,,n}S\subset\{1,\ldots,n\} of cardinality at most ss if and only if \min\big{(}\phi(1,S):S\subset\{1,\ldots,n\},|S|\leq s\big{)}>0.

An observation of a similar nature is true for the REC: it can be viewed as a quantitative measure of the null space property.

Let Γ\Gamma be an N×n{N\times n} matrix. Γ\Gamma satisfies the null space property of order ss if it is invertible in the cone

In , the authors prove that Γ\Gamma satisfies ER(ss) if and only if it has the null space property of order ss.

A natural way of quantifying the invertibility of Γ\Gamma in the cone (4.6) is to consider its smallest singular value, restricted to this cone, which is simply the REC κ(s,ns,1)\kappa(s,n-s,1). Unfortunately, statistical properties of the LASSO and of the Dantzig selector are not known under the assumption that κ(s,ns,1)\kappa(s,n-s,1) is an absolute constant (though if κ(s,s,3)\kappa(s,s,3) is an absolute constant, LASSO is known to be optimal ).

The main result of this section is the following:

Theorem E. Let L>0L>0, 1sn1\leq s\leq n and c0>0c_{0}>0. Under the same assumptions as in Theorem A and with the same probability estimate, \Gamma=N^{-1/2}\sum_{i=1}^{N}\bigl{<}X_{i},\cdot\bigr{>}f_{i} satisfies:

A uniform compatibility condition of order c1sc_{1}s, namely that

A restricted eigenvalue condition of order c2sc_{2}s, with

for any 1mn1\leq m\leq n, as long as (1+c0)2c2u2β/(16e)(1+c_{0})^{2}c_{2}\leq u^{2}\beta/(16e).

On the other hand, if Γ\Gamma is the matrix considered in Theorem C, then with probability at least 1/21/2, ϕ(1,{e1})=0\phi(1,\{e_{1}\})=0 and κ(1,m,1)=0\kappa(1,m,1)=0 for any 1mn1\leq m\leq n.

Just like Theorem A and Theorem C, Theorem E shows that the requirement that the coordinates of the measurement vector have logn\log n moments is almost a necessary and sufficient condition for the uniform Compatibility Condition and the Restricted Eigenvalue Condition to hold. Moreover, it shows the significance of the small-ball condition, even in the noisy setup.

It also follows from Theorem E that if XX satisfies the small-ball condition and its coordinates have logn\log n well-behaved moments as in Theorem A, then ΓB1n\Gamma B_{1}^{n} has 2n2n vertices and is ss-neighbourly with high probability for Nslog(en/s)N\sim s\log(en/s). In particular, this improves Theorem 4.3 in by a logarithmic factor for matrices generated by subexponential variables.

Consider γ=(ζSζSc)/ζSζSc2\gamma=(\zeta_{S}-\zeta_{S^{c}})/\left\|\zeta_{S}-\zeta_{S^{c}}\right\|_{2}. Since

it follows that \gamma\in\big{(}(1+L)\sqrt{|S|}\big{)}B_{1}^{n}\cap S^{n-1}.

Recall that by (2.7), if r=(1+L)2c1ssu2β/(16e)r=(1+L)^{2}c_{1}s\leq su^{2}\beta/(16e), then Γγ2(u2β)/4\left\|\Gamma\gamma\right\|_{2}\geq(u^{2}\beta)/4. Therefore,

and thus minSc1sϕ(L,S)u2β/4\min_{|S|\leq c_{1}s}\phi(L,S)\geq u^{2}\beta/4 for c_{1}=u^{2}\beta/\big{(}16e(1+L)^{2}\big{)}.

Turning to the REC, fix a constant c2c_{2} to be named later. Consider xx in the cone and let S0{1,,n}S_{0}\subset\{1,\ldots,n\} of cardinality S0c2s|S_{0}|\leq c_{2}s for which xS0c1c0xS01\left\|x_{S_{0}^{c}}\right\|_{1}\leq c_{0}\left\|x_{S_{0}}\right\|_{1}. Let S1{1,,n}S_{1}\subset\{1,\ldots,n\} be the set of indices of the mm largest coordinates of (xi)i=1n(|x_{i}|)_{i=1}^{n} that are outside S0S_{0} and put S01=S0S1S_{01}=S_{0}\cup S_{1}.

Observe that x1(1+c0)xS01(1+c0)S0x2\left\|x\right\|_{1}\leq(1+c_{0})\left\|x_{S_{0}}\right\|_{1}\leq(1+c_{0})\sqrt{|S_{0}|}\left\|x\right\|_{2}; hence x/\left\|x\right\|_{2}\in\big{(}(1+c_{0})\sqrt{|S_{0}|}\big{)}B_{1}^{n}\cap S^{n-1}. Applying (2.7) again, if (1+c0)2c2ssu2β/(16e)(1+c_{0})^{2}c_{2}s\leq su^{2}\beta/(16e), then \left\|\Gamma x\right\|_{2}\geq\big{(}(u^{2}\beta)/4\big{)}\left\|x\right\|_{2}. Thus,

and κ(c2s,m,c0)u2β/4\kappa(c_{2}s,m,c_{0})\geq u^{2}\beta/4 for any 1mn1\leq m\leq n, as long as (1+c0)2c2u2β/(16e)(1+c_{0})^{2}c_{2}\leq u^{2}\beta/(16e).

The proof of the second part of Theorem E is an immediate corollary of the construction used in Theorem C. Recall that with probability at least 1/21/2, Γe1absconv(Γej:j{2,...,n})\Gamma e_{1}\in{\rm absconv}(\Gamma e_{j}:j\in\{2,...,n\}). Setting J={e2,...,en}J=\{e_{2},...,e_{n}\}, there is ζB1J\zeta\in B_{1}^{J} for which Γe1Γζ2=0\left\|\Gamma e_{1}-\Gamma\zeta\right\|_{2}=0. Therefore, ϕ(1,{e1})=0\phi(1,\{e_{1}\})=0 and κ(1,m,1)=0\kappa(1,m,1)=0 for any 1mn1\leq m\leq n, as claimed.

The results obtained in Theorem A and in parts (1) and (2) of Theorem E are also valid for the normalized (columns wise) measurement matrix:

The proof is almost identical to the one used for Γ\Gamma itself, even though Γ1\Gamma_{1} does not have independent rows vectors, due to the normalization. For the sake of brevity, we will not present the straightforward proof of this observation.

Finally, the counterexample constructed in the proof of Theorem C and in which a typical Γ\Gamma does not satisfy ER(11), does not necessarily generate ΓB1n\Gamma B_{1}^{n} that is not ss-neighbourly. Indeed, an inspection of the construction shows that the reason ER(11) fails is that ΓB1n\Gamma B_{1}^{n} has less than 2n22n-2 vertices, rather than that ΓB1n\Gamma B_{1}^{n} is not ss-neighbourly. Thus, the question of whether a moment condition is necessary for the random polytope ΓB1n\Gamma B_{1}^{n} to be ss-neighbourly with probability at least 1/21/2 is still unresolved.

References