Simple, Efficient, and Neural Algorithms for Sparse Coding

Sanjeev Arora, Rong Ge, Tengyu Ma, Ankur Moitra

Introduction

Sparse coding or dictionary learning consists of learning to express (i.e., code) a set of input vectors, say image patches, as linear combinations of a small number of vectors chosen from a large dictionary. It is a basic task in many fields. In signal processing, a wide variety of signals turn out to be sparse in an appropriately chosen basis (see references in Mallat (1998)). In neuroscience, sparse representations are believed to improve energy efficiency of the brain by allowing most neurons to be inactive at any given time. In machine learning, imposing sparsity as a constraint on the representation is a useful way to avoid over-fitting. Additionally, methods for sparse coding can be thought of as a tool for feature extraction and are the basis for a number of important tasks in image processing such as segmentation, retrieval, de-noising and super-resolution (see references in Elad (2010)), as well as a building block for some deep learning architectures Ranzato et al. (2007). It is also a basic problem in linear algebra itself since it involves finding a better basis.

The notion was introduced by neuroscientists Olshausen and Field (1997a) who formalized it as follows: Given a dataset y(1),y(2),,y(p)ny^{(1)},y^{(2)},\ldots,y^{(p)}\in\Re^{n}, our goal is to find a set of basis vectors A1,A2,,AmnA_{1},A_{2},\ldots,A_{m}\in\Re^{n} and sparse coefficient vectors x(1),x(2),,x(p)mx^{(1)},x^{(2)},\ldots,x^{(p)}\in\Re^{m} that minimize the reconstruction error

where AA is the n×mn\times m coding matrix whose jjth column is AjA_{j} and S()S(\cdot) is a nonlinear penalty function that is used to encourage sparsity. This function is nonconvex because both AA and the x(i)x^{(i)}’s are unknown. Their paper as well as subsequent work chooses mm to be larger than nn (so-called overcomplete case) because this allows greater flexibility in adapting the representation to the data. We remark that sparse coding should not be confused with the related — and usually easier — problem of finding the sparse representations of the y(i)y^{(i)}’s given the coding matrix AA, variously called compressed sensing or sparse recovery Candes et al. (2006); Candes and Tao (2005).

Olshausen and Field also gave a local search/gradient descent heuristic for trying to minimize the nonconvex energy function (1). They gave experimental evidence that it produces coding matrices for image patches that resemble known features (such as Gabor filters) in V1V1 portion of the visual cortex. A related paper of the same authors Olshausen and Field (1997b) (and also Lewicki and Sejnowski (2000)) places sparse coding in a more familiar generative model setting whereby the data points y(i)y^{(i)}’s are assumed to be probabilistically generated according to a model y(i)=Ax(i)+\mboxnoisey^{(i)}=A^{*}\cdot{x^{*(i)}}+\mbox{noise} where x(1),x(2),,x(p){x^{*(1)}},{x^{*(2)}},\ldots,{x^{*(p)}} are samples from some appropriate distribution and AA^{*} is an unknown code. Then one can define the maximum likelihood estimate, and this leads to a different and usually more complicated energy function — and associated heuristics — compared to (1).

Surprisingly, maximum likelihood-based approaches seem unnecessary in practice and local search/gradient descent on the energy function (1) with hard constraints works well, as do related algorithms such as MOD Aharon et al. (2006) and kk-SVD Engan et al. (1999). In fact these methods are so effective that sparse coding is considered in practice to be a solved problem, even though it has no polynomial time algorithm per se.

Recently, there has been rapid progress on designing polynomial time algorithms for sparse coding with provable guarantees (the relevant papers are discussed below). All of these adopt the generative model viewpoint sketched above. But the surprising success of the simple descent heuristics has remained largely unexplained. Empirically, these heuristics far out perform — in running time, sample complexity, and solution quality — the new algorithms, and this (startling) observation was in fact the starting point for the current work.

Of course, the famous example of simplex vs ellipsoid for linear programming reminds us that it can be much more challenging to analyze the behavior of an empirically successful algorithm than it is to to design a new polynomial time algorithm from scratch! But for sparse coding the simple intuitive heuristics are important for another reason beyond just their algorithmic efficiency: they appear to be implementable in neural architectures. (Roughly speaking, this means that the algorithm stores the code matrix AA as synapse weights in a neural network and updates the entries using differences in potentials of the synapse’s endpoints.) Since neural computation — and also deep learning — have proven to be difficult to analyze in general, analyzing sparse coding thoroughly seems to be a natural first step for theory. Our algorithm is a close relative of the Olshausen-Field algorithm and thus inherits its neural implementability; see Appendix E for further discussion.

Here we present a rigorous analysis of the simple energy minimization heuristic, and as a side benefit this yields bounds on running time and sample complexity for sparse coding that are better (in some cases, dramatically so) than the algorithms in recent papers. This adds to the recent literature on analyzing alternating minimization Jain et al. (2013); Hardt (2013); Netrapalli et al. (2013, 2014) but these work in a setting where there is a convex program that is known to work too, and in our setting, the only known convex program runs in time exponential in a natural parameter Barak et al. (2014).

1 Recent Work

A common thread in recent work on sparse coding is to assume a generative model; the precise details vary, but each has the property that given enough samples the solution is essentially unique. Spielman et al. (2012) gave an algorithm that succeeds when AA^{*} has full column rank (in particular mnm\leq n) which works up to sparsity roughly n\sqrt{n}. However this algorithm is not applicable in the more prevalent overcomplete setting. Arora et al. (2014) and Agarwal et al. (2013, 2014) independently gave algorithms in the overcomplete case assuming that AA^{*} is μ\mu-incoherent (which we define in the next section). The former gave an algorithm that works up to sparsity n1/2γ/μn^{1/2-\gamma}/\mu for any γ>0\gamma>0 but the running time is nΘ(1/γ)n^{\Theta(1/\gamma)}; Agarwal et al. (2013, 2014) gave an algorithm that works up to sparsity either n1/4/μn^{1/4}/\mu or n1/6/μn^{1/6}/\mu depending on the particular assumptions on the model. These works also analyze alternating minimization but assume that it starts from an estimate AA that is column-wise 1/\mboxpoly(n)1/\mbox{poly}(n)-close to AA^{*}, in which case the objective function is essentially convex.

Barak et al. (2014) gave a new approach based on the sum-of-squares hierarchy that works for sparsity up to n1γn^{1-\gamma} for any γ>0\gamma>0. But in order to output an estimate that is column-wise ϵ\epsilon-close to AA^{*} the running time of the algorithm is n1/ϵO(1)n^{1/\epsilon^{O(1)}}. In most applications, one needs to set (say) ϵ=1/k\epsilon=1/k in order to get a useful estimate. However in this case their algorithm runs in exponential time. The sample complexity of the above algorithms is also rather large, and is at least Ω(m2)\Omega(m^{2}) if not much larger. Here we will give simple and more efficient algorithms based on alternating minimization whose column-wise error decreases geometrically, and that work for sparsity up to n1/2/μlognn^{1/2}/\mu\log n. We remark that even empirically, alternating minimization does not appear to work much beyond this bound.

2 Model, Notation and Results

We will work with the following family of generative models (similar to those in earlier papers)The casual reader should just think of xx^{*} as being drawn from some distribution that has independent coordinates. Even in this simpler setting —which has polynomial time algorithms using Independent Component Analysis—we do not know of any rigorous analysis of heuristics like Olshausen-Field. The earlier papers were only interested in polynomial-time algorithms, so did not wish to assume independence.:

Each sample is generated as y=Ax+\mboxnoisey=A^{*}x^{*}+\mbox{noise} where AA^{*} is a ground truth dictionary and xx^{*} is drawn from an unknown distribution D\mathcal{D} where

the support S=\mboxsupp(x)S=\mbox{supp}(x^{*}) is of size at most kk, Pr\/[iS]=Θ(k/m)\mathop{\bf Pr\/}[i\in S]=\Theta(k/m) and Pr\/[i,jS]=Θ(k2/m2)\mathop{\bf Pr\/}[i,j\in S]=\Theta(k^{2}/m^{2})

the distribution is normalized so that E\/[xixj0]=0\mathop{\bf E\/}[x^{*}_{i}|x^{*}_{j}\neq 0]=0; E\/[xi2xi0]=1\mathop{\bf E\/}[{x^{*}_{i}}^{2}|x^{*}_{i}\neq 0]=1 and when xi0x^{*}_{i}\neq 0, xiC|x^{*}_{i}|\geq C for some constant C1C\leq 1 and

the non-zero entries are pairwise independent and subgaussian, conditioned on the support.

The noise is Gaussian and independent across coordinates.

Such models are natural since the original motivation behind sparse coding was to discover a code whose representations have the property that the coordinates are almost independent. We can relax most of the requirements above, at the expense of further restricting the sparsity, but will not detail such tradeoffs.

The rest of the paper ignores the iid noise: it has little effect on our basic steps like computing inner products of samples or taking singular vectors, and easily tolerated so long as it stays smaller than the “signal.”

We assume AA^{*} is an incoherent dictionary, since these are widespread in signal processing Elad (2010) and statistics Donoho and Huo (1999), and include various families of wavelets, Gabor filters as well as randomly generated dictionaries.

An n×mn\times m matrix AA whose columns are unit vectors is μ\mu-incoherent if for all iji\neq j we have Ai,Ajμ/n.\langle A_{i},A_{j}\rangle\leq\mu/\sqrt{n}.

We also require that A=O(m/n)\|A^{*}\|=O(\sqrt{m/n}). However this can be relaxed within polylogarithmic factors by tightening the bound on the sparsity by the same factor. Throughout this paper we will say that AsA^{s} is (δ,κ)(\delta,\kappa)-near to AA^{*} if after a permutation and sign flips its columns are within distance δ\delta and we have AsAκA\|A^{s}-A^{*}\|\leq\kappa\|A^{*}\|. See also Definition 3.1. We will use this notion to measure the progress of our algorithms. Moreover we will use g(n)=O(f(n))g(n)=O^{*}(f(n)) to signify that g(n)g(n) is upper bounded by Cf(n)Cf(n) for some small enough constant CC. Finally, throughout this paper we will assume that kO(n/μlogn)k\leq O^{*}(\sqrt{n}/\mu\log n) and m=O(n)m=O(n). Again, mm can be allowed to be higher by lowering the sparsity. We assume all these conditions in our main theorems.

In Section 2 we give a general framework for analyzing alternating minimization. Instead of thinking of the algorithm as trying to minimize a known non-convex function, we view it as trying to minimize an unknown convex function. Various update rules are shown to provide good approximations to the gradient of the unknown function. See Lemma 3.4, Lemma B.1 and Lemma B.7 for examples. We then leverage our framework to analyze existing heuristics and to design new ones also with provable guarantees. In Section 3, we prove:

There is a neurally plausible algorithm which when initialized with an estimate A0A^{0} that is (δ,2)(\delta,2)-near to AA^{*} for δ=O(1/logn)\delta=O^{*}(1/\log n), converges at a geometric rate to AA^{*} until the column-wise error is O(k/n)O(\sqrt{k/n}). Furthermore the running time is O(mnp)O(mnp) and the sample complexity is p=O~(mk)p=\widetilde{O}(mk) for each step.

Additionally we give a neural architecture implementing our algorithm in Appendix E. To the best of our knowledge, this is the first neurally plausible algorithm for sparse coding with provable convergence.

Having set up our general framework and analysis technique we can use it on other variants of alternating minimization. Section 3.1.2 gives a new update rule whose bias (i.e., error) is negligible:

There is an algorithm which when initialized with an estimate A0A^{0} that is (δ,2)(\delta,2)-near to AA^{*} for δ=O(1/logn)\delta=O^{*}(1/\log n), converges at a geometric rate to AA^{*} until the column-wise error is O(nω(1))O(n^{-\omega(1)}). Furthermore each step runs in time O(mnp)O(mnp) and the sample complexity pp is polynomial.

This algorithm is based on a modification where we carefully project out components along the column currently being updated. We complement the above theorems by revisiting the Olshausen-Field rule and analyzing a variant of it in Section 3.1.1 (Theorem 3.6). However its analysis is more complex because we need to bound some quadratic error terms. It uses convex programming.

What remains is to give a method to initialize these iterative algorithms. We give a new approach based on pair-wise reweighting and we prove that it returns an estimate A0A^{0} that is (δ,2)(\delta,2)-near to AA^{*} for δ=O(1/logn)\delta=O^{*}(1/\log n) with high probability. As an additional benefit, this algorithm can be used even in settings where mm is not known and this could help solve another problem in practice — that of model selection. In Section 5 we prove:

There is an algorithm which returns an estimate A0A^{0} that is (δ,2)(\delta,2)-near to AA^{*} for δ=O(1/logn)\delta=O^{*}(1/\log n). Furthermore the running time is O~(mn2p)\widetilde{O}(mn^{2}p) and the sample complexity p=O~(mk)p=\widetilde{O}(mk).

This algorithm also admits a neural implementation, which is sketched in Appendix E. The proof currently requires a projection step that increases the run time though we suspect it is not needed.

We remark that these algorithms work up to sparsity O(n/μlogn)O^{*}(\sqrt{n}/\mu\log n) which is within a logarithmic factor of the information theoretic threshold for sparse recovery on incoherent dictionaries Donoho and Huo (1999); Gribonval and Nielsen (2003). All previous known algorithms that approach Arora et al. (2014) or surpass this sparsity Barak et al. (2014) run in time exponential in some natural parameter. Moreover, our algorithms are simple to describe and implement, and involve only basic operations. We believe that our framework will have applications beyond sparse coding, and could be used to show that simple, iterative algorithms can be powerful in other contexts as well by suggesting new ways to analyze them.

Our Framework, and an Overview

Here we describe our framework for analyzing alternating minimization. The generic scheme we will be interested in is given in Algorithm 1 and it alternates between updating the estimates AA and XX. It is a heuristic for minimizing the non-convex function in (1) where the penalty function is a hard constraint. The crucial step is if we fix XX and compute the gradient of (1) with respect to AA, we get:

We then take a step in the opposite direction to update AA. Here and throughout the paper η\eta is the learning rate, and needs to be set appropriately. The challenge in analyzing this general algorithm is to identify a suitable “measure of progress”— called a Lyapunov function in dynamical systems and control theory — and show that it improves at each step (with high probability over the samples). We will measure the progress of our algorithms by the maximum column-wise difference between AA and AA^{*}.

In the next subsection, we identify sufficient conditions that guarantee progress. They are inspired by proofs in convex optimization. We view Algorithm 1 as trying to minimize an unknown convex function, specifically f(A)=E(A,X)f(A)=\mathcal{E}(A,X^{*}), which is strictly convex and hence has a unique optimum that can be reached via gradient descent. This function is unknown since the algorithm does not know XX^{*}. The analysis will show that the direction of movement is correlated with AAsA^{*}-A^{s}, which in turn is the gradient of the above function. An independent paper of Balakrishnan et al. (2014) proposes a similar framework for analysing EM algorithms for hidden variable models. The difference is that their condition is really about the geometry of the objective function, though ours is about the property of the direction of movement. Therefore we have the flexibility to choose different decoding procedures. This flexibility allows us to have a closed form of XsX^{s} and obtain a useful functional form of gsg^{s}. The setup is reminiscent of stochastic gradient descent, which moves in a direction whose expectation is the gradient of a known convex function. By contrast, here the function f()f() is unknown, and furthermore the expectation of gsg^{s} is not the true gradient and has bias. Due to the bias, we will only be able to prove that our algorithms reach an approximate optimum up to some error whose magnitude is determined by the bias. We can make the bias negligible using more complicated algorithms.

Consider a general iterative algorithm that is trying to get to a desired solution zz^{*} (in our case z=Aiz^{*}=A^{*}_{i} for some ii). At step ss it starts with a guess zsz^{s}, computes some direction gsg^{s}, and updates its estimate as: zs+1=zsηgsz^{s+1}=z^{s}-\eta g^{s}. The natural progress measure is zzs2\|z^{*}-z^{s}\|^{2}, and below we will identify a sufficient condition for it to decrease in each step:

A vector gsg^{s} is (α,β,ϵs)(\alpha,\beta,\epsilon_{s})-correlated with zz^{*} if

Remark: The traditional analysis of convex optimization corresponds to the setting where zz^{*} is the global optimum of some convex function ff, and ϵs=0\epsilon_{s}=0. Specifically, if f()f(\cdot) is 2α2\alpha-strongly convex and 1/(2β)1/(2\beta)-smooth, then gs=f(zs)g^{s}=\nabla f(z^{s}) (α,β,0)(\alpha,\beta,0)-correlated with zz^{*}. Also we will refer to ϵs\epsilon_{s} as the bias.

Suppose gsg^{s} satisfies Definition 2.1 for s=1,2,,Ts=1,2,\dots,T, and η\eta satisfies 0<η2β0<\eta\leq 2\beta and ϵ=maxs=1Tϵs\epsilon=\max_{s=1}^{T}\epsilon_{s}. Then for any s=1,,Ts=1,\dots,T,

In particular, the update rule above converges to zz^{*} geometrically with systematic error ϵ/α\epsilon/\alpha in the sense that

Furthermore, if ϵs<α2zsz2\epsilon_{s}<\frac{\alpha}{2}\|z^{s}-z^{*}\|^{2} for s=1,,Ts=1,\dots,T, then

The proof closely follows existing proofs in convex optimization:

Then solving this recurrence we have zs+1z2(12αη)s+1R2+ϵα\|z^{s+1}-z^{*}\|^{2}\leq(1-2\alpha\eta)^{s+1}R^{2}+\frac{\epsilon}{\alpha} where R=z0zR=\|z^{0}-z^{*}\|. And furthermore if ϵs<α2zsz2\epsilon_{s}<\frac{\alpha}{2}\|z^{s}-z^{*}\|^{2} we have instead

and this yields the second part of the theorem too.

In fact, we can extend the analysis above to obtain identical results for the case of constrained optimization. Suppose we are interested in optimizing a convex function f(z)f(z) over a convex set B\mathcal{B}. The standard approach is to take a step in the direction of the gradient (or gsg^{s} in our case) and then project into B\mathcal{B} after each iteration, namely, replace zs+1z^{s+1} by \mboxProjBzs+1\mbox{Proj}_{\mathcal{B}}\,z^{s+1} which is the closest point in B\mathcal{B} to zs+1z^{s+1} in Euclidean distance. It is well-known that if zBz^{*}\in\mathcal{B}, then \mboxProjB  zzzz\|\mbox{Proj}_{\mathcal{B}}\;z-z^{*}\|\leq\|z-z^{*}\|. Therefore we obtain the following as an immediate corollary to the above analysis:

Suppose gsg^{s} satisfies Definition 2.1 for s=1,2,,Ts=1,2,\dots,T and set 0<η2β0<\eta\leq 2\beta and ϵ=maxs=1Tϵs\epsilon=\max_{s=1}^{T}\epsilon_{s}. Further suppose that zz^{*} lies in a convex set B\mathcal{B}. Then the update rule zs+1=\mboxProjB(zsηgs)z^{s+1}=\mbox{Proj}_{\mathcal{B}}(z^{s}-\eta g^{s}) satisfies that for any s=1,,Ts=1,\dots,T,

In particular, zsz^{s} converges to zz^{*} geometrically with systematic error ϵ/α\epsilon/\alpha. Additionally if ϵs<α2zsz2\epsilon_{s}<\frac{\alpha}{2}\|z^{s}-z^{*}\|^{2} for s=1,,Ts=1,\dots,T, then

What remains is to derive a functional form for various update rules and show that these rules move in a direction gsg^{s} that approximately points in the direction of the desired solution zz^{*} (under the assumption that our data is generated from a stochastic model that meets certain conditions).

An Overview of Applying Our Framework

Our framework clarifies that any improvement step meeting Definition 2.1 will also converge to an approximate optimum, which enables us to engineer other update rules that turn out to be easier to analyze. Indeed we first analyze a simpler update rule with gs=E\/[(yAsx)\mboxsgn(xT)]g^{s}=\mathop{\bf E\/}[(y-A^{s}x)\mbox{sgn}(x^{T})] in Section 3. Here \mboxsgn()\mbox{sgn}(\cdot) is the coordinate-wise sign function. We then return to the Olshausen-Field update rule and analyze a variant of it in Section 3.1.1 using approximate projected gradient descent. Finally, we design a new update rule in Section 3.1.2 where we carefully project out components along the column currently being updated. This has the effect of replacing one error term with another and results in an update rule with negligible bias. The main steps in showing that these update rules fit into our framework are given in Lemma 3.4, Lemma B.1 and Lemma B.7.

The rest of the analysis can be described as follows: If the signs and support of xx are recovered correctly, then alternating minimization makes progress in each step. In fact this holds each for much larger values of kk than we consider; as high as n/(logn)O(1)n/(\log n)^{O(1)}. (However, the explicit decoding rule fails for k>n/μlognk>\sqrt{n}/\mu\log n.) Thus it only remains to properly initialize A0A^{0} so that it is close enough to AA^{*} to let the above decoding rule succeed. In Section 5 we give a new initialization procedure based on pair-wise reweighting that we prove works with high probability. This section may be of independent interest, since this algorithm can be used even in settings where mm is not known and could help solve another problem in practice — that of model selection. See Lemma 5.2.

A Neurally Plausible Algorithm with Provable Guarantees

Here we will design and analyze a neurally plausible algorithm for sparse coding which is given in Algorithm 2, and we give a neural architecture implementing our algorithm in Appendix E. The fact that such a simple algorithm provably works sheds new light on how sparse coding might be accomplished in nature. Here and throughout this paper we will work with the following measure of closeness:

AA is δ\delta-close to AA^{*} if there is a permutation π:[m][m]\pi:[m]\rightarrow[m] and a choice of signs σ:[m]{±1}\sigma:[m]\rightarrow\{\pm 1\} such that \|\sigma(i)A_{\pi(i)}-A^{*}_{i}\|\leq\delta\mbox{ for alli} We say AA is (δ,κ)(\delta,\kappa)-near to AA^{*} if in addition AAκA\|A-A^{*}\|\leq\kappa\|A^{*}\| too.

This is a natural measure to use, since we can only hope to learn the columns of AA^{*} up to relabeling and sign-flips. In our analysis, we will assume throughout that π()\pi(\cdot) is the identity permutation and σ()+1\sigma(\cdot)\equiv+1 because our family of generative models is invariant under this relabeling and it will simplify our notation.

Let \mboxsgn()\mbox{sgn}(\cdot) denote the coordinate-wise sign function and recall that η\eta is the learning rate, which we will soon set. Also we fix both δ,δ0=O(1/logn)\delta,\delta_{0}=O^{*}(1/\log n). We will also assume that in each iteration, our algorithm is given a fresh set of pp samples. Our main theorem is:

Suppose that A0A^{0} is (2δ,2)(2\delta,2)-near to AA^{*} and that η=Θ(m/k)\eta=\Theta(m/k). Then if each update step in Algorithm 2 uses p=Ω~(mk)p=\widetilde{\Omega}(mk) fresh samples, we have

for some 0<τ<1/20<\tau<1/2 and for any s=1,2,...,Ts=1,2,...,T. In particular it converges to AA^{*} geometrically, until the column-wise error is O(k/n)O(\sqrt{k/n}).

Our strategy is to prove that g^s\widehat{g}^{s} is (α,β,ϵ)(\alpha,\beta,\epsilon)-correlated (see Definition 2.1) with the desired solution AA^{*}, and then to prove that A\|A\| never gets too large. We will first prove that if AA is somewhat close to AA^{*} then the estimate xx for the representation almost always has the correct support. Here and elsewhere in the paper, we use “very high probability” to mean that an event happens with probability at least 11/nω(1)1-1/n^{\omega(1)}.

Suppose that AsA^{s} is δ\delta-close to AA^{*}. Then with very high probability over the choice of the random sample y=Axy=A^{*}x^{*}:

We prove a more general version of this lemma (Lemma A.1) in Appendix A; it is an ingredient in analyzing all of the update rules we consider in this paper. However this is just one step on the way towards proving that g^s\widehat{g}^{s} is correlated with the true solution.

The next step in our proof is to use the properties of the generative model to derive a new formula for g^s\widehat{g}^{s} that is more amenable to analysis. We define gsg^{s} to be the expectation of g^s\widehat{g}^{s}

where x:=thresholdC/2((As)Ty)x:=\textrm{threshold}_{C/2}((A^{s})^{T}y) is the decoding of yy. Let qi=Pr\/[xi0]q_{i}=\mathop{\bf Pr\/}[x^{*}_{i}\neq 0] and qi,j=Pr\/[xixj0]q_{i,j}=\mathop{\bf Pr\/}[x^{*}_{i}x^{*}_{j}\neq 0], and define pi=E\/[xi\mboxsgn(xi)xi0]p_{i}=\mathop{\bf E\/}[x^{*}_{i}\mbox{sgn}(x^{*}_{i})|x^{*}_{i}\neq 0].

Here and in the rest of the paper, we will let γ\gamma denote any vector whose norm is negligible (i.e. smaller than 1/nC1/n^{C} for any large constant C>1C>1). This will simplify our calculations. Also let AiA^{*}_{-i} denote the matrix obtained from deleting the iith column of AA^{*}. The following lemma is the main step in our analysis.

Suppose that AsA^{s} is (2δ,2)(2\delta,2)-near to AA^{*}. Then the update step in Algorithm 2 takes the form E\displaylimits[Ais+1]=Aisηgis\mathop{\mathbf{E}}\displaylimits[A_{i}^{s+1}]=A_{i}^{s}-\eta g_{i}^{s} where gis=piqi(λisAisAi+ϵis±γ)g_{i}^{s}=p_{i}q_{i}\left(\lambda^{s}_{i}A_{i}^{s}-A^{*}_{i}+\epsilon^{s}_{i}\pm\gamma\right), and λis=Ais,Ai\lambda^{s}_{i}=\langle A_{i}^{s},A^{*}_{i}\rangle and

Moreover the norm of ϵis\epsilon^{s}_{i} can be bounded as ϵisO(k/n)\|\epsilon^{s}_{i}\|\leq O(k/n).

Note that piqip_{i}q_{i} is a scaling constant and λi1\lambda_{i}\approx 1; hence from the above formula we should expect that gisg_{i}^{s} is well-correlated with AisAiA_{i}^{s}-A^{*}_{i}.

Since AsA^{s} is (2δ,2)(2\delta,2)-near to AA^{*}, AsA^{s} is 2δ2\delta-close to AA^{*}. We can now invoke Lemma 3.3 and conclude that with high probability, \mboxsgn(x)=\mboxsgn(x)\mbox{sgn}(x^{*})=\mbox{sgn}(x). Let Fx\mathcal{F}_{x^{*}} be the event that \mboxsgn(x)=\mboxsgn(x)\mbox{sgn}(x^{*})=\mbox{sgn}(x), and let 1Fx\mathbf{1}_{\mathcal{F}_{x^{*}}} be the indicator function of this event.

The key is that this allows us to essentially replace \mboxsgn(x)\mbox{sgn}(x) with \mboxsgn(x)\mbox{sgn}(x^{*}). Moreover, let S=\mboxsupp(x)S=\mbox{supp}(x^{*}). Note that when Fx\mathcal{F}_{x^{*}} happens SS is also the support of xx. Recall that according to the decoding rule (where we have replaced AsA^{s} by BB for notational simplicity) x=thresholdC/2(BTy)x=\textrm{threshold}_{C/2}(B^{T}y). Therefore, xS=(BTy)S=BSTy=BSTAxx_{S}=(B^{T}y)_{S}=B_{S}^{T}y=B_{S}^{T}A^{*}x^{*}. Using the fact that the support of xx is SS again, we have Bx=BSTBSAxBx=B_{S}^{T}B_{S}A^{*}x^{*}. Plugging it into equation (3):

where again we have used the fact that Fx\mathcal{F}_{x^{*}} happens with very high probability. Now we rewrite the expectation above using subconditioning where we first choose the support SS of xx^{*}, and then we choose the nonzero values xSx^{*}_{S}.

where we use the fact that E\/[xi\mboxsgn(xi)S]=pi\mathop{\bf E\/}[x^{*}_{i}\cdot\mbox{sgn}(x^{*}_{i})|S]=p_{i}. Let R=S{i}R=S-\{i\}. Using the fact that BSBST=BiBiT+BRBRTB_{S}B_{S}^{T}=B_{i}B_{i}^{T}+B_{R}B_{R}^{T}, we can split the quantity above into two parts,

where \mboxdiag(qi,j)\mbox{diag}(q_{i,j}) is a m×mm\times m diagonal matrix whose (j,j)(j,j)-th entry is equal to qi,jq_{i,j}, and BiB_{-i} is the matrix obtained by zeroing out the iith column of BB. Here we used the fact that Pr\/[iS]=qi\mathop{\bf Pr\/}[i\in S]=q_{i} and Pr\/[i,jS]=qij\mathop{\bf Pr\/}[i,j\in S]=q_{ij}.

Now we set B=AsB=A^{s}, and rearranging the terms, we have gis=piqi(Ais,AiAisAi+ϵis±γ)g_{i}^{s}=p_{i}q_{i}\left(\langle A_{i}^{s},A^{*}_{i}\rangle A_{i}^{s}-A^{*}_{i}+\epsilon^{s}_{i}\pm\gamma\right) where \epsilon_{i}^{s}=\Big{(}A^{s}_{-i}\mbox{diag}(q_{i,j})\left(A_{-i}^{s}\right)^{T}\Big{)}A^{*}_{i}/q_{i}, which can be bounded as follows

where the last step used the fact that maxijqi,jminqiO(k/m)\frac{\max_{i\neq j}q_{i,j}}{\min q_{i}}\leq O(k/m), which is an assumption of our generative model.

We will be able to reuse much of this analysis in Lemma B.1 and Lemma B.7 because we have derived to for a general decoding matrix BB.

In Section 4 we complete the analysis of Algorithm 2 in the infinite sample setting. In particular, in Section 4.1, we prove that if AsA^{s} is (2δ,2)(2\delta,2)-near to AA^{*} then gisg_{i}^{s} is indeed (α,β,ϵ)(\alpha,\beta,\epsilon)-correlated with AiA_{i} (Lemma 4.3). Finally we prove that if AsA^{s} is (2δ,2)(2\delta,2)-near to AA^{*} then As+1A2A\|A^{s+1}-A^{*}\|\leq 2\|A^{*}\| (Lemma 4.6). These lemmas together with Theorem 2.2 imply Theorem 4.1, the simplified version of Theorem 3.2 where the number of samples pp is assumed to be infinite (i.e. we have access to the true expectation gsg^{s}). In Appendix D we prove the sample complexity bounds we need and this completes the proof of Theorem 3.2.

Here we apply our framework to design and analyze further variants of alternating minimization.

In this subsection we analyze a variant of the Olshausen-Field update rule. However there are quadratic error terms that arise in the expressions we derive for gsg^{s} and bounding them is more challenging. We will also need to make (slightly) stronger assumptions on the distributional model that for distinct i1,i2,i3i_{1},i_{2},i_{3} we have qi1,i2,i3=O(k3/m3)q_{i_{1},i_{2},i_{3}}=O(k^{3}/m^{3}) where qi1,i2,i3=Pr\/[i1,i2,i3S]q_{i_{1},i_{2},i_{3}}=\mathop{\bf Pr\/}[i_{1},i_{2},i_{3}\in S].

Suppose that A0A^{0} is (2δ,2)(2\delta,2)-near to AA^{*} and that η=Θ(m/k)\eta=\Theta(m/k). There is a variant of Olshausen-Field (given in Algorithm 4 in Appendix B.1) for which at each step ss we have

for some 0<τ<1/20<\tau<1/2 and for any s=1,2,...,Ts=1,2,...,T. In particular it converges to AA^{*} geometrically until the error in Frobenius norm is O(mk/n)O(\sqrt{m}k/n).

We defer the proof of the main theorem to Appendix B.1. Currently it uses a projection step (using convex programming) that may not be needed but the proof requires it.

1.2 Removing the Systemic Error

In this subsection, we design and analyze a new update rule that converges geometrically until the column-wise error is nω(1)n^{-\omega(1)}. The basic idea is to engineer a new decoding matrix that projects out the components along the column currently being updated. This has the effect of replacing a certain error term in Lemma 3.4 with another term that goes to zero as AA gets closer to AA^{*} (the earlier rules we have analyzed do not have this property).

We will use B(s,i)B^{(s,i)} to denote the decoding matrix used when updating the iith column in the ssth step. Then we set B^{(s,i)}_{i}=A_{i}\mbox{ and }B^{(s,i)}_{j}=\mbox{Proj}_{A_{i}^{\perp}}A_{j}\mbox{ forj\neq i.} Note that Bi(s,i)B^{(s,i)}_{-i} (i.e. B(s,i)B^{(s,i)} with the iith column removed) is now orthogonal to AiA_{i}. We will rely on this fact when we bound the error. We defer the proof of the main theorem to Appendix B.2.

Suppose that A0A^{0} is (2δ,2)(2\delta,2)-near to AA^{*} and that η=Θ(m/k)\eta=\Theta(m/k). There is an algorithm (given in Algorithm 5 given in Appendix B.2) for which at each step ss, we have

for some 0<τ<1/20<\tau<1/2 and for any s=1,2,...,Ts=1,2,...,T. In particular it converges to AA^{*} geometrically until the column-wise error is nω(1)n^{-\omega(1)}.

Analysis of the Neural Algorithm

In Lemma 3.4 we gave a new (and more useful) expression that describes the update direction under the assumptions of our generative model. Here we will make crucial use of Lemma 3.4 in order to prove that gisg^{s}_{i} is (α,β,ϵ)(\alpha,\beta,\epsilon)-correlated with AiA_{i} (Lemma 4.2). Moreover we use Lemma 3.4 again to show that As+1A2A\|A^{s+1}-A^{*}\|\leq 2\|A^{*}\| (Lemma 4.6). Together, these auxiliary lemmas imply that the column-wise error decreases in the next step and moreover the errors across columns are uncorrelated.

We assume that each iteration of Algorithm 2 takes infinite number of samples, and prove the corresponding simplified version of Theorem 3.2. The proof of this Theorem highlights the essential ideas of behind the proof of the Theorem 3.2, which can be found at Appendix D.

Suppose that A0A^{0} is (2δ,2)(2\delta,2)-near to AA^{*} and that η=Θ(m/k)\eta=\Theta(m/k). Then if each update step in Algorithm 2 uses infinite number of samples at each iteration, we have

for some 0<τ<1/20<\tau<1/2 and for any s=1,2,...,Ts=1,2,...,T. In particular it converges to AA^{*} geometrically until the column-wise error is O(k/n)O(k/n).

The proof is deferred to the end of this section.

In Lemma 3.4 we showed that gis=piqi(λiAisAi+ϵis+γ)g^{s}_{i}=p_{i}q_{i}(\lambda_{i}A_{i}^{s}-A^{*}_{i}+\epsilon^{s}_{i}+\gamma) where λi=Ai,Ai\lambda_{i}=\langle A_{i},A^{*}_{i}\rangle. Here we will prove that gisg_{i}^{s} is (α,β,ϵ)(\alpha,\beta,\epsilon)-correlated with AiA^{*}_{i}. Recall that we fixed δ=O(1/logn)\delta=O^{*}(1/\log n). The main intuition is that gisg^{s}_{i} is mostly equal to piqi(AisAi)p_{i}q_{i}(A_{i}^{s}-A^{*}_{i}) with a small error term.

If a vector gisg^{s}_{i} is equal to 4α(AisAi)+v4\alpha(A_{i}^{s}-A^{*}_{i})+v where vαAisAi+ζ\|v\|\leq\alpha\|A_{i}^{s}-A^{*}_{i}\|+\zeta, then gisg^{s}_{i} is (α,1/100α,ζ2/α)(\alpha,1/100\alpha,\zeta^{2}/\alpha)-correlated with AiA^{*}_{i}, more specifically,

In particular, gisg_{i}^{s} is (α,β,ϵ)(\alpha,\beta,\epsilon)-correlated with AiA^{*}_{i}, where α=Ω(k/m)\alpha=\Omega(k/m), βΩ(m/k)\beta\geq\Omega(m/k) and ϵ=O(k3/mn2)\epsilon=O(k^{3}/mn^{2}). We can now apply Theorem 2.2 and conclude that the column-wise error gets smaller in the next step:

If AsA^{s} is (2δ,2)(2\delta,2)-near to AA^{*} and ηmini(piqi(1δ))=O(m/k)\eta\leq\min_{i}(p_{i}q_{i}(1-\delta))=O(m/k), then gis=piqi(λiAisAi+ϵis+γ)g_{i}^{s}=p_{i}q_{i}(\lambda_{i}A_{i}^{s}-A^{*}_{i}+\epsilon^{s}_{i}+\gamma) is (Ω(k/m),Ω(m/k),O(k3/mn2))(\Omega(k/m),\Omega(m/k),O(k^{3}/mn^{2}))-correlated with AiA^{*}_{i}, and further

Using these bounds, we will show gi,AiAiαAiAi21100αgi2+ζ2/α0\langle g_{i},A_{i}-A^{*}_{i}\rangle-\alpha\|A_{i}-A^{*}_{i}\|^{2}-\frac{1}{100\alpha}\|g_{i}\|^{2}+\zeta^{2}/\alpha\geq 0. Indeed we have

We use the form in Lemma 3.4, gis=piqi(λiAisAi+ϵis+γ)g^{s}_{i}=p_{i}q_{i}(\lambda_{i}A_{i}^{s}-A^{*}_{i}+\epsilon^{s}_{i}+\gamma) where λi=Ai,Ai\lambda_{i}=\langle A_{i},A^{*}_{i}\rangle. We can write gis=piqi(AisAi)+piqi((1λi)Ais+ϵis+γ)g^{s}_{i}=p_{i}q_{i}(A_{i}^{s}-A^{*}_{i})+p_{i}q_{i}((1-\lambda_{i})A_{i}^{s}+\epsilon^{s}_{i}+\gamma), so when applying Lemma 4.2 we can use 4α=piqi=Θ(k/m)4\alpha=p_{i}q_{i}=\Theta(k/m) and v=piqi((1λi)Ais+ϵis+γ)v=p_{i}q_{i}((1-\lambda_{i})A_{i}^{s}+\epsilon^{s}_{i}+\gamma). The norm of vv can be bounded in two terms, the first term piqi(1λi)Aisp_{i}q_{i}(1-\lambda_{i})A_{i}^{s} has norm piqi(1λi)p_{i}q_{i}(1-\lambda_{i}) which is smaller than piqiAisAip_{i}q_{i}\|A_{i}^{s}-A^{*}_{i}\|, and the second term has norm bounded by ζ=O(k2/mn)\zeta=O(k^{2}/mn).

By Lemma 4.2 we know the vector gsig^{i}_{s} is (Ω(k/m),Ω(m/k),O(k3/mn2))(\Omega(k/m),\Omega(m/k),O(k^{3}/mn^{2}))-correlated with AsA^{s}. Then by Theorem 2.2 we have the last part of the corollary.

2 Maintaining Nearness

Suppose that AsA^{s} is (2δ,2)(2\delta,2)-near to AA^{*}. Then As+1A2A\|A^{s+1}-A^{*}\|\leq 2\|A^{*}\| in Algorithm 2.

As in the proof of the previous lemma, we will make crucial use of Lemma 3.4. Substituting and rearranging terms we have:

Our first goal is to write this equation in a more convenient form. In particular let UU and VV be matrices such that Ui=piqi(1λis)AisU_{i}=p_{i}q_{i}(1-\lambda_{i}^{s})A_{i}^{s} and V_{i}=p_{i}\Big{(}A^{s}_{-i}\mbox{diag}(q_{i,j})\left(A_{-i}^{s}\right)^{T}\Big{)}A^{*}_{i}. Then we can re-write the above equation as:

where \mboxdiag(1ηpiqi)\mbox{diag}(1-\eta p_{i}q_{i}) is the m×mm\times m diagonal matrix whose entries along the diagonal are 1ηpiqi1-\eta p_{i}q_{i}.

We will bound the spectral norm of As+1AA^{s+1}-A by bounding the spectral norm of each of the matrices of right hand side. The first two terms are straightforward to bound:

where the last inequality uses the assumption that pi=Θ(1)p_{i}=\Theta(1) and qiO(k/m)q_{i}\leq O(k/m), and the assumption that AsA2A\|A^{s}-A^{*}\|\leq 2\|A^{*}\|.

From the definition of UU it follows that U=As\mboxdiag(piqi(1λis))U=A^{s}\mbox{diag}(p_{i}q_{i}(1-\lambda_{i}^{s})), and therefore

where we have used the fact that λis1δ\lambda_{i}^{s}\geq 1-\delta and δ=o(1)\delta=o(1), and AsAsA+A=O(A)\|A^{s}\|\leq\|A^{s}-A^{*}\|+\|A^{*}\|=O(\|A^{*}\|).

What remains is to bound the third term, and let us first introduce an auxiliary matrix QQ which we define as follows: Qii=0Q_{ii}=0 and Qi,j=qi,jAis,AiQ_{i,j}=q_{i,j}\langle A^{s}_{i},A^{*}_{i}\rangle for iji\neq j. It is easy to verify that the following claim:

The iith column of AsQA^{s}Q is equal to \Big{(}A^{s}_{-i}\mbox{diag}(q_{i,j})\left(A_{-i}^{s}\right)^{T}\Big{)}A^{*}_{i}

Therefore we can write V=AsQ\mboxdiag(pi)V=A^{s}Q\mbox{diag}(p_{i}). We will bound the spectral norm of QQ by bounding its Frobenus norm instead. Then from the definition of AA, we have that:

Moreover since ATAs{A^{*}}^{T}A^{s} is an m×mm\times m matrix, its Frobenius norm can be at most a m\sqrt{m} factor larger than its spectral norm. Hence we have

where the last inequality uses the fact that k=O(n/logn)k=O(\sqrt{n}/\log n) and AsO(A)\|A^{s}\|\leq O(\|A^{*}\|).

Therefore, putting the pieces together we have:

and this completes the proof of the lemma.

3 Proof of Theorem 4.1

We prove by induction on ss. Our induction hypothesis is that the theorem is true at each step ss and AsA^{s} is (2δ,2)(2\delta,2)-near to AA^{*}. The hypothesis is trivially true for s=0s=0. Now assuming the inductive hypothesis is true. Recall that Corollary 4.3 of Section 4.1 says that if AsA^{s} is (2δ,2)(2\delta,2)-near to AA^{*}, which is guaranteed by the inductive hypothesis, by then gisg_{i}^{s} is indeed (Ω(k/m),Ω(m/k),O(k3/mn2))(\Omega(k/m),\Omega(m/k),O(k^{3}/mn^{2}))-correlated with AiA^{*}_{i}. Invoking our framework of analysis (Theorem 2.2), we have that

Therefore it also follows that As+1A^{s+1} is 2δ2\delta-close to AA^{*}. Then we invoke Lemma 4.6 to prove As+1A^{s+1} has not too large spectral norm As+1A2A\|A^{s+1}-A^{*}\|\leq 2\|A^{*}\|, which completes the induction.

Initialization

There is a large gap between theory and practice in terms of how to initialize alternating minimization. The usual approach is to set AA randomly or to populate its columns with samples y(i)y^{(i)}. These often work but we do not know how to analyse them. Here we give a novel method for initialization which we show succeeds with very high probability. Our algorithm works by pairwise reweighting. Let u=Aαu=A^{*}\alpha and v=Aαv=A^{*}\alpha^{\prime} be two samples from our model whose supports are UU and VV respectively. The main idea is that if we reweight fresh samples yy with a factor y,uy,v\langle y,u\rangle\langle y,v\rangle and compute

then the top singular vectors will correspond to columns AjA^{*}_{j} where jUVj\in U\cap V. (This is reminiscent of ideas in recent papers on dictionary learning, but more sample efficient.)

Throughout this section we will assume that the algorithm is given two sets of samples of size p1p_{1} and p2p_{2} respectively. Let p=p1+p2p=p_{1}+p_{2}. We use the first set of samples for the pairs u,vu,v that are used in reweighting and we use the second set to compute M^u,v\widehat{M}_{u,v} (that is, the same set of p2p_{2} samples is used for each u,vu,v throughout the execution of the algorithm). Our main theorem is:

Suppose that Algorithm 3 is given p1=Ω~(m)p_{1}=\widetilde{\Omega}(m) and p2=Ω~(mk)p_{2}=\widetilde{\Omega}(mk) fresh samples and moreover (a)(a) AA^{*} is μ\mu-incoherent with μ=O(nklog3n)\mu=O^{*}(\frac{\sqrt{n}}{k\log^{3}n}), (b)(b) m=O(n)m=O(n) and (c)(c) AO(mn)\|A^{*}\|\leq O(\sqrt{\frac{m}{n}}). Then with high probability AA is (δ,2)(\delta,2)-near to AA^{*} where δ=O(1/logn)\delta=O^{*}(1/\log n).

We will defer the proof of this theorem to Appendix C. The key idea is the following main lemma. We will invoke this lemma several times in order to analyze Algorithm 3 to verify whether or not the supports of uu and vv share a common element, and again to show that if they do we can approximately recover the corresponding column of AA^{*} from the top singular vector of Mu,vM_{u,v}.

Suppose u=Aαu=A^{*}\alpha and v=Aαv=A^{*}\alpha^{\prime} are two random samples with supports UU, VV respectively. Let β=ATu\beta={A^{*}}^{T}u and β=ATv\beta^{\prime}={A^{*}}^{T}v. Let y=Axy=A^{*}x^{*} be random sample that is independent of uu,vv, then

where qi=Pr\/[iS]q_{i}=\mathop{\bf Pr\/}[i\in S], ci=E\/[xi4iS]c_{i}=\mathop{\bf E\/}[{x^{*}_{i}}^{4}|i\neq S], and the error terms are:

Moreover the error terms E1+E2+E3E_{1}+E_{2}+E_{3} has spectral norm bounded by O(k/mlogn)O^{*}(k/m\log n), βiΩ(1)|\beta_{i}|\geq\Omega(1) for all i\mbox\emsupp(α)i\in\mbox{{\em supp}}(\alpha) and βiΩ(1)|\beta^{\prime}_{i}|\geq\Omega(1) for all i\mbox\emsupp(α)i\in\mbox{{\em supp}}(\alpha^{\prime}).

We will prove this lemma in three parts. First we compute the expectation and show it has the desired form. Recall that y=Axy=A^{*}x^{*}, and so:

where the second-to-last line follows because the entries in xSx^{*}_{S} are independent and have mean zero, and the only non-zero terms come from xi4{x^{*}_{i}}^{4} (whose expectation is cic_{i}) and xi2xj2{x^{*}_{i}}^{2}{x^{*}_{j}}^{2} (whose expectation is one). Equation (4) now follows by rearranging the terms in the last line. What remains is to bound the spectral norm of E1,E2E_{1},E_{2} and E3E_{3}.

Next we establish some useful properties of β\beta and β\beta^{\prime}:

With high probability it holds that (a)(a) for each ii we have βiαiμklogmn|\beta_{i}-\alpha_{i}|\leq\frac{\mu k\log m}{\sqrt{n}} and (b)(b) βO(mk/n)\|\beta\|\leq O(\sqrt{mk/n}).

In particular since the difference between βi\beta_{i} an αi\alpha_{i} is o(1)o(1) for our setting of parameters, we conclude that if αi0\alpha_{i}\neq 0 then Co(1)βiO(logm)C-o(1)\leq|\beta_{i}|\leq O(\log m) and if αi=0\alpha_{i}=0 then βiμklogmn|\beta_{i}|\leq\frac{\mu k\log m}{\sqrt{n}}.

Recall that UU is the support of α\alpha and let R=U\{i}R=U\backslash\{i\}. Then:

and since AA^{*} is incoherent we have that AiTARμk/n\|{A^{*}_{i}}^{T}A^{*}_{R}\|\leq\mu\sqrt{k/n}. Moreover the entries in αR\alpha_{R} are independent and subgaussian random variables, and so with high probability AiTAR,αRμklogmn|\langle{A^{*}_{i}}^{T}A^{*}_{R},\alpha_{R}\rangle|\leq\frac{\mu k\log m}{\sqrt{n}} and this implies the first part of the claim.

For the second part, we can bound βAAUα\|\beta\|\leq\|A^{*}\|\|A^{*}_{U}\|\|\alpha\|. Since α\alpha is a kk-sparse vector with independent and subgaussian entries, with high probability αO(k)\|\alpha\|\leq O(\sqrt{k}) in which case it follows that βO(mk/n)\|\beta\|\leq O(\sqrt{mk/n}).

Now we are ready to bound the error terms.

With high probability each of the error terms E1,E2E_{1},E_{2} and E3E_{3} in (4) has spectral norm bounded at most O(k/mlogm)O^{*}(k/m\log m).

Let S=[m]\(UV)S=[m]\backslash(U\cap V), then E1=ASD1ASTE_{1}=A^{*}_{S}D_{1}{A^{*}_{S}}^{T} where D1D_{1} is a diagonal matrix whose entries are qiciβiβiq_{i}c_{i}\beta_{i}\beta^{\prime}_{i}. We first bound D1\|D_{1}\|. To this end, we can invoke the first part of Claim 2 to conclude that βiβiμ2k2log2mn|\beta_{i}\beta^{\prime}_{i}|\leq\frac{\mu^{2}k^{2}\log^{2}m}{n}. Also qici=Θ(k/m)q_{i}c_{i}=\Theta(k/m) and so

Finally ASAO(1)\|A_{S}\|\leq\|A\|\leq O(1) (where we have used the assumption that m=O(n)m=O(n)), and this yields the desired bound on E1\|E_{1}\|.

The second term E2E_{2} is a sum of positive semidefinite matrices and we will make crucial use of this fact below:

Here the first inequality follows by using bounds on qi,jq_{i,j} and then completing the square. The second inequality uses Cauchy-Schwartz. We can now invoke the second part of Claim 2 and conclude that E2O(k2/m2)ββA2O(k/mlogm)\|E_{2}\|\leq O(k^{2}/m^{2})\|\beta\|\|\beta^{\prime}\|\|A^{*}\|^{2}\leq O^{*}(k/m\log m) (where we have used the assumption that m=O(n)m=O(n)).

For the third error term E3E_{3}, by symmetry we need only consider terms of the form qi,jβiβjAiAjTq_{i,j}\beta_{i}\beta_{j}^{\prime}A^{*}_{i}{A^{*}_{j}}^{T}. We can collect these terms and write them as AQATA^{*}Q{A^{*}}^{T}, where Qi,j=0Q_{i,j}=0 if i=ji=j and Qi,j=qi,jβiβjQ_{i,j}=q_{i,j}\beta_{i}\beta_{j}^{\prime} if iji\neq j. First, we bound the Frobenius norm of QQ:

Finally E32A2QO(m/nk2/m2)ββO(k/mlogm)\|E_{3}\|\leq 2\|A^{*}\|^{2}\|Q\|\leq O(m/n\cdot k^{2}/m^{2})\|\beta\|\|\beta^{\prime}\|\leq O^{*}(k/m\log m), and this completes the proof of the claim.

The proof of the main lemma is now complete.

Conclusions

Going beyond n\sqrt{n} sparsity requires new ideas as alternating minimization appears to break down. Mysterious properties of alternating minimization are also left to explore, such as why a random initialization works. Are these heuristics information theoretically optimal in terms of their sample complexity? Finally, can we analyse energy minimization in other contexts as well?

References

Appendix A Threshold Decoding

Here we show that a simple thresholding method recovers the support of each sample with high probability (over the randomness of xx^{*}). This corresponds to the fact that sparse recovery for incoherent dictionaries is much easier when the non-zero coefficients do not take on a wide range of values; in particular, one does not need iterative pursuit algorithms in this case. As usual let y=Axy=A^{*}x^{*} be a sample from the model, and let SS be the support of xx^{*}. Moreover suppose that AA^{*} is μ\mu-incoherent and let AA be column-wise δ\delta-close to AA. Then

If μn12k\frac{\mu}{\sqrt{n}}\leq\frac{1}{2k} and k=Ω(logm)k=\Omega^{*}(\log m) and δ=O(1/logm)\delta=O^{*}(1/\sqrt{\log m}), then with high probability (over the choice of xx^{*}) we have S={i \mbox: Ai,y>C/2}S=\{i~{}\mbox{:}~{}|\langle A_{i},y\rangle|>C/2\}. Also for all iSi\in S \mboxsgn(Ai,y)=\mboxsgn(xi)\mbox{sgn}(\langle A_{i},y\rangle)=\mbox{sgn}(x^{*}_{i}).

Consider Ai,y=Ai,Aixi+Zi\langle A_{i},y\rangle=\langle A_{i},A^{*}_{i}\rangle x^{*}_{i}+Z_{i} where Zi=jiAi,AjxjZ_{i}=\sum_{j\neq i}\langle A_{i},A^{*}_{j}\rangle x^{*}_{j} is a mean zero random variable which measures the contribution of the cross-terms. Note that Ai,Ai(1δ2/2)|\langle A_{i},A^{*}_{i}\rangle|\geq(1-\delta^{2}/2), so Ai,Aixi|\langle A_{i},A^{*}_{i}\rangle x^{*}_{i}| is either larger than (1δ2/2)C(1-\delta^{2}/2)C or equal to zero depending on whether or not iSi\in S. Our main goal is to show that the variable ZiZ_{i} is much smaller than CC with high probability, and this follows by standard concentration bounds.

Intuitively, ZiZ_{i} has two source of randomness: the support SS of xx^{*}, and the random values of xx^{*} conditioned on the support. We prove a stronger statement that only requires second source of randomness. Namely, even conditioned on the support SS, with high probability S={i \mbox: Ai,y>C/2}S=\{i~{}\mbox{:}~{}|\langle A_{i},y\rangle|>C/2\}.

We remark that ZiZ_{i} is a sum of independent subgaussian random variables and the variance of ZiZ_{i} is equal to jS\{i}Ai,Aj2\sum_{j\in S\backslash\{i\}}\langle A_{i},A^{*}_{j}\rangle^{2}. Next we bound each term in the sum as

On the other hand, we know AS\{i}2\|A^{*}_{S\backslash\{i\}}\|\leq 2 by Gershgorin’s Disk Theorem. Therefore the second term can be bounded as jS\{i}AiAi,Aj2=AS\{i}T(AiAi)2O(1/logm)\sum_{j\in S\backslash\{i\}}\langle A_{i}-A^{*}_{i},A^{*}_{j}\rangle^{2}=\|{A^{*}_{S\backslash\{i\}}}^{T}(A_{i}-A^{*}_{i})\|^{2}\leq O^{*}(1/\log m). Using this bound, we know the variance is at most O(1/logm)O^{*}(1/\log m):

Hence we have that ZiZ_{i} is a subgaussian random variable with variance at most O(1/logm)O^{*}(1/\log m) and so we conclude that ZiC/4Z_{i}\leq C/4 with high probability. Finally we can take a union bound over all indices i[m]i\in[m] and this completes the proof of the lemma.

In fact, even if kk is much larger than n\sqrt{n}, as long as the spectral norm of AA^{*} is small and the support of xx^{*} is random enough, the support recovery is still correct.

If k=O(n/logn)k=O(n/\log n), μ/n<1/log2n\mu/\sqrt{n}<1/\log^{2}n and δ<O(1/logm)\delta<O^{*}(1/\sqrt{\log m}), the support of xx^{*} is a uniformly kk-sparse set, then with high probability (over the choice of xx) we have S={i \mbox: Ai,y>C/2}S=\{i~{}\mbox{:}~{}|\langle A_{i},y\rangle|>C/2\}. Also for all iSi\in S \mboxsgn(Ai,y)=\mboxsgn(xi)\mbox{sgn}(\langle A_{i},y\rangle)=\mbox{sgn}(x^{*}_{i}).

The proof of this lemma is very similar to the previous one. However, in the previous case we only used the randomness after conditioning on the support, but to prove this stronger lemma we need to use the randomness of the support.

First we will need the following elementary claim:

ATAiO(m/n)\|{A^{*}}^{T}A_{i}\|\leq O(\sqrt{m/n}) and AjTAiO(1/logm)|{A^{*}}^{T}_{j}A_{i}|\leq O^{*}(1/\sqrt{\log m}) for all jij\neq i.

The first part follows immediately from the assumption that AA^{*} and AA are column-wise close and that A=O(m/n)\|A^{*}\|=O(\sqrt{m/n}). The second part follows because AjTAiAjTAi+AjT(AiAi)O(1/logm)|{A^{*}}^{T}_{j}A_{i}|\leq|{A^{*}}^{T}_{j}A^{*}_{i}|+|{A^{*}}^{T}_{j}(A^{*}_{i}-A_{i})|\leq O^{*}(1/\sqrt{\log m}).

Let R=S\{i}R=S\backslash\{i\}. Recall that conditioned on choice of SS, we have \mboxvar(Zi)=jRAi,Aj2.\mbox{var}(Z_{i})=\sum_{j\in R}\langle A_{i},A^{*}_{j}\rangle^{2}. We will bound this term with high probability over the choice of RR. First we bound its expectation:

E\/R[jRAi,Aj2]O(k/n)\mathop{\bf E\/}_{R}[\sum_{j\in R}\langle A_{i},A^{*}_{j}\rangle^{2}]\leq O(k/n)

By assumption RR is a uniformly random subset of [m]{i}[m]\setminus\{i\} of size R|R| (this is either kk or k1k-1). Then

However bounding the expected variance of ZiZ_{i} is not enough; we need a bound that holds with high probability over the choice of the support. Intuitively, we should expect to get bounds on the variance that hold with high probability because each term in the sum above (that bounds \mboxvar(Zi)\mbox{var}(Z_{i})) is itself at most O(1/logm)O^{*}(1/\log m), which easily implies Theorem A.1.

jRAi,Aj2O(1/logm)\sum_{j\in R}\langle A_{i},A^{*}_{j}\rangle^{2}\leq O^{*}(1/\log m) with high probability over the choice of RR.

Let aj=Ai,Aj2a_{j}=\langle A_{i},A^{*}_{j}\rangle^{2}, then aj=O(1/logm)a_{j}=O^{*}(1/\log m) and moreover jiaj=O(m/n)\sum_{j\neq i}a_{j}=O(m/n) using the same idea as in the proof of Lemma A.5. Hence we can apply Chernoff bounds and conclude that with high probability jiajXj=jRAi,Aj2O(1/logm)\sum_{j\neq i}a_{j}X_{j}=\sum_{j\in R}\langle A_{i},A^{*}_{j}\rangle^{2}\leq O^{*}(1/\log m) where XjX_{j} is an indicator variable for whether or not jRj\in R.

Using Lemma A.7 we have that with high probability over the choice of RR, var(Zi)O1/logmvar(Z_{i})\leq O^{*}{1/\log m}. In particular, conditioned on the support RR, ZiZ_{i} is the sum of independent subgaussian variables and so with high probability (using Theorem LABEL:thm:subgaussian)

Also as we saw before that Ai,Aixi>(1δ2/2)C|\langle A_{i},A^{*}_{i}\rangle x_{i}|>(1-\delta^{2}/2)C if iSi\in S and is zero otherwise. So we conclude that Ai,y>C/2|\langle A_{i},y\rangle|>C/2 if and only if iSi\in S which completes the proof.

In the above lemma we only needs the support of xx satisfy concentration inequality in Lemma A.7. This does not really require SS to be uniformly random.

Appendix B More Alternating Minimization

Here we prove Theorem 3.6 and Theorem 3.7. Note that in Algorithm 4 and Algorithm 5, we use the expectation of the gradient over the samples instead of the empirical average. We can show that these algorithms would maintain the same guarantees if we used p=Ω~(mk)p=\widetilde{\Omega}(mk) to estimate gsg^{s} as we did in Algorithm 2. However these proofs would require repeating very similar calculations to those that we performed in Appendix D, and so we only claim that these algorithms maintain their guarantees if they use a polynomial number of samples to approximate the expectation.

We give a variant of the Olshausen-Field update rule in Algorithm 4. Our first goal is to prove that each column of gsg^{s} is (α,β,ϵ)(\alpha,\beta,\epsilon)-correlated with AiA^{*}_{i}. The main step is to prove an analogue of Lemma 3.4 that holds for the new update rule.

Suppose that AsA^{s} is (2δ,5)(2\delta,5)-near to AA^{*} Then each column of gsg^{s} in Algorithm 4 takes the form

where λi=Ai,Ai\lambda_{i}=\langle A_{i},A^{*}_{i}\rangle. Moreover the norm of ϵis\epsilon_{i}^{s} can be bounded as ϵisO(k2/mn)\|\epsilon_{i}^{s}\|\leq O(k^{2}/mn).

We remark that unlike the statement of Lemma 4.2, here we will not explicitly state the functional form of ϵis\epsilon_{i}^{s} because we will not need it.

The proof parallels that of Lemma 3.4, although we will use slightly different conditioning arguments as needed. Again, we define Fx\mathcal{F}_{x^{*}} as the event that \mboxsgn(x)=\mboxsgn(x)\mbox{sgn}(x^{*})=\mbox{sgn}(x), and let 1Fx\mathbf{1}_{\mathcal{F}_{x^{*}}} be the indicator function of this event. We can invoke Lemma A.1 and conclude that this event happens with high probability. Moreover let Fi\mathcal{F}_{i} be the event that ii is in the set S=\mboxsupp(x)S=\mbox{supp}(x^{*}) and let 1Fi\mathbf{1}_{\mathcal{F}_{i}} be its indicator function.

Once again our strategy is to rewrite the expectation above using subconditioning where we first choose the support SS of xx^{*}, and then we choose the nonzero values xSx^{*}_{S}.

Next we will compute the expectation of each of the terms on the right hand side. This part of the proof will be somewhat more involved than the proof of Lemma 3.4, because the terms above are quadratic instead of linear. The leading term is equal to qi(λiAiλi2Ai)q_{i}(\lambda_{i}A^{*}_{i}-\lambda_{i}^{2}A_{i}) and the remaining terms contribute to ϵi\epsilon_{i}. The second term is equal to (IAiAiT)Ai\mboxdiag(qi,j)AiTAi(I-A_{i}A_{i}^{T})A^{*}_{-i}\mbox{diag}(q_{i,j}){A^{*}_{-i}}^{T}A_{i} which has spectral norm bounded by O(k2/mn)O(k^{2}/mn). The third term is equal to λiAi\mboxdiag(qi,j)AiTAi\lambda_{i}A_{-i}\mbox{diag}(q_{i,j}){A^{*}_{-i}}^{T}A^{*}_{i} which again has spectral norm bounded by O(k2/mn)O(k^{2}/mn). The final term is equal to

where vv is a vector whose j2j_{2}-th component is equal to j2iqi,j1,j2Aj2,AiAj2,Aj1\sum_{j_{2}\neq i}q_{i,j_{1},j_{2}}\langle A^{*}_{j_{2}},A_{i}\rangle\langle A^{*}_{j_{2}},A_{j_{1}}\rangle. The absolute value of vj2v_{j_{2}} is bounded by

The first inequality uses bounds for qq’s and the AM-GM inequality, the second inequality uses the spectral norm of AA^{*}. We can now bound the norm of vv as follows

and this implies that the last term satisfies AivO(k2/mn)\|A_{-i}\|\|v\|\leq O(k^{2}/mn). Combining all these bounds completes the proof of the lemma.

We are now ready to prove that the update rule satisfies Definition 2.1. This again uses Lemma 4.2, except that we invoke Lemma B.1 instead. Combining these lemmas we obtain:

Suppose that AsA^{s} is (2δ,5)(2\delta,5)-near to AA^{*}. Then for each ii, gisg^{s}_{i} as defined in Algorithm 4 is (α,β,ϵ)(\alpha,\beta,\epsilon)-correlated with AiA^{*}_{i}, where α=Ω(k/m)\alpha=\Omega(k/m), βΩ(m/k)\beta\geq\Omega(m/k) and ϵ=O(k3/mn2)\epsilon=O(k^{3}/mn^{2}).

Notice that in the third step in Algorithm 4 we project back (with respect to Frobenius norm of the matrices) into a convex set B\mathcal{B} which we define below. Viewed as minimizing a convex function with convex constraints, this projection can be computed by various convex optimization algorithm, e.g. subgradient method (see Theorem 3.2.3 of Section 3.2.4 of Nesterov’s seminal Book Nesterov (2004) for more detail). Without this modification, it seems that the update rule given in Algorithm 4 does not necessarily preserve nearness.

Let B={AA\mboxisδ0\mboxclosetoA0\mboxandA2A}\mathcal{B}=\{A|A\mbox{ is }\delta_{0}\mbox{ close to }A^{0}\mbox{ and }\|A\|\leq 2\|A^{*}\|\}

The crucial properties of this set are summarized in the following claim:

(a)(a) ABA^{*}\in\mathcal{B} and (b)(b) for each ABA\in\mathcal{B}, AA is (2δ0,5)(2\delta_{0},5)-near to AA^{*}

The first part of the claim follows because by assumption AA^{*} is δ0\delta_{0}-close to A0A^{0} and AA02A\|A^{*}-A^{0}\|\leq 2\|A^{*}\|. Also the second part follows because AAAA0+A0A4A\|A-A^{*}\|\leq\|A-A^{0}\|+\|A^{0}-A^{*}\|\leq 4\|A^{*}\|. This completes the proof of the claim.

By the convexity of B\mathcal{B} and the fact that ABA^{*}\in\mathcal{B}, we have that projection doesn’t increase the error in Frobenius norm.

For any matrix AA, \mboxProjBAAFAAF\|\mbox{Proj}_{\mathcal{B}}A-A^{*}\|_{F}\leq\|A-A^{*}\|_{F}.

We now have the tools to analyze Algorithm 4 by fitting it into the framework of Corollary 2.4. In particular, we prove that it converges to a globally optimal solution by connecting it to an approximate form of projected gradient descent:

We note that projecting into B\mathcal{B} ensures that at the start of each step AsA5A\|A^{s}-A^{*}\|\leq 5\|A^{*}\|. Hence gisg^{s}_{i} is (Ω(k/m),Ω(m/k),O(k3/mn2))(\Omega(k/m),\Omega(m/k),O(k^{3}/mn^{2}))-correlated with AiA^{*}_{i} for each ii, which follows from Lemma B.3. This implies that gsg^{s} is (Ω(k/m),Ω(m/k),O(k3/n2))(\Omega(k/m),\Omega(m/k),O(k^{3}/n^{2}))-correlated with AA^{*} in Frobenius norm. Finally we can apply Corollary 2.4 (on the matrices with Frobenius) to complete the proof of the theorem.

B.2 Proof of Theorem 3.7

The proof of Theorem 3.7 is parallel to that of Theorem 4.1 and Theorem 3.6. As usual, our first step is to show that gsg_{s} is correlated with AA^{*}:

Suppose that AsA^{s} is (δ,5)(\delta,5)-near to AA^{*}. Then for each ii, gisg^{s}_{i} as defined in Algorithm 5 is (α,β,ϵ)(\alpha,\beta,\epsilon)-correlated with AiA^{*}_{i}, where α=Ω(k/m)\alpha=\Omega(k/m), βΩ(m/k)\beta\geq\Omega(m/k) and ϵnω(1)\epsilon\leq n^{-\omega(1)}.

We chose to write the proof of Lemma 3.4 so that we can reuse the calculation here. In particular, instead of substituting BB for AsA^{s} in the calculation we can substitute B(s,i)B^{(s,i)} instead and we get:

Recall that λis=Ais,Ai\lambda^{s}_{i}=\langle A_{i}^{s},A^{*}_{i}\rangle. Now we can write g(s,i)=piqi(AisAi)+vg^{(s,i)}=p_{i}q_{i}(A^{s}_{i}-A^{*}_{i})+v, where

Indeed the norm of the first term piqi(λis1)Aisp_{i}q_{i}(\lambda_{i}^{s}-1)A^{s}_{i} is smaller than piqiAisAip_{i}q_{i}\|A^{s}_{i}-A^{*}_{i}\|.

Recall that the second term was the main contribution to the systemic error, when we analyzed earlier update rules. However in this case we can use the fact that Bi(s,i)TAis=0B^{(s,i)T}_{-i}A^{s}_{i}=0 to rewrite the second term above as

Hence we can bound the norm of the second term by O(k2/mn)AiAisO(k^{2}/mn)\|A^{*}_{i}-A^{s}_{i}\|, which is also much smaller than piqiAisAip_{i}q_{i}\|A^{s}_{i}-A^{*}_{i}\|.

Combining these two bounds we have that vpiqiAisAi/4+γ\|v\|\leq p_{i}q_{i}\|A^{s}_{i}-A^{*}_{i}\|/4+\gamma, so we can take ζ=γ=nω(1)\zeta=\gamma=n^{-\omega(1)} in Lemma 4.2. We can complete the proof by invoking Lemma 4.2 which implies that the g(s,i)g^{(s,i)} is (Ω(k/m),Ω(m/k),nω(1))(\Omega(k/m),\Omega(m/k),n^{-\omega(1)})-correlated with AiA_{i}.

This lemma would be all we would need, if we added a third step that projects onto B\mathcal{B} as we did in Algorithm 4. However here we do not need to project at all, because the update rule maintains nearness and thus we can avoid this computationally intensive step.

Suppose that AsA^{s} is (δ,2)(\delta,2)-near to AA^{*}. Then As+1A2A\|A^{s+1}-A^{*}\|\leq 2\|A^{*}\| in Algorithm 5.

This proof of the above lemma parallels that of Lemma 4.6. We will focus on highlighting the differences in bounding the error term, to avoid repeating the same calculation.

We will use AA to denote AsA^{s} and B(i)B^{(i)} to denote B(s,i)B^{(s,i)} to simplify the notation. Also let Aˉi\bar{A}_{i} be normalized so that Aˉi=Ai/Ai\bar{A}_{i}=A_{i}/\|A_{i}\| and then we can write Bi(i)=(IAˉiAˉiT)AiB^{(i)}_{-i}=(I-\bar{A}_{i}\bar{A}_{i}^{T})A_{-i}. Hence the error term is given by

Let CC be a matrix whose columns are Ci=(IAˉiAˉiT)Ai=AiAˉi,AiAˉiC_{i}=(I-\bar{A}_{i}\bar{A}_{i}^{T})A^{*}_{i}=A_{i}-\langle\bar{A}_{i},A^{*}_{i}\rangle\bar{A}_{i}. This implies that CO(m/n)\|C\|\leq O(\sqrt{m/n}). We can now rewrite the error term above as

It follows from the proof of Lemma 4.6 that the first term above has spectral norm bounded by O(k/mm/n)O(k/m\cdot\sqrt{m/n}). This is because in Lemma 4.6 we bounded the term Ai\mboxdiag(qi,j)AiTAiA_{-i}\mbox{diag}(q_{i,j})A_{-i}^{T}A^{*}_{i} and in fact it is easily verified that all we used in that proof was the fact that A=O(m/n)\|A^{*}\|=O(\sqrt{m/n}), which also holds for CC.

All that remains is to bound the second term. We note that its columns are scalar multiples of Aˉi\bar{A}_{i}, where the coefficient can be bounded as follows: AˉiAi2\mboxdiag(qi,j)AiO(k2/mn)\|\bar{A}_{i}\|\|A_{-i}\|^{2}\|\mbox{diag}(q_{i,j})\|\|A^{*}_{i}\|\leq O(k^{2}/mn). Hence we can bound the spectral norm of the second term iby O(k2/mn)Aˉ=O(k/mm/n)O(k^{2}/mn)\|\bar{A}\|=O^{*}(k/m\cdot\sqrt{m/n}). We can now combine these two bounds, which together with the calculation in Lemma 4.6 completes the proof.

These two lemmas directly imply Theorem 3.7.

Appendix C Analysis of Initialization

Here we prove an infinite sample version of Theorem 5.1 by repeatedly invoking Lemma 5.2. We give sample complexity bounds for it in Appendix D.3 where we complete the proof of Theorem 5.1.

Under the assumption of Theorem 5.1, if Algorithm 3 has access to Mu,vM_{u,v} (defined in Lemma 5.2) instead of the empirical average M^u,v\widehat{M}_{u,v}, then with high probability AA is (δ,2)(\delta,2)-near to AA^{*} where δ=O(1/logn)\delta=O^{*}(1/\log n).

Our first step is to use Lemma 5.2 to show that when uu and vv share a unique dictionary element, there is only one large term in Mu,vM_{u,v} and the error terms are small. Hence the top singular vector of Mu,vM_{u,v} must be close to the corresponding dictionary element AiA_{i}.

Under the assumptions of Theorem 5.1, suppose u=Aαu=A^{*}\alpha and v=Aαv=A^{*}\alpha^{\prime} are two random samples with supports UU, VV respectively. When UV={i}U\cap V=\{i\} the top singular vector of Mu,vM_{u,v} is O(1/logn)O^{*}(1/\log n)-close to AiA^{*}_{i}.

When uu and vv share a unique dictionary element ii, the contribution of the first term in (4) is just qiciβiβiAiAiTq_{i}c_{i}\beta_{i}\beta^{\prime}_{i}A^{*}_{i}{A^{*}_{i}}^{T}. Moreover the coefficient qiciβiβiq_{i}c_{i}\beta_{i}\beta^{\prime}_{i} is at least Ω(k/m)\Omega(k/m) which follows from Lemma 5.2 and from the assumptions that ci1c_{i}\geq 1 and qi=Ω(k/m)q_{i}=\Omega(k/m).

On the other hand, the error terms are bounded by E1+E2+E3O(k/mlogm)\|E_{1}+E_{2}+E_{3}\|\leq O^{*}(k/m\log m) which again by Lemma 5.2. We can now apply Wedin’s Theorem (see e.g. Horn and Johnson (1990)) to

and conclude that its top singular vector must be O(k/mlogm)/Ω(k/m)=O(1/logm)O^{*}(k/m\log m)/\Omega(k/m)=O^{*}(1/\log m)-close to AiA^{*}_{i}, and this completes the proof of the lemma.

Using (4) again, we can verify whether or not the supports of uu and vv share a unique element.

Suppose u=Aαu=A^{*}\alpha and v=Aαv=A^{*}\alpha^{\prime} are two random samples with supports UU, VV respectively. Under the assumption of Theorem 5.1, if the top singular value of Mu,vM_{u,v} is at least Ω(k/m)\Omega(k/m) and the second largest singular value is at most O(k/mlogm)O^{*}(k/m\log m), then with high probability uu and vv share a unique dictionary element.

By Lemma 5.2 we know with high probability the error terms have spectral norm O(k/mlogm)O^{*}(k/m\log m). Here we show when that happens, and the top singular value is at least Ω(k/m)\Omega(k/m), second largest singular value is at most O(k/mlogm)O^{*}(k/m\log m), then uu and vv must share a unique dictionary element.

If uu and vv share no dictionary element, then the main part in Equation (4) is empty, and the error term has spectral norm O(k/mlogm)O^{*}(k/m\log m). In this case the top singular value of Mu,vM_{u,v} cannot be as large as Ω(k/m)\Omega(k/m).

If uu and vv share more than one dictionary element, there are more than one terms in the main part of (4). Let S=UVS=U\cap V, we know Mu,v=ASDSAST+E1+E2+E3M_{u,v}=A^{*}_{S}D_{S}{A^{*}_{S}}^{T}+E_{1}+E_{2}+E_{3} where DSD_{S} is a diagonal matrix whose entries are equal to qiciβiβiq_{i}c_{i}\beta_{i}\beta^{\prime}_{i}. All diagonal entries in DSD_{S} have magnitude at least Ω(k/m)\Omega(k/m). By incoherence we know ASA^{*}_{S} have smallest singular value at least 1/21/2, therefore the second largest singular value of ASDSASTA^{*}_{S}D_{S}{A^{*}_{S}}^{T} is at least:

Finally by Weyl’s theorem (see e.g. Horn and Johnson (1990)) we know σ2(Mu,v)σ2(ASDSAST)E1+E2+E3Ω(k/m)\sigma_{2}(M_{u,v})\geq\sigma_{2}(A^{*}_{S}D_{S}{A^{*}_{S}}^{T})-\|E_{1}+E_{2}+E_{3}\|\geq\Omega(k/m). Therefore in this case the second largest singular value cannot be as small as O(k/mlogm)O^{*}(k/m\log m).

Combining the above two cases, we know when the top two singular values satisfy the conditions in the lemma, and the error terms are small, uu and vv share a unique dictionary element.

Finally, we are ready to prove Theorem 5.1. The idea is every vector added to the list LL will be close to one of the dictionary elements (by Lemma C.4), and for every dictionary element the list LL contains at least one close vector because we have enough random samples.

By Lemma C.4 we know every vector added into LL must be close to one of the dictionary elements. On the other hand, for any dictionary element AiA^{*}_{i}, by the bounded moment condition of D\mathcal{D} we know

Here the inequality uses union bound. Therefore given O(m2log2n/k2)O(m^{2}\log^{2}n/k^{2}) trials, with high probability there is a pair of uu,vv that intersect uniquely at ii for all i[m]i\in[m]. By Lemma C.2 this implies there must be at least one vector that is close to AiA^{*}_{i} for all dictionary elements.

Finally, since all the dictionary elements have distance at least 1/21/2 (by incoherence), the connected components in LL correctly identifies different dictionary elements. The output AA must be O(1/logm)O^{*}(1/\log m) close to AA^{*}.

Appendix D Sample Complexity

In the previous sections, we analyzed various update rules assuming that the algorithm was given the exact expectation of some matrix-valued random variable. Here we show that these algorithms can just as well use approximations to the expectation (computed by taking a small number of samples). We will focus on analyzing the sample complexity of Algorithm 2, but a similar analysis extends to the other update rules as well.

We first give a generalization of the framework we presented in Section 2 that handles random update direction gsg^{s}.

A random vector gsg^{s} is (α,β,ϵs)(\alpha,\beta,\epsilon_{s})-correlated-whp with a desired solution zz^{*} if with probability at least 1nω(1)1-n^{-\omega(1)},

This is a strong condition as it requires the random vector is well-correlated with the desired solution with very high probability. In some cases we can further relax the definition as the following:

A random vector gsg^{s} is (α,β,ϵs)(\alpha,\beta,\epsilon_{s})-correlated-in-expectation with a desired solution zz^{*} if

We remark that E\/[gs2]\mathop{\bf E\/}[\|g^{s}\|^{2}] can be much larger than E\/[gs]2\|\mathop{\bf E\/}[g^{s}]\|^{2}, and so the above notion is still stronger than requiring (say) that the expected vector E\/[gs]\mathop{\bf E\/}[g^{s}] is (α,β,ϵs)(\alpha,\beta,\epsilon_{s})-correlated with zz^{*}.

In particular, if z0zδ0\|z^{0}-z^{*}\|\leq\delta_{0} and ϵsαo((12αη)s)δ02+ϵ\epsilon_{s}\leq\alpha\cdot o((1-2\alpha\eta)^{s})\delta_{0}^{2}+\epsilon, then the updates converge to zz^{*} geometrically with systematic error ϵ/α\epsilon/\alpha in the sense that

The proof is identical to that of Theorem 2.2 except that we take the expectation of both sides.

D.2 Proof of Theorem 3.2

In order to prove Theorem 3.2, we proceed in two steps. First we show when AsA^{s} is (δs,2)(\delta_{s},2)-near to AA^{*}, the approximate gradient is (α,β,ϵs)(\alpha,\beta,\epsilon_{s})-correlated-whp with optimal solution AA^{*}, with ϵsO(k2/mn)+αo(δs2)\epsilon_{s}\leq O(k^{2}/mn)+\alpha\cdot o(\delta_{s}^{2}). This allows us to use Theorem D.3 as long as we can guarantee the spectral norm of AsAA^{s}-A^{*} is small. Next we show a version of Lemma 4.6 which works even with the random approximate gradient, hence the nearness property is preserved during the iterations. These two steps are formalized in the following two lemmas, and we defer the proofs until the end of the section.

Suppose AsA^{s} is (2δ,2)(2\delta,2)-near to AA^{*} and ηmini(piqi(1δ))=O(m/k)\eta\leq\min_{i}(p_{i}q_{i}(1-\delta))=O(m/k), then g^is\widehat{g}^{s}_{i} as defined in Algorithm 2 is (α,β,ϵs)(\alpha,\beta,\epsilon_{s})-correlated-whp with AiA^{*}_{i} with α=Ω(k/m)\alpha=\Omega(k/m), β=Ω(m/k)\beta=\Omega(m/k) and ϵsαo(δs2)+O(k2/mn)\epsilon_{s}\leq\alpha\cdot o(\delta_{s}^{2})+O(k^{2}/mn).

Suppose AsA^{s} is (δs,2)(\delta_{s},2)-near to AA^{*} with δs=O(1/logn)\delta_{s}=O^{*}(1/\log n), and number of samples used in step ss is p=Ω~(mk)p=\widetilde{\Omega}(mk), then with high probability As+1A^{s+1} satisfies As+1A2A\|A^{s+1}-A^{*}\|\leq 2\|A^{*}\|.

We will prove these lemmas by bounding the difference between g^is\widehat{g}_{i}^{s} and gisg_{i}^{s} using various concentration inequalities. For example, we will use the fact that g^is\widehat{g}_{i}^{s} is close to gisg_{i}^{s} in Euclidean distance.

Suppose AsA^{s} is (δs,2)(\delta_{s},2)-near to AA^{*} with δs=O(1/logn)\delta_{s}=O^{*}(1/\log n), and number of samples used in step ss is p=Ω~(mk)p=\widetilde{\Omega}(mk), then with high probability g^isgisO(k/m)(o(δs)+O(k/n))\|\widehat{g}_{i}^{s}-g_{i}^{s}\|\leq O(k/m)\cdot(o(\delta_{s})+O(\sqrt{k/n})).

Using the above lemma, Lemma D.4 now follows the fact that gisg_{i}^{s} is correlated with AiA^{*}_{i}. The proof of Lemma D.5 mainly involves using matrix Bernstein’s inequality to bound the fluctuation of the spectral norm of As+1A^{s+1}.

The theorem now follows immediately by combining Lemma D.4 and Lemma D.5, and then applying Theorem D.3.

D.3 Sample Complexity for Algorithm 3

For the initialization procedure, when computing the reweighted covariance matrix Mu,vM_{u,v} we can only take the empirical average over samples. Here we show with only Ω~(mk)\widetilde{\Omega}(mk) samples, the difference between the true Mu,vM_{u,v} matrix and the estimated Mu,vM_{u,v} matrix is already small enough.

In Algorithm 3, if p=Ω~(mk)p=\widetilde{\Omega}(mk) then with high probability for any pair u,vu,v consider by Algorithm 3, we have Mu,vM^u,vO(k/mlogn)\|M_{u,v}-\widehat{M}_{u,v}\|\leq O^{*}(k/m\log n).

The proof of this Lemma is deferred to Section D.4.3. notice that although in Algorithm 3, we need to estimate Mu,vM_{u,v} for many pairs uu and vv, the samples used for different pairs do not need to be independent. Therefore we can partition the data into two parts, use the first part to sample pairs u,vu,v, and use the second part to estimate Mu,vM_{u,v}. In this way, we know that for each pair u,vu,v the whole initialization algorithm also takes Ω~(mk)\widetilde{\Omega}(mk) samples. Now we are ready to prove Theorem 5.1.

First of all, the conclusion of Lemma C.2 is still true for M^u,v\widehat{M}_{u,v} when p=Ω~(mk)p=\widetilde{\Omega}(mk). To see this, we could simply write

where E1,E2,E3E_{1},E_{2},E_{3} are the same as the proof of Lemma C.2. We can now view M^u,vMu,v\widehat{M}_{u,v}-M_{u,v} as an additional perturbation term with the same magnitude. We have that when UV={i}U\cap V=\{i\} the top singular vector of Mu,vM_{u,v} is O(1/logn)O^{*}(1/\log n)-close to AiA^{*}_{i}. Similarly, we can prove the conclusion of Lemma C.4 is also true for M^u,v\widehat{M}_{u,v}. Note that we actually choose pp such that the perturbation of M^u,v\widehat{M}_{u,v} matches noise level in Lemma C.4. Finally, the proof of the theorem follows exactly that of the infinite sample case given in Theorem C.1, except that we invoke the finite sample counterparts of Lemma 5.2 and Lemma C.4 that we gave above.

D.4 Proofs of Auxiliary Lemmas

Here we prove Lemma D.4, Lemma D.5, and Lemma D.8 which will follow from various versions of the Bernstein inequality. We first recall Bernstein’s inequality that we are going to use several times in this section. Let ZZ be a random variable (which could be a vector or a matrix) chosen from some distribution D\mathcal{D} and let Z(1),Z(2),...,Z(p)Z^{(1)},Z^{(2)},...,Z^{(p)} be pp independent and identically distributed samples from D\mathcal{D}. Bernstein’s inequality implies that if E\displaylimits[Z]=0\mathop{\mathbf{E}}\displaylimits[Z]=0 and for each jj, Z(j)R\|Z^{(j)}\|\leq R almost surely and E\/[(Z(j))2]σ2\mathop{\bf E\/}[(Z^{(j)})^{2}]\leq\sigma^{2}, then

with high probability. The proofs below will involve computing good bounds on RR and σ2\sigma^{2}. However in our setting, the random variables will not be bounded almost surely. We will use the following technical lemma to handle this issue.

Suppose that the distribution of ZZ satisfies Pr\/[ZR(log(1/ρ))C]1ρ\mathop{\bf Pr\/}[\|Z\|\geq R(\log(1/\rho))^{C}]\leq 1-\rho for some constant C>0C>0, then

If p=nO(1)p=n^{O(1)} then Z(j)O~(R)\|Z^{(j)}\|\leq\widetilde{O}(R) holds for each jj with high probability and

E\/[Z1ZΩ~(R)]=nω(1)\|\mathop{\bf E\/}[Z\mathbf{1}_{\|Z\|\geq\widetilde{\Omega}(R)}]\|=n^{-\omega(1)}.

In particular, if 1pj=1pZ(j)(11Z(j)Ω~(R))\frac{1}{p}\sum_{j=1}^{p}Z^{(j)}(1-\mathbf{1}_{\|Z^{(j)}\|\geq\widetilde{\Omega}(R)}) is concentrated with high probability, then 1pj=1pZ(j)\frac{1}{p}\sum_{j=1}^{p}Z^{(j)} is too.

The first part of the lemma follows from choosing ρ=nlogn\rho=n^{-\log n} and applying a union bound. The second part of the lemma follows from

All of the random variables we consider are themselves products of subgaussian random variables, so they satisfy the tail bounds in the above lemma. In the remaining proofs we will focus on bounding the norm of these variables with high probability.

Since ss is fixed throughout, we will use AA to denote AsA^{s}. Also we fix ii in this proof. Let SS denote the support of xx^{*}. Note that g^i\widehat{g}_{i} is a sum of random variable of the form (yAx)\mboxsgn(xi)(y-Ax)\mbox{sgn}(x_{i}). Therefore we are going to apply Bernstein inequality for proving g^i\widehat{g}_{i} concentrates around its mean gig_{i}. Since Bernstein is typically not tight for sparse random varaibles like in our case. We study the concentration of the random variable Z:=(yAx)\mboxsgn(xi)iSZ:=(y-Ax)\mbox{sgn}(x_{i})\mid i\in S first. We prove the following technical lemma at the end of this section.

Let W={j:i\mboxsupp(x(j))}W=\{j:i\in\mbox{supp}({x^{*(j)}})\} and then we have that

Therefore using Lemma 3.4 we can write g^is\widehat{g}_{i}^{s} (whp) as g^i=g^igi+gi=4α(AisAi)+v\widehat{g}_{i}=\widehat{g}_{i}-g_{i}+g_{i}=4\alpha(A_{i}^{s}-A^{*}_{i})+v with vαAisAi+O(k/m)(o(δs)+O(k/n))\|v\|\leq\alpha\|A_{i}^{s}-A^{*}_{i}\|+O(k/m)\cdot(o(\delta_{s})+O(\sqrt{k/n})). By Lemma 4.2 we have g^i\widehat{g}_{i} is (Ω(k/m),Ω(m/k),o(k/mδs2)+O(k2/mn))(\Omega(k/m),\Omega(m/k),o(k/m\cdot\delta_{s}^{2})+O(k^{2}/mn))-correlated-whp with AiA^{*}_{i}.

Then it suffices to prove Claim 7. To this end, we apply the Bernstein’s inequality stated in equation 5 with the additional technical lemma D.10. We are going to control the maximum norm of ZZ and as well as the variance of ZZ using Claim 8 and Claim 9 as follows:

Z=(yAx)\mboxsgn(xi)O~(μk/n+kδs2+kδs)\|Z\|=\|(y-Ax)\mbox{sgn}(x_{i})\|\leq\widetilde{O}(\mu k/\sqrt{n}+k\delta_{s}^{2}+\sqrt{k}\delta_{s}) holds with high probability

We write yAx=(ASASASTAS)xS=(ASAS)xS+AS(IASTAS)xSy-Ax=(A^{*}_{S}-A_{S}A_{S}^{T}A^{*}_{S})x^{*}_{S}=(A^{*}_{S}-A_{S})x^{*}_{S}+A_{S}(I-A_{S}^{T}A^{*}_{S})x^{*}_{S} and we will bound each term. For the first term, since AA is δs\delta_{s}-close to AA^{*} and SO(k)|S|\leq O(k), we have that ASASFO(δsk)\|A^{*}_{S}-A_{S}\|_{F}\leq O(\delta_{s}\sqrt{k}). And for the second term, we have

Here we have repeatedly used the bound UVFUVF\|UV\|_{F}\leq\|U\|\|V\|_{F} and the fact that AA^{*} is μ\mu incoherent which implies AS2\|A^{*}_{S}\|\leq 2. Recall that the entries in xSx^{*}_{S} are chosen independently of SS and are subgaussian. Hence if MM is fixed then MxSO~(MF)\|Mx^{*}_{S}\|\leq\widetilde{O}(\|M\|_{F}) holds with high probability. And so

which holds with high probability and this completes the proof.

E\displaylimits[Z2]=E\/[(yAx)\mboxsgn(xi)2iS]O(k2δs2)+O(k3/n)\mathop{\mathbf{E}}\displaylimits[\|Z\|^{2}]=\mathop{\bf E\/}[\|(y-Ax)\mbox{sgn}(x_{i})\|^{2}|i\in S]\leq O(k^{2}\delta_{s}^{2})+O(k^{3}/n)

We can again use the fact that yAx=(ASASASTAS)xSy-Ax=(A^{*}_{S}-A_{S}A_{S}^{T}A^{*}_{S})x^{*}_{S} and that xSx^{*}_{S} is conditionally independent of SS with E\displaylimits[xS(xS)T]=I\mathop{\mathbf{E}}\displaylimits[x^{*}_{S}(x^{*}_{S})^{T}]=I and conclude

Then again we write ASASASTASA^{*}_{S}-A_{S}A_{S}^{T}A^{*}_{S} as (ASAS)+AS(Ik×kASTAS)(A^{*}_{S}-A_{S})+A_{S}(I_{k\times k}-A_{S}^{T}A^{*}_{S}), and the bound the Frobenius norm of the two terms separately. First, since AA is δs\delta_{s}-close to AA^{*}, we have that ASASA^{*}_{S}-A_{S} has column-wise norm at most δs\delta_{s} and therefore ASASFkδs\|A^{*}_{S}-A_{S}\|_{F}\leq\sqrt{k}\delta_{s}. Second, note that ASFO(k)\|A_{S}\|_{F}\leq O(\sqrt{k}) since each column of AA has norm 1±δs1\pm\delta_{s}, we have that

D.4.2 Proof of Lemma D.5

We will apply the matrix Bernstein inequality. In order to do this, we need to establish bounds on the spectral norm and on the variance. For the spectral norm bound, we have (yAsx)\mboxsgn(x)T=(yAsx)\mboxsgn(x)=k(yAsx)\|(y-A^{s}x)\mbox{sgn}(x)^{T}\|=\|(y-A^{s}x)\|\|\mbox{sgn}(x)\|=\sqrt{k}\|(y-A^{s}x)\|. We can now use Claim 8 to conclude that (yAsx)O~(k)\|(y-A^{s}x)\|\leq\widetilde{O}(k), and hence (yAsx)\mboxsgn(x)O~(k3/2)\|(y-A^{s}x)\mbox{sgn}(x)\|\leq\widetilde{O}(k^{3/2}) holds with high probability.

For the variance, we need to bound both E\/[(yAsx)\mboxsgn(x)T\mboxsgn(x)(yAsx)T]\mathop{\bf E\/}[(y-A^{s}x)\mbox{sgn}(x)^{T}\mbox{sgn}(x)(y-A^{s}x)^{T}] and E\/[\mboxsgn(x)(yAsx)T(yAsx)\mboxsgn(x)T]\mathop{\bf E\/}[\mbox{sgn}(x)(y-A^{s}x)^{T}(y-A^{s}x)\mbox{sgn}(x)^{T}]. The first term is equal to kE\/[(yAsx)(yAsx)T]k\mathop{\bf E\/}[(y-A^{s}x)(y-A^{s}x)^{T}]. Again, the bound follows from the calculation in Lemma B.1 and we conclude that

Moreover we can now apply the matrix Bernstein inequality and conclude that when the number of samples is at least Ω~(mk)\widetilde{\Omega}(mk) we have

D.4.3 Proof of Lemma D.8

Again in order to apply the matrix Bernstein inequality we need to bound the spectral norm and the variance of each term of the form u,yv,yyyT\langle u,y\rangle\langle v,y\rangle yy^{T} . We make use of the following claim to bound the magnitude of the inner product:

u,yO~(k)|\langle u,y\rangle|\leq\widetilde{O}(\sqrt{k}) and yO~(k)\|y\|\leq\widetilde{O}(\sqrt{k}) hold with high probability

Since u=Aαu=A^{*}\alpha and because α\alpha is kk-sparse and has subgaussian non-zero entries we have that uO~(k)\|u\|\leq\widetilde{O}(\sqrt{k}), and the same bound holds for yy too. Next we write u,y=ASTu,xS|\langle u,y\rangle|=|\langle{A^{*}}_{S}^{T}u,x^{*}_{S}\rangle| where SS is the support of xx^{*}. Moreover for any set SS, we have that

holds with high probability, again because the entries of xSx^{*}_{S} are subgaussian we conclude that u,yO~(k)|\langle u,y\rangle|\leq\widetilde{O}(\sqrt{k}) with high probability.

This implies that u,yv,yyyTO~(k2)\|\langle u,y\rangle\langle v,y\rangle yy^{T}\|\leq\widetilde{O}(k^{2}) with high probability.

E\/[u,y2v,y2yyTyyT]O~(k3/m)\|\mathop{\bf E\/}[\langle u,y\rangle^{2}\langle v,y\rangle^{2}yy^{T}yy^{T}]\|\leq\widetilde{O}(k^{3}/m)

We have that with high probability y2O~(k)\|y\|^{2}\leq\widetilde{O}(k) and u,y2O~(k)\langle u,y\rangle^{2}\leq\widetilde{O}(k), and we can apply these bounds to obtain

On the other hand, notice that E\/[v,y2yyT]=Mv,v\mathop{\bf E\/}[\langle v,y\rangle^{2}yy^{T}]=M_{v,v} and using Lemma 5.2 we have that E\/[v,y2yyT]O(k/m)\|\mathop{\bf E\/}[\langle v,y\rangle^{2}yy^{T}]\|\leq O(k/m). Hence we conclude that the variance term is bounded by O~(k3/m)\widetilde{O}(k^{3}/m).

Now we can apply the matrix Bernstein inequality and conclude that when the number of samples is p=Ω~(mk)p=\widetilde{\Omega}(mk) then

with high probability, and this completes the proof.

Appendix E Neural Implementation

Here we sketch a neural architecture implementing Algorithm 2, essentially mimicking Olshausen-Field (Figure 5 in Olshausen and Field (1997a)), except our decoding rule is much simpler and takes a single neuronal step.

(a) The bottom layer of neurons take input yy and at the top layer neurons output the decoding xx of yy with respect to the current code. The middle layer labeled rr is used for intermediate computation. The code AA is stored as the weights between the top and middle layer on the synapses. Moreover these weights are set via a Hebbian rule, and upon receiving a new sample yy, we update the weight AijA_{ij} on the synapses by the product of the values of the two endpoint neurons, xjx_{j} and rir_{i}.

(b) The top layer neurons are equipped with a threshold function. The middle layer ones are equipped with simple linear functions (no threshold). The bottom layer can only be changed by stimulation from outside the system.

(c) We remark that the updates require some attention to timing, which can be accomplished via spike timing. In particular, when a new image is presented, the value of all neurons are updated to a (nonlinear) function of the weighted sum of the values of its neighbors with weights on the corresponding synapses. The execution of the network is shown at the right hand side of the figure. Upon receiving a new sample at time t=0t=0, the values of bottom layer are set to be yy and all the other layers are reset to zero. At time t=1t=1, the values in the middle layer are updated by the weighted sum of their neighbors, which is just yy. Then at time t=2t=2, the top layer obtains the decoding xx of yy by calculating thresholdC/2(ATy)\textrm{threshold}_{C/2}(A^{T}y). At time t=3t=3 the middle layer calculates the residual error yAxy-Ax and then at time t=4t=4 the synapse weights that store AA are updated via Hebbian rule (update proportional to the product of the endpoint values). Repeating this process with many images indeed implements Algorithm 2 and succeeds provided that the network is appropriately initialized to A0A^{0} that is (δ,2)(\delta,2) near to AA^{*}.

Here we sketch a neural implementation of Algorithm 5.1. This algorithm uses simple operations that can be implemented in neurons and composed, however unlike the above implementation we do not know of a two layer network, but note that this procedure need only be performed once so it need not have a particularly fast or short neural implementation.

(a) It is standard to compute the inner products y,u\langle y,u\rangle and y,v\langle y,v\rangle using neurons, and even the top singular vector can be computed using the classic Oja’s Rule Oja (1982) in an online manner where each sample yy is received sequentially. There are also generalizations to computing other principle components Oja (1992). However, we only need the top singular vector and the top two singular values.

(b) Also, the greedy clustering algorithm which preserves a single estimate zz in each equivalence class of vectors that are O(1/logm)O^{*}(1/\log m)-close (after sign flips) can be implemented using inner products. Finally, projecting the estimate A~\widetilde{A} onto the set B\mathcal{B} may not be required in real life (or even for correctness proof), but even if it is it can be accomplished via stochastic gradient descent where the gradient again makes use of the top singular vector of the matrix.