Signal Recovery from Pooling Representations

Joan Bruna, Arthur Szlam, Yann LeCun

Introduction

The purpose of the pooling layer in each module is to give invariance to the system, perhaps at the expense of resolution. This is done via a summary statistic over the outputs of groups of nodes. In the trained system, the columns of the weight matrix corresponding to nodes grouped together often exhibit similar characteristics, and code for perturbations of a template (Kavukcuoglu et al., 2009; Hyvärinen and Hoyer, 2001).

where as usual, if pp\rightarrow\infty, this is

If the outputs of the nonlinearity are nonnegative (as for the half rectification function), then p=1p=1 corresponds to average pooling, and the case p=p=\infty is max pooling.

2 Phase reconstruction

3 What vs. Where

If the columns of the weight matrix in a pool correspond to related features, it can be reasonable to see the entire pool as a “what”. That is, the modulus of the pool indicates the relative presence of a grouping of (sub)features into a template, and the phase of the pool describes the relative arrangement of the subfeatures, describing “where” the template is, or more generally, describing the “pose” of the template.

From this viewpoint, phase reconstruction results make rigorous the notion that given enough redundant versions of “what” and throwing away the “where”, we can still recover the “where”.

4 Contributions of this work

Injectivity and Lipschitz stability of Pooling Operators

This section studies necessary and sufficient conditions guaranteeing that pooling representations are invertible. It also computes upper and lower Lipschitz bounds, which are tight under certain configurations.

where αk\alpha_{k} contains the coordinates IkI_{k} of α\alpha.

We shall measure the stability of the inverse pooling with the Euclidean distance in the representation space. Given a distance d(x,x)d(x,x^{\prime}) in the input space, the Lipschitz bounds of a given operator Φ(x)\Phi(x) are defined as the constants 0AB0\leq A\leq B such that

In the remainder of the paper, given a frame F{\mathcal{F}}, we denote respectively by λ(F)\lambda_{-}({\mathcal{F}}) and λ+(F)\lambda_{+}({\mathcal{F}}) its lower and upper frame bounds. If F{\mathcal{F}} has MM vectors and Ω{1,,M}\Omega\subset\{1,\dots,M\}, we denote FΩ{\mathcal{F}}_{\Omega} the frame obtained by keeping the vectors indexed in Ω\Omega. Finally, we denote Ωc{\Omega}^{c} the complement of Ω\Omega.

In order to study the injectivity of pooling representations, we first focus on the properties of the operators defined by the point-wise nonlinearities.

have been extensively studied in the literature (Balan et al., 2006; Balan and Wang, 2013), in part motivated by applications to speech processing (Achan et al., 2003) or X-ray crystallography (Ohlsson, 2013). It is shown in (Balan et al., 2006) that if m>2n1m>2n-1 then it is possible to recover xx from M(x)M(x), up to a global sign change. In particular, (Balan and Wang, 2013) recently characterized the stability of the phaseless operator, that is summarized in the following proposition:

In particular, M(x)M(x) is injective if and only if for any subset Ω{1,,M}\Omega\subseteq\{1,\dots,M\}, either FΩ{\mathcal{F}}_{\Omega} or FΩc{\mathcal{F}}_{\Omega^{c}} is an invertible frame.

A frame F{\mathcal{F}} satisfying the previous condition is said to be phase retrievable.

We now turn our attention to the half-rectification operator, defined as

For that purpose, let us introduce some extra notation. Given a frame F={f1,,fM}{\mathcal{F}}=\{f_{1},\dots,f_{M}\}, a subset Ω{1M}\Omega\subset\{1\dots M\} is admissible if

We denote by Ω\overline{\Omega} the collection of all admissible sets, and VΩV_{\Omega} the vector space generated by Ω\Omega. The following proposition, proved in Section B, gives a necessary and sufficient condition for the injectivity of the half-rectification.

Let A_{0}=\min_{\Omega\in\overline{\Omega}}\lambda_{-}({\left.\kern-1.2pt{\mathcal{F}}_{\Omega}\vphantom{\big{|}}\right|_{V_{\Omega}}}). Then the half-rectification operator Mα(x)=ρα(FTx)M_{\alpha}(x)=\rho_{\alpha}({\mathcal{F}}^{T}x) is injective if and only if A0>0A_{0}>0. Moreover, it satisfies

with B0=maxΩΩλ+(FΩ)λ+(F)B_{0}=\max_{\Omega\in\overline{\Omega}}\lambda_{+}({\mathcal{F}}_{\Omega})\leq\lambda_{+}({\mathcal{F}}).

The half-rectification has the ability to recover the input signal, without the global sign ambiguity. The ability to reconstruct from MαM_{\alpha} is thus controlled by the rank of any matrix FΩ{\mathcal{F}}_{\Omega} whose columns are taken from a subset belonging to Ω\overline{\Omega}. In particular, if α0\alpha\equiv 0, since ΩΩΩcΩ\Omega\in\overline{\Omega}\Rightarrow\Omega^{c}\in\overline{\Omega}, it results that m2nm\geq 2n is necessary in order to have A0>0A_{0}>0.

The rectified linear operator creates a partition of the input space into polytopes, defined by (8), and computes a linear operator on each of these regions. A given input xx activates a set ΩxΩ\Omega_{x}\in\overline{\Omega}, encoded by the sign of the linear measurements x,fiαi\langle x,f_{i}\rangle-\alpha_{i}.

As opposed to the absolute value operator, the inverse of MαM_{\alpha}, whenever it exists, can be computed directly by locally inverting a linear operator. Indeed, the coordinates of Mα(x)M_{\alpha}(x) satisfying Mα(x)j>αjM_{\alpha}(x)_{j}>\alpha_{j} form a set s(x)s(x), which defines a linear model Fs(x){\mathcal{F}}_{s(x)} which is invertible if A0>0A_{0}>0.

In order to compare the stability of the half-rectification versus the full rectification, one can modify MαM_{\alpha} so that it maps xx and x-x to the same point. If one considers

It results that the bi-Lipschitz bounds of the half-rectification operator, when considered in under the equivalence xxx\sim-x, are controlled by the bounds of the absolute value operator, up to a factor 21/22^{-1/2}. However, the lower Lipschitz bound (11) consists in a minimum taken over a much smaller family of elements than in (5).

From its definition, it follows that pooling operators PpP_{p} and RpR_{p} can be expressed respectively as a function of phaseless and half-rectified operators, which implies that for the pooling to be invertible, it is necessary that the absolute value and rectified operators are invertible too. Naturally, the converse is not true.

Q2{\mathcal{Q}}_{2} thus contains all the orthogonal bases of each subspace Fk{\mathcal{F}}_{k}.

The following proposition, proved in section B, computes upper and lower bounds of the Lipschitz constants of P2P_{2}.

This proposition thus generalizes the results from (Cahill et al., 2013), since it shows that A2>0A_{2}>0 not only controls when P2P_{2} is invertible, but also controls the stability of the inverse mapping.

Proposition 2.4 and Corollary 2.5 give a lower Lipschitz bound which gives sufficient guarantees for the inversion of pooling representations. Corollary 2.5 indicates that, in the case d=2d=2, the lower Lipschitz bounds are sharper than the non-rectified case, in consistency with the results of section 2.1. The general case d>2d>2 remains an open issue.

We give in this section sufficient and necessary conditions such that the max-pooling operator PP_{\infty} is injective, and we compute a lower bound of its lower Lipschitz constant.

and as a result the input space is divided into a collection of Voronoi cells defined from linear equations. Restricted to each cone Cs{\mathcal{C}}_{s}, the max-pooling operator computes the phaseless mapping M(x)M(x) from equation (3) corresponding to Fs=(fs1,,fsK){\mathcal{F}}_{s}=(f_{s_{1}},\dots,f_{s_{K}}).

Given vectors uu and vv, as usual, set the angle θ(u,v)=arccos(u,vuv).\theta(u,v)=\arccos\left(\frac{|\langle u,v\rangle|}{\|u\|\|v\|}\right). For each s,sSs,s^{\prime}\in{\mathcal{S}} such that CsCs={\mathcal{C}}_{s}\cap{\mathcal{C}}_{s^{\prime}}=\emptyset and for each Ω{1K}\Omega\subset\{1\dots K\}, we define

This is a modified first principal angle between the subspaces {\left.\kern-1.2pt{\mathcal{F}}_{s}\vphantom{\big{|}}\right|_{\Omega}} and {\left.\kern-1.2pt{\mathcal{F}}_{s^{\prime}}\vphantom{\big{|}}\right|_{\Omega}}, where the infimum is taken only on the directions included in the respective cones. Set \Lambda_{s,s^{\prime},\Omega}=\lambda_{-}({\left.\kern-1.2pt{\mathcal{F}}\vphantom{\big{|}}\right|_{\Omega}})\cdot\sin(\beta(s,s^{\prime},\Omega)).

Given ss, ss^{\prime}, we also define J(s,s)={k;sk=sk}{\mathcal{J}}(s,s^{\prime})=\{k\,;\,s_{k}=s^{\prime}_{k}\}. Recall LL is the size of each pool. Set

The following proposition, proved in section B, gives a lower Lipschitz bound of the max-pooling operator.

For all xx and xx^{\prime}, the max-pooling operator PP_{\infty} satisfies

where d(x,x)=min(xx,x+x)d(x,x^{\prime})=\min(\|x-x^{\prime}\|,\|x+x^{\prime}\|).

Propostion 2.6 shows that the lower Lipschitz bound is controlled by two different phenomena. The first one depends upon how the cones corresponding to disjoint switches are aligned, whereas the second one depends on the internal incoherence of each frame FJ(s,s){\mathcal{F}}_{{\mathcal{J}}(s,s^{\prime})}. One may ask how do these constants evolve in different asymptotic regimes. For example, if one lets the number of pools KK be fixed but increases the size of each pool by increasing MM. In that case, the set of possible switches S{\mathcal{S}} increases, and each cone Cs{\mathcal{C}}_{s} gets smaller. The bound corresponding to CsCs{\mathcal{C}}_{s}\cap{\mathcal{C}}_{s^{\prime}}\neq\emptyset decreases since the infimum is taken over a larger family. However, as the cones Cs{\mathcal{C}}_{s} become smaller, the likelihood that any pair xxx\neq x^{\prime} share the same switches decreases, thus giving more importance to the case CsCs={\mathcal{C}}_{s}\cap{\mathcal{C}}_{s^{\prime}}=\emptyset. Although the ratio 1L\frac{1}{L} decreases, the lower frame bounds λ(FΩ)2\lambda_{-}({\mathcal{F}}_{\Omega})^{2}, λ(FΩc)2\lambda_{-}({\mathcal{F}}_{\Omega^{c}})^{2} will in general increase linearly with LL. The lower bound will thus mainly be driven by the principal angles β(s,s,Ω)\beta(s,s^{\prime},\Omega). Although the minimum in (28) is taken over a larger family, each angle is computed over a smaller region of the space, suggesting that one can indeed increase the size of each pool without compromising the injectivity of the max-pooling.

Another asymptotic regime considers pools of fixed size LL and increases the number of pools KK. In that case, the bound increases as long as the principal angles remain lower bounded.

We also consider the stability of max-pooling with a half-rectification. By redefining the switches s(x)s(x) accordingly:

the following proposition, proved in section B, computes a lower bound of the Lipschitz constant of RR_{\infty}.

The rectified max-pooling operator RR_{\infty} satisfies

defined using the cones Cs{\mathcal{C}}_{s} obtained from (18).

3 Random Pooling Operators

What is the minimum amount of redundancy needed to invert a pooling operator? As in previous works on compressed sensing (Candes and Tao, 2004) and phase recovery (Balan et al., 2006), one may address this question by studying random pooling operators. In this case, the lower Lipschitz bounds derived in previous sections can be shown to be positive with probability 11 given appropriate parameters KK and LL.

The size of the pools LL does not influence the injectivity of random pooling, but it affects the stability of the inverse, as shown in proposition 2.6. The half-rectified case requires extra care, since the set of admissible switches Ω\overline{\Omega} might contain frames with M<NM<N columns with non-zero probability, and is not considered in the present work.

Numerical Experiments

A greedy method for recovering the phase from the modulus of complex measurements is given in (Gerchberg and Saxton, 1972); this method naturally extends to the case at hand. As above, denote the frame {f1,...,fM}=F\{f_{1},...,f_{M}\}={\mathcal{F}}, let Fk{\mathcal{F}}_{k} be the frame vectors in the kkth block, and set IkI_{k} to be the indices of the kkth block. Let F(1){\mathcal{F}}^{(-1)} be the pseudoinverse of F{\mathcal{F}}; set (Pp(x))k=Fkxp(P_{p}(x))_{k}=||{\mathcal{F}}_{k}x||_{p}. Starting with an initial signal x0x^{0}, update

yIk(n)=(Pp(x))kFkx(n)Fkx(n)py^{(n)}_{I_{k}}=(P_{p}(x))_{k}\frac{{\mathcal{F}}_{k}x^{(n)}}{||{\mathcal{F}}_{k}x^{(n)}||_{p}}, k=1Kk=1\dots K,

In the experiments below, we will use random, Gaussian i.i.d. F{\mathcal{F}}, but also we will use the outputs of dictionary learning with block sparsity. The F{\mathcal{F}} generated this way is not really a frame, as the condition number of a trained dictionary on real data is often very high. In this case, we will renormalize each data point to have norm 11, and modify the update x(n+1)=F(1)y(n)x^{(n+1)}={\mathcal{F}}^{(-1)}y^{(n)} to

x(n+1)=argminx2=1Fxy(n)2x^{(n+1)}=arg\min\limits_{||x||_{2}=1}||{\mathcal{F}}x-y^{(n)}||^{2}.

In practice, this modification might not always be possible, since the norm x\|x\| is not explicitly presented in PpP_{p}. However, in the classical setting of Fourier measurements and positive xx, this information is available. Moreover, our empirical experience has been that using this regularization on well conditioned analysis dictionaries offers no benefit; in particular, it gives no benefit with random analysis matrices.

1.2 Phaselift and Phasecut

1.3 Nearest neighbors regression

We would like to use the same basic algorithm for all settings to get an idea of the relative difficulty of the recovery problem for different pp, but also would like an algorithm with good recovery performance. If our algorithm simply returns poor results in each case, differences between the cases might be masked.

The alternating minimization can be very effective when well initialized. When given a training set of the data to recover, we use a simple regression to find x0x_{0}. Fix a number of neighbors qq (in the experiments below we use q=10q=10, and suppose XX is the training set). Set G=Pp(X)G=P_{p}(X), and for a new point xx to recover from Pp(x)P_{p}(x), find the qq nearest neighbors in GG of Pp(x)P_{p}(x), and take their principal component to serve as x0x_{0} in the alternating minimization algorithm. We use the fast neighbor searcher from (Vedaldi and Fulkerson, 2008) to accelerate the search.

2 Experiments

We discuss results on the MNIST dataset, available at http://yann.lecun.com/exdb/mnist/, and on 16×1616\times 16 patches drawn from the VOC dataset, available at http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2012/. For each of these data sets, we run experiments with random dictionaries and adapted dictionaries. We also run experiments where the data and the dictionary are both Gaussian i.i.d.; in this case, we do not use adapted dictionaries.

In the experiments with adapted dictionaries, the dictionary is built using block OMP and batch updates with a K-SVD type update (Aharon et al., 2006); in this case, F{\mathcal{F}} is the transpose of the learned dictionary. We again use groups of size 22 in the adapted dictionary experiments.

We run two sets of experiments with Gaussian i.i.d. data and dictionaries, with n=20n=20 and n=40n=40. We consider mm in the range from n/2n/2 to 8n8n. On this data set, phaselift outperforms alternating minimization; see the supplementary material.

We draw approximately 5 million 16×1616\times 16 grayscale image patches from the PASCAL VOC data set; these are sorted by variance, and the largest variance 1 million are kept. The mean is removed from each patch. These are split into a training set of 900000 patches and a test set of 100000 patches. In this experiment, we let mm range from 30 to 830. On this data set, by m=330m=330 measurements, alternating minimization with nearest neighbor initialization recovers mean angle of .97.97; for comparison, Phaselift at m=330m=330 has mean angle of .39.39; see the supplementary material.

3 Analysis

The experiments show (see figures 1 and 2) that:

For every data set, with random initializations and dictionaries, recovery is easier with half rectification before pooling than without (green vs dark blue in figures).

Good initialization improves performance; indeed, alternating minimization with nearest neighbor regression outperforms phaselift and phasecut (which of course do not have the luxury of samples from the prior, as the regressor does). We believe this of independent interest.

Adapted analysis “frames” (with regularization) are easier to invert than random analysis frames, with or without regularization (the bottom row of each pair of graphs vs the top row of each pair in Figure 2).

Each of these conclusions is unfortunately only true up to the optimization method- it may be true that a different optimizer will lead to different results. With learned initializations and alternating minimization, recovery results can be better without half rectification. Note this is only up until the point where the alternating minimization gets better than the learned initialization without any refinement, and is especially true for random dictionaries. The simple interpretation is that the reconstruction step 2 of the alternating minimization just does not have a large enough span with roughly half the entries removed; that is, this is an effect of the optimization, not of the difference between the difficulty of the problems.

Conclusion

However, we are not anywhere near where we would like to be in understanding these systems, or even the invertibility of the layers of these systems. This work gives little direct help to a practicioner asking the question “how should I design my network?”. In particular, our results just barely touch on the distribution of the data; but the experiments make it clear (see also (Ohlsson et al., 2012)) that knowing more information about the data changes the invertibility of the mappings. Moreover, preservation of information needs to be balanced against invariance, and the tension between these is not discussed in this work. Even in the setting of this work, without consideration of the data distribution or tension with invariance, Proposition 2.4 although tight, is not easy to use, and even though we are able to use 2.6 to get an invertibility result, it is probably not tight.

References

Appendix A Comparison between phaselift and the various alternating minimization algorithms

In figure of 3, we use random data and a random dictionary. As the data has no structure, we only compare against random initialization, with and without half rectification. We can see from figure 3 in this case, where we do not know a good way to initialize the alternating minimization, alternating minimization is significantly worse than phasecut. On the other hand, recovery after rectified pooling with alternating minimization does almost as well as phasecut.

In the examples where we have training data, shown in figure 4, alternating minimization with the nearest neighbor regressor (red curve) performs significantly better than phasecut (green and blue curves). Of course phasecut does not get the knowledge of the data distribution used to generate the regressor.

Appendix B Proofs of results in Section 2

which in particular implies that xx is known to lie in VS(x)V_{S(x)}, the subspace generated by s(x)s(x). But the restriction Fs(x){\mathcal{F}}_{s(x)} is a linear operator, which can be inverted in VSV_{S} as long as \lambda_{-}({\left.\kern-1.2pt{\mathcal{F}}_{s(x)}\vphantom{\big{|}}\right|_{V_{S}}})\geq A_{0}>0.

Let us now show that A0>0A_{0}>0 is also necessary. Let us suppose that for some SS, FS{\mathcal{F}}_{S} is such that \lambda_{-}({\left.\kern-1.2pt{\mathcal{F}}_{S}\vphantom{\big{|}}\right|_{V_{S}}})=0. It results that there exists ηVS\eta\in V_{S} such that η>0\|\eta\|>0 but FSη=0\|{\mathcal{F}}_{S}\eta\|=0. Since SS is a cone, we can find xSx\in S and ϵ0\epsilon\neq 0 small enough such that x+ϵeSx+\epsilon e\in S. It results that Mα(x)=Mα(x+ϵe)M_{\alpha}(x)=M_{\alpha}(x+\epsilon e) which implies that MαM_{\alpha} cannot be injective.

Finally, let us prove (9). If x,xx,x^{\prime} are such that S=s(x)=s(x)S=s(x)=s({x^{\prime}}), then

If s(x)s(x)s(x)\neq s(x^{\prime}), we have that Mα(x)iMα(x)i=xx,fi|M_{\alpha}(x)_{i}-M_{\alpha}(x^{\prime})_{i}|=|\langle x-x^{\prime},f_{i}\rangle| if is(x)s(x)i\in s(x)\cap s({x^{\prime}}) and Mα(x)iMα(x)ixx,fi|M_{\alpha}(x)_{i}-M_{\alpha}(x^{\prime})_{i}|\geq|\langle x-x^{\prime},f_{i}\rangle| if is(x)s(x)i\in s(x)\cup s({x^{\prime}}), is(x)s(x)i\notin s(x)\cap s({x^{\prime}}). It results that

B.2 Proof of Proposition 2.4

The upper Lipschitz bound is obtained by observing that, in dimension dd,

which implies, by denoting M(x)=(x,f~k,j)k,jM(x)=(|\langle x,\widetilde{f}_{k,j}\rangle|)_{k,j}, that P2(x)P2(x)=M(x)M(x)\|P_{2}(x)-P_{2}(x^{\prime})\|=\|M(x)-M(x^{\prime})\|. Since F~Q2\widetilde{{\mathcal{F}}}\in{\mathcal{Q}}_{2}, it results from Proposition 2.1 that

B.3 Proof of Corollary 2.5

Given x,xx,x^{\prime}, let II denote the groups IkI_{k}, kKk\leq K such that SxSxIk=IkS_{x}\cap S_{x^{\prime}}\cap I_{k}=I_{k}. It results that

Finally, the upper Lipschitz bound is obtained by noting that

and using the same argument as in (B.2) \square.

B.4 Proof of Proposition 2.6

by Proposition 2.1 and by definition (LABEL:mpool_bobo).

We shall concentrate in each restriction independently. Since {\left.\kern-1.2pt{\mathcal{F}}_{s(x^{\prime})}\vphantom{\big{|}}\right|_{\Omega}}(x^{\prime})\in{\left.\kern-1.2pt{\mathcal{F}}_{s(x^{\prime})}\vphantom{\big{|}}\right|_{\Omega}}({\mathcal{C}}_{s(x^{\prime})}), it results that

it results, assuming without loss of generality that all pools have equal size ( Ik=MK)|I_{k}|=\frac{M}{K}),

Equivalently, since {\left.\kern-1.2pt{\mathcal{F}}_{s(x)}\vphantom{\big{|}}\right|_{\Omega}}(x)\in{\left.\kern-1.2pt{\mathcal{F}}_{s(x)}\vphantom{\big{|}}\right|_{\Omega}}({\mathcal{C}}_{s(x)}) we also have

By aggregating the bound for Ω\Omega and Ωc\Omega^{c} we obtain (28) \square.

These results easily extend to the so-called Maxout operator [Goodfellow et al., 2013], defined as xMO(x)={maxjIkx,fj;k=1K}x\mapsto MO(x)=\{\max_{j\in I_{k}}\langle x,f_{j}\rangle\,;\,k=1\dots K\}. By redefining the switches of xx as

the following corollary computes a Lower Lipschitz bound of MO(x)MO(x):

The Maxout operator MOMO satisfies (19) with A(s,s)A(s,s^{\prime}) defined using the switches (27).

It results that P1(x;F)P(x;F~)P_{1}(x;\mathcal{F})\equiv P_{\infty}(x;\widetilde{\mathcal{F}}), with

Each pool F~k\widetilde{{\mathcal{F}}}_{k} can be rewritten as F~k=HLFk\widetilde{{\mathcal{F}}}_{k}=H_{L}{\mathcal{F}}_{k}, where HLH_{L} is the L×2LL\times 2^{L} Hadamard matrix whose rows contain the ϵ\epsilon vectors. One can verify that HLTHL=2L1H_{L}^{T}H_{L}=2^{L}{\bf 1}, which implies that for any Ω{1K}\Omega\subseteq\{1\dots K\}, \lambda_{-}({\left.\kern-1.2pt\widetilde{{\mathcal{F}}}\vphantom{\big{|}}\right|_{\Omega}})=2^{L/2}\lambda_{-}({\left.\kern-1.2pt{{\mathcal{F}}}\vphantom{\big{|}}\right|_{\Omega}}). It results that

where d(x,x)=min(xx,x+x)d(x,x^{\prime})=\min(\|x-x^{\prime}\|,\|x+x^{\prime}\|) and

with s,ss,s^{\prime} and β(s,s)\beta(s,s^{\prime}) are defined on the frame F~\widetilde{\mathcal{F}}.

B.5 Proof of Corollaries 2.7 and B.1

The result follows immediately from Proposition 2.6, by replacing the phaseless invertibility condition of Propostion 2.1 by the one in Proposition 2.2. \square.

B.6 Proof of Proposition 2.8

Proposition 2.8 also extends to the maxout case. We restate it here with the extra result:

PpP_{p} is injective (modulo xxx\sim-x) if K4NK\geq 4N for p=1,p=1,\infty, and if K2N1K\geq 2N-1 for p=2p=2.

The Maxout operator MOMO is injective if K2N+1K\geq 2N+1.

Let us now prove the case p=2p=2. We start drawing a random basis for each of the pools F1,,FKF_{1},\dots,F_{K}. From proposition 2.4, it follows that we have to check that if M2NM\geq 2N, the quantity

with probability 11. If M2N1M\geq 2N-1, it follows that either Ω\Omega has the property that it intersects at least NN pools, either Ωc\Omega^{c} intersects NN pools. Say it is Ω\Omega. Now, for each pool with nonzero intersection, say FkF_{k}, we have that

for some fk,jf_{k,j} belonging to the initial random basis of FkF_{k}. It results that

where FF^{*} is a subset of NN columns of the original frame F{\mathcal{F}}, which means

Appendix C Notes on changes from cycle 1

The mathematical results have been essentially rewritten, for clarity as well as to sharpen the bounds. The proofs are now in the supplementary material, as requested by the reviewers. We have used the extra space to expand the indroduction, conclusion, and intro to the experiments, in part to to explain the connections between the theoretical and experimental parts of the paper, as requested by the reviewers. We also added results on the invertibility of random modules.

We have edited the text in the experiments section and in the captions of the figures to clarify them. Each curve is described in the caption and the text; the graphs are also now specifically referenced in the analysis bullets in section 3.3.

The introduction and conclusion more explicitly address take messages. Note that the take home message is not of the form “this is how to design a network”, but rather, “these conditions allow (stable) inversion”. We are sympathetic to the reviewers desire for a take home message giving insight into the actual design of networks for practical applications. That is, of course, the ultimate goal of a mathematical analysis of a learning algorithm. However, if the standard for theoretical papers analyzing deep models is that they lead immediately to design suggestions with associated performance increases on benchmarks, it is unlikely that there will ever be a mature enough theory to give honest design suggestions.

Finally, we reprint larger versions of the figures below.