Identity Matters in Deep Learning

Moritz Hardt, Tengyu Ma

Introduction

Traditional convolutional neural networks for image classification, such as AlexNet (), are parameterized in such a way that when all trainable weights are , a convolutional layer represents the -mapping. Moreover, the weights are initialized symmetrically around 0.0. This standard parameterization makes it non-trivial for a convolutional layer trained with stochastic gradient methods to preserve features that were already good. Put differently, such convolutional layers cannot easily converge to the identity transformation at training time.

This shortcoming was observed and partially addressed by through batch normalization, i.e., layer-wise whitening of the input with a learned mean and covariance. But the idea remained somewhat implicit until residual networks (; ) explicitly introduced a reparameterization of the convolutional layers such that when all trainable weights are 0,0, the layer represents the identity function. Formally, for an input x,x, each residual layer has the form x+h(x),x+h(x), rather than h(x).h(x). This simple reparameterization allows for much deeper architectures largely avoiding the problem of vanishing (or exploding) gradients. Residual networks, and subsequent architectures that use the same parameterization, have since then consistently achieved state-of-the-art results on various computer vision benchmarks such as CIFAR10 and ImageNet.

In this work, we consider identity parameterizations from a theoretical perspective, while translating some of our theoretical insight back into experiments. Loosely speaking, our first result underlines how identity parameterizations make optimization easier, while our second result shows the same is true for representation.

In analogy with residual networks, we will instead parameterize the objective function as

Here, A\|A\| denotes the spectral norm of A.A. The constant factor depends on the conditioning of R.R. We give the formal statement in Theorem 2.1. The theorem has the interesting consequence that as the depth increases, smaller norm solutions exist and hence regularization may offset the increase in parameters.

Universal finite sample expressivity.

Going back to non-linear residual networks with ReLU activations, we can ask: How expressive are deep neural networks that are solely based on residual layers with ReLU activations? To answer this question, we give a very simple construction showing that such residual networks have perfect finite sample expressivity. In other words, a residual network with ReLU activations can easily express any functions of a sample of size n,n, provided that it has sufficiently more than nn parameters. Note that this requirement is easily met in practice. On CIFAR 10 (n=50000n=50000), for example, successful residual networks often have more than 10610^{6} parameters. More formally, for a data set of size nn with rr classes, our construction requires O(nlogn+r2)O(n\log n+r^{2}) parameters. Theorem 3.2 gives the formal statement.

The power of all-convolutional residual networks.

2 Related Work

Since the advent of residual networks (; ), most state-of-the-art networks for image classification have adopted a residual parameterization of the convolutional layers. Further impressive improvements were reported by with a variant of residual networks, called dense nets. Rather than adding the original input to the output of a convolutional layer, these networks preserve the original features directly by concatenation. In doing so, dense nets are also able to easily encode an identity embedding in a higher-dimensional space. It would be interesting to see if our theoretical results also apply to this variant of residual networks.

There has been recent progress on understanding the optimization landscape of neural networks, though a comprehensive answer remains elusive. Experiments in and suggest that the training objectives have a limited number of bad local minima with large function values. Work by draws an analogy between the optimization landscape of neural nets and that of the spin glass model in physics (). showed that 22-layer neural networks have no bad differentiable local minima, but they didn’t prove that a good differentiable local minimum does exist. and show that linear neural networks have no bad local minima. In contrast, we show that the optimization landscape of deep linear residual networks has no bad critical point, which is a stronger and more desirable property. Our proof is also notably simpler illustrating the power of re-parametrization for optimization. Our results also indicate that deeper networks may have more desirable optimization landscapes compared with shallower ones.

Optimization landscape of linear residual networks

We will analyze the landscape of the population risk, defined as,

Recall that Ai\lVert A_{i}\rVert is the spectral norm of AiA_{i}. We define the norm {\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\cdot\right|\kern-1.07639pt\right|\kern-1.07639pt\right|} for the tensor AA as the maximum of the spectral norms of its slices,

We first note that the condition det(R)>0\det(R)>0 is without loss of generality in the following sense. Given any linear transformation RR with negative determinant, we can effectively flip the determinant by augmenting the data and the label with an additional dimension: let x=[x,b]x^{\prime}=[x,b] and y=[y,b]y^{\prime}=[y,-b] , where bb is an independent random variable (say, from standard normal distribution), and let R=[R001]R^{\prime}=\begin{bmatrix}R&0\\ 0&-1\end{bmatrix}. Then, we have that y=Rx+ξy^{\prime}=R^{\prime}x^{\prime}+\xi and det(R)=det(R)>0\det(R^{\prime})=-\det(R)>0.When the dimension is odd, there is an easier way to see this – flipping the label corresponds to flipping RR, and we have det(R)=det(R)\det(-R)=-\det(R).

Second, we note that here γ\gamma should be thought of as a constant since if RR is too large (or too small), we can scale the data properly so that σmin(R)1σmax(R)\sigma_{\min}(R)\leq 1\leq\sigma_{\max}(R). Concretely, if σmax(R)/σmin(R)=κ\sigma_{\max}(R)/\sigma_{\min}(R)=\kappa, then we can scaling for the outputs properly so that σmin(R)=1/κ\sigma_{\min}(R)=1/\sqrt{\kappa} and σmax(R)=κ\sigma_{\max}(R)=\sqrt{\kappa}. In this case, we have γ=logκ\gamma=\log\sqrt{\kappa}, which will remain a small constant for fairly large condition number κ\kappa. We also point out that we made no attempt to optimize the constant factors here in the analysis. The proof of Theorem 2.1 is rather involved and is deferred to Section A.

Given the observation of Theorem 2.1, we restrict our attention to analyzing the landscape of f()f(\cdot) in the set of AA with {\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\cdot\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}-norm less than τ\tau,

For any τ<1\tau<1, we have that any critical point AA of the objective function f()f(\cdot) inside the domain Bτ\mathcal{B}_{\tau} must also be a global minimum.

Theorem 2.2 suggests that it is sufficient for the optimizer to converge to critical points of the population risk, since all the critical points are also global minima.

Moreover, in addition to Theorem 2.2, we also have that any AA inside the domain Bτ\mathcal{B}_{\tau} satisfies that

Equation (2.3) says that the gradient has fairly large norm compared to the error, which guarantees convergence of the gradient descent to a global minimum () if the iterates stay inside the domain Bτ,\mathcal{B}_{\tau}, which is not guaranteed by Theorem 2.2 by itself.

Towards proving Theorem 2.2, we start off with a simple claim that simplifies the population risk. We use F\lVert\cdot\rVert_{F} to denote the Frobenius norm of a matrix, and A,B\langle A,B\rangle denotes the inner product of AA and BB in the standard basis (that is, A,B=tr(AB)\langle A,B\rangle=\textup{tr}(A^{\top}B) where tr()\textup{tr}(\cdot) denotes the trace of a matrix.)

Here CC is a constant that doesn’t depend on AA, and Σ1/2\Sigma^{1/2} denote the square root of Σ\Sigma, that is, the unique symmetric matrix BB that satisfies B2=ΣB^{2}=\Sigma.

Next we compute the gradients of the objective function f()f(\cdot) from straightforward matrix calculus. We defer the full proof to Section A.

The gradients of f()f(\cdot) can be written as,

Now we are ready to prove Theorem 2.2. The key observation is that each matric AjA_{j} has small norm and cannot cancel the identity matrix. Therefore, the gradients in equation (2.5) is a product of non-zero matrices, except for the error matrix EE. Therefore, if the gradient vanishes, then the only possibility is that the matrix EE vanishes, which in turns implies AA is an optimal solution.

Therefore we complete the proof of equation (2.3). Finally, if AA is a critical point, namely, f(A)=0\nabla f(A)=0, then by equation (2.3) we have that f(A)=Coptf(A)=C_{\textup{opt}}. That is, AA is a global minimum. ∎

Representational Power of the Residual Networks

A residual network is composed of a sequence of such residual blocks. In comparison with the full pre-activation architecture in , we remove two batch normalization layers and one ReLU layer in each building block.

We assume that for every 1i<jn1\leq i<j\leq n, we have x(i)x(j)2ρ\lVert x^{(i)}-x^{(j)}\rVert^{2}\geq\rho for some absolute constant ρ>0.\rho>0.

Images, for example, can always be imperceptibly perturbed in pixel space so as to satisfy this assumption for a small but constant ρ.\rho.

Under this mild assumption, we prove that residual networks have the power to express any possible labeling of the data as long as the number of parameters is a logarithmic factor larger than nn.

Suppose the training examples satisfy Assumption 3.1. Then, there exists a residual network NN (specified below) with O(nlogn+r2)O(n\log n+r^{2}) parameters that perfectly expresses the training data, i.e., for all i{1,,n},i\in\{1,\dots,n\}, the network NN maps x(i)x^{(i)} to y(i).y^{(i)}.

It is common in practice that n>r2,n>r^{2}, as is for example the case for the Imagenet data set where n>106n>10^{6} and r=1000.r=1000.

We briefly sketch the proof of the Lemma to provide intuitions, and defer the full proof to Section B. The operation that each residual block applies to the hidden variable can be abstractly written as,

where hh corresponds to the hidden variable before the block and h^\hat{h} corresponds to that after. We claim that for an (almost) arbitrary sequence of vectors h(1),,h(n)h^{(1)},\dots,h^{(n)}, there exists TU,V,s()\mathcal{T}_{U,V,s}(\cdot) such that operation (3.5) transforms kk vectors of h(i)h^{(i)}’s to an arbitrary set of other kk vectors that we can freely choose, and maintain the value of the rest of nkn-k vectors. Concretely, for any subset SS of size kk, and any desired vector v(i)(iS)v^{(i)}(i\in S), there exist U,V,sU,V,s such that

Power of all-convolutional residual networks

Inspired by our theory, we experimented with all-convolutional residual networks on standard image classification benchmarks.

We trained our models with the Tensorflow framework, using a momentum optimizer with momentum 0.9,0.9, and batch size is 128128. All convolutional weights are trained with weight decay 0.0001.0.0001. The initial learning rate is 0.05,0.05, which drops by a factor 1010 and 3000030000 and 5000050000 steps. The model reaches peak performance at around 50k50k steps, which takes about 24h24h on a single NVIDIA Tesla K40 GPU. Our code can be easily derived from an open source implementationhttps://github.com/tensorflow/models/tree/master/resnet by removing batch normalization, adjusting the residual components and model architecture. An important departure from the code is that we initialize a residual convolutional layer of kernel size k×kk\times k and cc output channels using a random normal initializer of standard deviation σ=1/k2c,\sigma=1/k^{2}c, rather than 1/kc1/k\sqrt{c} used for standard convolutional layers. This substantially smaller weight initialization helped training, while not affecting representation.

A notable difference from standard models is that the last layer is not trained, but simply a fixed random projection. On the one hand, this slightly improved test error (perhaps due to a regularizing effect). On the other hand, it means that the only trainable weights in our model are those of the convolutions, making our architecture “all-convolutional”.

An interesting aspect of our model is that despite its massive size of 13.5913.59 million trainable parameters, the model does not seem to overfit too quickly even though the data set size is 50000.50000. In contrast, we found it difficult to train a model with batch normalization of this size without significant overfitting on CIFAR10.

Table 2 summarizes the top-11 classification error of our models compared with a non-exhaustive list of previous works, restricted to the best previous all-convolutional result by , the first residual results , and state-of-the-art results on CIFAR by . All results are with standard data augmentation.

2 ImageNet

The ImageNet ILSVRC 2012 data set has 1,281,1671,281,167 data points with 10001000 classes. Each image is resized to 224×224224\times 224 pixels with 33 channels. We experimented with an all-convolutional variant of the 3434-layer network in . The original model achieved 25.03%25.03\% classification error. Our derived model has 35.7M35.7M trainable parameters. We trained the model with a momentum optimizer (with momentum 0.90.9) and a learning rate schedule that decays by a factor of 0.940.94 every two epochs, starting from the initial learning rate 0.1.0.1. Training was distributed across 66 machines updating asynchronously. Each machine was equipped with 88 GPUs (NVIDIA Tesla K40) and used batch size 256256 split across the 88 GPUs so that each GPU updated with batches of size 32.32.

In contrast to the situation with CIFAR10 and CIFAR100, on ImageNet our all-convolutional model performed significantly worse than its original counterpart. Specifically, we experienced a significant amount of underfitting suggesting that a larger model would likely perform better.

Despite this issue, our model still reached 35.29%35.29\% top-11 classification error on the test set (5000050000 data points), and 14.17%14.17\% top-55 test error after 700,000700,000 steps (about one week of training). While no longer state-of-the-art, this performance is significantly better than the 40.7%40.7\% reported by , as well as the best all-convolutional architecture by . We believe it is quite likely that a better learning rate schedule and hyperparameter settings of our model could substantially improve on the preliminary performance reported here.

Conclusion

Our theory underlines the importance of identity parameterizations when training deep artificial neural networks. An outstanding open problem is to extend our optimization result to the non-linear case where each residual has a single ReLU activiation as in our expressivity result. We conjecture that a result analogous to Theorem 2.2 is true for the general non-linear case. Unlike with the standard parameterization, we see no fundamental obstacle for such a result.

We hope our theory and experiments together help simplify the state of deep learning by aiming to explain its success with a few fundamental principles, rather than a multitude of tricks that need to be delicately combined. We believe that much of the advances in image recognition can be achieved with residual convolutional layers and ReLU activations alone. This could lead to extremely simple (albeit deep) architectures that match the state-of-the-art on all image classification benchmarks.

Acknowledgment: We thank Jason D. Lee, Qixing Huang, and Jonathan Shewchuk for helpful discussions and kindly pointing out errors in earlier versions of the paper. We also thank Jonathan Shewchuk for suggesting an improvement of equation (2.3) that is incorporated into the current version. Tengyu Ma would like to thank the support by Dodds Fellowship and Siebel Scholarship.

References

Appendix A Missing Proofs in Section 2

In this section, we give the complete proofs for Theorem 2.1 and Lemma 2.4, which are omitted in Section 2.

We see that the network defined by AA^{\star} reconstruct the transformation RR, and therefore it’s a global minimum of the population risk (formally see Claim 2.3 below). Next, we verify that each of the AjA^{\star}_{j} has small spectral norm:

Towards fully proving the Theorem 2.1, we start with the following Claim:

We first consider the case when QQ is a rotation. Each rotation matrix can be written as T(θ):=[cosθsinθsinθcosθ]T(\theta):=\begin{bmatrix}\cos\theta&-\sin\theta\\ \sin\theta&\cos\theta\end{bmatrix}. Suppose Q=T(θ)Q=T(\theta). Then we can take W1==Wq=T(θ/q)W_{1}=\dots=W_{q}=T(\theta/q) and Λ=I\Lambda=I. We can verify that

Next, we consider the case when QQ is a reflection. Then we have that QQ can be written as Q=T(θ)diag(1,1)Q=T(\theta)\cdot\operatorname{diag}(-1,1), where diag(1,1)\operatorname{diag}(-1,1) is the reflection with respect to the yy-axis. Then we can take W1==Wq=T(θ/q)W_{1}=\dots=W_{q}=T(\theta/q) and Λ=diag(1,1)\Lambda=\operatorname{diag}(-1,1) and complete the proof. ∎

Next we give the formal full proof of Theorem 2.1. The main idea is to reduce to the block diagonal situation and to apply the Claim above.

Let R=UKVR=UKV^{\top} be the singular value decomposition of RR, where UU,VV are two orthonormal matrices and KK is a diagonal matrix with nonnegative entries on the diagonal. Since det(R)=det(U)det(K)det(V)>0\det(R)=\det(U)\det(K)\det(V)>0 and det(K)>0\det(K)>0, we can flip U,VU,V properly so that det(U)=det(V)=1\det(U)=\det(V)=1. Since UU is a normal matrix (that is, UU satisfies that UU=UUUU^{\top}=U^{\top}U), by Claim C.1, we have that UU can be block-diagonalized by orthonormal matrix SS into U=SDS1U=SDS^{-1}, where D=diag(D1,,Dm)D=\operatorname{diag}(D_{1},\dots,D_{m}) is a real block diagonal matrix with each block DiD_{i} being of size at most 2×22\times 2. Using Claim A.1, we have that for any DiD_{i}, there exists Wi,1,,Wi,q,ΛiW_{i,1},\dots,W_{i,q},\Lambda_{i} such that

and Wi,jIπ/q\lVert W_{i,j}-I\rVert\leq\pi/q. Let Λ=diag(Λ1,,Λm)\Lambda=\operatorname{diag}(\Lambda_{1},\dots,\Lambda_{m}) and Wj=diag(W1,j,Wm,j)W_{j}=\operatorname{diag}(W_{1,j},\dots W_{m,j}). We can rewrite equation (A.2) as

Moreover, we have that Λ\Lambda is a diagonal matrix with ±1\pm 1 on the diagonal. Since Wi,jW_{i,j}’s are orthonormal matrix with determinant 1, we have det(Λ)=det(D)=det(U)=1\det(\Lambda)=\det(D)=\det(U)=1. That is, Λ\Lambda has an even number of 1-1’s on the diagonal. Then we can group the 1-1’s into 2×22\times 2 blocks. Note that [1001]\begin{bmatrix}-1&0\\ 0&-1\end{bmatrix} is the rotation matrix T(π)T(\pi). Thus we can write Λ\Lambda as a concatenation of +1+1’s on the diagonal and block T(π)T(\pi). Then applying Claim A.1 (on each of the block T(π)T(\pi)), we obtain that there are W1,,WqW_{1}^{\prime},\dots,W_{q}^{\prime} such that

where WjIπ/q\lVert W_{j}^{\prime}-I\rVert\leq\pi/q. Thus using equation (A.3) and (2.3), we obtain that

Moreover, we have that for every jj, SWjS1I=S(WjI)S1=WjIπ/q\lVert SW_{j}S^{-1}-I\rVert=\lVert S(W_{j}-I)S^{-1}\rVert=\lVert W_{j}-I\rVert\leq\pi/q, because SS is an orthonormal matrix. The same can be proved for WjW_{j}^{\prime}. Thus let Bj=SWjS1IB_{j}=SW_{j}S^{-1}-I for jqj\leq q and Bj+q=SWjS1IB_{j+q}=SW_{j}^{\prime}S^{-1}-I, and we can rewrite,

We can deal with VV similarly by decomposing VV into 2q2q matrices that are π/q\pi/q close to identity matrix,

Last, we deal with the diagonal matrix KK. Let K=diag(ki)K=\operatorname{diag}(k_{i}). We have minki=σmin(R),maxki=σmax(R)\min k_{i}=\sigma_{\min}(R),\max k_{i}=\sigma_{\max}(R). Then, we can write K=(K)pK=(K^{\prime})^{p} where K=diag(ki1/p)K^{\prime}=\operatorname{diag}(k_{i}^{1/p}) and pp is an integer to be chosen later. We have that KImaxki1/p1maxelogki1/p1\left\lVert K^{\prime}-I\right\rVert\leq\max|k_{i}^{1/p}-1|\leq\max|e^{\log k_{i}\cdot 1/p}-1|. When pγ=max{logmaxki,logminki}=max{logσmax(R),logσmin(R)}p\geq\gamma=\max\{\log\max k_{i},-\log\min k_{i}\}=\max\{\log\sigma_{\max}(R),-\log\sigma_{\min}(R)\}, we have that

A.2 Proof of Lemma 2.4

Appendix B Missing Proofs in Section 3

In this section, we provide the full proof of Theorem 3.2. We start with the following Lemma that constructs a building block T\mathcal{T} that transform kk vectors of an arbitrary sequence of nn vectors to any arbitrary set of vectors, and main the value of the others. For better abstraction we use α(i)\alpha^{(i)},β(i)\beta^{(i)} to denote the sequence of vectors.

which is a different way of writing equation (3.6).

Without loss of generality, suppose S={1,,k}S=\{1,\dots,k\}. We construct U,V,sU,V,s as follows. Let the ii-th row of UU be α(i)\alpha^{(i)} for i[k]i\in[k], and let s=(12ρ)1s=-(1-2\rho^{\prime})\cdot\mathbf{1} where 1\mathbf{1} denotes the all 1’s vector. Let the ii-column of VV be 1α(i)2(12ρ)(β(i)α(i))\frac{1}{\lVert\alpha^{(i)}\rVert^{2}-(1-2\rho^{\prime})}(\beta^{(i)}-\alpha^{(i)}) for i[k]i\in[k].

Next we verify that the correctness of the construction. We first consider 1ik1\leq i\leq k. We have that Uα(i)U\alpha^{(i)} is a a vector with ii-th coordinate equal to α(i)21ρ\lVert\alpha^{(i)}\rVert^{2}\geq 1-\rho^{\prime}. The jj-th coordinate of Uα(i)U\alpha^{(i)} is equal to α(j),α(i)\langle\alpha^{(j)},\alpha^{(i)}\rangle, which can be upperbounded using the assumption of the Lemma by

Therefore, this means Uα(i)(12ρ)1U\alpha^{(i)}-(1-2\rho^{\prime})\cdot\mathbf{1}contains a single positive entry (with value at least α(i)2(12ρ)ρ\lVert\alpha^{(i)}\rVert^{2}-(1-2\rho^{\prime})\geq\rho^{\prime}), and all other entries being non-positive. This means that ReLu(Uα(i)+b)=(α(i)2(12ρ))ei\textup{ReLu}(U\alpha^{(i)}+b)=\left(\lVert\alpha^{(i)}\rVert^{2}-(1-2\rho^{\prime})\right)e_{i} where eie_{i} is the ii-th natural basis vector. It follows that VReLu(Uα(i)+b)=(α(i)2(12ρ))Vei=β(i)α(i)V\textup{ReLu}(U\alpha^{(i)}+b)=(\lVert\alpha^{(i)}\rVert^{2}-(1-2\rho^{\prime}))Ve_{i}=\beta^{(i)}-\alpha^{(i)}.

Finally, consider ni>kn\geq i>k. Then similarly to the computation in equation (B.1), Uα(i)U\alpha^{(i)} is a vector with all coordinates less than 12ρ1-2\rho^{\prime}. Therefore Uα(i)+bU\alpha^{(i)}+b is a vector with negative entries. Hence we have ReLu(Uα(i)+b)=0\textup{ReLu}(U\alpha^{(i)}+b)=0, which implies VReLu(Uα(i)+b)=0V\textup{ReLu}(U\alpha^{(i)}+b)=0. ∎

Now we are ready to state the formal version of Lemma 3.3.

We will use Lemma B.1 repeatedly to construct building blocks TAj,Bk,sj()\mathcal{T}_{A_{j},B_{k},s_{j}}(\cdot), and thus prove Lemma B.2. Each building block TAj,Bk,sj()\mathcal{T}_{A_{j},B_{k},s_{j}}(\cdot) takes a subset of kk vectors among {z(1),,z(n)}\{z^{(1)},\dots,z^{(n)}\} and convert them to v(i)v^{(i)}’s, while maintaining all other vectors as fixed. Since they are totally n/kn/k layers, we finally maps all the z(i)z^{(i)}’s to the target vectors v(i)v^{(i)}’s.

We use Lemma B.1 repeatedly. Let S1=[1,,k]S_{1}=[1,\dots,k]. Then using Lemma B.1 with α(i)=z(i)\alpha^{(i)}=z^{(i)} and β(i)=v(i)\beta^{(i)}=v^{(i)} for i[n]i\in[n], we obtain that there exists A1,B1,b1A_{1},B_{1},b_{1} such that for iki\leq k, it holds that h1(i)=z(i)+TA1,B1,b1(z(i))=v(i)h_{1}^{(i)}=z^{(i)}+\mathcal{T}_{A_{1},B_{1},b_{1}}(z^{(i)})=v^{(i)}, and for iki\geq k, it holds that h1(i)=z(i)+TA1,B1,b1(z(i))=z(i)h_{1}^{(i)}=z^{(i)}+\mathcal{T}_{A_{1},B_{1},b_{1}}(z^{(i)})=z^{(i)}.

Now we construct the other layers inductively. We will construct the layers such that the hidden variable at layer jj satisfies hj(i)=v(i)h_{j}^{(i)}=v^{(i)} for every 1ijk1\leq i\leq jk, and hj(i)=z(i)h_{j}^{(i)}=z^{(i)} for every ni>jkn\geq i>jk. Assume that we have constructed the first jj layer and next we use Lemma B.1 to construct the j+1j+1 layer. Then we argue that the choice of α(1)=v(1),,α(jk)=v(jk)\alpha^{(1)}=v^{(1)},\dots,\alpha^{(jk)}=v^{(jk)}, α(jk+1)=z(jk+1),,α(n)=z(n)\alpha^{(jk+1)}=z^{(jk+1)},\dots,\alpha^{(n)}=z^{(n)}, and S={jk+1,,(j+1)k}S=\{jk+1,\dots,(j+1)k\} satisfies the assumption of Lemma B.1. Indeed, because qiq_{i}’s are chosen uniformly randomly, we have w.h.p for every ss and ii, qs,z(i)1ρ\langle q_{s},z^{(i)}\rangle\leq 1-\rho^{\prime}. Thus, since v(i){q1,,qr}v^{(i)}\in\{q_{1},\dots,q_{r}\}, we have that v(i)v^{(i)} also doesn’t correlate with any of the z(i)z^{(i)}. Then we apply Lemma B.1 and conclude that there exists Aj+1=U,Bj+1=V,bj+1=sA_{j+1}=U,B_{j+1}=V,b_{j+1}=s such that TAj+1,bj+1,bj+1(v(i))=0\mathcal{T}_{A_{j+1},b_{j+1},b_{j+1}}(v^{(i)})=0 for ijki\leq jk, TAj+1,bj+1,bj+1(z(i))=v(i)z(i)\mathcal{T}_{A_{j+1},b_{j+1},b_{j+1}}(z^{(i)})=v^{(i)}-z^{(i)} for jk<i(j+1)kjk<i\leq(j+1)k, and TAj+1,bj+1,bj+1(z(i))=0\mathcal{T}_{A_{j+1},b_{j+1},b_{j+1}}(z^{(i)})=0 for ni>(j+1)kn\geq i>(j+1)k. These imply that

Now we ready to prove Theorem 3.2, following the general plan sketched in Section 3.

Appendix C Toolbox

In this section, we state two folklore linear algebra statements. The following Claim should be known, but we can’t find it in the literature. We provide the proof here for completeness.

where DD is a real block diagonal matrix that consists of blocks with size at most 2×22\times 2.

The following Claim is used in the proof of Theorem 2.2. We provide a proof here for completeness.

Since σmin(A)2\sigma_{\min}(A)^{2} is the smallest eigenvalue of AAA^{\top}A, we have that

Taking square root of both sides completes the proof. ∎