Statistical guarantees for the EM algorithm: From population to sample-based analysis

Sivaraman Balakrishnan, Martin J. Wainwright, Bin Yu

Introduction

Data problems with missing values, corruptions, and latent variables are common in practice. From a computational standpoint, computing the maximum likelihood estimate (MLE) in such incomplete data problems can be quite complex. To a certain extent, these concerns have been assuaged by the development of the expectation-maximization (EM) algorithm, along with growth in computational resources. The EM algorithm is widely applied to incomplete data problems, and there is now a very rich literature on its behavior (e.g., ). However, a major issue is that in most models, although the MLE is known to have good statistical properties, the EM algorithm is only guaranteed to return a local optimum. The goal of this paper is to address this potential gap between statistical and computational guarantees in application of the EM algorithm.

The EM algorithm has a lengthy and rich history. Various algorithms of the EM-type were analyzed in early work (e.g.,), before Dempster et al. introduced the EM algorithm in its modern general form. Among other results, they established its well-known monotonicity properties. The subsequent work of Wu established some of the most general convergence results known for the EM algorithm; see also the more recent papers . Together with other results, Wu showed that if the likelihood is unimodal and certain regularity conditions hold, then the EM algorithm converges to the unique global optimum. However, in most interesting cases of the EM algorithm, the likelihood function is multi-modal, in which case the behavior of the EM algorithm remains a little more mysterious. Indeed, despite its popularity and widespread practical effectiveness, the EM algorithm is often considered a “sensible heuristic” with little or no theoretical backing.

One interesting observation with the EM algorithm is given a “suitable” initialization, it often converges to a statistically useful estimate. For instance, in application to a mixture of regressions problem (see Section 2.2.2 for more details), Chaganty and Liang empirically demonstrate good performance for a two-stage estimator, in which the method of methods is used as an initialization, and then the EM algorithm is applied to refine this initial estimator. Although encouraging, this type of behavior is not well understood in a quantitative sense, especially how EM fixed points reached by this type of two-stage estimator are related to the global maximizers of the population likelihood. The goal of this paper is to address this question, and to develop some general tools for characterizing fixed points of the suitably initialized sample-based EM algorithm, and their relation to maximum likelihood estimates.

Some two-stage estimators have recently been analyzed in work on alternating minimization algorithms (see e.g. ) which show that at least in certain special cases optimization methods can be locally effective despite non-convexity. Most directly related to our work is the paper of Yi et al. which considers a special (degenerate) noiseless case of the EM algorithm for the mixtures of regressions problem. Results for the noisy mixtures of regressions problem follow from our general treatment of the EM algorithm (see Section 2.2.2).

In some settings, performing an exact M-step is computationally burdensome, in which case a natural alternative is some form of generalized EM updates. In such an algorithm, instead of performing an exact maximization, we simply choose a parameter value that does not decrease the likelihood. In addition to the standard EM updates, we also analyze a particular case of such an algorithm, known as gradient EM, based on a taking a single gradient step per iteration.

Our main results concern the population EM and gradient EM algorithms and their finite-sample counterparts. Our first set of results (Theorems 1 and 3) give conditions under which the population algorithms are contractive to the MLE, in a ball around the MLE. These results are completely deterministic. This population-level analysis is based on viewing these algorithms as perturbed versions of certain “oracle” algorithms which are known to be contractive around the MLE. Our second set of results (Theorem 2, Theorem 4 and Theorem 5) concern the sample-based EM and gradient EM algorithms which approximate the population-based algorithms using a subset of samples at each step. We give conditions under which these sample operators converge to an ε\varepsilon-ball around the population MLE. These results involve probabilistic bounds on the deviations between the iterates of the population and sample-based algorithms.

The remainder of this paper is organized as follows. Section 2 provides an introduction to the EM and gradient EM algorithms, as well as a description of the three examples treated in detail in this paper—namely, Gaussian mixture models (Section 2.2.1), mixture of regressions (Section 2.2.2), and regression with missing covariates (Section 2.2.3). Section 3 is devoted to our general convergence results on both the EM and gradient EM algorithms. In Section 4, we revisit the three model classes previously introduced, and illustrate the use of our general theory by deriving some concrete corollaries. In concrete examples our theory gives a characterization of the quality of initialization needed and the rate of convergence of the EM and gradient EM algorithms. We complement these theoretical results with simulations that confirm various aspects of the theoretical predictions. In order to promote readability, we defer the more technical aspects of proofs to the appendices.

Background and model examples

We begin with basic background on the EM algorithm and its variants, along with a number of specific models that we revisit later in the paper.

Let YY and ZZ be random variables taking values in the sample spaces Y\mathcal{\mathcal{Y}} and Z\mathcal{\mathcal{Z}}, respectively. Suppose that the pair (Y,Z)(Y,Z) has a joint density function fθf_{\theta^{*}} that belongs to some parameterized family {fθθΩ}\{f_{\theta}\,\mid\,\theta\in\Omega\}, for a non-empty compact convex set Ω\Omega. Rather than observing the complete data (Y,Z)(Y,Z), we observe only component YY. Thus, the component ZZ corresponds to the missing or latent structure in the data.

Our goal is to obtain an estimate of the unknown parameter θ\theta^{*} via maximum likelihood—namely, to compute some θ^Ω\widehat{\theta}\in\Omega maximizing the function θgθ(y)\theta\mapsto g_{\theta}(y), where

is the density function of the observed variable YY. Throughout this paper, we assume that θ\theta^{*} is a maximizer of the population likelihood, but not that θ\theta^{*} is a unique maximizer. Uniqueness is often violated in mixture models for which parameters are typically only identifiable up to permutation. In the examples that we consider, this non-identifiability will be resolved by appropriate initialization conditions.

In many settings, it can be difficult or computationally expensive to evaluate the log likelihood of the observed data, but relatively easy to compute the log likelihood logfθ(y,z)\log f_{\theta}(y,z) of both the latent and observed variables. The EM algorithm is well-suited to such settings. For each θΩ\theta\in\Omega, let kθ(zy)k_{\theta}(z\mid y) denote the conditional density of zz given yy. A straightforward application of Jensen’s inequality then shows that the log likelihood at θΩ\theta^{\prime}\in\Omega can be lower bounded as

with equality holding when θ=θ\theta=\theta^{\prime}. Thus, we have a family of lower bounds on the log likelihood, and the EM algorithm successively maximizes this lower bound (MM-step), and then reevaluates the lower bound at the new parameter value (EE-step).

With this notation, it is easy to specify the EM iterations. The update θtθt+1\theta^{t}\rightarrow\theta^{t+1} consists of the following two steps.

E-step: Evaluate the expectation in equation (2) to compute Q(θt)Q(\cdot|\theta^{t}).

M-step: Compute the maximizer θt+1=argmaxθΩQ(θθt)\theta^{t+1}=\arg\max\limits_{\theta^{\prime}\in\Omega}Q(\theta^{\prime}|\theta^{t}).

For future use, it is convenient to introduce the mapping M:ΩΩM:\Omega\rightarrow\Omega given by

With this choice, the MM-step corresponds to the update θt+1=M(θt)\theta^{t+1}=M(\theta^{t}).

In a generalized EM algorithm, the requirements of the MM-step are relaxed: instead of finding the exact optimum, the algorithm is required only to find a value θt+1Ω\theta^{t+1}\in\Omega such that

Depending on how θt+1\theta^{t+1} is chosen, this requirement actually defines a family of algorithms.

A closely related variant of the generalized EM updates is what we refer to as the gradient EM updates, applicable in the case when the function Q(θt)Q(\cdot|\theta^{t}) is differentiable at each iteration tt. Given a step size α>0\alpha>0, these updates take the form

where the gradient is taken in the first argument of QQ. For ease of notation, we define the mapping G ⁣:ΩΩG\colon\Omega\rightarrow\Omega by

An iteration of gradient EM can now be written compactly as θt+1=G(θt)\theta^{t+1}=G(\theta^{t}).

There is a natural extension that includes a constraint arising from the parameter space Ω\Omega, in which the update is projected back onto the constraint set To avoid pathologies additionally assume that the constraint set is closed.. For simplicity, we focus on unconstrained problems in this paper, but all of our results extend in a straightforward way to constrained examples by incorporating the additional Euclidean projection. For appropriate choices of the step size parameter α\alpha, the gradient EM updates guarantee the ascent condition (4), so that it is a particular case of a generalized EM algorithm.

Let us now make an important distinction, namely, that between the population and sample-based versions of the EM updates. Up to this point, we have suppressed dependence on the number of observed samples nn. The population form of the (gradient) EM updates are an “oracle version”, in which we effectively observe an infinite number of samples, and consequently, the function Q(θ)Q(\cdot|\theta) takes the form

From here onwards, we use the notation MM and GG for the EM and gradient EM operators, respectively, both defined at the population level.

In the classical statistical settings, we observe only nn i.i.d. samples {yi}i=1n\{y_{i}\}_{i=1}^{n} of the YY component. Under the i.i.d. assumption, we define the function

so that the expectation over YY in equation (7) is replaced by the empirical expectation defined by the samples. The function QnQ_{n} defines an analog of the population EM operator (3), namely

In an analogous fashion, we define the sample-based analog of the gradient EM operator (6), namely

where α>0\alpha>0 is an appropriately chosen step size parameter.

2 Illustrative examples

The EM algorithm is popular and a variety of examples can be found in the literature. In this section, we review three specific models analyzed in this paper, and derive the form of the population and sample-based updates, both for the usual EM algorithm and the gradient EM algorithm.

An isotropic, balanced two-component Gaussian mixture model can be specified by a density of the form

Suppose that we are given nn i.i.d. samples {yi}i=1n\{y_{i}\}_{i=1}^{n} drawn from the mixture density (11). The complete data {(yi,zi)}i=1n\{(y_{i},z_{i})\}_{i=1}^{n} corresponds to the original samples along with the component indicator variables zi{0,1}z_{i}\in\{0,1\}. The sample-based function QnQ_{n} takes the form

where w_{\theta}(y):=e^{-\frac{\|\theta-y\|_{2}^{2}}{2\sigma^{2}}}\Big{[}e^{-\frac{\|\theta-y\|_{2}^{2}}{2\sigma^{2}}}+e^{-\frac{\|\theta+y\|_{2}^{2}}{2\sigma^{2}}}\Big{]}^{-1}.

where the empirical expectation has been replaced by expectation under the mixture distribution (11).

On the other hand, the sample-based and population gradient EM operators with step size α>0\alpha>0 are given by

We return to analyze the EM updates for the Gaussian mixture model in Section 4.1.

2.2 Mixture of regressions

On the other hand, the gradient EM operators are given by

where α>0\alpha>0 is a step size parameter.

We return to analyze the EM updates for the mixture of regressions model in Section 4.2.

2.3 Linear regression with missing covariates

Our first two examples involved mixture models in which the class membership variable was hidden. Another canonical use of the EM algorithm is in cases with corrupted or missing data. In this section, we consider a particular instantiation of such a problem, namely that of linear regression with the covariates missing completely at random.

where ρ[0,1)\rho\in[0,1) is the probability of missingness.

In writing all these expressions, we have assumed that the coordinates are permuted so that the missing values are in the first block.

We now have the necessary notation in place to describe the EM and gradient EM updates. For a given parameter θ\theta, the EM update is based on maximizing

Again, this optimization problem has an explicit solution, so that the sample-based EM operator is given by

On the other hand, the gradient EM algorithm with step size α\alpha takes the form

We return to analyze the gradient EM updates for this model in Section 4.3.

General convergence results

We now turn to analysis of the EM algorithm and gradient EM algorithms. In both cases, we let θ{\theta^{\ast}} denote a maximizer of the population likelihood. In this section, we give general sufficient conditions under which the population algorithms converge to θ{\theta^{\ast}} and under which the sample-based algorithms converge to an ε\varepsilon-ball around θ{\theta^{\ast}}. Our analysis of each algorithm is organized as follows:

Our second result in Sections 3.1 and 3.2 concern the sample-based EM and gradient EM operators. These sample-based operators approximate the population-based update using a subset of samples at each step. Theorem 2 and Theorem 4 for sample-based EM and gradient EM, respectively give conditions under which the sample-based operator is guaranteed to converge to an ε\varepsilon-ball around the fixed point θ{\theta^{\ast}}. These results involve probabilistic bounds on the deviations between the population-based and sample-based operators. In addition, for gradient EM, we also analyze a stochastic update that uses a single sample per update in the flavor of stochastic approximation algorithms (see Theorem 5 in Section 3.2.3).

Let us begin with analysis of the standard EM updates, starting with the population version before turning to a sample-based version.

Recall that we always assume that the vector θ{\theta^{\ast}} maximizes the population likelihood. It is a classical fact that it must then satisfy the condition

a property known as self-consistency. For this reason, the function q():=Q(θ)q(\cdot):=Q(\cdot|{\theta^{\ast}}) plays an important role in our analysis.

We assume throughout this section that the function qq is λ\lambda-strongly concave, meaning that

for all pairs (θ1,θ2)(\theta_{1},\theta_{2}) in a neighborhood of θ\theta^{*}. As we will illustrate, this condition holds in most concrete instantiations of EM, including the three model classes introduced in the previous section.

For any fixed θ\theta, in order to relate the population EM updates to the fixed point θ{\theta^{\ast}}, we require control on the two gradient mappings q()=Q(θ)\nabla q(\cdot)=\nabla Q(\cdot|{\theta^{\ast}}) and Q(θ)\nabla Q(\cdot|\theta). These mappings are central in characterizing the fixed point θ{\theta^{\ast}} and the update M(θ)M(\theta) respectively. Indeed, by virtue of the self-consistency property (24) and the convexity of Ω\Omega, the fixed point satisfies the first-order optimality condition

Similarly, for any θΩ\theta\in\Omega, since M(θ)M(\theta) maximizes the function θQ(θθ)\theta^{\prime}\mapsto Q(\theta^{\prime}|\theta), we have

Equations (26) and (27) are sets of inequalities that characterize the points M(θ)M(\theta) and θ{\theta^{\ast}}. Thus, at an intuitive level, in order to establish that M(θ)M(\theta) and θ{\theta^{\ast}} are close, it suffices to verify that these two characterizations are close in a suitable sense. We also note that inequalities similar to the condition (27) are often used as a starting point in the classical analysis of M-estimators (e.g., see van de Geer ). In the analysis of EM, we obtain additional leverage from the self-consistency condition (24) that characterizes θ\theta^{*}.

With this intuition in mind, we introduce the following regularity condition in order to relate conditions (27) and (24): The condition involves a Euclidean ball of radius rr around the fixed point θ{\theta^{\ast}}, given by

As a concrete example, recall the Gaussian mixture model first introduced in Section 2.2.1. For this model, the condition (29) is equivalent to

where wθw_{\theta} was previously defined following equation (12). Given that the function θwθ(y)\theta\mapsto w_{\theta}(y) is smooth in θ\theta, provided that γ\gamma is not too small, it is reasonable to expect that this condition will hold in a neighborhood of θ\theta^{*}, and we confirm this intuition in Corollary 1 to follow.

Under the conditions we have introduced, the following result guarantees that the population EM operator is locally contractive:

Since both M(θ)M(\theta) and θ{\theta^{\ast}} are in Ω\Omega, we may apply condition (26) with θ=M(θ)\theta^{\prime}=M(\theta) and condition (27) with θ=θ\theta^{\prime}={\theta^{\ast}}. Doing so, adding the resulting inequalities and then performing some algebra yields the condition

Now the λ\lambda-strong concavity condition (25) implies that the left-hand side is lower bounded as

Combining inequalities (32a) and (32b) with the original bound (31) yields

and canceling terms completes the proof. ∎

1.2 Guarantees for sample-based EM

We now turn to theoretical results on sample-based versions of the EM algorithm. More specifically, we consider two forms of the EM algorithm, the first being the standard form in which the operator Mn:ΩΩM_{n}:\Omega\rightarrow\Omega, as previously defined (9), is applied repeatedly, thereby generating the sequence θt+1=Mn(θt)\theta^{t+1}=M_{n}(\theta^{t}). We also analyze a sample-splitting From a practical point, a potential advantage of sample splitting is that each iteration may be cheaper, since it is based on a smaller sample size. In contrast, a disadvantage is that it can be difficult to correctly specify the number of iterations in advance. version of the EM algorithm, in which given a total of nn samples and TT iterations, we divide the full data set into TT subsets of size n/T\lfloor n/T\rfloor, and then perform the updates θt+1=Mn/T(θt)\theta^{t+1}=M_{n/T}(\theta^{t}), using a fresh subset of samples at each iteration.

with probability at least 1δ1-\delta. With these definitions, we have the following guarantees:

If the sample size nn is large enough to ensure that

For a given iteration number TT, suppose the sample size nn is large enough to ensure that

Figure 1 provides an illustration of the behavior predicted by Theorem 2: both algorithms are expected to show geometric convergence to the target parameter θ\theta^{*}, up to some tolerance. For the bound (35b) note that the first term is decreasing in tt, whereas the second term is independent of tt. Thus, for a fixed sample size nn, the bounds in Theorem 2 suggests a reasonable choice of the number of iterations. In particular, focusing on the standard EM algorithm, consider any positive integer As will be clarified in the sequel, such a choice of TT exists in various concrete models considered here. TT such that

This choice ensures that the first term in the bound (35b) is dominated by the second term, and hence that

with probability at least 1δ1-\delta. For the sample-splitting update in (36b) the first term is decreasing in tt, whereas the second term is increasing in tt. In this case, a similar conclusion holds when TT is chosen to be the smallest positive integer such that

Let us now turn to the proof of the theorem.

We give a detailed proof of the claim (36b), from which it will be clear that the claim (35b) follows by a nearly identical argument. For any iteration s{1,2,,T}s\in\{1,2,\ldots,T\}, we have

with probability at least 1δT1-\frac{\delta}{T}. Consequently, by a union bound over all TT indices, the bound (40) holds uniformly with probability at least 1δ1-\delta. We perform the remainder of our analysis under this event.

Indeed, when this bound holds, we may iterate it to show that

where the final step follows by summing the geometric series.

It remains to prove the claim (41), and we do so via induction on the iteration number. Beginning with s=1s=1, we have

In the induction from ss+1s\mapsto s+1, suppose that θsθ2r\|\theta^{s}-{\theta^{\ast}}\|_{2}\leq r, and the bound (41) holds at iteration ss. The same argument then implies that the bound (41) also holds for iteration s+1s+1, and that θs+1θ2r\|\theta^{s+1}-{\theta^{\ast}}\|_{2}\leq r, thus completing the proof. ∎

2 Analysis of gradient EM algorithm

We now turn to analysis of the gradient EM algorithm. As before, we separate our analysis into two parts, the first (Theorem 3) addressing the behavior of the population-level operator, and the second (Theorems 4 and 5) providing guarantees for sample-based updates.

Recall that the gradient EM algorithm generates a sequence of iterates {θt}t=0\{\theta^{t}\}_{t=0}^{\infty} via the recursion θt+1=G(θt)\theta^{t+1}=G(\theta^{t}), where

Here α>0\alpha>0 is a step size parameter to be chosen. For analyzing gradient EM, we also require an additional condition on the function q(θ)=Q(θθ)q(\theta)=Q(\theta|{\theta^{\ast}}), previously defined in Section 3.1. In addition to the λ\lambda-strong concavity assumption (25), we also assume that qq is μ\mu-smooth, meaning that

Under the stated strong concavity and smoothness assumptions, it is a standard result from optimization theory that the gradient operator T:ΩΩT:\Omega\rightarrow\Omega with step size choice α=2μ+λ\alpha=\frac{2}{\mu+\lambda} is contractive, in particular with

Intuitively, then, if the function Q(θ)Q(\cdot|\theta) is “close enough” to the function q()=Q(θ)q(\cdot)=Q(\cdot|{\theta^{\ast}}), then the gradient EM operator might be expected to satisfy a similar contractivity condition. The closeness requirement is formalized in the following condition:

See Figure 2 for an illustration of this condition. We give concrete examples of this condition and its verification in Section 4. As with the FOS condition observe that the GS condition is always satisfied at the fixed point θ\theta^{*}, i.e. for r=0r=0 with γ=0\gamma=0. Allowing for strictly positive γ\gamma, if the functions Q(θ)Q(\cdot|\theta) are sufficiently regular we expect the condition to hold in a region around θ\theta^{*}. Observe that this condition involves the gradient of the functions Q(θ)Q(\cdot|\theta) and Q(θ)Q(\cdot|{\theta^{\ast}}) at θ\theta, as opposed to M(θ)M(\theta) in the case of the FOS condition. For this reason, it can be easier to verify for specific models.

Under this condition, the following result guarantees local contractivity of the gradient EM operator (42):

By definition of the gradient EM update (42), we have

where step (i) follows from the triangle inequality, and step (ii) uses the contractivity of TT from equation (45), and condition GS. Substituting α=2μ+λ\alpha=\frac{2}{\mu+\lambda} and performing some algebra yields the claim. ∎

2.2 Guarantees for sample-based gradient EM

In this section, in parallel with our earlier analysis of sample-based version of the EM algorithm, we analyze two sample-based variants of the gradient EM algorithm, the first when the update operator GnG_{n} is computed using all nn samples and applied repeatedly, and the second based on sample-splitting.

If the sample size nn is large enough to ensure that

If the sample size nn is large enough to ensure that

2.3 Stochastic version of gradient EM

In this section, we analyze a sample-based variant of gradient EM that is inspired by stochastic approximation. It can be viewed as an extreme form of sample-splitting, in which we use only a single sample per iteration, but compensate for the noisiness using a decaying step size. Throughout this section we assume that (a lower bound on) the radius of convergence rr of the population operator is known to the algorithmThis assumption can be restrictive in practice. We believe the requirement can be eliminated by a more judicious choice of the step-size parameter in the first few iterations..

In particular, given a sequence of positive step sizes {αt}t=0\{\alpha^{t}\}_{t=0}^{\infty}, we analyze the recursion

In order to prove this theorem we first establish a recursion on the expected mean-squared error. As with Theorem 3 this result is established by relating the population gradient EM operator to the gradient ascent operator on the function q()q(\cdot). This key recursion along with some algebra will yield the theorem.

Given the stochastic EM gradient iterates with step sizes {αt}t=0\{\alpha^{t}\}_{t=0}^{\infty}, the error Δt+1:=θt+1θ\Delta^{t+1}:=\theta^{t+1}-\theta^{*} at iteration t+1t+1 satisfies the recursion

Using this result, we can now complete the proof of the bound (54). With the step size choice αt:=aξ(t+2)\alpha^{t}:=\frac{a}{\xi\,(t+2)} where a=32a=\frac{3}{2}, unwrapping the recursion (55) yields

In order to bound these terms we use the following fact: For any a(1,2)a\in(1,2), we have

See Noorshams and Wainwright for a proof. Using this fact in Equation (56) yields

Finally, applying the integral upper bound τ=2t+21τ2a1t+21x2adx  2(t+3)a1\sum\limits_{\tau=2}^{t+2}\frac{1}{\tau^{2-a}}\leq\int_{1}^{t+2}\frac{1}{x^{2-a}}dx\;\leq 2(t+3)^{a-1} yields the claim (54). ∎

For the convenience of the reader, let us now summarize the theorems given in this section, including the assumptions on which they rely and the results that they provide.

Consequences for specific models

In the previous section, we provided a number of general theorems on the behavior of the EM algorithm as well as the gradient EM algorithm, at both the population and sample levels. In this section, we develop some concrete consequences of this general theory for the three specific model classes previously introduced in Section 2.2.

We begin by analyzing the EM updates for the Gaussian mixture model previously introduced in Section 2.2.1. Our first result (Corollary 1) establishes contractivity for the population operator (13b), whereas our second result (Corollary 2) provides bounds for the sample-based EM updates.

Recall that our mixture model consists of two equally weighted components, with distributions N(θ,σ2I)\mathcal{N}(\theta^{*},\sigma^{2}I) and N(θ,σ2I)\mathcal{N}(-\theta^{*},\sigma^{2}I) respectively. The difficulty of estimating this mixture model can be characterized by the signal-to-noise ratio θ2σ\frac{\|\theta^{*}\|_{2}}{\sigma}, and our analysis requires a lower bound of the form

for a sufficiently large constant η>0\eta>0. Past work by Redner and Walker provides evidence for the necessity of this assumption: for Gaussian mixtures with low signal-to-noise ratio, they show that the ML solution has large variance and furthermore verify empirically that the convergence of the EM algorithm can be quite slow. Other researchers also provide theoretical justification for the slow convergence of EM on poorly separated Gaussian mixtures.

With the signal-to-noise ratio lower bound η\eta defined above we have the following guarantee:

This corollary guarantees that when the SNR is sufficiently large, then the MLE θ\theta^{*} has a basin of attraction that is at least a constant fraction of the signal strength. Moreover, the convergence rate of the population updates is geometric, with the contraction factor κ\kappa decreasing exponentially in the signal-to-noise ratio. The proof of Corollary 1 involves establishing that for a sufficiently large SNR, the strong concavity and FOS (γ\gamma) conditions hold for a Gaussian mixture model, so that Theorem 1 can be applied. Although the proof structure is conceptually straightforward, the details are quite technical, so that we defer it to Appendix B.1.

Based on the population-level contractivity guaranteed by Corollary 1, we can also establish guarantees for the standard EM sequence θt+1=Mn(θt)\theta^{t+1}=M_{n}(\theta^{t}), where the sample-based operator MnM_{n} was previously defined in equation (13a). This guarantee involves the function φ(σ;θ2):=θ2θ22+σ2\varphi(\sigma;\|\theta^{*}\|_{2}):=\|\theta^{*}\|_{2}\sqrt{\|\theta^{*}\|_{2}^{2}+\sigma^{2}}, as well as positive universal constants (c,c1,c2)(c,c_{1},c_{2}).

See Appendix B.2 for the proof of this result. In Appendix B.3, we also give guarantees for EM with sample-splitting which achieves better dependence on θ2\|\theta^{*}\|_{2} and σ\sigma with an easier proof at the cost of additional logarithmic factors in sample complexity.

A related result of Dasgupta and Schulman shows that when the SNR is sufficiently high a modified EM algorithm, with an intermediate pruning step, reaches a near-optimal solution in two iterations. On one hand, the SNR condition in our corollary is significantly weaker, requiring only that it is larger than a fixed constant independent of dimension (as opposed to scaling with dd), but their theory is developed for more general kk-mixtures.

The bound (59) provides a rough guide of how many iterations are required: consider the smallest positive integer such that

Corollary 2 makes a number of qualitative predictions that can be tested. To begin, it predicts that the statistical error θtθ2\|\theta^{t}-\theta^{*}\|_{2} should decrease geometrically, and then level off at a plateau. Figure 3 shows the results of simulations designed to test this prediction: for dimension d=10d=10 and sample size n=1000n=1000, we performed 1010 trials with the standard EM updates applied to Gaussian mixture models with SNR θ2σ=2\frac{\|\theta^{*}\|_{2}}{\sigma}=2. In panel (a), the red curves plot the log statistical error versus the iteration number, whereas the blue curves show the log optimization error versus iteration. As can be seen by the red curves, the statistical error decreases geometrically before leveling off at a plateau. On the other hand, the optimization error decreases geometrically to numerical tolerance. Panel (b) shows that the gradient EM updates have a qualitatively similar behavior for this model, although the overall convergence rate appears to be slower.

In conjunction with Corollary 1, Corollary 2 also predicts that the convergence rate should increase as the signal-to-noise ratio θ2σ\frac{\|\theta^{*}\|_{2}}{\sigma} is increased. Figure 3 shows the results of simulations designed to test this prediction: again, for mixture models with dimension d=10d=10 and sample size n=1000n=1000, we applied the standard EM updates to Gaussian mixture models with varying SNR θ2σ\frac{\|\theta^{*}\|_{2}}{\sigma}. For each choice of SNR, we performed 1010 trials, and plotted the log optimization error logθtθ^2\log\|\theta^{t}-\widehat{\theta}\|_{2} versus the iteration number. As expected, the convergence rate is geometric (linear on this logarithmic scale), and the rate of convergence increases as the SNR growsTo be clear, Corollary 2 predicts geometric convergence of the statistical error θtθ2\|\theta^{t}-\theta^{*}\|_{2}, whereas these plots show the optimization error θtθ^2\|\theta^{t}-\widehat{\theta}\|_{2}. However, the analysis underlying Corollary 2 can also be used to show geometric convergence of the optimization error..

2 Mixtures of regressions

In this section, we analyze the EM and gradient EM algorithms for the mixture of regressions (MOR) model, previously introduced in Section 2.2.2. As in our analysis of the Gaussian mixture model, our theory applies when the signal-to-noise ratio is sufficiently large, as enforced by a condition of the form

Under a suitable lower bound on this quantity, our first result guarantees that the population level operators (17b) and (18a) are locally contractive.

As shown in the proof, the contraction coefficient κ\kappa is again a decreasing function of the SNR parameter η\eta. However, its functional form is not as explicit as in the Gaussian mixture case. The proof of Corollary 3 involves verifying that the function qq for the MOR model satisfies the required concavity, smoothness, GS(γ\gamma) and FOS(γ\gamma) conditions. It is quite technically involved, so that we defer it to Appendix C.1.

Let us now provide guarantees for a sample-splitting version of the EM updates. Recall that sample-based EM operator was previously defined in equation (17a). For a given sample size nn and iteration number TT, suppose that we splitTo simplify exposition, assume that n/Tn/T is an integer. our full data set into TT subsets, each of size n/Tn/T. We then generate the sequence θt+1=Mn/T(θt)\theta^{t+1}=M_{n/T}(\theta^{t}), where we use a fresh subset at each iteration. In the following result, we use φ(σ;θ2)=σ2+θ22\varphi(\sigma;\|\theta^{*}\|_{2})=\sqrt{\sigma^{2}+\|\theta^{*}\|_{2}^{2}}, along with positive universal constants (c1c_{1},c2c_{2}).

We prove this corollary in Appendix C.2. Note the bound (63) again provides guidance on the number of iterations to perform. For a given sample size nn, suppose we perform T=clog(n/dφ2(σ;θ2))T=c\log(n/d\varphi^{2}(\sigma;\|\theta^{*}\|_{2})) iterations for a constant cc. The bound (63) then implies that

with probability at least 1δ1-\delta. Apart from the logarithmic penalty \log^{2}\big{(}\frac{n}{d\varphi^{2}(\sigma;\|\theta^{*}\|_{2})}\big{)}, this guarantee matches the minimax rate for estimation of a dd-dimensional regression vector. We note that the logarithmic penalty can be removed by instead analyzing the standard form of the EM updates, as we did for the Gaussian mixture model.

We conclude our discussion of the MOR model by stating a result for the stochastic form of gradient EM analyzed in Theorem 5. In particular, given a data set of size nn, we run the algorithm for nn iterations, with a step size αt:=3t+2\alpha^{t}:=\frac{3}{t+2} for iterations t=1,,nt=1,\ldots,n. Once again our result is terms of φ(σ;θ2)=σ2+θ22\varphi(\sigma;\|\theta^{*}\|_{2})=\sqrt{\sigma^{2}+\|\theta^{*}\|_{2}^{2}} and positive universal constants (c1c_{1},c2c_{2}).

We prove this corollary in Appendix C.3. Figure 4 illustrates this corollary showing the error as a function of iteration number (sample size) for the stochastic gradient EM algorithm.

3 Linear regression with missing covariates

This section is devoted to analysis of the gradient EM algorithm for the problem of linear regression with missing covariates, as previously introduced in Section 2.2.3. Here the central parameter is the probability ρ\rho that any given coordinate of the covariate vector is missing, and our analysis links this quantity to the signal-to-noise ratio and the radius of contractivity. Define ξ1\xi_{1} and ξ2\xi_{2} to be such that the following bounds hold,

For any given choice of (ξ1,ξ2)(\xi_{1},\xi_{2}) define ξ:=(ξ1+ξ2)2.\xi:=(\xi_{1}+\xi_{2})^{2}. Our guarantees apply whenever the missing probability is bounded as

See Appendix D.1 for the proof of Corollary 6. Relative to our previous results, this corollary is somewhat unusual, in that we require an upper bound on the ratio θ2σ\frac{\|\theta^{*}\|_{2}}{\sigma}. Although this requirement might seem counter-intuitive at first sight, known minimax lower bounds on regression with missing covariates show that it is unavoidable— that is, it is not an artifact of our analysis nor of the gradient EM algorithm. Roughly these lower bounds formalize the intuition that as the norm θ2\|\theta^{*}\|_{2} increases, the amount of missing information increases in proportion to the amount of observed information. Figure 6 provides the results of simulations that confirm this behavior, in particular showing that for regression with missing data, the radius of convergence eventually decreases as θ2\|\theta^{*}\|_{2} grows.

Let us now provide guarantees for a sample-splitting version of the EM updates, based on the sample-based EM operator in equation (22a). As usual, for a given sample size nn and iteration number TT, suppose that we split our full data set into TT subsets, each of size n/Tn/T. We then generate the sequence θt+1=Mn/T(θt)\theta^{t+1}=M_{n/T}(\theta^{t}), where we use a fresh subset at each iteration.

We prove this corollary in Appendix D.2. We note that the constant c2c_{2} is a monotonic function of the parameters (ξ1,ξ2)(\xi_{1},\xi_{2}), but does not otherwise depend on nn, dd, σ2\sigma^{2} or other problem-dependent parameters.

As with Corollary 4, this result provides guidance on the appropriate number of iterations to perform: in particular, if we set T=clognT=c\log n for a sufficiently large constant cc, then the bound (69) implies that

We conclude our discussion of the missing covariates model by stating a result for the stochastic form of gradient EM analyzed in Theorem 5. In particular, given a data set of size nn, we run the algorithm for nn iterations, with a step size αt:=3t+2\alpha^{t}:=\frac{3}{t+2} for iterations t=1,,nt=1,\ldots,n.

We prove this corollary in Appendix D.3. Figure 7 illustrates this.

Discussion

In this paper, we have provided some general techniques for studying the EM and gradient EM algorithms, at both the population and finite-sample levels. Although this paper focuses on these specific algorithms, we expect that the techniques could be useful in understanding the convergence behavior of other algorithms for potentially non-convex problems.

The analysis of this paper can be extended in various directions. For instance, in the three concrete models that we treated, we assumed that the model was correctly specified, and that the samples were drawn in an i.i.d. manner, both conditions that may be violated in statistical practice. Maximum likelihood estimation is known to have various robustness properties under model mis-specification. Developing an understanding of the EM algorithm in this setting is an important open problem.

Finally, we note that in concrete examples our analysis guarantees good behavior of the EM and gradient EM algorithms when they are given suitable initialization. For the three model classes treated in this paper, simple pilot estimators can be used to obtain such initializations—in particular using PCA for Gaussian mixtures and mixtures of regressions (e.g., ), and the plug-in principle for regression with missing data (e.g., ). These estimators can be seen as particular instantiations of the method of moments . Although still an active area of research, a line of recent work (e.g., ) has demonstrated the utility of moment-based estimators or initializations for other types of latent variable models, and it would be interesting to analyze the behavior of EM for such models.

This research was partially supported by ONR-MURI grant N00014-11-1-0688 and NSF grant CIF-31712-23800 to MJW, and by US NSF grants DMS-1107000, CDS&E-MSS 1228246, ARO grant W911NF-11-1-0114, the Center for Science of Information (CSoI), and US NSF Science and Technology Center, under grant agreement CCF-0939370. SB would also like to thank John Duchi for helpful discussions.

Appendix A Proofs for stochastic gradient EM

In this section we provide proofs of results related to Theorem 5 from Section 3.2.3. It only remains to prove Lemma 1.

In order to establish Lemma 1 we require an analogue of Theorem 3 that allows for a wider range of step sizes. Recall the classical gradient ascent operator on the function q(θ)=Q(θθ)q(\theta)=Q(\theta|\theta^{*}). For step size α>0\alpha>0, it takes the form T(θ)=θ+αq(θ)T(\theta)=\theta+\alpha\nabla q(\theta). Under the stated λ\lambda-concavity and μ\mu-smoothness conditions, for any step size 0<α2λ+μ0<\alpha\leq\frac{2}{\lambda+\mu}, the classical gradient operator TT is contractive with parameter

This follows from the classical analysis of gradient descent (e.g., ). Using this fact, we can prove the following about the population gradient EM operator:

For any step size 0<α2λ+μ0<\alpha\leq\frac{2}{\lambda+\mu}, the population gradient EM operator G:ΩΩG:\Omega\rightarrow\Omega is contractive with parameter κ(α)=1αξ\kappa(\alpha)=1-\alpha\xi, where

We omit the proof, since it follows from a similar argument to that of Theorem 3. With this preliminary in place we can now begin the proof of Lemma 1.

Introducing the shorthand W^(θ):=Q1(θθ)\widehat{W}(\theta):=\nabla Q_{1}(\theta|\theta), we have θ~t+1θt=αtW^(θ)\widetilde{\theta}^{t+1}-\theta^{t}=\alpha^{t}\widehat{W}(\theta), and hence

Letting Ft\mathcal{F}_{t} denote the σ\sigma-field of events up to the random variable θt\theta^{t}, note that

Consequently, by iterated expectations, we have

Combining with our earlier inequality (72) yields

Defining Gt(θt):=θt+αtW(θt)G^{t}(\theta^{t}):=\theta^{t}+\alpha^{t}W(\theta^{t}), we see that

where step (i) uses the contractivity of GtG^{t} established in Lemma 2 and step (ii) uses the definition of ξ\xi from equation (71). Putting together the pieces yields the claim (55).

Appendix B Proofs for Gaussian mixture models

In this section, we provide proofs of results related to the Gaussian mixture model, as presented in Section 4.1. More specifically, we first prove Corollary 1 on the population level behavior, followed by the proof of Corollary 2 on the behavior of the standard sample-based EM updates.

Here the weighting function takes the form

It remains to verify the FOS(γ\gamma) condition (29). The following auxiliary lemma is central to the proof:

Under the conditions of Corollary 1, there is a constant γ(0,1)\gamma\in(0,1) with γexp(c2η2)\gamma\leq\exp(-c_{2}\eta^{2}) such that

where Δw(y):=wθ(y)wθ(y)\Delta_{w}(y):=w_{\theta}(y)-w_{{\theta^{\ast}}}(y).

This follows immediately from Lemma 3. Thus, the FOS condition holds when γ<1.\gamma<1. The bound on the contraction parameter follows from the fact that γexp(c2η2)\gamma\leq\exp(-c_{2}\eta^{2}) and applying Theorem 1 yields Corollary 1.

We now prove Lemma 3. Our proof makes use of the following elementary facts:

For the function f(t)=t2exp(μt)f(t)=\frac{t^{2}}{\exp(\mu t)}, we have

For the function g(t)=1(exp(t)+exp(t))2g(t)=\frac{1}{(\exp(t)+\exp(-t))^{2}}, we have

With these preliminaries in place, we can now begin the proof. For each uu\in, define θu=θ+uΔ\theta_{u}={\theta^{\ast}}+u\Delta, where Δ:=θθ\Delta:=\theta-{\theta^{\ast}}. Taylor’s theorem applied to the function θwθ(Y)\theta\mapsto w_{\theta}(Y), followed by expectations, yields

By construction, the matrix DD is diagonal, so that it suffices to bound the diagonal terms. Beginning with the first diagonal entry, we have

Defining the event E={V1θ24}\mathcal{E}=\{V_{1}\leq\frac{\|\theta^{*}\|_{2}}{4}\}, we condition on it and its complement to obtain

Conditioned on E\mathcal{E} and Ec\mathcal{E}^{c}, respectively, we then apply the bounds (74a) and (74b) to obtain

provided θ2θu24σ2\|\theta^{*}\|_{2}\|\theta_{u}\|_{2}\geq 4\sigma^{2}. Noting that

Note that the mean of V1V_{1} is lower bounded as

where step (i) follows from the lower bound (77). Consequently, by standard Gaussian tail bounds, we have

On the other hand, for any index j1j\neq 1, we have

where the reader should recall the function gg from equation (75a). Once again, conditioning on the event E={V1θ24}\mathcal{E}=\{V_{1}\leq\frac{\|\theta^{*}\|_{2}}{4}\} and its complement yields

Returning to equation (76), we have shown that

whenever θ22σ2η216/3\frac{\|\theta^{*}\|_{2}^{2}}{\sigma^{2}}\geq\eta^{2}\geq 16/3. On this basis, the bound (73) holds as long as the signal-to-noise ratio is sufficiently large,

B.2 Proof of Corollary 2

where the final step uses the fact that ZuZvZuv2|Z_{u}-Z_{v}|\leq Z\,\|u-v\|_{2} for any pair (u,v)(u,v). Putting together the pieces, we conclude that

using a standard symmetrization result for empirical processes (e.g., ). Now observe that for any triplet of dd-vectors yy, θ\theta and θ\theta^{\prime}, we have the Lipschitz property

Consequently, by the Ledoux-Talagrand contraction for Rademacher processes , we have

Putting together the pieces, we conclude that

for all λ\lambda sufficiently small. Combined with our first discretization (79), we have thus shown that

Combined with the Chernoff approach, this bound on the MGF implies that, as long as nc1dlog(1/δ)n\geq c_{1}d\log(1/\delta) for a sufficiently large constant c1c_{1}, we have

B.3 Guarantees for EM with sample-splitting

In this section, we state and prove a result for the EM algorithm with sample-splitting for the mixture of Gaussians.

Consider a Gaussian mixture model satisfying the \mboxSNR(η)\mbox{SNR}(\eta) condition (57), and any initialization θ0\theta^{0} such that θ0θ2θ24\|\theta^{0}-\theta^{*}\|_{2}\leq\frac{\|\theta^{*}\|_{2}}{4}. Given a sample size n16Tlog(6T/δ)n\geq 16T\log(6T/\delta), then with probability at least 1δ1-\delta, the sample-splitting EM iterates {θt}t=0T\{\theta^{t}\}_{t=0}^{T} satisfy the bound

It is worth comparing the result here to the result established earlier in Corollary 2. The sample-splitting EM algorithm is more sensitive to the number of iterations which determines the batch size and needs to be chosen in advance. Supposing that the number of iterations were chosen optimally however the result has better dependence on θ2\|\theta^{*}\|_{2} and σ\sigma at the cost of a logarithmic factor in nn.

The proof follows by establishing a bound on the function εM(n,δ)\varepsilon_{M}(n,\delta). Define S={θ:θθ2θ24}\mathcal{S}=\{\theta:\|\theta-\theta^{*}\|_{2}\leq\frac{\|\theta^{*}\|_{2}}{4}\}. Recalling the updates in (13a) and (13b), note that

We bound each of these terms in turn, in particular showing that

Observe that since Y(2Z1)θ+vY\sim(2Z-1)\theta^{*}+v we have

Since ZiZ_{i} are i.i.d Bernoulli variables, Hoeffding’s inequality implies that

with probability at least 1δ41-\frac{\delta}{4}. On the other hand, the vector U1:=1ni=1nviU_{1}:=\frac{1}{n}\sum_{i=1}^{n}{v_{i}} is zero-mean and sub-Gaussian with parameter σ/n\sigma/\sqrt{n}, whence the squared norm U122\|U_{1}\|_{2}^{2} is sub-exponential. Using standard bounds for sub-exponential variates and the condition n>σdn>\sigma d, we obtain

with probability at least 1δ/41-\delta/4. Combining the pieces yields the claimed bound (82) on T1T_{1}.

The random variable wθ(Y)(2Z1)w_{\theta}(Y)(2Z-1) lies in the interval $$, so that Hoeffding’s inequality implies that

with probability at least 1δ/41-\delta/4. Putting together the pieces yields the claimed bound (82) on T2T_{2}, thereby completing the proof of the corollary. ∎

Appendix C Proofs for mixtures of regressions

In this appendix, we provide proofs of results related to the mixture of regressions model, as presented in Section 4.2. More specifically, we first prove Corollary 3 on the population level behavior, followed by the proof of Corollaries 4 and 5 on the behavior of sample-splitting EM updates and stochastic gradient EM updates, respectively.

We begin by proving part (a) of the corollary on the population EM update, which is based on maximizing the function

It remains to verify condition FOS. Define the difference function Δw(X,Y):=wθ(X,Y)wθ(X,Y)\Delta_{w}(X,Y):=w_{\theta}(X,Y)-w_{\theta^{*}}(X,Y), and the difference vectors Δ=θθ\Delta=\theta-\theta^{*}. Using this notation, for this model, we need to show that

Note that we can write Y=d(2Z1)X,θ+vY\stackrel{{\scriptstyle d}}{{=}}(2Z-1)\langle X,\,\theta^{*}\rangle+v, where Z\mboxBer(1/2)Z\sim\mbox{Ber}(1/2) is a Bernoulli variable. Using this notation, it is equivalent to show

for γ[0,1/2)\gamma\in[0,1/2) in order to establish contractivity. In order to prove the theorem with the desired upper bound on κ\kappa we need to show (83) with γ[0,1/4).\gamma\in[0,1/4). The following lemma provides control on the two terms:

Under the conditions of Corollary 3, there is a constant γ<1/4\gamma<1/4 such that for any fixed vector Δ~\widetilde{\Delta} we have

for κ[0,12).\kappa\in[0,\frac{1}{2}). This is an immediate consequence of Lemma 4.

It remains to prove Lemma 4. Since the standard deviation σ\sigma is known, a simple rescaling argument allows us to take σ=1\sigma=1, and replace the weight function in (16a) with

Our proof makes use of the following elementary result on Gaussian random vectors:

A similar argument yields the second claim. ∎

Noting that Lemma 4 consists of two separate inequalities (84a) and (84b), we treat these cases separately.

We split the proof of this bound into two separate cases: namely, Δ21\|\Delta\|_{2}\leq 1 and Δ2>1\|\Delta\|_{2}>1.

Thus, using a Taylor series with integral form remainder on the function θwθ(X,Y)\theta\mapsto w_{\theta}(X,Y) yields

where Zu:=YX,θ+uΔZ_{u}:=Y\langle X,\,\theta^{*}+u\Delta\rangle. Substituting for Δw(X,Y)\Delta_{w}(X,Y) in inequality (84a), we see that it suffices to show

for some γ[0,1/4)\gamma\in[0,1/4). The following auxiliary result is central to establishing this claim:

There is a γ[0,1/4)\gamma\in[0,1/4) such that for each uu\in, we have

See Section C.1.5 for the proof of this lemma.

Using Lemma 6, let us bound the quantity AuA_{u} from equation (89). Since θ=θuuΔ\theta^{*}=\theta_{u}-u\Delta, we have Au=B1+B2A_{u}=B_{1}+B_{2}, where

In order to show that Auγ2Δ2Δ~2A_{u}\leq\frac{\gamma}{2}\|\Delta\|_{2}\,\|\widetilde{\Delta}\|_{2}, it suffices to show that max{B1,B2}γ4Δ2Δ~2\max\{B_{1},B_{2}\}\leq\frac{\gamma}{4}\|\Delta\|_{2}\|\widetilde{\Delta}\|_{2}.

By the Cauchy-Schwarz inequality, we have

Similarly, another application of the Cauchy-Schwarz inequality yields

where the second step follows from the bound (90b), and the fact that (2Z1)2=1(2Z-1)^{2}=1. In this case, we have

where step (i) uses the bound (86b) from Lemma 5, and step (ii) that Δ21\|\Delta\|_{2}\leq 1. Combining the pieces, we conclude that B2γ4Δ2Δ~2B_{2}\leq\frac{\gamma}{4}\|\Delta\|_{2}\|\widetilde{\Delta}\|_{2}, which completes the proof of inequality (84a) in the case Δ21\|\Delta\|_{2}\leq 1.

We now turn to the second case of the bound (84a). Our argument (here and in later sections) makes use of various probability bounds on different events, which we state here for future reference. These events involve the scalar τ:=Cτlogθ2\tau:=C_{\tau}\sqrt{\log\|\theta^{*}\|_{2}} for a constant CτC_{\tau}, as well as the vectors

For the event \mathcal{E}_{2}:=\big{\{}|\langle X,\,\theta^{*}\rangle|>\tau\big{\}}\cap\big{\{}|\langle X,\,\theta_{u}\rangle|>\tau\big{\}}\cap\big{\{}|v|\leq\frac{\tau}{2}\big{\}}, we have

See Section C.1.7 for the proof of this result.

With this set-up, our goal is to bound the quantity

We bound each of these five terms in turn.

Applying the Cauchy-Schwarz inequality and using the fact that (2Z1)2=1(2Z-1)^{2}=1 yields

We now bound Δw(X,Y)\Delta_{w}(X,Y) conditioned on the event E1E2\mathcal{E}_{1}\cap\mathcal{E}_{2}. Since sign(X,θ)=sign(X,θu)\operatorname{sign}(\langle X,\,\theta^{*}\rangle)=\operatorname{sign}(\langle X,\,\theta_{u}\rangle) on the event E1\mathcal{E}_{1}, we have

Recalling the weight function (85), we claim that when conditions (93a) and (93b) hold, then

Combined with our earlier bound (92), we have shown

Applying Lemma 8 with E=E1E2\mathcal{E}=\mathcal{E}_{1}\cap\mathcal{E}_{2} yields Ψ(E1E2)2Δ~2θ2eτ2\Psi(\mathcal{E}_{1}\cap\mathcal{E}_{2})\leq 2\|\widetilde{\Delta}\|_{2}\|\theta^{*}\|_{2}e^{-\tau^{2}}.

Combining the Cauchy-Schwarz inequality with Lemma 7(i), we have

where step (i) makes use of the bound X,θX,θu0\langle X,\,\theta^{*}\rangle\,\langle X,\,\theta_{u}\rangle\leq 0; and step (ii) follows since θu=θ+uΔ\theta_{u}=\theta^{*}+u\Delta, and uu\in.

where step (i) follows from the conditional covariance bound of Lemma 8.

Combining the Cauchy-Schwarz inequality with Lemma 7(iv) yields

Together with Lemma 8, we obtain the bound

where step (i) uses the fact that Δ21\|\Delta\|_{2}\geq 1.

We have thus obtained bounds on all five terms in the decomposition (91). We combine these bounds with the with lower bound θu2θ22\|\theta_{u}\|_{2}\geq\frac{\|\theta^{*}\|_{2}}{2} from equation (87b), and then perform some algebra to obtain

where cc is a universal constant. In particular, selecting τ=cτlogθ2\tau=c_{\tau}\sqrt{\log\|\theta^{*}\|_{2}} for a sufficient large constant cτc_{\tau}, selecting the constant η\eta in (87a) sufficiently large yields the claim (84a).

C.1.2 Proof of inequality (84b)

As in Section C.1.1, we treat the cases Δ21\|\Delta\|_{2}\leq 1 and Δ21\|\Delta\|_{2}\geq 1 separately.

As before, by a Taylor expansion of the function θΔw(X,Y)\theta\mapsto\Delta_{w}(X,Y), it suffices to show that

For any fixed uu\in, the Cauchy-Schwarz inequality implies that

Conditioned on the event E1E2\mathcal{E}_{1}\cap\mathcal{E}_{2}, the bound (94) implies that Δw(X,Y)exp(τ2)|\Delta_{w}(X,Y)|\leq\exp(-\tau^{2}), and hence Ψ(E1E2)e2τ2\Psi(\mathcal{E}_{1}\cap\mathcal{E}_{2})\leq e^{-2\tau^{2}}.

Thus, we have derived bounds on each of the three terms in the decomposition (96): putting them together yields

By choosing CτC_{\tau} sufficiently large in the definition of τ\tau, selecting the signal-to-noise constant η\eta in condition (87a) sufficiently large, the claim follows.

C.1.5 Proof of Lemma 6

The lemma statement consists of two inequalities, and we divide our proof accordingly.

and we bound each of these terms in turn. The reader should recall the constant τ:=Cτlogθ2\tau:=C_{\tau}\sqrt{\log\|\theta^{*}\|_{2}}, as well as the events E3\mathcal{E}_{3} and E4\mathcal{E}_{4} from Lemma 7.

where the final step follows from inequality (74a). Combined with Lemma 7(iv), we conclude that Ψ(E4c)12e2eτ22\Psi(\mathcal{E}_{4}^{c})\leq\frac{1}{2e^{2}}e^{-\frac{\tau^{2}}{2}}.

where step (i) follows from inequality (98), and step (ii) follows from Lemma 7(iii).

Conditioned on the event E2\mathcal{E}_{2}, we have Y2X,θu2τ22Y^{2}\langle X,\,\theta_{u}\rangle^{2}\geq\frac{\tau^{2}}{2}, where we have used the lower bound (93b). Introducing the shorthand t=τ2/2t^{*}=\tau^{2}/2, this lower bound implies that

where inequality (i) is valid as long as t=τ2212t^{*}=\frac{\tau^{2}}{2}\geq\frac{1}{2}, or equivalently τ21\tau^{2}\geq 1.

Substituting our upper bounds on three components in the decomposition (97) yields

Setting CτC_{\tau} sufficiently large in the definition of τ\tau and choosing sufficiently large values of the signal-to-noise constant η\eta in the condition (87a) yields the claim.

Combining Lemma 8 with the bound Δ21\|\Delta\|_{2}\leq 1, we find that X,θ22τ2+4\langle X,\,\theta^{*}\rangle^{2}\leq 2\tau^{2}+4. Since Y=d(2Z1)X,θ+vY\stackrel{{\scriptstyle d}}{{=}}(2Z-1)\langle X,\,\theta^{*}\rangle+v, we have

Putting together the pieces, we conclude that Ψ(E5c)4τ3+10τ16θu2\Psi(\mathcal{E}_{5}^{c})\leq\frac{4\tau^{3}+10\tau}{16\|\theta_{u}\|_{2}}.

Recall that Zu=YX,θuZ_{u}=Y\langle X,\,\theta_{u}\rangle, so we have that

where step (i) follows from the bound (74a) and the observation that X,θuτ|\langle X,\,\theta_{u}\rangle|\geq\tau conditioned on the event E5\mathcal{E}_{5}.

Substituting our bounds on the two terms into the decomposition (97) yields

Once again, sufficiently large choices of the constant cτc_{\tau} and the signal-to-noise constant η\eta in equation (87a) yields the claim.

C.1.6 Proof of Lemma 7

In this section, we prove the probability bounds on events E1\mathcal{E}_{1} through E6\mathcal{E}_{6} stated in Lemma 7. In doing so, we make use of the following auxiliary result, due to Yi et al. (see Lemma 1 in their paper):

Note that the event E1c\mathcal{E}_{1}^{c} holds if and only if X,θX,θu<0\langle X,\,\theta^{*}\rangle\langle X,\,\theta_{u}\rangle<0, or equivalently, if and only if

For XN(0,σ2)X\sim\mathcal{N}(0,\sigma^{2}), we have

Similarly, this inequality follows from the tail bound (101).

This claim follows from parts (v) and (vi) of Lemma 7, combined with the union bound.

This bound follows from parts (iii) and (iv) of Lemma 7, combined with the union bound.

C.1.7 Proof of Lemma 8

For appropriate choices of cτc_{\tau} and the constant η\eta in the signal-to-noise condition (87a), the claim follows.

As before, note that the event E1c\mathcal{E}_{1}^{c} holds if and only if the inequality X,θ+θu<X,θθu|\langle X,\,\theta^{*}+\theta_{u}\rangle|<|\langle X,\,\theta^{*}-\theta_{u}\rangle| holds. Consequently, Lemma 9 implies that  ⁣ ⁣Γ(E1c) ⁣ ⁣\mboxop2|\!|\!|\Gamma(\mathcal{E}_{1}^{c})|\!|\!|_{{\text{\footnotesize{\mbox{op}}}}}\leq 2.

We make note of an elementary fact about Gaussians: for any scalar α>0\alpha>0 and unit norm vector v2=1\|v\|_{2}=1, for XN(0,Id)X\sim\mathcal{N}(0,I_{d}), we have

In particular, when α1\alpha\leq 1, then the operator norm is at most 11. This claim follows easily from the rotation invariance of the Gaussian, which allows us to assume that v=e1v=e_{1} without loss of generality. It is thus equivalent to bound the largest eigenvalue of the matrix

which is a diagonal matrix by independence of the entries of XX. Noting that D11α2D_{11}\leq\alpha^{2} and Djj=1D_{jj}=1 for j1j\neq 1 completes the proof of the bound (102).

Applying the bound (102), we find that |\!|\!|\Gamma(\mathcal{E}_{5}^{c})|\!|\!|_{{\text{\footnotesize{\mbox{op}}}}}\leq\max\Big{(}1,\frac{\tau^{2}}{\|\theta_{u}\|_{2}^{2}}\Big{)}. Consequently, the claim follows by making sufficiently large choices of cτc_{\tau} and the constant η\eta in the signal-to-noise condition (87a).

C.2 Proof of Corollary 4

We need to compute an upper bound on the function εM(n,δ)\varepsilon_{M}(n,\delta) previously defined in equation (33). For this particular model, we have

We bound each of the terms T1T_{1} and T2T_{2} in turn.

Recall the assumed lower bound on the sample size—namely n>cdlog(1/δ)n>c\,d\log(1/\delta) for a sufficiently large constant cc. Under this condition, standard bounds in random matrix theory , guarantee that  ⁣ ⁣Σ^Σ ⁣ ⁣\mboxop12|\!|\!|\widehat{\Sigma}-\Sigma|\!|\!|_{{\text{\footnotesize{\mbox{op}}}}}\leq\frac{1}{2} with probability at least 1δ1-\delta. When this bound holds, we have  ⁣ ⁣Σ^1 ⁣ ⁣\mboxop1/2|\!|\!|\widehat{\Sigma}^{-1}|\!|\!|_{{\text{\footnotesize{\mbox{op}}}}}\geq 1/2.

where we have applied the Ledoux-Talagrand contraction for Rademacher processes , using the fact that μθ(x,y)1|\mu_{\theta}(x,y)|\leq 1 for all pairs (x,y)(x,y). Now conditioned on xix_{i}, the random variable yiy_{i} is zero-mean and sub-Gaussian with parameter at most θ22+σ2\sqrt{\|\theta^{*}\|_{2}^{2}+\sigma^{2}}. Consequently, taking expectations over the distribution (yixi)(y_{i}\mid x_{i}) for each index ii, we find that

where the final inequality uses the definition of E\mathcal{E}. Using this bound on the moment-generating function, we find that

Since n>dn>d by assumption, standard results in random matrix theory imply that  ⁣ ⁣Σ^1Σ1 ⁣ ⁣\mboxopcdnlog(1/δ)|\!|\!|\widehat{\Sigma}^{-1}-\Sigma^{-1}|\!|\!|_{{\text{\footnotesize{\mbox{op}}}}}\leq c\sqrt{\frac{d}{n}\,\log(1/\delta)} with probability at least 1δ1-\delta. On the other hand, observe that

since the population operator MM is a contraction, and θ22θ2\|\theta\|_{2}\leq 2\|\theta^{*}\|_{2}. Combining the pieces, we see that T2cθ2dn  log(1/δ)T_{2}\leq c\|\theta^{*}\|_{2}\sqrt{\frac{d}{n}\;\log(1/\delta)} with probability at least 1δ1-\delta.

Finally, substituting our bounds on T1T_{1} and T2T_{2} into the decomposition (103) yields the claim.

C.3 Proof of Corollary 5

First considering T1T_{1}, recall that y1=z1x1,θ+v1y_{1}=z_{1}\langle x_{1},\,\theta^{*}\rangle+v_{1}, where vN(0,σ2)v\sim\mathcal{N}(0,\sigma^{2}) and z1z_{1} is a random sign, independent of (x1,v1)(x_{1},v_{1}). Consequently, we have

Putting together the pieces, we conclude that T18θ22d+2σ2dT_{1}\leq 8\|\theta^{*}\|_{2}^{2}d+2\sigma^{2}d.

Appendix D Proofs for missing covariates

In this appendix, we provide proofs of results related to regression with missing covariates, as presented in Section 4.3. More specifically, we first prove Corollary 6 on the population level behavior, followed by the proof of Corollaries 7 and 8 on the behavior of sample-splitting EM updates and stochastic gradient EM updates, respectively.

We need to verify the conditions of Theorem 3, namely that the function qq is μ\mu-smooth, λ\lambda-strongly concave, and that the GS condition is satisfied. In this case, qq is a quadratic of the form

showing that the expectation does not depend on the pattern of missingness. Consequently, the quadratic function qq has an identity Hessian, showing that smoothness and strong concavity hold with μ=λ=1\mu=\lambda=1.

where we have used our upper bound (107) on π1\pi_{1}. We need to ensure that γ<1\gamma<1. By assumption, we have θ2ξ1σ\|\theta^{*}\|_{2}\leq\xi_{1}\sigma and θθ2ξ2σ\|\theta-\theta^{*}\|_{2}\leq\xi_{2}\sigma, and hence θ2(ξ1+ξ2)σ\|\theta\|_{2}\leq(\xi_{1}+\xi_{2})\sigma. Thus, the coefficient γ2\gamma^{2} is upper bounded as

Under the stated conditions of the corollary, we have γ<1\gamma<1, thereby completing the proof.

D.2 Proof of Corollary 7

It is clear that u,(ΣˉΣ^)u\langle u,\,(\bar{\Sigma}-\widehat{\Sigma})u\rangle has zero mean. It remains to prove that u,(ΣˉΣ^)u\langle u,\,(\bar{\Sigma}-\widehat{\Sigma})u\rangle is sub-exponential. Note that Σ^\widehat{\Sigma} is a rescaled sum of rank one matrices, each of the form

Introducing the shorthand ω=(1z)θ\omega=(1-z)\odot\theta, we have

Moreover, since y=x,θ+vy=\langle x,\,\theta^{*}\rangle+v, we have

It suffices to show that each of the variables {Bj}j=14\{B_{j}\}_{j=1}^{4} is sub-Gaussian with a constant parameter. As discussed previously, the variable B1B_{1} is sub-Gaussian with parameter at most one. On the other hand, note that xx and ω\omega are independent. Moreover, with ω\omega fixed, the variable x,ω\langle x,\,\omega\rangle is sub-Gaussian with parameter ω22\|\omega\|_{2}^{2}, whence

where the final inequality uses the fact that ω,u2ω22\langle\omega,\,u\rangle^{2}\leq\|\omega\|_{2}^{2}. We have thus shown that B2B_{2} is sub-Gaussian with parameter one. Since θθ2ξ2σ\|\theta-\theta^{*}\|_{2}\leq\xi_{2}\sigma, the same argument shows that B3B_{3} is sub-Gaussian with parameter at most ξ2\xi_{2}. Since vv is sub-Gaussian with parameter σ\sigma and independent of ω\omega, the same argument shows that B4B_{4} is sub-Gaussian with parameter at most one, thereby completing the proof of the lemma. ∎

We now turn to the second term. Note the variational representation

By a discretization argument–say with a 1/21/2 cover {u1,,uM}\{u^{1},\ldots,u^{M}\} of the sphere with M2dM\leq 2^{d} elements—we obtain

Each term in this maximum is the product of two zero-mean variables, namely yy and μθ,u\langle\mu_{\theta},\,u\rangle. On one hand, the variable yy is sub-Gaussian with parameter at most θ22+σ2cσ\sqrt{\|\theta^{*}\|_{2}^{2}+\sigma^{2}}\leq c\sigma; on the other hand, Lemma 10 guarantees that μθ,u\langle\mu_{\theta},\,u\rangle is sub-Gaussian with constant parameter. The product of any two sub-Gaussian variables is sub-exponential, and thus, by standard sub-exponential tail bounds , we have

Since M2dM\leq 2^{d} and n>c1dn>c_{1}d, we conclude that T2c1+σ2dn  log(1/δ)T_{2}\leq c\sqrt{1+\sigma^{2}}\,\sqrt{\frac{d}{n}\;\log(1/\delta)} with probability at least 1δ1-\delta.

Combining our bounds on T1T_{1} and T2T_{2}, we conclude that εG(n,δ)c1+σ2  dn  log(1/δ)\varepsilon_{G}(n,\delta)\leq c\sqrt{1+\sigma^{2}}\;\sqrt{\frac{d}{n}\;\log(1/\delta)} with probability at least 1δ1-\delta. Thus, we see that Corollary 7 follows from Theorem 2.

D.3 Proof of Corollary 8

By the Cauchy-Schwarz inequality, we have

By the Cauchy-Schwarz inequality, we have

Note that yy is sub-Gaussian with parameter at most θ22+σ2\sqrt{\|\theta^{*}\|_{2}^{2}+\sigma^{2}}, whence

References