Statistical guarantees for the EM algorithm: From population to sample-based analysis

Sivaraman Balakrishnan, Martin J. Wainwright, Bin Yu

Introduction

Data problems with missing values, corruptions, and latent variables are common in practice. From a computational standpoint, computing the maximum likelihood estimate (MLE) in such incomplete data problems can be quite complex. To a certain extent, these concerns have been assuaged by the development of the expectation-maximization (EM) algorithm, along with growth in computational resources. The EM algorithm is widely applied to incomplete data problems, and there is now a very rich literature on its behavior (e.g., ). However, a major issue is that in most models, although the MLE is known to have good statistical properties, the EM algorithm is only guaranteed to return a local optimum. The goal of this paper is to address this potential gap between statistical and computational guarantees in application of the EM algorithm.

The EM algorithm has a lengthy and rich history. Various algorithms of the EM-type were analyzed in early work (e.g.,), before Dempster et al. introduced the EM algorithm in its modern general form. Among other results, they established its well-known monotonicity properties. The subsequent work of Wu established some of the most general convergence results known for the EM algorithm; see also the more recent papers . Together with other results, Wu showed that if the likelihood is unimodal and certain regularity conditions hold, then the EM algorithm converges to the unique global optimum. However, in most interesting cases of the EM algorithm, the likelihood function is multi-modal, in which case the behavior of the EM algorithm remains a little more mysterious. Indeed, despite its popularity and widespread practical effectiveness, the EM algorithm is often considered a “sensible heuristic” with little or no theoretical backing.

One interesting observation with the EM algorithm is given a “suitable” initialization, it often converges to a statistically useful estimate. For instance, in application to a mixture of regressions problem (see Section 2.2.2 for more details), Chaganty and Liang empirically demonstrate good performance for a two-stage estimator, in which the method of methods is used as an initialization, and then the EM algorithm is applied to refine this initial estimator. Although encouraging, this type of behavior is not well understood in a quantitative sense, especially how EM fixed points reached by this type of two-stage estimator are related to the global maximizers of the population likelihood. The goal of this paper is to address this question, and to develop some general tools for characterizing fixed points of the suitably initialized sample-based EM algorithm, and their relation to maximum likelihood estimates.

Some two-stage estimators have recently been analyzed in work on alternating minimization algorithms (see e.g. ) which show that at least in certain special cases optimization methods can be locally effective despite non-convexity. Most directly related to our work is the paper of Yi et al. which considers a special (degenerate) noiseless case of the EM algorithm for the mixtures of regressions problem. Results for the noisy mixtures of regressions problem follow from our general treatment of the EM algorithm (see Section 2.2.2).

In some settings, performing an exact M-step is computationally burdensome, in which case a natural alternative is some form of generalized EM updates. In such an algorithm, instead of performing an exact maximization, we simply choose a parameter value that does not decrease the likelihood. In addition to the standard EM updates, we also analyze a particular case of such an algorithm, known as gradient EM, based on a taking a single gradient step per iteration.

Our main results concern the population EM and gradient EM algorithms and their finite-sample counterparts. Our first set of results (Theorems 1 and 3) give conditions under which the population algorithms are contractive to the MLE, in a ball around the MLE. These results are completely deterministic. This population-level analysis is based on viewing these algorithms as perturbed versions of certain “oracle” algorithms which are known to be contractive around the MLE. Our second set of results (Theorem 2, Theorem 4 and Theorem 5) concern the sample-based EM and gradient EM algorithms which approximate the population-based algorithms using a subset of samples at each step. We give conditions under which these sample operators converge to an $\varepsilon$ -ball around the population MLE. These results involve probabilistic bounds on the deviations between the iterates of the population and sample-based algorithms.

The remainder of this paper is organized as follows. Section 2 provides an introduction to the EM and gradient EM algorithms, as well as a description of the three examples treated in detail in this paper—namely, Gaussian mixture models (Section 2.2.1), mixture of regressions (Section 2.2.2), and regression with missing covariates (Section 2.2.3). Section 3 is devoted to our general convergence results on both the EM and gradient EM algorithms. In Section 4, we revisit the three model classes previously introduced, and illustrate the use of our general theory by deriving some concrete corollaries. In concrete examples our theory gives a characterization of the quality of initialization needed and the rate of convergence of the EM and gradient EM algorithms. We complement these theoretical results with simulations that confirm various aspects of the theoretical predictions. In order to promote readability, we defer the more technical aspects of proofs to the appendices.

Background and model examples

We begin with basic background on the EM algorithm and its variants, along with a number of specific models that we revisit later in the paper.

Let $Y$ and $Z$ be random variables taking values in the sample spaces $\mathcal{\mathcal{Y}}$ and $\mathcal{\mathcal{Z}}$ , respectively. Suppose that the pair $(Y,Z)$ has a joint density function $f_{\theta^{*}}$ that belongs to some parameterized family $\{f_{\theta}\,\mid\,\theta\in\Omega\}$ , for a non-empty compact convex set $\Omega$ . Rather than observing the complete data $(Y,Z)$ , we observe only component $Y$ . Thus, the component $Z$ corresponds to the missing or latent structure in the data.

Our goal is to obtain an estimate of the unknown parameter $\theta^{*}$ via maximum likelihood—namely, to compute some $\widehat{\theta}\in\Omega$ maximizing the function $\theta\mapsto g_{\theta}(y)$ , where

is the density function of the observed variable $Y$ . Throughout this paper, we assume that $\theta^{*}$ is a maximizer of the population likelihood, but not that $\theta^{*}$ is a unique maximizer. Uniqueness is often violated in mixture models for which parameters are typically only identifiable up to permutation. In the examples that we consider, this non-identifiability will be resolved by appropriate initialization conditions.

In many settings, it can be difficult or computationally expensive to evaluate the log likelihood of the observed data, but relatively easy to compute the log likelihood $\log f_{\theta}(y,z)$ of both the latent and observed variables. The EM algorithm is well-suited to such settings. For each $\theta\in\Omega$ , let $k_{\theta}(z\mid y)$ denote the conditional density of $z$ given $y$ . A straightforward application of Jensen’s inequality then shows that the log likelihood at $\theta^{\prime}\in\Omega$ can be lower bounded as

with equality holding when $\theta=\theta^{\prime}$ . Thus, we have a family of lower bounds on the log likelihood, and the EM algorithm successively maximizes this lower bound ( $M$ -step), and then reevaluates the lower bound at the new parameter value ( $E$ -step).

With this notation, it is easy to specify the EM iterations. The update $\theta^{t}\rightarrow\theta^{t+1}$ consists of the following two steps.

E-step: Evaluate the expectation in equation (2) to compute $Q(\cdot|\theta^{t})$ .

M-step: Compute the maximizer $\theta^{t+1}=\arg\max\limits_{\theta^{\prime}\in\Omega}Q(\theta^{\prime}|\theta^{t})$ .

For future use, it is convenient to introduce the mapping $M:\Omega\rightarrow\Omega$ given by

With this choice, the $M$ -step corresponds to the update $\theta^{t+1}=M(\theta^{t})$ .

In a generalized EM algorithm, the requirements of the $M$ -step are relaxed: instead of finding the exact optimum, the algorithm is required only to find a value $\theta^{t+1}\in\Omega$ such that

Depending on how $\theta^{t+1}$ is chosen, this requirement actually defines a family of algorithms.

A closely related variant of the generalized EM updates is what we refer to as the gradient EM updates, applicable in the case when the function $Q(\cdot|\theta^{t})$ is differentiable at each iteration $t$ . Given a step size $\alpha>0$ , these updates take the form

where the gradient is taken in the first argument of $Q$ . For ease of notation, we define the mapping $G\colon\Omega\rightarrow\Omega$ by

An iteration of gradient EM can now be written compactly as $\theta^{t+1}=G(\theta^{t})$ .

There is a natural extension that includes a constraint arising from the parameter space $\Omega$ , in which the update is projected back onto the constraint set To avoid pathologies additionally assume that the constraint set is closed.. For simplicity, we focus on unconstrained problems in this paper, but all of our results extend in a straightforward way to constrained examples by incorporating the additional Euclidean projection. For appropriate choices of the step size parameter $\alpha$ , the gradient EM updates guarantee the ascent condition (4), so that it is a particular case of a generalized EM algorithm.

Let us now make an important distinction, namely, that between the population and sample-based versions of the EM updates. Up to this point, we have suppressed dependence on the number of observed samples $n$ . The population form of the (gradient) EM updates are an “oracle version”, in which we effectively observe an infinite number of samples, and consequently, the function $Q(\cdot|\theta)$ takes the form

From here onwards, we use the notation $M$ and $G$ for the EM and gradient EM operators, respectively, both defined at the population level.

In the classical statistical settings, we observe only $n$ i.i.d. samples $\{y_{i}\}_{i=1}^{n}$ of the $Y$ component. Under the i.i.d. assumption, we define the function

so that the expectation over $Y$ in equation (7) is replaced by the empirical expectation defined by the samples. The function $Q_{n}$ defines an analog of the population EM operator (3), namely

In an analogous fashion, we define the sample-based analog of the gradient EM operator (6), namely

where $\alpha>0$ is an appropriately chosen step size parameter.

2 Illustrative examples

The EM algorithm is popular and a variety of examples can be found in the literature. In this section, we review three specific models analyzed in this paper, and derive the form of the population and sample-based updates, both for the usual EM algorithm and the gradient EM algorithm.

An isotropic, balanced two-component Gaussian mixture model can be specified by a density of the form

Suppose that we are given $n$ i.i.d. samples $\{y_{i}\}_{i=1}^{n}$ drawn from the mixture density (11). The complete data $\{(y_{i},z_{i})\}_{i=1}^{n}$ corresponds to the original samples along with the component indicator variables $z_{i}\in\{0,1\}$ . The sample-based function $Q_{n}$ takes the form

where $w_{\theta}(y):=e^{-\frac{\|\theta-y\|_{2}^{2}}{2\sigma^{2}}}\Big{[}e^{-\frac{\|\theta-y\|_{2}^{2}}{2\sigma^{2}}}+e^{-\frac{\|\theta+y\|_{2}^{2}}{2\sigma^{2}}}\Big{]}^{-1}$ .

where the empirical expectation has been replaced by expectation under the mixture distribution (11).

On the other hand, the sample-based and population gradient EM operators with step size $\alpha>0$ are given by

We return to analyze the EM updates for the Gaussian mixture model in Section 4.1.

2.2 Mixture of regressions

On the other hand, the gradient EM operators are given by

where $\alpha>0$ is a step size parameter.

We return to analyze the EM updates for the mixture of regressions model in Section 4.2.

2.3 Linear regression with missing covariates

Our first two examples involved mixture models in which the class membership variable was hidden. Another canonical use of the EM algorithm is in cases with corrupted or missing data. In this section, we consider a particular instantiation of such a problem, namely that of linear regression with the covariates missing completely at random.

where $\rho\in[0,1)$ is the probability of missingness.

In writing all these expressions, we have assumed that the coordinates are permuted so that the missing values are in the first block.

We now have the necessary notation in place to describe the EM and gradient EM updates. For a given parameter $\theta$ , the EM update is based on maximizing

Again, this optimization problem has an explicit solution, so that the sample-based EM operator is given by

On the other hand, the gradient EM algorithm with step size $\alpha$ takes the form

We return to analyze the gradient EM updates for this model in Section 4.3.

General convergence results

We now turn to analysis of the EM algorithm and gradient EM algorithms. In both cases, we let ${\theta^{\ast}}$ denote a maximizer of the population likelihood. In this section, we give general sufficient conditions under which the population algorithms converge to ${\theta^{\ast}}$ and under which the sample-based algorithms converge to an $\varepsilon$ -ball around ${\theta^{\ast}}$ . Our analysis of each algorithm is organized as follows:

Our second result in Sections 3.1 and 3.2 concern the sample-based EM and gradient EM operators. These sample-based operators approximate the population-based update using a subset of samples at each step. Theorem 2 and Theorem 4 for sample-based EM and gradient EM, respectively give conditions under which the sample-based operator is guaranteed to converge to an $\varepsilon$ -ball around the fixed point ${\theta^{\ast}}$ . These results involve probabilistic bounds on the deviations between the population-based and sample-based operators. In addition, for gradient EM, we also analyze a stochastic update that uses a single sample per update in the flavor of stochastic approximation algorithms (see Theorem 5 in Section 3.2.3).

Let us begin with analysis of the standard EM updates, starting with the population version before turning to a sample-based version.

Recall that we always assume that the vector ${\theta^{\ast}}$ maximizes the population likelihood. It is a classical fact that it must then satisfy the condition

a property known as self-consistency. For this reason, the function $q(\cdot):=Q(\cdot|{\theta^{\ast}})$ plays an important role in our analysis.

We assume throughout this section that the function $q$ is $\lambda$ -strongly concave, meaning that

for all pairs $(\theta_{1},\theta_{2})$ in a neighborhood of $\theta^{*}$ . As we will illustrate, this condition holds in most concrete instantiations of EM, including the three model classes introduced in the previous section.

For any fixed $\theta$ , in order to relate the population EM updates to the fixed point ${\theta^{\ast}}$ , we require control on the two gradient mappings $\nabla q(\cdot)=\nabla Q(\cdot|{\theta^{\ast}})$ and $\nabla Q(\cdot|\theta)$ . These mappings are central in characterizing the fixed point ${\theta^{\ast}}$ and the update $M(\theta)$ respectively. Indeed, by virtue of the self-consistency property (24) and the convexity of $\Omega$ , the fixed point satisfies the first-order optimality condition

Similarly, for any $\theta\in\Omega$ , since $M(\theta)$ maximizes the function $\theta^{\prime}\mapsto Q(\theta^{\prime}|\theta)$ , we have

Equations (26) and (27) are sets of inequalities that characterize the points $M(\theta)$ and ${\theta^{\ast}}$ . Thus, at an intuitive level, in order to establish that $M(\theta)$ and ${\theta^{\ast}}$ are close, it suffices to verify that these two characterizations are close in a suitable sense. We also note that inequalities similar to the condition (27) are often used as a starting point in the classical analysis of M-estimators (e.g., see van de Geer ). In the analysis of EM, we obtain additional leverage from the self-consistency condition (24) that characterizes $\theta^{*}$ .

With this intuition in mind, we introduce the following regularity condition in order to relate conditions (27) and (24): The condition involves a Euclidean ball of radius $r$ around the fixed point ${\theta^{\ast}}$ , given by

As a concrete example, recall the Gaussian mixture model first introduced in Section 2.2.1. For this model, the condition (29) is equivalent to

where $w_{\theta}$ was previously defined following equation (12). Given that the function $\theta\mapsto w_{\theta}(y)$ is smooth in $\theta$ , provided that $\gamma$ is not too small, it is reasonable to expect that this condition will hold in a neighborhood of $\theta^{*}$ , and we confirm this intuition in Corollary 1 to follow.

Under the conditions we have introduced, the following result guarantees that the population EM operator is locally contractive:

Since both $M(\theta)$ and ${\theta^{\ast}}$ are in $\Omega$ , we may apply condition (26) with $\theta^{\prime}=M(\theta)$ and condition (27) with $\theta^{\prime}={\theta^{\ast}}$ . Doing so, adding the resulting inequalities and then performing some algebra yields the condition

Now the $\lambda$ -strong concavity condition (25) implies that the left-hand side is lower bounded as

Combining inequalities (32a) and (32b) with the original bound (31) yields

and canceling terms completes the proof. ∎

1.2 Guarantees for sample-based EM

We now turn to theoretical results on sample-based versions of the EM algorithm. More specifically, we consider two forms of the EM algorithm, the first being the standard form in which the operator $M_{n}:\Omega\rightarrow\Omega$ , as previously defined (9), is applied repeatedly, thereby generating the sequence $\theta^{t+1}=M_{n}(\theta^{t})$ . We also analyze a sample-splitting From a practical point, a potential advantage of sample splitting is that each iteration may be cheaper, since it is based on a smaller sample size. In contrast, a disadvantage is that it can be difficult to correctly specify the number of iterations in advance. version of the EM algorithm, in which given a total of $n$ samples and $T$ iterations, we divide the full data set into $T$ subsets of size $\lfloor n/T\rfloor$ , and then perform the updates $\theta^{t+1}=M_{n/T}(\theta^{t})$ , using a fresh subset of samples at each iteration.

with probability at least $1-\delta$ . With these definitions, we have the following guarantees:

If the sample size $n$ is large enough to ensure that

For a given iteration number $T$ , suppose the sample size $n$ is large enough to ensure that

Figure 1 provides an illustration of the behavior predicted by Theorem 2: both algorithms are expected to show geometric convergence to the target parameter $\theta^{*}$ , up to some tolerance. For the bound (35b) note that the first term is decreasing in $t$ , whereas the second term is independent of $t$ . Thus, for a fixed sample size $n$ , the bounds in Theorem 2 suggests a reasonable choice of the number of iterations. In particular, focusing on the standard EM algorithm, consider any positive integer As will be clarified in the sequel, such a choice of $T$ exists in various concrete models considered here. $T$ such that

This choice ensures that the first term in the bound (35b) is dominated by the second term, and hence that

with probability at least $1-\delta$ . For the sample-splitting update in (36b) the first term is decreasing in $t$ , whereas the second term is increasing in $t$ . In this case, a similar conclusion holds when $T$ is chosen to be the smallest positive integer such that

Let us now turn to the proof of the theorem.

We give a detailed proof of the claim (36b), from which it will be clear that the claim (35b) follows by a nearly identical argument. For any iteration $s\in\{1,2,\ldots,T\}$ , we have

with probability at least $1-\frac{\delta}{T}$ . Consequently, by a union bound over all $T$ indices, the bound (40) holds uniformly with probability at least $1-\delta$ . We perform the remainder of our analysis under this event.

Indeed, when this bound holds, we may iterate it to show that

where the final step follows by summing the geometric series.

It remains to prove the claim (41), and we do so via induction on the iteration number. Beginning with $s=1$ , we have

In the induction from $s\mapsto s+1$ , suppose that $\|\theta^{s}-{\theta^{\ast}}\|_{2}\leq r$ , and the bound (41) holds at iteration $s$ . The same argument then implies that the bound (41) also holds for iteration $s+1$ , and that $\|\theta^{s+1}-{\theta^{\ast}}\|_{2}\leq r$ , thus completing the proof. ∎

2 Analysis of gradient EM algorithm

We now turn to analysis of the gradient EM algorithm. As before, we separate our analysis into two parts, the first (Theorem 3) addressing the behavior of the population-level operator, and the second (Theorems 4 and 5) providing guarantees for sample-based updates.

Recall that the gradient EM algorithm generates a sequence of iterates $\{\theta^{t}\}_{t=0}^{\infty}$ via the recursion $\theta^{t+1}=G(\theta^{t})$ , where

Here $\alpha>0$ is a step size parameter to be chosen. For analyzing gradient EM, we also require an additional condition on the function $q(\theta)=Q(\theta|{\theta^{\ast}})$ , previously defined in Section 3.1. In addition to the $\lambda$ -strong concavity assumption (25), we also assume that $q$ is $\mu$ -smooth, meaning that

Under the stated strong concavity and smoothness assumptions, it is a standard result from optimization theory that the gradient operator $T:\Omega\rightarrow\Omega$ with step size choice $\alpha=\frac{2}{\mu+\lambda}$ is contractive, in particular with

Intuitively, then, if the function $Q(\cdot|\theta)$ is “close enough” to the function $q(\cdot)=Q(\cdot|{\theta^{\ast}})$ , then the gradient EM operator might be expected to satisfy a similar contractivity condition. The closeness requirement is formalized in the following condition:

See Figure 2 for an illustration of this condition. We give concrete examples of this condition and its verification in Section 4. As with the FOS condition observe that the GS condition is always satisfied at the fixed point $\theta^{*}$ , i.e. for $r=0$ with $\gamma=0$ . Allowing for strictly positive $\gamma$ , if the functions $Q(\cdot|\theta)$ are sufficiently regular we expect the condition to hold in a region around $\theta^{*}$ . Observe that this condition involves the gradient of the functions $Q(\cdot|\theta)$ and $Q(\cdot|{\theta^{\ast}})$ at $\theta$ , as opposed to $M(\theta)$ in the case of the FOS condition. For this reason, it can be easier to verify for specific models.

Under this condition, the following result guarantees local contractivity of the gradient EM operator (42):

By definition of the gradient EM update (42), we have

where step (i) follows from the triangle inequality, and step (ii) uses the contractivity of $T$ from equation (45), and condition GS. Substituting $\alpha=\frac{2}{\mu+\lambda}$ and performing some algebra yields the claim. ∎

2.2 Guarantees for sample-based gradient EM

In this section, in parallel with our earlier analysis of sample-based version of the EM algorithm, we analyze two sample-based variants of the gradient EM algorithm, the first when the update operator $G_{n}$ is computed using all $n$ samples and applied repeatedly, and the second based on sample-splitting.

If the sample size $n$ is large enough to ensure that

2.3 Stochastic version of gradient EM

In this section, we analyze a sample-based variant of gradient EM that is inspired by stochastic approximation. It can be viewed as an extreme form of sample-splitting, in which we use only a single sample per iteration, but compensate for the noisiness using a decaying step size. Throughout this section we assume that (a lower bound on) the radius of convergence $r$ of the population operator is known to the algorithmThis assumption can be restrictive in practice. We believe the requirement can be eliminated by a more judicious choice of the step-size parameter in the first few iterations..

In particular, given a sequence of positive step sizes $\{\alpha^{t}\}_{t=0}^{\infty}$ , we analyze the recursion

In order to prove this theorem we first establish a recursion on the expected mean-squared error. As with Theorem 3 this result is established by relating the population gradient EM operator to the gradient ascent operator on the function $q(\cdot)$ . This key recursion along with some algebra will yield the theorem.

Given the stochastic EM gradient iterates with step sizes $\{\alpha^{t}\}_{t=0}^{\infty}$ , the error $\Delta^{t+1}:=\theta^{t+1}-\theta^{*}$ at iteration $t+1$ satisfies the recursion

Using this result, we can now complete the proof of the bound (54). With the step size choice $\alpha^{t}:=\frac{a}{\xi\,(t+2)}$ where $a=\frac{3}{2}$ , unwrapping the recursion (55) yields

In order to bound these terms we use the following fact: For any $a\in(1,2)$ , we have

See Noorshams and Wainwright for a proof. Using this fact in Equation (56) yields

Finally, applying the integral upper bound $\sum\limits_{\tau=2}^{t+2}\frac{1}{\tau^{2-a}}\leq\int_{1}^{t+2}\frac{1}{x^{2-a}}dx\;\leq 2(t+3)^{a-1}$ yields the claim (54). ∎

For the convenience of the reader, let us now summarize the theorems given in this section, including the assumptions on which they rely and the results that they provide.

Consequences for specific models

In the previous section, we provided a number of general theorems on the behavior of the EM algorithm as well as the gradient EM algorithm, at both the population and sample levels. In this section, we develop some concrete consequences of this general theory for the three specific model classes previously introduced in Section 2.2.

We begin by analyzing the EM updates for the Gaussian mixture model previously introduced in Section 2.2.1. Our first result (Corollary 1) establishes contractivity for the population operator (13b), whereas our second result (Corollary 2) provides bounds for the sample-based EM updates.

Recall that our mixture model consists of two equally weighted components, with distributions $\mathcal{N}(\theta^{*},\sigma^{2}I)$ and $\mathcal{N}(-\theta^{*},\sigma^{2}I)$ respectively. The difficulty of estimating this mixture model can be characterized by the signal-to-noise ratio $\frac{\|\theta^{*}\|_{2}}{\sigma}$ , and our analysis requires a lower bound of the form

for a sufficiently large constant $\eta>0$ . Past work by Redner and Walker provides evidence for the necessity of this assumption: for Gaussian mixtures with low signal-to-noise ratio, they show that the ML solution has large variance and furthermore verify empirically that the convergence of the EM algorithm can be quite slow. Other researchers also provide theoretical justification for the slow convergence of EM on poorly separated Gaussian mixtures.

With the signal-to-noise ratio lower bound $\eta$ defined above we have the following guarantee:

This corollary guarantees that when the SNR is sufficiently large, then the MLE $\theta^{*}$ has a basin of attraction that is at least a constant fraction of the signal strength. Moreover, the convergence rate of the population updates is geometric, with the contraction factor $\kappa$ decreasing exponentially in the signal-to-noise ratio. The proof of Corollary 1 involves establishing that for a sufficiently large SNR, the strong concavity and FOS ( $\gamma$ ) conditions hold for a Gaussian mixture model, so that Theorem 1 can be applied. Although the proof structure is conceptually straightforward, the details are quite technical, so that we defer it to Appendix B.1.

Based on the population-level contractivity guaranteed by Corollary 1, we can also establish guarantees for the standard EM sequence $\theta^{t+1}=M_{n}(\theta^{t})$ , where the sample-based operator $M_{n}$ was previously defined in equation (13a). This guarantee involves the function $\varphi(\sigma;\|\theta^{*}\|_{2}):=\|\theta^{*}\|_{2}\sqrt{\|\theta^{*}\|_{2}^{2}+\sigma^{2}}$ , as well as positive universal constants $(c,c_{1},c_{2})$ .

See Appendix B.2 for the proof of this result. In Appendix B.3, we also give guarantees for EM with sample-splitting which achieves better dependence on $\|\theta^{*}\|_{2}$ and $\sigma$ with an easier proof at the cost of additional logarithmic factors in sample complexity.

A related result of Dasgupta and Schulman shows that when the SNR is sufficiently high a modified EM algorithm, with an intermediate pruning step, reaches a near-optimal solution in two iterations. On one hand, the SNR condition in our corollary is significantly weaker, requiring only that it is larger than a fixed constant independent of dimension (as opposed to scaling with $d$ ), but their theory is developed for more general $k$ -mixtures.

The bound (59) provides a rough guide of how many iterations are required: consider the smallest positive integer such that

Corollary 2 makes a number of qualitative predictions that can be tested. To begin, it predicts that the statistical error $\|\theta^{t}-\theta^{*}\|_{2}$ should decrease geometrically, and then level off at a plateau. Figure 3 shows the results of simulations designed to test this prediction: for dimension $d=10$ and sample size $n=1000$ , we performed $10$ trials with the standard EM updates applied to Gaussian mixture models with SNR $\frac{\|\theta^{*}\|_{2}}{\sigma}=2$ . In panel (a), the red curves plot the log statistical error versus the iteration number, whereas the blue curves show the log optimization error versus iteration. As can be seen by the red curves, the statistical error decreases geometrically before leveling off at a plateau. On the other hand, the optimization error decreases geometrically to numerical tolerance. Panel (b) shows that the gradient EM updates have a qualitatively similar behavior for this model, although the overall convergence rate appears to be slower.

In conjunction with Corollary 1, Corollary 2 also predicts that the convergence rate should increase as the signal-to-noise ratio $\frac{\|\theta^{*}\|_{2}}{\sigma}$ is increased. Figure 3 shows the results of simulations designed to test this prediction: again, for mixture models with dimension $d=10$ and sample size $n=1000$ , we applied the standard EM updates to Gaussian mixture models with varying SNR $\frac{\|\theta^{*}\|_{2}}{\sigma}$ . For each choice of SNR, we performed $10$ trials, and plotted the log optimization error $\log\|\theta^{t}-\widehat{\theta}\|_{2}$ versus the iteration number. As expected, the convergence rate is geometric (linear on this logarithmic scale), and the rate of convergence increases as the SNR growsTo be clear, Corollary 2 predicts geometric convergence of the statistical error $\|\theta^{t}-\theta^{*}\|_{2}$ , whereas these plots show the optimization error $\|\theta^{t}-\widehat{\theta}\|_{2}$ . However, the analysis underlying Corollary 2 can also be used to show geometric convergence of the optimization error..

2 Mixtures of regressions

In this section, we analyze the EM and gradient EM algorithms for the mixture of regressions (MOR) model, previously introduced in Section 2.2.2. As in our analysis of the Gaussian mixture model, our theory applies when the signal-to-noise ratio is sufficiently large, as enforced by a condition of the form

Under a suitable lower bound on this quantity, our first result guarantees that the population level operators (17b) and (18a) are locally contractive.

As shown in the proof, the contraction coefficient $\kappa$ is again a decreasing function of the SNR parameter $\eta$ . However, its functional form is not as explicit as in the Gaussian mixture case. The proof of Corollary 3 involves verifying that the function $q$ for the MOR model satisfies the required concavity, smoothness, GS( $\gamma$ ) and FOS( $\gamma$ ) conditions. It is quite technically involved, so that we defer it to Appendix C.1.

Let us now provide guarantees for a sample-splitting version of the EM updates. Recall that sample-based EM operator was previously defined in equation (17a). For a given sample size $n$ and iteration number $T$ , suppose that we splitTo simplify exposition, assume that $n/T$ is an integer. our full data set into $T$ subsets, each of size $n/T$ . We then generate the sequence $\theta^{t+1}=M_{n/T}(\theta^{t})$ , where we use a fresh subset at each iteration. In the following result, we use $\varphi(\sigma;\|\theta^{*}\|_{2})=\sqrt{\sigma^{2}+\|\theta^{*}\|_{2}^{2}}$ , along with positive universal constants ( $c_{1}$ , $c_{2}$ ).

We prove this corollary in Appendix C.2. Note the bound (63) again provides guidance on the number of iterations to perform. For a given sample size $n$ , suppose we perform $T=c\log(n/d\varphi^{2}(\sigma;\|\theta^{*}\|_{2}))$ iterations for a constant $c$ . The bound (63) then implies that

with probability at least $1-\delta$ . Apart from the logarithmic penalty $\log^{2}\big{(}\frac{n}{d\varphi^{2}(\sigma;\|\theta^{*}\|_{2})}\big{)}$ , this guarantee matches the minimax rate for estimation of a $d$ -dimensional regression vector. We note that the logarithmic penalty can be removed by instead analyzing the standard form of the EM updates, as we did for the Gaussian mixture model.

We conclude our discussion of the MOR model by stating a result for the stochastic form of gradient EM analyzed in Theorem 5. In particular, given a data set of size $n$ , we run the algorithm for $n$ iterations, with a step size $\alpha^{t}:=\frac{3}{t+2}$ for iterations $t=1,\ldots,n$ . Once again our result is terms of $\varphi(\sigma;\|\theta^{*}\|_{2})=\sqrt{\sigma^{2}+\|\theta^{*}\|_{2}^{2}}$ and positive universal constants ( $c_{1}$ , $c_{2}$ ).

We prove this corollary in Appendix C.3. Figure 4 illustrates this corollary showing the error as a function of iteration number (sample size) for the stochastic gradient EM algorithm.

3 Linear regression with missing covariates

This section is devoted to analysis of the gradient EM algorithm for the problem of linear regression with missing covariates, as previously introduced in Section 2.2.3. Here the central parameter is the probability $\rho$ that any given coordinate of the covariate vector is missing, and our analysis links this quantity to the signal-to-noise ratio and the radius of contractivity. Define $\xi_{1}$ and $\xi_{2}$ to be such that the following bounds hold,

For any given choice of $(\xi_{1},\xi_{2})$ define $\xi:=(\xi_{1}+\xi_{2})^{2}.$ Our guarantees apply whenever the missing probability is bounded as

See Appendix D.1 for the proof of Corollary 6. Relative to our previous results, this corollary is somewhat unusual, in that we require an upper bound on the ratio $\frac{\|\theta^{*}\|_{2}}{\sigma}$ . Although this requirement might seem counter-intuitive at first sight, known minimax lower bounds on regression with missing covariates show that it is unavoidable— that is, it is not an artifact of our analysis nor of the gradient EM algorithm. Roughly these lower bounds formalize the intuition that as the norm $\|\theta^{*}\|_{2}$ increases, the amount of missing information increases in proportion to the amount of observed information. Figure 6 provides the results of simulations that confirm this behavior, in particular showing that for regression with missing data, the radius of convergence eventually decreases as $\|\theta^{*}\|_{2}$ grows.

Let us now provide guarantees for a sample-splitting version of the EM updates, based on the sample-based EM operator in equation (22a). As usual, for a given sample size $n$ and iteration number $T$ , suppose that we split our full data set into $T$ subsets, each of size $n/T$ . We then generate the sequence $\theta^{t+1}=M_{n/T}(\theta^{t})$ , where we use a fresh subset at each iteration.

We prove this corollary in Appendix D.2. We note that the constant $c_{2}$ is a monotonic function of the parameters $(\xi_{1},\xi_{2})$ , but does not otherwise depend on $n$ , $d$ , $\sigma^{2}$ or other problem-dependent parameters.

As with Corollary 4, this result provides guidance on the appropriate number of iterations to perform: in particular, if we set $T=c\log n$ for a sufficiently large constant $c$ , then the bound (69) implies that

We conclude our discussion of the missing covariates model by stating a result for the stochastic form of gradient EM analyzed in Theorem 5. In particular, given a data set of size $n$ , we run the algorithm for $n$ iterations, with a step size $\alpha^{t}:=\frac{3}{t+2}$ for iterations $t=1,\ldots,n$ .

We prove this corollary in Appendix D.3. Figure 7 illustrates this.

Discussion

In this paper, we have provided some general techniques for studying the EM and gradient EM algorithms, at both the population and finite-sample levels. Although this paper focuses on these specific algorithms, we expect that the techniques could be useful in understanding the convergence behavior of other algorithms for potentially non-convex problems.

The analysis of this paper can be extended in various directions. For instance, in the three concrete models that we treated, we assumed that the model was correctly specified, and that the samples were drawn in an i.i.d. manner, both conditions that may be violated in statistical practice. Maximum likelihood estimation is known to have various robustness properties under model mis-specification. Developing an understanding of the EM algorithm in this setting is an important open problem.

Finally, we note that in concrete examples our analysis guarantees good behavior of the EM and gradient EM algorithms when they are given suitable initialization. For the three model classes treated in this paper, simple pilot estimators can be used to obtain such initializations—in particular using PCA for Gaussian mixtures and mixtures of regressions (e.g., ), and the plug-in principle for regression with missing data (e.g., ). These estimators can be seen as particular instantiations of the method of moments . Although still an active area of research, a line of recent work (e.g., ) has demonstrated the utility of moment-based estimators or initializations for other types of latent variable models, and it would be interesting to analyze the behavior of EM for such models.

This research was partially supported by ONR-MURI grant N00014-11-1-0688 and NSF grant CIF-31712-23800 to MJW, and by US NSF grants DMS-1107000, CDS&E-MSS 1228246, ARO grant W911NF-11-1-0114, the Center for Science of Information (CSoI), and US NSF Science and Technology Center, under grant agreement CCF-0939370. SB would also like to thank John Duchi for helpful discussions.

Appendix A Proofs for stochastic gradient EM

In this section we provide proofs of results related to Theorem 5 from Section 3.2.3. It only remains to prove Lemma 1.

In order to establish Lemma 1 we require an analogue of Theorem 3 that allows for a wider range of step sizes. Recall the classical gradient ascent operator on the function $q(\theta)=Q(\theta|\theta^{*})$ . For step size $\alpha>0$ , it takes the form $T(\theta)=\theta+\alpha\nabla q(\theta)$ . Under the stated $\lambda$ -concavity and $\mu$ -smoothness conditions, for any step size $0<\alpha\leq\frac{2}{\lambda+\mu}$ , the classical gradient operator $T$ is contractive with parameter

This follows from the classical analysis of gradient descent (e.g., ). Using this fact, we can prove the following about the population gradient EM operator:

For any step size $0<\alpha\leq\frac{2}{\lambda+\mu}$ , the population gradient EM operator $G:\Omega\rightarrow\Omega$ is contractive with parameter $\kappa(\alpha)=1-\alpha\xi$ , where

We omit the proof, since it follows from a similar argument to that of Theorem 3. With this preliminary in place we can now begin the proof of Lemma 1.

Introducing the shorthand $\widehat{W}(\theta):=\nabla Q_{1}(\theta|\theta)$ , we have $\widetilde{\theta}^{t+1}-\theta^{t}=\alpha^{t}\widehat{W}(\theta)$ , and hence

Letting $\mathcal{F}_{t}$ denote the $\sigma$ -field of events up to the random variable $\theta^{t}$ , note that

Consequently, by iterated expectations, we have

Combining with our earlier inequality (72) yields

Defining $G^{t}(\theta^{t}):=\theta^{t}+\alpha^{t}W(\theta^{t})$ , we see that

where step (i) uses the contractivity of $G^{t}$ established in Lemma 2 and step (ii) uses the definition of $\xi$ from equation (71). Putting together the pieces yields the claim (55).

Appendix B Proofs for Gaussian mixture models

In this section, we provide proofs of results related to the Gaussian mixture model, as presented in Section 4.1. More specifically, we first prove Corollary 1 on the population level behavior, followed by the proof of Corollary 2 on the behavior of the standard sample-based EM updates.

Here the weighting function takes the form

It remains to verify the FOS( $\gamma$ ) condition (29). The following auxiliary lemma is central to the proof:

Under the conditions of Corollary 1, there is a constant $\gamma\in(0,1)$ with $\gamma\leq\exp(-c_{2}\eta^{2})$ such that

where $\Delta_{w}(y):=w_{\theta}(y)-w_{{\theta^{\ast}}}(y)$ .

This follows immediately from Lemma 3. Thus, the FOS condition holds when $\gamma<1.$ The bound on the contraction parameter follows from the fact that $\gamma\leq\exp(-c_{2}\eta^{2})$ and applying Theorem 1 yields Corollary 1.

We now prove Lemma 3. Our proof makes use of the following elementary facts:

For the function $f(t)=\frac{t^{2}}{\exp(\mu t)}$ , we have

For the function $g(t)=\frac{1}{(\exp(t)+\exp(-t))^{2}}$ , we have

With these preliminaries in place, we can now begin the proof. For each $u\in$ , define $\theta_{u}={\theta^{\ast}}+u\Delta$ , where $\Delta:=\theta-{\theta^{\ast}}$ . Taylor’s theorem applied to the function $\theta\mapsto w_{\theta}(Y)$ , followed by expectations, yields

By construction, the matrix $D$ is diagonal, so that it suffices to bound the diagonal terms. Beginning with the first diagonal entry, we have

Defining the event $\mathcal{E}=\{V_{1}\leq\frac{\|\theta^{*}\|_{2}}{4}\}$ , we condition on it and its complement to obtain

Conditioned on $\mathcal{E}$ and $\mathcal{E}^{c}$ , respectively, we then apply the bounds (74a) and (74b) to obtain

provided $\|\theta^{*}\|_{2}\|\theta_{u}\|_{2}\geq 4\sigma^{2}$ . Noting that

Note that the mean of $V_{1}$ is lower bounded as

where step (i) follows from the lower bound (77). Consequently, by standard Gaussian tail bounds, we have

On the other hand, for any index $j\neq 1$ , we have

where the reader should recall the function $g$ from equation (75a). Once again, conditioning on the event $\mathcal{E}=\{V_{1}\leq\frac{\|\theta^{*}\|_{2}}{4}\}$ and its complement yields

Returning to equation (76), we have shown that

whenever $\frac{\|\theta^{*}\|_{2}^{2}}{\sigma^{2}}\geq\eta^{2}\geq 16/3$ . On this basis, the bound (73) holds as long as the signal-to-noise ratio is sufficiently large,

B.2 Proof of Corollary 2

where the final step uses the fact that $|Z_{u}-Z_{v}|\leq Z\,\|u-v\|_{2}$ for any pair $(u,v)$ . Putting together the pieces, we conclude that

using a standard symmetrization result for empirical processes (e.g., ). Now observe that for any triplet of $d$ -vectors $y$ , $\theta$ and $\theta^{\prime}$ , we have the Lipschitz property

Consequently, by the Ledoux-Talagrand contraction for Rademacher processes , we have

Putting together the pieces, we conclude that

for all $\lambda$ sufficiently small. Combined with our first discretization (79), we have thus shown that

Combined with the Chernoff approach, this bound on the MGF implies that, as long as $n\geq c_{1}d\log(1/\delta)$ for a sufficiently large constant $c_{1}$ , we have

B.3 Guarantees for EM with sample-splitting

In this section, we state and prove a result for the EM algorithm with sample-splitting for the mixture of Gaussians.

Consider a Gaussian mixture model satisfying the $\mbox{SNR}(\eta)$ condition (57), and any initialization $\theta^{0}$ such that $\|\theta^{0}-\theta^{*}\|_{2}\leq\frac{\|\theta^{*}\|_{2}}{4}$ . Given a sample size $n\geq 16T\log(6T/\delta)$ , then with probability at least $1-\delta$ , the sample-splitting EM iterates $\{\theta^{t}\}_{t=0}^{T}$ satisfy the bound

It is worth comparing the result here to the result established earlier in Corollary 2. The sample-splitting EM algorithm is more sensitive to the number of iterations which determines the batch size and needs to be chosen in advance. Supposing that the number of iterations were chosen optimally however the result has better dependence on $\|\theta^{*}\|_{2}$ and $\sigma$ at the cost of a logarithmic factor in $n$ .

The proof follows by establishing a bound on the function $\varepsilon_{M}(n,\delta)$ . Define $\mathcal{S}=\{\theta:\|\theta-\theta^{*}\|_{2}\leq\frac{\|\theta^{*}\|_{2}}{4}\}$ . Recalling the updates in (13a) and (13b), note that

We bound each of these terms in turn, in particular showing that

Observe that since $Y\sim(2Z-1)\theta^{*}+v$ we have

Since $Z_{i}$ are i.i.d Bernoulli variables, Hoeffding’s inequality implies that

with probability at least $1-\frac{\delta}{4}$ . On the other hand, the vector $U_{1}:=\frac{1}{n}\sum_{i=1}^{n}{v_{i}}$ is zero-mean and sub-Gaussian with parameter $\sigma/\sqrt{n}$ , whence the squared norm $\|U_{1}\|_{2}^{2}$ is sub-exponential. Using standard bounds for sub-exponential variates and the condition $n>\sigma d$ , we obtain

with probability at least $1-\delta/4$ . Combining the pieces yields the claimed bound (82) on $T_{1}$ .

The random variable $w_{\theta}(Y)(2Z-1)$ lies in the interval $$, so that Hoeffding’s inequality implies that

with probability at least $1-\delta/4$ . Putting together the pieces yields the claimed bound (82) on $T_{2}$ , thereby completing the proof of the corollary. ∎

Appendix C Proofs for mixtures of regressions

In this appendix, we provide proofs of results related to the mixture of regressions model, as presented in Section 4.2. More specifically, we first prove Corollary 3 on the population level behavior, followed by the proof of Corollaries 4 and 5 on the behavior of sample-splitting EM updates and stochastic gradient EM updates, respectively.

We begin by proving part (a) of the corollary on the population EM update, which is based on maximizing the function

It remains to verify condition FOS. Define the difference function $\Delta_{w}(X,Y):=w_{\theta}(X,Y)-w_{\theta^{*}}(X,Y)$ , and the difference vectors $\Delta=\theta-\theta^{*}$ . Using this notation, for this model, we need to show that

Note that we can write $Y\stackrel{{\scriptstyle d}}{{=}}(2Z-1)\langle X,\,\theta^{*}\rangle+v$ , where $Z\sim\mbox{Ber}(1/2)$ is a Bernoulli variable. Using this notation, it is equivalent to show

for $\gamma\in[0,1/2)$ in order to establish contractivity. In order to prove the theorem with the desired upper bound on $\kappa$ we need to show (83) with $\gamma\in[0,1/4).$ The following lemma provides control on the two terms:

Under the conditions of Corollary 3, there is a constant $\gamma<1/4$ such that for any fixed vector $\widetilde{\Delta}$ we have

for $\kappa\in[0,\frac{1}{2}).$ This is an immediate consequence of Lemma 4.

It remains to prove Lemma 4. Since the standard deviation $\sigma$ is known, a simple rescaling argument allows us to take $\sigma=1$ , and replace the weight function in (16a) with

Our proof makes use of the following elementary result on Gaussian random vectors:

A similar argument yields the second claim. ∎

Noting that Lemma 4 consists of two separate inequalities (84a) and (84b), we treat these cases separately.

We split the proof of this bound into two separate cases: namely, $\|\Delta\|_{2}\leq 1$ and $\|\Delta\|_{2}>1$ .

Thus, using a Taylor series with integral form remainder on the function $\theta\mapsto w_{\theta}(X,Y)$ yields

where $Z_{u}:=Y\langle X,\,\theta^{*}+u\Delta\rangle$ . Substituting for $\Delta_{w}(X,Y)$ in inequality (84a), we see that it suffices to show

for some $\gamma\in[0,1/4)$ . The following auxiliary result is central to establishing this claim:

There is a $\gamma\in[0,1/4)$ such that for each $u\in$ , we have

See Section C.1.5 for the proof of this lemma.

Using Lemma 6, let us bound the quantity $A_{u}$ from equation (89). Since $\theta^{*}=\theta_{u}-u\Delta$ , we have $A_{u}=B_{1}+B_{2}$ , where

In order to show that $A_{u}\leq\frac{\gamma}{2}\|\Delta\|_{2}\,\|\widetilde{\Delta}\|_{2}$ , it suffices to show that $\max\{B_{1},B_{2}\}\leq\frac{\gamma}{4}\|\Delta\|_{2}\|\widetilde{\Delta}\|_{2}$ .

By the Cauchy-Schwarz inequality, we have

Similarly, another application of the Cauchy-Schwarz inequality yields

where the second step follows from the bound (90b), and the fact that $(2Z-1)^{2}=1$ . In this case, we have

where step (i) uses the bound (86b) from Lemma 5, and step (ii) that $\|\Delta\|_{2}\leq 1$ . Combining the pieces, we conclude that $B_{2}\leq\frac{\gamma}{4}\|\Delta\|_{2}\|\widetilde{\Delta}\|_{2}$ , which completes the proof of inequality (84a) in the case $\|\Delta\|_{2}\leq 1$ .

We now turn to the second case of the bound (84a). Our argument (here and in later sections) makes use of various probability bounds on different events, which we state here for future reference. These events involve the scalar $\tau:=C_{\tau}\sqrt{\log\|\theta^{*}\|_{2}}$ for a constant $C_{\tau}$ , as well as the vectors

For the event $\mathcal{E}_{2}:=\big{\{}|\langle X,\,\theta^{*}\rangle|>\tau\big{\}}\cap\big{\{}|\langle X,\,\theta_{u}\rangle|>\tau\big{\}}\cap\big{\{}|v|\leq\frac{\tau}{2}\big{\}}$ , we have

See Section C.1.7 for the proof of this result.

With this set-up, our goal is to bound the quantity

We bound each of these five terms in turn.

Applying the Cauchy-Schwarz inequality and using the fact that $(2Z-1)^{2}=1$ yields

We now bound $\Delta_{w}(X,Y)$ conditioned on the event $\mathcal{E}_{1}\cap\mathcal{E}_{2}$ . Since $\operatorname{sign}(\langle X,\,\theta^{*}\rangle)=\operatorname{sign}(\langle X,\,\theta_{u}\rangle)$ on the event $\mathcal{E}_{1}$ , we have

Recalling the weight function (85), we claim that when conditions (93a) and (93b) hold, then

Combined with our earlier bound (92), we have shown

Applying Lemma 8 with $\mathcal{E}=\mathcal{E}_{1}\cap\mathcal{E}_{2}$ yields $\Psi(\mathcal{E}_{1}\cap\mathcal{E}_{2})\leq 2\|\widetilde{\Delta}\|_{2}\|\theta^{*}\|_{2}e^{-\tau^{2}}$ .

Combining the Cauchy-Schwarz inequality with Lemma 7(i), we have

where step (i) makes use of the bound $\langle X,\,\theta^{*}\rangle\,\langle X,\,\theta_{u}\rangle\leq 0$ ; and step (ii) follows since $\theta_{u}=\theta^{*}+u\Delta$ , and $u\in$ .

where step (i) follows from the conditional covariance bound of Lemma 8.

Combining the Cauchy-Schwarz inequality with Lemma 7(iv) yields

Together with Lemma 8, we obtain the bound

where step (i) uses the fact that $\|\Delta\|_{2}\geq 1$ .

We have thus obtained bounds on all five terms in the decomposition (91). We combine these bounds with the with lower bound $\|\theta_{u}\|_{2}\geq\frac{\|\theta^{*}\|_{2}}{2}$ from equation (87b), and then perform some algebra to obtain

where $c$ is a universal constant. In particular, selecting $\tau=c_{\tau}\sqrt{\log\|\theta^{*}\|_{2}}$ for a sufficient large constant $c_{\tau}$ , selecting the constant $\eta$ in (87a) sufficiently large yields the claim (84a).

C.1.2 Proof of inequality (84b)

As in Section C.1.1, we treat the cases $\|\Delta\|_{2}\leq 1$ and $\|\Delta\|_{2}\geq 1$ separately.

As before, by a Taylor expansion of the function $\theta\mapsto\Delta_{w}(X,Y)$ , it suffices to show that

For any fixed $u\in$ , the Cauchy-Schwarz inequality implies that

Conditioned on the event $\mathcal{E}_{1}\cap\mathcal{E}_{2}$ , the bound (94) implies that $|\Delta_{w}(X,Y)|\leq\exp(-\tau^{2})$ , and hence $\Psi(\mathcal{E}_{1}\cap\mathcal{E}_{2})\leq e^{-2\tau^{2}}$ .

Thus, we have derived bounds on each of the three terms in the decomposition (96): putting them together yields

By choosing $C_{\tau}$ sufficiently large in the definition of $\tau$ , selecting the signal-to-noise constant $\eta$ in condition (87a) sufficiently large, the claim follows.

C.1.5 Proof of Lemma 6

The lemma statement consists of two inequalities, and we divide our proof accordingly.

and we bound each of these terms in turn. The reader should recall the constant $\tau:=C_{\tau}\sqrt{\log\|\theta^{*}\|_{2}}$ , as well as the events $\mathcal{E}_{3}$ and $\mathcal{E}_{4}$ from Lemma 7.

where the final step follows from inequality (74a). Combined with Lemma 7(iv), we conclude that $\Psi(\mathcal{E}_{4}^{c})\leq\frac{1}{2e^{2}}e^{-\frac{\tau^{2}}{2}}$ .

where step (i) follows from inequality (98), and step (ii) follows from Lemma 7(iii).

Conditioned on the event $\mathcal{E}_{2}$ , we have $Y^{2}\langle X,\,\theta_{u}\rangle^{2}\geq\frac{\tau^{2}}{2}$ , where we have used the lower bound (93b). Introducing the shorthand $t^{*}=\tau^{2}/2$ , this lower bound implies that

where inequality (i) is valid as long as $t^{*}=\frac{\tau^{2}}{2}\geq\frac{1}{2}$ , or equivalently $\tau^{2}\geq 1$ .

Substituting our upper bounds on three components in the decomposition (97) yields

Setting $C_{\tau}$ sufficiently large in the definition of $\tau$ and choosing sufficiently large values of the signal-to-noise constant $\eta$ in the condition (87a) yields the claim.

Combining Lemma 8 with the bound $\|\Delta\|_{2}\leq 1$ , we find that $\langle X,\,\theta^{*}\rangle^{2}\leq 2\tau^{2}+4$ . Since $Y\stackrel{{\scriptstyle d}}{{=}}(2Z-1)\langle X,\,\theta^{*}\rangle+v$ , we have

Putting together the pieces, we conclude that $\Psi(\mathcal{E}_{5}^{c})\leq\frac{4\tau^{3}+10\tau}{16\|\theta_{u}\|_{2}}$ .

Recall that $Z_{u}=Y\langle X,\,\theta_{u}\rangle$ , so we have that

where step (i) follows from the bound (74a) and the observation that $|\langle X,\,\theta_{u}\rangle|\geq\tau$ conditioned on the event $\mathcal{E}_{5}$ .

Substituting our bounds on the two terms into the decomposition (97) yields

Once again, sufficiently large choices of the constant $c_{\tau}$ and the signal-to-noise constant $\eta$ in equation (87a) yields the claim.

C.1.6 Proof of Lemma 7

In this section, we prove the probability bounds on events $\mathcal{E}_{1}$ through $\mathcal{E}_{6}$ stated in Lemma 7. In doing so, we make use of the following auxiliary result, due to Yi et al. (see Lemma 1 in their paper):

Note that the event $\mathcal{E}_{1}^{c}$ holds if and only if $\langle X,\,\theta^{*}\rangle\langle X,\,\theta_{u}\rangle<0$ , or equivalently, if and only if

For $X\sim\mathcal{N}(0,\sigma^{2})$ , we have

Similarly, this inequality follows from the tail bound (101).

This claim follows from parts (v) and (vi) of Lemma 7, combined with the union bound.

This bound follows from parts (iii) and (iv) of Lemma 7, combined with the union bound.

C.1.7 Proof of Lemma 8

For appropriate choices of $c_{\tau}$ and the constant $\eta$ in the signal-to-noise condition (87a), the claim follows.

As before, note that the event $\mathcal{E}_{1}^{c}$ holds if and only if the inequality $|\langle X,\,\theta^{*}+\theta_{u}\rangle|<|\langle X,\,\theta^{*}-\theta_{u}\rangle|$ holds. Consequently, Lemma 9 implies that $|\!|\!|\Gamma(\mathcal{E}_{1}^{c})|\!|\!|_{{\text{\footnotesize{\mbox{op}}}}}\leq 2$ .

We make note of an elementary fact about Gaussians: for any scalar $\alpha>0$ and unit norm vector $\|v\|_{2}=1$ , for $X\sim\mathcal{N}(0,I_{d})$ , we have

In particular, when $\alpha\leq 1$ , then the operator norm is at most $1$ . This claim follows easily from the rotation invariance of the Gaussian, which allows us to assume that $v=e_{1}$ without loss of generality. It is thus equivalent to bound the largest eigenvalue of the matrix

which is a diagonal matrix by independence of the entries of $X$ . Noting that $D_{11}\leq\alpha^{2}$ and $D_{jj}=1$ for $j\neq 1$ completes the proof of the bound (102).

Applying the bound (102), we find that $|\!|\!|\Gamma(\mathcal{E}_{5}^{c})|\!|\!|_{{\text{\footnotesize{\mbox{op}}}}}\leq\max\Big{(}1,\frac{\tau^{2}}{\|\theta_{u}\|_{2}^{2}}\Big{)}$ . Consequently, the claim follows by making sufficiently large choices of $c_{\tau}$ and the constant $\eta$ in the signal-to-noise condition (87a).

C.2 Proof of Corollary 4

We need to compute an upper bound on the function $\varepsilon_{M}(n,\delta)$ previously defined in equation (33). For this particular model, we have

We bound each of the terms $T_{1}$ and $T_{2}$ in turn.

Recall the assumed lower bound on the sample size—namely $n>c\,d\log(1/\delta)$ for a sufficiently large constant $c$ . Under this condition, standard bounds in random matrix theory , guarantee that $|\!|\!|\widehat{\Sigma}-\Sigma|\!|\!|_{{\text{\footnotesize{\mbox{op}}}}}\leq\frac{1}{2}$ with probability at least $1-\delta$ . When this bound holds, we have $|\!|\!|\widehat{\Sigma}^{-1}|\!|\!|_{{\text{\footnotesize{\mbox{op}}}}}\geq 1/2$ .

where we have applied the Ledoux-Talagrand contraction for Rademacher processes , using the fact that $|\mu_{\theta}(x,y)|\leq 1$ for all pairs $(x,y)$ . Now conditioned on $x_{i}$ , the random variable $y_{i}$ is zero-mean and sub-Gaussian with parameter at most $\sqrt{\|\theta^{*}\|_{2}^{2}+\sigma^{2}}$ . Consequently, taking expectations over the distribution $(y_{i}\mid x_{i})$ for each index $i$ , we find that

where the final inequality uses the definition of $\mathcal{E}$ . Using this bound on the moment-generating function, we find that

Since $n>d$ by assumption, standard results in random matrix theory imply that $|\!|\!|\widehat{\Sigma}^{-1}-\Sigma^{-1}|\!|\!|_{{\text{\footnotesize{\mbox{op}}}}}\leq c\sqrt{\frac{d}{n}\,\log(1/\delta)}$ with probability at least $1-\delta$ . On the other hand, observe that

since the population operator $M$ is a contraction, and $\|\theta\|_{2}\leq 2\|\theta^{*}\|_{2}$ . Combining the pieces, we see that $T_{2}\leq c\|\theta^{*}\|_{2}\sqrt{\frac{d}{n}\;\log(1/\delta)}$ with probability at least $1-\delta$ .

Finally, substituting our bounds on $T_{1}$ and $T_{2}$ into the decomposition (103) yields the claim.

C.3 Proof of Corollary 5

First considering $T_{1}$ , recall that $y_{1}=z_{1}\langle x_{1},\,\theta^{*}\rangle+v_{1}$ , where $v\sim\mathcal{N}(0,\sigma^{2})$ and $z_{1}$ is a random sign, independent of $(x_{1},v_{1})$ . Consequently, we have

Putting together the pieces, we conclude that $T_{1}\leq 8\|\theta^{*}\|_{2}^{2}d+2\sigma^{2}d$ .

Appendix D Proofs for missing covariates

In this appendix, we provide proofs of results related to regression with missing covariates, as presented in Section 4.3. More specifically, we first prove Corollary 6 on the population level behavior, followed by the proof of Corollaries 7 and 8 on the behavior of sample-splitting EM updates and stochastic gradient EM updates, respectively.

We need to verify the conditions of Theorem 3, namely that the function $q$ is $\mu$ -smooth, $\lambda$ -strongly concave, and that the GS condition is satisfied. In this case, $q$ is a quadratic of the form

showing that the expectation does not depend on the pattern of missingness. Consequently, the quadratic function $q$ has an identity Hessian, showing that smoothness and strong concavity hold with $\mu=\lambda=1$ .

where we have used our upper bound (107) on $\pi_{1}$ . We need to ensure that $\gamma<1$ . By assumption, we have $\|\theta^{*}\|_{2}\leq\xi_{1}\sigma$ and $\|\theta-\theta^{*}\|_{2}\leq\xi_{2}\sigma$ , and hence $\|\theta\|_{2}\leq(\xi_{1}+\xi_{2})\sigma$ . Thus, the coefficient $\gamma^{2}$ is upper bounded as

Under the stated conditions of the corollary, we have $\gamma<1$ , thereby completing the proof.

D.2 Proof of Corollary 7

It is clear that $\langle u,\,(\bar{\Sigma}-\widehat{\Sigma})u\rangle$ has zero mean. It remains to prove that $\langle u,\,(\bar{\Sigma}-\widehat{\Sigma})u\rangle$ is sub-exponential. Note that $\widehat{\Sigma}$ is a rescaled sum of rank one matrices, each of the form

Introducing the shorthand $\omega=(1-z)\odot\theta$ , we have

Moreover, since $y=\langle x,\,\theta^{*}\rangle+v$ , we have

It suffices to show that each of the variables $\{B_{j}\}_{j=1}^{4}$ is sub-Gaussian with a constant parameter. As discussed previously, the variable $B_{1}$ is sub-Gaussian with parameter at most one. On the other hand, note that $x$ and $\omega$ are independent. Moreover, with $\omega$ fixed, the variable $\langle x,\,\omega\rangle$ is sub-Gaussian with parameter $\|\omega\|_{2}^{2}$ , whence

where the final inequality uses the fact that $\langle\omega,\,u\rangle^{2}\leq\|\omega\|_{2}^{2}$ . We have thus shown that $B_{2}$ is sub-Gaussian with parameter one. Since $\|\theta-\theta^{*}\|_{2}\leq\xi_{2}\sigma$ , the same argument shows that $B_{3}$ is sub-Gaussian with parameter at most $\xi_{2}$ . Since $v$ is sub-Gaussian with parameter $\sigma$ and independent of $\omega$ , the same argument shows that $B_{4}$ is sub-Gaussian with parameter at most one, thereby completing the proof of the lemma. ∎

We now turn to the second term. Note the variational representation

By a discretization argument–say with a $1/2$ cover $\{u^{1},\ldots,u^{M}\}$ of the sphere with $M\leq 2^{d}$ elements—we obtain

Each term in this maximum is the product of two zero-mean variables, namely $y$ and $\langle\mu_{\theta},\,u\rangle$ . On one hand, the variable $y$ is sub-Gaussian with parameter at most $\sqrt{\|\theta^{*}\|_{2}^{2}+\sigma^{2}}\leq c\sigma$ ; on the other hand, Lemma 10 guarantees that $\langle\mu_{\theta},\,u\rangle$ is sub-Gaussian with constant parameter. The product of any two sub-Gaussian variables is sub-exponential, and thus, by standard sub-exponential tail bounds , we have

Since $M\leq 2^{d}$ and $n>c_{1}d$ , we conclude that $T_{2}\leq c\sqrt{1+\sigma^{2}}\,\sqrt{\frac{d}{n}\;\log(1/\delta)}$ with probability at least $1-\delta$ .

Combining our bounds on $T_{1}$ and $T_{2}$ , we conclude that $\varepsilon_{G}(n,\delta)\leq c\sqrt{1+\sigma^{2}}\;\sqrt{\frac{d}{n}\;\log(1/\delta)}$ with probability at least $1-\delta$ . Thus, we see that Corollary 7 follows from Theorem 2.

D.3 Proof of Corollary 8