Statistical guarantees for the EM algorithm: From population to sample-based analysis
Sivaraman Balakrishnan, Martin J. Wainwright, Bin Yu
Introduction
Data problems with missing values, corruptions, and latent variables are common in practice. From a computational standpoint, computing the maximum likelihood estimate (MLE) in such incomplete data problems can be quite complex. To a certain extent, these concerns have been assuaged by the development of the expectation-maximization (EM) algorithm, along with growth in computational resources. The EM algorithm is widely applied to incomplete data problems, and there is now a very rich literature on its behavior (e.g., ). However, a major issue is that in most models, although the MLE is known to have good statistical properties, the EM algorithm is only guaranteed to return a local optimum. The goal of this paper is to address this potential gap between statistical and computational guarantees in application of the EM algorithm.
The EM algorithm has a lengthy and rich history. Various algorithms of the EM-type were analyzed in early work (e.g.,), before Dempster et al. introduced the EM algorithm in its modern general form. Among other results, they established its well-known monotonicity properties. The subsequent work of Wu established some of the most general convergence results known for the EM algorithm; see also the more recent papers . Together with other results, Wu showed that if the likelihood is unimodal and certain regularity conditions hold, then the EM algorithm converges to the unique global optimum. However, in most interesting cases of the EM algorithm, the likelihood function is multi-modal, in which case the behavior of the EM algorithm remains a little more mysterious. Indeed, despite its popularity and widespread practical effectiveness, the EM algorithm is often considered a “sensible heuristic” with little or no theoretical backing.
One interesting observation with the EM algorithm is given a “suitable” initialization, it often converges to a statistically useful estimate. For instance, in application to a mixture of regressions problem (see Section 2.2.2 for more details), Chaganty and Liang empirically demonstrate good performance for a two-stage estimator, in which the method of methods is used as an initialization, and then the EM algorithm is applied to refine this initial estimator. Although encouraging, this type of behavior is not well understood in a quantitative sense, especially how EM fixed points reached by this type of two-stage estimator are related to the global maximizers of the population likelihood. The goal of this paper is to address this question, and to develop some general tools for characterizing fixed points of the suitably initialized sample-based EM algorithm, and their relation to maximum likelihood estimates.
Some two-stage estimators have recently been analyzed in work on alternating minimization algorithms (see e.g. ) which show that at least in certain special cases optimization methods can be locally effective despite non-convexity. Most directly related to our work is the paper of Yi et al. which considers a special (degenerate) noiseless case of the EM algorithm for the mixtures of regressions problem. Results for the noisy mixtures of regressions problem follow from our general treatment of the EM algorithm (see Section 2.2.2).
In some settings, performing an exact M-step is computationally burdensome, in which case a natural alternative is some form of generalized EM updates. In such an algorithm, instead of performing an exact maximization, we simply choose a parameter value that does not decrease the likelihood. In addition to the standard EM updates, we also analyze a particular case of such an algorithm, known as gradient EM, based on a taking a single gradient step per iteration.
Our main results concern the population EM and gradient EM algorithms and their finite-sample counterparts. Our first set of results (Theorems 1 and 3) give conditions under which the population algorithms are contractive to the MLE, in a ball around the MLE. These results are completely deterministic. This population-level analysis is based on viewing these algorithms as perturbed versions of certain “oracle” algorithms which are known to be contractive around the MLE. Our second set of results (Theorem 2, Theorem 4 and Theorem 5) concern the sample-based EM and gradient EM algorithms which approximate the population-based algorithms using a subset of samples at each step. We give conditions under which these sample operators converge to an -ball around the population MLE. These results involve probabilistic bounds on the deviations between the iterates of the population and sample-based algorithms.
The remainder of this paper is organized as follows. Section 2 provides an introduction to the EM and gradient EM algorithms, as well as a description of the three examples treated in detail in this paper—namely, Gaussian mixture models (Section 2.2.1), mixture of regressions (Section 2.2.2), and regression with missing covariates (Section 2.2.3). Section 3 is devoted to our general convergence results on both the EM and gradient EM algorithms. In Section 4, we revisit the three model classes previously introduced, and illustrate the use of our general theory by deriving some concrete corollaries. In concrete examples our theory gives a characterization of the quality of initialization needed and the rate of convergence of the EM and gradient EM algorithms. We complement these theoretical results with simulations that confirm various aspects of the theoretical predictions. In order to promote readability, we defer the more technical aspects of proofs to the appendices.
Background and model examples
We begin with basic background on the EM algorithm and its variants, along with a number of specific models that we revisit later in the paper.
Let and be random variables taking values in the sample spaces and , respectively. Suppose that the pair has a joint density function that belongs to some parameterized family , for a non-empty compact convex set . Rather than observing the complete data , we observe only component . Thus, the component corresponds to the missing or latent structure in the data.
Our goal is to obtain an estimate of the unknown parameter via maximum likelihood—namely, to compute some maximizing the function , where
is the density function of the observed variable . Throughout this paper, we assume that is a maximizer of the population likelihood, but not that is a unique maximizer. Uniqueness is often violated in mixture models for which parameters are typically only identifiable up to permutation. In the examples that we consider, this non-identifiability will be resolved by appropriate initialization conditions.
In many settings, it can be difficult or computationally expensive to evaluate the log likelihood of the observed data, but relatively easy to compute the log likelihood of both the latent and observed variables. The EM algorithm is well-suited to such settings. For each , let denote the conditional density of given . A straightforward application of Jensen’s inequality then shows that the log likelihood at can be lower bounded as
with equality holding when . Thus, we have a family of lower bounds on the log likelihood, and the EM algorithm successively maximizes this lower bound (-step), and then reevaluates the lower bound at the new parameter value (-step).
With this notation, it is easy to specify the EM iterations. The update consists of the following two steps.
E-step: Evaluate the expectation in equation (2) to compute .
M-step: Compute the maximizer .
For future use, it is convenient to introduce the mapping given by
With this choice, the -step corresponds to the update .
In a generalized EM algorithm, the requirements of the -step are relaxed: instead of finding the exact optimum, the algorithm is required only to find a value such that
Depending on how is chosen, this requirement actually defines a family of algorithms.
A closely related variant of the generalized EM updates is what we refer to as the gradient EM updates, applicable in the case when the function is differentiable at each iteration . Given a step size , these updates take the form
where the gradient is taken in the first argument of . For ease of notation, we define the mapping by
An iteration of gradient EM can now be written compactly as .
There is a natural extension that includes a constraint arising from the parameter space , in which the update is projected back onto the constraint set To avoid pathologies additionally assume that the constraint set is closed.. For simplicity, we focus on unconstrained problems in this paper, but all of our results extend in a straightforward way to constrained examples by incorporating the additional Euclidean projection. For appropriate choices of the step size parameter , the gradient EM updates guarantee the ascent condition (4), so that it is a particular case of a generalized EM algorithm.
Let us now make an important distinction, namely, that between the population and sample-based versions of the EM updates. Up to this point, we have suppressed dependence on the number of observed samples . The population form of the (gradient) EM updates are an “oracle version”, in which we effectively observe an infinite number of samples, and consequently, the function takes the form
From here onwards, we use the notation and for the EM and gradient EM operators, respectively, both defined at the population level.
In the classical statistical settings, we observe only i.i.d. samples of the component. Under the i.i.d. assumption, we define the function
so that the expectation over in equation (7) is replaced by the empirical expectation defined by the samples. The function defines an analog of the population EM operator (3), namely
In an analogous fashion, we define the sample-based analog of the gradient EM operator (6), namely
where is an appropriately chosen step size parameter.
2 Illustrative examples
The EM algorithm is popular and a variety of examples can be found in the literature. In this section, we review three specific models analyzed in this paper, and derive the form of the population and sample-based updates, both for the usual EM algorithm and the gradient EM algorithm.
An isotropic, balanced two-component Gaussian mixture model can be specified by a density of the form
Suppose that we are given i.i.d. samples drawn from the mixture density (11). The complete data corresponds to the original samples along with the component indicator variables . The sample-based function takes the form
where w_{\theta}(y):=e^{-\frac{\|\theta-y\|_{2}^{2}}{2\sigma^{2}}}\Big{[}e^{-\frac{\|\theta-y\|_{2}^{2}}{2\sigma^{2}}}+e^{-\frac{\|\theta+y\|_{2}^{2}}{2\sigma^{2}}}\Big{]}^{-1}.
where the empirical expectation has been replaced by expectation under the mixture distribution (11).
On the other hand, the sample-based and population gradient EM operators with step size are given by
We return to analyze the EM updates for the Gaussian mixture model in Section 4.1.
2.2 Mixture of regressions
On the other hand, the gradient EM operators are given by
where is a step size parameter.
We return to analyze the EM updates for the mixture of regressions model in Section 4.2.
2.3 Linear regression with missing covariates
Our first two examples involved mixture models in which the class membership variable was hidden. Another canonical use of the EM algorithm is in cases with corrupted or missing data. In this section, we consider a particular instantiation of such a problem, namely that of linear regression with the covariates missing completely at random.
where is the probability of missingness.
In writing all these expressions, we have assumed that the coordinates are permuted so that the missing values are in the first block.
We now have the necessary notation in place to describe the EM and gradient EM updates. For a given parameter , the EM update is based on maximizing
Again, this optimization problem has an explicit solution, so that the sample-based EM operator is given by
On the other hand, the gradient EM algorithm with step size takes the form
We return to analyze the gradient EM updates for this model in Section 4.3.
General convergence results
We now turn to analysis of the EM algorithm and gradient EM algorithms. In both cases, we let denote a maximizer of the population likelihood. In this section, we give general sufficient conditions under which the population algorithms converge to and under which the sample-based algorithms converge to an -ball around . Our analysis of each algorithm is organized as follows:
Our second result in Sections 3.1 and 3.2 concern the sample-based EM and gradient EM operators. These sample-based operators approximate the population-based update using a subset of samples at each step. Theorem 2 and Theorem 4 for sample-based EM and gradient EM, respectively give conditions under which the sample-based operator is guaranteed to converge to an -ball around the fixed point . These results involve probabilistic bounds on the deviations between the population-based and sample-based operators. In addition, for gradient EM, we also analyze a stochastic update that uses a single sample per update in the flavor of stochastic approximation algorithms (see Theorem 5 in Section 3.2.3).
Let us begin with analysis of the standard EM updates, starting with the population version before turning to a sample-based version.
Recall that we always assume that the vector maximizes the population likelihood. It is a classical fact that it must then satisfy the condition
a property known as self-consistency. For this reason, the function plays an important role in our analysis.
We assume throughout this section that the function is -strongly concave, meaning that
for all pairs in a neighborhood of . As we will illustrate, this condition holds in most concrete instantiations of EM, including the three model classes introduced in the previous section.
For any fixed , in order to relate the population EM updates to the fixed point , we require control on the two gradient mappings and . These mappings are central in characterizing the fixed point and the update respectively. Indeed, by virtue of the self-consistency property (24) and the convexity of , the fixed point satisfies the first-order optimality condition
Similarly, for any , since maximizes the function , we have
Equations (26) and (27) are sets of inequalities that characterize the points and . Thus, at an intuitive level, in order to establish that and are close, it suffices to verify that these two characterizations are close in a suitable sense. We also note that inequalities similar to the condition (27) are often used as a starting point in the classical analysis of M-estimators (e.g., see van de Geer ). In the analysis of EM, we obtain additional leverage from the self-consistency condition (24) that characterizes .
With this intuition in mind, we introduce the following regularity condition in order to relate conditions (27) and (24): The condition involves a Euclidean ball of radius around the fixed point , given by
As a concrete example, recall the Gaussian mixture model first introduced in Section 2.2.1. For this model, the condition (29) is equivalent to
where was previously defined following equation (12). Given that the function is smooth in , provided that is not too small, it is reasonable to expect that this condition will hold in a neighborhood of , and we confirm this intuition in Corollary 1 to follow.
Under the conditions we have introduced, the following result guarantees that the population EM operator is locally contractive:
Since both and are in , we may apply condition (26) with and condition (27) with . Doing so, adding the resulting inequalities and then performing some algebra yields the condition
Now the -strong concavity condition (25) implies that the left-hand side is lower bounded as
Combining inequalities (32a) and (32b) with the original bound (31) yields
and canceling terms completes the proof. ∎
1.2 Guarantees for sample-based EM
We now turn to theoretical results on sample-based versions of the EM algorithm. More specifically, we consider two forms of the EM algorithm, the first being the standard form in which the operator , as previously defined (9), is applied repeatedly, thereby generating the sequence . We also analyze a sample-splitting From a practical point, a potential advantage of sample splitting is that each iteration may be cheaper, since it is based on a smaller sample size. In contrast, a disadvantage is that it can be difficult to correctly specify the number of iterations in advance. version of the EM algorithm, in which given a total of samples and iterations, we divide the full data set into subsets of size , and then perform the updates , using a fresh subset of samples at each iteration.
with probability at least . With these definitions, we have the following guarantees:
If the sample size is large enough to ensure that
For a given iteration number , suppose the sample size is large enough to ensure that
Figure 1 provides an illustration of the behavior predicted by Theorem 2: both algorithms are expected to show geometric convergence to the target parameter , up to some tolerance. For the bound (35b) note that the first term is decreasing in , whereas the second term is independent of . Thus, for a fixed sample size , the bounds in Theorem 2 suggests a reasonable choice of the number of iterations. In particular, focusing on the standard EM algorithm, consider any positive integer As will be clarified in the sequel, such a choice of exists in various concrete models considered here. such that
This choice ensures that the first term in the bound (35b) is dominated by the second term, and hence that
with probability at least . For the sample-splitting update in (36b) the first term is decreasing in , whereas the second term is increasing in . In this case, a similar conclusion holds when is chosen to be the smallest positive integer such that
Let us now turn to the proof of the theorem.
We give a detailed proof of the claim (36b), from which it will be clear that the claim (35b) follows by a nearly identical argument. For any iteration , we have
with probability at least . Consequently, by a union bound over all indices, the bound (40) holds uniformly with probability at least . We perform the remainder of our analysis under this event.
Indeed, when this bound holds, we may iterate it to show that
where the final step follows by summing the geometric series.
It remains to prove the claim (41), and we do so via induction on the iteration number. Beginning with , we have
In the induction from , suppose that , and the bound (41) holds at iteration . The same argument then implies that the bound (41) also holds for iteration , and that , thus completing the proof. ∎
2 Analysis of gradient EM algorithm
We now turn to analysis of the gradient EM algorithm. As before, we separate our analysis into two parts, the first (Theorem 3) addressing the behavior of the population-level operator, and the second (Theorems 4 and 5) providing guarantees for sample-based updates.
Recall that the gradient EM algorithm generates a sequence of iterates via the recursion , where
Here is a step size parameter to be chosen. For analyzing gradient EM, we also require an additional condition on the function , previously defined in Section 3.1. In addition to the -strong concavity assumption (25), we also assume that is -smooth, meaning that
Under the stated strong concavity and smoothness assumptions, it is a standard result from optimization theory that the gradient operator with step size choice is contractive, in particular with
Intuitively, then, if the function is “close enough” to the function , then the gradient EM operator might be expected to satisfy a similar contractivity condition. The closeness requirement is formalized in the following condition:
See Figure 2 for an illustration of this condition. We give concrete examples of this condition and its verification in Section 4. As with the FOS condition observe that the GS condition is always satisfied at the fixed point , i.e. for with . Allowing for strictly positive , if the functions are sufficiently regular we expect the condition to hold in a region around . Observe that this condition involves the gradient of the functions and at , as opposed to in the case of the FOS condition. For this reason, it can be easier to verify for specific models.
Under this condition, the following result guarantees local contractivity of the gradient EM operator (42):
By definition of the gradient EM update (42), we have
where step (i) follows from the triangle inequality, and step (ii) uses the contractivity of from equation (45), and condition GS. Substituting and performing some algebra yields the claim. ∎
2.2 Guarantees for sample-based gradient EM
In this section, in parallel with our earlier analysis of sample-based version of the EM algorithm, we analyze two sample-based variants of the gradient EM algorithm, the first when the update operator is computed using all samples and applied repeatedly, and the second based on sample-splitting.
If the sample size is large enough to ensure that
If the sample size is large enough to ensure that
2.3 Stochastic version of gradient EM
In this section, we analyze a sample-based variant of gradient EM that is inspired by stochastic approximation. It can be viewed as an extreme form of sample-splitting, in which we use only a single sample per iteration, but compensate for the noisiness using a decaying step size. Throughout this section we assume that (a lower bound on) the radius of convergence of the population operator is known to the algorithmThis assumption can be restrictive in practice. We believe the requirement can be eliminated by a more judicious choice of the step-size parameter in the first few iterations..
In particular, given a sequence of positive step sizes , we analyze the recursion
In order to prove this theorem we first establish a recursion on the expected mean-squared error. As with Theorem 3 this result is established by relating the population gradient EM operator to the gradient ascent operator on the function . This key recursion along with some algebra will yield the theorem.
Given the stochastic EM gradient iterates with step sizes , the error at iteration satisfies the recursion
Using this result, we can now complete the proof of the bound (54). With the step size choice where , unwrapping the recursion (55) yields
In order to bound these terms we use the following fact: For any , we have
See Noorshams and Wainwright for a proof. Using this fact in Equation (56) yields
Finally, applying the integral upper bound yields the claim (54). ∎
For the convenience of the reader, let us now summarize the theorems given in this section, including the assumptions on which they rely and the results that they provide.
Consequences for specific models
In the previous section, we provided a number of general theorems on the behavior of the EM algorithm as well as the gradient EM algorithm, at both the population and sample levels. In this section, we develop some concrete consequences of this general theory for the three specific model classes previously introduced in Section 2.2.
We begin by analyzing the EM updates for the Gaussian mixture model previously introduced in Section 2.2.1. Our first result (Corollary 1) establishes contractivity for the population operator (13b), whereas our second result (Corollary 2) provides bounds for the sample-based EM updates.
Recall that our mixture model consists of two equally weighted components, with distributions and respectively. The difficulty of estimating this mixture model can be characterized by the signal-to-noise ratio , and our analysis requires a lower bound of the form
for a sufficiently large constant . Past work by Redner and Walker provides evidence for the necessity of this assumption: for Gaussian mixtures with low signal-to-noise ratio, they show that the ML solution has large variance and furthermore verify empirically that the convergence of the EM algorithm can be quite slow. Other researchers also provide theoretical justification for the slow convergence of EM on poorly separated Gaussian mixtures.
With the signal-to-noise ratio lower bound defined above we have the following guarantee:
This corollary guarantees that when the SNR is sufficiently large, then the MLE has a basin of attraction that is at least a constant fraction of the signal strength. Moreover, the convergence rate of the population updates is geometric, with the contraction factor decreasing exponentially in the signal-to-noise ratio. The proof of Corollary 1 involves establishing that for a sufficiently large SNR, the strong concavity and FOS () conditions hold for a Gaussian mixture model, so that Theorem 1 can be applied. Although the proof structure is conceptually straightforward, the details are quite technical, so that we defer it to Appendix B.1.
Based on the population-level contractivity guaranteed by Corollary 1, we can also establish guarantees for the standard EM sequence , where the sample-based operator was previously defined in equation (13a). This guarantee involves the function , as well as positive universal constants .
See Appendix B.2 for the proof of this result. In Appendix B.3, we also give guarantees for EM with sample-splitting which achieves better dependence on and with an easier proof at the cost of additional logarithmic factors in sample complexity.
A related result of Dasgupta and Schulman shows that when the SNR is sufficiently high a modified EM algorithm, with an intermediate pruning step, reaches a near-optimal solution in two iterations. On one hand, the SNR condition in our corollary is significantly weaker, requiring only that it is larger than a fixed constant independent of dimension (as opposed to scaling with ), but their theory is developed for more general -mixtures.
The bound (59) provides a rough guide of how many iterations are required: consider the smallest positive integer such that
Corollary 2 makes a number of qualitative predictions that can be tested. To begin, it predicts that the statistical error should decrease geometrically, and then level off at a plateau. Figure 3 shows the results of simulations designed to test this prediction: for dimension and sample size , we performed trials with the standard EM updates applied to Gaussian mixture models with SNR . In panel (a), the red curves plot the log statistical error versus the iteration number, whereas the blue curves show the log optimization error versus iteration. As can be seen by the red curves, the statistical error decreases geometrically before leveling off at a plateau. On the other hand, the optimization error decreases geometrically to numerical tolerance. Panel (b) shows that the gradient EM updates have a qualitatively similar behavior for this model, although the overall convergence rate appears to be slower.
In conjunction with Corollary 1, Corollary 2 also predicts that the convergence rate should increase as the signal-to-noise ratio is increased. Figure 3 shows the results of simulations designed to test this prediction: again, for mixture models with dimension and sample size , we applied the standard EM updates to Gaussian mixture models with varying SNR . For each choice of SNR, we performed trials, and plotted the log optimization error versus the iteration number. As expected, the convergence rate is geometric (linear on this logarithmic scale), and the rate of convergence increases as the SNR growsTo be clear, Corollary 2 predicts geometric convergence of the statistical error , whereas these plots show the optimization error . However, the analysis underlying Corollary 2 can also be used to show geometric convergence of the optimization error..
2 Mixtures of regressions
In this section, we analyze the EM and gradient EM algorithms for the mixture of regressions (MOR) model, previously introduced in Section 2.2.2. As in our analysis of the Gaussian mixture model, our theory applies when the signal-to-noise ratio is sufficiently large, as enforced by a condition of the form
Under a suitable lower bound on this quantity, our first result guarantees that the population level operators (17b) and (18a) are locally contractive.
As shown in the proof, the contraction coefficient is again a decreasing function of the SNR parameter . However, its functional form is not as explicit as in the Gaussian mixture case. The proof of Corollary 3 involves verifying that the function for the MOR model satisfies the required concavity, smoothness, GS() and FOS() conditions. It is quite technically involved, so that we defer it to Appendix C.1.
Let us now provide guarantees for a sample-splitting version of the EM updates. Recall that sample-based EM operator was previously defined in equation (17a). For a given sample size and iteration number , suppose that we splitTo simplify exposition, assume that is an integer. our full data set into subsets, each of size . We then generate the sequence , where we use a fresh subset at each iteration. In the following result, we use , along with positive universal constants (,).
We prove this corollary in Appendix C.2. Note the bound (63) again provides guidance on the number of iterations to perform. For a given sample size , suppose we perform iterations for a constant . The bound (63) then implies that
with probability at least . Apart from the logarithmic penalty \log^{2}\big{(}\frac{n}{d\varphi^{2}(\sigma;\|\theta^{*}\|_{2})}\big{)}, this guarantee matches the minimax rate for estimation of a -dimensional regression vector. We note that the logarithmic penalty can be removed by instead analyzing the standard form of the EM updates, as we did for the Gaussian mixture model.
We conclude our discussion of the MOR model by stating a result for the stochastic form of gradient EM analyzed in Theorem 5. In particular, given a data set of size , we run the algorithm for iterations, with a step size for iterations . Once again our result is terms of and positive universal constants (,).
We prove this corollary in Appendix C.3. Figure 4 illustrates this corollary showing the error as a function of iteration number (sample size) for the stochastic gradient EM algorithm.
3 Linear regression with missing covariates
This section is devoted to analysis of the gradient EM algorithm for the problem of linear regression with missing covariates, as previously introduced in Section 2.2.3. Here the central parameter is the probability that any given coordinate of the covariate vector is missing, and our analysis links this quantity to the signal-to-noise ratio and the radius of contractivity. Define and to be such that the following bounds hold,
For any given choice of define Our guarantees apply whenever the missing probability is bounded as
See Appendix D.1 for the proof of Corollary 6. Relative to our previous results, this corollary is somewhat unusual, in that we require an upper bound on the ratio . Although this requirement might seem counter-intuitive at first sight, known minimax lower bounds on regression with missing covariates show that it is unavoidable— that is, it is not an artifact of our analysis nor of the gradient EM algorithm. Roughly these lower bounds formalize the intuition that as the norm increases, the amount of missing information increases in proportion to the amount of observed information. Figure 6 provides the results of simulations that confirm this behavior, in particular showing that for regression with missing data, the radius of convergence eventually decreases as grows.
Let us now provide guarantees for a sample-splitting version of the EM updates, based on the sample-based EM operator in equation (22a). As usual, for a given sample size and iteration number , suppose that we split our full data set into subsets, each of size . We then generate the sequence , where we use a fresh subset at each iteration.
We prove this corollary in Appendix D.2. We note that the constant is a monotonic function of the parameters , but does not otherwise depend on , , or other problem-dependent parameters.
As with Corollary 4, this result provides guidance on the appropriate number of iterations to perform: in particular, if we set for a sufficiently large constant , then the bound (69) implies that
We conclude our discussion of the missing covariates model by stating a result for the stochastic form of gradient EM analyzed in Theorem 5. In particular, given a data set of size , we run the algorithm for iterations, with a step size for iterations .
We prove this corollary in Appendix D.3. Figure 7 illustrates this.
Discussion
In this paper, we have provided some general techniques for studying the EM and gradient EM algorithms, at both the population and finite-sample levels. Although this paper focuses on these specific algorithms, we expect that the techniques could be useful in understanding the convergence behavior of other algorithms for potentially non-convex problems.
The analysis of this paper can be extended in various directions. For instance, in the three concrete models that we treated, we assumed that the model was correctly specified, and that the samples were drawn in an i.i.d. manner, both conditions that may be violated in statistical practice. Maximum likelihood estimation is known to have various robustness properties under model mis-specification. Developing an understanding of the EM algorithm in this setting is an important open problem.
Finally, we note that in concrete examples our analysis guarantees good behavior of the EM and gradient EM algorithms when they are given suitable initialization. For the three model classes treated in this paper, simple pilot estimators can be used to obtain such initializations—in particular using PCA for Gaussian mixtures and mixtures of regressions (e.g., ), and the plug-in principle for regression with missing data (e.g., ). These estimators can be seen as particular instantiations of the method of moments . Although still an active area of research, a line of recent work (e.g., ) has demonstrated the utility of moment-based estimators or initializations for other types of latent variable models, and it would be interesting to analyze the behavior of EM for such models.
This research was partially supported by ONR-MURI grant N00014-11-1-0688 and NSF grant CIF-31712-23800 to MJW, and by US NSF grants DMS-1107000, CDS&E-MSS 1228246, ARO grant W911NF-11-1-0114, the Center for Science of Information (CSoI), and US NSF Science and Technology Center, under grant agreement CCF-0939370. SB would also like to thank John Duchi for helpful discussions.
Appendix A Proofs for stochastic gradient EM
In this section we provide proofs of results related to Theorem 5 from Section 3.2.3. It only remains to prove Lemma 1.
In order to establish Lemma 1 we require an analogue of Theorem 3 that allows for a wider range of step sizes. Recall the classical gradient ascent operator on the function . For step size , it takes the form . Under the stated -concavity and -smoothness conditions, for any step size , the classical gradient operator is contractive with parameter
This follows from the classical analysis of gradient descent (e.g., ). Using this fact, we can prove the following about the population gradient EM operator:
For any step size , the population gradient EM operator is contractive with parameter , where
We omit the proof, since it follows from a similar argument to that of Theorem 3. With this preliminary in place we can now begin the proof of Lemma 1.
Introducing the shorthand , we have , and hence
Letting denote the -field of events up to the random variable , note that
Consequently, by iterated expectations, we have
Combining with our earlier inequality (72) yields
Defining , we see that
where step (i) uses the contractivity of established in Lemma 2 and step (ii) uses the definition of from equation (71). Putting together the pieces yields the claim (55).
Appendix B Proofs for Gaussian mixture models
In this section, we provide proofs of results related to the Gaussian mixture model, as presented in Section 4.1. More specifically, we first prove Corollary 1 on the population level behavior, followed by the proof of Corollary 2 on the behavior of the standard sample-based EM updates.
Here the weighting function takes the form
It remains to verify the FOS() condition (29). The following auxiliary lemma is central to the proof:
Under the conditions of Corollary 1, there is a constant with such that
where .
This follows immediately from Lemma 3. Thus, the FOS condition holds when The bound on the contraction parameter follows from the fact that and applying Theorem 1 yields Corollary 1.
We now prove Lemma 3. Our proof makes use of the following elementary facts:
For the function , we have
For the function , we have
With these preliminaries in place, we can now begin the proof. For each , define , where . Taylor’s theorem applied to the function , followed by expectations, yields
By construction, the matrix is diagonal, so that it suffices to bound the diagonal terms. Beginning with the first diagonal entry, we have
Defining the event , we condition on it and its complement to obtain
Conditioned on and , respectively, we then apply the bounds (74a) and (74b) to obtain
provided . Noting that
Note that the mean of is lower bounded as
where step (i) follows from the lower bound (77). Consequently, by standard Gaussian tail bounds, we have
On the other hand, for any index , we have
where the reader should recall the function from equation (75a). Once again, conditioning on the event and its complement yields
Returning to equation (76), we have shown that
whenever . On this basis, the bound (73) holds as long as the signal-to-noise ratio is sufficiently large,
B.2 Proof of Corollary 2
where the final step uses the fact that for any pair . Putting together the pieces, we conclude that
using a standard symmetrization result for empirical processes (e.g., ). Now observe that for any triplet of -vectors , and , we have the Lipschitz property
Consequently, by the Ledoux-Talagrand contraction for Rademacher processes , we have
Putting together the pieces, we conclude that
for all sufficiently small. Combined with our first discretization (79), we have thus shown that
Combined with the Chernoff approach, this bound on the MGF implies that, as long as for a sufficiently large constant , we have
B.3 Guarantees for EM with sample-splitting
In this section, we state and prove a result for the EM algorithm with sample-splitting for the mixture of Gaussians.
Consider a Gaussian mixture model satisfying the condition (57), and any initialization such that . Given a sample size , then with probability at least , the sample-splitting EM iterates satisfy the bound
It is worth comparing the result here to the result established earlier in Corollary 2. The sample-splitting EM algorithm is more sensitive to the number of iterations which determines the batch size and needs to be chosen in advance. Supposing that the number of iterations were chosen optimally however the result has better dependence on and at the cost of a logarithmic factor in .
The proof follows by establishing a bound on the function . Define . Recalling the updates in (13a) and (13b), note that
We bound each of these terms in turn, in particular showing that
Observe that since we have
Since are i.i.d Bernoulli variables, Hoeffding’s inequality implies that
with probability at least . On the other hand, the vector is zero-mean and sub-Gaussian with parameter , whence the squared norm is sub-exponential. Using standard bounds for sub-exponential variates and the condition , we obtain
with probability at least . Combining the pieces yields the claimed bound (82) on .
The random variable lies in the interval $$, so that Hoeffding’s inequality implies that
with probability at least . Putting together the pieces yields the claimed bound (82) on , thereby completing the proof of the corollary. ∎
Appendix C Proofs for mixtures of regressions
In this appendix, we provide proofs of results related to the mixture of regressions model, as presented in Section 4.2. More specifically, we first prove Corollary 3 on the population level behavior, followed by the proof of Corollaries 4 and 5 on the behavior of sample-splitting EM updates and stochastic gradient EM updates, respectively.
We begin by proving part (a) of the corollary on the population EM update, which is based on maximizing the function
It remains to verify condition FOS. Define the difference function , and the difference vectors . Using this notation, for this model, we need to show that
Note that we can write , where is a Bernoulli variable. Using this notation, it is equivalent to show
for in order to establish contractivity. In order to prove the theorem with the desired upper bound on we need to show (83) with The following lemma provides control on the two terms:
Under the conditions of Corollary 3, there is a constant such that for any fixed vector we have
for This is an immediate consequence of Lemma 4.
It remains to prove Lemma 4. Since the standard deviation is known, a simple rescaling argument allows us to take , and replace the weight function in (16a) with
Our proof makes use of the following elementary result on Gaussian random vectors:
A similar argument yields the second claim. ∎
Noting that Lemma 4 consists of two separate inequalities (84a) and (84b), we treat these cases separately.
We split the proof of this bound into two separate cases: namely, and .
Thus, using a Taylor series with integral form remainder on the function yields
where . Substituting for in inequality (84a), we see that it suffices to show
for some . The following auxiliary result is central to establishing this claim:
There is a such that for each , we have
See Section C.1.5 for the proof of this lemma.
Using Lemma 6, let us bound the quantity from equation (89). Since , we have , where
In order to show that , it suffices to show that .
By the Cauchy-Schwarz inequality, we have
Similarly, another application of the Cauchy-Schwarz inequality yields
where the second step follows from the bound (90b), and the fact that . In this case, we have
where step (i) uses the bound (86b) from Lemma 5, and step (ii) that . Combining the pieces, we conclude that , which completes the proof of inequality (84a) in the case .
We now turn to the second case of the bound (84a). Our argument (here and in later sections) makes use of various probability bounds on different events, which we state here for future reference. These events involve the scalar for a constant , as well as the vectors
For the event \mathcal{E}_{2}:=\big{\{}|\langle X,\,\theta^{*}\rangle|>\tau\big{\}}\cap\big{\{}|\langle X,\,\theta_{u}\rangle|>\tau\big{\}}\cap\big{\{}|v|\leq\frac{\tau}{2}\big{\}}, we have
See Section C.1.7 for the proof of this result.
With this set-up, our goal is to bound the quantity
We bound each of these five terms in turn.
Applying the Cauchy-Schwarz inequality and using the fact that yields
We now bound conditioned on the event . Since on the event , we have
Recalling the weight function (85), we claim that when conditions (93a) and (93b) hold, then
Combined with our earlier bound (92), we have shown
Applying Lemma 8 with yields .
Combining the Cauchy-Schwarz inequality with Lemma 7(i), we have
where step (i) makes use of the bound ; and step (ii) follows since , and .
where step (i) follows from the conditional covariance bound of Lemma 8.
Combining the Cauchy-Schwarz inequality with Lemma 7(iv) yields
Together with Lemma 8, we obtain the bound
where step (i) uses the fact that .
We have thus obtained bounds on all five terms in the decomposition (91). We combine these bounds with the with lower bound from equation (87b), and then perform some algebra to obtain
where is a universal constant. In particular, selecting for a sufficient large constant , selecting the constant in (87a) sufficiently large yields the claim (84a).
C.1.2 Proof of inequality (84b)
As in Section C.1.1, we treat the cases and separately.
As before, by a Taylor expansion of the function , it suffices to show that
For any fixed , the Cauchy-Schwarz inequality implies that
Conditioned on the event , the bound (94) implies that , and hence .
Thus, we have derived bounds on each of the three terms in the decomposition (96): putting them together yields
By choosing sufficiently large in the definition of , selecting the signal-to-noise constant in condition (87a) sufficiently large, the claim follows.
C.1.5 Proof of Lemma 6
The lemma statement consists of two inequalities, and we divide our proof accordingly.
and we bound each of these terms in turn. The reader should recall the constant , as well as the events and from Lemma 7.
where the final step follows from inequality (74a). Combined with Lemma 7(iv), we conclude that .
where step (i) follows from inequality (98), and step (ii) follows from Lemma 7(iii).
Conditioned on the event , we have , where we have used the lower bound (93b). Introducing the shorthand , this lower bound implies that
where inequality (i) is valid as long as , or equivalently .
Substituting our upper bounds on three components in the decomposition (97) yields
Setting sufficiently large in the definition of and choosing sufficiently large values of the signal-to-noise constant in the condition (87a) yields the claim.
Combining Lemma 8 with the bound , we find that . Since , we have
Putting together the pieces, we conclude that .
Recall that , so we have that
where step (i) follows from the bound (74a) and the observation that conditioned on the event .
Substituting our bounds on the two terms into the decomposition (97) yields
Once again, sufficiently large choices of the constant and the signal-to-noise constant in equation (87a) yields the claim.
C.1.6 Proof of Lemma 7
In this section, we prove the probability bounds on events through stated in Lemma 7. In doing so, we make use of the following auxiliary result, due to Yi et al. (see Lemma 1 in their paper):
Note that the event holds if and only if , or equivalently, if and only if
For , we have
Similarly, this inequality follows from the tail bound (101).
This claim follows from parts (v) and (vi) of Lemma 7, combined with the union bound.
This bound follows from parts (iii) and (iv) of Lemma 7, combined with the union bound.
C.1.7 Proof of Lemma 8
For appropriate choices of and the constant in the signal-to-noise condition (87a), the claim follows.
As before, note that the event holds if and only if the inequality holds. Consequently, Lemma 9 implies that .
We make note of an elementary fact about Gaussians: for any scalar and unit norm vector , for , we have
In particular, when , then the operator norm is at most . This claim follows easily from the rotation invariance of the Gaussian, which allows us to assume that without loss of generality. It is thus equivalent to bound the largest eigenvalue of the matrix
which is a diagonal matrix by independence of the entries of . Noting that and for completes the proof of the bound (102).
Applying the bound (102), we find that |\!|\!|\Gamma(\mathcal{E}_{5}^{c})|\!|\!|_{{\text{\footnotesize{\mbox{op}}}}}\leq\max\Big{(}1,\frac{\tau^{2}}{\|\theta_{u}\|_{2}^{2}}\Big{)}. Consequently, the claim follows by making sufficiently large choices of and the constant in the signal-to-noise condition (87a).
C.2 Proof of Corollary 4
We need to compute an upper bound on the function previously defined in equation (33). For this particular model, we have
We bound each of the terms and in turn.
Recall the assumed lower bound on the sample size—namely for a sufficiently large constant . Under this condition, standard bounds in random matrix theory , guarantee that with probability at least . When this bound holds, we have .
where we have applied the Ledoux-Talagrand contraction for Rademacher processes , using the fact that for all pairs . Now conditioned on , the random variable is zero-mean and sub-Gaussian with parameter at most . Consequently, taking expectations over the distribution for each index , we find that
where the final inequality uses the definition of . Using this bound on the moment-generating function, we find that
Since by assumption, standard results in random matrix theory imply that with probability at least . On the other hand, observe that
since the population operator is a contraction, and . Combining the pieces, we see that with probability at least .
Finally, substituting our bounds on and into the decomposition (103) yields the claim.
C.3 Proof of Corollary 5
First considering , recall that , where and is a random sign, independent of . Consequently, we have
Putting together the pieces, we conclude that .
Appendix D Proofs for missing covariates
In this appendix, we provide proofs of results related to regression with missing covariates, as presented in Section 4.3. More specifically, we first prove Corollary 6 on the population level behavior, followed by the proof of Corollaries 7 and 8 on the behavior of sample-splitting EM updates and stochastic gradient EM updates, respectively.
We need to verify the conditions of Theorem 3, namely that the function is -smooth, -strongly concave, and that the GS condition is satisfied. In this case, is a quadratic of the form
showing that the expectation does not depend on the pattern of missingness. Consequently, the quadratic function has an identity Hessian, showing that smoothness and strong concavity hold with .
where we have used our upper bound (107) on . We need to ensure that . By assumption, we have and , and hence . Thus, the coefficient is upper bounded as
Under the stated conditions of the corollary, we have , thereby completing the proof.
D.2 Proof of Corollary 7
It is clear that has zero mean. It remains to prove that is sub-exponential. Note that is a rescaled sum of rank one matrices, each of the form
Introducing the shorthand , we have
Moreover, since , we have
It suffices to show that each of the variables is sub-Gaussian with a constant parameter. As discussed previously, the variable is sub-Gaussian with parameter at most one. On the other hand, note that and are independent. Moreover, with fixed, the variable is sub-Gaussian with parameter , whence
where the final inequality uses the fact that . We have thus shown that is sub-Gaussian with parameter one. Since , the same argument shows that is sub-Gaussian with parameter at most . Since is sub-Gaussian with parameter and independent of , the same argument shows that is sub-Gaussian with parameter at most one, thereby completing the proof of the lemma. ∎
We now turn to the second term. Note the variational representation
By a discretization argument–say with a cover of the sphere with elements—we obtain
Each term in this maximum is the product of two zero-mean variables, namely and . On one hand, the variable is sub-Gaussian with parameter at most ; on the other hand, Lemma 10 guarantees that is sub-Gaussian with constant parameter. The product of any two sub-Gaussian variables is sub-exponential, and thus, by standard sub-exponential tail bounds , we have
Since and , we conclude that with probability at least .
Combining our bounds on and , we conclude that with probability at least . Thus, we see that Corollary 7 follows from Theorem 2.
D.3 Proof of Corollary 8
By the Cauchy-Schwarz inequality, we have
By the Cauchy-Schwarz inequality, we have
Note that is sub-Gaussian with parameter at most , whence