Probabilistic Interpretation of Linear Solvers

Philipp Hennig

Introduction

Solving the unconstrained linear problem of finding $x$ in

It will turn out that a family of quasi-Newton methods (§1.2), more widely used to solve nonlinear optimization problems, can help answer this question, because classic derivations of these methods can be re-formulated and extended into a probabilistic interpretation of these methods as maxima of Gaussian posterior probability distributions (§2). The covariance of these Gaussians offers a new object of interest and provides error estimates (§3). Because there are entire linear spaces of Gaussian distributions with the same posterior mean but differing posterior error estimates, selecting one error measure consistent with the algorithm is a new statistical estimation task (§4).

2 The Dennis family of secant methods

This ensures the secant relation $y_{i}=B_{i+1}s_{i}$ , sometimes called ‘the quasi-Newton Equation’ . Convergence of the sequence of iterates $x_{i}$ for various members of this class (and the classes of Broyden and Davidon) are well-understood . The rules named above can be found in the Dennis class as :

Because the update of Equation (2) is of rank 2, the corresponding estimate for the inverse $H=B^{-1}$ (assuming it exists) can be constructed using the matrix inversion lemma. Alternatively, all Dennis rules can also be used as inverse updates , i.e. estimates for $H$ itself, by exchanging $s\operatorname*{\leftrightarrowtriangle}y$ and $B\operatorname*{\leftrightarrowtriangle}H$ , $B_{0}\operatorname*{\leftrightarrowtriangle}H_{0}$ above (corresponding to the secant relation $s=Hy$ ). Interestingly, the DFP and BFGS updates are duals of each other under this exchange : The inverse of $B_{1}$ as constructed by the DFP rule (6) equals the $H_{1}$ arising from the inverse BFGS rule (7). This does not mean BFGS and DFP are the same, but only that they fill opposing roles in the inverse and direct formulation. To avoid confusion, in this text the DFP rule will always be used in the sense of a direct update (estimating $B$ , with $c=y$ ), and the BFGS rule in the inverse sense (i.e. estimating $H$ , with $c=s$ ). The first parts of this text will focus on direct updates and thus mostly talk about the DFP method instead of the BFGS rule. All results extend to the inverse models (and thus BFGS) under the exchange of variables mentioned above. Sections 3.2 and 4 will make some specific choices geared to inverse updates. They will then talk explicitly about BFGS, always in the sense of an inverse update.

This text gives a probabilistic interpretation of the Dennis family, for the linear problems of Eq. (1). We will interpret the secant methods as estimators of (inverse) Hessians of an objective function, and ask what kind of prior assumptions would give rise to these specific estimators. This results in a self-contained derivation of inference rules for symmetric matrices. Some of the rules quoted above can be motivated as ‘natural’ from the inference perspective.

Another major strand of nonlinear optimization methods extends from the conjugate gradient algorithm of Hestenes & Stiefel for linear problems, nonlinearly extended by Fletcher and Reeves and others. On linear problems, the CG and quasi-Newton ideas are closely linked: Nazareth showed that CG is equivalent to BFGS for linear problems (with exact line searches, when the initial estimate $B_{0}=\boldsymbol{I}$ ). More generally, Dixon showed for quasi-Newton methods in the Broyden class (which also contains the methods listed above) that, under exact line searches and the same starting point, all methods in Broyden’s class generate a sequence of points identical to CG, if the starting matrix $B_{0}$ is taken as a preconditioner of CG. In this sense, this text also provides a novel derivation for conjugate gradients, and will use several well-known properties of that method. Implications of the results presented herein to nonlinear variants of conjugate gradients will be left for future work.

3 Numerical methods perform inference — The value of a statistical interpretation

The defining aspect of quasi-Newton methods is that they approximate—estimate—the Hessian matrix of the objective function, or its inverse, based on evaluations—observations—of the objective’s gradient and certain prior structural restrictions on the estimate. They can therefore be interpreted as inferring the latent quantity $B$ or $H$ from the observed quantities $s,y$ . This creates a connection to statistics and probability theory, in particular the probabilistic framework of encoding prior assumptions in a probability measure over a hypothesis space, and describing observations using a likelihood function, which combines with the prior according to Bayes’ theorem into a posterior measure over the hypothesis space (§2).

On the one hand, this elucidates prior assumptions of quasi-Newton methods (§3). On the other, it suggests new functionality for the existing methods, in particular error estimates on $B$ and $H$ (§4). In future work, it may also allow for algorithms robust to ‘noisy’ linear maps, such as they arise in physical inverse problems.

The interpretation of numerical problems as estimation was pointed out by statisticians like Diaconis in 1988 , and O’Hagan in 1992 , well after the introduction of quasi-Newton methods. To the author’s knowledge, the idea has rarely attracted interest in numerical mathematics, and has not been studied in the context of quasi-Newton methods before recent work by Hennig & Kiefel . An argument sometimes raised against analysing numerical methods probabilistically is that numerical problems do not generally feature an aspect of randomness. But probability theory makes no formal distinction between epistemic uncertainty, arising from lack of knowledge, and aleatoric uncertainty, arising from ‘randomness’, whatever the latter may be taken to mean precisely. Randomness is not a prerequisite for the use of probabilities. Those who do feel uneasy about applying probability theory to unknown deterministic quantities, however, may prefer another, perhaps more subjective argument: From the point of view of a numerical algorithm’s designer, the ‘population’ of problems that practitioners will apply the algorithm to does in fact form a probability distribution from which tasks are ‘sampled’.

Numerical algorithms running on a finite computational budget make numerical errors. A notion of the imprecision of this answers is helpful, in particular when the method is used within a larger computational framework. Explicit error estimates can be propagated through the computational pipeline, helping identify points of instability, and to distribute or save computational resources. Needless to say, it makes no sense to ask for the exact error (if the exact difference between the true and estimated answer where known, the exact answer would be known, too). But it is meaningful to ask for the remaining volume of hypotheses consistent with the computations so far. This paper attempts to construct such an answer for linear problems.

4 Overview of main results

As pointed out above, although quasi-Newton methods are most popular for nonlinear optimization, here the focus will be on linear problems. Extending the probabilistic interpretation constructed here to the nonlinear setting of inferring the (inverse) Hessian of a function $f$ will be left for future work (see for pointers). The present aim is an iterative linear solver iterating through posterior beliefs $\{p_{t}(x,H)\}_{t=1,\dots,M}$ for $H=B^{-1}$ and the solution $x=Hb$ of the linear problem. These beliefs will be constructed as Gaussian densities $p_{t}(H)=\mathcal{N}(H;H_{t},V_{t})$ over the elementsFor notational convenience, the elements of $H$ will be treated as the elements of a vector of length $N^{2}$ , see more at the beginning of §2.2. of $H$ , with mean $H_{t}$ and covariance matrix $V_{t}$ .

The results in this paper significantly clarify and extend previous results by Hennig & Kiefel and Hennig . Here is a brief outlook of the main results:

As a probabilistic interpretation of results by Dennis & Moré and Dennis & Schnabel , Hennig provided a derivation of rank-2 secant methods in terms of two independent observations of two separate parts of the Hessian. This viewpoint affords a nonparametric extension to nonlinear optimization, but is not particularly elegant. This paper provides a cleaner derivation: the Dennis family can in fact be derived naturally from a prior over only symmetric matrices. This extends the results of Dennis & Schnabel , from statements about the maximum of a Frobenius norm in the space of symmetric matrices to the entire structure of that norm in that space.

The choice of prior parameters distinguishes between the members of the Dennis family. An analysis shows that DFP and BFGS are ‘more correct’ than other members of the family in the sense that they are consistent with exact probabilistic inference for the entire run of the algorithm, while general Dennis rules are only consistent after the first step (Lemmas 6 and 7). Further, SR1, Greenstadt, DFP and BFGS all use different prior measures that, although all ‘scale-free’, give imperfect notions of calibration for the prior measure. Finally, because BFGS is equivalent to CG ( and Lemma 8), its set of evaluated gradients is orthogonal. This allows a computationally convenient parameterization of posterior uncertainty. Overall, the picture arising is that, from the probabilistic perspective, the DFP and particularly BFGS methods have convenient numerical properties, but their posterior measure can be calibrated better.

It will transpire that the decision for a specific member of the Dennis family still leaves a space of possible choices of prior covariances consistent with this update rule. Constructing a meaningful posterior uncertainty estimate (covariance) on $H$ after finitely many steps requires a choice in this unidentified space, which, as in other estimation problems, needs to be motivated based on some notion of regularity in $H$ . Several possible choices are discussed in Section 3, all of which add very low overhead to the standard conjugate gradient algorithm.

Gaussian inference from matrix-vector multiplications

then, by Bayes’ theorem and a simple linear computation (see e.g. [38, §2.1.2]), the posterior, the unique distribution over $v$ consistent with both prior and likelihood, is

This derivation also works in the limit of perfect information, i.e. for a well-defined limit of $\Lambda\operatorname*{\rightarrowtriangle}0$ , in which caseIf $A$ is not of maximal rank, a precise formulation requires a projection of $y$ into the preimage of $A$ . This is merely a technical complication. It is circumvented here by assuming, later on, that line-search directions are linearly independent. This amounts to a maximal-rank $A$ . the likelihood converges to the Dirac distribution $p(y\,|\,A,a,v)\operatorname*{\rightarrowtriangle}\delta(y-A^{\intercal}v-a)$ . The crucial point is that constructing the posterior after linear observations involves only linear algebraic operations, with the posterior covariance (the ‘error bar’) using many of the computations also required to compute the mean (the ‘best guess’).

Gaussian inference is closely linked to least-squares estimation: Because the logarithm is concave, the maximum of the posterior (10) (which equals the mean) is also the minimizer of the quadratic norm (using $\|x\|_{K}^{2}\colonequals x^{\intercal}K^{-1}x.$ )

The added value of the probabilistic interpretation is embodied in the posterior covariance, which quantifies remaining degrees of freedom of the estimator, and can thus also be interpreted as a measure of uncertainty, or estimated error.

2 Inference on asymmetric matrices from matrix vector multiplications

An important observation is that Broyden’s method ceases to be a direct match to this update after the first line search, because matrix $S^{\intercal}WS$ is not a diagonal matrix. This matrix will come to play a central role; we will call it the Gram matrix, because it is an inner product of $S$ weighted by the positive definite $W$ .

It is well known that, because the posterior mean of Eq. (12) is not in general a symmetric matrix, it is a suboptimal learning rule for the Hessian of an objective function. Which is why this class was quickly abandoned in favour of the rank-2 updates in the Dennis family mentioned above. Dennis & Moré and Dennis & Schnabel showed that the minimizer of weighted Frobenius regularizers (the maximizer of the Gaussian posterior) within the linear subspace of symmetric matrices is given by the Dennis class of update rules. Hennig & Kiefel constructed a probabilistic interpretation based on this derivation, which involves doubling the input domain of the objective function and introducing two separate, independent observations. This has the advantage of allowing for relatively straightforward nonparametric extensions, and a broad class of noise models for cases in which gradients can not be evaluated without error . But artificially doubling the input dimensionality is dissatisfying.

We now introduce a cleaner derivation of the same updates, by explicitly restricting the hypothesis class to symmetric matrices. This gives the covariance matrix a more involved structure than the Kronecker product, and makes derivations more challenging. It results in a new interpretation for the Dennis class, fully consistent with the probabilistic framework. Since the identity of the posterior mean was known from and , the interesting novel aspect here is the structure of the posterior covariance. In essence, it provides insight into the structure of the loss function around the previously known estimates.

We begin by building a Gaussian prior over the symmetric matrices, using the symmetrization operator $\Gamma$ , the linear operator acting on vectorized matrices defined implicitly through its effect $\Gamma\overrightarrow{A}=\nicefrac{{1}}{{2}}\overrightarrow{(A+A^{\intercal})}$ (explicit definition in Appendix A.1).

Here, $W\mathchoice{{\ooalign{$ \displaystyle\otimes $\cr\hfil$ \displaystyle\ominus $\hfil\cr}}}{{\ooalign{$ \textstyle\otimes $\cr\hfil$ \textstyle\ominus $\hfil\cr}}}{{\ooalign{$ \scriptstyle\otimes $\cr\hfil$ \scriptstyle\ominus $\hfil\cr}}}{{\ooalign{$ \scriptscriptstyle\otimes $\cr\hfil$ \scriptscriptstyle\ominus $\hfil\cr}}}W=\Gamma(W\otimes W)\Gamma^{\intercal}$ is the symmetric Kronecker product of $W$ with itself (see e.g. for an earlier mention). It is the matrix containing elements

The Dennis family of quasi-Newton methods is the posterior mean after one step ( $M=1$ ) of Gaussian regression on matrix elements.

2.2 Remark on the structure of the prior covariance

The fact that symmetric Kronecker product covariance matrices give rise to some of the most popular secant methods may be reason enough to be interested in these structured Gaussian priors. This section provides two additional arguments in their favor.

The first argument, applicable to the entire family of Gaussian inference rules, is that they give consistent estimates, and thus convergent solvers: The priors of Lemma 1 and Theorem 3 assign nonzero mass to all square, and all symmetric matrices, respectively. It thus follows, from standard theorems about the consistency of parametric Bayesian priors (e.g. ), that linear solvers based on the mean estimate arising from either of these two Gaussian priors, applied to linear problems of general, or symmetric structure, respectively, are guaranteed (assuming perfect arithmetic precision) to converge to the correct $B$ (and $B^{-1}$ , where it exists) after $M=N$ linearly independent line searches (i.e. $\operatorname{rk}(S)=M$ ). This is because the Schur complement $W-WS(S^{\intercal}WS)^{-1}S^{\intercal}W$ is of rank $N-M$ [41, Eq. 0.9.2], so the remaining belief after $M=N$ is a point-mass at the unique $B=YS^{-1}$ . By a generalization of the same argument, it also follows that these linear solvers are always exact within the vector space spanned by the line-search directions. This holds for all choices of prior parameters $B_{0}$ and $W$ , as long as $W$ is strictly positive definite. Of course, good convergence rates do depend crucially on these two choices. And the aim in this paper is to also identify choices for these parameters such that the posterior uncertainty around the mean estimate is meaningful, too.

Since we know $B$ to be positive definite, it would be desirable to restrict the prior explicitly to the positive definite cone. Unfortunately, this is not straightforward within the Gaussian family, because normal distributions have full support. A seemingly more natural prior over this cone is the Wishart distribution popular in statistics,

(the $\propto$ symbol suppresses an irrelevant normalization constant). Using this prior in conjunction with linear observations, however, causes various complications, because the Wishart is not conjugate to one-sided linear observations of the form discussed above. So one may be interested in finding a ‘linearization’ (a Gaussian approximation of some form) for the Wishart, for example through moment matching. And indeed, the second moment (covariance) of the Wishart is $\nu^{-1}(W\mathchoice{{\ooalign{$ \displaystyle\otimes $\cr\hfil$ \displaystyle\ominus $\hfil\cr}}}{{\ooalign{$ \textstyle\otimes $\cr\hfil$ \textstyle\ominus $\hfil\cr}}}{{\ooalign{$ \scriptstyle\otimes $\cr\hfil$ \scriptstyle\ominus $\hfil\cr}}}{{\ooalign{$ \scriptscriptstyle\otimes $\cr\hfil$ \scriptscriptstyle\ominus $\hfil\cr}}}W)$ (see e.g. ).

Choice of parameters

The Powell symmetric Broyden update rule is the one-step posterior mean for a Gaussian regression model with $W=\alpha\boldsymbol{I}$ .

The Symmetric Rank-1 rule is the one-step posterior mean for a Gaussian regression model with the implicit choice $W=\alpha(B-B_{0})$ . (For a specific rank- $1$ observation, there is a linear subspace of choices $W$ which give $WS=Y$ , but $W=B$ is the only globally consistent such choice).

The Greenstadt update rule is the one-step posterior mean for a Gaussian regression model with $W=\alpha B_{0}$ .

The DFP update is the one-step posterior mean for the implicit choice $W=\alpha B$ . (This choice is unique in a manner analogous to the above for SR1).

The BFGS rule is the one-step posterior mean for the implicit choice $W=\alpha\left(B+\sqrt{\frac{s^{\intercal}Bs}{s^{\intercal}B_{0}s}}B_{t}\right)$ . (This, too, is unique in a manner analogous to the above).

It may seem circular for an inference algorithm trying to infer the matrix $B$ to use that very matrix as part of its computations (SR1, DFP, BFGS). But computation of the mean in Equation (15) only requires the projections $BS$ of $B$ , which are accessible because $BS=Y$ . However, the posterior uncertainty (Eq. 16), which is not part of the optimizers in their contemporary form, can not be computed this way.

Hence, with the exception of PSB, the popular secant rules all involve what would be called empirical Bayesian estimation in statistics, i.e. parameter adaptation from observed data. We also note again that the connection between probabilistic maximum-a-posterior estimates and Dennis-class updates only applies in the first of $M$ steps. As such, the Dennis updates ignore the dependence between information collected in older and newer search directions that leads to the matrix inverse of $G=(S^{\intercal}WS)$ in Equations (15) and (16) (obviously, including this information explicitly requires solving $M$ linear problems, at additional cost). As will be shown in Lemma 7 below, though, for some members of the Dennis family, and for their use within linear problems, this simplification is in fact exact.

Because directions $s$ where chosen randomly, these results say little about these algorithms as optimizers. What they do offer is an intuition for the difference between the exact rank- $2M$ posterior and repeated application of rank- $2$ Dennis-class update rules. A first observation is that, in this setup, keeping track of the dependence between consecutive search directions through $S^{\intercal}WS$ makes a big difference: For both pairs of ‘related’ algorithms PSB and $W=\boldsymbol{I}$ , as well as DFP and $W=B$ , the full posterior mean dominates the simpler ‘independent’ update rule. In fact, the classic secant rules do not converge to the true Hessian $B$ in this setup. The consistency argument in Section 2.2.2 only applies to estimators constructed by exact inference. The experiment shows how crucial tracking the full Gram matrix $S^{\intercal}WS$ is after $M>1$ .

A second, not surprising observation is that, although both probabilistic algorithms are consistent—they converge to the correct $B$ after $N$ steps—the quality of the inferred point estimate after $M<N$ steps depends on the choice of parameters. The simpler $W=\boldsymbol{I}$ (PSB) choice performs qualitatively worse than the $W=B$ (DFP) choice.

The posterior covariances were used to compute posterior uncertainty estimates for $\|B_{M}-B\|_{F}$ (gray lines in Figure 1): The Frobenius norm can be written as $\|B_{M}-B\|^{2}_{F}=\overrightarrow{(B_{M}-B)}^{\intercal}\overrightarrow{(B_{M}-B)}$ ; thus the expected value of this quadratic form is

with $W_{M}\colonequals W-WS(S^{\intercal}WS)^{-1}S^{\intercal}W$ . (To be clear: for $W=B$ , computing this uncertainty required the unrealistic step of giving the algorithm access to $B$ , which only makes sense for this conceptual experiment). The uncertainty estimate for $W=\boldsymbol{I}$ (dashed gray lines) is all but invisible in the right hand plot because its values are very close to 0—this algorithm has a badly calibrated uncertainty measure that has no practical use as an estimate of error. The uncertainty under $W=B$ (solid gray lines), on the other hand, scales qualitatively with the size of $B$ . This is because scaling $B$ by a scalar factor automatically also scales the covariance by the same factor. This has been noted before as a ‘non-dimensional’ property of BFGS/DFP [34, Eq. 6.11]. However, it is also apparent that the uncertainty estimate is too large in both plots—here by about a factor of 5. To understand why, we consider the individual terms in the sum of Equation (18) at the beginning of the inference: The ratio between the true estimation error on element $B_{ij}$ and the estimated error is

One may argue that a ‘well-calibrated’ algorithm should achieve $e_{ij}\approx 1$ . A problem with the choice $W=B$ becomes apparent considering diagonal elements and $B_{0}=\boldsymbol{I}$ :

This means the DFP prior is well-calibrated only for large diagonal elements $(B_{ii}\gg 1)$ . For diagonal elements $B_{ii}\approx 1$ , it is under-confident ( $e_{ii}\operatorname*{\rightarrowtriangle}0$ , estimating too large an error), and for very small diagonal elements $B_{ii}>0,B_{ii}\ll 1$ , it can be severely over-confident ( $e_{ii}\operatorname*{\rightarrowtriangle}\infty$ estimating too small an error). For off-diagonal elements and unit prior mean, the error estimate is

For positive definite $B$ , $e_{ij}^{2}<1$ off the diagonal holds because, for such matrices, $B_{ij}^{2}<B_{ii}B_{jj}$ (see e.g. [28, Corollary 7.1.5]), but of course $e_{ij}^{2}$ can still be very small or even vanish, e.g. for diagonal matrices. It is possible to at least fix the over-confidence problem, using the degree of freedom in Corollary 5 to scale the prior covariance to $W=\theta^{2}B$ with $\theta=\lambda_{\min}/(\lambda_{\min}-1)$ , using $\lambda_{\min}$ , the smallest eigenvalue of $B$ . This at least ensures $e_{ij}\leq 1\;\forall(i,j)$ .

Interestingly, setting $W=B-B_{0}$ (which gives the SR1 rule after the first observation, but not after subsequent ones) gives $e_{ii}^{2}=1$ , and $e_{ij}<1$ for $i\neq j$ . It also has the property that norm of the true $B$ under this prior is

so the true $B$ is exactly one standard deviation away from the mean under this prior. These properties suggest this covariance, which will be called standardized norm covariance, for further investigation in §4, which addresses the question: Is it possible to construct a linear solver that, without ‘cheating’ (using $B$ or $H$ explicitly in the covariance), has a well-calibrated uncertainty measure, and can thus meaningfully estimate the error of its computation; ideally, without major cost increase?

2 Structure of the Gram matrix

The above established that, treated as inference rules for matrices, general Dennis rules are probabilistically exact only after one rank-1 observation $y=Bs$ . How strong is the error thus introduced? In fact, as the following lemma shows, there are choices of search directions $S$ for which the existing algorithms do become exact probabilistic inference.

For linear problems $Bx=b$ with symmetric positive definite $B$ and exact line searches, under the DFP covariance $W=B$ , and linesearches along the inverse of the posterior mean of the Gaussian belief, the Gram matrix is diagonal. Analogously for inverse updates: For inference on $H=B^{-1}$ under the BFGS covariance $W=H$ on the same linear optimization problem and linesearches along the posterior mean over $H$ , the Gram matrix is diagonal.

The following result by Nazareth establishes that, for linear problems, the inference interpretation for BFGS transfers directly to the conjugate gradient (CG) method of Hestenes & Stiefel .

The connection between BFGS and CG is intuitive within the probabilistic framework: BFGS uses $W=H$ , so its mean estimate $H_{M}$ is the ‘best guess’ for $H$ under (the minimizer of) the norm $\overrightarrow{(H-H_{M})}^{\intercal}(H\mathchoice{{\ooalign{$ \displaystyle\otimes $\cr\hfil$ \displaystyle\ominus $\hfil\cr}}}{{\ooalign{$ \textstyle\otimes $\cr\hfil$ \textstyle\ominus $\hfil\cr}}}{{\ooalign{$ \scriptstyle\otimes $\cr\hfil$ \scriptstyle\ominus $\hfil\cr}}}{{\ooalign{$ \scriptscriptstyle\otimes $\cr\hfil$ \scriptscriptstyle\ominus $\hfil\cr}}}H)^{-1}\overrightarrow{(H-H_{M})}$ , and its iterated estimate $x_{M}$ is the best rank- $M$ estimate for $x$ when the error is measured as $(x-x_{M})^{\intercal}W^{-1}(x-x_{M})=(x-x_{M})^{\intercal}B(x-x_{M})$ . Minimizing this quantity after $M$ steps is a well-known characterisation of CG [34, Eq. 5.27].

Theorem 8 implies that, describing BFGS in terms of Gaussian inference also gives a Gaussian interpretation for CG ‘for free’. From the probabilistic perspective, and exclusively for linear problems, CG is ‘just’ a compact implementation of iterated Gaussian inference on $H$ from $p(H)=\mathcal{N}(H;\boldsymbol{I},H\mathchoice{{\ooalign{$ \displaystyle\otimes $\cr\hfil$ \displaystyle\ominus $\hfil\cr}}}{{\ooalign{$ \textstyle\otimes $\cr\hfil$ \textstyle\ominus $\hfil\cr}}}{{\ooalign{$ \scriptstyle\otimes $\cr\hfil$ \scriptstyle\ominus $\hfil\cr}}}{{\ooalign{$ \scriptscriptstyle\otimes $\cr\hfil$ \scriptscriptstyle\ominus $\hfil\cr}}}H)$ , with search directions along $H_{M}F(x_{M})=H_{M}(Bx_{M}-b)$ . This observation has conceptual value in itself (the natural question, left open here, is what it implies for the nonparametric extensions of CG). But Theorem 8, among other things, also implies the following helpful properties for the search directions $s_{i}$ chosen by, and gradients $F_{i}$ ‘observed’ by the (scalar prior mean) BFGS algorithm. They are all well-known properties of the conjugate gradient method (e.g. [34, Thm. 5.3]). In the following, generally assume that the algorithm has not converged at step $M<N$ , and remember that the $F_{M}=Bx_{M}-b$ are the residuals (gradients of $f(x)=\nicefrac{{1}}{{2}}x^{\intercal}Bx-x^{\intercal}x$ ) after $M$ steps, which form $y_{M}=F_{M}-F_{M-1}$ .

the set of evaluated gradients / residuals is orthogonal:

the gradients (and thus $Y$ ) span the Krylov subspaces generated by $(B,b)$ :

line searches and gradients span the same vector space:

3 Discussion

We have established a probabilistic interpretation of the Dennis class of quasi-Newton methods, and the CG algorithm, as Gaussian inference: The Dennis class can be seen as Gaussian posterior means after the first line search (Corollary 4), but this connection extends to multiple search directions only if the search directions are conjugate under prior covariance (Lemma 6). For linear problems, this is the case for the DFP, BFGS update rules (Lemma 7). Since BFGS is equivalent to CG on linear problems (Lemma 8), this also establishes a probabilistic interpretation for linear CG. These results offer new ways of thinking about linear solvers, in terms of solving an inference problem by collecting information and building a model, rather than by designing a dynamic process converging to the minimum of a function. It is intriguing that, from this vantage point, the extremely popular CG / BFGS methods look less well-calibrated than one may have expected (§3.1).

The obvious next question is, can one design explicitly uncertain linear solvers with a reasonably well-calibrated posterior? In addition to the scaling issues, a challenge is that, for BFGS / CG, the prior covariance $W=H$ is only an implicit object. After $M<N$ steps, there exists a $\nicefrac{{1}}{{2}}(N-M)(N-M+1)$ -dimensional cone of positive definite covariance matrices fulfilling $WY=S$ (and, additionally, a scalar degree of freedom inherent to the Dennis class). How do we pick a point in this space?

Constructing explicit posteriors

Considering the structure of $\Sigma$ , one can write $T$ in terms of block matrices

A primary goal in designing a probabilistic linear solver is thus, at step $M<N$ , to (1) identify the span of $W_{M}$ , ideally without incurring additional cost, and to (2) fix the entries in the remaining free dimensions in $W_{M}$ , by using some regularity assumptionsA probabilistically more appealing approach would be to use a hyper-prior on the elements of $W$ , marginalized over the unidentified degrees of freedom. It is currently unclear to the author how to do this in a computationally efficient way. about $H$ . The equivalence between BFGS and CG offers an elegant way of solving problem (1), with no additional computational cost: Recall from Theorem 3 and Lemma 6 that the covariance after $M$ steps under $W=H$ is $W_{M}\mathchoice{{\ooalign{$ \displaystyle\otimes $\cr\hfil$ \displaystyle\ominus $\hfil\cr}}}{{\ooalign{$ \textstyle\otimes $\cr\hfil$ \textstyle\ominus $\hfil\cr}}}{{\ooalign{$ \scriptstyle\otimes $\cr\hfil$ \scriptstyle\ominus $\hfil\cr}}}{{\ooalign{$ \scriptscriptstyle\otimes $\cr\hfil$ \scriptscriptstyle\ominus $\hfil\cr}}}W_{M}$ with

Because, by Equation (25) the vector-space spanned by $S$ is identical to that spanned by the orthogonal gradients, we can write the space of all symmetric positive semidefinite matrices $W$ with the property $WY=S$ as

Eq. (30) parametrises posterior covariances of the BFGS family. In light of the scaling issues of these priors discussed in §3.1, one would prefer, from the probabilistic standpoint, to use the standardized norm priors $p(H)=\mathcal{N}(H;\alpha\boldsymbol{I},(H-H_{0})\mathchoice{{\ooalign{$ \displaystyle\otimes $\cr\hfil$ \displaystyle\ominus $\hfil\cr}}}{{\ooalign{$ \textstyle\otimes $\cr\hfil$ \textstyle\ominus $\hfil\cr}}}{{\ooalign{$ \scriptstyle\otimes $\cr\hfil$ \scriptstyle\ominus $\hfil\cr}}}{{\ooalign{$ \scriptscriptstyle\otimes $\cr\hfil$ \scriptscriptstyle\ominus $\hfil\cr}}}(H-H_{0}))$ , but these priors do not share BFGS/CG’s other good numerical properties. Instead, a hybrid algorithm can be constructed as follows:

Solve the linear problem using the conjugate gradient method. While the algorithm runs, collect $S,Y,\bar{F}$ . This has storage cost of $2NM+M$ floats: Because $Y$ consists of differences between subsequent columns of $F$ , it does not need to be stored explicitly, the column norms $\|F\|_{i}$ required to compute $\bar{F}$ require $M$ extra floats. The computation cost of the standard conjugate gradient algorithm is $\mathcal{O}(M)$ matrix-vector multiplications (that is, $\mathcal{O}(MN^{2})$ assuming a dense matrix), plus $\mathcal{O}(MN)$ operations for the algorithm itself (including computation of $\|F\|_{i}$ ).

Using the $(S,Y,\bar{F})$ constructed by CG, compute the standardized-norm posterior on $H$ , i.e. use the prior $p(H)$ defined above, which yields a Gaussian posterior with mean and covariance

A prerequisite for this is to choose $\alpha<\lambda_{\min}(H)$ , less than the smallest eigenvalue of $H$ , to ensure that $W=H-H_{0}$ is positive definite. But $\lambda_{\min}(H)=1./\lambda_{\max}(B)$ , which can be estimated efficiently (and without additional cost) from the $\|F\|_{i}$ . Another minor hurdle is that Equations (32) & (34) require the inverse of $S^{\intercal}Y-\alpha Y^{\intercal}Y$ . The columns of $Y$ are $Y_{i}=F_{i}-F_{i-1}$ , so, because conjugate gradient constructs orthogonal gradients, $Y^{\intercal}Y$ is a symmetric tridiagonal matrix, $Y_{i}^{\intercal}Y_{j}=\delta_{ij}(\|F_{i}\|^{2}+\|F_{i-1}\|^{2})+(\delta_{i(j-1)}+\delta_{(i+1)j})\|F_{i}\|^{2}$ , and $S^{\intercal}Y$ is diagonal because the $S$ are conjugate under $B$ . So the entire Gram matrix is tridiagonal, and the $M$ linear problems in $(Y^{\intercal}S-\alpha YY^{\intercal})^{-1}(S-\alpha Y)^{\intercal}$ can be solved in $\mathcal{O}(M^{2})$ , e.g. using the Thomas algorithm [7, Alg. 4.3]

estimate $\Omega$ according to some rule. §4.2 proposes several rules of $\mathcal{O}(M)$ cost.

While there is a vague connection between the standardized norm prior and the SR1 algorithm by Corollary 5, the algorithm described above is quite different from the SR1 method. It uses search directions constructed by BFGS/CG, and its update rule uses the exact Gram matrix, not the repeated rank-1 updates that give SR1 its name.

The computation overhead of constructing this posterior mean and covariance, after running the conjugate gradient algorithm, is $\mathcal{O}(M^{2})$ , which is small compared even to the internal $\mathcal{O}(MN)$ cost of CG, let alone the $\mathcal{O}(MN^{2})$ for the matrix-vector multiplications in CG. Storing the posterior mean and covariance requires $\mathcal{O}(NM)$ space, which is feasible even for relatively large problems. Crucially, retaining the covariance adds almost no overhead to storing the mean alone.

2 Estimation rules

The remaining step is to find estimates for $\Omega$ . It is clear that there are myriad options for fixing such rules. For an initial evaluation, we adopt the perhaps simplistic, but straightforward approach of estimating $\Omega$ to a scalar matrix $\Omega=\omega^{2}\boldsymbol{I}$ (one way to motivate this is to argue that, at step $M$ , future line searches $s_{M+i}$ will point in an unknown direction in the span of $\boldsymbol{I}-\bar{F}\bar{F}^{\intercal}$ , so it makes sense to not prefer any direction in the choice of $\Omega$ ).

A natural idea is to use regularity structure on quantities already computed during the run of the conjugate gradient algorithm: Assume the algorithm is currently at step $T$ . If, at step $M<T$ we had tried to predict the Gram matrix diagonal element $y_{M+1}^{\intercal}Wy_{M+1}=-s_{M+1}^{\intercal}F_{M}$ using the structure for $W$ described above, we would have predicted, because $F_{M}$ is known to be in the span of $S$ , and orthogonal to $(\boldsymbol{I}-\bar{F}\bar{F}^{\intercal})$ ,

$\|F_{M+1}\|$ can be estimated from the norm of preceding gradients. The second term on the right hand side of Equation (37) is known at step $M$ . The first term of the right hand side can be estimated by regression, in ways further explored below.

A simple baseline is a stationary model for the $\omega_{i}$ . This was used to construct error estimates in Figures 3 to 5 (in gray for the middle and bottom row, black for the top row). Of course, if the eigenvalues of $B$ are uniformly distributed in the top row, the eigenvalues of $H$ (their inverses) are not.

A slightly more elaborate model is a linear trend with noise: $\omega_{i}=ai+b+n$ (with $n\sim\mathcal{N}(0,\sigma^{2})$ ). Linear regression on the values of $\omega_{i}$ can be performed in $\mathcal{O}(M)$ . We can then set $\Omega=\bar{\omega}\boldsymbol{I}$ with $\bar{\omega}=aN+b$ the expected largest value of $\omega_{i}$ (i.e. a noisy upper bound). This approach was used to construct the (black) error estimates in the middle rows of Figures 3 to 5.

Finally, if structural knowledge is available, e.g. that the first $L$ eigenvalues of $B$ are $\alpha$ times larger than the later ones, on may use the stationary rule from above, but explicitly multiply the estimate $\omega$ by $\alpha$ for the first $L$ steps. This may seem contrived, but in fact it is not uncommon in applications to know an effective number of degrees of freedom in $B$ . For example, in nonparametric least-squares regression with a very large number of $N$ data points distributed approximately uniformly over a range of width $\rho$ , using an RBF kernel of length scale $\lambda$ , the model’s number of degrees of freedom is $L=\rho/(2\pi\lambda)$ [38, Eq. 4.3]. This rule was used to construct (black) error estimates in the bottom rows of Figures 3 to 5.

3 Estimating quantities of interest

This final part demonstrates a few example uses of the Gaussian posterior $p(H)=\mathcal{N}(\overrightarrow{H};\overrightarrow{H}_{M},W_{M}\mathchoice{{\ooalign{$ \displaystyle\otimes $\cr\hfil$ \displaystyle\ominus $\hfil\cr}}}{{\ooalign{$ \textstyle\otimes $\cr\hfil$ \textstyle\ominus $\hfil\cr}}}{{\ooalign{$ \scriptstyle\otimes $\cr\hfil$ \scriptstyle\ominus $\hfil\cr}}}{{\ooalign{$ \scriptscriptstyle\otimes $\cr\hfil$ \scriptscriptstyle\ominus $\hfil\cr}}}W_{M})$ on $H$ constructed by the BFGS / CG method. Figures 3 to 5 show three such uses, explained below. Each row of this figure uses data from one of the experiments shown in the corresponding row of Figure 2.

The most obvious question is how far the estimate $H_{M}$ for $H$ after $M$ steps is from the true $H$ . This distance is estimated directly by the Gaussian posterior of Equation (26). The marginal distribution on any linear projection $A\overrightarrow{H}$ is $\mathcal{N}(A\overrightarrow{H};A\overrightarrow{H}_{M},A(W_{M}\mathchoice{{\ooalign{$ \displaystyle\otimes $\cr\hfil$ \displaystyle\ominus $\hfil\cr}}}{{\ooalign{$ \textstyle\otimes $\cr\hfil$ \textstyle\ominus $\hfil\cr}}}{{\ooalign{$ \scriptstyle\otimes $\cr\hfil$ \scriptstyle\ominus $\hfil\cr}}}{{\ooalign{$ \scriptscriptstyle\otimes $\cr\hfil$ \scriptscriptstyle\ominus $\hfil\cr}}}W_{M})A^{\intercal})$ . In particular, the marginal distribution on each element $H_{ij}$ is a scalar Gaussian

Figure 3 shows this error estimate for 40 elements of one particular $H$ (drawn uniformly at random from the $4\cdot 10^{4}$ elements of the $200\times 200$ matrix). The estimate arising from the uniform estimation rule for $\omega$ from Section 4.2 is shown in gray in each panel (black for the top panel). The same quantity, estimated with the linear regression and structured estimation rules from Section 4.2 are shown in black in the middle and bottom row, respectively. The left column of the figure shows results from the BFGS/CG prior, the right column shows results using the standardized norm prior on data constructed with the CG algorithm as described in §4.1.1. As expected from the argument in §3.1, the BFGS estimates are regularly considerably too small, while the standardized-norm estimates have a meaningful width. The error estimators have varying behaviour. For the exponential eigenvalue spectrum, the estimator fluctuates strongly in the first few steps before settling to a good value (this could be corrected using a regularizer, left out here to not bias the results). For the structured-eigenvalues problems, the region around the step from small to large eigenvalues is problematic. But overall, they do provide a meaningful notion of error. In particular, they are rarely too small. For most uses of statistical error estimators, it is better to be too conservative (too large) than to be too confident. Of course, it would be great if future research would find better calibrated error estimates.

As explained in Equation (18), the same error estimates can also be collapsed into an error estimate on the norm $\|H-H_{M}\|_{F}$ . Figure 4 shows results from such an experiment, for the 20 different $H$ ’s from Figure 2. The quantitative results are similar to the previous figure, but this figure more clearly shows the difference between the baseline (gray) and exponential, structured error estimates (black), and the behaviour of the estimated errors relative to the varying norms of the drawn $H$ ’s.

3.2 Estimating solutions for new linear problems

An obvious use for the estimate for $H$ found by CG / BFGS when solving one linear problem $Bx=b$ is as an instantaneous solution estimate for other linear problems $Bx_{\text{test}}=b_{\text{test}}$ . The left and middle columns of Figure 5 shows this use. In each case, an $x_{\text{test}}$ was drawn from $\mathcal{N}(x;0,10\boldsymbol{I})$ , and the corresponding $b_{\text{test}}=Bx_{\text{test}}$ presented to the algorithm. Since $x_{\text{test}}=Hb_{\text{test}}$ is a linear projection of $H$ , the posterior marginal on $x_{\text{test}}$ is also Gaussian $p(x_{\text{test}}\,|\,S_{M},Y_{M})=\mathcal{N}(x_{\text{test}},H_{M}b_{\text{test}},\Sigma)$ , and has covariance matrix elements

Figure 4 shows the true errors on the elements of $x_{\text{test}}$ in blue, and the estimated marginal errors (the diagonal elements of $\Sigma$ ) in black for the stationary, linear, structured models, respectively (and, as in previous figures, the stationary model in gray in the two non-stationary cases). More drastically than the previous ones, these figures show that the BFGS posterior can severely underestimate the error on elements of $x_{\text{test}}$ , while the standardized norm prior at least provides outer bounds (albeit sometimes quite loose ones).

The error on $x_{\text{test}}$ does not always collapse over the course of finding $x$ . This says more about CG as such than about its probabilistic interpretation: CG does not aim to construct $H$ , but only to find $x_{*}$ . For simplicity of exposition, we have assumed that $H=B^{-1}$ exists, and CG requires the full $N$ steps to converge, thus identifying $B$ and $H$ . In general, CG regularly converges much earlier. For an intuition, consider the special case where $x_{0}=0$ and $b=[1,\dots,1,0,0,\dots,0]$ consists of $K$ consecutive ones and $N-K$ zeros. The CG/BFGS algorithm will never explore the lower $(N-K)\times(N-K)$ block of $H$ , which may contain arbitrary numbers. If the primary aim is not $x_{*}=Hb$ but $H$ itself, a more elaborate course is needed; e.g. choosing several $b$ to span a space of interest over $H$ . It is an interesting open question whether the probabilistic interpretation can be used to actively collapse the uncertainty on $H$ in a typically more efficient way than established matrix inversion methods like Gauss-Jordan (which is also a conjugate direction method ).

Conclusion & outlook

This text developed a probabilistic interpretation of iterative solvers for linear problems $Bx=b$ with symmetric $B$ . The Dennis family of secant updates can be derived as the posterior mean of a parametric Gaussian model after one rank-1 observation. For rank $M$ observations, the match between these updates and Gaussian inference only holds if the search directions are conjugate under the prior covariance. This is the case for the DFP direct and BFGS inverse updates rules. Their equivalence to CG in the linear case makes them particularly interesting. However, it also became apparent that, from a inference perspective, the BFGS rule does not yield a well-scaled error measure.

As a first step toward a better scaled Gaussian belief, the standardized norm covariance, was proposed. It is inspired by the SR1 rule, but leads to probabilistic corrections in the form of off-diagonal terms, and can be used with data produced by the CG algorithm, thus retaining the good numerical properties of that method. The space of possible covariance matrices consistent with the resulting mean is a sub-space of the positive definite cone, which collapses during the run of the algorithm (the same holds for the BFGS / CG method). Several possible estimation rules for choosing elements in this space of covariances where proposed, arising from different structural assumptions over $H$ . The resulting Gaussian posterior provides joint uncertainty estimates on the elements of $H$ , and all linear projections of $H$ , in particular of other linear problems $x_{\text{test}}=Hb_{\text{test}}$ . This adds functionality to the conjugate gradient method, at a computational overhead much smaller than the cost of CG itself.

The implications for nonlinear optimization methods of both the quasi-Newton and CG families remain interesting open questions. For example, clearly the conjugacy assumption implicit in the Dennis class members is inconsistent with the probabilistic interpretation. This was already noted by Hennig & Kiefel , who also proposed using a nonparametric Gaussian formulation to give a more explicit inference interpretation to nonlinear optimization. This left questions regarding the choice of prior covariance, which are only made more pressing by the results presented here. Another direction is inference from noisy evaluations, in which case the posterior covariance does not collapse to zero after finitely many steps of optimization, not even in the linear case. Some related results where previously discussed in , but the study of probabilistic numerical optimization remains at an early stage.

Appendix A Proofs for results from main text

Throughout the appendix, the notation $\Delta=Y-B_{0}S$ will be used to represent the residual.

A.2 Proof for Theorem 3

We begin with the posterior mean (43). From Equation (10), it has the form (with the prior covariance $V=W\mathchoice{{\ooalign{$ \displaystyle\otimes $\cr\hfil$ \displaystyle\ominus $\hfil\cr}}}{{\ooalign{$ \textstyle\otimes $\cr\hfil$ \textstyle\ominus $\hfil\cr}}}{{\ooalign{$ \scriptstyle\otimes $\cr\hfil$ \scriptstyle\ominus $\hfil\cr}}}{{\ooalign{$ \scriptscriptstyle\otimes $\cr\hfil$ \scriptscriptstyle\ominus $\hfil\cr}}}W$ )

A few straightforward steps establish that the $NM\times NM$ matrix to be inverted is indeed invertible for linearly independent columns of $S$ , and has elements

Also, the elements of $V(\boldsymbol{I}\otimes S)$ are

which then gives the posterior as $\nicefrac{{1}}{{2}}(WXS^{\intercal}WSWSX^{\intercal}W)$ . (Because $X$ is rectangular, Equation (56) is a generalization of a Lyapunov equation. Standard solutions for such Equations do not apply directly). Instead of just presenting a solution, the following lines show a constructive proof. We first re-write Eq. (56) ( $S^{\intercal}WS$ is invertible because $W$ is positive definite, and $S$ is assumed to be of rank $M$ ) as

which identifies $R_{-}$ . Noting that $Q_{+}Q_{+}^{\intercal}=Q_{+}D^{-1}UU^{\intercal}DQ_{+}^{\intercal}=S^{+}S^{\intercal}$ , we can write

Plugging back into Equation (60), using $(S^{\intercal}WS)^{-1}UD=(Q_{+}^{\intercal}WS)^{-1}$ , we get

We see directly that this is a symmetric matrix, because $S^{\intercal}\Delta=S^{\intercal}BS-S^{\intercal}B_{0}S=\Delta^{\intercal}S$ . Now, noting that $XS^{\intercal}+SX^{\intercal}=Q_{+}(R_{+}+R_{+}^{\intercal})Q_{+}^{\intercal}+Q_{-}R_{-}Q_{+}^{\intercal}+Q_{+}R_{-}^{\intercal}Q_{-}^{\intercal}$ , we find

From Equation (55), the posterior mean can be written as

which is clearly equal to Equation (43). To establish the form of the posterior covariance, we make use of the structural similarities between the posterior mean and covariance (Equation (10)), and notice that we have just established

A.3 Proof for Lemma 6

To be shown: If the Gram matrix $S^{\intercal}WS$ is diagonal, then the exact posterior mean $B_{M}$ after $M$ steps, which is

is equal to the rank-2 update of $B_{M-1}$ using the Dennis update

We first harmonize the notation between the two formulations by writing the elements of the diagonal Gram matrix as $(S^{\intercal}WS)_{ij}=\delta_{ij}c^{\intercal}_{i}s_{i}\equalscolon\delta_{ij}a_{i}$ . With this notation, the posterior mean $B_{M}$ , Equation (72), can be written as

On the other hand, the expression $y_{M}-B_{M-1}s_{M}$ from Equation (73) can be written using Equation (74) as

But since, by assumption, $c_{i}^{\intercal}s_{M}=0$ for $i\neq M$ , this expression simplifies to

Similarly, the expression $s_{M}^{\intercal}(y_{M}-B_{M-1}s_{M})$ from Equation (73) simplifies to

Reinserting these expressions into Equation (73), we see that it equals Equation (76), which completes the proof.

A.4 Proof for Lemma 7

The DFP update is the direct update with the choice $W=B$ ; and the BFGS update is the inverse update with the choice $W=H$ . So the Gram matrix, in both cases, is $S^{\intercal}BS=Y^{\intercal}HY=S^{\intercal}Y$ . The $i,j$ -th element of this symmetric $M\times M$ matrix is $y_{i}^{\intercal}s_{j}$ . The statement to be shown is that this matrix is diagonal if the line search directions are chosen as

with the residual (the gradient of the equivalent quadratic optimization objective) $F_{i}=Bx_{i}-b$ . We also assume perfect line searches. First, consider the special case where $j=i+1$ (i.e. subsequent line searches). Because they are in the Dennis class, the estimates for $H$ (irrespective of whether they were constructed by inverting a direct estimate or using an inverse estimate directly) fulfill the ‘quasi-Newton equation’ $s_{i}=H_{i+1}y_{i}=H_{i+1}(F_{i}-F_{i-1})$ . Thus

The exact line search along $s_{i}$ ended when $s_{i}^{\intercal}F_{i}=0$ , so

(the last line follows again because, by the quasi-Newton equation, $s_{i}=H_{j}y_{i}$ for all $j>i$ ). By symmetry of the Gram matrix, Eq. (82) also implies $y_{i+1}^{\intercal}s_{i}=0$ . We complete the proof inductively: Let $j>i+1$ or $i>j+1$ , and assume $y_{i}^{\intercal}s_{j-a}=y_{j-a}^{\intercal}s_{i}=0\;\forall a>0$ . Also, $F_{j-1}$ can be written with a telescoping sum as

This also implies $F_{i}^{\intercal}s_{j}=0$ for $i\neq j$ : Assume w.l.o.g. that $i>j$ . Then use the telescoping sum of Equation (83) to get

Acknowledgments

The author would like to thank Maren Mahsereci and Martin Kiefel for helpful discussions both prior to and during the preparation of this manuscript. He is particularly grateful to Maren Mahsereci for helpful discussions about details of the symmetric basis in Theorem 3. In addition, the author is thankful for comments from two anonymous reviewers, particularly for pointing out relevant previous work.