A Theory of the Distortion-Perception Tradeoff in Wasserstein Space

Dror Freirich, Tomer Michaeli, Ron Meir

Introduction

Image restoration covers some fundamental settings in image processing such as denoising, deblurring and super-resolution. Over the past few years, image restoration methods have demonstrated impressive improvements in both visual quality and distortion measures such as peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) (31). It was noticed, however, that improvement in accuracy, as measured by distortion, does not necessarily lead to improvement in visual quality, referred to as perceptual quality. Furthermore, the lower the distortion of an estimator, the more the distribution of its outputs generally deviates from the distribution of the signals it attempts to estimate. This phenomenon, known as the perception-distortion tradeoff (4), has captured significant attention, where it implies that faithfulness to ground truth images comes at the expense of perceptual quality, namely the deviation from statistics of natural images. Several works have extended the perception-distortion tradeoff to settings such as lossy compression (5) and classification (14).

Despite the increasing popularity of performing comparisons on the perception-distortion plane, the exact characterization of the minimal distortion that can be achieved under a given perception constraint remains an important open question. Although Blau and Michaeli (4) investigated the basic properties of this distortion-perception function, such as monotonicity and convexity, little is known about its precise nature. While a general answer to this question is unavailable, in this paper, we derive a closed form expression for the distortion-perception (DP) function for the mean squared-error (MSE) distortion and the Wasserstein- $2$ perception index.

Our main contributions are: (i) We prove that the DP function is always quadratic in the perception constraint $P$ , regardless of the underlying distribution (Theorem 1). (ii) We show that it is possible to construct estimators on the DP curve from the estimators at the two extremes of the tradeoff (Theorem 3): The one that globally minimizes the MSE, and a minimizer of the MSE under a perfect perceptual quality constraint. The latter can be obtained as a stochastic transformation of the former. (iii) In the Gaussian setting, we further provide a closed form expression for optimal estimators and for the corresponding DP curve (Theorems 4 and 5). We show this Gaussian DP curve is a lower bound on the DP curve of any distribution having the same second order statistics. Finally, we illustrate our results, numerically and visually, in a super-resolution setting in Section 5. The proofs of all the theorems in the main text are provided in Appendix B.

Our theoretical results shed light on several topics that are subject to much practical activity. Particularly, in the domain of image restoration, numerous works target perceptual quality rather than distortion (e.g. (29; 13; 12)). However, it has recently been recognized that generating a single reconstructed image often does not convey to the user the inherent ambiguity in the problem. Therefore, many recent works target diverse perceptual image reconstruction, by employing randomization among possible restorations (15; 3; 21; 1). Commonly, such works perform sampling from the posterior distribution of natural images given the degraded input image. This is done e.g. using priors over image patches (7), conditional generative models (18; 20), or implicit priors induced by deep denoiser networks (10). Theoretically, posterior sampling leads to perfect perceptual quality (the restored outputs are distributed like the prior). However, a fundamental question is whether this is optimal in terms of distortion. As we show in Section 3.1, posterior sampling is often not an optimal strategy, in the sense that there often exist perfect perceptual quality estimators that achieve lower distortion.

Another topic of practical interest, is the ability to traverse the distortion-perception tradeoff at test time, without having to train a different model for each working point. Recently, interpolation has been suggested for controlling several objectives at test-time. Shoshan et al. (25) propose using interpolation in some latent space in order to approximate intermediate objectives. Wang et al. (29) use per-pixel interpolation for balancing perceptual quality and fidelity. Studies of network parameter interpolation are presented by Wang et al. (29, 30). Deng (6) produces a low distortion reconstruction and a high perceptual quality one, and then uses style transfer to combine them. An important question, therefore, is which strategy is optimal. In Section 3.2 we show that for the MSE–Wasserstein-2 tradeoff, linear interpolation leads to optimal estimators. We also discuss a geometric connection between interpolation and the fact that estimators on the DP curve form a geodesic in Wasserstein space.

Problem setting and preliminaries

In many practical cases, the goodness of an estimator is associated with two factors: (i) the degree to which $\hat{X}$ is close to $X$ on average (low distortion), and (ii) the degree to which the distribution of $\hat{X}$ is close to that of $X$ (good perceptual quality). An important question, then, is what is the minimal distortion that can be achieved under a given level of perceptual quality? and how can we construct estimators that achieve this lower bound? In mathematical language, we are interested in analyzing the distortion-perception (DP) function (defined similarly to the perception-distortion function of (4))

As discussed in (4), the function $D(P)$ is monotonically non-increasing and is convex whenever $d_{p}(\cdot,\cdot)$ is convex in its second argument (which is the case for most popular divergences). However, without further concrete assumptions on the distortion measure $d(\cdot,\cdot)$ and the perception index $d_{p}(\cdot,\cdot)$ , little can be said about the precise nature of $D(P)$ .

Here, we focus our attention on the squared-error distortion $d(x,\hat{x})=\|x-\hat{x}\|^{2}$ and the Wasserstein-2 distance $d_{p}(p_{X},p_{\hat{X}})=W_{2}(p_{X},p_{\hat{X}})$ , with which (1) reads

2 The Wasserstein and Gelbrich Distances

Before we present our main results, we briefly survey a few properties of the Wasserstein distance, mostly taken from (19). The Wasserstein- $p$ ( $p\geq 1$ ) distance between measures $\mu$ and $\gamma$ on a separable Banach space $\mathcal{X}$ with norm $\|\cdot\|$ is defined by

where $\Pi(\mu,\gamma)$ is the set of all probabilities on $\mathcal{X}\times\mathcal{X}$ with marginals $\mu$ and $\gamma$ . A joint probability $\nu$ achieving the optimum in (3) is often referred to as optimal plan. The Wasserstein space of probability measures is defined as

and $W_{p}$ constitutes a metric on $\mathcal{W}_{p}(\mathcal{X})$ .

is the optimal transformation pushing forward from $\mathcal{N}(0,\Sigma_{1})$ to $\mathcal{N}(0,\Sigma_{2})$ (11). This transformation satisfies $\Sigma_{2}=T_{1\rightarrow 2}\Sigma_{1}T_{1\rightarrow 2}.$ For a discussion on singular distributions, please see App. A.

Main results

The DP function (2) depends, of course, on the underlying joint probability $p_{XY}$ of the signal $X$ and measurements $Y$ . Our first key result is that this dependence can be expressed solely in terms of $D^{*}$ and $P^{*}$ . In other words, knowing the distortion and perception index attained by the minimum MSE estimator $X^{*}$ , suffices for determining $D(P)$ for any $P$ .

where $(x)_{+}=\max(0,x)$ . Furthermore, an estimator achieving perception index $P$ and distortion $D(P)$ can always be constructed by applying a (possibly stochastic) transformation to $X^{*}$ .

The bound is attained when $X$ and $Y$ are jointly Gaussian.

A remark is in place regarding the uniqueness of an estimator achieving (8). As we discuss below, what defines an optimal estimator $\hat{X}$ is its joint distribution with $X^{*}$ . This joint distribution may not be unique, in which case the optimal estimator is not unique. Moreover, even if $p_{\hat{X}X^{*}}$ is unique, the uniqueness of the estimator is not guaranteed because there may be different conditional distributions $p_{\hat{X}|Y}$ that lead to the same optimal $p_{\hat{X}X^{*}}$ . In other words, given the optimal $p_{\hat{X}X^{*}}$ , one can choose any joint probability $p_{\hat{X}YX^{*}}$ that has marginals $p_{\hat{X}X^{*}}$ and $p_{YX^{*}}$ . One option is to take the estimator $\hat{X}$ to be a (possibly stochastic) transformation of $X^{*}$ , namely $p_{\hat{X}|Y}=p_{\hat{X}|X^{*}}p_{X^{*}|Y}$ . But there may be other options. In cases where either $Y$ or $\hat{X}$ are a deterministic transformation of $X^{*}$ (e.g. when $X^{*}$ has a density, or is an invertible function of $Y$ ), there is a unique joint distribution $p_{\hat{X}YX^{*}}$ with the given marginals (2, Lemma 5.3.2). In this case, if $p_{\hat{X}X^{*}}$ is unique then so is the estimator $p_{\hat{X}|Y}$ .

Under the settings of image restoration, many methods encourage diversity in their output by adding randomness (15; 3; 21). In our setting, we may ask under what conditions there exists an optimal estimator $\hat{X}$ which is a deterministic function of $Y$ . For example, when $p_{Y}=\delta_{0}$ but $X$ has some non-atomic distribution, it is clear that no deterministic function of $Y$ can attain perfect perceptual quality. It turns out that a sufficient condition for the optimal $\hat{X}$ to be a deterministic function of $Y$ is that $X^{*}$ have a density. We discuss this in App. B and explicitly illustrate it in the Gaussian case (see Sec. 3.3), where if $X^{*}$ has a non-singular covariance matrix then $\hat{X}$ is a deterministic function of $Y$ .

and the upper bound is attained when $(P^{*})^{2}=D^{*}$ . To see when this happens, observe that

2 Optimal estimators

While Theorem 1 reveals the shape of the DP function, it does not provide a recipe for constructing optimal estimators on the DP tradeoff. We now discuss the nature of such estimators.

Note that the objective in (12) depends on the MSE between $\hat{X}$ and $X^{*}$ , so that we can perform the minimization on $p_{\hat{X}|X^{*}}$ rather than on $p_{\hat{X}|Y}$ (once we determine the optimal $p_{\hat{X}|X^{*}}$ we can construct a consistent $p_{\hat{X}|Y}$ as discussed above).

Now, let us start by examining the leftmost side of the curve $D(P)$ , which corresponds to a perfect perceptual quality estimator (i.e. $P=0$ ). In this case, the constraint becomes $p_{\hat{X}}=p_{X}$ . Therefore,

Let $\hat{X}_{0}$ be an estimator achieving perception index and MSE $D(0)$ . Then its joint distribution with $X^{*}$ attains the optimum in the definition of $W_{2}(p_{X},p_{X^{*}})$ . Namely, $p_{\hat{X}_{0}X^{*}}$ is an optimal plan between $p_{X}$ and $p_{X^{*}}$ .

Having understood the estimator $\hat{X}_{0}$ at the leftmost end of the tradeoff, we now turn to study optimal estimators for arbitrary $P$ . Interestingly, we can show that Problem (12) is equivalent to (see App. B)

Namely, an optimal $p_{\hat{X}}$ is closest to $p_{X^{*}}$ among all distributions within a ball of radius $P$ around $p_{X}$ , as illustrated in Fig. 1. Moreover, $p_{\hat{X}X^{*}}$ is an optimal plan between $p_{\hat{X}}$ and $p_{X^{*}}$ . As it turns out, this somewhat abstract viewpoint leads to a rather practical construction for $\hat{X}$ from the estimators $\hat{X}_{0}$ and $X^{*}$ at the two extremes of the tradeoff. Specifically, we have the following result, proved in App. B.

Let $\hat{X}_{0}$ be an estimator achieving perception index and MSE $D(0)$ . Then for any $P\in[0,P^{*}]$ , the estimator

is optimal for perception index $P$ . Namely, it achieves perception index $P$ and distortion $D(P)$ .

Theorem 3 has important implications for perceptual signal restoration. For example, in the task of image super-resolution, there exist many deep network based methods that achieve a low MSE (13; 27; 24). These provide an approximation for $X^{*}$ . Moreover, there is an abundance of methods that achieve good perceptual quality at the price of a reasonable degradation in MSE (often by incorporating a GAN-based loss) (12; 29; 23). These constitute approximations for $\hat{X}_{0}$ . However, achieving results that strike other prescribed balances between MSE and perceptual quality commonly require training a different model for each setting. Shoshan et al. (25) and Navarrete Michelini et al. (17) tried to address this difficulty by introducing new training techniques that allow traversing the distortion-perception tradeoff at test time. But, interestingly, Theorem 3 shows that in our setting such specialized training methods are not required. Having a model that leads to low MSE and one that leads to good perceptual quality, it is possible to construct any other estimator on the DP tradeoff, by simply averaging the outputs of these two models with appropriate weights. We illustrate this in Sec. 5.

3 The Gaussian setting

When $X$ and $Y$ are jointly Gaussian, it is well known that the minimum MSE estimator $X^{*}$ is a linear function of the measurements $Y$ . However, it is not a-priori clear whether all estimators along the DP tradeoff are linear in this case, and what kind of randomness they possess. As we now show, equipped with Theorem 3, we can obtain closed form expressions for optimal estimators for any $P$ . For simplicity, we assume here that $X$ and $Y$ have zero means and that $\Sigma_{X},\Sigma_{Y}\succ 0$ .

It is instructive to start by considering the simple case, where $\Sigma_{X^{*}}$ is non-singular (in Theorem 4 below we address the more general case of a possibly singular $\Sigma_{X^{*}}$ ). It is well known that

Now, since we assumed that $\Sigma_{X},\Sigma_{X^{*}}\succ 0$ , we have from Theorem 2 and (6),(7) that

Finally, we know that $P^{*}=G^{*}$ , which is given by the left-hand side of (11). Substituting these expressions into (15), we obtain that an optimal estimator for perception $P\in[0,G^{*}]$ is given by

As can be seen, this optimal estimator is a deterministic linear transformation of $Y$ for any $P$ .

The setting just described does not cover the case where $Y$ is of lower dimensionality than $X$ because in that case $\Sigma_{X^{*}}$ is necessarily singular (it is a $n_{x}\times n_{x}$ matrix of rank at most $n_{y}$ ; see (16)). In this case, any deterministic linear function of $Y$ would result in an estimator $\hat{X}$ with a rank- $n_{y}$ covariance. Obviously, the distribution of such an estimator cannot be arbitrarily close to that of $X$ , whose covariance has rank $n_{x}$ . What is the optimal estimator in this more general setting, then?

Assume $X$ and $Y$ are zero-mean jointly Gaussian random vectors with $\Sigma_{X},\Sigma_{Y}\succ 0$ . Denote $T^{*}\triangleq T_{p_{X}\rightarrow p_{X^{*}}}=\Sigma_{X}^{-1/2}(\Sigma_{X}^{1/2}\Sigma_{X^{*}}\Sigma_{X}^{1/2})^{1/2}\Sigma_{X}^{-1/2}$ . Then for any $P\in[0,G^{*}]$ , an estimator with perception index $P$ and MSE $D(P)$ can be constructed as

where $W$ is a zero-mean Gaussian noise with covariance $\Sigma_{W}=\Sigma_{X}^{1/2}(I-\Sigma_{X}^{1/2}T^{*}\Sigma_{X^{*}}^{\dagger}T^{*}\Sigma_{X}^{1/2})\Sigma_{X}^{1/2}$ , which is independent of $Y,X$ , and $\Sigma_{X^{*}}^{\dagger}$ is the pseudo-inverse of $\Sigma_{X^{*}}$ .

Note that in this case, we indeed have a random noise component that shapes the covariance of $\hat{X}_{P}$ to become closer to $\Sigma_{X}$ as $P$ gets closer to . It can be shown (see App. B) that when $\Sigma_{X^{*}}$ is invertible, $\Sigma_{W}=0$ and (19) reduces to (18). Also note that, as in (18), the dependence of $\hat{X}_{P}$ on $Y$ in (19) is only through $X^{*}=\Sigma_{XY}\Sigma_{Y}^{-1}Y$ .

As mentioned in Sec. 3.1, the optimal estimator is generally not unique. Interestingly, in the Gaussian setting we can explicitly characterize a set of optimal estimators.

and $W_{0}$ be a zero-mean Gaussian noise with covariance

that is independent of $X,Y$ . Then, for any $P\in[0,G^{*}]$ , an optimal estimator with perception index $P$ can be obtained by

The estimator given in (19) is one solution to (20)-(21), but is generally not unique.

A geometric perspective on the distortion-perception tradeoff

and it follows that $W_{2}(\gamma_{t},\gamma)=tW_{2}(\gamma,\mu)$ and $W_{2}(\gamma_{t},\mu)=(1-t)W_{2}(\gamma,\mu)$ . Furthermore, if $\gamma_{t},t\in$ is a constant-speed geodesic with $\gamma_{0}=\gamma,\gamma_{1}=\mu$ , then the optimal plans between $\gamma,\gamma_{t}$ and between $\gamma_{t},\mu$ are given by

It is worth mentioning that this geometric interpretation is simplified under some common settings. For example, when $\gamma$ is absolutely continuous (w.r.t. the Lebesgue measure), we have a measurable map $T_{\gamma\rightarrow\mu}$ which is the solution to the optimal transport problem with the quadratic cost (19, Thm 1.6.2, p.16). The geodesic (23) then takes the form

Therefore, in our setting, if $\gamma=p_{X^{*}}$ has a density, then we can obtain $\hat{X}_{P}$ by the deterministic transformation $[X^{*}+\left(1-\frac{P}{P^{*}}\right)\left(T_{p_{X^{*}}\rightarrow p_{X}}(X^{*})-X^{*}\right)]$ (see Remark about randomness in Sec. 3.1).

Numerical illustration

In this Section we evaluate $12$ super resolution algorithms on the BSD100 datasetAll codes are freely available and provided by the authors. The BSD100 dataset is free to download for non-commercial research. (16). The evaluated algorithms include EDSR (13), ESRGAN (29), SinGAN (23), ZSSR (24), DIP (27), SRResNet variants which optimize MSE and VGG2,2, SRGAN variants which optimize MSE, VGG2,2 and VGG5,4 in addition to an adversarial loss (12), ENet (22) (“PAT” and “E” variants). Low resolution images were obtained by $4\times$ downsampling using a bicubic kernel.

In Figure 2 we plot each method on the distortion-perception plane. Specifically, we consider natural (and reconstructed) images to be stationary random sources, and use $9\times 9$ patches (totally $1.6\times 10^{6}$ patches) from the RGB images to empirically estimate the mean and covariance matrix for the ground-truth images, and for the reconstructions produced by each method. We then use the estimated Gelbrich distances (4) between the patch distribution of each method and that of ground-truth images, as a perceptual quality index. Recall this is a lower bound on the Wasserstein distance.

We consider EDSR (13) to be the best MSE estimator $X^{*}$ since it achieves the lowest distortion among the evaluated methods. We therefore estimate the lower bound (9) as

where $D_{\text{EDSR}}$ is the MSE of EDSR, and $P_{\text{EDSR}}$ is the estimated Gelbrich distance between EDSR reconstructions and ground-truth images. Note the unoccupied region under the estimated curve in Figure 2, which is indeed unattainable according to the theory.

We also present 11 estimators $\hat{X}_{t}$ which we construct by interpolation between EDSR and ESRGAN (29), $\hat{X}_{t}=tX_{\text{EDSR}}+(1-t)X_{\text{ESRGAN}}$ . We observe (Figure 2) that estimators constructed using these two extreme points are closer to the optimal DP tradeoff than the evaluated methods. Also note that since ESRGAN does not attain -perception index, we are practically able to use negative values $t<0$ to extrapolate better perception-quality estimators $\hat{X}_{-0.05}$ and $\hat{X}_{-0.1}$ . In Figure 3 we present a visual comparison between SRGAN-VGG2,2 (12) and our interpolated estimator $\hat{X}_{0.12}$ . Both achieve roughly the same RMSE distortion ( $18.09$ for SRGAN, $18.15$ for $\hat{X}_{0.12}$ ), but our estimator achieves a lower perception index. Namely, by using interpolation, we manage to achieve improvement in perceptual quality, without degradation in distortion. The improvement in visual quality is also apparent in the figure. Additional visual comparisons can be found in the Appendix.

Conclusion

In this paper we provide a full characterization of the distortion-perception tradeoff for the MSE distortion and the Wasserstein- $2$ perception index. We show that optimal estimators are obtained by interpolation between the minimum MSE estimator and an optimal perfect perception quality estimator. In the Gaussian case, we explicitly formulate these estimators. To the best of our knowledge, this is the first work to derive such closed-form expressions. Our work paves the way towards fully understanding the DP tradeoff under more general distortions and perceptual criteria, and bridging between fidelity and visual quality at test-time, without training different models.

References

Appendix A Background and extensions

In Sec. 2 of the main text we presented the setting of Euclidean space for simplicity. For the sake of completeness, we present here a more general setup.

The problem of finding a perfect perceptual quality estimator can be now written as an optimal transport problem

A.2 The optimal transportation problem

In the Monge formulation, we search for an optimal transformation, often referred to as an optimal map, $T:\mathcal{Y}\rightarrow\mathcal{X}$ minimizing

Note that the Monge problem seeks for a deterministic map, and might not have a solution.

In the Kantorovich formulation, we wish to find a probability measure $q=q_{XY}$ on $\mathcal{X}\times\mathcal{Y}$ , minimizing

$\Pi$ is the set of probabilities on $\mathcal{X}\times\mathcal{Y}$ with marginals $q^{(x)},p^{(y)}$ . A probability minimizing (32) is called an optimal plan, and we denote $q\in\Pi_{o}(q^{(x)},p^{(y)})$ . Note that when $\rho(x,y)=d^{p}(x,y)$ and $d(x,y)$ is a metric, taking $\inf$ over (32) yields the Wasserstein distance $W_{p}^{p}(q^{(x)},p^{(y)})$ induced by $d(x,y)$ .

A.3 Optimal maps between Gaussian measures

If $\Sigma_{1}$ and $\Sigma_{2}$ are non-singular, then the distribution attaining the optimum in (3) corresponds to

is the optimal transformation pushing forward from $\mathcal{N}(0,\Sigma_{1})$ to $\mathcal{N}(0,\Sigma_{2})$ [Knott and Smith, 1984]. This transformation satisfies $\Sigma_{2}=T_{1\rightarrow 2}\Sigma_{1}T_{1\rightarrow 2}.$

When distributions are singular, we have the following.

Appendix B Proof of main results

In this Section we provide proofs of the main results of this paper. In lemmas 2 and 3 we present some alternative representations for $D(P)$ . In Lemma 4 we obtain a lower bound on $D(P)$ . We then prove Theorem 3 (via a more general result given by Lemma 5), where the lower bound of Lemma 4 is attained. Equipped with Theorem 3, we prove Theorem 1 which is the main result of our paper.

Since in our case $\hat{X}$ is independent of $X$ given $Y$ , we show that the third term vanishes.

Next, we express $D(P)$ in terms of the Wasserstein distance between $p_{\hat{X}}$ and $p_{X^{*}}$ .

Denote $W_{2}^{2}(\mathcal{B}_{P},p_{X^{*}})=\min_{p_{\hat{X}}:W_{2}(p_{\hat{X}},p_{X})\leq P}W^{2}_{2}(p_{\hat{X}},p_{X^{*}})$ , where $\mathcal{B}_{P}$ is the ball of radius $P$ around $p_{X}$ in Wasserstein space.

For every $p_{\hat{X}|Y}$ whose marginal attains $W_{2}(p_{\hat{X}},p_{X})\leq P$ we have,

which leads to $D(P)\geq D^{*}+W_{2}^{2}(\mathcal{B}_{P},p_{X^{*}})$ .

Taking the minimum over $p_{\hat{X}}$ yields $D(P)\leq D^{*}+W_{2}^{2}(\mathcal{B}_{P},p_{X^{*}})$ . Combining the upper and lower bounds, we obtain the desired result. ∎

For the proof of Theorem 3, we first prove the following

For every estimator satisfying $W_{2}(p_{\hat{X}},p_{X})\leq P$ , we have from the triangle inequality

3. Let $\hat{X}_{0}$ be an estimator achieving perception index and MSE $D(0)$ . Then for any $P\in[0,P^{*}]$ , the estimator

is optimal for perception index $P$ , namely, it achieves perception index $P$ and distortion $D(P)$ .

Let us prove a stronger result, from which Theorem 3 will follow.

where the equality is based on (42). A direct calculation of the distortion yields

When ${X^{*}}$ has a density, $\hat{X}_{0}$ (hence $\hat{X}_{P}$ ) can be obtained via a deterministic transformation of $Y$ .

Since the distribution of $X^{*}$ is absolutely continuous, we have an optimal map $T_{p_{X^{*}}\rightarrow p_{X}}$ between the distributions of $X^{*}$ and $X$ (see discussion in App. A.2). Namely, we have that $\hat{X}_{0}=T_{p_{X^{*}}\rightarrow p_{X}}(X^{*})$ is an optimal estimator with perception index . Thus, according to (15) $\hat{X}_{P}=\left(1-\frac{P}{P^{*}}\right)T_{p_{X^{*}}\rightarrow p_{X}}(X^{*})+\frac{P}{P^{*}}X^{*}$ are optimal estimators, which in this case are given by a deterministic function of $Y$ . ∎

B.2 Proof of theorem 1

With Theorem 3 and Lemma 5 in hand, we are now ready to prove our main result.

Furthermore, an estimator achieving perception index $P$ and distortion $D(P)$ can always be constructed by applying a (possibly stochastic) transformation to $X^{*}$ .

hence $D(P)\leq D^{*}+\left[(P^{*}-P)_{+}\right]^{2}$ . On the other hand, we have (Lemma 4) $D(P)\geq D^{*}+\left[(P^{*}-P)_{+}\right]^{2}$ , which completes the proof. ∎

B.3 The Gaussian setting

In this Section we prove Theorems 4 and 5. We begin by proving Theorem 5, and then show that Theorem 4 follows as a special case. Recall that

and $W_{0}$ be a zero-mean Gaussian noise with covariance

that is independent of $Y,X$ . Then, for any $P\in[0,G^{*}]$ , an optimal estimator with perception index $P$ can be obtained by

The estimator given in (50) is one solution to (46)-(47), but it is generally not unique.

(Theorem 5) Let $\hat{X}_{0}\triangleq\Sigma_{\hat{X}_{0}Y}\Sigma_{Y}^{-1}Y+W_{0}$ where $\Sigma_{\hat{X}_{0}Y}$ satisfies (46)-(47). It is easy to see that $\hat{X}_{0}\sim\mathcal{N}(0,\Sigma_{X})$ and it is jointly Gaussian with $(X,Y,X^{*})$ . We have by (46)

Summarizing, $\hat{X}_{0}$ is an optimal perfect perception quality estimator. Note that (48) can be written as

and by Theorem 3 we have that it is an optimal estimator. ∎

Before proceeding to the proof of Theorem 4, let us introduce some auxiliary facts.

The following Lemma is a reminder of Schur’s Complement and its properties.

[Schur’s complement]. Let $\Sigma=\left[\begin{array}[]{cc}A&B\\ B^{T}&C\end{array}\right]$ be a symmetric matrix where $A$ is PD. Then $\nicefrac{{\Sigma}}{{A}}\triangleq C-B^{T}A^{-1}B$ is the Schur complement of $\Sigma$ , and we have that $\Sigma$ is PSD iff $\nicefrac{{\Sigma}}{{A}}$ is PSD.

4. Assume $X$ and $Y$ are zero-mean jointly Gaussian random vectors with $\Sigma_{X},\Sigma_{Y}\succ 0$ . Then for any $P\in[0,G^{*}]$ , an estimator with perception index $P$ and MSE $D(P)$ can be constructed as

We observe that (50) is a special case of (48), where $\Sigma_{\hat{X}_{0}Y}=\Sigma_{Y\hat{X}_{0}}^{T}=\Sigma_{X}^{\frac{1}{2}}\left(\Sigma_{X}^{\frac{1}{2}}\Sigma_{X^{*}}\Sigma_{X}^{\frac{1}{2}}\right)^{\frac{1}{2}}\Sigma_{X}^{-\frac{1}{2}}\Sigma_{X^{*}}^{\dagger}\Sigma_{XY}$ . We now show that $\Sigma_{\hat{X}_{0}Y}$ has the desired properties (46)-(47). By substitution,

Recall $\Sigma_{X^{*}}^{\dagger}\Sigma_{X^{*}}\Sigma_{X^{*}}^{\dagger}=\Sigma_{X^{*}}^{\dagger}$ , and we denote $T^{*}=\Sigma_{X}^{-\frac{1}{2}}\left(\Sigma_{X}^{\frac{1}{2}}\Sigma_{X^{*}}\Sigma_{X}^{\frac{1}{2}}\right)^{\frac{1}{2}}\Sigma_{X}^{-\frac{1}{2}}$ . We now have

Since $\Sigma_{X},\Sigma_{Y}\succ 0$ , (51) is Schur’s complement of $\begin{bmatrix}\Sigma_{X}&\Sigma_{\hat{X}_{0}Y}\\ \Sigma_{Y\hat{X}_{0}}&\Sigma_{Y}\end{bmatrix}\succeq 0$ , yielding

In the case where $\Sigma_{X^{*}}$ is invertible, $\Sigma_{\hat{X}_{0}Y}=\Sigma_{X}T^{*}\Sigma_{X^{*}}^{-1}\Sigma_{XY}$ in the proof of Theorem 4, and it is easy to see that the noise covariance is $\Sigma_{W}=0$ . In this case $\Sigma_{\hat{X}_{0}Y}$ is the unique solution to (46)-(47). This means that $\hat{X_{0}}$ (hence $\hat{X_{P}}$ ) is a deterministic function of $Y$ .

We first show $\Sigma_{W}=0$ . Let $M_{P}=\Sigma_{\hat{X}_{0}Y}=\Sigma_{X}T^{*}\Sigma_{X^{*}}^{-1}\Sigma_{XY}$ , then

Now, assume $M$ is a solution to (46)-(47), then $M_{\Delta}=M_{P}-M$ satisfies $M_{\Delta}\Sigma_{Y}^{-1}\Sigma_{YX}=0$ and

But, $M_{\Delta}\Sigma_{Y}^{-1}M_{P}^{T}=(M_{\Delta}\Sigma_{Y}^{-1}\Sigma_{YX})\Sigma_{X^{*}}^{-1}T^{*}\Sigma_{X}=0$ and $\Sigma_{X}-M_{P}\Sigma_{Y}^{-1}M_{P}^{T}=0$ , yielding $M_{\Delta}\Sigma_{Y}^{-1}M_{\Delta}^{T}\preceq 0$ . Since $M_{\Delta}\Sigma_{Y}^{-1}M_{\Delta}^{T}$ is PSD and $\Sigma_{Y}^{-1}$ is PD, we conclude that $M_{\Delta}=0$ . ∎

Appendix C Settings with commuting covariances

In many practical problems, covariance matrices may have the commutative relation $\Sigma_{X}\Sigma_{X^{*}}=\Sigma_{X^{*}}\Sigma_{X}$ . This is the case, for example, of circulant or large Toeplitz matrices [Gray, 2006]. For natural images this is a reasonable assumption since shift-invariance induces diagonalization by the Fourier basis [Unser, 1984].

In the Gaussian settings of Sec. 3.3, where $\Sigma_{X},\Sigma_{X^{*}}$ commute it is easy to see that the Gelbrich distance between them can be written as

It is easy to see that estimators obtained by $\hat{X}_{0},X^{*}$ using (15) are Gaussian with zero mean and covariance $\Sigma_{P}$ , given by

Pay attention that since the roots commute, $\Sigma_{P}$ commmutes with $\Sigma_{X},\Sigma_{X^{*}}$ , and

This further reduces the geometry of the problem to the $l^{2}$ -distance between commuting matrices.

Appendix D Numerical illustration

For each algorithm, we acquire $100$ RGB images which are reconstructions of BSD100 images. We extract $9\times 9$ patches from the RGB images, and then estimate:

where $p_{i}$ is the $i$ -th patch (a $243$ -row vector, $N_{\text{patches}}=1,632,800$ ). We compute using (4)

The estimators $\hat{X}_{t}$ are constructed using per-pixel interpolation between EDSR and ESRGAN

D.2 Visual illustration

Here we present a visual comparison between SR methods and our constructed estimators, achieving roughly the same MSE but with a lower perception index. We also present EDSR, ESRGAN, the low-resolution input, and the ground-truth BSD100 images.