A Theory of the Distortion-Perception Tradeoff in Wasserstein Space

Dror Freirich, Tomer Michaeli, Ron Meir

Introduction

Image restoration covers some fundamental settings in image processing such as denoising, deblurring and super-resolution. Over the past few years, image restoration methods have demonstrated impressive improvements in both visual quality and distortion measures such as peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) (31). It was noticed, however, that improvement in accuracy, as measured by distortion, does not necessarily lead to improvement in visual quality, referred to as perceptual quality. Furthermore, the lower the distortion of an estimator, the more the distribution of its outputs generally deviates from the distribution of the signals it attempts to estimate. This phenomenon, known as the perception-distortion tradeoff (4), has captured significant attention, where it implies that faithfulness to ground truth images comes at the expense of perceptual quality, namely the deviation from statistics of natural images. Several works have extended the perception-distortion tradeoff to settings such as lossy compression (5) and classification (14).

Despite the increasing popularity of performing comparisons on the perception-distortion plane, the exact characterization of the minimal distortion that can be achieved under a given perception constraint remains an important open question. Although Blau and Michaeli (4) investigated the basic properties of this distortion-perception function, such as monotonicity and convexity, little is known about its precise nature. While a general answer to this question is unavailable, in this paper, we derive a closed form expression for the distortion-perception (DP) function for the mean squared-error (MSE) distortion and the Wasserstein-22 perception index.

Our main contributions are: (i) We prove that the DP function is always quadratic in the perception constraint PP, regardless of the underlying distribution (Theorem 1). (ii) We show that it is possible to construct estimators on the DP curve from the estimators at the two extremes of the tradeoff (Theorem 3): The one that globally minimizes the MSE, and a minimizer of the MSE under a perfect perceptual quality constraint. The latter can be obtained as a stochastic transformation of the former. (iii) In the Gaussian setting, we further provide a closed form expression for optimal estimators and for the corresponding DP curve (Theorems 4 and 5). We show this Gaussian DP curve is a lower bound on the DP curve of any distribution having the same second order statistics. Finally, we illustrate our results, numerically and visually, in a super-resolution setting in Section 5. The proofs of all the theorems in the main text are provided in Appendix B.

Our theoretical results shed light on several topics that are subject to much practical activity. Particularly, in the domain of image restoration, numerous works target perceptual quality rather than distortion (e.g. (29; 13; 12)). However, it has recently been recognized that generating a single reconstructed image often does not convey to the user the inherent ambiguity in the problem. Therefore, many recent works target diverse perceptual image reconstruction, by employing randomization among possible restorations (15; 3; 21; 1). Commonly, such works perform sampling from the posterior distribution of natural images given the degraded input image. This is done e.g. using priors over image patches (7), conditional generative models (18; 20), or implicit priors induced by deep denoiser networks (10). Theoretically, posterior sampling leads to perfect perceptual quality (the restored outputs are distributed like the prior). However, a fundamental question is whether this is optimal in terms of distortion. As we show in Section 3.1, posterior sampling is often not an optimal strategy, in the sense that there often exist perfect perceptual quality estimators that achieve lower distortion.

Another topic of practical interest, is the ability to traverse the distortion-perception tradeoff at test time, without having to train a different model for each working point. Recently, interpolation has been suggested for controlling several objectives at test-time. Shoshan et al. (25) propose using interpolation in some latent space in order to approximate intermediate objectives. Wang et al. (29) use per-pixel interpolation for balancing perceptual quality and fidelity. Studies of network parameter interpolation are presented by Wang et al. (29, 30). Deng (6) produces a low distortion reconstruction and a high perceptual quality one, and then uses style transfer to combine them. An important question, therefore, is which strategy is optimal. In Section 3.2 we show that for the MSE–Wasserstein-2 tradeoff, linear interpolation leads to optimal estimators. We also discuss a geometric connection between interpolation and the fact that estimators on the DP curve form a geodesic in Wasserstein space.

Problem setting and preliminaries

In many practical cases, the goodness of an estimator is associated with two factors: (i) the degree to which X^\hat{X} is close to XX on average (low distortion), and (ii) the degree to which the distribution of X^\hat{X} is close to that of XX (good perceptual quality). An important question, then, is what is the minimal distortion that can be achieved under a given level of perceptual quality? and how can we construct estimators that achieve this lower bound? In mathematical language, we are interested in analyzing the distortion-perception (DP) function (defined similarly to the perception-distortion function of (4))

As discussed in (4), the function D(P)D(P) is monotonically non-increasing and is convex whenever dp(,)d_{p}(\cdot,\cdot) is convex in its second argument (which is the case for most popular divergences). However, without further concrete assumptions on the distortion measure d(,)d(\cdot,\cdot) and the perception index dp(,)d_{p}(\cdot,\cdot), little can be said about the precise nature of D(P)D(P).

Here, we focus our attention on the squared-error distortion d(x,x^)=xx^2d(x,\hat{x})=\|x-\hat{x}\|^{2} and the Wasserstein-2 distance dp(pX,pX^)=W2(pX,pX^)d_{p}(p_{X},p_{\hat{X}})=W_{2}(p_{X},p_{\hat{X}}), with which (1) reads

2 The Wasserstein and Gelbrich Distances

Before we present our main results, we briefly survey a few properties of the Wasserstein distance, mostly taken from (19). The Wasserstein-pp (p1p\geq 1) distance between measures μ\mu and γ\gamma on a separable Banach space X\mathcal{X} with norm \|\cdot\| is defined by

where Π(μ,γ)\Pi(\mu,\gamma) is the set of all probabilities on X×X\mathcal{X}\times\mathcal{X} with marginals μ\mu and γ\gamma. A joint probability ν\nu achieving the optimum in (3) is often referred to as optimal plan. The Wasserstein space of probability measures is defined as

and WpW_{p} constitutes a metric on Wp(X)\mathcal{W}_{p}(\mathcal{X}).

is the optimal transformation pushing forward from N(0,Σ1)\mathcal{N}(0,\Sigma_{1}) to N(0,Σ2)\mathcal{N}(0,\Sigma_{2}) (11). This transformation satisfies Σ2=T12Σ1T12.\Sigma_{2}=T_{1\rightarrow 2}\Sigma_{1}T_{1\rightarrow 2}. For a discussion on singular distributions, please see App. A.

Main results

The DP function (2) depends, of course, on the underlying joint probability pXYp_{XY} of the signal XX and measurements YY. Our first key result is that this dependence can be expressed solely in terms of DD^{*} and PP^{*}. In other words, knowing the distortion and perception index attained by the minimum MSE estimator XX^{*}, suffices for determining D(P)D(P) for any PP.

where (x)+=max(0,x)(x)_{+}=\max(0,x). Furthermore, an estimator achieving perception index PP and distortion D(P)D(P) can always be constructed by applying a (possibly stochastic) transformation to XX^{*}.

The bound is attained when XX and YY are jointly Gaussian.

A remark is in place regarding the uniqueness of an estimator achieving (8). As we discuss below, what defines an optimal estimator X^\hat{X} is its joint distribution with XX^{*}. This joint distribution may not be unique, in which case the optimal estimator is not unique. Moreover, even if pX^Xp_{\hat{X}X^{*}} is unique, the uniqueness of the estimator is not guaranteed because there may be different conditional distributions pX^Yp_{\hat{X}|Y} that lead to the same optimal pX^Xp_{\hat{X}X^{*}}. In other words, given the optimal pX^Xp_{\hat{X}X^{*}}, one can choose any joint probability pX^YXp_{\hat{X}YX^{*}} that has marginals pX^Xp_{\hat{X}X^{*}} and pYXp_{YX^{*}}. One option is to take the estimator X^\hat{X} to be a (possibly stochastic) transformation of XX^{*}, namely pX^Y=pX^XpXYp_{\hat{X}|Y}=p_{\hat{X}|X^{*}}p_{X^{*}|Y}. But there may be other options. In cases where either YY or X^\hat{X} are a deterministic transformation of XX^{*} (e.g. when XX^{*} has a density, or is an invertible function of YY), there is a unique joint distribution pX^YXp_{\hat{X}YX^{*}} with the given marginals (2, Lemma 5.3.2). In this case, if pX^Xp_{\hat{X}X^{*}} is unique then so is the estimator pX^Yp_{\hat{X}|Y}.

Under the settings of image restoration, many methods encourage diversity in their output by adding randomness (15; 3; 21). In our setting, we may ask under what conditions there exists an optimal estimator X^\hat{X} which is a deterministic function of YY. For example, when pY=δ0p_{Y}=\delta_{0} but XX has some non-atomic distribution, it is clear that no deterministic function of YY can attain perfect perceptual quality. It turns out that a sufficient condition for the optimal X^\hat{X} to be a deterministic function of YY is that XX^{*} have a density. We discuss this in App. B and explicitly illustrate it in the Gaussian case (see Sec. 3.3), where if XX^{*} has a non-singular covariance matrix then X^\hat{X} is a deterministic function of YY.

and the upper bound is attained when (P)2=D(P^{*})^{2}=D^{*}. To see when this happens, observe that

2 Optimal estimators

While Theorem 1 reveals the shape of the DP function, it does not provide a recipe for constructing optimal estimators on the DP tradeoff. We now discuss the nature of such estimators.

Note that the objective in (12) depends on the MSE between X^\hat{X} and XX^{*}, so that we can perform the minimization on pX^Xp_{\hat{X}|X^{*}} rather than on pX^Yp_{\hat{X}|Y} (once we determine the optimal pX^Xp_{\hat{X}|X^{*}} we can construct a consistent pX^Yp_{\hat{X}|Y} as discussed above).

Now, let us start by examining the leftmost side of the curve D(P)D(P), which corresponds to a perfect perceptual quality estimator (i.e. P=0P=0). In this case, the constraint becomes pX^=pXp_{\hat{X}}=p_{X}. Therefore,

Let X^0\hat{X}_{0} be an estimator achieving perception index and MSE D(0)D(0). Then its joint distribution with XX^{*} attains the optimum in the definition of W2(pX,pX)W_{2}(p_{X},p_{X^{*}}). Namely, pX^0Xp_{\hat{X}_{0}X^{*}} is an optimal plan between pXp_{X} and pXp_{X^{*}}.

Having understood the estimator X^0\hat{X}_{0} at the leftmost end of the tradeoff, we now turn to study optimal estimators for arbitrary PP. Interestingly, we can show that Problem (12) is equivalent to (see App. B)

Namely, an optimal pX^p_{\hat{X}} is closest to pXp_{X^{*}} among all distributions within a ball of radius PP around pXp_{X}, as illustrated in Fig. 1. Moreover, pX^Xp_{\hat{X}X^{*}} is an optimal plan between pX^p_{\hat{X}} and pXp_{X^{*}}. As it turns out, this somewhat abstract viewpoint leads to a rather practical construction for X^\hat{X} from the estimators X^0\hat{X}_{0} and XX^{*} at the two extremes of the tradeoff. Specifically, we have the following result, proved in App. B.

Let X^0\hat{X}_{0} be an estimator achieving perception index and MSE D(0)D(0). Then for any P[0,P]P\in[0,P^{*}], the estimator

is optimal for perception index PP. Namely, it achieves perception index PP and distortion D(P)D(P).

Theorem 3 has important implications for perceptual signal restoration. For example, in the task of image super-resolution, there exist many deep network based methods that achieve a low MSE (13; 27; 24). These provide an approximation for XX^{*}. Moreover, there is an abundance of methods that achieve good perceptual quality at the price of a reasonable degradation in MSE (often by incorporating a GAN-based loss) (12; 29; 23). These constitute approximations for X^0\hat{X}_{0}. However, achieving results that strike other prescribed balances between MSE and perceptual quality commonly require training a different model for each setting. Shoshan et al. (25) and Navarrete Michelini et al. (17) tried to address this difficulty by introducing new training techniques that allow traversing the distortion-perception tradeoff at test time. But, interestingly, Theorem 3 shows that in our setting such specialized training methods are not required. Having a model that leads to low MSE and one that leads to good perceptual quality, it is possible to construct any other estimator on the DP tradeoff, by simply averaging the outputs of these two models with appropriate weights. We illustrate this in Sec. 5.

3 The Gaussian setting

When XX and YY are jointly Gaussian, it is well known that the minimum MSE estimator XX^{*} is a linear function of the measurements YY. However, it is not a-priori clear whether all estimators along the DP tradeoff are linear in this case, and what kind of randomness they possess. As we now show, equipped with Theorem 3, we can obtain closed form expressions for optimal estimators for any PP. For simplicity, we assume here that XX and YY have zero means and that ΣX,ΣY0\Sigma_{X},\Sigma_{Y}\succ 0.

It is instructive to start by considering the simple case, where ΣX\Sigma_{X^{*}} is non-singular (in Theorem 4 below we address the more general case of a possibly singular ΣX\Sigma_{X^{*}}). It is well known that

Now, since we assumed that ΣX,ΣX0\Sigma_{X},\Sigma_{X^{*}}\succ 0, we have from Theorem 2 and (6),(7) that

Finally, we know that P=GP^{*}=G^{*}, which is given by the left-hand side of (11). Substituting these expressions into (15), we obtain that an optimal estimator for perception P[0,G]P\in[0,G^{*}] is given by

As can be seen, this optimal estimator is a deterministic linear transformation of YY for any PP.

The setting just described does not cover the case where YY is of lower dimensionality than XX because in that case ΣX\Sigma_{X^{*}} is necessarily singular (it is a nx×nxn_{x}\times n_{x} matrix of rank at most nyn_{y}; see (16)). In this case, any deterministic linear function of YY would result in an estimator X^\hat{X} with a rank-nyn_{y} covariance. Obviously, the distribution of such an estimator cannot be arbitrarily close to that of XX, whose covariance has rank nxn_{x}. What is the optimal estimator in this more general setting, then?

Assume XX and YY are zero-mean jointly Gaussian random vectors with ΣX,ΣY0\Sigma_{X},\Sigma_{Y}\succ 0. Denote TTpXpX=ΣX1/2(ΣX1/2ΣXΣX1/2)1/2ΣX1/2T^{*}\triangleq T_{p_{X}\rightarrow p_{X^{*}}}=\Sigma_{X}^{-1/2}(\Sigma_{X}^{1/2}\Sigma_{X^{*}}\Sigma_{X}^{1/2})^{1/2}\Sigma_{X}^{-1/2}. Then for any P[0,G]P\in[0,G^{*}], an estimator with perception index PP and MSE D(P)D(P) can be constructed as

where WW is a zero-mean Gaussian noise with covariance ΣW=ΣX1/2(IΣX1/2TΣXTΣX1/2)ΣX1/2\Sigma_{W}=\Sigma_{X}^{1/2}(I-\Sigma_{X}^{1/2}T^{*}\Sigma_{X^{*}}^{\dagger}T^{*}\Sigma_{X}^{1/2})\Sigma_{X}^{1/2}, which is independent of Y,XY,X, and ΣX\Sigma_{X^{*}}^{\dagger} is the pseudo-inverse of ΣX\Sigma_{X^{*}}.

Note that in this case, we indeed have a random noise component that shapes the covariance of X^P\hat{X}_{P} to become closer to ΣX\Sigma_{X} as PP gets closer to . It can be shown (see App. B) that when ΣX\Sigma_{X^{*}} is invertible, ΣW=0\Sigma_{W}=0 and (19) reduces to (18). Also note that, as in (18), the dependence of X^P\hat{X}_{P} on YY in (19) is only through X=ΣXYΣY1YX^{*}=\Sigma_{XY}\Sigma_{Y}^{-1}Y.

As mentioned in Sec. 3.1, the optimal estimator is generally not unique. Interestingly, in the Gaussian setting we can explicitly characterize a set of optimal estimators.

and W0W_{0} be a zero-mean Gaussian noise with covariance

that is independent of X,YX,Y. Then, for any P[0,G]P\in[0,G^{*}], an optimal estimator with perception index PP can be obtained by

The estimator given in (19) is one solution to (20)-(21), but is generally not unique.

A geometric perspective on the distortion-perception tradeoff

and it follows that W2(γt,γ)=tW2(γ,μ)W_{2}(\gamma_{t},\gamma)=tW_{2}(\gamma,\mu) and W2(γt,μ)=(1t)W2(γ,μ)W_{2}(\gamma_{t},\mu)=(1-t)W_{2}(\gamma,\mu). Furthermore, if γt,t\gamma_{t},t\in is a constant-speed geodesic with γ0=γ,γ1=μ\gamma_{0}=\gamma,\gamma_{1}=\mu, then the optimal plans between γ,γt\gamma,\gamma_{t} and between γt,μ\gamma_{t},\mu are given by

It is worth mentioning that this geometric interpretation is simplified under some common settings. For example, when γ\gamma is absolutely continuous (w.r.t. the Lebesgue measure), we have a measurable map TγμT_{\gamma\rightarrow\mu} which is the solution to the optimal transport problem with the quadratic cost (19, Thm 1.6.2, p.16). The geodesic (23) then takes the form

Therefore, in our setting, if γ=pX\gamma=p_{X^{*}} has a density, then we can obtain X^P\hat{X}_{P} by the deterministic transformation [X+(1PP)(TpXpX(X)X)][X^{*}+\left(1-\frac{P}{P^{*}}\right)\left(T_{p_{X^{*}}\rightarrow p_{X}}(X^{*})-X^{*}\right)] (see Remark about randomness in Sec. 3.1).

Numerical illustration

In this Section we evaluate 1212 super resolution algorithms on the BSD100 datasetAll codes are freely available and provided by the authors. The BSD100 dataset is free to download for non-commercial research. (16). The evaluated algorithms include EDSR (13), ESRGAN (29), SinGAN (23), ZSSR (24), DIP (27), SRResNet variants which optimize MSE and VGG2,2, SRGAN variants which optimize MSE, VGG2,2 and VGG5,4 in addition to an adversarial loss (12), ENet (22) (“PAT” and “E” variants). Low resolution images were obtained by 4×4\times downsampling using a bicubic kernel.

In Figure 2 we plot each method on the distortion-perception plane. Specifically, we consider natural (and reconstructed) images to be stationary random sources, and use 9×99\times 9 patches (totally 1.6×1061.6\times 10^{6} patches) from the RGB images to empirically estimate the mean and covariance matrix for the ground-truth images, and for the reconstructions produced by each method. We then use the estimated Gelbrich distances (4) between the patch distribution of each method and that of ground-truth images, as a perceptual quality index. Recall this is a lower bound on the Wasserstein distance.

We consider EDSR (13) to be the best MSE estimator XX^{*} since it achieves the lowest distortion among the evaluated methods. We therefore estimate the lower bound (9) as

where DEDSRD_{\text{EDSR}} is the MSE of EDSR, and PEDSRP_{\text{EDSR}} is the estimated Gelbrich distance between EDSR reconstructions and ground-truth images. Note the unoccupied region under the estimated curve in Figure 2, which is indeed unattainable according to the theory.

We also present 11 estimators X^t\hat{X}_{t} which we construct by interpolation between EDSR and ESRGAN (29), X^t=tXEDSR+(1t)XESRGAN\hat{X}_{t}=tX_{\text{EDSR}}+(1-t)X_{\text{ESRGAN}}. We observe (Figure 2) that estimators constructed using these two extreme points are closer to the optimal DP tradeoff than the evaluated methods. Also note that since ESRGAN does not attain -perception index, we are practically able to use negative values t<0t<0 to extrapolate better perception-quality estimators X^0.05\hat{X}_{-0.05} and X^0.1\hat{X}_{-0.1}. In Figure 3 we present a visual comparison between SRGAN-VGG2,2 (12) and our interpolated estimator X^0.12\hat{X}_{0.12}. Both achieve roughly the same RMSE distortion (18.0918.09 for SRGAN, 18.1518.15 for X^0.12\hat{X}_{0.12}), but our estimator achieves a lower perception index. Namely, by using interpolation, we manage to achieve improvement in perceptual quality, without degradation in distortion. The improvement in visual quality is also apparent in the figure. Additional visual comparisons can be found in the Appendix.

Conclusion

In this paper we provide a full characterization of the distortion-perception tradeoff for the MSE distortion and the Wasserstein-22 perception index. We show that optimal estimators are obtained by interpolation between the minimum MSE estimator and an optimal perfect perception quality estimator. In the Gaussian case, we explicitly formulate these estimators. To the best of our knowledge, this is the first work to derive such closed-form expressions. Our work paves the way towards fully understanding the DP tradeoff under more general distortions and perceptual criteria, and bridging between fidelity and visual quality at test-time, without training different models.

References

Appendix A Background and extensions

In Sec. 2 of the main text we presented the setting of Euclidean space for simplicity. For the sake of completeness, we present here a more general setup.

The problem of finding a perfect perceptual quality estimator can be now written as an optimal transport problem

A.2 The optimal transportation problem

In the Monge formulation, we search for an optimal transformation, often referred to as an optimal map, T:YXT:\mathcal{Y}\rightarrow\mathcal{X} minimizing

Note that the Monge problem seeks for a deterministic map, and might not have a solution.

In the Kantorovich formulation, we wish to find a probability measure q=qXYq=q_{XY} on X×Y\mathcal{X}\times\mathcal{Y}, minimizing

Π\Pi is the set of probabilities on X×Y\mathcal{X}\times\mathcal{Y} with marginals q(x),p(y)q^{(x)},p^{(y)}. A probability minimizing (32) is called an optimal plan, and we denote qΠo(q(x),p(y))q\in\Pi_{o}(q^{(x)},p^{(y)}). Note that when ρ(x,y)=dp(x,y)\rho(x,y)=d^{p}(x,y) and d(x,y)d(x,y) is a metric, taking inf\inf over (32) yields the Wasserstein distance Wpp(q(x),p(y))W_{p}^{p}(q^{(x)},p^{(y)}) induced by d(x,y)d(x,y).

A.3 Optimal maps between Gaussian measures

If Σ1\Sigma_{1} and Σ2\Sigma_{2} are non-singular, then the distribution attaining the optimum in (3) corresponds to

is the optimal transformation pushing forward from N(0,Σ1)\mathcal{N}(0,\Sigma_{1}) to N(0,Σ2)\mathcal{N}(0,\Sigma_{2}) [Knott and Smith, 1984]. This transformation satisfies Σ2=T12Σ1T12.\Sigma_{2}=T_{1\rightarrow 2}\Sigma_{1}T_{1\rightarrow 2}.

When distributions are singular, we have the following.

Appendix B Proof of main results

In this Section we provide proofs of the main results of this paper. In lemmas 2 and 3 we present some alternative representations for D(P)D(P). In Lemma 4 we obtain a lower bound on D(P)D(P). We then prove Theorem 3 (via a more general result given by Lemma 5), where the lower bound of Lemma 4 is attained. Equipped with Theorem 3, we prove Theorem 1 which is the main result of our paper.

Since in our case X^\hat{X} is independent of XX given YY, we show that the third term vanishes.

Next, we express D(P)D(P) in terms of the Wasserstein distance between pX^p_{\hat{X}} and pXp_{X^{*}}.

Denote W22(BP,pX)=minpX^:W2(pX^,pX)PW22(pX^,pX)W_{2}^{2}(\mathcal{B}_{P},p_{X^{*}})=\min_{p_{\hat{X}}:W_{2}(p_{\hat{X}},p_{X})\leq P}W^{2}_{2}(p_{\hat{X}},p_{X^{*}}), where BP\mathcal{B}_{P} is the ball of radius PP around pXp_{X} in Wasserstein space.

For every pX^Yp_{\hat{X}|Y} whose marginal attains W2(pX^,pX)PW_{2}(p_{\hat{X}},p_{X})\leq P we have,

which leads to D(P)D+W22(BP,pX)D(P)\geq D^{*}+W_{2}^{2}(\mathcal{B}_{P},p_{X^{*}}).

Taking the minimum over pX^p_{\hat{X}} yields D(P)D+W22(BP,pX)D(P)\leq D^{*}+W_{2}^{2}(\mathcal{B}_{P},p_{X^{*}}). Combining the upper and lower bounds, we obtain the desired result. ∎

For the proof of Theorem 3, we first prove the following

For every estimator satisfying W2(pX^,pX)PW_{2}(p_{\hat{X}},p_{X})\leq P, we have from the triangle inequality

3. Let X^0\hat{X}_{0} be an estimator achieving perception index and MSE D(0)D(0). Then for any P[0,P]P\in[0,P^{*}], the estimator

is optimal for perception index PP, namely, it achieves perception index PP and distortion D(P)D(P).

Let us prove a stronger result, from which Theorem 3 will follow.

where the equality is based on (42). A direct calculation of the distortion yields

When X{X^{*}} has a density, X^0\hat{X}_{0} (hence X^P\hat{X}_{P}) can be obtained via a deterministic transformation of YY.

Since the distribution of XX^{*} is absolutely continuous, we have an optimal map TpXpXT_{p_{X^{*}}\rightarrow p_{X}} between the distributions of XX^{*} and XX (see discussion in App. A.2). Namely, we have that X^0=TpXpX(X)\hat{X}_{0}=T_{p_{X^{*}}\rightarrow p_{X}}(X^{*}) is an optimal estimator with perception index . Thus, according to (15) X^P=(1PP)TpXpX(X)+PPX\hat{X}_{P}=\left(1-\frac{P}{P^{*}}\right)T_{p_{X^{*}}\rightarrow p_{X}}(X^{*})+\frac{P}{P^{*}}X^{*} are optimal estimators, which in this case are given by a deterministic function of YY. ∎

B.2 Proof of theorem 1

With Theorem 3 and Lemma 5 in hand, we are now ready to prove our main result.

Furthermore, an estimator achieving perception index PP and distortion D(P)D(P) can always be constructed by applying a (possibly stochastic) transformation to XX^{*}.

hence D(P)D+[(PP)+]2D(P)\leq D^{*}+\left[(P^{*}-P)_{+}\right]^{2}. On the other hand, we have (Lemma 4) D(P)D+[(PP)+]2D(P)\geq D^{*}+\left[(P^{*}-P)_{+}\right]^{2}, which completes the proof. ∎

B.3 The Gaussian setting

In this Section we prove Theorems 4 and 5. We begin by proving Theorem 5, and then show that Theorem 4 follows as a special case. Recall that

and W0W_{0} be a zero-mean Gaussian noise with covariance

that is independent of Y,XY,X. Then, for any P[0,G]P\in[0,G^{*}], an optimal estimator with perception index PP can be obtained by

The estimator given in (50) is one solution to (46)-(47), but it is generally not unique.

(Theorem 5) Let X^0ΣX^0YΣY1Y+W0\hat{X}_{0}\triangleq\Sigma_{\hat{X}_{0}Y}\Sigma_{Y}^{-1}Y+W_{0} where ΣX^0Y\Sigma_{\hat{X}_{0}Y} satisfies (46)-(47). It is easy to see that X^0N(0,ΣX)\hat{X}_{0}\sim\mathcal{N}(0,\Sigma_{X}) and it is jointly Gaussian with (X,Y,X)(X,Y,X^{*}). We have by (46)

Summarizing, X^0\hat{X}_{0} is an optimal perfect perception quality estimator. Note that (48) can be written as

and by Theorem 3 we have that it is an optimal estimator. ∎

Before proceeding to the proof of Theorem 4, let us introduce some auxiliary facts.

The following Lemma is a reminder of Schur’s Complement and its properties.

[Schur’s complement]. Let \Sigma=\left[\begin{array}[]{cc}A&B\\ B^{T}&C\end{array}\right] be a symmetric matrix where AA is PD. Then \nicefracΣACBTA1B\nicefrac{{\Sigma}}{{A}}\triangleq C-B^{T}A^{-1}B is the Schur complement of Σ\Sigma, and we have that Σ\Sigma is PSD iff \nicefracΣA\nicefrac{{\Sigma}}{{A}} is PSD.

4. Assume XX and YY are zero-mean jointly Gaussian random vectors with ΣX,ΣY0\Sigma_{X},\Sigma_{Y}\succ 0. Then for any P[0,G]P\in[0,G^{*}], an estimator with perception index PP and MSE D(P)D(P) can be constructed as

where WW is a zero-mean Gaussian noise with covariance ΣW=ΣX1/2(IΣX1/2TΣXTΣX1/2)ΣX1/2\Sigma_{W}=\Sigma_{X}^{1/2}(I-\Sigma_{X}^{1/2}T^{*}\Sigma_{X^{*}}^{\dagger}T^{*}\Sigma_{X}^{1/2})\Sigma_{X}^{1/2}, which is independent of Y,XY,X.

We observe that (50) is a special case of (48), where ΣX^0Y=ΣYX^0T=ΣX12(ΣX12ΣXΣX12)12ΣX12ΣXΣXY\Sigma_{\hat{X}_{0}Y}=\Sigma_{Y\hat{X}_{0}}^{T}=\Sigma_{X}^{\frac{1}{2}}\left(\Sigma_{X}^{\frac{1}{2}}\Sigma_{X^{*}}\Sigma_{X}^{\frac{1}{2}}\right)^{\frac{1}{2}}\Sigma_{X}^{-\frac{1}{2}}\Sigma_{X^{*}}^{\dagger}\Sigma_{XY}. We now show that ΣX^0Y\Sigma_{\hat{X}_{0}Y} has the desired properties (46)-(47). By substitution,

Recall ΣXΣXΣX=ΣX\Sigma_{X^{*}}^{\dagger}\Sigma_{X^{*}}\Sigma_{X^{*}}^{\dagger}=\Sigma_{X^{*}}^{\dagger}, and we denote T=ΣX12(ΣX12ΣXΣX12)12ΣX12T^{*}=\Sigma_{X}^{-\frac{1}{2}}\left(\Sigma_{X}^{\frac{1}{2}}\Sigma_{X^{*}}\Sigma_{X}^{\frac{1}{2}}\right)^{\frac{1}{2}}\Sigma_{X}^{-\frac{1}{2}}. We now have

Since ΣX,ΣY0\Sigma_{X},\Sigma_{Y}\succ 0, (51) is Schur’s complement of [ΣXΣX^0YΣYX^0ΣY]0\begin{bmatrix}\Sigma_{X}&\Sigma_{\hat{X}_{0}Y}\\ \Sigma_{Y\hat{X}_{0}}&\Sigma_{Y}\end{bmatrix}\succeq 0, yielding

In the case where ΣX\Sigma_{X^{*}} is invertible, ΣX^0Y=ΣXTΣX1ΣXY\Sigma_{\hat{X}_{0}Y}=\Sigma_{X}T^{*}\Sigma_{X^{*}}^{-1}\Sigma_{XY} in the proof of Theorem 4, and it is easy to see that the noise covariance is ΣW=0\Sigma_{W}=0. In this case ΣX^0Y\Sigma_{\hat{X}_{0}Y} is the unique solution to (46)-(47). This means that X0^\hat{X_{0}} (hence XP^\hat{X_{P}}) is a deterministic function of YY.

We first show ΣW=0\Sigma_{W}=0. Let MP=ΣX^0Y=ΣXTΣX1ΣXYM_{P}=\Sigma_{\hat{X}_{0}Y}=\Sigma_{X}T^{*}\Sigma_{X^{*}}^{-1}\Sigma_{XY}, then

Now, assume MM is a solution to (46)-(47), then MΔ=MPMM_{\Delta}=M_{P}-M satisfies MΔΣY1ΣYX=0M_{\Delta}\Sigma_{Y}^{-1}\Sigma_{YX}=0 and

But, MΔΣY1MPT=(MΔΣY1ΣYX)ΣX1TΣX=0M_{\Delta}\Sigma_{Y}^{-1}M_{P}^{T}=(M_{\Delta}\Sigma_{Y}^{-1}\Sigma_{YX})\Sigma_{X^{*}}^{-1}T^{*}\Sigma_{X}=0 and ΣXMPΣY1MPT=0\Sigma_{X}-M_{P}\Sigma_{Y}^{-1}M_{P}^{T}=0, yielding MΔΣY1MΔT0M_{\Delta}\Sigma_{Y}^{-1}M_{\Delta}^{T}\preceq 0. Since MΔΣY1MΔTM_{\Delta}\Sigma_{Y}^{-1}M_{\Delta}^{T} is PSD and ΣY1\Sigma_{Y}^{-1} is PD, we conclude that MΔ=0M_{\Delta}=0. ∎

Appendix C Settings with commuting covariances

In many practical problems, covariance matrices may have the commutative relation ΣXΣX=ΣXΣX\Sigma_{X}\Sigma_{X^{*}}=\Sigma_{X^{*}}\Sigma_{X}. This is the case, for example, of circulant or large Toeplitz matrices [Gray, 2006]. For natural images this is a reasonable assumption since shift-invariance induces diagonalization by the Fourier basis [Unser, 1984].

In the Gaussian settings of Sec. 3.3, where ΣX,ΣX\Sigma_{X},\Sigma_{X^{*}} commute it is easy to see that the Gelbrich distance between them can be written as

It is easy to see that estimators obtained by X^0,X\hat{X}_{0},X^{*} using (15) are Gaussian with zero mean and covariance ΣP\Sigma_{P}, given by

Pay attention that since the roots commute, ΣP\Sigma_{P} commmutes with ΣX,ΣX\Sigma_{X},\Sigma_{X^{*}}, and

This further reduces the geometry of the problem to the l2l^{2}-distance between commuting matrices.

Appendix D Numerical illustration

For each algorithm, we acquire 100100 RGB images which are reconstructions of BSD100 images. We extract 9×99\times 9 patches from the RGB images, and then estimate:

where pip_{i} is the ii-th patch (a 243243-row vector, Npatches=1,632,800N_{\text{patches}}=1,632,800). We compute using (4)

The estimators X^t\hat{X}_{t} are constructed using per-pixel interpolation between EDSR and ESRGAN

D.2 Visual illustration

Here we present a visual comparison between SR methods and our constructed estimators, achieving roughly the same MSE but with a lower perception index. We also present EDSR, ESRGAN, the low-resolution input, and the ground-truth BSD100 images.