A Theory of the Distortion-Perception Tradeoff in Wasserstein Space
Dror Freirich, Tomer Michaeli, Ron Meir
Introduction
Image restoration covers some fundamental settings in image processing such as denoising, deblurring and super-resolution. Over the past few years, image restoration methods have demonstrated impressive improvements in both visual quality and distortion measures such as peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) (31). It was noticed, however, that improvement in accuracy, as measured by distortion, does not necessarily lead to improvement in visual quality, referred to as perceptual quality. Furthermore, the lower the distortion of an estimator, the more the distribution of its outputs generally deviates from the distribution of the signals it attempts to estimate. This phenomenon, known as the perception-distortion tradeoff (4), has captured significant attention, where it implies that faithfulness to ground truth images comes at the expense of perceptual quality, namely the deviation from statistics of natural images. Several works have extended the perception-distortion tradeoff to settings such as lossy compression (5) and classification (14).
Despite the increasing popularity of performing comparisons on the perception-distortion plane, the exact characterization of the minimal distortion that can be achieved under a given perception constraint remains an important open question. Although Blau and Michaeli (4) investigated the basic properties of this distortion-perception function, such as monotonicity and convexity, little is known about its precise nature. While a general answer to this question is unavailable, in this paper, we derive a closed form expression for the distortion-perception (DP) function for the mean squared-error (MSE) distortion and the Wasserstein- perception index.
Our main contributions are: (i) We prove that the DP function is always quadratic in the perception constraint , regardless of the underlying distribution (Theorem 1). (ii) We show that it is possible to construct estimators on the DP curve from the estimators at the two extremes of the tradeoff (Theorem 3): The one that globally minimizes the MSE, and a minimizer of the MSE under a perfect perceptual quality constraint. The latter can be obtained as a stochastic transformation of the former. (iii) In the Gaussian setting, we further provide a closed form expression for optimal estimators and for the corresponding DP curve (Theorems 4 and 5). We show this Gaussian DP curve is a lower bound on the DP curve of any distribution having the same second order statistics. Finally, we illustrate our results, numerically and visually, in a super-resolution setting in Section 5. The proofs of all the theorems in the main text are provided in Appendix B.
Our theoretical results shed light on several topics that are subject to much practical activity. Particularly, in the domain of image restoration, numerous works target perceptual quality rather than distortion (e.g. (29; 13; 12)). However, it has recently been recognized that generating a single reconstructed image often does not convey to the user the inherent ambiguity in the problem. Therefore, many recent works target diverse perceptual image reconstruction, by employing randomization among possible restorations (15; 3; 21; 1). Commonly, such works perform sampling from the posterior distribution of natural images given the degraded input image. This is done e.g. using priors over image patches (7), conditional generative models (18; 20), or implicit priors induced by deep denoiser networks (10). Theoretically, posterior sampling leads to perfect perceptual quality (the restored outputs are distributed like the prior). However, a fundamental question is whether this is optimal in terms of distortion. As we show in Section 3.1, posterior sampling is often not an optimal strategy, in the sense that there often exist perfect perceptual quality estimators that achieve lower distortion.
Another topic of practical interest, is the ability to traverse the distortion-perception tradeoff at test time, without having to train a different model for each working point. Recently, interpolation has been suggested for controlling several objectives at test-time. Shoshan et al. (25) propose using interpolation in some latent space in order to approximate intermediate objectives. Wang et al. (29) use per-pixel interpolation for balancing perceptual quality and fidelity. Studies of network parameter interpolation are presented by Wang et al. (29, 30). Deng (6) produces a low distortion reconstruction and a high perceptual quality one, and then uses style transfer to combine them. An important question, therefore, is which strategy is optimal. In Section 3.2 we show that for the MSE–Wasserstein-2 tradeoff, linear interpolation leads to optimal estimators. We also discuss a geometric connection between interpolation and the fact that estimators on the DP curve form a geodesic in Wasserstein space.
Problem setting and preliminaries
In many practical cases, the goodness of an estimator is associated with two factors: (i) the degree to which is close to on average (low distortion), and (ii) the degree to which the distribution of is close to that of (good perceptual quality). An important question, then, is what is the minimal distortion that can be achieved under a given level of perceptual quality? and how can we construct estimators that achieve this lower bound? In mathematical language, we are interested in analyzing the distortion-perception (DP) function (defined similarly to the perception-distortion function of (4))
As discussed in (4), the function is monotonically non-increasing and is convex whenever is convex in its second argument (which is the case for most popular divergences). However, without further concrete assumptions on the distortion measure and the perception index , little can be said about the precise nature of .
Here, we focus our attention on the squared-error distortion and the Wasserstein-2 distance , with which (1) reads
2 The Wasserstein and Gelbrich Distances
Before we present our main results, we briefly survey a few properties of the Wasserstein distance, mostly taken from (19). The Wasserstein- () distance between measures and on a separable Banach space with norm is defined by
where is the set of all probabilities on with marginals and . A joint probability achieving the optimum in (3) is often referred to as optimal plan. The Wasserstein space of probability measures is defined as
and constitutes a metric on .
is the optimal transformation pushing forward from to (11). This transformation satisfies For a discussion on singular distributions, please see App. A.
Main results
The DP function (2) depends, of course, on the underlying joint probability of the signal and measurements . Our first key result is that this dependence can be expressed solely in terms of and . In other words, knowing the distortion and perception index attained by the minimum MSE estimator , suffices for determining for any .
where . Furthermore, an estimator achieving perception index and distortion can always be constructed by applying a (possibly stochastic) transformation to .
The bound is attained when and are jointly Gaussian.
A remark is in place regarding the uniqueness of an estimator achieving (8). As we discuss below, what defines an optimal estimator is its joint distribution with . This joint distribution may not be unique, in which case the optimal estimator is not unique. Moreover, even if is unique, the uniqueness of the estimator is not guaranteed because there may be different conditional distributions that lead to the same optimal . In other words, given the optimal , one can choose any joint probability that has marginals and . One option is to take the estimator to be a (possibly stochastic) transformation of , namely . But there may be other options. In cases where either or are a deterministic transformation of (e.g. when has a density, or is an invertible function of ), there is a unique joint distribution with the given marginals (2, Lemma 5.3.2). In this case, if is unique then so is the estimator .
Under the settings of image restoration, many methods encourage diversity in their output by adding randomness (15; 3; 21). In our setting, we may ask under what conditions there exists an optimal estimator which is a deterministic function of . For example, when but has some non-atomic distribution, it is clear that no deterministic function of can attain perfect perceptual quality. It turns out that a sufficient condition for the optimal to be a deterministic function of is that have a density. We discuss this in App. B and explicitly illustrate it in the Gaussian case (see Sec. 3.3), where if has a non-singular covariance matrix then is a deterministic function of .
and the upper bound is attained when . To see when this happens, observe that
2 Optimal estimators
While Theorem 1 reveals the shape of the DP function, it does not provide a recipe for constructing optimal estimators on the DP tradeoff. We now discuss the nature of such estimators.
Note that the objective in (12) depends on the MSE between and , so that we can perform the minimization on rather than on (once we determine the optimal we can construct a consistent as discussed above).
Now, let us start by examining the leftmost side of the curve , which corresponds to a perfect perceptual quality estimator (i.e. ). In this case, the constraint becomes . Therefore,
Let be an estimator achieving perception index and MSE . Then its joint distribution with attains the optimum in the definition of . Namely, is an optimal plan between and .
Having understood the estimator at the leftmost end of the tradeoff, we now turn to study optimal estimators for arbitrary . Interestingly, we can show that Problem (12) is equivalent to (see App. B)
Namely, an optimal is closest to among all distributions within a ball of radius around , as illustrated in Fig. 1. Moreover, is an optimal plan between and . As it turns out, this somewhat abstract viewpoint leads to a rather practical construction for from the estimators and at the two extremes of the tradeoff. Specifically, we have the following result, proved in App. B.
Let be an estimator achieving perception index and MSE . Then for any , the estimator
is optimal for perception index . Namely, it achieves perception index and distortion .
Theorem 3 has important implications for perceptual signal restoration. For example, in the task of image super-resolution, there exist many deep network based methods that achieve a low MSE (13; 27; 24). These provide an approximation for . Moreover, there is an abundance of methods that achieve good perceptual quality at the price of a reasonable degradation in MSE (often by incorporating a GAN-based loss) (12; 29; 23). These constitute approximations for . However, achieving results that strike other prescribed balances between MSE and perceptual quality commonly require training a different model for each setting. Shoshan et al. (25) and Navarrete Michelini et al. (17) tried to address this difficulty by introducing new training techniques that allow traversing the distortion-perception tradeoff at test time. But, interestingly, Theorem 3 shows that in our setting such specialized training methods are not required. Having a model that leads to low MSE and one that leads to good perceptual quality, it is possible to construct any other estimator on the DP tradeoff, by simply averaging the outputs of these two models with appropriate weights. We illustrate this in Sec. 5.
3 The Gaussian setting
When and are jointly Gaussian, it is well known that the minimum MSE estimator is a linear function of the measurements . However, it is not a-priori clear whether all estimators along the DP tradeoff are linear in this case, and what kind of randomness they possess. As we now show, equipped with Theorem 3, we can obtain closed form expressions for optimal estimators for any . For simplicity, we assume here that and have zero means and that .
It is instructive to start by considering the simple case, where is non-singular (in Theorem 4 below we address the more general case of a possibly singular ). It is well known that
Now, since we assumed that , we have from Theorem 2 and (6),(7) that
Finally, we know that , which is given by the left-hand side of (11). Substituting these expressions into (15), we obtain that an optimal estimator for perception is given by
As can be seen, this optimal estimator is a deterministic linear transformation of for any .
The setting just described does not cover the case where is of lower dimensionality than because in that case is necessarily singular (it is a matrix of rank at most ; see (16)). In this case, any deterministic linear function of would result in an estimator with a rank- covariance. Obviously, the distribution of such an estimator cannot be arbitrarily close to that of , whose covariance has rank . What is the optimal estimator in this more general setting, then?
Assume and are zero-mean jointly Gaussian random vectors with . Denote . Then for any , an estimator with perception index and MSE can be constructed as
where is a zero-mean Gaussian noise with covariance , which is independent of , and is the pseudo-inverse of .
Note that in this case, we indeed have a random noise component that shapes the covariance of to become closer to as gets closer to . It can be shown (see App. B) that when is invertible, and (19) reduces to (18). Also note that, as in (18), the dependence of on in (19) is only through .
As mentioned in Sec. 3.1, the optimal estimator is generally not unique. Interestingly, in the Gaussian setting we can explicitly characterize a set of optimal estimators.
and be a zero-mean Gaussian noise with covariance
that is independent of . Then, for any , an optimal estimator with perception index can be obtained by
The estimator given in (19) is one solution to (20)-(21), but is generally not unique.
A geometric perspective on the distortion-perception tradeoff
and it follows that and . Furthermore, if is a constant-speed geodesic with , then the optimal plans between and between are given by
It is worth mentioning that this geometric interpretation is simplified under some common settings. For example, when is absolutely continuous (w.r.t. the Lebesgue measure), we have a measurable map which is the solution to the optimal transport problem with the quadratic cost (19, Thm 1.6.2, p.16). The geodesic (23) then takes the form
Therefore, in our setting, if has a density, then we can obtain by the deterministic transformation (see Remark about randomness in Sec. 3.1).
Numerical illustration
In this Section we evaluate super resolution algorithms on the BSD100 datasetAll codes are freely available and provided by the authors. The BSD100 dataset is free to download for non-commercial research. (16). The evaluated algorithms include EDSR (13), ESRGAN (29), SinGAN (23), ZSSR (24), DIP (27), SRResNet variants which optimize MSE and VGG2,2, SRGAN variants which optimize MSE, VGG2,2 and VGG5,4 in addition to an adversarial loss (12), ENet (22) (“PAT” and “E” variants). Low resolution images were obtained by downsampling using a bicubic kernel.
In Figure 2 we plot each method on the distortion-perception plane. Specifically, we consider natural (and reconstructed) images to be stationary random sources, and use patches (totally patches) from the RGB images to empirically estimate the mean and covariance matrix for the ground-truth images, and for the reconstructions produced by each method. We then use the estimated Gelbrich distances (4) between the patch distribution of each method and that of ground-truth images, as a perceptual quality index. Recall this is a lower bound on the Wasserstein distance.
We consider EDSR (13) to be the best MSE estimator since it achieves the lowest distortion among the evaluated methods. We therefore estimate the lower bound (9) as
where is the MSE of EDSR, and is the estimated Gelbrich distance between EDSR reconstructions and ground-truth images. Note the unoccupied region under the estimated curve in Figure 2, which is indeed unattainable according to the theory.
We also present 11 estimators which we construct by interpolation between EDSR and ESRGAN (29), . We observe (Figure 2) that estimators constructed using these two extreme points are closer to the optimal DP tradeoff than the evaluated methods. Also note that since ESRGAN does not attain -perception index, we are practically able to use negative values to extrapolate better perception-quality estimators and . In Figure 3 we present a visual comparison between SRGAN-VGG2,2 (12) and our interpolated estimator . Both achieve roughly the same RMSE distortion ( for SRGAN, for ), but our estimator achieves a lower perception index. Namely, by using interpolation, we manage to achieve improvement in perceptual quality, without degradation in distortion. The improvement in visual quality is also apparent in the figure. Additional visual comparisons can be found in the Appendix.
Conclusion
In this paper we provide a full characterization of the distortion-perception tradeoff for the MSE distortion and the Wasserstein- perception index. We show that optimal estimators are obtained by interpolation between the minimum MSE estimator and an optimal perfect perception quality estimator. In the Gaussian case, we explicitly formulate these estimators. To the best of our knowledge, this is the first work to derive such closed-form expressions. Our work paves the way towards fully understanding the DP tradeoff under more general distortions and perceptual criteria, and bridging between fidelity and visual quality at test-time, without training different models.
References
Appendix A Background and extensions
In Sec. 2 of the main text we presented the setting of Euclidean space for simplicity. For the sake of completeness, we present here a more general setup.
The problem of finding a perfect perceptual quality estimator can be now written as an optimal transport problem
A.2 The optimal transportation problem
In the Monge formulation, we search for an optimal transformation, often referred to as an optimal map, minimizing
Note that the Monge problem seeks for a deterministic map, and might not have a solution.
In the Kantorovich formulation, we wish to find a probability measure on , minimizing
is the set of probabilities on with marginals . A probability minimizing (32) is called an optimal plan, and we denote . Note that when and is a metric, taking over (32) yields the Wasserstein distance induced by .
A.3 Optimal maps between Gaussian measures
If and are non-singular, then the distribution attaining the optimum in (3) corresponds to
is the optimal transformation pushing forward from to [Knott and Smith, 1984]. This transformation satisfies
When distributions are singular, we have the following.
Appendix B Proof of main results
In this Section we provide proofs of the main results of this paper. In lemmas 2 and 3 we present some alternative representations for . In Lemma 4 we obtain a lower bound on . We then prove Theorem 3 (via a more general result given by Lemma 5), where the lower bound of Lemma 4 is attained. Equipped with Theorem 3, we prove Theorem 1 which is the main result of our paper.
Since in our case is independent of given , we show that the third term vanishes.
Next, we express in terms of the Wasserstein distance between and .
Denote , where is the ball of radius around in Wasserstein space.
For every whose marginal attains we have,
which leads to .
Taking the minimum over yields . Combining the upper and lower bounds, we obtain the desired result. ∎
For the proof of Theorem 3, we first prove the following
For every estimator satisfying , we have from the triangle inequality
3. Let be an estimator achieving perception index and MSE . Then for any , the estimator
is optimal for perception index , namely, it achieves perception index and distortion .
Let us prove a stronger result, from which Theorem 3 will follow.
where the equality is based on (42). A direct calculation of the distortion yields
When has a density, (hence ) can be obtained via a deterministic transformation of .
Since the distribution of is absolutely continuous, we have an optimal map between the distributions of and (see discussion in App. A.2). Namely, we have that is an optimal estimator with perception index . Thus, according to (15) are optimal estimators, which in this case are given by a deterministic function of . ∎
B.2 Proof of theorem 1
With Theorem 3 and Lemma 5 in hand, we are now ready to prove our main result.
Furthermore, an estimator achieving perception index and distortion can always be constructed by applying a (possibly stochastic) transformation to .
hence . On the other hand, we have (Lemma 4) , which completes the proof. ∎
B.3 The Gaussian setting
In this Section we prove Theorems 4 and 5. We begin by proving Theorem 5, and then show that Theorem 4 follows as a special case. Recall that
and be a zero-mean Gaussian noise with covariance
that is independent of . Then, for any , an optimal estimator with perception index can be obtained by
The estimator given in (50) is one solution to (46)-(47), but it is generally not unique.
(Theorem 5) Let where satisfies (46)-(47). It is easy to see that and it is jointly Gaussian with . We have by (46)
Summarizing, is an optimal perfect perception quality estimator. Note that (48) can be written as
and by Theorem 3 we have that it is an optimal estimator. ∎
Before proceeding to the proof of Theorem 4, let us introduce some auxiliary facts.
The following Lemma is a reminder of Schur’s Complement and its properties.
[Schur’s complement]. Let \Sigma=\left[\begin{array}[]{cc}A&B\\ B^{T}&C\end{array}\right] be a symmetric matrix where is PD. Then is the Schur complement of , and we have that is PSD iff is PSD.
4. Assume and are zero-mean jointly Gaussian random vectors with . Then for any , an estimator with perception index and MSE can be constructed as
where is a zero-mean Gaussian noise with covariance , which is independent of .
We observe that (50) is a special case of (48), where . We now show that has the desired properties (46)-(47). By substitution,
Recall , and we denote . We now have
Since , (51) is Schur’s complement of , yielding
In the case where is invertible, in the proof of Theorem 4, and it is easy to see that the noise covariance is . In this case is the unique solution to (46)-(47). This means that (hence ) is a deterministic function of .
We first show . Let , then
Now, assume is a solution to (46)-(47), then satisfies and
But, and , yielding . Since is PSD and is PD, we conclude that . ∎
Appendix C Settings with commuting covariances
In many practical problems, covariance matrices may have the commutative relation . This is the case, for example, of circulant or large Toeplitz matrices [Gray, 2006]. For natural images this is a reasonable assumption since shift-invariance induces diagonalization by the Fourier basis [Unser, 1984].
In the Gaussian settings of Sec. 3.3, where commute it is easy to see that the Gelbrich distance between them can be written as
It is easy to see that estimators obtained by using (15) are Gaussian with zero mean and covariance , given by
Pay attention that since the roots commute, commmutes with , and
This further reduces the geometry of the problem to the -distance between commuting matrices.
Appendix D Numerical illustration
For each algorithm, we acquire RGB images which are reconstructions of BSD100 images. We extract patches from the RGB images, and then estimate:
where is the -th patch (a -row vector, ). We compute using (4)
The estimators are constructed using per-pixel interpolation between EDSR and ESRGAN
D.2 Visual illustration
Here we present a visual comparison between SR methods and our constructed estimators, achieving roughly the same MSE but with a lower perception index. We also present EDSR, ESRGAN, the low-resolution input, and the ground-truth BSD100 images.