Assessing Generative Models via Precision and Recall

Mehdi S. M. Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, Sylvain Gelly

Introduction

Deep generative models, such as Variational Autoencoders (VAE) and Generative Adversarial Networks (GAN) , have received a great deal of attention due to their ability to learn complex, high-dimensional distributions. One of the biggest impediments to future research is the lack of quantitative evaluation methods to accurately assess the quality of trained models. Without a proper evaluation metric researchers often need to visually inspect generated samples or resort to qualitative techniques which can be subjective. One of the main difficulties for quantitative assessment lies in the fact that the distribution is only specified implicitly – one can learn to sample from a predefined distribution, but cannot evaluate the likelihood efficiently. In fact, even if likelihood computation were computationally tractable, it might be inadequate and misleading for high-dimensional problems .

As a result, surrogate metrics are often used to assess the quality of the trained models. Some proposed measures, such as Inception Score (IS) and Fréchet Inception Distance (FID) , have shown promising results in practice. In particular, FID has been shown to be robust to image corruption, it correlates well with the visual fidelity of the samples, and it can be computed on unlabeled data.

However, all of the metrics commonly applied to evaluating generative models share a crucial weakness: Since they yield a one-dimensional score, they are unable to distinguish between different failure cases. For example, the generative models shown in Figure 1 obtain similar FIDs but exhibit different sample characteristics: the model on the left trained on MNIST produces realistic samples, but only generates a subset of the digits. On the other hand, the model on the right produces low-quality samples which appear to cover all digits. A similar effect can be observed on the CelebA data set. In this work we argue that a single-value summary is not adequate to compare generative models.

Motivated by this shortcoming, we present a novel approach which disentangles the divergence between distributions into two components: precision and recall. Given a reference distribution PP and a learned distribution QQ, precision intuitively measures the quality of samples from QQ, while recall measures the proportion of PP that is covered by QQ. Furthermore, we propose an elegant algorithm which can compute these quantities based on samples from PP and QQ. In particular, using this approach we are able to quantify the degree of mode dropping and mode inventing based on samples from the true and the learned distributions.

Our contributions: (1) We introduce a novel definition of precision and recall for distributions and prove that the notion is theoretically sound and has desirable properties, (2) we propose an efficient algorithm to compute these quantities, (3) we relate these notions to total variation, IS and FID, (4) we demonstrate that in practice one can quantify the degree of mode dropping and mode inventing on real world data sets (image and text data), and (5) we compare several types of generative models based on the proposed approach – to our knowledge, this is the first metric that experimentally confirms the folklore that GANs often produce "sharper" images, but can suffer from mode collapse (high precision, low recall), while VAEs produce "blurry" images, but cover more modes of the distribution (low precision, high recall).

Background and Related Work

The task of evaluating generative models is an active research area. Here we focus on recent work in the context of deep generative models for image and text data. Classic approaches relying on comparing log-likelihood have received some criticism due the fact that one can achieve high likelihood, but low image quality, and conversely, high-quality images but low likelihood . While the likelihood can be approximated in some settings, kernel density estimation in high-dimensional spaces is extremely challenging . Other failure modes related to density estimation in high-dimensional spaces have been elaborated in . A recent review of popular approaches is presented in .

The FID provides an alternative approach which requires no labeled data. The samples are first embedded in some feature space (e.g., a specific layer of Inception network for images). Then, a continuous multivariate Gaussian is fit to the data and the distance computed as FID(x,g)=μxμg22+Tr(Σx+Σg2(ΣxΣg)12)\operatorname{FID}(x,g)=||\mu_{x}-\mu_{g}||_{2}^{2}+\operatorname{Tr}(\Sigma_{x}+\Sigma_{g}-2(\Sigma_{x}\Sigma_{g})^{\frac{1}{2}}), where μ\mu and Σ\Sigma denote the mean and covariance of the corresponding samples. FID is sensitive to both the addition of spurious modes as well as to mode dropping (see Figure 5 and results in ). recently introduced an unbiased alternative to FID, the Kernel Inception Distance. While unbiased, it shares an extremely high Spearman rank-order correlation with FID .

Another approach is to train a classifier between the real and fake distributions and to use its accuracy on a test set as a proxy for the quality of the samples . This approach necessitates training of a classifier for each model which is seldom practical. Furthermore, the classifier might detect a single dimension where the true and generated samples differ (e.g., barely visible artifacts in generated images) and enjoy high accuracy, which runs the risk of assigning lower quality to a better model.

To the best of our knowledge, all commonly used metrics for evaluating generative models are one-dimensional in that they only yield a single score or distance. A notion of precision and recall has previously been introduced in where the authors compute the distance to the manifold of the true data and use it as a proxy for precision and recall on a synthetic data set. Unfortunately, it is not possible to compute this quantity for more complex data sets.

PRD: Precision and Recall for Distributions

In this section, we derive a novel notion of precision and recall to compare a distribution QQ to a reference distribution PP. The key intuition is that precision should measure how much of QQ can be generated by a “part” of PP while recall should measure how much of PP can be generated by a “part” of QQ. Figure 4 (a)-(d) show four toy examples for PP and QQ to visualize this idea: (a) If PP is bimodal and QQ only captures one of the modes, we should have perfect precision but only limited recall. (b) In the opposite case, we should have perfect recall but only limited precision. (c) If Q=PQ=P, we should have perfect precision and recall. (d) If the supports of PP and QQ are disjoint, we should have zero precision and recall.

Let S=supp(P)supp(Q)S=\operatorname{supp}(P)\cap\operatorname{supp}(Q) be the (non-empty) intersection of the supportsFor a distribution PP defined on a finite state space Ω\Omega, we define supp(P)={ωΩP(ω)>0}\operatorname{supp}(P)=\{\omega\in\Omega\mid P(\omega)>0\}. of PP and QQ. Then, PP may be viewed as a two-component mixture where the first component PSP_{S} is a probability distribution on SS and the second component PSP_{\overline{S}} is defined on the complement of SS. Similarly, QQ may be rewritten as a mixture of QSQ_{S} and QSQ_{\overline{S}}. More formally, for some αˉ,βˉ(0,1]\bar{\alpha},\bar{\beta}\in(0,1], we define

This decomposition allows for a natural interpretation: PSP_{\overline{S}} is the part of PP that cannot be generated by QQ, so its mixture weight 1βˉ1-\bar{\beta} may be viewed as a loss in recall. Similarly, QSQ_{\overline{S}} is the part of QQ that cannot be generated by PP, so 1αˉ1-\bar{\alpha} may be regarded as a loss in precision. In the case where PS=QSP_{S}=Q_{S}, i.e., the distributions PP and QQ agree on SS up to scaling, αˉ\bar{\alpha} and βˉ\bar{\beta} provide us with a simple two-number precision and recall summary satisfying the examples in Figure 4 (a)-(d).

If PSQSP_{S}\neq Q_{S}, we are faced with a conundrum: Should the differences in PSP_{S} and QSQ_{S} be attributed to losses in precision or recall? Is QSQ_{S} inadequately “covering” PSP_{S} or is it generating “unnecessary” noise? Inspired by PR curves for binary classification, we propose to resolve this predicament by providing a trade-off between precision and recall instead of a two-number summary for any two distributions PP and QQ. To parametrize this trade-off, we consider a distribution μ\mu on SS that signifies a “true” common component of PSP_{S} and QSQ_{S} and similarly to (1), we decompose both PSP_{S} and QSQ_{S} as

The distribution PSP_{S} is viewed as a two-component mixture where the first component is μ\mu and the second component PμP_{\mu} signifies the part of PSP_{S} that is “missed” by QSQ_{S} and should thus be considered a recall loss. Similarly, QSQ_{S} is decomposed into μ\mu and the part QμQ_{\mu} that signifies noise and should thus be considered a precision loss. As μ\mu is varied, this leads to a trade-off between precision and recall.

It should be noted that unlike PR curves for binary classification where different thresholds lead to different classifiers, trade-offs between precision and recall here do not constitute different models or distributions – the proposed PRD curves only serve as a description of the characteristics of the model with respect to the target distribution.

2 Formal definition

For simplicity, we consider distributions PP and QQ that are defined on a finite state space, though the notion of precision and recall can be extended to arbitrary distributions. By combining (1) and (2), we obtain the following formal definition of precision and recall.

For α,β(0,1]\alpha,\beta\in(0,1], the probability distribution QQ has precision α\alpha at recall β\beta w.r.t. PP if there exist distributions μ\mu, νP\nu_{P} and νQ\nu_{Q} such that

The component νP\nu_{P} denotes the part of PP that is “missed” by QQ and encompasses both PSP_{\overline{S}} in (1) and PμP_{\mu} in (2). Similarly, νQ\nu_{Q} denotes the noise part of QQ and includes both QSQ_{\overline{S}} in (1) and QμQ_{\mu} in (2).

The set of attainable pairs of precision and recall of a distribution QQ w.r.t. a distribution PP is denoted by PRD(Q,P)\operatorname{PRD{}}(Q,P) and it consists of all (α,β)(\alpha,\beta) satisfying Definition 1 and the pair (0,0)(0,0).

The set PRD(Q,P)\operatorname{PRD{}}(Q,P) characterizes the above-mentioned trade-off between precision and recall and can be visualized similarly to PR curves in binary classification: Figure 4 (a)-(d) show the set PRD(Q,P)\operatorname{PRD{}}(Q,P) on a 2D-plot for the examples (a)-(d) in Figure 4. Note how the plot distinguishes between (a) and (b): Any symmetric evaluation method (such as FID) assigns these cases the same score although they are highly different. The interpretation of the set PRD(Q,P)\operatorname{PRD{}}(Q,P) is further aided by the following set of basic properties which we prove in Section A.1 in the appendix.

Let PP and QQ be probability distributions defined on a finite state space Ω\Omega. The set PRD(Q,P)\operatorname{PRD{}}(Q,P) satisfies the following properties:

(1,1)PRD(Q,P)        Q=P(1,1)\in\operatorname{PRD{}}(Q,P)\;\;\Leftrightarrow\;\;Q=P (equality)

PRD(Q,P)={(0,0)}        supp(Q)supp(P)=\operatorname{PRD{}}(Q,P)=\{(0,0)\}\;\;\Leftrightarrow\;\;\operatorname{supp}(Q)\cap\operatorname{supp}(P)=\emptyset (disjoint supports)

Q(supp(P))=αˉ=max(α,β)PRD(Q,P)αQ(\operatorname{supp}(P))=\bar{\alpha}=\max_{(\alpha,\beta)\in\operatorname{PRD{}}(Q,P)}\alpha (max precision)

P(supp(Q))=βˉ=max(α,β)PRD(Q,P)βP(\operatorname{supp}(Q))=\bar{\beta}=\max_{(\alpha,\beta)\in\operatorname{PRD{}}(Q,P)}\beta (max recall)

(α,β)PRD(Q,P)(\alpha^{\prime},\beta^{\prime})\in\operatorname{PRD{}}(Q,P) if α(0,α]\alpha^{\prime}\in(0,\alpha], β(0,β]\beta^{\prime}\in(0,\beta], (α,β)PRD(Q,P)(\alpha,\beta)\in\operatorname{PRD{}}(Q,P) (monotonicity)

(α,β)PRD(Q,P)        (β,α)PRD(P,Q)(\alpha,\beta)\in\operatorname{PRD{}}(Q,P)\;\;\Leftrightarrow\;\;(\beta,\alpha)\in\operatorname{PRD{}}(P,Q) (duality)

Property i in combination with Property v guarantees that Q=PQ=P if the set PRD(Q,P)\operatorname{PRD{}}(Q,P) contains the interior of the unit square, see case (c) in Figures 4 and 4. Similarly, Property ii assures that whenever there is no overlap between PP and QQ, PRD(Q,P)\operatorname{PRD{}}(Q,P) only contains the origin, see case (d) of Figures 4 and 4. Properties iii and iv provide a connection to the decomposition in (1) and allow an analysis of the cases (a) and (b) in Figures 4 and 4: As expected, QQ in (a) achieves a maximum precision of 1 but only a maximum recall of 0.5 while in (b), maximum recall is 1 but maximum precision is 0.5. Note that the quantities αˉ\bar{\alpha} and βˉ\bar{\beta} here are by construction the same as in (1). Finally, Property vi provides a natural interpretation of precision and recall: The precision of QQ w.r.t. PP is equal to the recall of PP w.r.t. QQ and vice versa.

Clearly, not all cases are as simple as the examples (a)-(d) in Figures 4 and 4, in particular if PP and QQ are different on the intersection SS of their support. The examples (e) and (f) in Figure 4 and the resulting sets PRD(Q,P)\operatorname{PRD{}}(Q,P) in Figure 4 illustrate the importance of the trade-off between precision and recall as well as the utility of the set PRD(Q,P)\operatorname{PRD{}}(Q,P). In both cases, PP and QQ have the same support while QQ has high precision and low recall in case (e) and low precision and high recall in case (f). This is clearly captured by the sets PRD(Q,P)\operatorname{PRD{}}(Q,P). Intuitively, the examples (e) and (f) may be viewed as noisy versions of the cases (a) and (b) in Figure 4.

3 Algorithm

Computing the set PRD(Q,P)\operatorname{PRD{}}(Q,P) based on Definitions 1 and 2 is non-trivial as one has to check whether there exist suitable distributions μ\mu, νP\nu_{P} and νQ\nu_{Q} for all possible values of α\alpha and β\beta. We introduce an equivalent definition of PRD(Q,P)\operatorname{PRD{}}(Q,P) in Theorem 2 that does not depend on the distributions μ\mu, νP\nu_{P} and νQ\nu_{Q} and that leads to an elegant algorithm to compute practical PRD curves.

Let PP and QQ be two probability distributions defined on a finite state space Ω\Omega. For λ>0\lambda>0 define the functions

To compare different distributions QiQ_{i}, one may simply plot their respective PRD curves PRD^(Qi,P)\widehat{\operatorname{PRD{}}}(Q_{i},P), while an approximation of the full sets PRD(Qi,P)\operatorname{PRD{}}(Q_{i},P) may be computed by interpolation between PRD^(Qi,P)\widehat{\operatorname{PRD{}}}(Q_{i},P) and the origin. An implementation of the algorithm is available at https://github.com/msmsajjadi/precision-recall-distributions.

4 Connection to total variation distance

Theorem 2 provides a natural interpretation of the proposed approach. For λ=1\lambda=1, we have

where δ(P,Q)\delta(P,Q) denotes the total variation distance between PP and QQ. As such, our notion of precision and recall may be viewed as a generalization of total variation distance.

Application to Deep Generative Models

In this section, we show that the algorithm introduced in Section 3.3 can be readily applied to evaluate precision and recall of deep generative models. In practice, access to PP and QQ is given via samples P^P\hat{P}\sim P and Q^Q\hat{Q}\sim Q. Given that both PP and QQ are continuous distributions, the probability of generating a point sampled from QQ is . Furthermore, there is strong empirical evidence that comparing samples in image space runs the risk of assigning higher quality to a worse model . A common remedy is to apply a pre-trained classifier trained on natural images and to compare P^\hat{P} and Q^\hat{Q} at a feature level. Intuitively, in this feature space the samples should be compared based on statistical regularities in the images rather than random artifacts resulting from the generative process .

Following this line of work, we first use a pre-trained Inception network to embed the samples (i.e. using the Pool3 layer ). We then cluster the union of P^\hat{P} and Q^\hat{Q} in this feature space using mini-batch k-means with k=20k=20 . Intuitively, we reduce the problem to a one dimensional problem where the histogram over the cluster assignments can be meaningfully compared. Hence, failing to produce samples from a cluster with many samples from the true distribution will hurt recall, and producing samples in clusters without many real samples will hurt precision. As the clustering algorithm is randomized, we run the procedure several times and average over the PRD curves. We note that such a clustering is meaningful as shown in Figure 9 in the appendix and that it can be efficiently scaled to very large sample sizes .

We stress that from the point of view of the proposed algorithm, only a meaningful embedding is required. As such, the algorithm can be applied to various data modalities. In particular, we show in Section 4.1 that besides image data the algorithm can be applied to a text generation task.

Mode collapse or mode dropping is a major challenge in GANs . Due to the symmetry of commonly used metrics with respect to precision and recall, the only way to assess whether the model is producing low-quality images or dropping modes is by visual inspection. In stark contrast, the proposed metric can quantitatively disentangle these effects which we empirically demonstrate.

We consider three data sets commonly used in the GAN literature: MNIST , Fashion-MNIST , and CIFAR-10 . These data sets are labeled and consist of 10 balanced classes. To show the sensitivity of the proposed measure to mode dropping and mode inventing, we first fix P^\hat{P} to contain samples from the first 5 classes in the respective test set. Then, for a fixed i=1,,10i=1,\dots,10, we generate a set Q^i\hat{Q}_{i}, which consists of samples from the first ii classes from the training set. As ii increases, Q^i\hat{Q}_{i} covers an increasing number of classes from P^\hat{P} which should result in higher recall. As we increase ii beyond 5, Q^i\hat{Q}_{i} includes samples from an increasing number of classes that are not present in P^\hat{P} which should result in a loss in precision, but not in recall as the other classes are already covered. Finally, the set Q^5\hat{Q}_{5} covers the same classes as P^\hat{P}, so it should have high precision and high recall.

Figure 5 (left) shows the IS and FID for the CIFAR-10 data set (results on the other data sets are shown in Figure 11 in the appendix). Since the IS is not computed w.r.t. a reference distribution, it is invariant to the choice of P^\hat{P}, so as we add classes to Q^i\hat{Q}_{i}, the IS increases. The FID decreases as we add more classes until Q^5\hat{Q}_{5} before it starts to increase as we add spurious modes. Critically, FID fails to distinguish the cases of mode dropping and mode inventing: Q^4\hat{Q}_{4} and Q^6\hat{Q}_{6} share similar FIDs. In contrast, Figure 5 (middle) shows our PRD curves as we vary the number of classes in Q^i\hat{Q}_{i}. Adding correct modes leads to an increase in recall, while adding fake modes leads to a loss of precision.

We also apply the proposed approach on text data as shown in Figure 5 (right). In particular, we use the MultiNLI corpus of crowd-sourced sentence pairs annotated with topic and textual entailment information . After discarding the entailment label, we collect all unique sentences for the same topic. Following , we embed these sentences using a BiLSTM with 2048 cells in each direction and max pooling, leading to a 4096-dimensional embedding . We consider 5 classes from this data set and fix P^\hat{P} to contain samples from all classes to measure the loss in recall for different QiQ_{i}. Figure 5 (right) curves successfully demonstrate the sensitivity of recall to mode dropping.

2 Assessing class imbalances for GANs

In this section we analyze the effect of class imbalance on the PRD curves. Figure 6 shows a pair of GANs trained on MNIST which have virtually the same FID, but very different PRD curves. The model on the left generates a subset of the digits of high quality, while the model on the right seems to generate all digits, but each has low quality. We can naturally interpret this difference via the PRD curves: For a desired recall level of less than 0.6{\sim}0.6, the model on the left enjoys higher precision – it generates several digits of high quality. If, however, one desires a recall higher than 0.6{\sim}0.6, the model on the right enjoys higher precision as it covers all digits. To confirm this, we train an MNIST classifier on the embedding of P^\hat{P} with the ground truth labels and plot the distribution of the predicted classes for both models. The histograms clearly show that the model on the left failed to generate all classes (loss in recall), while the model on the right is producing a more balanced distribution over all classes (high recall). At the same time, the classifier has an average confidenceWe denote the output of the classifier for its highest value at the softmax layer as confidence. The intuition is that higher values signify higher confidence of the model for the given label. of 96.7% on the model on the left compared to 88.6% on the model on the right, indicating that the sample quality of the former is higher. This aligns very well with the PRD plots: samples on the left have high quality but are not diverse in contrast to the samples on the right which are diverse but have low quality.

This analysis reveals a connection to IS which is based on the premise that the conditional label distribution p(yx)p(y|x) should have low entropy, while the marginal p(y)=p(yx=G(z))dzp(y)=\int p(y|x=G(z))dz should have high entropy. To further analyze the relationship between the proposed approach and PRD curves, we plot p(yx)p(y|x) against precision and p(y)p(y) against recall in Figure 10 in the appendix. The results over a large number of GANs and VAEs show a large Spearman correlation of -0.83 for precision and 0.89 for recall. We however stress two key differences between the approaches: Firstly, to compute the quantities in IS one needs a classifier and a labeled data set in contrast to the proposed PRD metric which can be applied on unlabeled data. Secondly, IS only captures losses in recall w.r.t. classes, while our metric measures more fine-grained recall losses (see Figure 8 in the appendix).

3 Application to GANs and VAEs

We evaluate the precision and recall of 7 GAN types and the VAE with 100 hyperparameter settings each as provided by . In order to visualize this vast quantity of models, one needs to summarize the PRD curves. A natural idea is to compute the maximum F1F_{1} score, which corresponds to the harmonic mean between precision and recall as a single-number summary. This idea is fundamentally flawed as F1F_{1} is symmetric. However, its generalization, defined as Fβ=(1+β2)pr(β2p)+rF_{\beta}=(1+\beta^{2})\frac{p\cdot r}{(\beta^{2}p)+r}, provides a way to quantify the relative importance of precision and recall: β>1\beta>1 weighs recall higher than precision, whereas β<1\beta<1 weighs precision higher than recall. As a result, we propose to distill each PRD curve into a pair of values: FβF_{\beta} and F1/βF_{1/\beta}.

Figure 7 compares the maximum F8F_{8} with the maximum F1/8F_{1/8} for these models on the Fashion-MNIST data set. We choose β=8\beta=8 as it offers a good insight into the bias towards precision versus recall. Since F8F_{8} weighs recall higher than precision and F1/8F_{1/8} does the opposite, models with higher recall than precision will lie below the diagonal F8=F1/8F_{8}=F_{1/8} and models with higher precision than recall will lie above. To our knowledge, this is the first metric which confirms the folklore that VAEs are biased towards higher recall, but may suffer from precision issues (e.g., due to blurring effects), at least on this data set. On the right, we show samples from four models on the extreme ends of the plot for all combinations of high and low precision and recall. We have included similar plots on the MNIST, CIFAR-10 and CelebA data sets in the appendix.

Conclusion

Quantitatively evaluating generative models is a challenging task of paramount importance. In this work we show that one-dimensional scores are not sufficient to capture different failure cases of current state-of-the-art generative models. As an alternative, we propose a novel notion of precision and recall for distributions and prove that both notions are theoretically sound and have desirable properties. We then connect these notions to total variation distance as well as FID and IS and we develop an efficient algorithm that can be readily applied to evaluate deep generative models based on samples. We investigate the properties of the proposed algorithm on real-world data sets, including image and text generation, and show that it captures the precision and recall of generative models. Finally, we find empirical evidence supporting the folklore that VAEs produce samples of lower quality, while being less prone to mode collapse than GANs.

References

Appendix A Proofs

We first show the following auxiliary result and then prove Theorems 1 and 2.

Let PP and QQ be probability distributions defined on a finite state space Ω\Omega. Let α(0,1]\alpha\in(0,1] and β(0,1]\beta\in(0,1]. Then, (α,β)PRD(Q,P)(\alpha,\beta)\in\operatorname{PRD{}}(Q,P) if and only if there exists a distribution μ\mu such that for all ωΩ\omega\in\Omega

If (α,β)PRD(Q,P)(\alpha,\beta)\in\operatorname{PRD{}}(Q,P), then (3) and the non-negativity of νP\nu_{P} and νQ\nu_{Q} directly imply (5) for the same choice of μ\mu. Conversely, if (5) holds for a distribution μ\mu, we may define the distributions

By definition α\alpha, β\beta, μ\mu, νP\nu_{P} and νQ\nu_{Q} satisfy (3) in Definition 1 which implies (α,β)PRD(Q,P)(\alpha,\beta)\in\operatorname{PRD{}}(Q,P). ∎

We show each of the properties independently:

i Equality: If (1,1)PRD(Q,P)(1,1)\in\operatorname{PRD{}}(Q,P), then we have by Definition 1 that P=μP=\mu and Q=μQ=\mu which implies P=QP=Q as claimed. Conversely, if P=QP=Q, Definition 1 is satisfied for α=β=1\alpha=\beta=1 by choosing μ=νP=νQ=P\mu=\nu_{P}=\nu_{Q}=P. Hence, (1,1)PRD(Q,P)(1,1)\in\operatorname{PRD{}}(Q,P) as claimed.

ii Disjoint support: We show both directions of the claim by contraposition, i.e., we show supp(P)supp(Q)    PRD(Q,P){(0,0)}\operatorname{supp}(P)\cap\operatorname{supp}(Q)\neq\emptyset\;\Leftrightarrow\;\operatorname{PRD{}}(Q,P)\supset\{(0,0)\}. Consider an arbitrary ωsupp(P)supp(Q)\omega\in\operatorname{supp}(P)\cap\operatorname{supp}(Q). Then, by definition we have P(ω)>0P(\omega)>0 and Q(ω)>0Q(\omega)>0. Let μ\mu be defined as the distribution with μ(ω)=1\mu(\omega)=1 and μ(ω)=0\mu(\omega^{\prime})=0 for all ωΩ{ω}\omega^{\prime}\in\Omega\setminus\{\omega\}. Clearly, it holds that P(ω)P(ω)μ(ω)P(\omega)\geq P(\omega)\mu(\omega) and Q(ω)Q(ω)μ(ω)Q(\omega)\geq Q(\omega)\mu(\omega) for all ωΩ\omega\in\Omega. Hence, by Lemma 1, we have (Q(ω),P(ω))PRD(Q,P)\left(Q(\omega),P(\omega)\right)\in\operatorname{PRD{}}(Q,P) which implies that PRD(Q,P){(0,0)}\operatorname{PRD{}}(Q,P)\supset\{(0,0)\} as claimed. Conversely, PRD(Q,P){(0,0)}\operatorname{PRD{}}(Q,P)\supset\{(0,0)\} implies by Lemma 1 that there exist α(0,1]\alpha\in(0,1] and β(0,1]\beta\in(0,1] as well as a distribution μ\mu satisfying (5). Let ωsupp(μ)\omega\in\operatorname{supp}(\mu) which implies μ(ω)>0\mu(\omega)>0 and thus by (5) also P(ω)>0P(\omega)>0 and Q(ω)>0Q(\omega)>0. Hence, ω\omega is both in the support of PP and QQ which implies supp(P)supp(Q)\operatorname{supp}(P)\cap\operatorname{supp}(Q)\neq\emptyset as claimed.

iii Maximum precision: If (α,β)PRD(Q,P)(\alpha,\beta)\in\operatorname{PRD{}}(Q,P), then by Lemma 1 there exists a distribution μ\mu such that for all ωΩ\omega\in\Omega we have P(ω)βμ(ω)P(\omega)\geq\beta\mu(\omega) and Q(ω)αμ(ω)Q(\omega)\geq\alpha\mu(\omega). P(ω)βμ(ω)P(\omega)\geq\beta\mu(\omega) implies supp(μ)supp(P)\operatorname{supp}(\mu)\subseteq\operatorname{supp}(P) and hence ωsupp(P)μ(ω)=1\sum_{\omega\in\operatorname{supp}(P)}\mu(\omega)=1. Together with Q(ω)αμ(ω)Q(\omega)\geq\alpha\mu(\omega), this yields Q(supp(P))=ωsupp(P)Q(ω)αωsupp(P)μ(ω)=αQ(\operatorname{supp}(P))=\sum_{\omega\in\operatorname{supp}(P)}Q(\omega)\geq\alpha\sum_{\omega\in\operatorname{supp}(P)}\mu(\omega)=\alpha which implies αQ(supp(P))\alpha\leq Q(\operatorname{supp}(P)) for all (α,β)PRD(Q,P)(\alpha,\beta)\in\operatorname{PRD{}}(Q,P).

To prove the claim, we next show that there exists (α,β)PRD(Q,P)(\alpha,\beta)\in\operatorname{PRD{}}(Q,P) with α=Q(supp(P))\alpha=Q(\operatorname{supp}(P)). Let S=supp(P)supp(Q)S=\operatorname{supp}(P)\cap\operatorname{supp}(Q). If S=S=\emptyset, then α=Q(supp(P))=0\alpha=Q(\operatorname{supp}(P))=0 and (0,0)PRD(Q,P)(0,0)\in\operatorname{PRD{}}(Q,P) by Definition 2 as claimed. For the case SS\neq\emptyset, let β=minωSP(ω)Q(S)Q(ω)\beta=\min_{\omega\in S}\frac{P(\omega)Q(S)}{Q(\omega)}. By definition of SS, we have β>0\beta>0. Furthermore, βP(S)1\beta\leq P(S)\leq 1 since P(ω)P(S)Q(ω)Q(S)\frac{P(\omega)}{P(S)}\leq\frac{Q(\omega)}{Q(S)} for at least one ωS\omega\in S. Consider the distribution μ\mu where μ(ω)=Q(ω)Q(S)\mu(\omega)=\frac{Q(\omega)}{Q(S)} for all ωS\omega\in S and μ(ω)=0\mu(\omega)=0 for ωΩS\omega\in\Omega\setminus S. By construction, μ\mu satisfies (5) in Lemma 1 and hence (α,β)PRD(Q,P)(\alpha,\beta)\in\operatorname{PRD{}}(Q,P) as claimed.

iv Maximum recall: This follows directly from applying Property vi to Property iii.

v Monotonicity: If (α,β)PRD(Q,P)(\alpha,\beta)\in\operatorname{PRD{}}(Q,P), then by Lemma 1 there exists a distribution μ\mu such that for all ωΩ\omega\in\Omega we have that P(ω)βμ(ω)P(\omega)\geq\beta\mu(\omega) and Q(ω)αμ(ω)Q(\omega)\geq\alpha\mu(\omega). For α(0,α]\alpha^{\prime}\in(0,\alpha] and β(0,β]\beta^{\prime}\in(0,\beta], it follows that P(ω)βμ(ω)P(\omega)\geq\beta^{\prime}\mu(\omega) and Q(ω)αμ(ω)Q(\omega)\geq\alpha^{\prime}\mu(\omega) for all ωΩ\omega\in\Omega. By Lemma 1 this implies (α,β)PRD(Q,P)(\alpha^{\prime},\beta^{\prime})\in\operatorname{PRD{}}(Q,P) as claimed.

vi Duality: This follows directly from switching α\alpha with β\beta, PP with QQ and νP\nu_{P} with νQ\nu_{Q} in Definition 1.

A.2 Proof of Theorem 2

We first show that PRD(Q,P){(θα(λ),θβ(λ))λ(0,),θ}\operatorname{PRD{}}(Q,P)\subseteq\left\{\left(\theta\alpha(\lambda),\theta\beta(\lambda)\right)\mid\lambda\in(0,\infty),\theta\in\right\}. We consider an arbitrary element (α,β)PRD(Q,P)(\alpha^{\prime},\beta^{\prime})\in\operatorname{PRD{}}(Q,P) and show that (α,β)=(θα(λ),θβ(λ))(\alpha^{\prime},\beta^{\prime})=\left(\theta\alpha(\lambda),\theta\beta(\lambda)\right) for some λ(0,)\lambda\in(0,\infty) and θ\theta\in. For the case (α,β)=(0,0)(\alpha^{\prime},\beta^{\prime})=(0,0), the result holds trivially for the choice of λ=1\lambda=1 and θ=0\theta=0. For the case (α,β)(0,0)(\alpha^{\prime},\beta^{\prime})\neq(0,0), we choose λ=αβ\lambda=\frac{\alpha^{\prime}}{\beta^{\prime}} and θ=ββ(λ)\theta=\frac{\beta^{\prime}}{\beta(\lambda)}. Since α(λ)=λβ(λ)\alpha(\lambda)=\lambda\beta(\lambda) by definition, this implies (α,β)=(θα(λ),θβ(λ))(\alpha^{\prime},\beta^{\prime})=\left(\theta\alpha(\lambda),\theta\beta(\lambda)\right) as required. Furthermore, λ(0,)\lambda\in(0,\infty) since by Definitions 1 and 2 α>0\alpha^{\prime}>0 if and only if β>0\beta^{\prime}>0. Similarly, we show that θ\theta\in: By Lemma 1 there exists a distribution μ\mu such that βμ(ω)P(ω)\beta^{\prime}\mu(\omega)\leq P(\omega) and αμ(ω)Q(ω)\alpha^{\prime}\mu(\omega)\leq Q(\omega) for all ωΩ\omega\in\Omega. This implies that βμ(ω)Q(ω)λ\beta^{\prime}\mu(\omega)\leq\frac{Q(\omega)}{\lambda} and thus βμ(ω)min(P(ω),Q(ω)λ)\beta^{\prime}\mu(\omega)\leq\min\left(P(\omega),\frac{Q(\omega)}{\lambda}\right) for all ωΩ\omega\in\Omega. Summing over all ωΩ\omega\in\Omega, we obtain βωΩmin(P(ω),Q(ω)λ)=β(λ)\beta^{\prime}\leq\sum_{\omega\in\Omega}\min\left(P(\omega),\frac{Q(\omega)}{\lambda}\right)=\beta(\lambda) which implies θ\theta\in.

Finally, we show that PRD(Q,P){(θα(λ),θβ(λ))λ(0,),θ}\operatorname{PRD{}}(Q,P)\supseteq\left\{\left(\theta\alpha(\lambda),\theta\beta(\lambda)\right)\mid\lambda\in(0,\infty),\theta\in\right\}. Consider arbitrary λ(0,)\lambda\in(0,\infty) and θ\theta\in. If β(λ)=0\beta(\lambda)=0, the claim holds trivially since (0,0)PRD(Q,P)(0,0)\in\operatorname{PRD{}}(Q,P). Otherwise, define the distribution μ\mu by μ(ω)=min(P(ω),Q(ω)λ)/β(λ)\mu(\omega)=\min\left(P(\omega),\frac{Q(\omega)}{\lambda}\right)/\beta(\lambda) for all ωΩ\omega\in\Omega. By definition, β(λ)μ(ω)min(P(ω),Q(ω)λ)P(ω)\beta(\lambda)\mu(\omega)\leq\min\left(P(\omega),\frac{Q(\omega)}{\lambda}\right)\leq P(\omega) for all ωΩ\omega\in\Omega. Similarly, α(λ)μ(ω)min(λP(ω),Q(ω))Q(ω)\alpha(\lambda)\mu(\omega)\leq\min\left(\lambda P(\omega),Q(\omega)\right)\leq Q(\omega) for all ωΩ\omega\in\Omega since α(λ)=λβ(λ)\alpha(\lambda)=\lambda\beta(\lambda). Because θ\theta\in, this implies θβ(λ)μ(ω)P(ω)\theta\beta(\lambda)\mu(\omega)\leq P(\omega) and θα(λ)μ(ω)Q(ω)\theta\alpha(\lambda)\mu(\omega)\leq Q(\omega) for all ωΩ\omega\in\Omega. Hence, by Lemma 1, (θα(λ),θβ(λ))PRD(Q,P)\left(\theta\alpha(\lambda),\theta\beta(\lambda)\right)\in\operatorname{PRD{}}(Q,P) for all λ(0,)\lambda\in(0,\infty) and θ\theta\in as claimed. ∎

Appendix B Further figures