What Makes for Good Views for Contrastive Learning?

Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, Phillip Isola

Introduction

It is commonsense that how you look at an object does not change its identity. Nonetheless, Jorge Luis Borges imagined the alternative. In his short story on Funes the Memorious, the titular character becomes bothered that a “dog at three fourteen (seen from the side) should have the same name as the dog at three fifteen (seen from the front)" . The curse of Funes is that he has a perfect memory, and every new way he looks at the world reveals a percept minutely distinct from anything he has seen before. He cannot collate the disparate experiences.

Most of us, fortunately, do not suffer from this curse. We build mental representations of identity that discard nuisances like time of day and viewing angle. The ability to build up view-invariant representations is central to a rich body of research on multiview learning. These methods seek representations of the world that are invariant to a family of viewing conditions. Currently, a popular paradigm is contrastive multiview learning, where two views of the same scene are brought together in representation space, and two views of different scenes are pushed apart.

This is a natural and powerful idea but it leaves open an important question: “which viewing conditions should we be invariant to?" It’s possible to go too far: if our task is to classify the time of day then we certainly should not use a representation that is invariant to time. Or, like Funes, we could go not far enough: representing each specific viewing angle independently would cripple our ability to track a dog as it moves about a scene.

We therefore seek representations with enough invariance to be robust to inconsequential variations but not so much as to discard information required by downstream tasks. In contrastive learning, the choice of “views" is what controls the information the representation captures, as the framework results in representations that focus on the shared information between views . Views are commonly different sensory signals, like photos and sounds , or different image channels or slices in time , but may also be different “augmented" versions of the same data tensor . If the shared information is small, then the learned representation can discard more information about the input and achieve a greater degree of invariance against nuisance variables. How can we find the right balance of views that share just the information we need, no more and no less?

We investigate this question in two ways: 1) we demonstrate that the optimal choice of views depends critically on the downstream task. If you know the task, it is often possible to design effective views. 2) We empirically demonstrate that for many common ways of generating views, there is a sweet spot in terms of downstream performance where the mutual information (MI) between views is neither too high nor too low.

Our analysis suggests an “InfoMin principle". A good set of views are those that share the minimal information necessary to perform well at the downstream task. This idea is related to the idea of minimal sufficient statistics and the Information Bottleneck theory , which have been previously articulated in the representation learning literature. This principle also complements the already popular “InfoMax principle" , which states that a goal in representation learning is to capture as much information as possible about the stimulus. We argue that maximizing information is only useful in so far as that information is task-relevant. Beyond that point, learning representations that throw out information about nuisance variables is preferable as it can improve generalization and decrease sample complexity on downstream tasks .

Based on our findings, we also introduce a semi-supervised method to learn views that are effective for learning good representations when the downstream task is known. We additionally demonstrate that the InfoMin principle can be practically applied by simply seeking stronger data augmentation to further reduce mutual information toward the sweet spot. This effort results in state of the art accuracy on a standard benchmark.

Demonstrating that optimal views for contrastive representation learning are task-dependent.

Empirically finding a U-shaped relationship between an estimate of mutual information and representation quality in a variety of settings.

A new semi-supervised method to learn effective views for a given task.

Applying our understanding to achieve state of the art accuracy of $73.0\%$ on the ImageNet linear readout benchmark with a ResNet-50.

Related Work

Recently the most competitive methods for learning representations without labels have been self-supervised contrastive representation learning . These methods learn representations by a “contrastive” loss which pushes apart dissimilar data pairs while pulling together similar pairs, an idea similar to exemplar learning . Models based on contrastive losses have significantly outperformed other approaches .

One of the major design choices in contrastive learning is how to select the similar (or positive) and dissimilar (or negative) pairs. The standard approach for generating positive pairs without additional annotations is to create multiple views of each datapoint. For example: luminance and chrominance decomposition , randomly augmenting an image twice , using different time-steps of videos , patches of the same image , multiple sensory data , text and its context , or representations of student and teacher models . Negative pairs can be randomly chosen images/videos/texts. Theoretically, we can think of the positive pairs as coming from a joint distribution over views $p(\mathbf{v_{1}},\mathbf{v_{2}})$ , and the negative pairs from a product of marginals $p(\mathbf{v_{1}})p(\mathbf{v_{2}})$ . The contrastive learning objective InfoNCE (or Deep InfoMax ) is developed to maximize a lower bound on the mutual information between the two views $I(\mathbf{v_{1}};\mathbf{v_{2}})$ . Such connection has been discussed further in .

Leveraging labeled data in contrastive representation learning has been shown to guide representations towards task-relevant features that improve performance . Here we use labeled data to learn better views, but still perform contrastive learning using only unlabeled data. Future work could combine these approaches to leverage labels for both view learning and representation learning. Besides, previous work has studied the effects of augmentation with different amount of images.

What Are the Optimal Views for Contrastive Learning?

In this section, we first introduce the standard multiview contrastive representation learning formulation, and then investigate what would be the optimal views for contrastive learning.

Given two random variables $\mathbf{v_{1}}$ and $\mathbf{v_{2}}$ , the goal of contrastive learning is to learn a parametric function to discriminate between samples from the empirical joint distribution $p(\mathbf{v_{1}})p(\mathbf{v_{2}}|\mathbf{v_{1}})$ and samples from the product of marginals $p(\mathbf{v_{1}})p(\mathbf{v_{2}})$ . The resulting function is an estimator of the mutual information between $\mathbf{v_{1}}$ and $\mathbf{v_{2}}$ , and the InfoNCE loss has been shown to maximize a lower bound on $I(\mathbf{v_{1}};\mathbf{v_{2}})$ . In practice, given an anchor point $\mathbf{v_{1,i}}$ , the InfoNCE loss is optimized to score the correct positive $\mathbf{v_{2,i}}\sim p(\mathbf{v_{2}}|\mathbf{v_{1,i}})$ higher compared to a set of $K$ distractors $\mathbf{v_{2,j}}\sim p(\mathbf{v_{2}})$ :

Minimizing this loss equivalently maximizes a lower bound (a.k.a. $I_{\text{NCE}}(\mathbf{v_{1}};\mathbf{v_{2}})$ ) on $I(\mathbf{v_{1}};\mathbf{v_{2}})$ , i.e., $I(\mathbf{v_{1}};\mathbf{v_{2}})\geq\log(K)-\mathcal{L}_{\text{NCE}}=I_{\text{NCE}}(\mathbf{v_{1}};\mathbf{v_{2}})$ . In practice, $\mathbf{v_{1}}$ and $\mathbf{v_{2}}$ are two views of the data $\mathbf{x}$ , such as different augmentations of the same image , different image channels , or video and text pairs . The score function $h(\cdot,\cdot)$ typically consists of two encoders ( $f_{1}$ for $\mathbf{v_{1}}$ and $f_{2}$ for $\mathbf{v_{2}}$ ), which may or may not share parameters depending on whether $\mathbf{v_{1}}$ and $\mathbf{v_{2}}$ are from the same domain. The resulting representations are $\mathbf{z_{1}}=f_{1}(\mathbf{v_{1}})$ and $\mathbf{z_{2}}=f_{2}(\mathbf{v_{2}})$ (see Fig. 1a).

(Sufficient Encoder) The encoder $f_{1}$ of $\mathbf{v_{1}}$ is sufficient in the contrastive learning framework if and only if $I(\mathbf{v_{1}};\mathbf{v_{2}})=I(f_{1}(\mathbf{v_{1}});\mathbf{v_{2}})$ .

Intuitively, the encoder $f_{1}$ is sufficient if the amount of information in $\mathbf{v_{1}}$ about $\mathbf{v_{2}}$ is lossless during the encoding procedure. In other words, $\mathbf{z_{1}}$ has kept all the information that the contrastive learning objective requires. Symmetrically, $f_{2}$ is sufficient if $I(\mathbf{v_{1}};\mathbf{v_{2}})=I(\mathbf{v_{1}};f_{2}(\mathbf{v_{2}}))$ .

(Minimal Sufficient Encoder) A sufficient encoder $f_{1}$ of $\mathbf{v_{1}}$ is minimal if and only if $I(f_{1}(\mathbf{v_{1}});\mathbf{v_{1}})\leq I(f(\mathbf{v_{1}});\mathbf{v_{1}}),\forall\>f$ that is sufficient.

Among those encoders which are sufficient, the minimal ones only extract relevant information of the contrastive task and throw away other irrelevant information. This is appealing in cases where the views are constructed in a way that all the information we care about is shared between them.

The representations learned in the contrastive framework are typically used in a separate downstream task. To characterize what representations are good for a downstream task, we define the optimality of representations. To make notation simple, we use $\mathbf{z}$ to mean it can be either $\mathbf{z_{1}}$ or $\mathbf{z_{2}}$ .

(Optimal Representation of a Task) For a task $\mathcal{T}$ whose goal is to predict a semantic label $\mathbf{y}$ from the input data $\mathbf{x}$ , the optimal representation $\mathbf{z}^{*}$ encoded from $\mathbf{x}$ is the minimal sufficient statistic with respect to $\mathbf{y}$ .

This says a model built on top of $\mathbf{z}^{*}$ has all the information necessary to predict $\mathbf{y}$ as accurately as if it were to access $\mathbf{x}$ . Furthermore, $\mathbf{z}^{*}$ maintains the smallest complexity, i.e., containing no other information besides that about $\mathbf{y}$ , which makes it more generalizable . We refer the reader to for a more in depth discussion about optimal visual representations and minimal sufficient statistics.

2 Three Regimes of Information Captured

As our representations $\mathbf{z_{1}},\mathbf{z_{2}}$ are built from our views and learned by the contrastive objective with the assumption of minimal sufficient encoders, the amount and type of information shared between $\mathbf{v_{1}}$ and $\mathbf{v_{2}}$ (i.e., $I(\mathbf{v_{1}};\mathbf{v_{2}})$ ) determines how well we perform on downstream tasks. As in information bottleneck , we can trace out a tradeoff between how much information our views share about the input, and how well our learned representation performs at predicting $\mathbf{y}$ for a task. Depending on how our views are constructed, we may find that we are keeping too many irrelevant variables while discarding relevant variables, leading to suboptimal performance on the information plane. Alternatively, we can find the views that maximize $I(\mathbf{v_{1}};\mathbf{y})$ and $I(\mathbf{v_{2}};\mathbf{y})$ (how much information is contained about the task label) while minimizing $I(\mathbf{v_{1}};\mathbf{v_{2}})$ (how much information is shared about the input, including both task-relevant and irrelevant information). Even in the case of these optimal traces, there are three regimes of performance we can consider that are depicted in Fig. 1b, and have been discussed previously in information bottleneck literature :

Missing information: When $I(\mathbf{v_{1}};\mathbf{v_{2}})<I(\mathbf{x};\mathbf{y})$ , there is information about the task-relevant variable that is discarded by the view, degrading performance.

Sweet spot: When $I(\mathbf{v_{1}};\mathbf{y})=I(\mathbf{v_{2}};\mathbf{y})=I(\mathbf{v_{1}};\mathbf{v_{2}})=I(\mathbf{x};\mathbf{y})$ , the only information shared between $\mathbf{v_{1}}$ and $\mathbf{v_{2}}$ is task-relevant, and there is no irrelevant noise.

Excess noise: As we increase the amount of information shared in the views beyond $I(\mathbf{x};\mathbf{y})$ , we begin to include additional information that is irrelevant for the downstream task. This can lead to worse generalization on the downstream task .

We hypothesize that the best performing views will be close to the sweet spot: containing as much task-relevant information while discarding as much irrelevant information in the input as possible. More formally, the following InfoMin proposition articulates which views are optimal supposing that we know the specific downstream task $\mathcal{T}$ in advance. The proof is in Section A.2 of the Appendix.

Suppose $f_{1}$ and $f_{2}$ are minimal sufficient encoders. Given a downstream task $\mathcal{T}$ with label $\mathbf{y}$ , the optimal views created from the data $\mathbf{x}$ are $(\mathbf{v_{1}}^{*},\mathbf{v_{2}}^{*})=\operatorname*{arg\,min}_{\mathbf{v_{1}},\mathbf{v_{2}}}I(\mathbf{v_{1}};\mathbf{v_{2}})$ , subject to $I(\mathbf{v_{1}};\mathbf{y})=I(\mathbf{v_{2}};\mathbf{y})=I(\mathbf{x};\mathbf{y})$ . Given $\mathbf{v_{1}}^{*},\mathbf{v_{2}}^{*}$ , the representation $\mathbf{z_{1}^{*}}$ (or $\mathbf{z_{2}^{*}}$ ) learned by contrastive learning is optimal for $\mathcal{T}$ (Def 3), thanks to the minimality and sufficiency of $f_{1}$ and $f_{2}$ .

Unlike in information bottleneck, for contrastive learning we often do not have access to a fully-labeled training set that specifies the downstream task in advance, and thus evaluating how much task-relevant information is contained in the views and representation at training time is challenging. Instead, the construction of views has typically been guided by domain knowledge that alters the input while preserving the task-relevant variable.

3 View Selection Influences Mutual Information and Accuracy

The above analysis suggests that transfer performance will be upper-bounded by a reverse-U shaped curve (Fig. 1b, right), with the sweet spot at the top of the curve. In theory, when the mutual information between views is changed, information about the downstream task and nuisance variables can be selectively included or excluded, biasing the learned representation, as shown in Fig. 2. The upper-bound reverse-U might not be reached if views are selected that share noise rather than signal. But practically, a recent study suggests that the reverse-U shape is quite common. Here we show several examples where reducing $I(\mathbf{v_{1}};\mathbf{v_{2}})$ improves downstream accuracy. We use $I_{\text{NCE}}$ as a neural proxy for $I$ , and note it depends on network architectures. Therefore for each plot in this paper, we only vary the input views while keeping other settings the same, to make the results comparable.

Example 1: Reducing $I(\mathbf{v_{1}};\mathbf{v_{2}})$ with spatial distance. We create views by randomly cropping two patches of size 64x64 from the same image with various offsets. Namely, one patch starts at position $(x,y)$ while the other starts at $(x+d,y+d)$ , with $(x,y)$ randomly generated. We increase $d$ from $64$ to $384$ , and sample patches from inside high resolution images in the DIV2K dataset . After contrastive training stage, we evaluate on STL-10 and CIFAR-10 by freezing the encoder and training a linear classifier. The plots in Fig. 3 shows the Mutual Information v.s. Accuracy. The results show that the reverse-U curve is consistent across both STL-10 and CIFAR-10. We can identify the sweet spot at $d=128$ . More details are provided in Appendix.

Example 2: Reducing $I(\mathbf{v_{1}};\mathbf{v_{2}})$ with different color spaces. The correlation between channels may vary significantly across different color spaces. We follow to split each color space into two views, such as $\{Y,DbDr\}$ and $\{R,GB\}$ . We perform contrastive learning on STL-10, and measure the representation quality by linear classification accuracy on the STL-10 and segmentation performance on NYU-V2 images. As shown in Fig. 4, the downstream performance keeps increasing as $I_{\text{NCE}}$ decreases for both classification and segmentation. Here we do not observe the the left half of the reverse U-shape, but in Sec. 4.2 we will show a learning method that generates color spaces which reveal the full shape and touch the sweet spot.

4 Data Augmentation to Reduce Mutual Information between Views

Multiple views can also be generated through augmenting an input in different ways. We can unify several recent contrastive learning methods through the perspective of view generation: despite differences in architecture, objective, and engineering tricks, all recent contrastive learning methods create two views $\mathbf{v_{1}}$ and $\mathbf{v_{2}}$ that implicitly follow the InfoMin principle. Below, we consider several recent works in this framework:

CMC . CMC further split images across color channels such that $\mathbf{v}_{1}^{\text{cmc}}$ is the first color channel of $\mathbf{v}_{1}$ , and $\mathbf{v}_{2}^{\text{cmc}}$ is the last two channels of $\mathbf{v}_{2}$ . By this design, $I(\mathbf{v}_{1}^{\text{cmc}};\mathbf{v}_{2}^{\text{cmc}})\leq I(\mathbf{v}_{1};\mathbf{v}_{2})$ is theoretically guaranteed, and we observe that CMC performs better than InstDis.

PIRL . PIRL keeps $\mathbf{v}_{1}^{\text{pirl}}=\mathbf{v}_{1}$ but transforms the other view $\mathbf{v}_{2}$ with random JigSaw shuffling $h$ to get $\mathbf{v}_{2}^{\text{pirl}}=h(\mathbf{v}_{2})$ . Similary we have $I(\mathbf{v}_{1}^{\text{pirl}};\mathbf{v}_{2}^{\text{pirl}})\leq I(\mathbf{v}_{1};\mathbf{v}_{2})$ as $h(\cdot)$ introduces randomness.

CPC . Different from the above methods that create views at the image level, CPC gets views $\mathbf{v}_{1}^{\text{cpc}}$ , $\mathbf{v}_{2}^{\text{cpc}}$ from local patches with strong data augmentation (e.g., RA ) which results in smaller $I(\mathbf{v}_{1}^{\text{cpc}};\mathbf{v}_{2}^{\text{cpc}})$ . As in Sec. 3, cropping views from disjoint patches also reduces $I(\mathbf{v}_{1}^{\text{cpc}};\mathbf{v}_{2}^{\text{cpc}})$ .

Besides, we also analyze how changing the magnitude parameter of individual augmentation functions trances out reverse-U shapes. We consider RandomResizedCrop and Color Jittering. For the former, a parameter c sets a low-area cropping bound, and smaller c indicates stronger augmentation. For the latter, a parameter x is adopted to control the strengths. The plots on ImageNet are shown in Fig. 5, where we identify a sweet spot at $1.0$ for Color Jittering and $0.2$ for RandomResizedCrop.

Motivated by the InfoMin principle, we propose a new set of data augmentation, called InfoMin Aug. In combination of the JigSaw strategy proposed in PIRL , our InfoMin Aug achieves $73.0\%$ top-1 accuracy on ImageNet linear readout benchmark with ResNet-50, outperforming SimCLR by nearly $4\%$ , as shown in Table 1. Besides, we also found that transferring our unsupervisedly pre-trained models to PASCAL VOC object detection and COCO instance segmentation consistently outperforms supervised ImageNet pre-training. More details and results are in Appendix.

One goal of unsupervised pre-training is to learn transferable representations that are beneficial for downstream tasks. The rapid progress of many vision tasks in past years can be ascribed to the paradigm of fine-tuning models that are initialized from supervised pre-training on ImageNet. When transferring to PASCAL VOC and COCO , we found our InfoMin pre-training consistently outperforms supervised pre-training as well as other unsupervised pre-training methods.

COCO Object Detection/Segmentation. Feature normalization has been shown to be important during fine-tuning . Therefore, we fine-tune the backbone with Synchronized BN (SyncBN ) and add SyncBN to newly initialized layers (e.g., FPN ). Table 2 reports the bounding box AP and mask AP on val2017 on COCO, using the Mask R-CNN R50-FPN pipeline. All results are reported on Detectron2 .

We have tried different popular detection frameworks with various backbones, extended the fine-tuning schedule (e.g., 6 ${\mathbf{x}}$ schedule), and compared InfoMin ResNeXt-152 trained on ImageNet-1k with supervised ResNeXt-152 trained on ImageNet-5k (6 times larger than ImageNet-1k). In all cases, InfoMin consistently outperforms supervised pre-training. Please see Section D for more detailed comparisons.

Pascal VOC Object Detection. We strictly follow the setting introduced in . Specifically, We use Faster R-CNN with R50-C4 architecture. We fine-tune all layers with 24000 iterations, each consisting of 16 images. The results are reported in Table 3.

Learning views for contrastive learning

Hand-designed data augmentation is an effective method for generating views that have reduced mutual information and strong transfer performance for images. However, as contrastive learning is applied to new domains, generating views through careful construction of data augmentation strategies may prove ineffective. Furthermore, the types of views that are useful depend on the downstream task. Here we show the task-dependence of optimal views on a simple toy problem, and propose an unsupervised and semi-supervised learning method to learn views from data.

To understand how the choice of views impact the representations learned by contrastive learning, we construct a toy dataset that mixes three tasks. We build our toy dataset by combining Moving-MNIST (consisting of videos where digits move inside a black canvas with constant speed and bounce off of image boundaries), with a fixed background image sampled from the STL-10 dataset . We call this dataset Colorful Moving-MNIST, which consists of three factors of variation in each frame: the class of the digit, the position of the digit, and the class of background image (see Appendix for more details). Here we analyze how the choice of views impacts which of these factors are extracted by contrastive learning.

Setup. We fix view $\mathbf{v_{1}}$ as the sequence of past frames $\mathbf{x}_{1:k}$ . For simplicity, we consider $\mathbf{v_{2}}$ as a single image, and construct it by referring to frame $\mathbf{x}_{t(t>k)}$ . One example of visualization is shown in Fig. 9, and please refer to Appendix for more details. We consider 3 downstream tasks for an image: (1) predict the digit class; (2) localize the digit; (3) classify the background image (10 classes from STL-10). This is performed by freezing the backbone and training a linear task-specific head. We also provide a “supervised” baseline that is trained end-to-end for comparison.

Single Factor Shared. We consider the case that $\mathbf{v_{1}}$ and $\mathbf{v_{2}}$ only share one of the three factors: digit, position, or background. We synthesize $\mathbf{v_{2}}$ by setting one of the three factors the same as $\mathbf{x}_{t}$ but randomly picking the other two. In such cases, the mutual information $I(\mathbf{v_{1}};\mathbf{v_{2}})$ is either about digit, position, or background. The results are summarized in Table 4, which clearly shows that the performance is significantly affected by what is shared between $\mathbf{v_{1}}$ and $\mathbf{v_{2}}$ . Specifically, if the downstream task is relevant to one factor, $I(\mathbf{v_{1}};\mathbf{v_{2}})$ should include that factor rather than others. For example, when $\mathbf{v_{2}}$ only shares background image with $\mathbf{v_{1}}$ , contrastive learning can hardly learn representations that capture digit class and location.

Multiple Factors Shared. We further explore how representation quality is changed if $\mathbf{v_{1}}$ and $\mathbf{v_{2}}$ share multiple factors. We follow a similar procedure as above to control factors shared by $\mathbf{v_{1}}$ and $\mathbf{v_{2}}$ , and present the results in Table 4. We found that one factor can overwhelm another; for instance, whenever background is shared, the latent representation leaves out information for discriminating or localizing digits. This might because the information bits of background predominates, and the encoder chooses the background as a “shortcut” to solve the contrastive pre-training task. When $\mathbf{v_{1}}$ and $\mathbf{v_{2}}$ share digit and position, the former is preferred over the latter.

2 Synthesizing Views with Invertible Generators

In this section, we design unsupervised and semi-supervised methods that synthesize novel views following the InfoMin principle. Concretely, we extend the color space experiments in Sec. 3.3 by learning flow-based models that transfer natural color spaces into novel color spaces, from which we split the channels to get views. We still call the output of flow-based models as color spaces because the flows are designed to be pixel-wise and bijective (by its nature), which follows the property of color space conversion. After the views have been learned, we perform standard contrastive learning followed by linear classifier evaluation.

Practically, the flow-based model $g$ is restricted to pixel-wise 1x1 convolutions and ReLU activations, operating independently on each pixel. We try both volume preserving (VP) and non-volume preserving (NVP) flows. For an input image $X$ , the splitting over channels is represented as $\{X_{1},X_{2:3}\}$ . $\hat{X}$ signifies the transformed image, i.e., $\hat{X}=g(X)$ . Experiments are conducted on STL-10, which includes 100k unlabeled and 5k labeled images. More details are in Appendix.

The idea is to leverage an adversarial training strategy . Given $\hat{X}=g(X)$ , we train two encoders $f_{1},f_{2}$ to maximize $I_{\text{NCE}}(\hat{X}_{1};\hat{X}_{2:3})$ as in Eqn. 1, similar to the discriminator of GAN . Meanwhile, $g$ is adversarially trained to minimize $I_{\text{NCE}}(\hat{X}_{1};\hat{X}_{2:3})$ . Formally, the objective is:

Alternatively, one may use other MI bounds , but we find $I_{\text{NCE}}$ works well and keep using it. We note that the invertibility of $g(\cdot)$ prevent it from learning degenerate/trivial solutions.

Results. We experiment with RGB and YDbDr. As shown in Fig. 7(a), a reverse U-shape of $I_{\text{NCE}}$ and downstream accuracy is present. Interestingly, YDbDr is already near the sweet spot. This happens to be in line with our human prior that the “luminance-chrominance” decomposition is a good way to decorrelate colors but still retains recognizability of objects. We also note that another luminance-chrominance decomposition Lab, which performs similarly well to YDbDr (Fig. 4), was designed to mimic the way humans perceive color . Our analysis therefore suggests yet another rational explanation for why humans perceive color the way we do – human perception of color may be near optimal for self-supervised representation learning.

With this unsupervised objective, in most cases $I_{\text{NCE}}$ between views is overly reduced. In addition, we found this GAN-style training is unstable, as different runs with the same hyper-parameter vary significantly. We conjecture it is because the view generator has no knowledge about the downstream task, and thus the constraint $I(\mathbf{v_{1}},\mathbf{y})=I(\mathbf{v_{2}},\mathbf{y})=I(\mathbf{x},\mathbf{y})$ in Proposition 3.1 is heavily broken. To overcome this, we further develop an semi-supervised view learning method.

2.2 Semi-supervised View Learning: Find Views that Share the Label Information

We assume a handful of labels for the downstream task are available. Thus we can guide the generator $g$ to retain $I(g(X)_{1},\mathbf{y})$ and $I(g(X)_{2:3},\mathbf{y})$ . Practically, we introduce two classifiers on each of the learned views to perform classification during the view learning process. Formally, we optimize:

where $c_{1},c_{2}$ are the classifiers. The $I_{\text{NCE}}$ term applies to all data while the latter two are only for labeled data. In each iteration, we sample an unlabeled batch and a labeled batch. After this process is done, we use frozen $g$ to generate views for unsupervised contrastive representation learning.

Results. The plots are shown in Figure 7(b). Now the learned views are centered around the sweet spot, no matter what the input color space is and whether the generator is VP or NVP, which highlights the importance of keeping information about $\mathbf{y}$ . Meanwhile, to see the importance of the unsupervised term, which reduces $I_{NCE}$ , we train another view generator with only supervised loss. We further compare “supervised”, “unsupervised” and “semi-supervised” (the supervised + unsupervised losses) generators in Table 6, where we also includes contrastive learning over the original color space (“raw views") as a baseline. The semi-supervised view generator significantly outperforms the supervised one, validating the importance of reducing $I(\mathbf{v_{1}};\mathbf{v_{2}})$ . We compare further compare $g(X)$ with $X$ ( $X$ is RGB or YDbDr) on larger backbone networks, as shown in Fig. 6, We see that the learned views consistently outperform its raw input, e.g., $g(RGB)$ surpasses $RGB$ by a large margin and reaches $94\%$ classification accuracy.

Conclusion

We have characterized that good views for a given task in contrastive representation learning framework should retain task-relevant information while minimizing irrelevant nuisances, which we call InfoMin principle. Based on it, we demonstrate that optimal views are task-dependent in both theory and practice. We further propose a semi-supversied method to learn effective views for a given task. In addition, we analyze the data augmentation used in recent methods from the InfoMin perspective, and further propose a new set of data augmentation that achieved a new state-of-the-art top-1 accuracy on ImageNet linear readout benchmark with a ResNet-50.

Broader Impact

This paper is on the basic science of representation learning, and we believe it will be beneficial to both the theory and practice of this field. An immediate application of self-supervised representation learning is to reduce the reliance on labeled data for downstream applications. This may have the beneficial effects of being more cost effective and reducing biases introduced by human annotations. At the same time, these methods open up the ability to use uncurated data more effectively, and such data may hide errors and biases that would have been uncovered via the human curation process. We also note that the view constructions we propose are not bias free, even when they do not use labels: using one color space or another may hide or reveal different properties of the data. The choice of views therefore plays a similar role to the choice of training data and training annotations in traditional supervised learning.

Acknowledgments and Disclosure of Funding

Acknowledgements. This work was done when Yonglong Tian was a student researcher at Google. We thank Kevin Murphy for fruitful and insightful discussion; Lucas Beyer for feedback on related work; and Google Cloud team for supporting computation resources. Yonglong is grateful to Zhoutong Zhang for encouragement and feedback on experimental design.

Funding. Funding for this project was provided Google, as part of Yonglong Tian’s role as a student researcher at Google.

Competing interests. In the past 36 months, Phillip Isola has had employment at MIT, Google, and OpenAI; honorarium for lecturing at the ACDL summer school in Italy; honorarium for speaking at GIST AI Day in South Korea. P.I.’s lab at MIT has been supported by grants from Facebook, IBM, and the US Air Force; start up funding from iFlyTech via MIT; gifts from Adobe and Google; compute credit donations from Google Cloud. Yonglong Tian is a Ph.D. student supported by MIT EECS department. Chen Sun, Ben Poole, Dilip Krishan, and Cordelia Schmid are employees at Google.

References

Appendix A Proof of Proposition 3.1

In this section, we provide proof for the statement regarding optimal views in proposition 3.1 of the main text. As a warmup, we firstly recap some properties of mutual information.

A.2 Proof

According to Proposition 1, the optimal views ${\mathbf{v}}_{1}^{*},{\mathbf{v}}_{2}^{*}$ for task $\mathcal{T}$ with label ${\mathbf{y}}$ , are views such that $I({\mathbf{v}}_{1}^{*};{\mathbf{v}}_{2}^{*})=I({\mathbf{v}}_{1}^{*};{\mathbf{y}})=I({\mathbf{v}}_{2}^{*};{\mathbf{y}})=I({\mathbf{x}};{\mathbf{y}})$

Since $I({\mathbf{v}}_{1};{\mathbf{y}})=I({\mathbf{v}}_{2};{\mathbf{y}})=I({\mathbf{x}};{\mathbf{y}})$ , and ${\mathbf{v}}_{1}$ , ${\mathbf{v}}_{2}$ are functions of ${\mathbf{x}}$ .

Therefore $I({\mathbf{y}};{\mathbf{v}}_{2}|{\mathbf{v}}_{1})=0$ , due to the nonnegativity. Then we have:

Therefore the optimal views ${\mathbf{v}}_{1}^{*},{\mathbf{v}}_{2}^{*}$ that minimizes $I({\mathbf{v}}_{1};{\mathbf{v}}_{2})$ subject to the constraint yields $I({\mathbf{v}}_{1}^{*};{\mathbf{v}}_{2}^{*})=I({\mathbf{x}};{\mathbf{y}})$ . Also note that optimal views ${\mathbf{v}}_{1}^{*},{\mathbf{v}}_{2}^{*}$ are conditionally independent given ${\mathbf{y}}$ , as now $I({\mathbf{v}}_{2}^{*};{\mathbf{v}}_{1}^{*}|{\mathbf{y}})=0$ . ∎

Given optimal views ${\mathbf{v}}_{1}^{*},{\mathbf{v}}_{2}^{*}$ and minimal sufficient encoders $f_{1}$ , $f_{2}$ , then the learned representations ${\mathbf{z}}_{1}$ $($ or ${\mathbf{z}}_{2})$ are sufficient statistic of ${\mathbf{v}}_{1}$ $($ or ${\mathbf{v}}_{2})$ for ${\mathbf{y}}$ , i.e., $I({\mathbf{z}}_{1};{\mathbf{y}})=I({\mathbf{v}}_{1};{\mathbf{y}})$ or $I({\mathbf{z}}_{2};{\mathbf{y}})=I({\mathbf{v}}_{2};{\mathbf{y}})$ .

Let’s prove for ${\mathbf{z}}_{1}$ . Since ${\mathbf{z}}_{1}$ is a function of ${\mathbf{v}}_{1}$ , we have:

To prove $I({\mathbf{y}};{\mathbf{v}}_{1})=I({\mathbf{y}};{\mathbf{z}}_{1})$ , we need to prove $I({\mathbf{y}};{\mathbf{v}}_{1}|{\mathbf{z}}_{1})=0$ .

In the above derivation $I({\mathbf{y}};{\mathbf{z}}_{1}|{\mathbf{v}}_{1},{\mathbf{v}}_{2})=0$ because ${\mathbf{z}}_{1}$ is a function of ${\mathbf{v}}_{1}$ ; $I({\mathbf{v}}_{1};{\mathbf{v}}_{2}|{\mathbf{y}},{\mathbf{z}}_{1})=0$ because optimal views ${\mathbf{v}}_{1},{\mathbf{v}}_{2}$ are conditional independent given ${\mathbf{y}}$ , see Proposition A.1. Now, we can easily prove $I({\mathbf{y}};{\mathbf{v}}_{1}|{\mathbf{v}}_{2})=0$ following a similar procedure in Proposition A.1. If we can further prove $I({\mathbf{v}}_{1};{\mathbf{v}}_{2}|{\mathbf{z}}_{1})=0$ , then we get $I({\mathbf{y}};{\mathbf{v}}_{1}|{\mathbf{z}}_{1})\leq 0$ . By nonnegativity, we will have $I({\mathbf{y}};{\mathbf{v}}_{1}|{\mathbf{z}}_{1})=0$ .

To see $I({\mathbf{v}}_{1};{\mathbf{v}}_{2}|{\mathbf{z}}_{1})=0$ , recall that our encoders are sufficient. According to Definition 1, we have $I({\mathbf{v}}_{1};{\mathbf{v}}_{2})=I({\mathbf{v}}_{2};{\mathbf{z}}_{1})$ :

The representations $z_{1}$ and $z_{2}$ are also minimal for $y$ .

For all sufficient encoders, we have proved ${\mathbf{z}}_{1}$ are sufficient statistic of ${\mathbf{v}}_{1}$ for predicting ${\mathbf{y}}$ . Namely $I({\mathbf{v}}_{1};{\mathbf{y}}|{\mathbf{z}}_{1})=0$ . Now:

The minimal sufficient encoder will minimize $I({\mathbf{z}}_{1};{\mathbf{v}}_{1})$ to $I({\mathbf{v}}_{1};{\mathbf{y}})$ . This is achievable and leads to $I({\mathbf{z}}_{1};{\mathbf{v}}_{1}|{\mathbf{y}})=0$ . Therefore, $z_{1}$ is a minimal sufficient statistic for predicting $y$ , thus optimal. Similarly, $z_{2}$ is also optimal. ∎

Appendix B Implementation Details

Why using DIV2K ? Recall that we randomly sample patches with a distance of d. During such sampling process, there is a possible bias that with an image of relatively small size (e.g., 512x512), a large d (e.g., 384) will always push these two patches around the boundary. To minimize this bias, we choose to use high resolution images (e.g. 2k) from DIV2K dataset.

Setup and Training. We use the training framework of CMC . The backbone network is a tiny AlexNet, following . We train for $3000$ epochs, with the learning rate initialized as $0.03$ and decayed with cosine annealing.

Evaluation. We evaluate the learned representation on both STL-10 and CIFAR-10 datasets. For CIFAR-10, we resize the image to 64 $\times$ 64 to extract features. The linear classifier is trained for 100 epochs.

B.2 Channel Splitting with Various Color Spaces

Setup and Training. The backbone network is also a tiny AlexNet, with the modification of adapting the first layer to input of $1$ or $2$ channels. We follow the training recipe in .

Evaluation. For the evaluation on STL-10 dataset, we train a linear classifier for 100 epochs and report the single-crop classification accuracy. For NYU-Depth-v2 segmentation task, we freeze the backbone network and train a 4-layer decoder on top of the learned representations. We report the mean IoU for labeled classes.

Another example we consider is to separate images into low- and high-frequency images. To simplify, we extract $\mathbf{v_{1}}$ and $\mathbf{v_{2}}$ by Gaussian blur, i.e.,

where Blur is the Gaussian blur function and $\sigma$ is the parameter controlling the kernel. Extremely small or large $\sigma$ can make the high- or low-frequency image contain little information. In theory, the maximal $I(\mathbf{v_{1}};\mathbf{v_{2}})$ is obtained with some intermediate $\sigma$ . As shown in Figure 8, we found $\sigma=0.7$ leads to the maximal $I_{NCE}$ on the STL-10 dataset. Either blurring more or less will reduce $I_{NCE}$ , but interestingly blurring more leads to different trajectory in the plot than blurring less. When increasing $\sigma$ from 0.7, the accuracy firstly improves and then drops, forming a reverse-U shape with a sweet spot at $\sigma=1.0$ . This situation corresponds to (a) in Figure 2 of the main paper. While decreasing $\sigma$ from 0.7, the accuracy keeps diminishing, corresponding to (b) in Figure 2 of the main paper. This reminds us of the two aspects in Proposition 3.1: mutual information is not the whole story; what information is shared between the two views also matters.

Setup and Training. The setup is almost the same as that in color channel splitting experiments, except that each view consists of three input channels. We follow the training recipe in .

Evaluation. We train a linear classifier for 100 epochs on STL-10 dataset and 40 epochs on TinyImageNet dataset.

B.4 Colorful Moving MNIST

Dataset. Following the original Moving MNIST dataset , we use a canvas of size 64 $\times$ 64, which contains a digit of size 28 $\times$ 28. The back ground image is a random crop from original STL-10 images (96 $\times$ 96). The starting position of the digit is uniformly sampled inside the canvas. The direction of the moving velocity is uniformly sampled in $[0,2\pi]$ , while the magnitude is kept as $0.1$ of the canvas size. When the digit touches the boundary, the velocity is reflected.

Setup. We use the first 10 frames as $\mathbf{v_{1}}$ (namely $k=10$ ), and we construct $\mathbf{v_{2}}$ by referring to the 20-th frame (namely $t=20$ ). During the contrastive learning phase, we employ a 4-layer ConvNet to encode images and use a single layer LSTM on top of the ConvNet to aggregate features of continuous frames. The CNN backbone consists of 4 layers with $8,16,32,64$ filters from low to high. Average pooling is applied after the last convolutional layer, resulting in a 64 dimensional representation. The dimensions of the hidden layer and output in LSTM are both 64.

Examples. The examples of $\mathbf{v_{1}}$ and $\mathbf{v_{2}}$ are shown in Figure 9, where the three rows on the RHS shows cases that only a single factor (digit, position, or background) is shared.

Training. We perform intra-batch contrast. Namely, inside each batch of size 128, we contrast each sample with the other 127 samples. We train for 200 epochs, with the learning rate initialized as $0.03$ and decayed with cosine annealing.

B.5 Un-/Semi-supervised View Learning

Invertible Generator. Figure 10 shows the basic building block for the Volume-Preserving (VP) and None-Volume-Preserving (NVP) invertible view generator. The $F$ and $G$ are pixel-wise convolutional function, i.e., convolutional layers with 1 $\times$ 1 kernel. $\mathbf{X}_{1}$ and $\mathbf{Y}_{1}$ represent a single channel of the input and output respectively, while $\mathbf{X}_{2}$ and $\mathbf{Y}_{2}$ represent the other two channels. While stacking basic building blocks, we alternatively select the first, second, and the third channel as $\mathbf{X}_{1}$ , to enhance the expressivity of view generator.

Setup and Training. For unsupervised view learning that only uses the adversarial $I_{NCE}$ loss, we found the training is relatively unstable, as also observed in GAN . We found the learning rate of view generator should be larger than that of $I_{NCE}$ approximator. Concretely, we use Adam optimizer , and we set the learning rates of view generator and $I_{NCE}$ approximator as $2e$ - $4$ and $6e$ - $4$ , respectively. For the semi-supervised view learning, we found the training is stable across different learning rate combinations, which we considered as an advantage. To be fair, we still use the same learning rates for both view generator and $I_{NCE}$ approximator.

Contrastive Learning and Evaluation. After the view learning stage, we perform contrastive learning and evaluation by following the recipe in Section B.2.

Appendix C Data Augmentation as InfoMin

C.2 Analysis of Data Augmentation as it relates to MI and Transfer Performance

We also investigate how sliding the strength parameter of individual augmentation functions leads to a practical reverse-U curves, as shown in Figures 12 and 13.

Cropping. In PyTorch, the RandomResizedCrop(scale=(c, 1.0)) data augmentation function sets a low-area cropping bound c. Smaller c means more aggressive data augmentation. We vary c for both a linear critic head (with temperature 0.07) and nonlinear critic head (with temperature 0.15), as shown in Figure 12. In both cases, decreasing c forms a reverse-U shape between $I_{NCE}$ and linear classification accuracy, with a sweet spot at $c=0.2$ . This is different from the widely used $0.08$ in the supervised learning setting. Using $0.08$ can lead to more than $1\%$ drop in accuracy compared to the optimal $0.2$ when a nonlinear projection head is applied.

Color Jittering. As shown in Figure 11(b), we adopt a parameter $x$ to control the strengths of color jittering function. As shown in Figure 13, increasing $x$ from $0.125$ to $2.5$ also traces a reverse-U shape, no matter whether a linear or nonlinear projection head is used. The sweet spot lies around $x=1.0$ , which is the same value as used in SimCLR . Practically, we see the accuracy is more sensitive around the sweet spot for the nonlinear projection head, which also happens for cropping. This implies that it is important to find the sweet spot for future design of augmentation functions.

Details. These plots are based on the MoCo framework. We use $65536$ negatives and pre-train for 100 epochs on 8 GPUs with a batch size of 256. The learning rate starts as $0.03$ and decays following a cosine annealing schedule. For the downstream task of linear evaluation, we train the linear classifier for 60 epochs with an initial learning rate of 30, following .

C.3 Results on ImageNet Benchmark

On top of the “RA-CJ-Blur” augmentations shown in Figure 11, we further reduce the mutual information (or enhance the invariance) of views by using PIRL , i.e., adding JigSaw . This improves the accuracy of the linear classifier from $63.6\%$ to $65.9\%$ . Replacing the widely-used linear projection head with a 2-layer MLP increases the accuracy to $67.3\%$ . When using this nonlinear projection head, we found a larger temperature is beneficial for downstream linear readout (as also reported in ). All these numbers are obtained with 100 epochs of pre-training. For simplicity, we call such unsupervised pre-training as InfoMin pre-training (i.e., pre-training with our InfoMin inspired augmentation). As shown in Table 7, our InfoMin model trained with 200 epochs achieves $70.1\%$ , outperforming SimCLR with 1000 epochs. Finally, a new state-of-the-art, $73.0\%$ is obtained by training for 800 epochs. Compared to SimCLR requiring 128 TPUs for large batch training, our model can be trained with as less as 4 GPUs on a single machine.

For future improvement, there is still room for manually designing better data augmentation. As shown in Figure 11(a), using “RA-CJ-Blur” has not touched the sweet spot yet. Another way to is to learn to synthesize better views (augmentations) by following (and expanding) the idea of semi-supervised view learning method presented in Section 4.2.2 of the main paper.

Different Architectures. We further include the performance of InfoMin as well as other SoTA methods with different architectures in Table 7. Increasing the network capacity leads to significant improvement of linear readout performance on ImageNet for InfoMin, which is consistent with previous literature .

C.4 Comparing with SoTA in Transfer Learning

Appendix D Transfer Learning with Various Backbones and Detectors on COCO

We evaluated the transferability of various models pre-trained with InfoMin, under different detection frameworks and fine-tuning schedules. In all cases we tested, models pre-trained with InfoMin outperform those pre-trained with supervised cross-entropy loss. Interestingly, ResNeXt-152 trained with InfoMin on ImageNet-1K beats its supervised counterpart trained on ImageNet 5K, which is 6 ${\mathbf{x}}$ times larger. Bounding box AP and mask Ap are reported on val2017

The results of Mask R-CNN with R-50 C4 backbone are shown in Table 8. We experimented with 1 ${\mathbf{x}}$ and 2 ${\mathbf{x}}$ schedule.

D.2 ResNet-50 with Mask R-CNN, FPN architecture

The results of Mask R-CNN with R-50 FPN backbone are shown in Table 9. We compared with MoCo and MoCo v2 under 2 ${\mathbf{x}}$ schedule, and also experimented with 6 ${\mathbf{x}}$ schedule.

D.3 ResNet-101 with Mask R-CNN, C4 architecture

The results of Mask R-CNN with R-101 C4 backbone are shown in Table 10. We experimented with 1 ${\mathbf{x}}$ and 1 ${\mathbf{x}}$ schedule.

D.4 ResNet-101 with Mask R-CNN, FPN architecture

The results of Mask R-CNN with R-101 FPN backbone are shown in Table 11. We experimented with 1 ${\mathbf{x}}$ , 2 ${\mathbf{x}}$ , and 6 ${\mathbf{x}}$ schedule.

D.5 ResNet-101 with Cascade Mask R-CNN, FPN architecture

The results of Cascade Mask R-CNN with R-101 FPN backbone are shown in Table 12. We experimented with 1 ${\mathbf{x}}$ , 2 ${\mathbf{x}}$ , and 6 ${\mathbf{x}}$ schedule.

D.6 ResNeXt-101 with Mask R-CNN, FPN architecture

The results of Mask R-CNN with X-101 FPN backbone are shown in Table 13. We experimented with 1 ${\mathbf{x}}$ and 2 ${\mathbf{x}}$ schedule.

D.7 ResNeXt-152 with Mask R-CNN, FPN architecture

The results of Mask R-CNN with X-152 FPN backbone are shown in Table 14. We experimented with 1 ${\mathbf{x}}$ schedule.. Note in this case, while InfoMin model is pre-trained on the standard ImageNet-1K dataset, supervised model is pre-trained on ImageNet-5K, which is 6 ${\mathbf{x}}$ times larger than ImageNet-1K. That said, we found InfoMin still outperforms the supervised pre-training.

Appendix E Change Log

arXiv v2 Paper accepted to NeurIPS 2020. Updated to the camera ready version

arXiv v3 Included more details in disclosure of funding.