Learning visual groups from co-occurrences in space and time

Phillip Isola, Daniel Zoran, Dilip Krishnan, Edward H. Adelson

Introduction

Clown fish live next to sea anemones, lightning is always accompanied by thunder. When looking at the world around us, we constantly notice which things go with which. These associations allow us to segment and organize the world into coherent visual representations.

This paper addresses how representations like “objects” and “scenes” might be learned from natural visual experience. A large body of work has focused on learning these representations as a supervised problem (e.g., by regressing on image labels) and current object and scene classifiers are highly effective. However, in the absence of expert annotations, it remains unclear how we might uncover these representations in the first place. Do “objects” fall directly out of the statistics of the environment, or are they a more subjective, human-specific construct?

Here we probe the former hypothesis. Because the physical world is highly structured, adjacent locations are usually semantically related, whereas far apart locations are more often semantically distinct. By modeling spatial and temporal dependencies, we may therefore learn something about semantic relatedness.

We investigate how these dependences may be learned from unlabeled sensory input. We train a deep neural network to predict whether or not two input images or patches are likely to be found next to each other in space or time. We demonstrate that the network learns dependencies that can be used to uncover meaningful visual groups. We apply the method to generate fast and accurate object proposals that are competitive with recent supervised methods, as well as to automatic movie scene segmentation, and to the grouping of semantically related photographs.

Related work

The idea that perceptual groups reflect statistical structure in the environment has deep roots in the perception and cognition literature (Barlow (1985); Wilkin & Tenenbaum (1985); Tenenbaum & Witkin (1983); Lowe (2012); Rock (1983)). Barlow postulated that the brain is constantly on the lookout for events that co-occur much more often than chance, and uses these “suspicious coincidences” to discover underlying structure in the world (Barlow (1985)). Subsequent researchers have argued that infants learn about linguistic groups from phoneme co-occurrence (Saffran et al. (1996)), and that humans may also pick up on visual patterns from co-occurrence statistics alone (Fiser & Aslin (2001); Schapiro et al. (2013)).

Visual grouping is also a central problem in computer vision, showing up in the tasks of edge/contour detection Canny (1986); Arbelaez et al. (2011); Isola et al. (2014) , (semantic) segmentation Shi & Malik (2000); Malisiewicz & Efros (2007), and object proposals Alexe et al. (2012); Zitnick & Dollár (2014); Krahnenbuhl & Koltun (2015), among others. Many papers in this field take the approach of first modeling the affinity between visual elements, then grouping elements with high affinity (e.g., Shi & Malik (2000)). Our work follows this approach. However, rather than using a hand-engineered grouping cue Shi & Malik (2000); Zitnick & Dollár (2014), or learning to group with direct supervision Dollár & Zitnick (2013); Krahnenbuhl & Koltun (2015), we use a measure of spatial and temporal dependence as the affinity.

Grouping based on co-occurrence has received some prior attention in computer vision (Sivic et al. (2005); Faktor & Irani (2012; 2013); Isola et al. (2014)). Sivic et al. (2005) demonstrated that object categories can be discovered and roughly localized using an unsupervised generative model. Isola et al. (2014) showed that statistical dependences between adjacent pixel colors, measured by pointwise mutual information (PMI) can be very effective at localizing object boundaries. Both these methods require modeling generative probability distributions, which restricts their ability to scale to high-dimensional data. Our model, on the other hand, is discriminative and can be easily scaled.

A recent line of work in representation learning has taken a similar tack, training discriminative models to predict one aspect of raw sensory data from another. This work may be termed self-supervised learning and has a number of flavors. The common theme is exploiting spatial and/or temporal structure as supervisory signals. Mobahi et al. (2009) learn a feature embedding such that features adjacent in time are similar and features far apart in time are dissimilar. Srivastava et al. (2015) predict future frames in a video, and rely on strict temporal ordering; extension to spatial or unordered data is unclear. Wang & Gupta (2015) use a siamese triplet loss to learn a representation that can track patches through a video. They rely on training input from a separate tracking algorithm. Agrawal et al. (2015) as well as Jayaraman & Grauman (2015) regress on egomotion signals to learn a representation. Finally, Doersch et al. (2015) learn features by predicting the relative orientation between patches within an image.

Each of these works focus on learning good generic features, which may then be applied as pre-training for a supervised model. Our current goal is rather different. Rather than learning a vector space representation of images, we search for more explicit structure, in the form of visual groups. We show that the learned groups are semantically meaningful in and of themselves. This differs from the usual approach in feature learning, where the features are not necessarily interpretable, but are instead used as an intermediate representation on top of which further models can be trained.

Modeling visual affinities by predicting co-occurrence

We would like to group visual primitives, AA and BB, based on the probability that they belong to the same semantic entity. AA and BB may, for example, be two image patches, in which case we would want to group them if they belong to the same visual object.

Given object labels, a straightforward approach to this problem would be to train a supervised classifier to predict indicator variable Q{0,1}\mathcal{Q}\in\{0,1\}, where Q=1\mathcal{Q}=1 iff AA and BB lie on the same object Manen et al. (2013). Throughout this paper, we use Q\mathcal{Q} to indicate the property that AA and BB share the same semantic label.

Acquiring training data for Q\mathcal{Q} may require time-consuming and expensive annotation. We instead will explore an alternative strategy. Instead of training a classifier to predict Q\mathcal{Q} directly, we train classifiers to predict spatial or temporal proximity, denoted by C{0,1}\mathcal{C}\in\{0,1\}. Because the semantics of the world change slowly over space and time, we hope that C\mathcal{C} might serve as a cheap proxy for Q\mathcal{Q} (c.f. Kayser et al. (2001); Wiskott & Sejnowski (2002)). The degree to which this is true is an empirical question, which we will test below. Throughout the paper, C=1\mathcal{C}=1 iff AA and BB are nearby each other in space or time.

Formally, we model the affinity between visual primitives AA and BB as

In other words, we model affinity as the probability that two primitives will co-occur within some context (with symmetry between the order of AA and BB enforced). We will then use this affinity metric to cluster primitives, in particular using spectral clustering (Section 4).

This affinity can be understood in a variety of ways. First, as described above, it can be seen as a proxy for what we are really after, P(Q=1A,B)P(\mathcal{Q}=1|A,B). Second, it is a measure of the spatial/temporal dependence between AA and BB. Applying Bayes’ rule we can see that P(C=1A,B)P(A,BC=1)P(A,B)P(\mathcal{C}=1|A,B)\propto\frac{P(A,B|\mathcal{C}=1)}{P(A,B)}, which factors to P(A,BC=1)P(A)P(B)\frac{P(A,B|\mathcal{C}=1)}{P(A)P(B)} when AA and BB are independent. Therefore, if we sample primitives iid in space or timeIn each of our applications we sample 50% positive (C=1\mathcal{C}=1) and 50% negative (C=0\mathcal{C}=0) examples. Under our logistic regression model, this results in a function monotonically related to what we would get if we had sampled iid (see appendix B.4 in King & Zeng (2001)), our affinity models how much more often AA and BB appear nearby each other than they would if they were independent (the rate of co-occurrences were they independent would simply be P(A)P(B)P(A)P(B)). This value is closely related to the pointwise mutual information (PMI) between AA and BB conditioned on C=1\mathcal{C}=1. Previous work on visual grouping Isola et al. (2014) and word embeddings (Church & Hanks (1990); Levy et al. (2014)) has found PMI to be an effective measure for these tasks.

To model w(A,B)w(A,B) we use a Convolutional Neural Net (CNN) with a Siamese-style architecture (Figure 2, Chopra et al. (2005)), which we implement in Caffe (Jia et al. (2014)). The network has two convolutional branches, one to process AA and the other to process BB, with shared weights. These branches can be regarded as feature extractors. The features are then concatenated and fed to a set of fully connected layers that compare the features and try to predict C\mathcal{C}. We use a logistic loss over C\mathcal{C} and train all models with stochastic gradient descent. Our objective can be expressed as

where θ\mathbf{\theta} are the net parameters we optimize over (weights and biases), NN is the number of training examples, σ\sigma is the logistic function, and ff is a neural net. For each of our experiments, N=500,000N=500,000 training examples, 50% of which are positive (C=1\mathcal{C}=1) and 50% negative (C=0\mathcal{C}=0).

We examine three domains: 1) learning to group patches based on their spatial adjacency in images, 2) learning to group video frames based on their temporal adjacency in movies, and 3) learning to group photos based on their geospatial proximity.

Each task corresponds to a different choice of AA, BB, C\mathcal{C}, and Q\mathcal{Q}. In each case, we analyze performance at predicting C\mathcal{C} and at predicting Q\mathcal{Q}, comparing our CNN to baseline grouping cues. Each baseline corresponds to a measure of the similarity between the primitives. Similarity measures like these are commonly used in visual grouping algorithms Arbelaez et al. (2011); Faktor & Irani (2012). Full results of this analysis are given in Table 3.1. In all cases, our co-occurrence classifier matches or outperforms the baselines.

We first examine whether or not predicting patch co-occurrence in images results in an effective affinity measure for object-level grouping. We set AA and BB to be 17×1717\times 17 pixel patches (with circular masks). The context function C\mathcal{C} is spatial adjacency. Positive examples (C=1\mathcal{C}=1) are pairs of adjacent patches (with no overlap between them) and negative examples (C=0\mathcal{C}=0) are pairs sampled from random locations across the dataset. We sample training patches from the Pascal VOC 2012 training set.

We train a CNN (two convolutional layers, two fully connected layers; Figure 2) to model P(C=1A,B)P(\mathcal{C}=1|A,B). Figure 3 shows a t-SNE visualization of the learned affinities (van der Maaten & Hinton (2009)). As can be seen, the network learns to associate patches with different kinds of structure such as texture, local features, and color similarities.

To evaluate performance, we sample 10,000 patches from the Pascal VOC 2012 validation set, 50%50\% with C=1\mathcal{C}=1 and 50%50\% with C=0\mathcal{C}=0. In Table 3.1 we measure the Average Precision of using several affinity measures as a binary classifier of either C\mathcal{C} or Q\mathcal{Q}. In this case, we defined Q\mathcal{Q} to indicate whether or not the center pixel of the two patches lies on the same labeled object instance. To test Q\mathcal{Q} independently from C\mathcal{C} we create the Q\mathcal{Q} test set by only sampling from patch pairs for which C=1\mathcal{C}=1 (so the net cannot do well at predicting Q\mathcal{Q} simply by doing well at predicting C\mathcal{C}). Our network performs well relative to the baseline affinity metrics, although color histogram similarity does reach a similar performance on predicting Q\mathcal{Q}.

Even though it was only trained to predict C\mathcal{C}, our method is effective at predicting Q\mathcal{Q} as well, achieving an average precision (AP) of 0.800.80. This validates that spatial proximity, C\mathcal{C}, is a good surrogate for “same object”, Q\mathcal{Q}. This raises the question, would we do any better if we directly trained on Q\mathcal{Q}? We tested this, training a new network on 50% patches with Q=1\mathcal{Q}=1 and 50% with Q=0\mathcal{Q}=0. This net achieves higher performance on predicting Q\mathcal{Q} (AP = 0.850.85) and lower performance at predicting C\mathcal{C} (AP = 0.920.92), than our net trained to predict C\mathcal{C}. Therefore, although predicting co-occurrence may be a decent proxy for predicting semantic sameness, there is still a gap in performance compared to directly training on Q\mathcal{Q}. Designing better context functions, C\mathcal{C}, that narrow this gap is an important direction for future research.

3 Frame affinities from co-occurrence in movies

Our framework can also be applied to learning temporal associations. To test this, we set AA and BB to be frames, cropped and down sampled to 33×3333\times 33 pixels, from a set of 96 movies sampled from the top 100 rated movies on IMDBhttp://www.imdb.com/. In this setting, C\mathcal{C} indicates temporal adjacency – specifically, two frames are assigned C=1\mathcal{C}=1 if they are at least 3 seconds from each other and not more than 10 seconds apart. C=0\mathcal{C}=0 otherwise.

Again we train a CNN to model P(C=1A,B)P(\mathcal{C}=1|A,B) (three convolutional layers, two fully connected layers). To evaluate predicting C\mathcal{C}, we train on half the movies and test on the remaining half. Our method can learn to predict C\mathcal{C} quite effectively, reaching an Average Precision of 0.950.95 on the test set.

How do the learned temporal associations relate to semantic visual scenes? To test this, we compared against DVD chapter annotations, setting Q\mathcal{Q} to be “do these two frames occur in the same DVD chapter?” We sample 10,000 frame pairs, 50% with Q=1\mathcal{Q}=1 and 50% with Q=0\mathcal{Q}=0, while holding C\mathcal{C} constant (so that good performance at predicting Q\mathcal{Q} cannot be achieved simply by doing well at predicting C\mathcal{C}). Our network achieves an AP of 0.670.67 on this task. Similar to above, we can then see that temporal adjacency, C\mathcal{C}, is an effective surrogate for learning about semantic sameness, Q\mathcal{Q}.

4 Photo affinities from geospatial co-occurrence

Just as an object is a collection of associated patches, and a movie scene is a collection of associated frames, a visual place can be viewed a collection of associated photographs. Here we set AA and BB to be geotagged photos, cropped and down sampled to 33×3333\times 33 pixels, and C\mathcal{C} indicates whether or not AA and BB are taken within 11 meters of one another (we exclude exact duplicate locations).

Using the same CNN architecture as for the movie frame network, we again learn P(C=1A,B)P(\mathcal{C}=1|A,B), but for this new setting of the variables. We train on five cities selected from the MIT City Database Zhou et al. (2014) and test predicting C\mathcal{C} on a held out set of three more cities from that dataset. We also test how well the network predicts place semantics. For this, we define Q\mathcal{Q} as “do these two photos belong to the same place category?” We test this task on the LabelMe Outdoors dataset Liu et al. (2009) for which each photo was assigned to one of eight place categories (e.g., “coast”, “highway”, “tall building”). Our network shows promising performance on this task, reaching 0.790.79 AP on predicting Q\mathcal{Q}. HOG similarity reaches the same performance, which corroborates past findings that HOG is effective at grouping related photos (Dalal & Triggs (2005)).

Notice that while HOG does well on associating photographs, it does not do well at associating movie frames nor image patches. On the other hand, color histogram similarity does well on associating image patches and movie frames, but fails at grouping everyday photographs – while patches on an object, or frames in a movie scene, may tend to all use a consistent color palette, tourist photos of the same location will have high color variance, due to seasonal and lighting variations. Different grouping rules will be effective at different tasks. Our learning based approach has the advantage that it automatically figures out the appropriate grouping cue for each new domain, and thereby achieves good performance on all our tasks.

5 Which cues did the networks learn to use?

In each domain tested above, the grouping rules may be very different. Here we study them by probing the trained networks with controlled stimuli. Similar to how a psychophysicist might experiment on human perception, we show our networks specially made stimuli. For each test, we feed the networks many pairs {A,B}\{A,B\}, sampled from locations such that C=1\mathcal{C}=1. We leave AA unaltered, but modify BB in a controlled way. This allows us to test what kinds of transformations of BB will change the network’s prediction as to wether or not C=1\mathcal{C}=1 (c.f. Lenc & Vedaldi (2014)). We consider the following modifications: rotation by 90°, mirroring vertically and horizontally, removal of color (by replacing each pixel with its mean over color channels), and a darkening transformation in which we multiply each color channel by 0.5.

Results of these tests are given in Table 2. Each number is the mean output of the network for a given test case. The left-most column provides the mean output without any transformation applied. Interestingly, the patch network is almost entirely invariant to 90°rotations and mirror flips – according to the network, sharing a common orientation does not increase the probability of two patches being nearby. On the other hand, the patch network’s output depends dramatically on color similarity. The frame network behaves similarly, but shows some sensitivity to geometric transformations. On the other hand, the geo-photo network exhibits the opposite pattern of behavior: it’s output is changed more when geometric transformations are applied than when color transformations are applied. According to the geo-network, two photos may have different color compositions and still be likely to be nearby.

From predicting co-occurrence to forming visual groups

We apply the following general approach to learning visual groups:

Define AA, BB, and C\mathcal{C} based on the domain

Learn P(CA,B)P(\mathcal{C}|A,B) using a CNN. 3. Setup a graph in which Ai=1N\mathbf{A}_{i=1}^{N} and Bi=1N\mathbf{B}_{i=1}^{N} are nodes and edge weights are given by w(Ai,Bi)w(\mathbf{A}_{i},\mathbf{B}_{i}) (Eqn. 1). Then partition the graph into visual groups using spectral clustering.

As demonstrated in Section 4, patches that predictably co-occur usually belong to the same object. This suggests that we can localize objects in an image by grouping patches using our co-occurrence affinity. We focus on the specific problem of “object proposals” (Zitnick & Dollár (2014); Krahnenbuhl & Koltun (2015)), where the goal is to localize all objects in an image.

We use the patch associations P(C=1A,B)P(\mathcal{C}=1|A,B) defined in Section 4, trained on the Pascal VOC 2012 training set. Follow previous benchmarks, we test on the validation set. Given a test image, we sample all 17×1717\times 17 patches at a stride of 88 pixels. We construct a graph in which each patch is a node and nodes are connected by an edge if the spatial distance between the patch centers is at least 1717 pixels and no more than 3333 pixels. Each patch is multiplied by a circular mask so that no two patches connected by an edge see any overlapping pixels (see Figure 2(right)). Each edge, indexed by i,ji,j, is weighted by Wi,j=w(Ai,Bi)α\mathbf{W}_{i,j}=w(\mathbf{A}_{i},\mathbf{B}_{i})^{\alpha}, resulting in the affinity matrix W\mathbf{W}, where we use the value α=20\alpha=20 in our experiments.

To globalize the associations, we apply spectral clustering to the matrix W\mathbf{W}. First we create the Laplacian eigenmap LL for W\mathbf{W}, using the 2nd through 16th eigenvectors with largest eigenvalues. Each eigenvector is scaled by λ12\lambda^{-\frac{1}{2}} where λ\lambda is the corresponding eigenvalue. We then generate object proposals simply by applying k-means to the Laplacian eigenmap. To generate more than a few proposals, we run k-means multiple times with random restarts and for values of k from 5 to 16. Finally, we prune redundant proposals and sort proposals to achieve diversity throughout the ranking (by encouraging proposals to be made from different values of k before giving proposals from a random restart at the same value of k).

Qualitative results from our method are shown in Figure 4. In each case, we show the proposals that have best overlap with the ground truth object masks for 100 proposals. We quantitatively compare against other recent methods in Figure 5. Even though our method is not trained on labeled images, it reaches performance comparable to recent supervised methods at proposing up to 100 objects per image. Our implementation runs in about 4 seconds per image on a 2015 Macbook Pro.

2 Segmenting movies

Just as objects are composed of associated patches, scenes in a movie are composed of associated frames. Here we show how our learned frame affinities can be used to break a movie into coherent scenes, a problem that has received some prior attention (Chen et al. (2008); Zhai & Shah (2006)).

To segment a movie, we build a graph in which each frame is a node and all frames within ten seconds of one another are connected by an edge. We then weight the edges using the frame-associations P(C=1A,B)P(\mathcal{C}=1|A,B) (Section 4), and partition the graph using spectral clustering.

To evaluate, we use DVD chapter annotations as ground truth. Following a standard evaluation procedure in image boundary detection (Arbelaez et al. (2011)), we measure performance on the retrieval task of finding all ground truth boundaries. In Figure 6(right), we compare against the baseline affinity metrics from Section 4. In each case, we apply the same spectral clustering pipeline as for our method, except with edge weights given by the baseline metrics instead of by our net. It is important to properly scale each similarity metric or else spectral clustering will fail. To provide fair comparison, we sweep over a range of scale factors α\alpha, setting edge weights as exp(w(A,B)α2)\exp(\frac{w(A,B)}{\alpha^{2}}), where ww is the affinity measure. In Figure 6(right) we show results selecting the optimal α\alpha for each method.

Our approach finds more boundaries with higher precision than these baselines, except for color histogram similarity, which reaches a similar performance. Figure 6 (left) shows an example segmentation of a section of The Two Towers. The movie is displayed as a “movie barcode”http://moviebarcode.tumblr.com/ in which each frame is squished into a single column and time advances to the right. On top are the DVD chapter annotations, and on the bottom are our inferred boundaries.

3 Discovering place categories

Taking the geospatial-associations model from Section 4, we cluster photos into coherent types of places. Here we create a fully connected graph between all photos in a given collection, weight the edges with P(C=1A,B)P(\mathcal{C}=1|A,B) and then apply spectral clustering to partition the collection. We test the purity of the clusters on LabelMe Outdoors dataset (Liu et al. (2009)). Clustering purity versus number of clusters kk is given in Figure 7 (right), showing that our method is effective at discovering semantic place categories. As in our movie segmentation experiments, we select the optimal α\alpha to scale the affinity of our method as well each baseline. Figure 7 (left) shows random sample images from each cluster after clustering into 8 categories. This clustering has 59%59\% purity.

Conclusion

We have presented a simple and general approach to learning visual groupings, which requires no pre-defined labels. Instead our framework uses co-occurrence in space or time as a supervisory signal. By doing so, we learn different clustering mechanisms for a variety of tasks. Our approach achieves competitive results on object proposal generation, even when compared to methods trained on labeled data. Additionally, we demonstrated that the same method can be used to segment movies into scenes and to uncover semantic place categories. The principles underlying the framework are quite general and may be applicable to data in other domains, when there are natural co-occurrence signals and groupings.

We thank William T. Freeman, Joshua B. Tenenbaum, and Alexei A. Efros for helpful feedback and discussions. Thanks to Andrew Owens for helping collect the movie dataset. This work is supported by NSF award 1161731, Understanding Translucency, and by Shell Research.

References