Weakly-supervised Compositional FeatureAggregation for Few-shot Recognition

Ping Hu, Ximeng Sun, Kate Saenko, Stan Sclaroff

Introduction

The human visual system has a remarkable ability to efficiently learn semantic concepts from just one or a few examples . To achieve a similar ability with machine learning, Few-Shot Learning (FSL) has emerged as an important research topic in recent years . Given a small set of labeled examples (support set) of novel object categories (novel classes), FSL aims to apply a model trained on the known object classes (base classes) to classify the unlabeled samples (query set) from the novel classes. To tackle this problem, meta-learning based models propose to train a meta-learner that can be quickly adapted to the new recognition tasks for the novel object categories; feature-hallucination approaches learn to generalize the base classes’ distribution to augment samples in the query set. Compared to these models with sophisticated frameworks and protocols, a more straightforward yet effective approach employs metric-learning based models that classify samples in the query set based on the metric distance to the labelled examples in the support set.

Although the aforementioned methods have achieved significant results when combined with deep CNNs’ superior learning and generalization capabilities, they may still suffer from the inherent limitations of FSL tasks . The size of the support set is typically too small to reliably generalize the model learned on base categories to novel classes. To alleviate this issue, one possible approach is to take into consideration the compositionality of concept representation (e.g. the fact that objects are built from parts and composed of semantic attributes). Compositionality plays a key role in the human visual system, as it represents novel concepts as known primitives, which helps us learn efficiently from a few examples . Inspired by these findings, models in learn with attribute-level annotations to encourage the deep features to encode semantic compositionality. However, as shown in Fig. 1, such methods need to predefine a fixed set of attributes and rely on attribute-level annotations for training, which may be sub-optimal and limit the range of applications. Moreover, methods including apply mean/max pooling over feature maps to produce image-level representations, thus losing the objects’ spatial compositionality which is important for visual understanding.

In order to effectively impose both the spatial and semantic compositionality to enhance few-shot learning performance, in this work we propose the Compositional Feature Aggregation (CFA) module as a weakly-supervised module for end-to-end optimization. Given the feature maps extracted from the input, at first we explicitly disentangle the feature space into independent semantic subspaces to encourage semantic compositionality. Then, to further impose spatial compositionality in each of the semantic subspaces, rather than simply applying mean/max pooling, we aggregate the sub-feature maps via bilinear aggregation to extract second-order statistics and capture translation-invariant spatial structure. Finally, we concatenate the aggregated feature vectors from all the subspaces and use it as the final descriptor. The proposed CFA module explicitly imposes semantic and spatial compositionality to help models focus on generalizing semantic knowledge at the attribute-level and object-part-level rather than at a holistic level, thus improving learning and generalization. Moreover, CFA imposes the compositionality onto deep features in an weakly-supervised way and does not need any annotations for attributes and object parts for training.

To summarize, this paper makes the following contributions. (i) We propose to explicitly impose both semantic and spatial compositionality in the form of weakly supervised regularization for deep networks to improve generalization in few-shot recognition tasks. (ii) We propose CFA as a convenient plugable module for end-to-end optimization without requiring annotations for semantic attributes or object parts. (iii) We evaluate our method with extensive experiments for few-shot image classification and action recognition tasks. The state-of-the-art performance validates the our method’s effectiveness for few-shot recognition tasks.

Related Work

Few-shot Learning. Few-shot learning methods aim to classify new categories based on limited supervision information. Recent trends for this task can be roughly grouped into three types: meta-learning based approaches , feature-hallucination based methods , and metric-learning based models . The meta-learning based approaches aim to learn a "meta-learner" that provides proper initialization or weight updates for models to quickly adapt to novel tasks with few training examples. The feature-hallucination based methods adopt generators to learn to transfer data distributions or visual styles to augment the novel examples. The metric-learning based models learn to encode and compare features such that samples of the same category show higher similarity than those of different categories, where the similarity can be evaluated with cosine similarity , euclidean distance , deep relation module , and graph neural networks . Qi et al. utilize the features of novel classes as class prototypes to extend the weight matrix of the final classification layer, so that the networks can dynamically process both base and novel classes. Similarly, Gidaris et al. and Qiao et al. learn to predict the weights of final classification layer for novel classes. Inspired by the compositionality in human’s visual perception, Tokmakov et al. proposed to explicitly learn semantic compositional representations for few-shot image learning. However, requires attribute-level annotations during training, and applies mean pooling operation that loses discriminative information contained in the object parts’ spatial structure. In contrast, our method successfully imposes both spatial and semantic compositionality in an weakly supervised way without the need for supervision of semantic attribute or object parts, and can conveniently learn end-to-end to produce discriminative representations.

Compositional Representation. Compositionality plays a key role in the human vision system, as it allows to represent novel concepts as knowing primitives so that to learn efficiently from a few examples . To exploit this feature to enhance deep neural networks, Andreas et al. and Tokmakov et al. utilize attribute annotations to learn deep embedding for compositional feature, which is a sum of encodings of the attributes of the inputs. Misra et al. train classifiers for different attributes and combine them to represent novel concepts. One limitation for these methods is that they apply mean/max pooling operations over the feature maps thus losing the spatial compsitionality of visual concepts and may result in less discriminative representations. Stone et al. address the spatial compsitionality by constraining the object parts to be independent in the representation space. However, all these methods rely on annotations for attributes or object parts.

Bilinear Feature Aggregation. Bilinear models were propose in to model two-factor variations like “style” and “content” for images. To improve the image recognition performance with richer spatial structure, bilinear models has been utilized to model the variations arising out of appearance and part locations . Comparing to mean/max pooling operations that extract first-order statistics, the bilinear aggregation models compute second-order statistics to preserve more complex relations. Lin et al. shows that bilinear model also generalizes to orderless second-order pooling techniques like VLAD and Fisher Vector . In our method, in order to retain richer spatial compositional information when aggregating feature maps, we build our method on the NetVLAD , which is a differentiable version of VLAD , and extend it with semantic compositionality to enhance performance for few-shot learning.

Compositional Feature Aggregation

As shown in Fig. 2, our CFA module is a trainable module that can be conveniently plugged into standard deep CNNs to learn to aggregate image-level compositional semantic representations. By decomposing the semantic feature space into subspaces and bilinearly aggregating features in each of them, we impose both semantic and spatial compositionality in a weakly-supervised way without requiring annotations of semantic attributes and object parts for training.

We follow the $X$ -way $Y$ -shot protocol adopted in recent few-shot learning methods . Formally, an $X$ -way $Y$ -shot learning task involves three sets of data: a training set $\textbf{T}_{b}$ containing labelled samples from the base classes; a support set $\textbf{S}_{n}$ consisting of $X$ novel classes with $Y$ labelled examples for each; and a query set $\textbf{Q}_{n}$ composed by unlabelled examples from the same $X$ novel classes. Usually, the amount of samples for each class in $\textbf{T}_{b}$ is much larger than $Y$ . To learn an effective embedding for the few-shot learning task, we adopt episode-based training that utilizes training samples to mimic the target task that classifies samples in $\textbf{Q}_{n}$ conditioned on $\textbf{S}_{n}$ . At each training iteration we randomly sample $X$ classes with $Y$ labelled samples for each from $\textbf{T}_{b}$ to play as a support set $\textbf{S}_{b}$ , and a fraction of the remaining samples in the same $X$ base classes are selected as the query set $\textbf{Q}_{b}$ . The objective is to train a nearest neighbor based classifier M to minimize the $X$ -way prediction loss. In the episode testing stage after training, we apply M to perform nearest neighbor searching over $\textbf{S}_{n}$ to classify the query samples in $\textbf{Q}_{n}$ .

2 Semantic Decomposition

One of the key factors for human vision’s superior ability to learn from few examples is the semantic and spatial compositionality in concept representation . In order to conveniently learn to produce a compositional representation without the need of extra annotations, inspired by Group Convolution , we explicitly decompose the feature space into independent subspaces in order to regularize the deep representation with semantic compositionality. Given a deep CNN $F(\cdot|\theta)$ that maps an image patch into a $C$ -dimension vector, we uniformly divide the vector into a predefined number $N$ disjoint groups along the channel dimension. Each of these sub-vectors has $\frac{C}{N}$ channels and corresponds to one semantic subspace. By further explicitly imposing spatial compositionality within each of the semantic subspace (details in the next subsection), we can regularize $F(\cdot|\theta)$ to focus on generalizing semantic knowledge at the attribute-level and part-level rather than at the holistic instance-level, thus reducing the difficulties in learning. Since our method doesn’t use attribute-level annotations, the manually defined $N$ subspace may correspond to semantic attributes that are not as meaningful for humans as the predefined attributes in . However, since our model can be conveniently optimized end-to-end, the learnt attributes can be better adapted to the task. It is also possible to explore other methods to group the feature channels rather than evenly dividing into $N$ parts, yet this is beyond the scope of this paper and we leave it for future research.

3 Bilinear Aggregation in Semantic Subspace

After decomposing the feature space into the $N$ semantic subspaces, the feature maps encoded by $F(\cdot|\theta)$ are divided into $N$ groups of sub-feature maps with the same spatial structure but correspond to different semantic attributes. In order to convert these feature maps into a fixed-length representation vector, a normal practice as in is spatially mean/max pooling over the feature map. Yet the mean/max pooling operations will lose the spatial compositionality information of object parts in the input image, leading to sub-optimal and less discriminative representations for few-shot recognition tasks. Another straightforward choice can be directly flatten the feature maps, which will keep the exact spatial structure of the input image. However, we found that this will drastically decrease the performance, because directly flatten the feature maps is not translation-invariant yet objects from the same category may show different spatial layouts.

To effective retain the spatial compositionality, we propose to bilinearly aggregate local features in each of the semantic subspace. As shown in , the Vector of Locally Aggregated Descriptor (VLAD) as a generalized bilinear model is able to aggregate feature maps in a translation-invariant way without losing spatial compositional information. Thus we built our aggregation model on the NetVLAD , which is a differentiable version of VLAD. Given a $H\times W$ feature maps, consider $x_{i,n}\in\mathcal{R}^{\frac{C}{N}}$ to be the $\frac{C}{N}$ -dimension feature at spatial location $i\in\{1,...,HW\}$ in semantic subspace $n\in\{1,...,N\}$ . We learn to divide the $\frac{C}{N}$ -dimension semantic subspace $n$ into $K$ cells via $K$ cluster centers ("semantic prototypes") $\{c_{k,n}|k=1,..,K\}$ . Each local semantic sub-feature $x_{i,n}$ is then assigned to its nearest center and the residual vector $x_{i,n}-c_{k,n}$ is recorded. For each of the cells in semantic subspace $n$ , the residual vectors are then summed spatially as,

where the $\alpha$ is always set to be high (100 in our experiments) to achieve the effect of hard assignment, and $K$ =32 as suggested in . The $v_{k,n}$ is a $\frac{C}{N}$ -dimension vector that describes the distribution of the input object’s local parts in the cell with the $k$ -th semantic prototype of the $n$ -th semantic subspace. An illustration is shown in Fig. 3. Comparing to the mean/max pooling that pool features over all the entire features space, the local aggregation method can be seen to pool features within cells of each semantic prototypes, thus retaining richer spatial compositional information.

In the case of few-shot recognition task, we may have $Y>1$ labeled examples as support for each novel class. Based on Equation 1, given $x_{i,n}^{t}$ as $x_{i,n}$ from the $t$ -th sample in the support set, the information in multiple samples can be conveniently aggregated as,

By stacking all the $\{v_{k,n}|k=1,...,K\}$ together, we get an $\frac{CK}{N}$ -dimension descriptor for the input image in $n$ -th semantic subspace, which is $V_{n}=[v_{1,n};v_{2,n};...;v_{K,n}]$ . Further, we concatenate the descriptors in different semantic subspaces together to be the $CK$ -dimension overall representation,

The image-level representation $I$ is finally L2-normalized as suggested in to be the our compositional aggregated feature. With such a design, our CFA module is able to explicitly impose spatial compositionality information as regularization for deep representations. Moreover, different from mean/max pooling operations that aggregate both foreground and background features equally, the above locality based feature aggregation also helps to highlight the similar contents among images and suppress the influence of background , which helps to produce more discriminative representations for few-shot recognition. As all the parameters of our CFA module are differentiable so that the proposed module can be conveniently plugged into other deep CNNs for end-to-end optimization. It should be noted that the aggregation method we use is based on NetVLAD , yet we novelly extend it with semantic compositionality to tackle the task of few-shot recognition.

4 CFA for Few-shot Recognition

As our CFA module naturally aggregate multiple support examples into a single L2-normalized representation vector, we adopt the cosine-similarity-based nearest neighbor classifier,

where $d(\cdot)$ is the cosine similarity, $I_{i},i\in\{1,..,Y\}$ is the overall representation vector for the $i$ -th category, $\hat{I}$ is the representation vector for the query sample, $l_{i},i\in\{1,..,Y\}$ is the class label for $i$ -th category, and $\hat{l}$ is the predicted label for the query sample. Given the groundtruth label $l_{gt}$ for the query sample, we adopt the cross entropy loss $L_{cls}=-l_{gt}\cdot log(\hat{l})$ as the objective for classification.

Furthermore, in the proposed CFA module, the $\frac{C}{N}$ -dimension semantic prototypes $\{c_{k,n}|k=1,..,K;n=1,...,N\}$ are important learnable parameters as the local features are grouped and aggregated over them. To avoid learning trivial results for these parameters during the end-to-end training, we add a regularization to the loss function to enforce the orthogonality between semantic prototypes within each attribute subspace. As a result, the final loss function for training is,

where $\gamma$ is a weight for orthogonality constraint, and $Iden(K)$ is a $K\times K$ identity matrix.

Experiments

We evaluate our method on two few-shot recognition scenarios: image classification, and action recognition. All the experiments are implemented with Pytorch on a Nvidia Titan Xp GPU card.

Dataset. The miniImagenet is a popular dataset for evaluating few-shot learning models. It contains 100 classes with 600 images for each from the ImageNet dataset . We follow the data splits adopted by with 64 base, 16 validation, and 20 novel categories. Another dataset we use is the CUB dataset which is a fine-grained bird species dataset composed by 11,788 images for 200 classes. We evaluate on this dataset with 64, 16, and 20 classes for training, validation, and testing respectively as in .

Implementation. We utilized ResNet-18 as feature encoder $F(\cdot|\theta)$ , and randomly initialize the parameters before training on each dataset. To effectively train with our CFA module, we first pretrain the feature encoder as a normal classifier on the base classes for 30,000 iterations, and then perform jointly episode training with loss function in Equation (5) for another 30,000 iterations. The validation set is used to select the iterations of best accuracy. In each testing episode of a $Y$ -shot recognition task, we randomly select 5 classes from the testing set. For each class, $Y$ labelled examples and 16 unlabelled examples are selected as support set and query set respectively. To evaluate the performance, we perform 600 testing episodes and compute the averaged accuracy with 95 $\%$ confidence intervals as final results. We also implement and evaluate several recent state-of-the-art few-shot learning methods in the same setting for fair comparisons. All these methods are optimized using Adam optimizer with initial learning rate 0.001 and batchsize 16 for 60,000 iterations.

Results. We compare with recent state-of-the-art models including MatchingNet , ProtoNet , RelationNet and MAML with the same backbone and trainging/testing protocol. For our CFA model, we report the results for $N$ =1 which only considers spatial compositionality and $N$ =64 that imposes both semantic and spatial compositionaliy. As shown in Table 1, our CFA performs better than other methods on the miniImagenet dataset for different sizes of support set. The result on CUB dataset is shown in Table 2. Our CFA ( $N$ =64) outperforms other methods for both 1-shot and 3-shot tasks, and achieves a similar accuracy to ProtoNet for 5-shot task on this dataset. The similar result of our CFA ( $N$ =64) comparing to ProtoNet for 5-shot task may be due to that the CUB dataset as a fine-grained bird dataset showing small intra-class variance, thus a larger support set will help to estimate better class center and greatly benefit methods like ProtoNet that rely on distribution estimation. Moreover, due to the relatively smaller inter-class variance of CUB dataset (all images are birds) comparing to miniImagenet, the benefits of imposing spatial compositionality is limited. Thus the CFA ( $N$ =1), achieves better accuracy on miniImagenet but performs worse on CUB than these state-of-the-art methods. However, by further incorporating semantic compositionality, our CFA with $N$ =64 shows better accuracy on both datasets especially for low-shot cases. The improvement achieved on both dataset shows our method is effective to learn from a few examples for both generic and fine-grained image classification.

2 Action Recognition

Dataset. We evaluate the performance for few-shot action recognition on two datasets: Kinetics-CMN and Jester . The Kinetics-CMN dataset contains 100 classes with 100 examples for each selected from the Kinetics dataset . We evaluate with the splits provided by with 64, 12, 24 non-overlapping classes for base, validation, and novel classes, respectively. The Jester dataset is a hand gesture dataset containing 27 categories of hand gestures with 148,092 video samples in total. In our experiments, we randomly select 1,000 video samples for each hand gesture and then randomly split the 27 classes into 13, 5, 9 non-overlapping classes to be the base, validation, and novel categories, respectively.

Implementation. To extract feature maps from the video sequences, we adopt the RGB stream of the two-stream model with ResNet18 as backbone. Following the practice in , 10 frames are randomly sampled from each video to be the input sequence for deep CNN. We initialize the backbone with parameters pretrained on the Imagenet dataset , then perform episode training for 10 epochs for each dataset. It should be noted that for the action recognition task, our CFA is extended to temporal dimension, and aggregates the feature maps spatio-temporally. The model for the best accuracy is selected with the validation set. As in the image-level few-shot learning task, we also implement recent state-of-the-art methods for comparisons. For these methods, spatio-temporal feature maps are mean pooled to be the video-level representation as in . All the methods are trained using Adam optimizer with initial learning rate 0.0001 and batchsize 1. During episode testing stage, we randomly sample 20,000 episodes as in and take the mean accuracy as well as the 95 $\%$ confidence intervals as the final results.

Results. We compare our method with several approaches including MatchingNet , ProtoNet , RelationNet , CMN . As presented in Table. 3, our CFA ( $N$ =64) outperforms other methods for both 1-shot, 3-shot, and 5-shot recognition on the Kinetics-CMN datasets. Given the very large spatio-temporal sample space for videos and the very limited training data (100 videos for each base class) used, it is challenging for deep models to effectively learn a mapping that generalizes well. As a result, the distribution based model ProtoNet performs worst among these methods for small support set like 1-shot task. In contrast, our CFA achieves 69.9 $\%$ for 1-shot case, which is nearly 14 $\%$ higher than ProtoNet. The large gap shows the effectiveness of spatio-temporal and semantic compositionality imposed by our method. The Jester dataset is a harder dataset for few-shot learning since the inter-class variety for hand gestures is much smaller than generic actions like those in the Kinetics-CMN. As shown in Table 4, previous methods like MatchingNet, ProtoNet and RelationNet drastically drop their accuracy comparing to our CFA ( $N$ =64). By only consider the spatial bilinear aggregation, rather than applying mean pooling to the spatio-temproal feature maps, our CFA ( $N$ =1) still achieves better performance than previous methods especially for 1-shot task. By further imposing the semantic compositionality into deep feature, the CFA ( $N$ =64) achieves a much higher accuracy. This shows that both the semantic and spatial compostionaliy is important for few-shot action recognition, and our CFA is effective to learn to produce discriminative compositional feature in a weakly-supervised way.

3 Method Analysis

We first analyse the effect of $N$ , which is the number of predefined semantic subspaces. The results for different values of $N$ on the four datasets are presented in Fig. 4. When incrasing $N$ from 1, the improvements on the miniImagenet and Kinetics-CMN datasets are less obvious than those on the CUB and Jester datasets, indicating that for generic image/action classification the spatial/spatiotemporal compositionality is more effective while for fine-grained classification the semantic compositionality plays a more important role. Moreover, on both CUB and Jester datasets, the accuracy improves greatly for some values of $N$ (N=64 in Fig. 4(a) and N=4 in Fig. 4(d)) and then becomes stable. This shows that for fine-grained datasets, there exists a group of optimal semantic attributes that generalize well, and our CFA model can effectively learn to find them with a relatively larger $N$ .

We also show in Fig. 5 the effect of weights for the orthogonality constraint $\gamma$ in the loss function Equation 5. As we can see, on the minImagenet dataset and Kinetics-CMN datatet that contain more generic scenes and objects, the performance is less sensitive to the value of $\gamma$ , since the high inter-class variance and intra-class variance help our CFA to find and summarize meaningful centers. For CUB and Jester datasets, without the orthogonality constraint ( $\gamma=0$ ) it may learn trivial solutions leading to low accuracy. However, a too high $\gamma$ will also harm the performance, as these two datasets have small inter-class and intra-class variance, so forcing the semantic prototypes to be orthogonal to each other may lead to a poorly generalizing representation.

At last, to evaluate our method’s ability for cross domain few-shot learning, we train the model on datasets for generic classification and then test the performance on datasets for fine-grained category recognition. As shown in Table 5, our CFA shows better transfer ability than other methods for both image classification and action recognition. In the first two rows, we compare our CFA with semantic compositionality (N=64) and without compositionality (N=1). For both tasks, by only considering spatial compositionality, CFA (N=1) achieves better performance than previous methods. Further considering attribute compositionality, CFA (N=64) leads to further improvement on image classification tasks, while worse accuracy on video recognition tasks. The decrease for video transfer-learning tasks may be because the Kinetics-CMN dataset, which has only 10,000 videos in total, is too small to learn effective semantic sub-features that generalize well the indoor hand gesture dataset.

Conclusion

In this paper, we propose the Compositional Feature Aggregation module as a plugable end-to-end layer for few-shot learning task. By decomposing the feature space into attribute subspaces and applying bilinear local aggregation in each subspace, CFA successfully imposes both spatial and semantic compositionality as a regularization to improve FSL, and produces more discriminative representations. We evaluate our model for both generic and fine-grained image classification and video classification task, and the improvements on all four datasets validate the proposed method’s effectiveness.