Zero-Shot Learning -- A Comprehensive Evaluation of the Good, the Bad and the Ugly

Yongqin Xian, Christoph H. Lampert, Bernt Schiele, Zeynep Akata

Introduction

Zero-shot learning aims to recognize objects whose instances may not have been seen during training . The number of new zero-shot learning methods proposed every year has been increasing rapidly, i.e. the good aspects as our title suggests. Although each new method has been shown to make progress over the previous one, it is difficult to quantify this progress without an established evaluation protocol, i.e. the bad aspects. In fact, the quest for improving numbers has lead to even flawed evaluation protocols, i.e. the ugly aspects. Therefore, in this work, we propose to extensively evaluate a significant number of recent zero-shot learning methods in depth on several small to large-scale datasets using the same evaluation protocol both in zero-shot, i.e. training and test classes are disjoint, and the more realistic generalized zero-shot learning settings, i.e. training classes are present at test time. Figure 1 presents an illustration of zero-shot and generalized zero-shot learning tasks.

We benchmark and systematically evaluate zero-shot learning w.r.t. three aspects, i.e. methods, datasets and evaluation protocol. The crux of the matter for all zero-shot learning methods is to associate observed and non observed classes through some form of auxiliary information which encodes visually distinguishing properties of objects. Different flavors of zero-shot learning methods that we evaluate in this work are linear and nonlinear compatibility learning frameworks which have dominated the zero-shot learning literature in the past few years whereas an orthogonal direction is learning independent attribute classifiers and finally others propose a hybrid model between independent classifier learning and compatibility learning frameworks which have demonstrated improved results over the compatibility learning frameworks both for zero-shot and generalized zero-shot learning settings.

We thoroughly evaluate the second aspect of zero-shot learning, by using multiple splits of several small, medium and large-scale datasets . Among these, the Animals with Attributes (AWA1) dataset introduced as a zero-shot learning dataset with per-class attribute annotations, has been one of the most widely used datasets for zero-shot learning. However, as AWA1 images does not have the public copyright license, only some image features, i.e. SIFT , DECAF , VGG19 of AWA1 dataset is publicly available, rather than the raw images. On the other hand, improving image features is a significant part of the progress both for supervised learning and for zero-shot learning. In fact, with the fast pace of deep learning, everyday new deep neural network models improve the ImageNet classification performance are being proposed. Without access to images, those new DNN models can not be evaluated on AWA1 dataset. Therefore, with this work, we introduce the Animals with Attributes 2 (AWA2) dataset that has roughly the same number of images all with public licenses, exactly the same number of classes and attributes as the AWA1 dataset. We will make both ResNet features of AWA2 images and the images themselves publicly available.

We propose a unified evaluation protocol to address the third aspect of zero-shot learning which is one of the most important ones. We emphasize the necessity of tuning hyperparameters of the methods on a validation class split that is disjoint from training classes as improving zero-shot learning performance via tuning parameters on test classes violates the zero-shot assumption. We argue that per-class averaged top-1 accuracy is an important evaluation metric when the dataset is not well balanced with respect to the number of images per class. We point out that extracting image features via a pre-trained deep neural network (DNN) on a large dataset that contains zero-shot test classes also violates the zero-shot learning idea as image feature extraction is a part of the training procedure. Moreover, we argue that demonstrating zero-shot performance on small-scale and coarse grained datasets, i.e. aPY is not conclusive. On the other hand, with this work we emphasize that it is hard to obtain labeled training data for fine-grained classes of rare objects recognizing which requires expert opinion. Therefore, we argue that zero-shot learning methods should be also evaluated on least populated or rare classes. We recommend to abstract away from the restricted nature of zero-shot evaluation and make the task more practical by including training classes in the search space, i.e. generalized zero-shot learning setting. Therefore, we argue that our work plays an important role in advancing the zero-shot learning field by analyzing the good and bad aspects of the zero-shot learning task as well as proposing ways to eliminate the ugly ones.

Related Work

Early works of zero-shot learning make use of the attributes within a two-stage approach to infer the label of an image that belong to one of the unseen classes. In the most general sense, the attributes of an input image are predicted in the first stage, then its class label is inferred by searching the class which attains the most similar set of attributes. For instance, DAP first estimates the posterior of each attribute for an image by learning probabilistic attribute classifiers. It then calculates the class posteriors and predicts the class label using MAP estimate. Similarly, first learns a probabilistic classifier for each attribute. It then estimates the class posteriors through random forest which is able to handle unreliable attributes. IAP first predicts the class posterior of seen classes, then the probability of each class is used to calculate the attribute posteriors of an image. The class posterior of seen classes is predicted by a multi-class classifier. In addition, this two-stage approach have been extended to the case when attributes are not available. For example, following IAP , CONSE first predicts seen class posteriors, then it projects image feature into the Word2vec space by taking the convex combination of top $T$ most possible seen classes. The two-stage models suffer from domain shift between the intermediate task and target task, e.g. although the target task is to predict the class label, the intermediate task of DAP is to learn attribute classifiers.

Recent advances in zero-shot learning directly learns a mapping from an image feature space to a semantic space. Among those, SOC maps the image features into the semantic space and then searches the nearest class embedding vector. ALE learns a bilinear compatibility function between the image and the attribute space using ranking loss. DeViSE also learns a linear mapping between image and semantic space using an efficient ranking loss formulation, and it is evaluated on the large-scale ImageNet dataset. SJE optimizes the structural SVM loss to learn the bilinear compatibility. On the other hand, ESZSL uses the square loss to learn the bilinear compatibility and explicitly regularizes the objective w.r.t Frobenius norm. The $l_{2,1}$ -based objective function of suppresses the noise in the semantic space. embeds visual features into the attribute space, and then learns a metric to improve the consistency of the semantic embedding. Recently, SAE proposed a semantic auto encoder to regularize the model by enforcing the image feature projected to the semantic space to be reconstructed.

Other zero-shot learning approaches learn non-linear multi-modal embeddings. LatEm extends the bilinear compatibility model of SJE to be a piecewise linear one by learning multiple linear mappings with the selection of which being a latent variable. CMT uses a neural network with two hidden layers to learn a non-linear projection from image feature space to word2vec space. Unlike other works which build their embedding on top of fixed image features, trains a deep convolutional neural networks while learning a visual semantic embedding. Similarly, argues that the visual feature space is more discriminative than the semantic space, thus it proposes an end-to-end deep embedding model which maps semantic features into the visual space. proposes a simple model by projecting class semantic representations into the visual feature space and performing nearest neighbor classifiers among those projected representations. The projection is learned through support vector regressor with visual exemplars of seen classes, i.e. class centroid in the feature space.

Embedding both the image and semantic features into another common intermediate space is another direction that zero-shot learning approaches adapt. SSE uses the mixture of seen class proportions as the common space and argues that images belong to the same class should have similar mixture pattern. JLSE maps visual features and semantic features into two separate latent spaces, and measures their similarity by learning another bilinear compatibility function. Furthermore, hybrid models such as jointly embeds multiple text representations and multiple visual parts to ground attributes on different image regions. SYNC constructs the classifiers of unseen classes by taking the linear combinations of base classifiers, which are trained in a discriminative learning framework.

While most of zero-shot learning methods learn the cross-modal mapping between the image and class embedding space with discriminative losses, there are a few generative models that represent each class as a probability distribution. GFZSL models each class-conditional distribution as a Gaussian and learns a regression function that maps a class embedding into the latent space. GLaP assumes that each class-conditional distribution follows a Gaussian and generates virtual instances of unseen classes from the learned distribution. learns a multimodal mapping where class and image embeddings of categories are both represented by Gaussian distributions.

Apart from the inductive zero-shot learning set-up where the model has no access to neither visual nor side-information of unseen classes, transductive zero-shot learning approaches use visual or semantic information of both seen and unseen classes without having access to the label information. combines DAP and graph-based label propagation. uses the idea of domain adaptation frameworks. proposes hypergraph label propagation which allows to use multiple class embeddings. use semi-supervised learning based on max-margin framework.

In zero-shot learning, some form of side information is required to share information between classes so that the knowledge learned from seen classes is transfered to unseen classes. One popular form of side information is attributes, i.e. shared and nameable visual properties of objects. However, attributes usually require costly manual annotation. Thus, there has been a large group of studies which exploit other auxiliary information that reduces this annotation effort. does not use side information however it requires one-shot image of the novel class to perform nearest neighbor search with the learned metric. SJE evaluates four different class embeddings including attributes, word2vec , glove and wordnet hierarchy . On ImageNet, leverages the wordnet hierarchy. leverages the rich information of detailed visual descriptions obtained from novice users and improves the performance of attributes obtained from experts. Recently, took a different approach and learned class embeddings using human gaze tracks showing that human gaze is class-specific.

Zero-shot learning has been criticized for being a restrictive set up as it comes with a strong assumption of the image used at prediction time can only come from unseen classes. Therefore, generalized zero-shot learning setting has been proposed to generalize the zero-shot learning task to the case where both seen and unseen classes are used at test time. argues that although ImageNet classification challenge performance has reached beyond human performance, we do not observe similar behavior of the methods that compete at the detection challenge which involves rejecting unknown objects while detecting the position and label of a known object. uses label embeddings to operate on the generalized zero-shot learning setting whereas proposes to learn latent representations for images and classes through coupled linear regression of factorized joint embeddings. On the other hand, introduces a new model layer to the deep net which estimates the probability of an input being from an unknown class and proposes a novelty detection mechanism.

Although zero-shot vs generalized zero-shot learning evaluation works exist in the literature, our work stands out in multiple aspects. For instance, operates on the ImageNet 1K by using 800 classes for training and 200 for test. One of the most comprehensive works, provides a comparison between five methods evaluated on three datasets including ImageNet with three standard splits and proposes a metric to evaluate generalized zero-shot learning performance. On the other hand, we evaluate ten zero-shot learning methods on five datasets with several splits both for zero-shot and generalized zero-shot learning settings, provide statistical significance and robustness tests, and present other valuable insights that emerge from our benchmark. In this sense, ours is the most extensive evaluation of zero-shot and generalized zero-shot learning tasks in the literature.

Evaluated Methods

We start by formalizing the zero-shot learning task and then we describe the zero-shot learning methods that we evaluate in this work. Given a training set $\mathcal{S}=\{(x_{n},y_{n}),n=1...N\}$ , with $y_{n}\in\mathcal{Y}^{tr}$ belonging to training classes, the task is to learn $f:\mathcal{X}\rightarrow\mathcal{Y}$ by minimizing the regularized empirical risk:

where $L(.)$ is the loss function and $\Omega(.)$ is the regularization term. Here, the mapping $f:\mathcal{X}\rightarrow\mathcal{Y}$ from input to output embeddings is defined as:

At test time, in zero-shot learning setting, the aim is to assign a test image to an unseen class label, i.e. $\mathcal{Y}^{ts}\subset\mathcal{Y}$ and in generalized zero-shot learning setting, the test image can be assigned either to seen or unseen classes, i.e. $\mathcal{Y}^{tr+ts}\subset\mathcal{Y}$ with the highest compatibility score.

Attribute Label Embedding (ALE) , Deep Visual Semantic Embedding (DEVISE) and Structured Joint Embedding (SJE) use bi-linear compatibility function to associate visual and auxiliary information:

where $\theta(x)$ and $\phi(y)$ , i.e. image and class embeddings, both of which are given. $F(.)$ is parameterized by the mapping $W$ , that is to be learned. Given an image, compatibility learning frameworks predict the class which attains the maximum compatibility score with the image.

Among the methods that are detailed below, ALE , DEVISE and SJE do early stopping to implicitly regularize Stochastic Gradient Descent (SGD) while ESZSL and SAE explicitly regularize the embedding model as detailed below. In the following, we provide a unified formulation of these five zero-shot learning methods.

DEVISE uses pairwise ranking objective that is inspired from unregularized ranking SVM :

where $\Delta(y_{n},y)$ is equal to 1 if $y_{n}=y$ , otherwise 0. The objective function is convex and is optimized by Stochastic Gradient Descent.

ALE uses the weighted approximate ranking objective for zero-shot learning in the following way:

where $l_{k}=\sum_{i=1}^{k}\alpha_{i}$ and $r_{\Delta(x_{n},y_{n})}$ is defined as:

Following the heuristic in , selects $\alpha_{i}=1/i$ which puts a high emphasis on the top of the rank list.

SJE gives the full weight to the top of the ranked list and is inspired from the structured SVM :

The prediction can only be made after computing the score against all the classifiers, i.e. so as to find the maximum violating class, which makes SJE less efficient than DEVISE and ALE.

ESZSL applies a square loss to the ranking formulation and adds the following implicit regularization term to the unregularized risk minimization formulation:

where $\gamma,\lambda,\beta$ are regularization parameters. The first two terms bound the Euclidean norm of projected attributes in the feature space and projected image feature in the attribute space respectively. The advantage of this approach is that the objective function is convex and has a closed form solution.

SAE also learns the linear projection from image embedding space to class embedding space, but it further constrains that the projection must be able to reconstruct the original image embedding. Similar to the linear auto-encoder, SAE optimizes the following objective:

where $\lambda$ is a hyperparameter to be tuned. The optimization problem can be transformed such that Bartels-Stewart algorithm is able to solve it efficiently.

2 Learning Nonlinear Compatibility

Latent Embeddings (LATEM) and Cross Modal Transfer (CMT) encode an additional non-linearity component to linear compatibility learning framework.

LATEM constructs a piece-wise linear compatibility:

where every $W_{i}$ models a different visual characteristic of the data and the selection of which matrix to use to do the mapping is a latent variable and $K$ is a hyperparameter to be tuned. LATEM uses the ranking loss formulated in Equation 4 and Stochastic Gradient Descent as the optimizer.

CMT first maps images into a semantic space of words, i.e. class names, where a neural network with $\tanh$ nonlinearity learns the mapping:

where $(W_{1},W_{2})$ are weights of the two layer neural network. This is followed by a novelty detection mechanism that assigns images to unseen or seen classes. The novelty is detected either via thresholds learned using the embedded images of the seen classes or the outlier probabilities are obtained in an unsupervised way. As zero-shot learning assumes that test images are only from unseen classes, in our experiments when we refer to CMT, that means we do not use the novelty detection component. On the other hand, we name the CMT with novelty detection as CMT* when we apply it to the generalized zero-shot learning setting.

3 Learning Intermediate Attribute Classifiers

Although Direct Attribute Prediction (DAP) and Indirect Attribute Prediction (IAP) have been shown to perform poorly compared to compatibility learning frameworks , we include them to our evaluation for being historically the most widely used methods in the literature.

DAP learns probabilistic attribute classifiers and makes a class prediction by combining scores of the learned attribute classifiers. A novel image is assigned to one of the unknown classes using:

with $M$ being the total number of attributes, $a^{c}_{m}$ is the m-th attribute of class $c$ , $p(a_{m}^{c}|x)$ is the attribute probability given image $x$ which is obtained from the attribute classifiers whereas $p(a_{m}^{c})$ is the attribute prior estimated by the empirical mean of attributes over training classes. We train binary classifiers with logistic regression that gives probability scores of attributes with respect to training classes.

IAP indirectly estimates attributes probabilities of an image by first predicting the probabilities of each training class, then multiplying the class attribute matrix. Once the attributes probabilities are obtained by the following equation:

where $K$ is the number of training classes, $p(a_{m}|y_{k})$ is the predefined class attribute and $p(y_{k}|x)$ is training class posterior from multi-class classifier, the Equation 12 is used to predict the class label for which we train a multi-class classifier on training classes with logistic regression.

4 Hybrid Models

Semantic Similarity Embedding (SSE) , Convex Combination of Semantic Embeddings (CONSE) and Synthesized Classifiers (SYNC) express images and semantic class embeddings as a mixture of seen class proportions, hence we group them as hybrid models.

SSE leverages similar class relationships both in image and semantic embedding space. An image is labeled with:

where $\pi,\psi$ are mappings of class and image embeddings into a common space defined by the mixture of seen classes proportions. Specifically, $\psi$ is learned by sparse coding and $\pi$ is by class dependent transformation.

CONSE learns the probability of a training image belonging to a training class:

where $y$ denotes the most likely training label ( $t$ =1) for image $x$ . Combination of semantic embeddings ( $s$ ) is used to assign an unknown image to an unseen class:

where $Z=\sum_{i=1}^{T}p_{tr}(f(x,t)|x)$ , $f(x,t)$ denotes the tth most likely label for image $x$ and $T$ controls the maximum number of semantic embedding vectors.

SYNC learns a mapping between the semantic class embedding space and a model space. In the model space, training classes and a set of phantom classes form a weighted bipartite graph. The objective is to minimize distortion error:

Semantic and model spaces are aligned by embedding classifiers of real classes ( $w_{c}$ ) and classifiers of phantom classes ( $v_{r}$ ) in the weighted graph ( $s_{cr}$ ). The classifiers for novel classes are constructed by linearly combining classifiers of phantom classes.

GFZSL proposes a generative framework for zero-shot learning by modeling each class-conditional distribution as a multi-variate Gaussian with mean vector $\mu$ and diagonal covariance matrix $\sigma$ . While the parameters of seen classes can be estimated by MLE, that of unseen classes are computed by learning the following two regression functions:

with an image $x$ , its class is predicted by searching the class with the maximum probability, i.e. $\mathop{\rm argmax}_{y}p(x|\sigma_{y},\mu_{y})$ .

5 Transductive Zero-Shot Learning Setting

In zero-shot learning, transductive setting implies that unlabeled images from unseen classes are available during training. Using unlabeled images are expected to improve performance as they possibly contain useful latent information of unseen classes. Here, we mainly focus on two state-of-the-art transductive approaches and show how to extend ALE into the transductive learning setting.

GFZSL-tran uses an Expectation-Maximization (EM) based procedure that alternates between inferring the labels of unlabeled examples of unseen classes and using the inferred labels to update the parameter estimates of unseen class distributions. Since the class-conditional distribution is assumed to be Gaussian, this procedure is equivalent to repeatedly estimating a Gaussian Mixture Model (GMM) with the unlabeled data from unseen classes and use the inferred class labels to re-estimate the GMM.

DSRL proposes to simultaneously learn image features with non-negative matrix factorization and align them with their corresponding class attributes. This step gives us an initial prediction score matrix $S_{0}$ in which each row is one instance and indicates the prediction scores for all unseen classes. To improve the prediction score matrix by transductive learning, a graph-based label propagation algorithm is applied. Specifically, a KNN graph is constructed with the projected instances of unseen classes in the class embedding space,

where KNN( $i$ ) denotes the k-nearest neighbor of $i$ -th instance and $d(x_{i},x_{j})$ measures the Euclidean distance between $x_{i}$ and $x_{j}$ . Given the affinity matrix $M$ , a normalized Laplacian matrix $L$ can be computed as $L=Q^{-1/2}MQ^{-1/2}$ where $Q$ is a diagonal matrix with $Q_{ii}=\sum_{j}M_{ij}$ . Finally, the standard label propagation gives the closed-form solution:

where $\alpha\in$ is a regularization trade-off parameter and $S$ is the score matrix. The class label of an instance is predicted by searching the class with the highest score, i.e. $\mathop{\rm argmax}_{y}S_{iy}$ .

ALE-tran Any compatibility learning method that explicitly learns cross-modal mapping from image feature space to class embedding space can be extended to transductive setting following the label propagation procedure of DSRL . Taking the ALE as an example, after learning the linear mapping $W$ , instances of unseen classes can be projected into the class embedding space and a score matrix $S_{0}$ can be computed similarly.

Datasets

Among the most widely used datasets for zero-shot learning, we select two coarse-grained, one small (aPY ) and one medium-scale (AWA1 ), and two fine-grained, both medium-scale, datasets (SUN , CUB ) with attributes and one large-scale dataset (ImageNet ) without. Here, we consider between $10K$ and $1M$ images, and, between $100$ and $1K$ classes as medium-scale. Details of dataset statistics in terms of the number of images, classes, attributes for the attribute datasets are in Table I. Furthermore, we introduce our Animals With Attributes 2 (AWA2) dataset and position it with respect to existing datasets.

Attribute Pascal and Yahoo (aPY) is a small-scale coarse-grained dataset with $64$ attributes. Among the total number of $32$ classes, $20$ Pascal classes are used for training (we randomly select $5$ for validation) and $12$ Yahoo classes are used for testing. The original Animals with Attributes (AWA1) is a coarse-grained dataset that is medium-scale in terms of the number of images, i.e. $30,475$ and small-scale in terms of number of classes, i.e. $50$ classes. introduces a standard zero-shot split with $40$ classes for training (we randomly select $13$ classes for validation) and $10$ classes for testing. AWA1 has $85$ attributes. Caltech-UCSD-Birds 200-2011 (CUB) is a fine-grained and medium scale dataset with respect to both number of images and number of classes, i.e. $11,788$ images from $200$ different types of birds annotated with $312$ attributes. introduces the first zero-shot split of CUB with $150$ training ( $50$ validation classes) and $50$ test classes. SUN is a fine-grained and medium-scale dataset with respect to both number of images and number of classes, i.e. SUN contains $14340$ images coming from $717$ types of scenes annotated with $102$ attributes. Following we use $645$ classes of SUN for training (we randomly select $65$ classes for validation) and $72$ classes for testing.

Animals with Attributes2 (AWA2) Dataset. One disadvantage of AWA1 dataset is that the images are not publicly available. As having highly descriptive image features is an important component for zero-shot learning, in order to enable vision research on the objects of the AWA1 dataset, we introduce the Animals with Attributes2 (AWA2) dataset. Following , we collect $37,322$ images for the $50$ classes of AWA1 dataset from public web sources, i.e. Flickr, Wikipedia, etc., making sure that all images of AWA2 have free-use and redistribution licenses and they do not overlap with images of the original Animal with Attributes dataset. The AWA2 dataset uses the same 50 animal classes as AWA1 dataset, similarly the $85$ binary and continuous class attributes are common. In total, AWA2 has $37,322$ images compared to $30,475$ images of AWA1. On average, each class includes $746$ images where the least populated class, i.e. mole, has $100$ and the most populated class, i.e. horse has $1645$ examples. Some example images from polar bear, zebra, otter and tiger classes along with sample attributes from our AWA2 dataset are shown in Figure 1.

In Figure 2, we provide some statistics on the AWA2 dataset in comparison with the AWA1 dataset in terms of the number of images and also the distribution of the image features. Compared to AWA1, our proposed AWA2 dataset contains more images, e.g. horse and dolphin among the test classes, antelope and cow among the training classes. Moreover, the t-SNE embedding of these test classes with more training data, e.g. horse, dolphin, seal etc. shows that AWA2 leads to slightly more visible clusters of ResNet features. The images, their labels and ResNet features of our AWA2 are publicly available in http://cvml.ist.ac.at/AwA2.

2 Large-Scale ImageNet

We also evaluate the performance of methods on the large scale ImageNet which contains a total of $14$ million images from $21$ K classes, each one labeled with one label, and the classes are hierarchically related as ImageNet follows the WordNet .

ImageNet is a natural fit for zero-shot and generalized zero-shot learning as there is a large class imbalance problem. Moreover, ImageNet is diverse in terms of granularity, i.e. it contains a collection of fine-grained datasets, e.g. different vehicle types, as well as coarse-grained datasets. The highest populated class contains $3,047$ images whereas there are many classes that contains only a single image. A balanced subset of ImageNet with $1$ K classes containing about $1000$ images each is used to train CNNs.

Previous works proposed to split the balanced subset of $1$ K classes into $800$ training and $200$ test classes. In this work, from the total of $21$ K classes, we use $1$ K classes for training (among which we use $200$ classes for validation) and the test split is either all the remaining 20K classes or a subset of it, e.g. we determine these subsets based on the hierarchical distance between classes and the population of classes. The details of these splits are provided in the following section.

Evaluation Protocol

In this section, we provide several components of previously used and our proposed ZSL and GZSL evaluation protocols, e.g. image and class encodings, dataset splits and the evaluation criteriaOur benchmark is in: http://www.mpi-inf.mpg.de/zsl-benchmark.

We extract image features, namely image embeddings, from the entire image for SUN, CUB, AWA1, our AWA2 and ImageNet, with no image pre-processing. For aPY, following the original publication in , we crop the images from bounding boxes. Our image embeddings are $2048$ -dim top-layer pooling units of the $101$ -layered ResNet as we found that it performs better than $1,024$ -dim top-layer pooling units of GoogleNet . We use the original ResNet- $101$ that is pre-trained on ImageNet with $1$ K classes, i.e. the balanced subset, and we do not fine-tune it for any of the mentioned datasets. In addition to the ResNet features, we re-evaluate all methods with their published image features.

In zero-shot learning, class embeddings are as important as image features. As class embeddings, for aPY, AWA1, AWA2, CUB and SUN, we use the per-class attributes between values and $1$ that are provided with the datasets as binary attributes have been shown to be weaker than continuous attributes. For ImageNet as attributes of $21$ K classes are not available, we use Word2Vec trained on Wikipedia provided by . Note that an evaluation of class embeddings is out of the scope of this paper. We refer the reader to for more details on the topic.

2 Dataset Splits

Zero-shot learning assumes disjoint training and test classes. Hence, as deep neural network (DNN) training for image feature extraction is actually a part of model training, the dataset used to train DNNs, e.g. ImageNet, should not include any of the test classes. However, we notice from the standard splits (SS) of aPY and AWA1 datasets that 7 aPY test classes out of 12 (monkey, wolf, zebra, mug, building, bag, carriage), 6 AWA1 test classes out of 10 (chimpanzee, giant panda, leopard, persian cat, pig, hippopotamus), are among the 1K classes of ImageNet, i.e. are used to pre-train ResNet. On the other hand, the mostly widely used splits, i.e. we term them as standard splits (SS), for SUN from and CUB from shows us that 1 CUB test class out of 50 (Indigo Bunting), and 6 SUN test classes out of 72 (restaurant, supermarket, planetarium, tent, market, bridge), are also among the 1K classes of ImageNet.

We noticed that the accuracy for all methods on those overlapping test classes are higher than others. Therefore, we propose new dataset splits, i.e. proposed splits (PS), insuring that none of the test classes appear in ImageNet 1K, i.e. used to train the ResNet model. We present the differences between the standard splits (SS) and the proposed splits (PS) in Table I. While in SS and PS no image from test classes is present at training time, at test time our PS includes images from training classes. We designed the PS this way as evaluating accuracy on both training and test classes is crucial to show the generalization of the methods.

For SUN, CUB, AWA1, aPY, and our proposed AWA2 dataset, for measuring the significance of the results, we propose 3 different splits of $580$ , $100$ , $27$ , $15$ and $27$ training classes respectively while keeping $72$ , $50$ , $10$ , $12$ and $10$ test classes the same. It is important to perform hyperparameter search on a disjoint set of validation set of $65$ , $50$ , $13$ , $5$ and $13$ classes respectively. We keep the number of classes the same for SS and PS, however we choose different classes while making sure that the test classes do not overlap with the $1$ K training classes of ImageNet. Note that we introduce Proposed Split Version 2.0http://www.mpi-inf.mpg.de/zsl-benchmark.

ImageNet provides possibilities of constructing several zero-shot evaluation splits. Following , our first two standard splits consider all the classes that are 2-hops and 3-hops away from the original 1K classes according to the ImageNet label hierarchy, corresponding to $1509$ and $7678$ classes. This split measures the generalization ability of the models with respect to the hierarchical and semantic similarity between classes. As discussed in the previous section, another characteristic of ImageNet is the imbalanced sample size. Therefore, our proposed split considers 500, 1K and 5K most populated classes among the remaining 21K classes of ImageNet with approximately $1756$ , $1624$ and $1335$ images per class on average. Similarly, we consider 500, 1K and 5K least-populated classes in ImageNet which correspond to most fine-grained subsets of ImageNet with approximately $1$ , $3$ and $51$ images per class on average. We measure the generalization of methods to the entire ImageNet data distribution by considering a final split of all the remaining approximately $20$ K classes of ImageNet with at least $1$ image per-class, i.e. approximately $631$ images per class on average.

3 Evaluation Criteria

Single label image classification accuracy has been measured with Top-1 accuracy, i.e. the prediction is accurate when the predicted class is the correct one. If the accuracy is averaged for all images, high performance on densely populated classes is encouraged. However, we are interested in having high performance also on sparsely populated classes. Therefore, we average the correct predictions independently for each class before dividing their cumulative sum w.r.t the number of classes, i.e. we measure average per-class top-1 accuracy in the following way:

In the generalized zero-shot learning setting, the search space at evaluation time is not restricted to only test classes ( $\mathcal{Y}^{ts}$ ), but includes also the training classes ( $\mathcal{Y}^{tr}$ ), hence this setting is more practical. As with our proposed split at test time we have access to some images from training classes, after having computed the average per-class top-1 accuracy on training and test classes, we compute the harmonic mean of training and test accuracies:

where $acc_{\mathcal{Y}^{tr}}$ and $acc_{\mathcal{Y}^{ts}}$ represent the accuracy of images from seen ( $\mathcal{Y}^{tr}$ ), and images from unseen ( $\mathcal{Y}^{ts}$ ) classes respectively. We choose harmonic mean as our evaluation criteria and not arithmetic mean because in arithmetic mean if the seen class accuracy is much higher, it effects the overall results significantly. Instead, our aim is high accuracy on both seen and unseen classes.

Experiments

We first provide ZSL results on the attribute datasets SUN, CUB, AWA1, AWA2 and aPY and then on the large-scale ImageNet dataset. Finally, we present results for the GZSL setting.

On attribute datasets, i.e. SUN, CUB, AWA1, AWA2, and aPY, we first reproduce the results of each method using their evaluation protocol, then provide a unified evaluation protocol using the same train/val/test class splits, followed by our proposed train/val/test class splits on SUN, CUB, AWA1, aPY and AWA2. We also evaluate the robustness of the methods to parameter tuning and visualize the ranking of different methods. Finally, we evaluate the methods on the large-scale ImageNet dataset.

Comparing State-of-The-Art Models. For sanity-check, we re-evaluate methods and using publicly available features and code from the original publication on SUN, CUB, AWA1 and aPY (CMT evaluates on CIFAR dataset.). We observe from the results in Table II that our reproduced results of DAP, SYNC , GFZSL , GFZSL-tran , DSRL and SAE are nearly identical to the reported number in their original publications. For LATEM , we obtain slightly different results which can be explained by the non-convexity and thus the sensibility to initialization. Similarly for SJE random sampling in SGD might lead to slightly different results. ESZSL has some variance because its algorithm randomly picks a validation set during each run, which leads to different hyperparameters. Notable observations on SSE results are as follows. The published code has hard-coded hyperparameters operational on aPY, i.e. number of iterations, number of data points to train SVM, and one regularizer parameter $\gamma$ which lead to inferior results than the ones reported here, therefore we set these parameters on validation sets. On SUN, SSE uses $10$ classes (instead of $72$ ) and our results with validated parameters got an improvement of $0.5\%$ that may be due to random sampling of training images. On AWA1, our reproduced result being $64.9\%$ is significantly lower than the reported result ( $76.3\%$ ). However, we could not reach the reported result even by tuning parameters on the test set ( $73.8\%$ ).

In addition to , we re-implement based on the original publications. We use train, validation, test splits as provided in Table I and report results in Table III with deep ResNet features. DAP uses hand-crafted image features and thus reproduced results with those features are significantly lower than the results with deep features ( $22.1\%$ vs $38.9\%$ ). When we investigate the results in detail, we noticed two irregularities with reported results on SUN. First, SSE and ESZSL report results on a test split with $10$ classes whereas the standard split of SUN contains $72$ test classes ( $74.5\%$ vs $54.5\%$ with SSE and $64.3\%$ vs $57.3\%$ with ESZSL ). Second, after careful examination and correspondence with the authors of SYNC , we detected that SUN features were extracted with a MIT Places pre-trained model. As the MIT Places dataset intersects with both training and test classes of SUN, it is expected to lead to significantly better results than ImageNet pre-trained models ( $62.8\%$ vs $59.1\%$ ). In addition, while SAE reported $84.7\%$ on AWA1, we obtain only $80.7\%$ on the standard split. This could be explained by two differences. First, we measure per-class accuracy but SAE reports per-image accuracy which is typically higher when the dataset is class-imbalanced, e.g. AWA1. Indeed, their reported accuracy decreases to $82.0\%$ if per-class accuracy is applied. Second, we confirmed with the authors of SAE that they improved GoogleNet by adding Batch Normalization and averaging 5 randomly cropped images to obtain better image features. Therefore, as expected, improving visual features lead to improved results in zero-shot learning.

Promoting Our Proposed Splits (PS). We propose new dataset splits (see details in section 4) ensuring that test classes of any of the datasets do not overlap with the ImageNet1K used to pre-train ResNet. As training ResNet is a part of the training procedure, including test classes in the dataset used for pre-training ResNet would violate the zero-shot learning conditions. We compare the results obtained with our proposed split (PS) with previously published standard split (SS) results in Table III.

Our first observation is that the results on the PS are significantly lower than the SS for AWA1 and AWA2. This is expected as most of the test classes of AWA1 and AWA2 in SS overlaps with ImageNet 1K. On the other hand, for fine-grained datasets CUB and SUN, the results are not significantly effected as the overlap in that case was not as significant. Our second observation regarding the method ranking is as follows. On SS, GFZSL is the best performing method on SUN ( $62.9\%$ ) and aPY ( $51.3\%$ ) datasets whereas SJE performs the best on CUB ( $55.3\%$ ) and SAE performs the best on AWA1 ( $80.6\%$ ) and AWA2 ( $80.7\%$ ) dataset. On PS, GFZSL performs the best on SUN ( $60.6\%$ ), AWA1 ( $68.2\%$ ) and AWA2 ( $63.8\%$ ), ALE on aPY ( $39.7\%$ ), and SYNC on CUB ( $56.0\%$ ). ALE, SJE and DEVISE all use max-margin bi-linear compatibility learning framework which seem to perform better than others.

Evaluating Robustness. We evaluate robustness of $13$ methods, i.e. , to hyperparameters by setting them on $3$ different validation splits while keeping the test split intact. We report results on SS (Figure 3, top) and PS (Figure 3, bottom) for SUN, CUB, AWA1, AWA2 and aPY datasets. On SUN and CUB, the results are stable across methods and across dataset splits. This is expected as these datasets both have a balanced number of images across classes and they are fine-grained datasets. Therefore, the validation splits are similar. On the other hand, aPY being a small and coarse-grained dataset has several issues. First, many of the test classes of aPY are included in ImageNet1K. Second, it is not well balanced, i.e. different validation class splits contain significantly different number of images. Third, the class embeddings are far from each other, i.e. objects are semantically different, therefore different validation splits learn a different mapping between images and classes. On AWA1 and AWA2, on SS, the DEVISE method seems to show the largest variance. This might be due to the fact that AWA1 and AWA2 datasets are also coarse-grained and test classes overlap with ImageNet training classes. Indeed, AWA2 being slightly more balanced than AWA1, in the proposed split it does not lead to such a high variance for DEVISE.

Visualizing Method Ranking. We first evaluate the $13$ methods using three different validation splits as in the previous experiment. We then rank them based on their per-class top-1 accuracy using the non-parametric Friedman test , which does not assume a distribution on performance but rather uses algorithm ranking. Each entry of the rank matrix on Figure 4 indicates the number of times the method is ranked at the first to thirteenth rank. We then compute the mean rank of each method and order them based on the mean rank across datasets.

Our general observation is that the highest ranked method on both splits is GFZSL, the second highest ranked method on the standard split (SS) is SYNC while it drops to the seventh rank on the proposed split (PS). On the other hand, ALE ranks the second on the SS and the first on the PS. We reinforce our initial observation from numerical results and conclude that GFZSL and ALE seems to be the method that is the most robust in zero-shot learning setting for attribute datasets. These results also indicate the importance of choosing zero-shot splits carefully. On the PS, the two of three highest ranked methods are compatibility learning methods, i.e. ALE and DEVISE whereas the three lowest ranked methods are attribute classifier learning or hybrid methods, i.e. IAP, CMT and CONSE. Therefore, max-margin compatibility learning methods lead to consistently better results in the zero-shot learning task compared to learning independent classifiers. Finally, visualizing the method ranking in this way provides a visually interpretable way of how models compare across datasets.

Results on Our Proposed AWA2. We introduce AWA2 which has the same classes and attributes as AWA1, but contains different images each coming with a public copyright license. In order to show that AWA1 and AWA2 images are not the same but similar in nature, we compare the zero-shot learning results on AWA1 and AWA2 in Table. III. Under the Standard Splits (SS), SAE is the best performing method on both AWA1 ( $80.6\%$ ) and AWA2 ( $80.7\%$ ). Similarly, for most of the methods, the results on AWA1 are close to those on AWA2, for instance, DAP obtains $57.1\%$ on AWA1 and $58.7\%$ on AWA2, SSE obtains $68.8\%$ on AWA1 and $67.5\%$ AWA2, etc. The results under the Proposed Splits (PS) are also consistent across AWA1 and AWA2. For $8$ out of $12$ methods, the performance difference between AWA1 and AWA2 is within $2\%$ . On the other hand, the same consistency is not observed for DEVISE , SJE and SYNC . For instance, SJE obtains $65.6\%$ on AWA1 and $61.9\%$ on AWA2. After careful examination, we noticed that SJE selects different hyperparameters for AWA1 and AWA2, which results in different performance on those two datasets. In our opinion, this does not indicate a possible dataset artifact, however shows that zero-shot learning is sensitive to parameter setting.

Commonly, a model is trained and evaluated on the same dataset. Across dataset experiments are not easy as different datasets do not share the same attributes. However, AWA1 and AWA2 share both classes and attributes. In order to verify that AWA2 is a good replacement for AWA1, we conduct across-dataset evaluation for 12 methods, i.e. . In particular, with our Proposed Splits (PS), we train one model on the training set of AWA1 and evaluate it on the test set of AWA2 in the zero-shot learning setting, and vice versa. From Table. IV, we observe that all the models trained on AWA1 generalize well to AWA2 and vice versa.

In addition, we notice that the cross-dataset result is dependent on the training set. For instance, for all the methods, if we fix training set to be from AWA1, the results on the test set of AWA1 and AWA2 are close. To verify this hypothesis, we performed a paired t-test which determines if the mean difference between paired results is significantly higher than zero. To that end, we take the 24 pairs of results whose test sets are the same, i.e. the results obtained with 12 methods on AWA1:AWA2 and AWA2:AWA2 (2nd and 3rd column) as well as the results obtained with 12 methods on AWA1:AWA1 and AWA2:AWA1 (1st and 4th column). The paired t-test rejects the null hypothesis with p-value $=0.007$ , indicating that the results are significantly different if the test set is the same but the training set is different. As a conclusion, the training set is an important indicator of the final result and the two datasets, i.e. AWA1 and AWA2 are sufficiently similar. Therefore, our cross-dataset experimental results indicate that AWA2 is a good replacement for AWA1.

Zero-Shot Learning Results on ImageNet. ImageNet scales the methods to a truly large-scale setting, thus these experiments provide further insights on how to tackle the zero-shot learning problem from the practical point of view. Here, we evaluate $10$ methods, i.e. . We exclude DAP and IAP as attributes are not available for all ImageNet classes as well as SSE due to scalability issues of the public implementation of the method. Table V shows that the best performing method is SYNC which may either indicate that it performs well in large-scale setting or it can learn under uncertainty due to usage of Word2Vec instead of attributes. Another possibility is Word2Vec may be tuned for SYNC as it is provided by the same authors. However, we refrain to make a strong claim as this would requires a full evaluation on class embeddings which is out of the scope of this paper. On the other hand, GFZSL which is the best performing model for attribute datasets perform poorly on ImageNet which may indicate that generative models require a strong class embedding space such as attributes to perform well on ZSL task. Note that due to the computational issues, we were not able to obtain results for GFZSL for 3H, M5K, L5K and All 20K classes.

More detailed observations are as follows. The second highest performing method is ESZSL which is one of the linear embedding models that have an implicit regularization mechanism, which seems to be more effective than early stopping as an explicit regularizer. A general observation from the results of all the methods is that in the most populated classes, the results are higher than the least populated classes which indicates that zero-shot learning on fine-grained ImageNet subsets is a more difficult task. Moreover, we conclude that the nature of the test set, e.g. type of the classes being tested, is more important than the number of classes. Therefore, the selection of the test set is an important aspect of zero-shot learning on large-scale datasets. Furthermore, for all methods we consistently observe a large drop in accuracy between 1K and 5K most populated classes which is expected as 5K contains $\approx 6.6$ M images, making the problem much more difficult than 1K ( $\approx 1624$ images). It is worth to note that, measuring per-image accuracy in this case would lead to higher results if the labels of the highly populated class samples are predicted correctly. Finally, the largest test set, i.e. All 20K, the results are poor for all methods which indicates the difficulty of this problem where there is a large room for improvement.

Several models in the literature evaluate Top-5 and Top-10 as well as Top-1 accuracy on ImageNet. Top-5 and Top-10 accuracy in this case is reasonable as an image usually contains multiple objects however by construction it is associated with a single label in ImageNet. Hence, we provide a comparison of the same $9$ models according to all these three criteria in Figure 5. We observe that SYNC performs significantly better than other methods when the number of images is higher, e.g. 2H, M500, M1K, whereas the gap reduces when the number of images and the number of classes increase, e.g. 3H, L5K and All. In fact, when for All, all the methods perform similarly and poorly which indicates that there is a large room for improvement in this task. In fact, this observation carries on for all three accuracy measures. For Top-5 (middle) and Top-10 (right) accuracy although the numbers are as expected in general higher, the winning model remains as SYNC, significantly for 2H, M500 and M1K whereas the difference is smaller with 3H, L5H, L1K. On the other hand, all methods perform similarly when all 20K classes are tested.

2 Generalized Zero-Shot Learning Results

In real world applications, image classification systems do not have access to whether a novel image belongs to a seen or unseen class in advance. Hence, generalized zero-shot learning is interesting from a practical point of view. Here, we use same models trained on ZSL setting on our proposed splits (PS). We evaluate performance on both $\mathcal{Y}^{tr}$ and $\mathcal{Y}^{ts}$ (using held-out images).

As shown in Table VI, generalized zero-shot learning results are significantly lower than zero-shot learning results. This is due to the fact that training classes are included in the search space which act as distractors for the images that come from test classes. An interesting observation is that compatibility learning frameworks, e.g. ALE, DEVISE, SJE, perform well on test classes. However, methods that learn independent attribute or object classifiers, e.g. DAP and CONSE, perform well on training classes. Due to this discrepancy, we evaluate the harmonic mean which takes a weighted average of training and test class accuracy as shown in Equation 17. The harmonic mean measure ranks ALE as the best performing method on SUN, CUB and AWA1 datasets whereas on our AWA2 dataset DEVISE performs the best and on aPY dataset CMT* performs the best. Note that CMT* has an integrated novelty detection phase for which the method receives another supervision signal determining if the image belongs to a training or a test class. Similar to the ImageNet results, GFZSL performs poorly on GZSL setting.

As for the generalized zero-shot learning setting on ImageNet, we report results measured on unseen classes as no images are reserved from seen classes on Figure 6. Our first observation is that there is no winner model in all cases, the results diverge for different splits and different accuracy measures. For instance, when the performance is measured with Top-1 accuracy, in general the best performing model seems to be DEVISE, ALE and SJE which are all linear compatibility learning models. On the other hand, for Top-5 accuracy different models take the lead in different splits, e.g. CONSE works the best for 3H and M5K indicating that it performs better when the number of images that come from unseen classes is larger. Whereas SJE and ESZSL works better for 2H, M500, L5H settings. Finally, for Top-10 accuracy, the best performing model overall is ESZSL which is the model that learns a linear compatibility with an explicit regularization scheme. Finally, for Top-1, Top-5 and Top-10 results we observe the same trend for when all the unseen classes are included in the test set, i.e. the models perform similarly however CONSE slightly stands out for Top-5 and Top-10 accuracy plots.

In summary, generalized zero-shot learning setting provides one more level of detail on the performance of zero-shot learning methods. Our take-home message is that the accuracy of training classes is as important as the accuracy of test classes in real world scenarios. Therefore, methods should be designed in a way that they are able to predict labels well both in train and test classes.

Visualizing Method Ranking. Similar to the analysis in the previous section that was conducted for zero-shot learning setting, we rank the $13$ methods, i.e. , based on their results obtained on SUN, CUB, AWA1, AWA2 and aPY. The performance is measured on seen classes, unseen classes and the Harmonic mean of the two.

The rank matrix of test classes, i.e. Figure 7 top left, shows that highest ranked methods,i.e. ALE, DEVISE, SJE, although overall the absolute accuracy numbers are lower (Table VI). Note that in Figure 4 GFZSL ranked highest which shows that GFZSL is not as strong for GZSL task. The rank matrix of the harmonic mean shows the same trend. However, the rank matrix of training classes, i.e. Figure 7 top right, shows that models that learn intermediate attribute classifiers perform well for the images that come from training classes. However, these models typically do not lead to a high accuracy for the images that belong to unseen classes as shown in Table VI. This eventually makes the harmonic mean, i.e. the overall accuracy on both training and test classes, lower. These results clearly suggest that one should not only optimize for test class accuracy but also for training class accuracy while evaluating generalized zero-shot learning.

Our final observation from Figure 7 is that CMT* is better than CMT in all cases which supports the argument that a simple novelty detection scheme helps to improve results. However, it is important to note that the proposed novelty detection mechanism uses more supervision than classic zero-shot learning models. Although the label of test classes is not used, whether the sample comes from a seen or unseen class is an additional supervision.

3 Transductive (Generalized) Zero-Shot Learning

In contrast to previous zero-shot learning approaches that learn only with data from training classes, transductive approaches use unlabaled images from test classes. In this section, we evaluate three state-of-the-art transductive ZSL approaches, i.e. DSRL , GFZSL-tran , and ALE-tran . Similar to the previous section, we evaluate those approaches on our proposed splits in both zero-shot learning where test time search space is composed of only unseen classes and generalized zero-shot learning where it contains both seen and unseen classes. The performance is per-class averaged top-1 accuracy.

Our transductive learning results are presented in Figure 8. We observe that in ZSL setting, transductive learning leads to accuracy improvement, e.g. ALE-tran and GFZSL-tran outperforms ALE and GFZSL respectively in almost all cases. In particular, on AWA2, GFZSL-tran achieves $78.6\%$ , significantly improving GFZSL ( $63.8\%$ ). On APY, ALE-tran obtains $45.5\%$ and significantly improves ALE ( $37.1\%$ ) as well. Moreover, GFZSL-tran outperforms ALE-tran and DSRL on SUN, AWA1 and AWA2. However, ALE-tran performs the best on CUB and APY. In GZSL setting we observe a different trend, i.e. transductive learning does not improve results for ALE in any of the datasets. Although, on AWA1 and AWA2 GFZSL results improve significantly for the transductive learning setting, on other datasets GFZSL model performs poorly both in inductive and in transductive settings.

Conclusion

In this work, we evaluated a significant number of state-of-the-art zero-shot learning methods, i.e. , on several datasets, i.e. SUN, CUB, AWA1, AWA2, aPY and ImageNet, within a unified evaluation protocol both in zero-shot and generalized zero-shot settings.

Our evaluation showed that generative models and compatibility learning frameworks have an edge over learning independent object or attribute classifiers and also over other hybrid models for the classic zero-shot learning setting. We observed that unlabeled data of unseen classes can further improve the zero-shot learning results, thus it is not fair to compare transductive learning approaches with inductive ones. We discovered that some standard zero-shot dataset splits may treat feature learning disjoint from the training stage as several test classes are included in the ImageNet1K dataset that is used to train the deep neural networks that act as feature extractor. Therefore, we proposed new dataset splits making sure that none of the test classes in none of the datasets belong to ImageNet1K. Moreover, disjoint training and validation class split is a necessary component of parameter tuning in zero-shot learning setting.

In addition, we introduced a new Animal with Attributes (AWA2) dataset. AWA2 inherits the same 50 classes and attributes annotations from the original Animal with Attributes (AWA1) dataset, but consists of different $37,322$ images with publicly available redistribution license. Our experimental results showed that the 12 methods that we evaluated perform similarly on AWA2 and AWA1. Moreover, our statistical consistency test indicated that AWA1 and AWA2 are compatible with each other.

Finally, including training classes in the search space while evaluating the methods, i.e. generalized zero-shot learning, provides an interesting playground for future research. Although the generalized zero-shot learning accuracy obtained with 13 models compared to their zero-shot learning accuracy is significantly lower, the relative performance comparison of different models remain the same. Having noticed that some models perform well when the test set is composed only of seen classes, while some others perform well when the test set is composed of only of unseen classes, we proposed the Harmonic mean of seen and unseen class accuracy as a unified measure for performance in GZSL setting. The Harmonic mean encourages the models to perform well on both seen and unseen class samples, which is closer to a real world setting. In summary, our work extensively evaluated the good and bad aspects of zero-shot learning while sanitizing the ugly ones.