From Zero-shot Learning to Conventional Supervised Classification: Unseen Visual Data Synthesis

Yang Long, Li Liu, Ling Shao, Fumin Shen, Guiguang Ding, Jungong Han

Introduction

Object Recognition is arguably one of the most fundamental tasks in computer vision field. Most of the conventional frameworks, e.g. Deep Neural Networks (DNN) , rely on a large number of training samples to build statistical models. However, such a premise is unattainable in many real-world situations. The main reasons can be summarised as follows: 1) Obtaining well-annotated training samples is expensive. Although abundant digital images and videos can be retrieved from the Internet, existing search engines crucially depend on user-defined keywords that are often vague and not suitable for learning tasks. 2) The number of newly defined classes is ever-growing. Meanwhile, fine-grained tasks make existing categories go deeper, e.g. to recognise a newly released handbag in a novel pattern. Training a particular model for each of them is infeasible. 3) Collecting instances for rare classes is difficult. For example, one might wish to detect an ancient or rare species automatically. It could be difficult to provide even a single example for them since available knowledge could be only textual descriptions or some distinctive attributes.

As a feasible solution, Zero-shot Learning (ZSL) aims to leverage a closed-set of semantic models that can generalise to unseen classes . The common paradigm of ZSL methods first train a prediction model that can map visual data to a semantic representation. Hereafter, new objects can be recognised by only knowing their semantic descriptions. However, existing methods cannot expand the training data for new unseen classes. As illustrated in Fig. 2, such frameworks impede existing methods from scaling up since the fixed seen data is eventually limited to represent the ever-growing semantic concepts.

In this paper, we investigate to synthesise high-quality visual features from semantic attributes so that the ZSL problem can be converted into conventional supervised classification. Our idea is inspired by the ability of human imagination, as shown in Fig. 1. Given a semantic description, we human can associate familiar visual elements and then imagine an approximate scene. Accordingly, we synthesise discriminative low-level features from semantic attributes to substitute feature extraction from real images. Our contributions can be summarised as follows:

1) We provide a feasible framework to synthesise unseen visual features from given semantic attributes without acquiring real images. The synthesised data obtained at the training stage can be straightforwardly fed to conventional classifiers so that ZSL recognition is skilfully converted into the conventional supervised problem and leads to state-of-the-art recognition performance on four benchmark datasets.

2) We introduce the variance decay problem during semantic-visual embedding and propose a novel Diffusion Regularisation that can explicitly make information diffuse to each dimension of the synthesised data. We achieve information diffusion by optimising an orthogonal rotation problem. We provide an efficient optimisation strategy to solve this problem together with the structural difference and training bias problem.

Related Work

Zero-shot Recognition Schemes: We summarise previous ZSL schemes in Fig. 2, in contrast to conventional supervised classification (Fig. 2(A)). Since collecting well-labelled visual data for novel classes is expensive, as shown in Fig. 2(B), zero-shot learning techniques are proposed to recognise novel classes without acquiring the visual data. Most of the early works are based on the Direct-Attribute Prediction (DAP) model . Such a model utilises semantic attributes as intermediate clues. A test sample is classified by each attribute classifier alternately, and the class label is predicted by probabilistic estimation. Admitting the merit of DAP, there are some concerns about its deficiencies. points out that the attributes may correlate to each other resulting in significant information redundancy and poor performance. The human labelling involved in attribute annotation may also be unreliable .

To circumvent learning independent attributes, embedding-based ZSL frameworks (Fig.2(C)) are proposed to learn a projection that can map the visual features to all of the attributes at once. The class label is then inferred in the semantic space using various measurements . Since the attribute annotations are expansive to acquire, attributes are substituted by the visual similarity and data distribution information in transductive ZSL settings . However, these methods involve the data of unseen classes to learn the model, which to some extent breaches the strict ZSL settings. Recent work combines the embedding-inferring procedure into a unified framework and empirically demonstrates better performance. The closest related work is , which takes one-step further to synthesise classifiers or prototypes for unseen classes.

Our method takes the advantages of semantic embedding. However, the inference direction is different from existing work. Our method aims to inversely synthesise visual feature vectors to as many as the available semantic instances rather than mapping visual data to the label space.

Semantic Side Information: ZSL tasks require to leverage side information as intermediate clues. Such frameworks not only broaden the classification settings but also enable various information to aid visual systems. Since textual sources are relatively easy to obtain from the Internet, propose to estimate the semantic relatedness of the novel classes from the text. learn pseudo-concepts to associate novel classes using Wikipedia articles. Recently, lexical hierarchies in the ontology engineering are also exploited to find the relationships between classes .

Although various side information is studied, attribute-based ZSL methods still gain the most popularity. One reason is ZSL by learning attributes often gives prominent classification performance . For another reason, attribute representation is a compact way that can further describe an image by concrete words that are human-understandable . Various types of attributes are proposed to enrich applicable tasks and improve the performance, such as relative attributes , class-similarity attributes , and augmented attributes . Our main motivation of this paper not only aims to improve the ZSL performance, but also seeks for a reliable solution for synthesising high-quality visual features.

Approach

Unseen Visual Data Synthesis: We aim to synthesise the visual features of unseen classes by the given semantic attributes. Specifically, we learn an embedding function on the training set $f^{\prime}:\mathcal{A}_{s}\rightarrow\mathcal{X}_{s}$ . After that, we are able to infer $\mathcal{X}_{u}$ through: $\mathcal{X}_{u}=f^{\prime}(\mathcal{A}_{u})$ .

Zero-shot Recognition: Using the synthesised visual features, the ZSL recognition is converted to a typical classification problem. It is straightforward to employ conventional supervised classifiers, e.g. SVM, to predict the labels of unseen classes $f_{\text{SVM}}:\mathcal{X}_{u}\rightarrow\mathcal{Y}_{u}$ .

To synthesise visual features, the most intuitive framework is to learn a mapping function from the semantic space to the visual feature space:

where $P$ is the projection matrix, $\mathcal{L}$ is a loss function, and $\Omega$ is a regularisation term with its hyper-parameter $\lambda$ . It is common to choose $\Omega(P)=\|P\|_{F}^{2}$ , where $\|.\|_{F}$ is the Frobenius norm of a matrix that estimates the Euclidean distance between two matrices. Before the test, we can synthesise unseen visual features from the attribute space by given attributes of the unseen instances:

In the embedding space $\mathcal{V}$ , we expect that, if $g_{i}$ and $g_{j}$ in both graphs are connected, each pair of embedded points $v_{i}$ and $v_{j}$ are also close to each other. However, sometimes $W_{\mathcal{X}}$ and $W_{\mathcal{A}}$ are not always consistent due to the visual-semantic gap. To compromise such conflicts, we compute the mean of the visual and attribute graphs, i.e. $W=\frac{1}{2}(W_{\mathcal{X}}+W_{\mathcal{A}})$ . The resulted regularisation is:

where $\mathbf{D}$ is the degree matrix of $W$ , $\mathbf{D}_{ii}=\sum_{i}w_{ij}$ . $L$ is known as graph Laplacian matrix $L=\mathbf{D}-W$ and $Tr(.)$ computes the trace of a matrix.

Diffusion Regularisation In this paper, we identify another fundamental problem: variance decay. When we learn visual features from the attributes, in particular when projecting $\mathcal{A}$ to $\mathcal{V}$ using $\mathcal{P}$ , the dimension difference $D\gg M$ will lead the learning algorithm to pick the directions with low variances progressively. As shown in Fig. 3, most of the information (variance) is contained in a few projections. As a result, the remaining dimensions of the synthesised data suffers a dramatic variance decay, which indicates the learnt representation is severely redundant. To address the problem, we may expect the concentrated information can effectively diffuse to all of the learnt dimensions through an adjustment rotation . Therefore, we modify the rotating matrix $Q$ in Eq. (3). In this paper, we consider an orthogonal rotation, i.e. $QQ^{\top}=I$ , since it is easy to show that $Tr(Q^{\top}P^{\top}\mathcal{A}^{\top}\mathcal{A}PQ)=Tr(P^{\top}\mathcal{A}^{\top}\mathcal{A}P)$ ( $I$ is an identity matrix). Such a property is reported in that the orthogonal rotation can protect the properties captured in the semantic space. Next, we show how the rotation can control variance diffusion.

From Eq. (3), the optimal synthesised data is $\mathcal{X}=\mathcal{V}Q$ , where $\mathcal{V}=\mathcal{A}P$ . We first prove that the overall variance does not change after rotation. Before rotation, $\mathcal{V}$ is centralised, i.e. $\sum_{n=1}^{N}v_{n}=\textbf{0}$ . The original overall variance $\Gamma$ of $\mathcal{V}$ is $\Gamma=N\sum_{d=1}^{D}\sigma_{d}$ , where $\sigma_{d}=(\sum_{n=1}^{N}v^{2}_{nd})/N$ denotes the variance of the $d$ -th dimension. After rotation $Q$ , we have the new variance of each dimension $\sigma_{d}^{\prime}$ and the sum of variance of each dimension is $\Gamma^{\prime}$ . We show $\Gamma=\Gamma^{\prime}$ in the following:

We hope the overall variance $\Gamma$ tends to equally diffuse to all of the learnt dimensions in order to recover the real data distribution of $\mathcal{X}$ . In other words, the variance of diffused standard deviations $\Pi$ in the synthesised data should be small ( $\Pi=\frac{1}{D}\sum_{d=1}^{D}(\pi_{d}-\bar{\pi})^{2}$ , where $\pi_{d}=\sqrt{\sigma^{\prime}_{d}}$ and $\bar{\pi}$ is the mean of all standard deviations). According to the above Eq. (6), we have $\sum_{d=1}^{D}\pi_{d}^{2}=\sum_{d=1}^{D}{\sigma^{\prime}_{d}}=\sum_{d=1}^{D}\sigma_{d}=\epsilon$ . Next, we show how to minimise $\Pi$ in our learning framework to find the orthogonal rotation:

The above equation shows that to minimise $\Pi$ is equivalent to maximise the sum of diffused standard deviations. Such a deduction is intuitive because our goal is a higher overall sum of standard deviation so that the synthesised data can gain more information. Moreover, we discover a novel relationship between the sum of diffused standard deviations and the orthogonal rotation:

2 Optimisation Strategy

The problem raised in Eq. (9) is a non-convex optimisation problem. To the best of our knowledge, there is no direct way to find its optimal solution. Similar to [kodirov2015unsupervised], in this paper, we propose an iterative scheme by using the alternating optimisation to obtain the local optimal solution. Specifically, we initialise $Q=I$ and $\mathcal{V}=\mathcal{X}_{s}$ .The initialisation of $P$ can be obtained via $P=(\mathcal{A}_{s}^{\top}\mathcal{A}_{s})^{-1}\mathcal{A}_{s}^{\top}\mathcal{V}$ . The whole alternate procedure of the proposed UVDS is listed as follows.

1. $\mathcal{V}$ -step: By fixing $P$ and $Q$ , we can reduce Eq. (9) to the following sub-problem:

where the extra term $\gamma\|\textbf{1}\mathcal{V}\|_{2}^{2}$ constrains the learnt $\mathcal{V}$ to be centralised according to Eq. 6. The minimal $\mathcal{V}$ can be obtained by setting the partial derivative of Eq. (3.2) to zero and we have

which is a typical Sylvester equation so that $\mathcal{V}$ can be efficiently solved by the lyap() function in the MATLAB. Afterwards, the leant $\mathcal{V}$ needs to be further centralised: $v_{n}\leftarrow v_{n}-(\sum_{n=1}^{N}v_{n})/N$ to satisfy Eq. 6.

2. Q-step: By fixing $P$ and $V$ , we can reduce Eq. (9) to the following sub-problem:

Since we need to solve $Q$ with the orthogonality constraint in Eq. (13), in this paper, we adopt the gradient flow in which is an iterative scheme for optimising generic orthogonal problems with a feasible solution. Specifically, given the orthogonal rotation $Q_{t}$ during the $t$ -th iterative optimisation, a better solution of $Q_{t+1}$ is updated via Cayley transformation:

where $H_{t}$ is the Cayley transformation matrix and defined as

where the diagonal matrix $E$ is defined the same as that in Eq. (11). In this way, for the Q-step, we repeat the above formulation to update Q until achieving convergence.

3. P-step: By fixing $Q$ and $V$ , we can reduce Eq. (9) to the following sub-problem:

The resulted equation is derived by a standard least squares problem with the following analytical solution:

In this way, we sequentially update $\mathcal{V}$ , $\mathbf{Q}$ and $\mathbf{P}$ to optimise UVDS with $T$ times based on coordinate descent. For each variable, either global or local optimum is achieved and thus the overall objective is lower bounded, which guarantees the convergence of our method. In practice, UVDS can well converge with $T=5\sim 10$ .

3 Zero-shot Recognition

Once we obtain the embedding matrices $P$ and $Q$ , the visual features of unseen classes can be easily synthesised from their semantic attributes:

It is noticeable that for image-level attributes, $\mathcal{X}_{u}$ contains as many instances as the test set. The zero-shot recognition task now becomes a typical classification problem. Thus, any existing supervised classifier, e.g. SVM, can be applied. For class-level, only a prototype feature of each class is synthesised. Either few-shot learning techniques or the simplest Nearest Neighbour (NN) algorithm can be adopted. Since we focus on the quality of the synthesised features, we simply use NN and SVM for image-level tasks and NN for class-level tasks.

Experiments

Settings We evaluate our method on four benchmark datasets and strictly follow the published seen/unseen splits. For AwA and aPY , we follow the standard 40/10 and 20/12 splits like most of existing methods. For CUB, we follow to use the 150/50 setting. For SUN, we use the simple 707/10 setting as reported in . Methods under different settings , or using other various semantic information are not compared with.

Semantic Attributes Existing attributes are divided into image-level and class-level. On CUB, aPY, and SUN datasets, image-level attributes are provided. Our approach can synthesise the visual features for all unseen instances. We compute class-level attributes by averaging the image-level attributes for each class. For the AwA dataset, only class-level attributes are provided.

Visual Features For low-level visual features, we use those provided by the four datasets . For deep learning features, we adopt CNN features released by for the four datasets using the VGG-19 model.

Implementation Parameters Half of the data in each class in the training sets are used as the validation set. We use 10-fold cross-validation to obtain the optimal hyper-parameters $\lambda$ and $\beta$ . $k$ is fixed to 10 for the $k$ -nn graph.

Table 1 summarises our comparison to the published results of state-of-the-art methods. The hyphens indicate that the compared methods were not tested on the corresponding datasets in the original papers. In the first section, all of the compared methods were tested using conventional low-level features. In the second section, deep learning features are employed. For all of the four datasets, we first evaluate our method using class-level attributes (CA). In this scenario, each unseen class gains a synthesised visual feature prototype from the class attribute signature. The unseen test images are predicted by the NN classification using these prototypes. When image-level attributes are available in CUB, aPY, and SUN, we further conduct experiments using SVM classifiers. The visual feature vector of each unseen image is synthesised by the proposed UVDS and then fed to train SVM models. During the test, visual features that are extracted from the unseen images are fed to the trained SVM to get the prediction. Our method can steadily outperform the state-of-the-art methods on conventional ZSL scenarios. Our results also exceed two of the results base on transductive settings , which sufficiently support our synthesised visual features are highly discriminative. While deep learning features can boost the performance, our method can also achieve acceptable results with low-level features. In most cases, using SVM can further improve the recognition rates, especially when the class-level attributes are noisy, e.g. on aPY and SUN. However, if the class-level attributes are more precise, e.g. CUB, the class-level NN classifier can be better than SVM.

2 Detailed Evaluations

Baseline methods To understand the effect of each term in Eq. (9), we compare our method to several baseline methods in Table 2. Since AwA only provides class-level attributes, the following experiments are conducted on CUB, SUN, and aPY only. The first method is simply Linear Regression that we solve Eq. (1) and synthesise prototypes of unseen classes using Eq. (2). The second and third methods are denoted as Graph-Regularisation (GR) only ( $\beta=0$ ) and Diffusion-Regularisation (DR) only ( $\lambda=0$ ). For the training bias problem, we use the validation set to test the methods on seen classes. We also investigate ZSL under both class-level and image-level attributes scenarios. The first scenario is prototype-based, i.e. each unseen class gains only one visual prototype. We compare two possible ways to obtain the class-level visual prototype: 1) we compute the mean of image-level attributes in each class and use the averaged class-level attributes (CA) to synthesise one visual prototype for each class; 2) we first synthesise the visual features from the image-level attributes and use the mean of the features (MF) as the class prototype. During the test, we use NN classification to predict the label for the test image. The second scenario is sample-based, i.e. each unseen image has one unique attribute description. In this scenario, we fully synthesise all of the visual features of unseen classes and use them as training examples. We show how an advanced classifier, e.g. SVM, can further boost the performance.

In summary, our method can effectively prevent the training bias whereas the linear regression without regularisation suffers from 30% performance degradation in average from seen to unseen. DR is complementary to GR and can further boost the performance. There is no significant difference between the CA and MF scenarios. Therefore, our proposed method can be reliably applied to both image-level and class-level attributes. Another advantage is that the synthesised visual data can be fed to typical supervised classifiers to achieve better performance, which can be supported by the results using SVM.

Further Discussion There are two more questions: (1) what are the outcomes of the diffusion regularisation? (2) What kind of visual features are synthesised? In Fig. 3, we show the variance of each dimension of the synthesised data. The variances are sorted in descending order. We compare with the real unseen data and the synthesised data without diffusion regularisation ( $\beta=0$ ). Note that, in the synthesised data without DR (red), most variances are concentrated in a few dimensions (roughly 1000, 1500, and 500 on SUN, aPY, and CUB) while most of the remaining dimensions gain very low variances. In comparison, the variances of our proposed synthesised data (green) and real data are more informative. Furthermore, thanks to the DR, the variances in our proposed data are more balanced than real data, i.e. each of the dimension gains the equal amount of information. Such quantitative evidence explains the success of our proposed method in ZSL recognition.

Finally, we provide some qualitative results of our method. We use the synthesised features as queries and retrieve real images from the unseen datasets. In Fig. 4, we show some success cases that most of the top-5 results are with the same class labels. Particularly, the third result of Bag is the same paired image of the attributes that are used to synthesise the data. Such results demonstrate that the synthesised data is close to the samples from the same class in the feature space. On the contrary, we also provide some failure cases that the top-1 retrieval result is not with the same class label. Some of them are due to the ambiguity of the semantic meaning, e.g. the flea market has many similar attributes to the shoe shop. Some other cases, e.g. the CUB dataset, the real data of the birds are not distinctive to the other classes. Therefore, the NN-based retrieval gives a mixture of true-positives and false-positives. Such failures due to the ambiguity of the visual feature are not common cases. We can still achieve 45.72% overall recognition rate on the CUB dataset.

Conclusion

In this paper, we proposed a novel algorithm that synthesises visual data for unseen classes using semantic attributes. From the experiments, we can see that directly embedding using regression-based models can lead to low recognition rates owing to three main problems, in terms of structural difference, training bias, and variance decay. In correspondence, we introduced a latent structure-preserving space with the diffusion regularisation. Our approach outperformed the state-of-the-art methods on all of the four benchmark datasets. For future work, a worthy attempt is to substitute the semantic attributes by automatic word vectors that are driven from the text. In this way, the cost of synthesising data can be further reduced.