Factorized Bilinear Models for Image Recognition

Yanghao Li, Naiyan Wang, Jiaying Liu, Xiaodi Hou

Introduction

Deep convolutional neural networks (CNNs) have demonstrated their power in most computer vision tasks, from image classification , object detection , to semantic segmentation . The impressive fitting power of a deep net mainly comes from its recursive feature transformations. Most efforts to enhance the representation power of a deep neural net can be roughly categorized into two lines. One line of works features on increasing the depth of the network, namely the number of non-linear transformations. ResNet is a classic example of such extremely deep network. By using skip connections to overcome the gradients vanishing/exploding and degradation problems, ResNet achieves significant performance improvements. The other line of efforts aims at enhancing the fitting power for each layer. For example, Deep Neural Decision Forests was proposed to integrate differentiable decision forests as the classifier. In , the authors modeled pairwise feature interactions using explicit outer product at the final classification layer. The main drawbacks of these approaches are that they either bring in large additional parameters (for instance, introduces 250M additional parameters for ImageNet classification) or have a slow convergence rate ( requires 10x more epochs to converge than a typical GoogLeNet ).

In this paper, we propose the Factorized Bilinear (FB) model to enhance the capacity of CNN layers in a simple and effective way. At a glance, the FB model can be considered as a generalized approximation of the Bilinear Pooling , but with two modifications. First, our FB model generalizes the original Bilinear Pooling to all convolutional and fully connected layers. In this way, all computational layers in CNN could have larger capacity with pairwise interactions. However, under the original settings of Bilinear Pooling, such generalization will lead to explosion of parameters. To mitigate this problem, we constrain the rank of all quadratic matrices. This constraint significantly reduces the number of parameters and computational cost, making the complexity of FB layer linear with respect to the original conv/fc layer. Furthermore, in order to cope with overfitting, we propose a regularization method called DropFactor for the FB model. Analogous to Dropout , DropFactor randomly sets some elements of the bilinear terms to zero during each iteration in the training phase, and uses all of them in the testing phase.

To summarize, our contributions of this work are three-fold:

We present a novel Factorized Bilinear (FB) model to consider the pairwise feature interactions with linear complexity. We further demonstrate that the FB model can be easily incorporated into convolutional and fully connected layers.

We propose a novel method DropFactor for the FB layers to prevent overfitting by randomly dropping factors in the training phase.

We validate the effectiveness of our approach on several standard benchmarks. Our proposed method archives remarkable performance compared to state-of-the-art methods with affordable complexity.

Related Work

The Tao of tuning the layer-wise capacity of a DNN lies in the balance between model complexity and computation efficiency. The naive, linear approach of increasing layer capacity is either adding more nodes, or enlarging receptive fields. As discussed in , these methods have beneficial effect up to a limit. From a different perspective, PReLU and ELU add flexibilities upon the activation function at a minimal cost, by providing a single learned parameter for each rectifier. Besides activation functions, many other works tried to use more complex, non-linear models to replace vector/matrix operations in each layer. For instance, Network In Network (NIN) replaced the linear convolutional filters with multilayer perception (MLP), which is proven to be a general function approximator . The MLP is essentially stacked of fully connected layers. Thus, NIN is equivalent of increasing the depth of the network. In , random forest was unified as the final predictors with the DNNs in a stochastic and differentiable way. This back-propagation compatible version of random forest guides the lower layers to learn better representation in an end-to-end manner. However, the large computation overload makes this method inappropriate for practical applications.

Before the invention of deep learning, one of the most common tricks to increase model capacity is to apply kernels . Although the computational burden of some kernel methods can go prohibitively high, its simplest form – bilinear kernel is certainly affordable. In fact, many of today’s DNN has adopted bilinear kernel and have achieved remarkable performance in various tasks, such as fine-grained classification , semantic segmentation , face identification , and person re-identification .

In , a method called Bilinear Pooling is introduced. In this model, the final output is obtained by a weighted pooling of a global descriptor, which comes from the outer product of the final convolutional layer with itselfAlthough the original Bilinear Pooling supports input vectors from two different networks, there is little difference performance-wise. For simplicity, we only consider the bilinear model using identical input vectors in this paper.:

It is easy to see that the size of the global descriptor can go huge. To reduce the dimensionality of this quadratic term in bilinear pooling, proposed two approximations to obtain compact bilinear representations. Despite the efforts to reduce dimensionality in , bilinear pooling still has large amounts of parameters and heavy computation burden. In addition, all of these models are based on the interactions of the final convolution layer, which is not able to be extended to earlier feature nodes in DNN.

The Model

Before introducing the FB models, we first rewrite the bilinear pooling with its fully connected layer as below:

Previous literatures, such as have observed patterns of the co-activation of intra-layer nodes. The responses of convolutional kernels often form clusters that have semantic meanings. This observation motivates us to regularize $\mathbf{W}_{j\cdot}^{R}$ by its rank to simplify computations and fight against overfitting.

To incorporate the interactions term, we present the factorized bilinear model as follows:

where $x_{i}$ is the $i$ -th variable of the input feature $\mathbf{x}$ , $w_{i}$ is $i$ -th value of the first-order weight and ${\bf f}_{\cdot i}$ is the $i$ -th column of $\mathbf{F}$ . $\langle{\bf f}_{\cdot i},{\bf f}_{\cdot j}\rangle$ is defined as the inner product of ${\bf f}_{\cdot i}$ and ${\bf f}_{\cdot j}$ , which describes the interaction between the $i$ -th and $j$ -th variables of the input feature vector.

During the training, the parameters in FB model can be updated by back-propagating the gradients of the loss $l$ . Let $\partial l/\partial y$ be the gradient of the loss function with respect to $y$ , then by the chain rule we have:

Thus, the FB model applied in DNNs can be easily trained along with other layers by existing optimizers, such as stochastic gradient descent.

The aforementioned FB model can be applied in fully connected layers easily by considering all the output neurons. Besides, the above formulations and analyses can also be extended to the convolutional layers. Specifically, the patches of the input feature map in the convolutional layers can be rearranged into vectors using im2col trick , and convolution operation is converted to dense matrix multiplication like in fully connected layers. Most popular deep learning frameworks utilize this reformulation to calculate the convolution operator, since dense matrix multiplication could maximize the utility of GPU. Thus, the convolutional layer could also benefit from the proposed FB model.

According to the definition of the interaction weight $\mathbf{F}$ in Eq. (5), the space complexity, which means the number of parameters for one neuron in the FB model, is $O(kn)$ . Although the complexity of naïve computation of Eq. (6) is $O(kn^{2})$ , we can compute the factorization bilinear term efficiently by manipulating the order of matrix multiplication in Eq. (5). By computing $\mathbf{F}\mathbf{x}$ and $\mathbf{x}^{T}\mathbf{F}^{T}$ first, $\mathbf{x}^{T}\mathbf{F}^{T}\mathbf{F}\mathbf{x}$ can be computed in $O(kn)$ . Thus, the total computation complexity of Eq. (5) is also $O(kn)$ . As a result, the FB model has linear complexity in terms of both $k$ and $n$ for the computation and the number of parameters. We will show the actual runtime of the FB layers in our implementation in the experiments section.

2 DropFactor

Dropout is a simple yet effective regularization to prevent DNNs from overfitting. The idea behind Dropout is that it provides an efficient way to combine exponentially different neural networks by randomly dropping neurons. Inspired by this technique, we propose a specific DropFactor method in our FB model.

where $\mathbf{f}_{j\cdot}$ is the $j$ -th row of interaction weight $\mathbf{F}$ , which represents the $j$ -th factor. Based on Eq. (8), Fig. 1(a) shows the expanding structure of the FB layer which composes of one linear transformation path and $k$ bilinear paths. The key idea of our DropFactor is to randomly drop the bilinear paths corresponding to $k$ factors during the training. This prevents $k$ factors from co-adapting.

In our implementation, each factor is retained with a fixed probability $p$ during training. With the DropFactor, the formulation of FB layer in the training becomes:

For testing, instead of explicitly averaging the outputs from all $2^{k}$ thinned networks, we use the approximate “Mean Network” scheme in . As shown in Fig. 1(c), each factor term $\mathbf{x}^{T}\mathbf{f}_{j\cdot}\mathbf{f}_{j\cdot}^{T}\mathbf{x}$ is multiplied by $p$ at testing time:

In this way, the output of each neuron at testing time is the same as the expected output of $2^{k}$ different networks at training time.

Relationship to Existing Methods

In this section, we connect our proposed FB model with several closely related works, and discuss the differences and advantages over them.

Relationship to Bilinear Pooling. Bilinear pooling modeled pairwise interactions of features by outer product of two vectors. In the following, we demonstrate that our FB block is a generalization form of bilinear pooling block.

As shown in Fig. 2(a), the bilinear pooling is applied after the last convolutional layer of a CNN (e.g. VGG) , then followed by a fully-connected layer for classification. We construct an equivalent structure with our FB model by using the FB convolutional layer with $1\times 1$ kernel as shown in Fig. 2(b). The final average pooling layer is used to aggregate the scores around the spatial locations. Thus, Eq. (5) can be reformulated as:

Compared with bilinear pooling in Eq. (3), we add the linear term and replace the pairwise matrix $\mathbf{W}_{j\cdot}^{R}$ with factorized bilinear weight $\mathbf{F}^{T}\mathbf{F}$ .

We argue that such symmetric and low rank constraints on the interaction matrix are reasonable in our case. First, the interaction between $i$ -th and $j$ -th feature and that between $j$ -th and $i$ -th feature should be same. Second, due to the redundancy in neural networks, the neurons usually form the clusters . As a result, only a few factors should be enough to capture the interactions between them. Besides reducing the space and time complexity, restricting $k$ also potentially prevents overfitting and leads to better generalization.

An improvement of bilinear pooling is compact bilinear pooling which reduces the feature dimension of bilinear pooling using two approximation methods: Random Maclaurin (RM) and Tensor Sketch (TS) . However, the dimension of the projected compact bilinear feature is still too large (10K for the 512-dimensional input) for deep networks. Table 1 compares the factorized bilinear with bilinear pooling and its variant compact bilinear pooling. Similar to compact bilinear pooling, our FB model requires much fewer parameters than bilinear pooling. It also reduces the computation complexity significantly (from 133M in TS to 10M) at the same time. In addition, not only used as the final prediction layer, our method can also be applied in the early layers as a common transformation layer, which is much more general than the bilinear pooling methods.

Relationship to Factorization Machines. Factorization Machine (FM) is a popular predictor in machine learning and data mining, especially for very sparse data. Similar to our FB model, FM also captures the interactions of the input features in a factorized parametrization way. However, since FM is only a classifier, its applications are restricted in the simple regression and classification. In fact, a 2-way FM can be constructed by a tiny network composed of a single FB layer with one output unit. In this way, a 2-way FM is a special case of our FB model. While our FB model is much more general, which can be integrated into regular neural networks seamlessly for different kinds of tasks.

Experiments

In this section, we conduct comprehensive experiments to validate the effectiveness of the proposed FB model. In Sec. 5.1, we first investigate the design choices and properties of the proposed FB model, including the architecture of the network, parameters setting and speed. Then, we conduct several experiments on three well-known standard image classification datasets and two fine-grained classification datasets, in Sec. 5.2. In the following experiments, we refer the CNN equipped with our FB model as Factorized Bilinear Network (FBN). Code is available at https://github.com/lyttonhao/Factorized-Bilinear-Network.

Implementation Details. We adopt two standard network structures: Inception-BN and ResNet as our baselines. Our FBN improves upon these two structures. Some details are elaborated below. For one specified network and its corresponding FBN, we use all the same experiment settings (e.g. the training policy and data augmentation), except two special treatments for FBNs. (i) To prevent the quadratic term in FB layer explodes too large, we change the activation before every FB layer from ‘ReLU’ to ‘Tanh’, which restricts the output range of FB layers. We do not use the power normalization and $l_{2}$ normalization in . The reason is that: 1) square root is not numerically stable around zero. 2) we do not calculate the bilinear features explicitly. (ii) We use the slow start training scheme, which shrinks the initialized learning rate of the FB layer by a tenth and gradually increases the learning rate to the regular level in several epochs (e.g. 3 epochs). This treatment learns a good initialization and is beneficial for converging of FBNs, which is similar to the warmup step in .

In this section we investigate the design of architecture of FBNs and the appropriate parameters, such as the number of factors $k$ and the DropFactor rate $p$ More exploration experiments can be found in supplemental material.. Most of the following experiments are conducted on a simplified version of Inception-BN networkhttps://goo.gl/QwVS3Z on CIFAR-100. Some details about the experiment settings, such as training policies and data augmentations, will be explained in Sec. 5.2.

As discovered in , the lower layers of CNNs usually respond to simple low-level features. Thus, linear transformation is enough to abstract the concept within images. Consequently, we modify the higher layers of Inception-BN network to build our FBNs. As shown in Fig. 3, the original Inception-BN is constructed by several SimpleFactories, and each SimpleFactory contains a $1\times 1$ conv block and a $3\times 3$ conv block. The five FBNs with different structures are explained as follows:

In5a-FBN. We replace the 1x1 conv layer in In5a factory with our FB convolutional layer. The parameters such as kernel size and stride size are kept the same.

In5b-FBN. This is same as In5a-FBN except we apply FB model in In5b factory.

FC-FBN. The final fully-connected layer is replaced by our FB fully connected layer.

Conv-FBN. As shown in Fig. 2(b), Conv-FBN is constructed by inserting a FB conv layer with $1\times 1$ kernel before the global pooling layer and removing the fully-connected layer.

Conv+In5b-FBN. This network combines Conv-FBN and In5b-FBN.

The results of original Inception-BN and five FBNs are shown in Table 2. The training and testing curves for these networks are presented in Fig. 4. The number of factors $k$ of different FBNs is fixed as $20$ and the appropriate values for the DropFactor rate $p$ are chosen for different FBNs (More experiments about $k$ and $p$ are shown in Table 3 and Fig. 5). From Table 2, we can see that most FBNs achieve better results than the baseline Inception-BN model, and Conv-FBN achieves 21.98% error which outperforms Inception-BN by a large margin of 2.72%. It demonstrates that incorporating FB model indeed improves the performance of the network.

From Table 2 and Fig. 4, we have several interesting findings: 1) Comparing the results of Conv-FBN, In5b-FBN and In5a-FBN, we find that incorporating FB model in the lower layers may lead to inferior results and suffer from overfitting more. 2) The difference between FC-FBN and Conv-FBN is whether to consider the interactions across different locations of the input feature map. The results show that the pairwise interactions should be captured at each position of the input separately. 3) Incorporating two FB blocks (In5b+Conv-FBN) does not further improve the performance at least in CIFAR-100, but leads to more severe overfitting instead.

As the number of parameters and computational complexity in the FB layer increase linearly in the number of factors $k$ , we also evaluate the sensitivity of $k$ in a FB layer. Table 3 shows the results of In5b-FBN and Conv-FBN on CIFAR-100. As can be noticed, after $k$ grows beyond 20, the increase of performance is marginal, and too large $k$ may be even detrimental. Thus, we choose 20 factors in all the subsequent experiments.

We also vary the DropFactor rate $p$ to see how it affects the performance. Fig. 5(a) shows the testing error on CIFAR-100 of In5b-FBN and Conv-FBN with different $p$ . Note that even the FBNs without DropFactor ( $p=1.0$ ) can achieve better results than the baseline method. With the DropFactor, FBNs further improve the result and achieve the best result $21.98\%$ when $p=0.5$ for Conv-FBN and $22.63\%$ when $p=0.8$ for In5b-FBN. Fig. 5(b) and 5(c) show the training and testing curves with different $p$ . As illustrated, the testing curves are similar at the first 200 epochs for different networks, yet the training curves differ much. The smaller DropFactor rate $p$ makes the network less prone to overfitting. It demonstrates the effectiveness of DropFactor. On the other hand, a too small rate may deteriorate the convergence of the FBNs.

We show the runtime speed comparison of a small network (Inception-BN) and a relatively large network (ResNet of 1001 layers) with and without FB layers in Table 4. The test is performed on the Titan X GPU. Since the FB layers implemented on our own are not optimized by advanced implementation such as cuDNN , we also show the results of all methods without cuDNN for a fair comparison. For Inception-BN, the loss of speed is still tolerable. In addition, since we only insert a single FB block, it has little impact on the speed of large networks, e.g. ResNet-1001. Lastly, cuDNN accelerates all methods a lot. We believe that the training speed of our FB layers will also benefit from a deliberate optimization.

2 Evaluation on Multiple Datasets

In this section, we compare our FBN with other start-of-art methods on multiple datasets, including CIFAR-10, CIFAR-100, ImageNet and two fine-grained classification datasets. For the following experiments, we do not try exhaustive parameter search and use the Conv-FBN network with fixed factors $k=20$ and DropFactor rate $p=0.5$ as the default setting of FBNs, since this setting achieves the best performance according to the ablation experiments in Sec. 5.1. Our FB layers are implemented in MXNet and we follow some training policies in “fb.resnet”https://github.com/facebook/fb.resnet.torch.

The CIFAR-10 and CIFAR-100 datasets contain 50,000 training images and 10,000 testing images of 10 and 100 classes, respectively. The resolution of each image is $32\times 32$ . We follow the moderate data augmentation in for training: a random crop is taken from the image padded by 4 pixels or its horizontal flip. We use SGD for optimization with a weight decay of 0.0001 and momentum of 0.9. All models are trained with a minibatch size of 128 on two GPUs. For ResNet and its corresponding FBNs, we start training of a learning rate of 0.1 for total 200 epochs and divide it by 10 at 100 and 150 epochs. For Inception-BN based models, the learning rate is 0.2 at start and divided by 10 at 200 and 300 epochs for total 400 epochs.

We train three different networks: Inception-BN, ResNet-164 and ResNet-1001, and their corresponding FB networks. Note that we use the pre-activation version of ResNet in instead of the original ResNet . Table 5 summarizes the results of our FBNs and other state-of-the-art algorithms. Our FBNs have consistent improvements over all three corresponding baselines. Specifically, our Inception-BN-FBN outperforms Inception-BN by 2.72% on CIFAR-100 and 0.24% on CIFAR-10, and ResNet-1001-FBN achieves the best result 19.67% on CIFAR-100 among all the methods. A more intuitive comparison is in Fig. 6. Most remarkably, our method improves the performance with slightly additional cost of parameters. For example, compared to ResNet-1001 with 10.7M parameters, our ResNet-1001-FBN obtains better results with only 0.5M (5%) additional parameters. This result is also better than the best Wide ResNet, which uses 36.5M parameters. Although Bilinear Pooling methods were not utilized in general image classification tasks, we also re-implement them here using Inception-BN and ResNet-164 architectures. Their performance is inferior to our results.

2.2 Results on ImageNet

Although lots of works show their improvements on small datasets such as CIFAR-10 and CIFAR-100, few works prove their effectiveness in large scale datasets. Thus, in this section we evaluate our method on the ImageNet dataset, which is the golden test for image classification. The dataset contains 1.28M training images, 50K validation images and 100K testing images. We report the Top-1 and Top-5 errors of validation set in the single-model single center-crop setting. For the choice of FBN, we use the Conv-FBN structure in Sec. 5.1 and the DropFactor rate is set as 0.5. In the training, we also follow some well-known strategies in “fb.resnet”4, such as data augmentations and initialization method. The initial learning rate starts from 0.1 and is divided by 10 at 60, 75, 90 epochs for 120 epochs.

We adopt two modern network structures: Inception-BN and ResNet in this experiment. Table 6 shows their results compared with FB variants. Relative to the original Inception-BN, our Inception-BN-FBN has a Top-1 error of 26.4%, which is 1.1% lower. ResNet-34-FBN and ResNet-50-FBN achieve 26.8% and 24.7% Top-1 error, and improve 1.4% and 0.7% over the baselines, respectively. The results demonstrate the effectiveness of our FB models on the large scale dataset.

2.3 Results on Fine-grained Recognition Datasets

Original Bilinear pooling methods only show their results on fine-grained recognition applications, thus we apply our FB models in two fine-grained datasets CUB-200-2011 and Describable Texture Dataset (DTD) for comparisons. We use the same base network VGG-16 in this experiment. Table 7 compares our method with bilinear pooling and two compact bilinear pooling methods (RM and TS). The results show that our FBN and the bilinear pooling methods all improves significantly over the VGG-16. We also re-implement bilinear pooling under the same training setting as our FBN. It should be more fair to compare its results (in the brackets) with our FBN. Note that our FBN also has much lower cost of memory and computation than bilinear pooling methods as described in Sec. 4.

Conclusion and Future Work

In this paper, we have presented the Factorized Bilinear (FB) model to incorporate pairwise interactions of features in neural networks. The method has low cost in both memory and computation, and can be easily trained in an end-to-end manner. To prevent overfitting, we have further proposed a specific regularization method DropFactor by randomly dropping factors in FB layers. Our method achieves remarkable performance in several standard benchmarks, including CIFAR-10, CIFAR-100 and ImageNet.

In the future work, we will go beyond the interactions inside features, and explore the generalization to model the correlations between samples in some more complicated tasks, such as face verification and re-identification.

Acknowledgement

This work was supported by the National Natural Science Foundation of China under Contract 61472011.

References

Supplementary: More Exploration Experiments of Factorized Bilinear Models

We study the effect of different kernel sizes in Factorized Bilinear (FB) Models. We also present comparisons with Dropout and our DropFactor in FB models.

Tab. 8 shows the results of different kernel sizes (1x1 and 3x3) for FB layers. We conduct experiments on CIFAR-100 dataset with two FB networks In5b-FBN and Conv-FBN as described in the paper. We insert a 1x1 FB layer and 3x3 FB layer, respectively, for both two FBNs. The results of 3x3 kernel size are still better than the baseline. This demonstrates that our FB models can generalize to model interactions with larger kernel size. However, it also leads to more severe over-fitting than 1x1 at least on CIFAR-100 and has 9 times parameters than an 1x1 FB layer. Thus, incorporating 1x1 FB layer can achieve more efficient and effective performance.

2 Comparisons with Dropout and DropFactor

Our DropFactor scheme shares similar idea with Dropout , which is also a simple yet effective regularization to prevent over-fitting. We evaluate the performance of Dropout with our specific designed DropFactor for Factorized Bilinear models. Tab. 9 illustrates the results of two methods on the CIFAR-100 dataset. We adopt the Inception-BN and ResNet-164 networks as the base networks in this experiments. The FBN models are constructed by inserting the FB layers in the base networks. As shown in the table, Dropout and DropFactor both improve the performance individually over the original FBN model. DropFactor achieves even better results and combining them does not get further improvement. This demonstrates the effectiveness of DropFactor scheme to reduce the over-fitting of FB models.