RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition

Xiaohan Ding, Chunlong Xia, Xiangyu Zhang, Xiaojie Chu, Jungong Han, Guiguang Ding

Introduction

The locality of images (i.e., a pixel is more related to its neighbors than the distant pixels) makes Convolutional Neural Network (ConvNet) successful in image recognition, as a conv layer only processes a local neighborhood. In this paper, we refer to this inductive bias as the local prior.

On top of that, we also desire the ability to capture the long-range dependencies, which is referred to as the global capacity in this paper. Traditional ConvNets model the long-range dependencies by the large receptive fields formed by deep stacks of conv layers . However, repeating local operations is computationally inefficient and may cause optimization difficulties. Some prior works enhance the global capacity with self-attention-based modules , which has no local prior. For example, ViT is a pure-Transformer model without convolution, which feeds images into the Transformers as a sequence. Due to the lack of local prior as an important inductive bias, ViT needs an enormous amount of training data (3×1083\times 10^{8} images in JFT-300M) to converge.

On the other hand, some images have intrinsic positional prior, which cannot be effectively utilized by a conv layer because it shares parameters among different positions. For example, when someone tries to unlock a cellphone via face recognition, the photo of the face is very likely to be centered and aligned so that the eyes appear at the top and the nose shows at the middle. We refer to the ability to utilize such positional prior as the positional perception.

This paper revisits fully-connected (FC) layers to provide traditional ConvNet with global capacity and positional perception. We directly use an FC as the transformation between feature maps to replace conv in some cases. By flattening a feature map, feeding it through FC, and reshaping back, we can enjoy the positional perception (because its parameters are position-related) and global capacity (because every output point is related to every input point). Such an operation is efficient in terms of both the actual speed and theoretical FLOPs, as shown in Table. 4. For the application scenarios where the primary concerns are the accuracy and throughput but not the number of parameters, one may prefer FC-based models to traditional ConvNets. For example, the GPU inference serves usually have tens of GBs of memory, so that the memory occupied by the parameters is minor compared to that consumed by the computations and internal feature maps.

However, an FC has no local prior because the spatial information is lost. In this paper, we propose to incorporate local prior into FC with a structural re-parameterization technique. Specifically, we construct conv and batch normalization (BN) layers parallel to the FC during training, then merge the trained parameters into the FC to reduce the number of parameters and latency for inference. Based on that, we propose a re-parameterized multi-layer perceptron (RepMLP). As shown in Fig. 1, the training-time RepMLP has FC, conv, and BN layers but can be equivalently converted into an inference-time block with only three FC layers. The meaning of structural re-parameterization is that the training-time model has a set of parameters while the inference-time model has another set, and we parameterize the latter with the parameters transformed from the former. Note that we do not derive the parameters before each inference. Instead, we convert it once for all, and then the training-time model can be discarded.

Compared to conv, RepMLP runs faster under the same number of parameters and has global capacity and positional perception. Compared to a self-attention module , it is simpler and can utilize the locality of images. As shown in our experiments (Table. 4, 5, 6), RepMLP outperforms the traditional ConvNets in a variety of vision tasks, including 1) general classification (ImageNet ), 2) task with positional prior (face recognition) and 3) task with translation invariance (semantic segmentation).

Our contributions are summarized as follows.

We propose to utilize the global capacity and positional perception of FC and equip it with local prior for image recognition.

We propose a simple, platform-agnostic and differentiable algorithm to merge the parallel conv and BN into FC for the local prior without any inference-time costs.

We propose RepMLP, an efficient building block, and show its effectiveness on multiple vision tasks.

Related Work

Non-local Network proposed to model the long-range dependencies via the self-attention mechanism. For each query position, the non-local module first computes the pairwise relations between the query position and all positions to form an attention map and then aggregates the features of all the positions by a weighted sum with the weights defined by the attention map. Then the aggregated features are added to the features of each query position.

GCNet created a simplified network based on a query-independent formulation, which maintains the accuracy of Non-local Network with less computation. The input to a GC block goes through a global attention pooling, feature transform (a 1×11\times 1 conv), and feature aggregation.

Compared to these works, RepMLP is simpler as it uses no self-attention and contains only three FC layers. As will be shown in Table. 4, RepMLP improves the performance of ResNet-50 more than Non-local module and GC block.

2 Structural Re-parameterization

In this paper, structural re-parameterization refers to constructing the conv and BN layers parallel to an FC for training and then merging the parameters into the FC for inference. The following two prior works can also be categorized into structural re-parameterization.

Asymmetric Convolution Block (ACB) is a replacement for regular conv layers, which uses horizontal (e.g., 1×31\times 3) and vertical (3×13\times 1) conv to strengthen the “skeleton” of a square (3×33\times 3) conv. Reasonable performance improvements are reported on several ConvNet benchmarks.

RepVGG is a VGG-like architecture, as its body uses only 3×33\times 3 conv and ReLU for inference. Such an inference-time architecture is converted from a training-time architecture with identity and 1×11\times 1 branches.

RepMLP is more related to ACB since they are both neural network building blocks, but our contributions are not about making convolutions stronger but making MLP powerful for image recognition as a replacement for regular conv. Besides, the training-time convolutions inside RepMLP may be enhanced by ACB, RepVGG block, or other forms of convolution for further improvements.

RepMLP

A training-time RepMLP is composed of three parts termed as Global Perceptron, Partition Perceptron and Local Perceptron (Fig. 1). In this section, we introduce our formulation, describe every component, and show how to convert a training-time RepMLP into three FC layers for inference, where the key is a simple, platform-agnostic and differentiable method for merging a conv into an FC.

In this paper, a feature map is denoted by a tensor MRN×C×H×W\mathrm{M}\in\mathbb{R}^{N\times C\times H\times W}, where NN is the batch size, CC is the number of channels, HH and WW are the height and width, respectively. We use F\mathrm{F} and W\mathrm{W} for the kernel of conv and FC, respectively. For the simplicity and ease of re-implementation, we use the same data format as PyTorch and formulate the transformations in a pseudo-code style. For example, the data flow through a K×KK\times K conv is formulated as

where M(out)RN×O×H×W\mathrm{M}^{(\text{out})}\in\mathbb{R}^{N\times O\times H^{\prime}\times W^{\prime}} is the output feature map, OO is the number of output channels, pp is the number of pixels to pad, FRO×C×K×K\mathrm{F}\in\mathbb{R}^{O\times C\times K\times K} is the conv kernel (we temporarily assume the conv is dense, i.e., the number of groups is 1). From now on, we assume H=H,W=WH^{\prime}=H,W^{\prime}=W for the simplicity (i.e., the stride is 1 and p=K2p=\lfloor\frac{K}{2}\rfloor).

For an FC, let PP and QQ be the input and output dimensions, V(in)RN×P\mathrm{V}^{(\text{in})}\in\mathbb{R}^{N\times P} and V(out)RN×Q\mathrm{V}^{(\text{out})}\in\mathbb{R}^{N\times Q} be the input and output, respectively, the kernel is WRQ×P\mathrm{W}\in\mathbb{R}^{Q\times P} and the matrix multiplication (MMUL) is formulated as

We now focus on an FC that takes M(in)\mathrm{M}^{(\text{in})} as input and outputs M(out)\mathrm{M}^{(\text{out})}. We assume the FC does not change the resolution, i.e., H=H,W=WH^{\prime}=H,W^{\prime}=W. We use RS (short for “reshape”) as the function that only changes the shape specification of tensors but not the order of data in memory, which is cost-free. The input is first flattened into NN vectors of length CHWCHW, which is V(in)=RS(M(in),(N,CHW))\mathrm{V}^{(\text{in})}=\text{RS}(\mathrm{M}^{(\text{in})},(N,CHW)), multiplied by the kernel W(OHW,CHW)\mathrm{W}(OHW,CHW), then the output V(out)(N,OHW)\mathrm{V}^{(\text{out})}(N,OHW) is reshaped back into M(out)(N,O,H,W)\mathrm{M}^{(\text{out})}(N,O,H,W). For the better readability, we omit the RS if there is no ambiguity,

Such an FC cannot take advantage of the locality of images as it computes each output point according to every input point, unaware of the positional information.

2 Components of RepMLP

We do not use FC in the above-mentioned manner because of not only the lack of local prior but also the huge number of parameters, which is COH2W2COH^{2}W^{2}. With the common settings, e.g., H=W=28,C=O=128H=W=28,C=O=128 on ImageNet, this single FC would have 10G parameters, which is clearly unacceptable. To reduce the parameters, we propose Global Perceptron and Partition Perceptron to model the inter- and intra-partition dependencies separately.

Global Perceptron splits up the feature map so that different partitions can share parameters. For example, an (N,C,14,14)(N,C,14,14) input can be split into (4N,C,7,7)(4N,C,7,7), and we refer to every 7×77\times 7 block as a partition. We use an efficient implementation for such splitting with a single operation of memory re-arrangement. Let hh and ww be the desired height and width of every partition (we assume H,WH,W are divisible by h,wh,w respectively, otherwise we can simply pad the input), the input MRN×C×H×W\mathrm{M}\in\mathbb{R}^{N\times C\times H\times W} is first reshaped into (N,C,Hh,h,Ww,w)(N,C,\frac{H}{h},h,\frac{W}{w},w). Note that this operation is cost-free as it does not move data in memory. Then we re-arrange the order of axes as (N,Hh,Ww,C,h,w)(N,\frac{H}{h},\frac{W}{w},C,h,w), which moves the data in memory efficiently. For example, it requires only one function call (permute) in PyTorch. Then the (N,Hh,Ww,C,h,w)(N,\frac{H}{h},\frac{W}{w},C,h,w) tensor is reshaped (which is cost-free again) as (NHWhw,C,h,w)(\frac{NHW}{hw},C,h,w) (noted as a partition map in Fig. 1). In this way, the number of parameters required is reduced from COH2W2COH^{2}W^{2} to COh2w2COh^{2}w^{2}.

However, splitting breaks the correlations among different partitions of the same channel. In other words, the model will view the partitions separately, totally unaware that they were positioned side by side. To add correlations onto each partition, Global Perceptron 1) uses average pooling to obtain a pixel for each partition, 2) feeds it though BN and a two-layer MLP, then 3) reshapes and adds it onto the partition map. This addition can be efficiently implemented with automatic broadcasting (i.e., implicitly replicating (NHWhw,C,1,1)(\frac{NHW}{hw},C,1,1) into (NHWhw,C,h,w)(\frac{NHW}{hw},C,h,w)) so that every pixel is related to the other partitions. Then the partition map is fed into Partition Perceptron and Local Perceptron. Note that if H=h,W=wH=h,W=w, we directly feed the input feature map into Partition Perceptron and Local Perceptron without splitting, hence there will be no Global Perceptron.

Partition Perceptron has an FC and a BN layer, which takes the partition map. The output (NHWhw,O,h,w)(\frac{NHW}{hw},O,h,w) is reshaped, re-arranged and reshaped in the inverse order as before into (N,O,H,W)(N,O,H,W). We further reduce the parameters of FC3 inspired by groupwise conv . With gg as the number of groups, we formulate the groupwise conv as

Similarly, the kernel of groupwise FC is WRQ×Pg\mathrm{W}\in\mathbb{R}^{Q\times\frac{P}{g}}, which has g×g\times fewer parameters. Though groupwise FC is not directly supported by some computing frameworks like PyTorch, it can be alternatively implemented by a groupwise 1×11\times 1 conv. The implementation is composed of three steps: 1) reshaping V(in)\mathrm{V}^{(\text{in})} as a “feature map” with spatial size of 1×11\times 1; 2) performing 1×11\times 1 conv with gg groups; 3) reshaping the output “feature map” into V(out)\mathrm{V}^{(\text{out})}. We formulate the groupwise matrix multiplication (gMMUL) as

Local Perceptron feeds the partition map through several conv layers. A BN follows every conv, as inspired by . Fig. 1 shows an example of h,w>7h,w>7 and K=1,3,5,7K=1,3,5,7. Theoretically, the only constraint on the kernel size KK is Kh,wK\leq h,w (because it does not make sense to use kernels larger than the resolution), but we only use odd kernel sizes as a common practice in ConvNet. We use K×KK\times K just for the simplicity of notation and a non-square conv (e.g., 1×31\times 3 or 3×53\times 5) also works. The padding of conv should be configured to maintain the resolution (e.g., p=0,1,2,3p=0,1,2,3 for K=1,3,5,7K=1,3,5,7, respectively), and the number of groups gg should be the same as the Partition Perceptron. The outputs of all the conv branches and Partition Perceptron are added up as the final output.

3 A Simple, Platform-agnostic, Differentiable Algorithm for Merging Conv into FC

Before converting a RepMLP into three FC layers, we first show how to merge a conv into FC. With the FC kernel W(1)(Ohw,Chw)\mathrm{W}^{(1)}(Ohw,Chw), conv kernel F(O,C,K,K)\mathrm{F}(O,C,K,K) (Kh,wK\leq h,w) and padding pp, we desire to construct W\mathrm{W}^{\prime} so that

MMULsuperscriptMinsuperscriptW1CONVsuperscriptMinF𝑝\displaystyle=\text{MMUL}(\mathrm{M}^{(\text{in})},\mathrm{W}^{(1)})+\text{CONV}(\mathrm{M}^{(\text{in})},\mathrm{F},p)\,. We note that for any kernel W(2)\mathrm{W}^{(2)} of the same shape as W(1)\mathrm{W}^{(1)}, the additivity of MMUL ensures that

MMULsuperscriptMinsuperscriptW1MMULsuperscriptMinsuperscriptW2\displaystyle\text{MMUL}(\mathrm{M}^{(\text{in})},\mathrm{W}^{(1)})+\text{MMUL}(\mathrm{M}^{(\text{in})},\mathrm{W}^{(2)}) (7) =MMUL(M(in),W(1)+W(2)),\displaystyle=\text{MMUL}(\mathrm{M}^{(\text{in})},\mathrm{W}^{(1)}+\mathrm{W}^{(2)})\,, so we can merge F\mathrm{F} into W(1)\mathrm{W}^{(1)} as long as we manage to construct W(F,p)\mathrm{W}^{(\mathrm{F},p)} of the same shape as W(1)\mathrm{W}^{(1)} which satisfies

Obviously, W(F,p)\mathrm{W}^{(\mathrm{F},p)} must exist, since a conv can be viewed as a sparse FC that shares parameters among spatial positions, which is exactly the source of its translation invariance, but it is not obvious to construct it with given F\mathrm{F} and pp. As modern computing platforms use different algorithms of convolution (e.g., im2col-, Winograd- , FFT-, MEC-, and sliding-window-based) and the memory allocation of data and implementations of padding may be different, a means for constructing the matrix on a specific platform may not work on another platform. In this paper, we propose a simple and platform-agnostic solution.

As discussed above, for any input M(in)\mathrm{M}^{(\text{in})} and conv kernel F\mathrm{F}, padding pp, there exists an FC kernel W(F,p)\mathrm{W}^{(\mathrm{F},p)} such that

With the formulation used before (Eq. 2), we have

We insert an identity matrix I\mathrm{I} (Chw,Chw)(Chw,Chw) and use the associative law

We note that because W(F,p)\mathrm{W}^{(\mathrm{F},p)} is constructed with F\mathrm{F}, IW(F,p)\mathrm{I}\cdot\mathrm{W}^{(\mathrm{F},p)\intercal} is a convolution with F\mathrm{F} on a feature map M(I)\mathrm{M}^{(\mathrm{I})} which is reshaped from I\mathrm{I}. With explicit RS, we have

Comparing Eq. 10 with Eq. 13, 14, we have

Which is exactly the expression we desire for constructing W(F,p)\mathrm{W}^{(\mathrm{F},p)} with F,p\mathrm{F},p. In short, the equivalently FC kernel of a conv kernel is the result of convolution on an identity matrix with proper reshaping. Better still, the conversion is efficient and differentiable, so one may derive the FC kernel during training and use it in the objective function (e.g., for penalty-based pruning ). The expression and code for the groupwise case are derived in a similar way and provided in the supplementary material.

4 Converting RepMLP into Three FC Layers

To use the theory presented above, we need to first eliminate the BN layers by equivalently fusing them into the preceding conv layers and FC3. Let FRO×Cg×K×K\mathrm{F}\in\mathbb{R}^{O\times\frac{C}{g}\times K\times K} be the conv kernel, μ,σ,γ,βRO\boldsymbol{\mathbf{\mu}},\boldsymbol{\mathbf{\sigma}},\boldsymbol{\mathbf{\gamma}},\boldsymbol{\mathbf{\beta}}\in\mathbb{R}^{O} be the accumulated mean, standard deviation and learned scaling factor and bias of the following BN, we construct the kernel F\mathrm{F}^{\prime} and bias b\mathbf{b}^{\prime} as

subscript𝝁𝑖subscript𝜸𝑖subscript𝝈𝑖subscript𝜷𝑖\mathrm{F}^{\prime}_{i,:,:,:}=\frac{\boldsymbol{\mathbf{\gamma}}_{i}}{\boldsymbol{\mathbf{\sigma}}_{i}}\mathrm{F}_{i,:,:,:}\,,\quad\mathbf{b}^{\prime}_{i}=-\frac{\boldsymbol{\mathbf{\mu}}_{i}\boldsymbol{\mathbf{\gamma}}_{i}}{\boldsymbol{\mathbf{\sigma}}_{i}}+\boldsymbol{\mathbf{\beta}}_{i}\,. (16) Then it is easy to verify the equivalence:

subscript𝜸𝑖subscript𝝈𝑖CONVsubscriptMF𝑝:𝑖::subscript𝝁𝑖subscript𝜷𝑖\displaystyle\frac{\boldsymbol{\mathbf{\gamma}}_{i}}{\boldsymbol{\mathbf{\sigma}}_{i}}(\text{CONV}(\mathrm{M},\mathrm{F},p)_{:,i,:,:}-\boldsymbol{\mathbf{\mu}}_{i})+\boldsymbol{\mathbf{\beta}}_{i} (17) =CONV(M,F,p):,i,:,:+bi,1iO,\displaystyle=\text{CONV}(\mathrm{M},\mathrm{F}^{\prime},p)_{:,i,:,:}+\mathbf{b}^{\prime}_{i}\,,\forall 1\leq i\leq O\,, where the left side is the original computation flow of a conv-BN, and the right is the constructed conv with bias.

The 1D BN and FC3 of Partition Perceptron are fused in a similar way into W^ROhw×Chwg\hat{\mathrm{W}}\in\mathbb{R}^{Ohw\times\frac{Chw}{g}}, b^ROhw\hat{\mathbf{b}}\in\mathbb{R}^{Ohw}. Then we convert every conv via Eq. 15 and add the resultant matrix onto W^\hat{\mathrm{W}}. The biases of conv are simply replicated by hwhw times (because all the points on the same channel share a bias value) and added onto b^\hat{\mathbf{b}}. Finally, we obtain a single FC kernel and a single bias vector, which will be used to parameterize the inference-time FC3.

The BN in Global Perceptron is also removed because the removal is equivalent to applying an affine transformation before FC1, which can be absorbed by FC1 as two sequential MMULs can be merged into one. The formulas and code are provided in the supplementary material.

5 RepMLP-ResNet

The design of RepMLP and the methodology of re-parameterizing conv into FC are generic hence may be used in numerous models including traditional CNNs and the concurrently proposed all-MLP models, e.g., MLP-Mixer , ResMLP , gMLP , AS-MLP , etc. In this paper, we use RepMLP in ResNet for most of our experiments because this work was finished before the publicity of all the above-mentioned all-MLP models. The application of RepMLP on the all-MLP models is scheduled as our future work.

In order to use RepMLP in ResNet, we follow the bottleneck design principle of ResNet-50 to reduce the channels by 4×4\times via 1×11\times 1 conv. Moreover, we further perform r×r\times channel reduction before RepMLP and r×r\times channel expansion afterwards via 3×33\times 3 conv. The whole block is termed as RepMLP Bottleneck (Fig. 4). For a specific stage, we replace all the stride-1 bottlenecks with RepMLP Bottlenecks and keep the original stride-2 (i.e., the first) bottleneck.

The design of RepMLP Bottleneck is relevant to GLFP Module , which uses a bottleneck structure with 1×11\times 1, 3×33\times 3 conv and FC for human face recognition, but the differences are significant. 1) GLFP directly flattens the input feature maps as vectors then feeds them into the FC layer, which is novel and insightful but may be inefficient on tasks with large input resolution such as ImageNet classification and semantic segmentation. In contrast, RepMLP partitions the input feature maps and use Global Perceptron to add the global information. 2) GLFP uses a 3×33\times 3 conv branch parallel to the 1×11\times 1-FC-3×33\times 3 branch to capture the local patterns. Unlike the Local Perceptron of RepMLP that can be merged into the FC for inference, the conv branch of GLFP is essential for both training and inference. 3) Some differences in the topology (e.g., addition v.s. concatenation). It should be noted again that the core contribution of this paper is not the solution to insert RepMLP into ResNet but the methodology of re-parameterizing conv into FC and the three components of RepMLP.

Experiments

We first verify the effectiveness of RepMLP by testing a pure MLP model on CIFAR-10. More precisely, since an FC is equivalent to a 1×11\times 1 conv, by “pure MLP” we means no usage of conv kernels bigger than 1×11\times 1. We interleave RepMLP and regular FC (1×11\times 1 conv) to construct three stages and downsample by max pooling, as shown in Fig. 3, and construct a ConvNet counterpart for comparison by replacing the RepMLPs with 3×33\times 3 conv. For the comparable FLOPs, the channels of the three stages are 16,32,64 for the pure MLP and 32,64,128 for the ConvNet, so the latter is named Wide ConvNet. We adopt the standard data augmentation : padding to 40×4040\times 40, random cropping and left-right flipping. The models are trained with a batch size of 128 and a cosine learning rate annealing from 0.2 to 0 in 100 epochs. As shown in Table. 1, the pure MLP model reaches 91.11% accuracy with only 52.8M FLOPs. Not surprisingly, the pure MLP model does not outperform the Wide ConvNet, motivating us to combine RepMLP and traditional ConvNet.

Then we conduct a series of ablation studies. A) We also report the FLOPs of the MLP before the conversion, which still contains conv and BN layers. The FLOPs is much higher though the extra parameters are marginal, which shows the significance of structural re-parameterization. B) “w/o Local” is a variant with no Local Perceptron, and the accuracy is 8.5% lower, which shows the significance of local prior. C) “w/o Global” removes FC1 and FC2 and directly feed the partition map into Local Perceptron and Partition Perceptron. D) “FC3 as conv9” replaces FC3 with a conv (K=9K=9 and p=4p=4, so that its receptive field is larger than FC3) followed by BN to compare the representational capacity of FC3 to a regular conv. Though the comparison is biased towards conv because its receptive field is larger, its accuracy is 3.5% lower, which validates that FC is more powerful than conv since a conv is a degraded FC. E) “RepMLP as conv9” directly replaces the RepMLP with a 9×99\times 9 conv and BN. Compared to D, its accuracy is lower as it has no Global Perceptrons.

2 RepMLP-ResNet for ImageNet Classification

We take ResNet-50 (the torchvision version ) as the base architecture to evaluate RepMLP as a building block in traditional ConvNet. For the fair comparison, all the models are trained with identical settings in 100 epochs: global batch size of 256 on 8 GPUs, weight decay of 10410^{-4}, momentum of 0.9, and cosine learning rate annealing from 0.1 to 0. We use mixup and a data augmentation pipeline of Autoaugment , random cropping and flipping. All the models are evaluated with single central crop and the speed is tested on the same 1080Ti GPU with a batch size of 128 and measured in examples/second. For the fair comparison, the RepMLPs are converted and all the original conv-BN structures of every model are also converted into conv layers with bias for the speed tests.

As a common practice, we refer to the four residual stages of ResNet-50 as c2, c3, c4, c5, respectively. With 224×224224\times 224 input, the output resolutions of the four stages are 56,28,14,756,28,14,7, and the 3×33\times 3 conv layers in the four stages have C=O=64,128,256,512C=O=64,128,256,512, respectively. To replace the big 3×33\times 3 conv layers with RepMLP, we use h=w=7h=w=7 and three conv branches in the Local Perceptron with K=1,3,5K=1,3,5.

We begin by using RepMLP in c4 only and varying the hyper-parameters rr and gg to test how they influence the accuracy, speed, and number of parameters (Table. 2). Notably, with violent 8×\times reduction (so that the input and output channels of RepMLP is 256/8=32256/8=32), RepMLP-Res50 has fewer parameters and run 10% faster than ResNet-50. The comparison between the first two rows suggest that the current groupwise 1×11\times 1 conv is inefficient, as the parameters increase by 59% but the speed decreases by only 0.7%. Further optimizations on groupwise 1×11\times 1 conv may make RepMLP more efficient. In the following experiments, we use r=2r=2 or 4 and g=4g=4 or 8 for the better trade-off.

We continue to test RepMLP in different stages. Specifically, we set g=8g=8 and r=2,2,4,4r=2,2,4,4 for c2,c3,c4,c5, respectively, for the reasonable model sizes. Table. 3 shows that replacing the original bottlenecks with RepMLP Bottlenecks causes very minor slowdown, and the accuracy is significantly improved. Using RepMLP only on c4 brings only 5M more parameters but 0.94% higher accuracy, and using RepMLP in c3 and c4 offers the best trade-off. It also suggests that RepMLP should be combined with traditional conv for the best performance, as using it in all the four stages delivers lower accuracy than c2+c3+c4 and c3+c4. We use RepMLP in c3+c4 in the following experiments.

The comparisons to the larger traditional ConvNets with higher input resolution (Table. 4) further justifies the effectiveness of RepMLP and offers some interesting discoveries. When trained and tested with 320×320320\times 320 inputs, we use RepMLP with h=w=10h=w=10 and the Local Perceptron has four branches with K=1,3,5,7K=1,3,5,7. We also vary the number of groups to generate three models with different sizes. For example, g8/16 means that g=8g=8 for c3 and 16 for c4. As two classic models for modeling the long-range dependencies, we construct the Non-local and GC counterparts following the instructions in the original papers, and the models are trained with the identical settings. We also present the well-known EfficientNet series as a strong baseline trained with the identical settings again. We have the following observations.

1) Compared to the traditional ConvNets with comparable numbers of parameters, the FLOPs of RepMLP-Res50 is much lower and the speed is faster. For example, compared to ResNet-101 with 224×224224\times 224 inputs, RepMLP-Res50 has only 50% FLOPs and 4M fewer parameters, runs 50% faster, but their accuracies are the same. With 320×320320\times 320 inputs, RepMLP-Res50 outperforms in accuracy, speed, and FLOPs by a large margin. Additionally, the improvements of ResNet-50 should not be simply attributed to the increased depth because it is still shallower than ResNet-101. 2) Increasing the parameters of RepMLPs causes very minor slowdown. From RepMLP-Res50-g8/16 to RepMLP-Res50-g4/8, the parameters increase by 47%, but the FLOPs increases by only 3.6% and the speed is lowered by only 2.2%. This property is particularly useful for high-throughput inference on large-scale servers, where the throughput and accuracy are our major concerns while the model size is not. 3) Compared to Nonlocal and GC, the speed of RepMLP-Res50 is almost the same, but the accuracy is around 1% higher. 4) Compared to EfficientNets, which are actually not efficient on GPU, RepMLP-Res50 outperforms in both the speed and accuracy.

We visualize the weights of FC3 in Fig. 5, where the sampled output point (6,6) is marked by a dashed square. The original FC3 has no local prior as the marked point and the neighborhood have no larger values than the others. But after merging the Local Perceptron, the resultant FC3 kernel has larger values around the marked point, suggests that the model focuses more on the neighborhood, which is expected. Besides, the global capacity is not lost because some points (marked by red rectangles) outside the largest conv kernel (7×77\times 7 in this case, marked by a blue square) still have larger values than the points inside.

We also present another design of bottleneck (RepMLP Light Block) in the Appendix, which uses no 3×33\times 3 conv but only 1×11\times 1 for 8×\times channel reduction/expansion. Compared to the original ResNet-50, it achieves comparable accuracy (77.14% vs. 77.19%) with 30% lower FLOPs and 55% faster speed.

3 Face Recognition

Unlike conv, FC is not translation-invariant, making RepMLP particularly effective for images with positional prior, i.e., human faces. The dataset we use for training is MS1M-V2, a large-scale face dataset with 5.8M images from 85k celebrities. It is a semi-automatic refined version of the MS-Celeb-1M dataset which consists of 1M photos from 100k identities and has many noisy images and wrong ID labels. We use MegaFace for evaluation, which includes 1M images of 60k identities as the gallery set and 100k images of 530 identities from FaceScrub as the probe set. It is also a refined version by manual clearing. We use 96×9696\times 96 inputs for both training and evaluation.

Apart from MobileFaceNet as a well-known baseline, which was originally designed for low-power devices, we also use a customized ResNet (referred to as FaceResNet in this paper) as a stronger baseline. Compared to a regular ResNet-50, the numbers of blocks in c2,c3,c4,c5 are reduced from 3,4,6,3 to 3,2,2,2, the widths are reduced from 256,512,1024,2048 to 128,256,512,1024, and the channels of 3×33\times 3 are increased from 64,128,256,512 to 128,256,512,1024. In other words, the 1×11\times 1 conv layers in residual blocks do not reduce or expand the channels. Because the input resolution is 96×9696\times 96, the spatial sizes of c2,c3,c4,c5 are 24,12,6,3, respectively. For the RepMLP counterpart, we modify FaceResNet by replacing the stride-1 bottlenecks of c2,c3,c4 (i.e., the last two bottlenecks of c2 and the last blocks of c3,c4) by RepMLP Bottlenecks with h=w=6,r=2,g=4h=w=6,r=2,g=4.

For training, we use a batch size of 512, momentum of 0.9, AM-Softmax loss , and weight decay following . All the models are trained for 420k iterations with a learning rate beginning with 0.1 and divided by 10 at 252k, 364k and 406k iterations. For evaluation, we report the top-1 accuracy on MegaFace. Table. 5 shows that FaceResNet delivers higher accuracy than MobileFaceNet but runs slower, while RepMLP-FaceRes outperforms in both accuracy and speed. Compared to MobileFaceNet, RepMLP-FaceRes shows 4.91% higher accuracy and runs 8% faster (though it has 2.5×\times FLOPs), which is obviously a better fit for the high-power devices.

4 Semantic Segmentation

Semantic segmentation is a representative task with translation invariance, as a car may occur at the left or right. We verify the generalization performance of ImageNet-pretrained RepMLP-Res50 on Cityscapes , which contains 5K finely annotated images and 19 categories. We use the RepMLP-Res50-g4/8 and the original ResNet-50 pretrained with 320×320320\times 320 on ImageNet as the backbones. For the better reproducibility, we simply adopt the official implementation and default configurations of PSPNet framework: poly learning rate policy with base of 0.01 and power of 0.9, weight decay of 10410^{-4} and a global batch size of 16 on 8 GPUs for 200 epochs. Following PSPNet-50, we use dilated conv in c5 of both models and c4 of the original ResNet-50. We do not use dilated conv in c4 of RepMLP-Res50-g4/8 because its receptive field is already large. Since the resolution of c3 and c4 becomes 90×9090\times 90, the Global Perceptron will have 81 partitions of each channel hence more parameters in FC1 and FC2. We address this problem by reducing the output dimensions of the FC1 and the input dimensions of FC2 by 4×\times for c3 and 8×\times for c4. FC1 are FC2 are initialized randomly, and all the other parameters are inherited from the ImageNet-pretrained model.

Table. 6 shows that the PSPNet with RepMLP-Res50-g4/8 outperforms the Res-50 backbone by 2.21% in mIoU. Though it has more parameters, the FLOPs is lower and the speed is faster. Of note is that our PSPNet baseline is lower than the reported PSPNet-50 because the latter was customized for semantic segmentation (added two more layers before the max pooling) but ours is not.

Conclusion

An FC has stronger representational capacity than a conv, as the latter can be viewed as a sparse FC with shared parameters. However, an FC has no local prior, which makes it less favored for image recognition. In this paper, we have proposed RepMLP, which utilizes the global capacity and positional perception of FC and incorporates the local prior into FC by re-parameterizing convolutions into it via a simple and platform-agnostic algorithm. From the theoretical side, viewing conv as a degraded case of FC opens up a new perspective, which may deepen our understanding of the traditional ConvNets. It should not be left unmentioned that RepMLP is designed for the application scenarios where the major concerns are the inference throughput and accuracy, less concerning the number of parameters.

References

Appendix A: RepMLP-ResNet for High Speed

The RepMLP Bottleneck presented in the paper is designed to improve the accuracy. Here we present another means of using RepMLP in ResNet for the higher speed. Specifically, we build a RepMLP Light Block (Fig. 6) with no 3×33\times 3 conv but drastic 8×8\times channel reduction/expansion via 1×11\times 1 conv before and after RepMLP. Same as the 78.55%-accuracy RepMLP-ResNet50 reported in the paper, we use h=w=7h=w=7, g=8g=8 and three conv branches in the Local Perceptron with K=1,3,5K=1,3,5. The speed is tested in the same way as all the models reported in the paper. Table. 7 shows that the ResNet with RepMLP Light Block achieves almost the same accuracy as the original ResNet-50 with 30% lower FLOPs and 55% faster speed.

Of note is that RepMLP is a building block that can be combined with numerous other structures in various ways. We only present two means for using RepMLP in ResNet, which may not be the optimal. We will make the code and models publicly available to encourage further research.

Appendix B: Converting Groupwise Conv into FC

The groupwise case of converting conv into FC is a bit more complicated, which can be derived by first splitting the input into gg parallel groups and then converting every group separately. The PyTorch code is shown in Alg. 1 and the submitted repmlp.py contains an executable example to verify the equivalence. It is easy to verify that with g=1g=1 the code exactly implements Eq. 15 in the paper.

Appendix C: Absorbing BN into FC1

The BN in Global Perceptron applies a linear scaling and a bias adding to the input. After the matrix multiplication by the FC1 kernel, the added bias is projected and then added onto the bias of FC1. Therefore, the removal of this BN can be offset by scaling the kernel of FC1 and changing the bias of FC1. The code is shown in Alg. 2 and the submitted repmlp.py contains an executable example to verify the equivalence.