Long-Tailed Classification by Keeping the Good and Removing the Bad Momentum Causal Effect

Kaihua Tang, Jianqiang Huang, Hanwang Zhang

Introduction

Over the years, we have witnessed the fast development of computer vision techniques , stemming from large and balanced datasets such as ImageNet and MS-COCO . Along with the growth of the digital data created by us, the crux of making a large-scale dataset is no longer about where to collect, but how to balance. However, the cost of expanding them to a larger class vocabulary with balanced data is not linear — but exponential — as the data will be inevitably long-tailed by Zipf’s law . Specifically, a single sample increased for one data-poor tail class will result in more samples from the data-rich head. Sometimes, even worse, re-balancing the class is impossible. For example, in instance segmentation , if we target at increasing the images of tail class instances like “remote controller”, we have to bring in more head instances like “sofa” and “TV” simultaneously in every newly added image .

Therefore, long-tailed classification is indispensable for training deep models at scale. Recent work starts to fill in the performance gap between class-balanced and long-tailed datasets, while new long-tailed benchmarks are springing up such as Long-tailed CIFAR-10/-100 , ImageNet-LT for image classification and LVIS for object detection and instance segmentation. Despite the vigorous development of this field, we find that the fundamental theory is still missing. We conjecture that it is mainly due to the paradoxical effects of long tail. On one hand, it is bad because the classification is severely biased towards the data-rich head. On the other hand, it is good because the long-tailed distribution essentially encodes the natural inter-dependencies of classes — “TV” is indeed a good context for “controller” — any disrespect of it will hurt the feature representation learning , e.g., re-weighting or re-sampling inevitably causes under-fitting to the head or over-fitting to the tail.

Inspired by the above paradox, latest studies show promising results in disentangling the “good” from the “bad”, by the naïve two-stage separation of imbalanced feature learning and balanced classifier training. However, such disentanglement does not explain the whys and wherefores of the paradox, leaving critical questions unanswered: given that the re-balancing causes under-fitting/over-fitting, why is the re-balanced classifier good but the re-balanced feature learning bad? The two-stage design clearly defies the end-to-end merit that we used to believe since the deep learning era; but why does the two-stage training significantly outperform the end-to-end one in long-tailed classification?

In this paper, we propose a causal framework that not only fundamentally explains the previous methods , but also provides a principled solution to further improve long-tailed classification. The proposed causal graph of this framework is given in Figure 1 (a). We find that the momentum $M$ in any SGD optimizer (also called betas in Adam optimizer ), which is indispensable for stabilizing gradients, is a confounder who is the common cause of the sample feature $X$ (via $M\rightarrow X$ ) and the classification logits $Y$ (via $M\rightarrow D\rightarrow Y$ ). In particular, $D$ denotes the $X$ ’s projection on the head feature direction that eventually deviates $X$ . We will justify the graph later in Section 3. Here, Figure 1 (b&c) sheds some light on how the momentum affects the feature $X$ and the prediction $Y$ . From the causal graph, we may revisit the “bad” long-tailed bias in a causal view: the backdoor path $X\leftarrow M\rightarrow D\rightarrow Y$ causes the spurious correlation even if $X$ has nothing to do with the predicted $Y$ , e.g., misclassifying a tail sample to the head. Also, the mediation path $X\rightarrow D\rightarrow Y$ mixes up the pure contribution made by $X\rightarrow Y$ . For the “good” bias, $X\rightarrow D\rightarrow Y$ respects the inter-relationships of the semantic concepts in classification, that is, the head class knowledge contributes a reliable evidence to filter out wrong predictions. For example, if a rare sample is closer to the head class “TV” and “sofa”, it is more likely to be a living room object (e.g., “remote controller”) but not an outdoor one (e.g., “car”).

Based on the graph that explains the paradox of the “bad” and “good”, we propose a principled solution for long-tailed classification. It is a natural derivation of pursuing the direct causal effect along $X\rightarrow Y$ by removing the momentum effect. Thanks to causal inference , we can elegantly keep the “good” while remove the “bad”. First, to learn the model parameters, we apply de-confounded training with causal intervention: while it removes the “bad” by backdoor adjustment who cuts off the backdoor confounding path $X\leftarrow M\rightarrow D\rightarrow Y$ , it keeps the “good” by retaining the mediation $X\rightarrow D\rightarrow Y$ . Second, we calculate the direct causal effect of $X\rightarrow Y$ as the final prediction logits. It disentangles the “good” from the “bad” in a counterfactual world, where the bad effect is considered as the $Y$ ’s indirect effect when $X$ is zero but $D$ retains the value when $X=\boldsymbol{x}$ . In contrast to the prevailing two-stage design that requires unbiased re-training in the 2nd stage, our solution is one-stage and re-training free. Interestingly, as discussed in Section 4.4, we show that why the re-training is inevitable in their method and why ours can avoid it with even better performance.

On image classification benchmarks Long-tailed CIFAR-10/-100 and ImageNet-LT , we outperform previous state-of-the-arts on all splits and settings, showing that the performance gain is not merely from catering to the long tail or a specific imbalanced distribution. In object detection and instance segmentation benchmark LVIS , our method also has a significant advantage over the former winner of LVIS 2019 challenge. We achieve 3.5% and 3.1% absolute improvements on mask AP and box AP using the same Cascade Mask R-CNN with R101-FPN backbone .

Related Work

Re-Balanced Training. The most widely-used solution for long-tailed classification is arguably to re-balance the contribution of each class in the training phase. It can be either achieved by re-sampling or re-weighting . However, they inevitably cause the under-fitting/over-fitting problem to head/tail classes. Besides, relying on the accessibility of data distribution also limits their application scope, e.g., not applicable in online and streaming data.

Hard Example Mining. The instance-level re-weighting is also a practical solution. Instead of hacking the prior distribution of classes, focusing on the hard samples also alleviates the long-tailed issue, e.g., using meta-learning to find the conditional weights for each samples , enhancing the samples of hard categories by group softmax .

Transfer Learning/Two-Stage Approach. Recent work shows a new trend of addressing the long-tailed problem by transferring the knowledge from head to tail. The sharing bilateral-branch network , the two-stage training , the dynamic curriculum learning and the transferring memory features / head distributions are all shown to be effective in long-tailed recognition, yet, they either significantly increase the parameters or require a complicated training strategy.

Causal Inference. Causal inference has been widely adopted in psychology, politics and epidemiology for years . It doesn’t just serve as an interpretation framework, but also provides solutions to achieve the desired objectives by pursing causal effect. Recently, causal inference has also attracted increasing attention in computer vision society for removing the dataset bias in domain-specific applications, e.g., using pure direct effect to capture the spurious bias in VQA and NWGM for Captioning . Compared to them, our method offers a fundamental framework for general long-tailed visual recognition.

A Causal View on Momentum Effect

To systematically study the long-tailed classification and how momentum affects the prediction, we construct a causal graph in Figure 1 (a) with four variables: momentum ( $M$ ), object feature ( $X$ ), projection on head direction ( $D$ ), and model prediction ( $Y$ ). The causal graph is a directed acyclic graph used to indicate how variables of interest $\{M,X,D,Y\}$ interacting with each other through causal links. The nodes $M$ and $D$ constitute a confounder and a mediator, respectively. A confounder is a variable that influences both correlated and independent variables, creating a spurious statistical correlation. Considering a causal graph $\mathbf{exercise\leftarrow age\rightarrow cancer}$ , the elder people spend more time on physical exercise after retirement and they are also easier to get cancer due to the elder age, so the confounder $age$ creates a spurious correlation that more physical exercise will increase the chance of getting cancer. The example of a mediator would be $\mathbf{drug\rightarrow placebo\rightarrow cure}$ , where mediator $placebo$ is the side effect of taking $drug$ that prevents us from getting the direct effect of $\mathbf{drug\rightarrow cure}$ .

Before we delve into the rationale of our causal graph, let’s take a brief review on the SGD with momentum . Without loss of generality, we adopt the Pytorch implementation :

where the notations in the $t$ -th iteration are: model parameters $\theta_{t}$ , gradient $g_{t}$ , velocity $v_{t}$ , momentum decay ratio $\mu$ , and learning rate $lr$ . Other versions of SGD only change the position of some hyper-parameters and we can easily prove them equivalent with each other. The use of momentum considerably dampens the oscillations caused by each single sample. In our causal graph, momentum $M$ is the overall effect of $\mu\cdot v_{T-1}$ at the convergence $t=T$ , which is the exponential moving average of the gradient over all past samples with decay rate $\mu$ . Eq. (1) shows that, given fixed hyper-parameters $\mu$ and $lr$ , each sample $M=\boldsymbol{m}$ is a function of the model initialization and the mini-batch sampling strategy, that is, $M$ has infinite samples.

In a balanced dataset, the momentum is equally contributed by every class. However, when the dataset is long-tailed, it will be dominated by the head samples, emerging the following causal links:

$\mathbf{\emph{M}\rightarrow\emph{X}}.$ This link says that the backbone parameters used to generate feature vectors $X$ , are trained under the effect of $M$ . This is obvious from Eq. (1) and can be illustrated in Figure 1 (b), where we visualize how the magnitudes of $X$ change from head to tail.

$\mathbf{(\emph{M},\emph{X})\rightarrow\textit{D}}.$ This link denotes that the momentum also causes feature vector $X$ deviates to the head direction $D$ , which is also determined by $M$ . In a long-tailed dataset, few head classes possess most of the training samples, who have less variance than the data-poor but class-rich tail, so the moving averaged momentum will thus point to a stable head direction. Specifically, as shown in Figure 2, we can decompose any feature vector $\boldsymbol{x}$ into $\boldsymbol{x}=\ddot{\boldsymbol{x}}+\boldsymbol{d}$ , where $D=\boldsymbol{d}=\boldsymbol{\hat{d}}cos(\boldsymbol{x},\boldsymbol{\hat{d}})\lVert\boldsymbol{x}\rVert$ . In particular, the head direction $\boldsymbol{\hat{d}}$ is given in Assumption 1, whose validity is detailed in Appendix A.

The head direction $\boldsymbol{\hat{d}}$ is the unit vector of the exponential moving average features with decay rate $\mu$ like momentum, i.e., $\boldsymbol{\hat{d}}=\overline{\boldsymbol{x}}_{T}/\rVert\overline{\boldsymbol{x}}_{T}\lVert$ , where $\overline{\boldsymbol{x}}_{t}=\mu\cdot\overline{\boldsymbol{x}}_{t-1}+\boldsymbol{x}_{t}$ and $T$ is the number of the total training iterations.

Note that Assumption 1 says that the head direction is exactly determined by the sample moving average in the dataset, which does not need the accessibility of the class statistics at all. In particular, as we show in Appendix A, when the dataset is balanced, Assumption 1 also holds but suggests that $X\rightarrow Y$ is naturally not affected by $M$ .

$\mathbf{\emph{X}\rightarrow\emph{D}\rightarrow\emph{Y}\;\&\;\emph{X}\rightarrow\emph{Y}}.$ These links indicate that the effect of $X$ can be disentangled into an indirect (mediation) and a direct effect. Thanks to the above orthogonal decomposition: $\boldsymbol{x}=\ddot{\boldsymbol{x}}+\boldsymbol{d}$ , the indirect effect is affected by $\boldsymbol{d}$ while the direct effect is affected by $\ddot{\boldsymbol{x}}$ , and they together determine the total effect. As shown in Figure 4, when we change the scale parameter $\alpha$ of $\boldsymbol{d}$ , the performance of the tail classes monotonically increases with $\alpha$ , which inspires us to remove the mediation effect of $D$ in Section 4.2.

The Proposed Solution

Based on the proposed causal graph in Figure 1 (a), we can delineate our goal for long-tailed classification: the pursuit of the direct causal effect along $X\rightarrow Y$ . In causal inference, it is defined as Total Direct Effect (TDE) :

We’d like to highlight that Eq. (2) removes the “bad” while keeps the “good” in a reconcilable way. First, in training, the $do$ -operator removes the “bad” confounder bias while keeps the “good” mediator bias, because the $do$ -operator retains the mediation path. Second, in inference, the mediator value $\boldsymbol{d}$ is imposed in both terms to keep the “good” of the mediator bias (towards head) in logit prediction; it also removes its “bad” by subtracting the second term: the prediction when the input $X$ is null ( $\boldsymbol{x}_{0}$ ) but the mediator $D$ is still the value $\boldsymbol{d}$ when $X$ had been $\boldsymbol{x}$ . Note that such a counterfactual minus elegantly characterizes the “bad” mediation bias, just like how we capture the tricky placebo effect: we cheat the patient to take a placebo drug, setting the direct drug effect $\mathbf{drug\rightarrow cure}$ to zero; thus, any cure observed must be purely due to the non-zero placebo effect $\mathbf{drug\rightarrow placebo\rightarrow cure}$ .

The model for the proposed causal graph is optimized under the causal intervention $do(X=\boldsymbol{x})$ , which aims to preserve the “good” feature learning from the momentum and cut off its “bad” confounding effect. We apply the backdoor adjustment to derive the de-confounded model:

As there are infinite number of $M=\boldsymbol{m}$ , it is prohibitively to achieve the above backdoor adjustment. Fortunately, the Inverse Probability Weighting formulation in Eq. (4) provides us a new perspective in approximating the infinite sampling $(i,\boldsymbol{x})|\boldsymbol{m}$ . For a finite dataset, no matter how many $\boldsymbol{m}$ there are, we can only observe one $(i,\boldsymbol{x})$ given one $\boldsymbol{m}$ . In such cases, the number of $\boldsymbol{m}$ values that Eq. (4) would encounter is equal to the number of samples $(i,\boldsymbol{x})$ available, not to the number of possible $\boldsymbol{m}$ values, which is prohibitive. In fact, thanks to the backdoor adjustment, which connects the equivalence between the originally confounded model $P$ and the deconfounded model $P$ with $do(X)$ , we can collect samples from the former, that act as though they were drawn from the latter. Therefore, Eq. (4) can be approximated as

where $\widetilde{P}$ is the inverse weighted probability and we will drop $M=\boldsymbol{m}$ in the rest of the paper for notation simplicity and bear in mind that $\boldsymbol{x}$ still depends on $\boldsymbol{m}$ . In particular, compared to the vanilla trick, we apply a multi-head strategy to equally divide the channel (or dimensions) of weights and features into $K$ groups, which can be considered as $K$ times more fine-grained sampling.

We model $\widetilde{P}$ in Eq. (5) as the softmax activated probability of the energy-based model :

where $\tau$ is a positive scaling factor akin to the inverse temperature in Gibbs distribution. Recall Assumption 1 that $\boldsymbol{x}^{k}=\ddot{\boldsymbol{x}}^{k}+\boldsymbol{d}^{k}$ . The numerator, i.e., the unnormalized effect, can be implemented as logits $f(i,\boldsymbol{x}^{k};\boldsymbol{w}_{i}^{k})=(\boldsymbol{w}_{i}^{k})^{\top}(\ddot{\boldsymbol{x}}^{k}+\boldsymbol{d}^{k})=(\boldsymbol{w}_{i}^{k})^{\top}\boldsymbol{x}^{k}$ , and the denominator is a normalization term (or propensity score ) that only balances the magnitude of the variables: $g(i,\boldsymbol{x}^{k};\boldsymbol{w}_{i}^{k})=\|\boldsymbol{x}^{k}\|\cdot\|\boldsymbol{w}_{i}^{k}\|+\gamma\|\boldsymbol{x}^{k}\|$ , where the first term is a class-specific energy and the second term is a class-agnostic baseline energy.

Putting the above all together, the logit calculation for $P(Y=i|do(X=\boldsymbol{x}))$ can be formulated as:

Interestingly, this model also explains the effectiveness of normalized classifiers like cosine classifier . We will further discuss it in Section 4.4.

2 Total Direct Effect Inference

After the de-confounded training, the causal graph is now ready for inference. The TDE of $X\rightarrow Y$ in Eq. (2) can thus be depicted as in Figure 3. By applying the counterfactual consistency rule , we have $[Y_{\boldsymbol{d}}=i|do(X=\boldsymbol{x})]=[Y=i|do(X=\boldsymbol{x})]$ . This indicates that we can use Eq. (7) to calculate the first term of Eq. (2). Thanks to Assumption 1, we can disentangle $\boldsymbol{x}$ by $\boldsymbol{x}=\ddot{\boldsymbol{x}}+\boldsymbol{d}$ , where $\boldsymbol{d}=\lVert\boldsymbol{d}\rVert\cdot\boldsymbol{\hat{d}}=cos(\boldsymbol{x},\boldsymbol{\hat{d}})\lVert\boldsymbol{x}\rVert\cdot\boldsymbol{\hat{d}}$ . Therefore, we have $[Y_{\boldsymbol{d}}=i|do(X=\boldsymbol{x}_{0})]$ that replaces the $\ddot{\boldsymbol{x}}$ in Eq. (7) with zero vector, just like “cheating” the model with a null input but keeping everything else unchanged. Overall, the final TDE calculation for Eq. (2) is

where $\alpha$ controls the trade-off between the indirect and direct effect as shown in Figure 4.

3 Background-Exempted Inference

Some classification tasks need a special “background” class to filter out samples belonging to none of the classes of interest, e.g., object detection and instance segmentation use the background class to remove non-object regions , and recommender systems assume that the majority of the items are irrelevant to a user . In such tasks, most of the training samples are background and hence the background class is a good head class, whose effect should be kept and thus exempted from the TDE calculation. To this end, we propose a background-exempted inference that particular uses the original inference (total effect) for background class. The inference can be formulated as:

where $i=0$ is the background class, $p_{i}=P(Y=i|do(X=\boldsymbol{x}))$ is the de-confounded probability that we defined in Section 4.1, $q_{i}$ is the softmax activated probability of the original $TDE(Y_{i})$ in Eq. (8). Note that Eq. (9) adds up to 1 from $i=0$ to $C$ .

4 Revisiting Two-stage Training

The proposed framework also theoretically explains the previous state-of-the-arts as shown in Table 1. Please see Appendix B for the detailed revisit for each method.

Two-stage Re-balancing. Naïve re-balanced training fails to retain a natural mediation $D$ that respects the inter-dependencies among classes. Therefore, the two-stage training is adopted by most of the re-balancing methods: imbalanced pre-training the backbone with natural $D$ and then balanced re-training a fair classifier with the fixed backbone for feature representation. Later, we will show that the second stage re-balancing essentially plays a counterfactual role, which reveals the reason why the stage-2 is indispensable.

De-confounded Training. Technically, the proposed de-confounded training in Eq. (7) is the multi-head classifier with normalization. The normalized classifier, like cosine classifier, has already been embraced by various methods based on empirical practice. However, as we will show in Table 2, without the guidance of our causal graph, their normalizations perform worse than the proposed de-confounded model. For example, methods like decouple only applies normalization in the 2nd stage balanced classifier training, and hence its feature learning is not de-confounded.

Direct Effect. The one-stage re-weighting/re-sampling training methods, like LDAM , can be interpreted as calculating Controlled Direct Effect (CDE) : $CDE(Y_{i})=[Y=i|do(X=\boldsymbol{x}),do(D=\boldsymbol{d}_{0})]-[Y=i|do(X=\boldsymbol{x}_{0}),do(D=\boldsymbol{d}_{0})]$ , where $\boldsymbol{x}_{0}$ is a dummy vector and $\boldsymbol{d}_{0}$ is a constant vector. CDE performs a physical intervention — re-balancing — on the training data by setting the bias $D$ to a constant. Note that the second term of CDE is a constant that does not affect the classification. However, CDE removes the “bad” at the cost of hurting the “good” during representation learning, as $D$ is no longer a natural mediation generated by $X$ .

The two-stage methods are essentially Natural Direct Effect (NDE), where the stage-2 re-balanced training is actually an intervention on $D$ that forces the direction $\boldsymbol{\hat{d}}$ do not head to any class. Therefore, when attached with the stage-1 imbalanced pre-trained features, the balanced classifier calculates the NDE: $NDE(Y_{i})=[Y_{\boldsymbol{d}_{0}}=i|do(X=\boldsymbol{x})]-[Y_{\boldsymbol{d}_{0}}=i|do(X=\boldsymbol{x}_{0})]$ , where $\boldsymbol{x}_{0}$ and $\boldsymbol{d}_{0}$ are dummy vectors, because the stage-2 balanced classifier forces the logits to nullify any class-specific momentum direction; $do(X=\boldsymbol{x})$ as stage-1 backbone is frozen and $M\not\rightarrow X$ ; the second term can be omitted as it is a class-agnostic constant. Besides that their stage-1 training is still confounded, as we will show in experiments, our TDE is better than NDE because the latter completely removes the entire effect of $D$ by setting $D=\boldsymbol{d}_{0}$ , which is however sometimes good, e.g., mis-classifying “warthog” as the head-class “pig” is better than “car”; TDE admits the effect by keeping $D=\boldsymbol{d}$ as a baseline and further compares the fine-grained difference via the direct effect, e.g., by admitting that “warthog” does look like “pig”, TDE finds out that the tusk is the key difference between “warthog” and “pig”, and that is why our method can focus on more discriminative regions in Figure 5.

Experiments

The proposed method was evaluated on three long-tailed benchmarks: Long-tailed CIFAR-10/-100, ImageNet-LT for image classification and LVIS for object detection and instance segmentation. The consistent improvements across different tasks demonstrate our broad application domain.

Datasets and Protocols. We followed to collect the long-tailed versions of CIFAR-10/-100 with controllable degrees of data imbalance ratio ( $\frac{N_{max}}{N_{min}}$ , where $N$ is number of samples in each category), which controls the distribution of training sets. ImageNet-LT is a long-tailed subset of ImageNet dataset . It consists of 1k classes over 186k images, where 116k/20k/50k for train/val/test sets, respectively. In train set, the number of images per class is ranged from 1,280 to 5, which imitates the long-tailed distribution that commonly exists in the real world. The test and val sets were balanced and reported on four splits: Many-shot containing classes with $>100$ images, Medium-shot including classes with $\geq 20\textit{ \&}\leq 100$ images, Few-shot covering classes with $<20$ images, and Overall for all classes. LVIS is a large vocabulary instance segmentation dataset with 1,230/1,203 categories in V0.5/V1.0, respectively. It contains a 57k/100k train set (V0.5/V1.0) under a significant long-tailed distribution, and relatively balanced 5k/20k val set (V0.5/V1.0) and 20k test set.

Implementation Details. For image classification on ImageNet-LT, we used ResNeXt-50-32x4d as our backbone for all experiments. All models were trained by using SGD optimizer with momentum $\mu=0.9$ and batch size 512. The learning rate was decayed by a cosine scheduler from 0.2 to 0.0 in 90 epochs. Hyper-parameters were chosen by the performances on ImageNet-LT val set, and we set $K=2,\tau=16,\gamma=1/32,\alpha=3.0$ . For Long-tailed CIFAR-10/-100, we changed the backbone to ResNet-32 and the training scheduler to warm-up scheduler like BBN for fair comparisons. All parameters except for $\alpha$ are inherited from ImageNet-LT, which was set to $1.0/1.5$ for CIFAR-10/-100 respectively. For instance segmentation and object detection on LVIS, we chose Cascade Mask R-CNN framework implemented by . The optimizer was also SGD with momentum $\mu=0.9$ and we used batch size 16 for a R101-FPN backbone. The models were trained in 20 epochs with learning rate starting at 0.02 and decaying by the factor of 0.1 at the 16-th and 19-th epochs. We selected the top 300 predicted boxes following . The hyper-parameters on LVIS were directly adopted from the ImageNet-LT, except for $\alpha=1.5$ . The main difference between image classification and object detection/instance segmentation is that the latter includes a background class $i=0$ , which is a head class used to make a binary decision between foreground and background. As we discussed in Section. 4.3, the Background-Exempted Inference should be used to retain the good background bias. The comparison between with and without Background-Exempted Inference is given in Appendix C.

Ablation studies. To study the effectiveness of the proposed de-confounded training and TDE inference, we tested a variety of ablation models: 1) the linear classifier baseline (no biased term); 2) the cosine classifier ; 3) the capsule classifier , where $x$ is normalized by the non-linear function from ; 4) the proposed de-confounded model with normal softmax inference; 5) different versions of the TDE. As reported in Table (2,4), the de-confound TDE achieves the best performance under all settings. The TDE inference improves all three normalized models, because the cosine and capsule classifiers can be considered as approximations to the proposed de-confounded model. To show that the mediation effect removed by TDE indeed controls the preference towards head direction, we changed the parameter $\alpha$ as shown in Figure 4, resulting the smooth increasing/decreasing of the performances on tail/head classes, respectively.

Comparisons with State-of-The-Art Methods. The previous state-of-the-art results on ImageNet-LT are achieved by the two-stage re-balanced training that decouples the backbone and classifier. However, as we discussed in Section 4.4, this kind of approaches are less effective or efficient. In Long-tailed CIFAR-10/-100, we outperform the previous methods in all imbalance ratios, which proves that the proposed method can automatically adapt to different data distributions. In LVIS dataset, after a simple adaptation, we beat the champion EQL of LVIS Challenge 2019 in Table 4. All reported results in Table 4 are using the same Cascade Mask R-CNN framework and R101-FPN backbone for fair comparison. The EQL results were copied from , which were trained by 16 GPUs and 32 batch size while the proposed method only used 8 GPUs and half of the batch size. We didn’t compare the EQL results on the final challenge test server, because they claimed to exploit external dataset and other tricks like ensemble to win the challenge. Note that EQL is also a re-balanced method, having the same problems as . We also visualized the activation maps using Grad-CAM in Figure 5. The linear classifier baseline and decouple-LWS usually activate the entire objects and some context regions to make a prediction. Meanwhile, the de-confound TDE only focuses on the direct effect, i.e., the most discriminative regions, so it usually activates on a more compact area, which is less likely to be biased towards its similar head classes. For example, to classify a “kimono”, the proposed method only focuses on the discriminative feature rather than the entire body, which is similar to some other clothes like “dress”.

Conclusions

In this work, we first proposed a causal framework to pinpoint the causal effect of momentum in the long-tailed classification, which not only theoretically explains the previous methods, but also provides an elegant one-stage training solution to extract the unbiased direct effect of each instance. The detailed implementation consists of de-confounded training and total direct effect inference, which is simple, adaptive, and agnostic to the prior statistics of the class distribution. We achieved the new stage-of-the-arts of various tasks on both ImageNet-LT and LVIS benchmarks. As moving forward, we are going to 1) further validate our theory in a wider spectrum of application domains and 2) seek better feature disentanglement algorithms for more precise counterfactual effects.

Broader Impact

The positive impacts of this work are two-fold: 1) it improves the fairness of the classifier, which prevents the potential discrimination of deep models, e.g., an unfair AI could blindly cater to the majority, causing gender, racial or religious discrimination; 2) it allows the larger vocabulary datasets to be easily collected without a compulsory class-balancing pre-processing, e.g., to train autonomous vehicles, by using the proposed method, we don’t need collecting as many ambulance images as normal van images do. The negative impacts could also happen when the proposed long-tailed classification technique falls into the wrong hands, e.g., it can be used to identify the minority groups for malicious purposes. Therefore, it’s our duty to make sure that the long-tailed classification technique is used for the right purpose.

Appendix A Additional Explanations of Assumption 1

To better understand the $\mathbf{(\emph{M},\emph{X})\rightarrow\textit{D}}$ and Assumption 1, let’s take a simple example. Given a learnable parameter $\theta\in\mathcal{R}^{2}$ , and its gradients of instances for class A, B approximate to (1, 1) and (-1, 1) respectively. If each of these two classes has 50 samples, the mean gradient would be (0, 1), which is the optimal gradient direction shared by both A and B. The momentum will thus accelerate on this direction that optimizes the model to fairly discriminate two classes. However, if there are 99 samples from class A and only 1 sample from class B (long-tailed dataset), the mean gradient would be (0.98, 1). In this case, the momentum direction now approximates to the class A (head) gradients, encouraging the backbone parameters to generate head-like feature vectors, i.e., creating an unfair deviation towards the head.

Since the momentum in SGD usually dominates the gradient velocity, the effect of such a deviation is not trivial, which will eventually create the head projection $D$ on all feature vectors generated by the backbone. It’s worth noting that although there are non-linear activation layers in the backbone, due to the central limit theorem , the overall effect of these deviated parameters is still following the normal distribution, which means we can use the moving averaged feature to approximate this head direction, i.e., the Assumption 1 in the original paper.

In addition, even in a balanced dataset, the Assumption 1 still holds. Considering the above example, the mean gradient is (0, 1) for balanced A and B, which is not biased towards either direction: (1, 1) or (-1, 1). In other word, the $D$ still exists for the balanced dataset, but the $cos(\boldsymbol{x},\boldsymbol{\hat{d}})$ should be almost the same for all classes. Therefore, the $M\rightarrow D\rightarrow Y$ won’t cause any preference in the balanced dataset, which naturally allows $X\rightarrow Y$ free from the effect of $M$ . It’s also intuitively easy to understand, because when the dataset is balanced, the mean feature only represents the common patterns shared by all classes, e.g., the $D$ in a balanced face recognition dataset is the mean face, which would be a contour of human head that not biased towards any specific face categories.

Appendix B Revisiting Previous Methods in Long-Tailed Classification

In this section, we will revisit the previous state-of-the-arts in two aspects: the normalized classifiers and the re-balancing strategies.

Normalized Classifiers. The normalized classifiers have already been widely adopted in long-tailed classification based on empirical practice. As we discussed in the Section 4, the correctly applied normalized classifiers are approximations of the proposed de-confounded training. However, without the guidance of the proposed causal framework, most of them are not utilized in a proper way. We define the general normalized classifier as the following equation:

The cosine classifier is defined based on the cosine similarity, which has $N(\boldsymbol{x},\boldsymbol{w}_{i})=\lVert\boldsymbol{x}\rVert\cdot\lVert\boldsymbol{w}_{i}\rVert$ . It is commonly used in the tasks like few-shot learning . In Table 2,3 of original paper, we have proved its effectiveness in the long-tailed classification. The capsule classifier is proposed by Liu et al. as the replacement of vanilla cosine classifier in OLTR. It changes the $l2$ norm of $\boldsymbol{x}$ into the squashing non-linear function proposed in Capsule Network , which allows the normalized $\boldsymbol{x}$ having a magnitude range from 0 to 1, representing the probability of $\boldsymbol{x}$ in its direction. The final normalization term can thus be defined as $N(\boldsymbol{x},\boldsymbol{w}_{i})=(\lVert\boldsymbol{x}\rVert+1)\cdot\lVert\boldsymbol{w}_{i}\rVert$ . However, the OLTR doesn’t use it to de-confound the visual feature. Instead, its $\boldsymbol{x}$ is the joint embedding of the feature vector and an attentive memory vector. The Decouple also invents two different types of normalized classifiers: $\tau$ -norm classifier and Learnable Weight Scaling (LWS) classifier. They empirically found that the $l2$ norm of $\boldsymbol{w}_{i}$ is not uniform in the long-tailed dataset, and has a positive correlation with the number of training samples for class $i$ , as shown in Figure 6. Therefore, their normalized classifiers only normalize the $\boldsymbol{w}_{i}$ : the $\tau$ -norm classifier is defined as $N(\boldsymbol{x},\boldsymbol{w}_{i})=\lVert\boldsymbol{w}_{i}\rVert^{\tau},\tau\in$ while LWS is $N(\boldsymbol{x},\boldsymbol{w}_{i})=g_{i},$ where $g_{i}$ is a learnable parameter. Yet, these decouple classifiers fail to de-confound the $M\rightarrow X$ for two reasons: 1) they don’t considering the confounding effect on $\boldsymbol{x}$ ; 2) they only apply the normalized classifiers on the 2nd stage when the backbone has already been frozen.

Re-balancing Strategies. Both OLTR and Decouple adopt the same class-aware sampler in their 2nd stage training, which forces each class to contribute the same number of samples regardless of the size. To dynamically combine the two training stages, the BBN utilizes a bilateral-branch design to smoothly transfer the sampling strategy from the imbalanced branch to the re-balancing branch, where two branches share the same set of parameters but learn from different sampling strategies, which has the same spirit as two-stage design in OLTR and Decouple . As to the EQL , since the re-sampling is complicated in the object detection and instance segmentation tasks, where objects from different classes co-exist in one image, they choose the re-weighted loss to balance the contributions of different classes.

Appendix C Background-Exempted Inference

The results with and without Background-Exempted Inference are reported in Table 5. As we can see, the Background-Exempted strategy successfully prevents the TDE from hurting the foreground-background selection. It is the key to apply TDE in tasks like object detection and instance segmentation that include one or more legitimately biased head categories, i.e., this strategy allows us to conduct TDE on a selected subset of categories.

Appendix D The Difference Between Re-balancing NDE and The Proposed TDE

In this section, we will further discuss the relationship between two-stage re-balancing NDE and the proposed TDE. As we discussed in Section 4.3 of original paper, the 2nd-stage re-balanced classifier essentially calculates the $NDE(Y_{i})=[Y_{\boldsymbol{d}^{\prime}}=i|do(X=\boldsymbol{x})]-[Y_{\boldsymbol{d}^{\prime}}=i|do(X=\boldsymbol{x}^{\prime})]$ , where the second term can be omitted because $\boldsymbol{x}^{\prime}$ is a dummy vector and the moving averaged $\boldsymbol{d}^{\prime}$ in a balanced set won’t point to any specific classes, so it is actually a constant offset. Therefore, the crux of understanding the NDE would be why the 2nd-stage re-balanced training equals to the first term $[Y_{\boldsymbol{d}^{\prime}}=i|do(X=\boldsymbol{x})]$ . It is because when the backbone is frozen, it breaks the dependency between $M\rightarrow X$ , which is a straightforward implementation of causal intervention $do(X=\boldsymbol{x})$ . The original OLTR violates this intervention by fine-tuning the backbone parameters in the 2nd stage, and it thus performs much worse than the Decouple-OLTR in the Table 2 of original paper, which freezes the backbone parameters. Meanwhile, the balanced re-sampling also brings a fair $\boldsymbol{d}^{\prime}$ as we discussed in the third paragraph of Section A.

To better illustrate both the similarity and the difference between re-balancing NDE and the proposed TDE, we constructed a one-dimensional binary classification example for conventional classifier, one-/two-stage re-balancing classifiers, and the proposed TDE in Figure 7, where the gaussian distribution curve represents the feature distribution generated by the backbone, and the 0 point is the classifier’s decision boundary. The conventional classifier and one-stage re-balancing are fundamentally problematic, because they either cause the mismatching in the inference or learn a bad backbone model. In the meantime, both two-stage re-balancing and the proposed TDE are able to correctly remove the bias by proper adjustments. The 2nd-stage re-balanced training (NDE) fixes the backbone parameters $do(X=\boldsymbol{x})$ learnt from 1st-stage imbalanced training, i.e., the frozen curve in the image, and then re-samples an artificially balanced data distribution to create a fair $\boldsymbol{d}^{\prime}$ . The overall re-balancing NDE can be considered as subtracting a bias offset from original decision boundary. Meanwhile, the proposed TDE removes the bias effect (head projection) from feature vectors. Both two types of adjustments can properly remove the head bias in this example. That’s why TDE and NDE should be theoretically identical in the long-tailed classification scenario. However, the 2nd-stage re-balancing NDE has two disadvantages: 1) its adjustment requires an additional training stage to fine-tune the classifier weights, which relies on the accessibility of data distribution; 2) if non-linear modules are applied to the feature vectors, e.g., a global context layer that conducts interactions among all objects $\{\boldsymbol{x}_{j}\}$ in an image, the NDE can only remove a linear approximation of this non-linear activated head bias, while the TDE would be able to maintain the natural interactions of features in both original logit term and the subtracted counterfactual term. It explains why the Decouple-OLTR in Table 2 of original paper doesn’t perform as good as Decouple- $\tau$ -norm or Decouple-LWS, because OLTR involves non-linear interactions between feature vectors and memory vectors, so a linear adjustment on classifier’s decision boundary cannot completely remove the head bias.

Appendix E Additional Ablation Studies

The hyper-parameters used in original paper are selected according to the performances on ImageNet-LT val set as shown in Table 6. To further study the multi-head strategy on different normalized classifiers, we tested the $K=2$ on cosine classifier and capsule classifier in Table 7. It proves that the advantage of the proposed de-confounded model doesn’t come from larger K, and the multi-head fine-grained sampling can generally improves the de-confounded training, no matter what kind of normalization function we choose.

As shown in Table 8,9, we tested the proposed method on different backbones. After equipped with ResNeXt-101-32x4d and ResNeXt-101-64x4d for ImageNet-LT and LVIS V0.5, respectively, the proposed method gains additional improvements. In ImageNet-LT dataset, we changed some hyper-parameters ( $K=4,\gamma=1/64.0$ ) and increased the training epochs to 120, because of the significantly increased number of model parameters. The hyper-parameters for LVIS are still the same as original paper.

We also reported the performances of the proposed method on LVIS V0.5 evaluation test server in Table 10, where we used ResNeXt-101-64x4d backbone and the original hyper-parameters. It’s worth noting that these are single model performances, which neither exploited external dataset nor utilized any model enhancement tricks.