Strategic Preys Make Acute Predators: Enhancing Camouflaged Object Detectors by Generating Camouflaged Objects
Chunming He, Kai Li, Yachao Zhang, Yulun Zhang, Zhenhua Guo, Xiu Li, Martin Danelljan, Fisher Yu
Introduction
The never-ending prey-vs-predator game drives preys to develop various escaping strategies. One of the most effective and ubiquitous strategies is camouflage. Preys use camouflage to blend into the surrounding environment, striving to escape hunting from predators. For survival, predators, on the other hand, must develop acute vision systems to decipher camouflage tricks.
Camouflaged object detection (COD) is the task that aims to mimic predators’ vision systems and localize foreground objects that have subtle differences from the background. The intrinsic similarity between camouflaged objects and the backgrounds renders COD a more challenging task than traditional object detection , and has attracted increasing research attention for its potential applications in medical image analysis , species discovery , and ecological protection .
Traditional COD solutions rely on manually designed detection strategies with hand-crafted extractors, and thus are constrained by the limited feature discriminability.
Benefiting from the powerful feature extraction capacity of convolutional neural network , a series of deep learning-based methods have been proposed and have achieved remarkable success on the COD task . However, when facing some extreme camouflage scenarios, those methods still struggle to excavate sufficient discriminative cues crucial to precisely localize objects of interest. For example, as shown in the top row of Fig. 1, the state-of-the-art COD method, FGANet , cannot even roughly localize the object and thus produce a completely wrong result. Sometimes, even though a rough position can be obtained, FGANet still fails to precisely segment the objects, as shown in the middle and bottom rows of Fig. 1. While FGANet manages to find the rough regions for the objects, the results are either incomplete (middle row: some key parts of the dog are missing) or ambiguous (bottom row: the boundaries of the frog are not segmented out).
This paper aims to address these limitations. We are inspired by the prey-vs-predator game, where preys develop more deceptive camouflage skills to escape predators, which, in turn, pushes the predators to develop more acute vision systems to discern the camouflage tricks. This game leads to ever-strategic preys and ever-acute predators. With this inspiration, we propose to address COD by developing algorithms on both the prey side that generates more deceptive camouflage objects and the predator side that produce complete and precise detection results.
On the prey side, we propose a novel adversarial training framework, Camouflageator, which generates more camouflaged objects that make it even harder for existing detectors to detect and thus enhance the generalizability of the detectors. Specifically, as shown in Fig. 2, Camouflageator comprises an auxiliary generator and a detector, which could be any existing detector. We adopt an alternative two-phase training mechanism to train the generator and the detector. In Phase I, we fix the detector and train the generator to synthesize camouflaged objects aiming to deceive the detector. In Phase II, we fix the generator and train the detector to accurately segment the synthesized camouflaged objects. By iteratively alternating Phases I and II, the generator and detector both evolve, helping to obtain better COD results.
On the predator side, we present a novel COD detector, named Internal Coherence and Edge Guidance (ICEG), which particularly aims to address the issues of incomplete segmentation and ambiguous boundaries of existing COD detectors. For incomplete segmentation, we introduce a camouflaged feature coherence (CFC) module to excavate the internal coherence of camouflaged objects. We first explore the feature correlations using two feature aggregation components, i.e., the intra-layer feature aggregation and the contextual feature aggregation. Then, we propose a camouflaged consistency loss to constrain the internal consistency of camouflaged objects. To eliminate ambiguous boundaries, we propose an edge-guided separated calibration (ESC) module. ESC separates foreground and background features using attentive masks to decrease uncertainty boundaries and remove false predictions. Additionally, ESC leverages edge features to adaptively guide segmentation and reinforce the feature-level edge information to achieve the sharp edge for segmentation results.
Our contributions are summarized as follows:
We introduce an adversarial training framework, Camouflageator, for the COD task. Camouflageator introduces an auxiliary generator that generates more camouflage objects that are harder for COD detectors to detect and hence enhances the generalizability of the COD detectors. Camouflageator is flexible and can be integrated with various existing COD detectors.
We propose a new COD detector, ICEG, to address the issues of incomplete segmentation and ambiguous boundaries that existing detectors face. ICEG introduces a novel CFC module to excavate the internal coherence of camouflaged objects to obtain complete segmentation results, and an ESC module to leverage edge information to get precise boundaries.
Experiments on four datasets verify that Camouflageator can promote the performance of various existing COD detectors, ICEG significantly outperforms existing COD detectors, and integrating Camouflageator with ICEG reaches even better results.
Related work
Traditional COD methods rely on hand-crafted operators with limited feature discriminability, thus struggling to handle complex scenarios. Learning-based approaches have recently become mainstream in COD with three main categories: (i) Multi-stage framework: SegMaR was the first plug-and-play framework for COD task to integrate segment, magnify, and reiterate under a multi-stage framework. However, SegMaR has limitations in flexibility due to not being end-to-end trainable. (ii) Multi-scale feature aggregation: PreyNet proposed a bidirectional bridging interaction module to aggregate cross-layer features with attentive guidance. Similarly, FGANet designed a collaborative local information interaction module and a global information interaction module to aggregate structure context features. (iii) Joint training strategy: LSR presented the first multi-task framework for COD to simultaneously localize, segment, and rank camouflaged objects using a joint training strategy. Analogously, BGNet jointly trained the edge detection task with the COD task and guided the segmentation with the detected edge.
We improve existing methods in three aspects: (i) Camouflageator is the first end-to-end trainable plug-and-play framework for COD, thus ensuring flexibility. (ii) ICEG is the first COD detector to alleviate incomplete segmentation by excavating the internal coherence of camouflaged objects. (iii) Unlike existing edge-based detectors , ICEG employs edge information to guide segmentation adaptively under the separated attentive framework.
2 Adversarial training
Adversarial training is a widely-used solution with many applications, including adversarial attack and generative adversarial network (GAN) . Recently, several GAN-based methods have been proposed for the COD task. JCOD introduced a GAN-based framework to measure the prediction uncertainty. ADENet employed GAN to weigh the contribution of depth for COD. Distinct from those GAN-based methods, our Camouflageator enhances the generalizability of existing COD detectors by generating more camouflaged objects that are harder to detect.
Methodology
When preys develop more deceptive camouflaged skills to escape predators, the predators respond by evolving more acute vision systems to discern the camouflage tricks. Drawing inspiration from this prey-vs-predator game, we propose to address COD by developing the Camouflageator and ICEG algorithms that mimic preys and predators, respectively, to generate more camouflaged objects and to more accurately detect those camouflaged objects.
Camouflageator is an adversarial training framework that employs an auxiliary generator to synthesize more camouflaged objects that make it even harder for existing detectors to detect and thus enhance the generalizability of the detectors. We train and alternatively in a two-phase adversarial training scheme. Fig. 2 shows the framework.
Training the generator. We fix the detector and train the generator to generate more deceptive objects that fail the detector. Given a camouflaged image , we generate
and expect is more deceptive to than . To achieve this, should be visually consistent (similar in global appearance) with but have those discriminative features crucial for detection hidden or reduced.
To encourage visual consistency, we propose to optimize the fidelity loss represented by the following formulation:
where is the ground truth binary mask and denotes element-wise multiplication. Since denotes the background mask, this term in essence encourage to be similar with for the background region. We encourage fidelity by preserving only the background rather than the whole image because otherwise, it hinders the generation of camouflaged objects in the foreground.
To hide discriminative features, we optimize the following concealment loss to imitate the bio-camouflage strategies, i.e., internal similarity and edge disruption , as
where is the weighted edge mask dilated by Gaussian function to capture richer edge information. is the image-level object prototype which is an average of foreground pixels. is the image-level edge prototype which is an average of edge pixels specified by . Note that , , and are all derived from the provided ground truth and help to train the model. This term encourages individual pixels of the foreground region and the edge region of to be similar to the average values, which has a smooth effect and thus hides discriminative features.
Apart from the above concealment loss, we further employ the detector to reinforce the concealment effect. The idea is that if is perfectly deceptive, tends to detect nothing as the foreground. To this end, we optimize
where is an all-zero mask with the same size as . L^{w}_{BCE}(\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.5}{\displaystyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\textstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptscriptstyle\bullet}}}}}) and L^{w}_{IoU}(\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.5}{\displaystyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\textstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptscriptstyle\bullet}}}}}) are the weighted binary cross-entropy loss and the weighted intersection-over-union loss .
Our overall learning objective to train is as follows,
Training the detector. In Phase II, we fix the generator and train the detector to accurately segment the synthesized camouflaged objects. This is the standard COD task and various existing COD detectors can be employed, for example, the simple one we used above,
2 ICEG
We further propose a novel detector, ICEG, to solve the problems of existing detectors. Given of size , we start by using a basic encoder to extract a set of deep features with the resolution of and employ ResNet50 as the default architecture. As shown in Fig. 3, we then feed these features, i.e., , to the camouflaged feature coherence (CFC) module and the edge-guided segmentation decoder (ESD) for further processing. Moreover, the last feature map , which has rich semantic cues, is fed into an atrous spatial pyramid pooling (ASPP) module and a convolution to generate a coarse segmentation result : , where shares the same spatial resolution with .
To alleviate incomplete segmentation, we propose the camouflaged feature coherence (CFC) module to excavate the internal coherence of camouflaged objects. CFC consists of two feature aggregation components, i.e., the intra-layer feature aggregation (IFA) and the contextual feature aggregation (CFA), to explore feature correlations. Additionally, CFC introduces a camouflaged consistency loss to constrain the internal consistency of camouflaged objects.
Intra-layer feature aggregation. In Fig. 4, IFA seeks the feature correlations by integrating the multi-scale features with different reception fields in a single layer, assuring that the aggregated features can capture scale-invariant information. Given , a convolution is first applied for channel reduction, followed by two parallel convolutions with different kernel sizes. This process produces the features and with varying receptive fields:
where is convolution. Then we combine and , process them with two parallel convolutions, and multiply the outputs to excavate the scale-invariant information:
where conca(\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.5}{\displaystyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\textstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptscriptstyle\bullet}}}}}) denote concatenation. We then integrate the three features and process them with a CRB block CRB(\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.5}{\displaystyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\textstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptscriptstyle\bullet}}}}}), i.e., convolution, ReLU, and batch normalization. By summing with the channel-wise down-sampled feature, the aggregated features are formulated as follows:
Contextual feature aggregation. CFA explores the inter-layer feature correlations by selectively interacting cross-level information with channel attention and spatial attention , which ensures the retention of significant coherence. The aggregated feature is:
where up(\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.5}{\displaystyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\textstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptscriptstyle\bullet}}}}}) is up-sampling operation. CA\left(\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.5}{\displaystyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\textstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptscriptstyle\bullet}}}}}\right) and SA\left(\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.5}{\displaystyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\textstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptscriptstyle\bullet}}}}}\right) are channel attention and spatial attention. Note that . Having acquired , the integrated features conveyed to the decoder are formulated as follows:
We employ for channel integration and .
Camouflaged consistency loss. To enforce the internal consistency of the camouflaged object, we propose a camouflaged consistency loss to enable more compact internal features. To achieve this, one intuitive idea is to decrease the variance of the camouflaged internal features. However, such a constraint can lead to feature collapse, i.e., all extracted features are too clustered to be separated, thus diminishing the segmentation capacity. Therefore, apart from the above constraint, we propose an extra requirement to keep the internal and external features as far away as possible. We apply the feature-level consistency loss to the deepest feature for its abundant semantic information:
where is the down-sampled ground truth mask. and denote the feature-level prototypes of the camouflaged object and the background, respectively.
Discussions. Apart from focusing on feature correlations as in existing detectors , we design a novel camouflaged consistency loss to enhance the internal consistency of camouflaged objects, facilitating complete segmentation.
2.2 Edge-guided segmentation decoder
As depicted in Fig. 3, edge-guided segmentation decoder (ESD) comprises an edge reconstruction (ER) module and an edge-guided separated calibration (ESC) module to generate the edge predictions and the segmentation results , respectively.
Edge reconstruction module. We introduce an ER module to reconstruct the object boundary. Assisted by the edge map and the segmentation feature from the former decoder, the edge feature is presented as follows:
where and . and are set as zero for initialization. We repeat as a 64-dimension tensor to ensure channel consistency with in Eq. 13.
Edge-guided separated calibration module. Ambiguous boundary, a common problem in COD, manifests as a high degree of uncertainty in the fringes and the unclear edge of the segmented object. We have observed that the high degree of uncertainty is mainly due to the intrinsic similarity between the camouflaged object and the background. To address this issue, we separate the features from the foreground and the background by introducing the corresponding attentive masks, and design a two-branch network to process the attentive features. This approach helps decrease uncertainty fringes and remove false predictions, including false-positive and false-negative errors. Given the prediction map , the network is defined as follows:
where and are the foreground attentive feature and the background attentive feature, which are formulated as:
where S(\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.5}{\displaystyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\textstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptscriptstyle\bullet}}}}}) and R(\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.5}{\displaystyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\textstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptscriptstyle\bullet}}}}}) are Sigmoid and reverse operators, i.e., element-wise subtraction with 1. RCAB(\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.5}{\displaystyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\textstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptscriptstyle\bullet}}}}}) is the residual channel attention block , which is used to emphasize those informative channels and high-frequency information.
The second phenomenon, unclear edge, is due to the extracted features giving insufficient importance to edge information. In this case, we explicitly incorporate edge features to guide the segmentation process and promote edge prominence. Instead of simply superimposing, we design an adaptive normalization (AN) strategy with edge features to guide the segmentation in a variational manner, which reinforces the feature-level edge information and thus ensures the sharp edge of the segmented object. Given the edge feature , the attentive features can be acquired by:
where and are the corresponding variational parameters. In AN, can be calculated by:
Discussions. Unlike existing edge-guided methods that focus only on edge guidance, we combine edge guidance with foreground/background splitting using attentive masks. This integration enables us to decrease uncertainty fringes and remove false predictions along edges.
2.3 Loss functions of ICEG
Apart from the camouflaged consistency loss, our ICEG is also constraint with the segmentation loss and the edge loss to supervise the segmentation results and the reconstructed edge results . Following , the segmentation loss consists of L^{w}_{BCE}(\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.5}{\displaystyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\textstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptscriptstyle\bullet}}}}}) and L^{w}_{IoU}(\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.5}{\displaystyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\textstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptscriptstyle\bullet}}}}}):
For edge supervision, we employ dice loss L_{dice}(\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.5}{\displaystyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\textstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptscriptstyle\bullet}}}}}) to overcome the extreme imbalance in edge maps:
Datasets. We use four COD datasets for evaluation, including CHAMELEON , CAMO , COD10K , and NC4K . CHAMELEON comprises 76 camouflaged images. CAMO contains 1,250 images with 8 categories. COD10K consists of 5,066 images with 10 super-classes. NC4K is the largest test set with 4,121 images. Following the common setting , our training set involves 1,000 images from CAMO and 3,040 images from COD10K, and our test set integrates the rest from the four datasets.
Metrics. Following previous methods , we employ four commonly-used metrics, including mean absolute error , adaptive F-measure , mean E-measure , and structure measure . Note that smaller or larger , , signify better performance.
2 Comparison with the state-of-the-arts
Quantitative analysis. We compare our ICEG with 13 state-of-the-art (SOTA) solutions in three different settings. Apart from the common setting (single input scale and single stage), two other settings (multiple input scales and multiple stages) are also included for a comprehensive evaluation, where ICEG follows the corresponding practices of ZoomNet and SegMaR . As shown in Sec. 3.2.3, ICEG outperforms the SOTAs by a large margin in all settings and backbones. For instance, in the common setting, ICEG overall surpasses the second-best methods in , , with the backbone of ResNet50 (FGANet ), Res2Net50 (BSA-Net ), Swin Transformer (DTIT ). Moreover, we also present the results of detectors optimized under the Camouflageator framework in the common setting. In Sec. 3.2.3, Camouflageator generally improves other detectors by (PreyNet) and (FGANet) and increases our ICEG by (ResNet50), (Res2Net50), and (Swin Transformer), which verifies that our Camouflageator is a plug-and-play framework. Results of the compared methods are generated by their provided models for fairness.
Qualitative analysis. Fig. 5 shows that ICEG gets more complete prediction results than existing methods, especially for large objects whose intrinsic correlations are more dispersed (the last two rows). This substantiates the effectiveness of the the proposed CFC module that excavates the internal coherence of camouflaged objects for generating more complete prediction maps. Moreover, ICEG gets clearer edges for the predictions than the existing methods, thanks to the proposed ESD module that decreases uncertainty fringes and eliminates unclear edges of the segmented object. Moreover, we can see that ICEG+ obtains even better results than ICEG, further verifying the effectiveness of the proposed Camouflageator framework.
3 Ablation study and analysis
We conduct the ablation study and analysis on the two largest datasets, i.e., COD10k and NC4K. The results for the most significant components are shown here; more can be found in the supplement material.
Conclusion
In this paper, we propose to address COD on both the prey and predator sides. On the prey side, we introduce a novel adversarial training strategy, Camouflageator, to enhance the generalizability of the detector by generating more camouflaged objects harder for a COD detector to detect. On the predator side, we design a novel detector, dubbed ICEG, to address the issues of incomplete segmentation and ambiguous boundaries. In specific, ICEG employs the CFC module to excavate the internal coherence of camouflaged objects and applies the ESD module for edge prominence, thus producing complete and precise detection results. Extensive experiments demonstrate the effectiveness of Camouflageator and the superiority of ICEG.