Capsule-Forensics: Using Capsule Networks to Detect Forged Images and Videos

Huy H. Nguyen, Junichi Yamagishi, Isao Echizen

INTRODUCTION

Forged images and videos can be used to bypass facial authentication and to create fake news media. The quality of manipulated images and videos has seen significant improvement with the development of advanced network architectures and the use of large amounts of training data. This has dramatically simplified the creation of facial forgeries. Nowadays, the only thing needed to create a forged facial image is simply a short video of the target person or an ID photo . The techniques developed by Chung et al. and Suwajanakorn et al. can improve the ability of attackers to learn the mapping between speech and lip motion, enabling the creation of fully synthesized audio-video data for any person. In this age of social networks serving as major sources of information, fake news with manipulated multimedia can quickly spread and have significant effects. The “deepfake” phenomenon is a good example of this threat—any person with a personal computer can create videos incorporating the facial image of any celebrity by using a human image synthesis technique based on artificial intelligence.

Several countermeasures have been proposed to deal with manipulated images and videos. However, most of them are aimed at particular types of attacks. For example, local binary pattern (LBP)-based methods are effective against replay attacks in which the attacker places a printed photo or displays a video on a screen in front of the camera. However, the eyes-focused method designed to detect a deepfake forgery can fail with the replay attack when the video displayed is of the actual target person. Other methods have more generalized ability; for instance, Fridrich and Kodovsky’s method can be applied for both steganalysis and detecting facial reenactment videos. However, its performance on secondary tasks is limited in comparison with task-specific methods like that of Rossler et al. . Moreover, while some methods can detect a single forged image , others require video input .

This paper presents a method that uses a capsule network to detect forged images and videos in a wide range of forgery scenarios, including replay attack detection and (both fully and partially) computer-generated image/video detection. This is pioneering work in the use of capsule networks , which were originally designed for computer vision problems, to solve digital forensics problems. A comprehensive survey of state-of-the-art related work and intensive comparisons using four major datasets demonstrated the superior performance of the proposed method.

RELATED WORK

In this section, we group forgery detection approaches into replay attack detection and computer-generated image/video detection on the basis of the features used and their target. Note that some approaches are two-fold while others are applicable only to certain types of attacks. We also provide some basic information about capsule networks and the dynamic routing algorithm that made this kind of network practical.

Prior to the current deep learning era, LBP methods were the primary defense against replay attacks . The method introduced by Kim et al. , which is based on local patterns of the diffusion speed (“local speed patterns”), achieves higher accuracy than that of LBP-based methods. Now, with the introduction of deep learning, the ability to detect replay attacks has been greatly improved. The method of Yang et al. uses a support vector machine to classify features extracted by a pre-trained convolutional neural network (CNN). That of Menotti et al. uses a similar procedure but optimizes the filters in an available high-performance CNN architecture. The method of Alotaibi and Mahmood uses nonlinear diffusion based on an additive operator splitting scheme in their own CNN. The recently introduced method of Ito et al. leverages a pre-trained CNN and utilizes the whole image instead of only the extracted face region.

2 Computer-Generated Image/Video Detection

There are several state-of-the-art methods for detecting images or videos generated by computer using, for example, a deepfake technique for face swapping , the Face2Face method for facial reenactment , or the deep video portraits technique for the purpose of forgery. Fridrich and Kodovsky proposed a hand-crafted-feature noise-based approach for steganalysis that can also be used for forgery detection. Cozzolino et al. implemented a CNN version of this approach. Raghavendra et al. described the special case of fine-tuning two available CNNs while Rossler et al. used only one CNN. Bayar and Stamm , Rahmouni et al. , Afchar et al. , Quan et al. , and Li et al. proposed their own networks. Li et al.’s network , for example, is video based and uses temporal information to detect eye blinking. We used a hybrid approach incorporating part of a pre-trained VGG (Visual Geometry Group)-19 network and a proposed CNN. Zhou et al. proposed a two-stream network.

3 Capsule Networks

Hinton et al. addressed the limitations of CNNs applied to inverse graphics tasks and laid the foundation for a more robust “capsule” architecture in 2011. However, this complex architecture could not be effectively implemented at the time due to the lack of an efficient algorithm and the limitations of computer hardware. Instead, easy-to-design easy-to-train CNNs became widely used. Now, with the introduction of the dynamic routing algorithm and the expectation-maximization routing algorithm , capsule networks have been implemented with remarkable initial results. Two recent studies demonstrated that, with the agreement between capsules calculated by the dynamic routing algorithm, the hierarchical pose relationships between object parts can be well described. This has improved the accuracy of vision tasks. Application of a capsule network to the forensics task, the focus of this paper, is a challenging problem. However, the agreement between capsules achieved by using the dynamic routing algorithm could boost detection performance on complex and nearly flawless forged images and videos.

CAPSULE-FORENSICS

The proposed method (Fig. 1) works for both images and videos. For video input, the video is split into frames in the pre-processing phase. The classification results (posterior probabilities) are then acquired from the frames. The probabilities are averaged in the post-processing phase to get the final result. The remaining parts are identical to the input image.

In the pre-processing phase, faces are detected and scaled to $128\times 128$ . Like we did in our previous work , we use part of the VGG-19 network to extract the latent features, which are the inputs to the capsule network. Unlike we did in our previous work, we take the output of the third maxpooling layer instead of three outputs before the ReLU layers. We do this because we need to reduce the size of the inputs to the capsule network.

2 Capsule Design

The proposed network consists of three primary capsules and two output capsules, one for real and one for fake images (Fig. 2). The latent features extracted by part of the VGG-19 network are the inputs, which are distributed to the three primary capsules (Fig. 3). As in our previous work , statistical pooling, which is important for forgery detection, is used. The outputs of the three capsules ( $\textbf{u}_{j|i}$ ) are dynamically routed to the output capsules ( $\textbf{v}_{j}$ ) for $r$ iterations using Algorithm 1. The network has approximate 2.8 million parameters, a relatively small number for such networks. We slightly improved the algorithm of Sabour et al. by adding Gaussian random noise to the 3-D weight tensor $W$ and applying one additional $squash$ (equation 1) before routing by iterating. The added noise helps reduce over-fitting while the additional equation keeps the network more stable. The outputs of the primary and output capsules are illustrated in Fig. 4.

Unlike Sabour et al.’s work , we use the cross-entropy loss function:

where $y$ is the ground truth label and $\hat{y}$ is the predicted label calculated using equation 3, in which $m$ is the number of dimensions of the output capsule $\textbf{v}_{j}$ .

The use of equation 3 instead of simply using the length of the output capsules promotes separation between the two output capsules on each dimension.

EVALUATION

To evaluate the advantage of using random noise, we tested the proposed method with and without using random noise (Capsule-Forensics-Noise and Capsule-Forensics, respectively). The random noise was generated from a normal distribution $N(0,0.01)$ and was used in the training phase only. Two iterations ( $r=2$ ) were used in the dynamic routing algorithm. We used the half total error rate (HTER) $\big{(}\frac{FRR+FAR}{2}\big{)}$ and accuracy $\big{(}\frac{TP+TN}{TP+TN+FP+FN}\big{)}$ as metrics.

To determine the ability of the proposed method to detect replay attacks, we compared its performance with that of eight state-of-the-art detection methods on the well-known Idiap REPLAY-ATTACK dataset . As shown in Table 1, the proposed method with random noise (Capsule-Forensics-Noise), as well as our previous method , had an HTER of zero.

2 Face Swapping Detection

We determined the ability of our proposed method to detect face swapping using a deepfake technique on the deepfake dataset proposed by Afchar et al. at both the frame and video levels. As shown in Tables 2 and 3, our proposed method with random noise (Capsule-Forensics-Noise) had the highest accuracy in both cases.

3 Facial Reenactment Detection

We determined the ability of our proposed method to detect facial reenactment on the FaceForensics dataset , which was created using the Face2Face method . We strictly followed the authors’ guidelines for processing the data. As shown in Table 4, on average, the proposed method (with and without noise) had performance comparable to that of the best-performing state-of-the-art methods.

We also tested our method at the video level and compared its performance with that of Afchar et al.’s MesoNet facial video forgery detection network . For our method, we used only the first ten frames instead of the entire video. As shown in Table 5, our method outperformed Afchar et al.’s network.

4 Fully Computer-Generated Image Detection

Finally, we compared the performance of our proposed method with that of state-of-the-art methods on computer-generated images (CGIs) and photographic images (PIs) on the dataset proposed by Rahmouni et al. . Once again, as shown in Table 6, our method had the best performance and had perfect accuracy on full-size test images.

CONCLUSION

Our comprehensive experiments demonstrated the feasibility of building a general detection method that is effective for a wide range of forged image and video attacks. They also demonstrated that capsule networks can be used in domains other than computer vision. The proposed use of random noise in the training phase proved beneficial in most cases. Future work will mainly focus on evaluating the ability of the proposed method to resist adversarial machine attacks, especially on the proposed random noise at test time, and enhancing its ability. It will also focus on making the proposed method robust against mixed attacks and on raising this critical issue in the research community.

ACKNOWLEDGMENTS

This work was supported by JSPS KAKENHI Grant Numbers (16H06302, 17H04687, 18H04120, 18H04112, 18KT0051) and by JST CREST Grant Number JPMJCR18A6, Japan.