Multi-scale Convolutional Neural Networks for Crowd Counting

Lingke Zeng, Xiangmin Xu, Bolun Cai, Suo Qiu, Tong Zhang

Introduction

Crowd counting aims to estimate the number of people in the crowded images or videos feed from surveillance cameras. Overcrowding in scenarios such as tourist attractions and public rallies can cause crowd crushes, blockages and even stampedes. It has been much significant to public safety to produce an accurate and robust crowd count estimation using computer vision techniques.

Existing methods of crowd counting can be generally divided into two categories: detection-based methods and regression-based methods.

Detection-based methods generally assume that each person on the crowd images can be detected and located by using the given visual object detector , and obtain the counting result by accumulating each detected person. However, these methods need huge computing resource and they are often limited by person occlusions and complex background in practical scenarios, resulting at a relatively low robustness and accuracy.

Regression-based methods regress the crowd count from the image directly. Chan et al. used handcraft features to translate the crowd counting task into a regression problem. Following works proposed more kinds of crowd-relevant features including segment-based features, structural-based features and local texture features. Lempitsky et al. proposed a density-based algorithm that obtain the count by integrating the estimated density map.

Recently, deep convolutional neural networks have been shown to be effective in crowd counting. Zhang et al. proposed a convolutional neural network (CNN) to alternatively learn the crowd density and the crowd count. Wang et al. directly used a CNN-based model to map the image patch to its people count value. However, these single-CNN-based algorithms are limited to extract scale-relevant features and hard to address the scale variations on crowd images. Zhang et al. proposed a multi-column CNN to extract multi-scale features by columns with different kernel sizes. Boominathan et al. proposed a multi-network CNN that used a deep and shallow network to improve the spatial resolution. These improved algorithms can relatively suppress the scale variations problem, but they still have two shortages:

Multi-column/network need pre-trained single-network for global optimization, which is more complicated than end-to-end training.

Multi-column/network introduce more parameters to consume more computing resource, which make it hard for practical application.

In this paper, we propose a multi-scale convolutional neural network (MSCNN) to extract scale-relevant features. Rather than adding more columns or networks, we only introduce a multi-scale blob with different kernel sizes similar to the naive Inception module . Our approach outperforms the state-of-the-art methods on the ShanghaiTech and UCF_CC_50 dataset with a small number of parameters.

MULTI-SCALE CNN FOR CROWD COUNTING

Crowd images are usually consisted of various sizes of person s pixels due to perspective distortion. Single-network is hard to counter scale variations with the same sized kernels combination. In , a Inception module is proposed to process visual information at various scales and aggregated to the next stage. Motivated by it, we designed a multi-scale convolutional neural network (MSCNN) to learn the scale-relevant density maps from original images.

An overview of MSCNN is illustrated in Figure. 1, including feature remapping, multi-scale feature extraction, and density map regression. The first convolution layer is a traditional convolutional layer with single-sized kernels to remap the image feature. Multi-Scale Blob (MSB) is a Inception-like model (as Figure. 2) to extract the scale-relevant features, which consists of multiple filters with different kernel size (including 9 $\times$ 9, 7 $\times$ 7, 5 $\times$ 5 and 3 $\times$ 3). A multi-layer perceptron (MLP) convolution layer works as a pixel-wise fully connection, which has multiple $1\times 1$ convolutional filters to regress the density map. Rectified linear unit (ReLU) is applied after each convolution layer, which works as the activation function of previous convolutional layers except the last one. Since the value in density map is always positive, adding ReLU after last convolutional layer can enhance the density map restoration. Detailed parameter settings are listed in Table 1.

2 Scale-relevant Density Map

Following Zhang et al. , we estimate the crowd density map directly from the input image. To generate a scale-relevant density map with high quality, the scale-adaptive kernel is currently the best choice. For each head annotation of the image, we represent it as a delta function $\delta\left(x-x_{i}\right)$ and describe its distribution with a Gaussian kernel $G_{\sigma}$ so that the density map can be represented as $F\left(x\right)=H\left(x\right)*G_{\sigma}\left(x\right)$ and finally accumulated to the crowd count value. If we assume that the crowd is evenly distributed on the ground plane, the average distance $\overline{d_{i}}$ between the head $x_{i}$ and its nearest 10 annotations can generally characterize the geometric distortion caused by perspective effect using the Eq. (1), where $M$ is the total number of head annotations in the image and we fix $\beta=0.3$ as empirically.

3 Model Optimization

The output from our model is mapped to the density map, Euclidean distance is used to measure the difference between the output feature map and the corresponding ground truth. The loss function that needs to be optimized is defined as Eq. (2), where $\Theta$ represents the parameters of the model while $F\left(X_{i};\Theta\right)$ represents the output of the model. $X_{i}$ and $F_{i}$ are respectively the $i^{th}$ input image and density map ground truth.

EXPERIMENTS

We evaluate our multi-scale convolutional neural network (MSCNN) for crowd counting on two different datasets, which include the ShanghaiTech and UCF_CC_50 datasets. The experimental results show that our MSCNN outperforms the state-of-the-art methods on both accuracy and robustness with far less parameter. All of the convolutional neural networks are trained based on Caffe .

Following existing state-of-the-art methods , we use the mean absolute error (MAE), the mean squared error (MSE) and the number of neural network s parameters (PARAMS) to evaluate the performance on the testing datasets. The MAE and the MSE are defined in Eq. (3) and Eq. (4).

Here $N$ represents the total number of images in the testing datasets, $z_{i}$ and $\hat{z}_{i}$ are the ground truth and the estimated value respectively for the $i^{th}$ image. In general, MAE, MSE and PARAMS can respectively indicate the accuracy, robustness and computation complexity of a method.

2 The ShanghaiTech Dataset

The ShanghaiTech dataset is a large-scale crowd counting dataset introduced by . It contains 1198 annotated images with a total of 330,165 persons. The dataset consists of 2 parts: Part_A has 482 images crawled from the Internet and Part_B has 716 images taken from the busy streets. Following , both of them are divided into a training set with 300 images and a testing set with the remainder.

To ensure a sufficient number of data for model training, we perform data augmentation by cropping 9 patches from each image and flipping them. We simply fix the 9 cropped points as top, center and bottom combining with left, center and right. Each patch is 90% of the original size.

In order to facilitate comparison with MCNN architecture , the network was designed similar to the largest column of MCNN but with MSB, of which detailed settings are described in Table 1. All convolutional kernels are initialized with Gaussian weight setting standard deviation to 0.01. As described in Sec. 2.3, we use the SGD optimization with momentum of 0.9 and weight decay as 0.0005.

2.2 Results

We compare our method with 4 existing methods on the ShanghaiTech dataset. The LBP+RR method used LBP feature to regress the function between the counting value and the input image. Zhang et al. designed a convolutional network to regress both the density map and the crowd count value from original pixels. A multi-column CNN is proposed to estimate the crowd count value (MCNN-CCR) and crowd density map (MCNN).

In Table 2, the results illustrate that our approach achieves the state-of-the-art performance on the ShanghaiTech dataset. In addition, it should be emphasized that the number of our parameters is far less than other two CNN-based algorithms. MSCNN uses approximately 7 $\times$ fewer parameters than the state-of-the-art method (MCNN) with higher accuracy and robustness.

3 The UCF_CC_50 Dataset

The UCF_CC_50 dataset contains 50 gray scale images with a total 63,974 annotated persons. The number of people range from 94 to 4543 with an average 1280 individuals per image. Following , we divide the dataset into five splits evenly so that each split contains 10 images. Then we use 5-fold cross-validation to evaluate the performance of our proposed method.

The most challenging problem of the UCF_CC_50 dataset is the limited number of images for training while the people count in the images span too large. To ensure enough number of training data, we perform a data augmentation strategy following by randomly cropping 36 patches with size 225 $\times$ 225 from each image and flipping them as similar in Sec. 3.2.1.

We train 5 models using 5 splits of training set. The MAE and the MSE are calculated after all the 5 models obtained the estimated results of the corresponding validation set. During training, the MSCNN model is initialized almost the same as the experiment on the ShanghaiTech dataset except that the learning rate is fixed to be 1e-7 to guarantee the model convergence.

3.2 Results

We compared our method on the UCF_CC_50 dataset with 6 existing methods. In , handcraft features are used to regress the density map from the input image. Three CNN-based methods proposed to used multi-column/network and perform evaluation on the UCF_CC_50 dataset.

Table 3 illustrates that our approach also achieves the state-of-the-art performance on the UCF_CC_50 dataset. Here our parameters number is approximately 5 $\times$ fewer than the CrowdNet model, demonstrating that our proposed MSCNN can work more accurately and robustly.

CONCLUSION

In this paper, we proposed a multi-scale convolutional neural network (MSCNN) for crowd counting. Compared with the recent CNN-based methods, our algorithm can extract scale-relevant features from crowd images using a single column network based on the multi-scale blob (MSB). It is an end-to-end training method with no requirement for multi-column/network pre-training works. Our method can achieve more accurate and robust crowd counting performance with far less number of parameters, which make it more likely to extend to the practical application.

ACKNOWLEDGMENT

This work is supported by the National Natural Science Foundation of China (61171142, 61401163), Science and Technology Planning Project of Guangdong Province of China (2014B010111003, 2014B010111006), and Guangzhou Key Lab of Body Data Science (201605030011).