Faceless Person Recognition; Privacy Implications in Social Media
Seong Joon Oh, Rodrigo Benenson, Mario Fritz, Bernt Schiele
Introduction
With the growth of the internet, more and more people share and disseminate large amounts of personal data be it on webpages, in social networks, or through personal communication. The steadily growing computation power, advances in machine learning, and the growth of the internet economy, have created strong revenue streams and a thriving industry built on monetising user data. It is clear that visual data contains private information, yet the privacy implications of this data dissemination are unclear, even for computer vision experts. We are aiming for a transparent and quantifiable understanding of the loss in privacy incurred by sharing personal data online, both for the uploader and other users who appear in the data.
In this work, we investigate the privacy implications of disseminating photos of people through social media. Although social media data allows to identify a person via different data types (timeline, geolocation, language, user profile, etc.) , we focus on the pixel content of an image. We want to know how well a vision system can recognise a person in social photos (using the image content only), and how well users can control their privacy when limiting the number of tagged images or when adding varying degrees of obfuscation (see figure 1) to their heads.
An important component to extract maximal information out of visual data in social networks is to fuse different data and provide a joint analysis. We propose our new Faceless Person Recogniser (described in §5), which not only reasons about individual images, but uses graph inference to deduce identities in a group of non-tagged images. We study the performance of our system on multiple privacy sensitive user scenarios (described in §3), analyse the main results in §6, and discuss implications and future work in §7. Since we focus on the image content itself, our results are a lower-bound on the privacy loss resulting from sharing such images.
We discuss dimensions that affect the privacy of online photos, and define a set of scenarios to study the question of privacy loss when social media images are aggregated and processed by a vision system.
We propose our new Faceless Person Recogniser, which uses convnet features in a graphical model for joint inference over identities.
We study the interplay and effectiveness of obfuscation techniques with regard of our vision system.
Related work
Nowadays, essentially all online activities can be potentially used to identify an internet user . Privacy of users in social network is a well studied topic in the security community . There are works which consider the relationship between privacy and photo sharing activities , yet they do not perform quantitative studies.
Some works have shown that it is possible to identify the camera taking the photos (and thus link photos and events via the photographer), either from the file itself or from recognisable sensing noise . In this work we focus exclusively on the image content, and leave the exploitation of image content together with other forms of privacy cues (e.g. additional meta-data from the social network) for future work.
Most previous work on person recognition in images has focused either on face images (mainly frontal head) or on the surveillance scenario , where the full body is visible, usually in low resolution. Like other areas of computer vision, the last years have seen a shift from classifiers based on hand-crafted features and metric learning approaches towards methods based on deep learning . Different from face recognition and surveillance scenarios, the social network images studied here tend to show a diverse range of poses, activities, points of view, scenes (indoors, outdoors), and illumination. This increased diversity makes recognition more challenging and only a handful of works have addressed explicitly this scenario . We construct our experiments on top of the recently introduced PIPA dataset , discussed in §4.
The notion of “person recognition” encompasses multiple related problems . Typical “person recognition” considers a few training samples over many different identities, and a large test set. It is thus akin to fine grained categorization. When only one training sample is available and many test images (typical for face recognition and surveillance scenarios ), the problem is usually named “re-identification”, and it becomes akin to metric learning or ranking problems. Other related tasks are, for example, face clustering , finding important people , or associating names in text to faces in images . In this work we focus on person recognition with on average training samples per identity (and hundreds of identities), as in typical social network scenario.
Given a rough bounding box locating a person, different cues can be used to recognise a person. Much work has focused on the face itself ( to name a few recent ones). Pose-independent descriptors have been explored for the body region . Various other cues have been explored, for example: attributes classification , social context , relative camera positions , space-time priors , and photo-album priors . In this work, we build upon which fuses multiple convnet cues from head, body, and the full scene. As we will discuss in the following sections, we will also indirectly use photo-album information.
Some previous works have considered the challenges of detection and recognition under obfuscation (e.g. see figure 1). Recently, quantified the decrease in Facebook face detection accuracy with respect to different types of obfuscation, e.g. blur, blacking-out, swirl, and dark spots. However, on principle, obfuscation patterns can expose faces at a higher risk of detection by a fine-tuned detector (e.g. blur detector). Unlike their work, we consider the identification problem with a system adapted to obfuscation patterns. Similarly, a few other works studied face recognition under blur . However, to the best of our knowledge, we are the first to consider person recognition under head obfuscation using a trainable system that leverages full-body cues.
Privacy scenarios
We consider a hypothetical social photo sharing service user. The user has a set of photos of herself and others in her account. Some of these photos have identity tags and the others do not have such identity tags. We assume that all heads on the test photos have been detected, either by an automatic detection system, or because a user is querying the identity of a specific head. Note that we do not assume that the faces are visible nor that persons are in a frontal-upstanding pose. A “tag” is an association between a given head and a unique identifier linked to a specific identity (social media user profile).
The task of our recognition system is to identify a person of interest (marked via its head bounding box), by leveraging all the photos available (both with and without identity tags). In this work, we want to explore how effective different strategies are to protect the user identity.
We consider four different dimensions that affect how hard or easy it is to recognise a user:
We vary the number of tagged images available per identity. The more tagged images available, the easier it should be to recognise someone in new photos. In the experiments of §5 & §6 we consider that tagged images are available per person.
Users concerned with their privacy might take protective measures by blurring or masking their heads. Other than the fully visible case (non-obfuscated), we consider three other obfuscations types, shown in figure 2. We consider both black and white, since showed that commercial systems might react differently to these. The blurring parameters are chosen to resemble the YouTube face blur feature.
Depending on the user’s activities (and her friends posting photos of her), not all photos might be obfuscated. We consider a variable fraction of these.
For the recognition task, there is a difference if all photos belong to the same event, where the appearance of people change little; or if the set of photos without tags correspond to a different event than the ones with identity tags. Recognising a person when the clothing, context, and illumination have changed (“across events”) is more challenging than when they have not (“within events”).
Based on these four dimensions, we discuss a set of scenarios, summarised in table 1. Clearly, these only cover a subset of all possible combinations along the mentioned four dimensions. However, we argue that this subset covers important and relevant aspects for our exploration on privacy implications.
Here all heads are fully visible and tagged. Since all heads are tagged, the user is fully identifiable. This is the classic case without any privacy.
There is no obfuscation but not all images are tagged. This is the scenario commonly considered for person recognition, e.g. . Unless otherwise specified we use , where an average of instances of the person are tagged (average across all identities). This is a common scenario for social media users, where some pictures are tagged, but many are not.
Here the user has all of her heads visible, except for the one non-tagged head being queried. This would model the case where the user wants to conceal her identity in one particular published photo.
The user aims at protecting her identity by obfuscating all her heads (using any obfuscation type, see figure 2). Both tagged and non-tagged heads are obfuscated. This scenario models a privacy concerned user. Note that the body is still visible and thus usable to recognise the user.
These consider the case of a user that inconsistently uses the obfuscation tactic to protect her identity. Albeit on the surface these seems like different scenarios, if the visual information of the heads cannot be propagated from/to the tagged/non-tagged heads, then these are functionally equivalent to .
Each of these scenarios can be applied for the “across/within events” dimension. In the following sections we will build a system able to recognise persons across these different scenarios, and quantify the effect of each dimension on the recognition capabilities (and thus their implication on privacy). For our system, the tagged heads become training data, while the non-tagged heads are used as test data.
Experimental setup
We investigate the scenarios proposed in §3 through a set of controlled experiments on a recently introduced social media dataset: PIPA (People In Photo Albums) . In this section, we project the scenarios in §3 onto specific aspects of the PIPA dataset, describing how much realism can be achieved and what are possible limitations.
From the Flickr website, each photo is associated with an album identifier. The test instances are grouped in photos belonging to albums. We use the photo album information indirectly during our graph inference (§5.3).
When instantiating the scenarios from §3, the tagged faces are all part of . In , , and , is never tagged. The task of our Faceless Person Recognition System is to recognise every query instance from , possibly leveraging other non-tagged instances in .
Other than the one split defined in , proposed additional splits with increasing recognition difficulty. We use the “Original” split as a good proxy for the “within events” case, and the “Day” split for “across events”. In the day split, and contain images of a given person across different days.
Faceless Recognition System
In this section, we introduce the Faceless Recognition System to study the effectiveness of privacy protective measures in §3. We choose to build our own baseline system, as opposed to using an existing system as in , for adaptibility of the system to obfuscation and reproducibility for future research.
Our system does joint recognition employing a conditional random field (CRF) model. CRF often used for joint recognition problems in computer vision . It enables the communication of information across instances, strengthening weak individual cues. Our CRF model is formulated as follows:
with observations , identities and unary potentials defined on each node (detailed in §5.1) as well as pairwise potentials defined on each edge (detailed in §5.2). is the indicator function, and controls the unary-pairwise balance.
We build our unary upon a state of the art, publicly available person recognition system, . The system was shown to be robust to decreasing number of tagged examples. It uses not only the face but also context (e.g. body and scene) as cues. Here, we also explore its robustness to obfuscation, see §5.1.
By adding pairwise terms over the unaries, we expect that the system to propagate predictions across nodes (instances). When a unary prediction is weak (e.g. obfuscated head), the system aggregates information from connected nodes with possibly stronger predictions (e.g. visible face), and thus deduce the query identity. Our pairwise term is a siamese network build on top of the unary features, see §5.2.
Experiments on the validation set indicate that, for all scenarios, the performance improves with increasing values of , and reaches the plateau around . We use this value for all the experiments and analysis.
In the rest of the section, we provide a detailed description of the different parts and evaluate our system component-wise.
1 Single person recognition
We build our single person recogniser (the unary potential of the CRF model) upon the state of the art person recognition system .
First, AlexNet cues are extracted and concatenated from multiple regions (head, body, and scene) defined relative to the ground truth head boxes. We then train per-identity logistic regression models on top of the resulting dimensional feature vector, which constitute the vector.
The AlexNet models are trained on the PIPA train set, while the logistic regression weights are trained on the tagged examples (). For each obfuscation case, we also train new AlexNet models over obfuscated images (referred to as “adapted” in figure 3). We assume that at test time the obfuscation can be easily detected, and the appropriate model is used. We always use the “adapted” model unless otherwise stated.
Figures 3 and 5 evaluate our unary term over the PIPA validation set, under different obfuscation, within/across events, and with varying number of training tags. In the following, we discuss our main findings on how single person recognition is affected by these measures.
When comparing “adapted” to “non-adapted” in figure 3, we see that adaptation of the convnet models is overall positive. It makes minor differences for black or white fill-in, but provides a good boost in recognition accuracy for the blur case, especially in the across events case ( percent points gain).
After applying black obfuscation in the within events case, our unary performs only slightly worse (from “visible” to “black adapted” ). This is times better than a naive baseline classifier () that blindly predicts the most popular class. In the across events case, the “visible” performance drops from from to , after black obfuscation, which is still more than times accurate than the naive baseline ().
suggests that white fill-in confuses a detection system more than does the black. In our recognition setting, black and white fill-in have similar effects: and respectively, for within events, adapted case (see figure 3). Thus, we omit the experiments for white fill-in obfuscation in the next sections.
2 Person pair matching
In this section, we introduce a method for predicting matches between a pair of persons based on head and body cues. This is the pairwise term in our CRF formulation (equation 2). Note that person pair matching in social media context is challenging due to clothing changes and varying poses (see figure 5).
We build a Siamese neural network to compute the match probability . A pair of instances are given as input, whose head and body features are then computed using the single person recogniser (§5.1), resulting in a dimensional feature vector. These features are passed through three fully connected layers with ReLU activations with a binary prediction at the end (match, no-match).
We first train the siamese network on the PIPA train set, and then fine-tune it over , the set of tagged samples. We train three types of models: one for visible pairs, one for obfuscated pairs, and one for mixed pairs. Like for the unary term, we assume that obfuscation is detected at test time, so that the appropriate model is used. Further details can be found in the supplementary materials.
Figure 6 shows the matching performance. We evaluate on the set of pairs within albums (used for graph inference in §5.3). The performance is evaluated in the equal error rate (EER), the accuracy at the score threshold where false positive and false negative rates meet. The three obfuscation type models are evaluated on the corresponding obfuscation pairs.
By fine-tuning on the tagged examples of query identities, matching performance improves significantly. For the visible pair model, EER improves from to in the within events setting, and from to in across events.
In order to evaluate whether the matching network has learned to predict match better than its initialisation model, we consider the unary baseline. See “visible unary base” in figure 6. It first compares the unary prediction (argmax) for a given pair, and then determines its confidence using the prediction entropies. See supplementary materials for more detail.
The unary baseline performs marginally better than the visible pair model under the within events: versus . Under the across events, on the other hand, the visible pair model beats the baseline by a large margin: versus (figure 6). In practice, the system has no information whether the query image is from within or across events. The system thus uses the pairwise trained model (visible pair model), which performs better on average.
The matching network performs better under the within events setting than across events, and better for the visible pairs than for mixed or black pairs. See figure 6.
3 Graph inference
Given the unaries from §5.1 and pairwise from §5.2, we perform a joint inference to perform more robust recognition. The graph inference is implemented via PyStruct . The results of the joint inference (for the black obfuscation case) are presented in figure 7, and discussed in the next paragraphs.
We introduce some graph pruning strategies which make the inference tractible and more robust to noisy predictions. Some of the scenarios considered (e.g. ) require running inference for each instance in the test set ( for within events). In order to lower down the computational cost from days to hours, we prune all edges across albums. The resulting graph only has fully connected cliques within albums. The across-album edge pruning reduces the number of edges by two orders of magnitude.
As can be seen in figure 7, simply adding pairwise terms (“unary+pairwise (no pruning)”) can hurt the unaries only performance. This happens because many pairwise terms are erroneous. This can be mitigated by only selecting confident (high quality, low recall) predictions from . We found that selecting positive pairs works best (any threshold in the range works equally fine). These are the “unary+pairwise” results in figure 7, which show an improvement over the unary case, especially for the across events setting. The main gain is observed for (one obfuscated head) across events, where the pairwise term brings a jump from to .
To put in context the gains from the graph inference, we build an oracle case that assumes perfect pairwise potentials (, where is the indicator function and are the ground truth identities ). We do not perform negative edge pruning here. The unaries are the same as for the other cases in figure 7. We can see that the “unary+pairwise” results are within of the oracle case “(oracle)”, indicating that the pairwise potential is rather strong. The cases where the oracle perform poorly (e.g. across events), indicate that stronger unaries or better graph inference is needed. Finally, even if no negative edge is pruned, adding oracle pairwise improves the performance, indicating that negative edge pruning is needed only when pairwise is imperfect.
After graph inference, all scenarios in the within event case reach recognition rates above (figure 7a). When across events, both and are above (figure 7b). These are recognition far above the chance level (/ within/across events, shown in figure 3). Only (all user heads with black obfuscation) show a dreadful drop in recognition rate, where neither the unaries nor the pairwise terms bring much help. See supplementary materials for more details in this section.
Test set results & analysis
Following the experimental protocol in §4, we now evaluate our Faceless Recognition System on the PIPA test set. The main results are summarised in figures 10 and 10. We observe the same trends as the validation set results discussed in §5. Figure 14 shows some qualitative results over the test set. We organize the results along the same privacy sensitive dimensions that we defined in §3 in order to build our study scenarios.
Figure 10 shows that even with only tagged photos per person on average, the system can recognise users far better than chance level (naive baseline; best guess before looking at the image). Even with such little amount of training data, the system predicts of the instances correctly within events and across events; which is and higher than chance level, respectively. We see that even few tags provide a threat for privacy and thus users concerned with their privacy should avoid having (any of) their photos tagged.
For both scenario and , figure 10 (and the results from §5.1) indicates the same privacy protection ranking for the different obfuscation types. From higher protection to lower protection, we have . Albeit blurring does provide some protection, the machine learning algorithm still extracts useful information from that region. When our full Faceless Recognition System is in use, one can see that (figure 10) obfuscation helps, but only to a limited degree: e.g. () to () under within events and () to () under across events.
We cover three scenarios: every head fully visible (), only the test head obfuscated (), and every head fully obfuscated (). Figure 10 shows that within events obfuscating either one () or all () heads is not very effective, compared to the across events case, where one can see larger drops for and . Notice that unary performances are identical for and in all settings, but using the full system raises the recognition accuracy for (since seeing the other heads allow to rule-out identities for the obfuscated head). We conclude that within events head obfuscation has only limited effectiveness, across events only blacking out all heads seems truly effective ( black).
In all scenarios, the recognition accuracy is significantly worse in the across events case than within events (about drop in accuracy across all other dimensions). For a user, it is a better privacy policy to make sure no tagged heads exist for the same event, than blacking out all his heads in the event.
Discussion & Conclusion
Within the limitation of any study based on public data, we believe the results presented here are a fresh view on the capabilities of machine learning to enable person recognition in social media under adversarial condition. From a privacy perspective, the results presented here should raise concern. We show that, when using state of the art techniques, blurring a head has limited effect. We also show that only a handful of tagged heads are enough to enable recognition, even across different events (different day, clothes, poses, point of view). In the most aggressive scenario considered (all user heads blacked-out, tagged images from a different event), the recognition accuracy of our system is higher than chance level. It is very probable that undisclosed systems similar to the ones described here already operate online. We believe it is the responsibility of the computer vision community to quantify, and disseminate the privacy implications of the images users share online. This work is a first step in this direction. We conclude by discussing some future challenges and directions on privacy implications of social visual media.
The current results focused singularly on the photo content itself and therefore a lower bound of the privacy implication of posting such photos. It remains as future work to explore an integrated system that will also exploit the images’ meta-data (timestamp, geolocation, camera identifier, related user comments, etc.). In the context of the era of “selfie” photos, meta-data can be as effective as head tags. Younger users also tend to cross-post across multiple social media, and make a larger use of video (e.g. Vine). Using these data-form will require developing new techniques.
The performance of recent techniques of feature learning and inference are strongly coupled with the amount of available training data. Person recognition systems like all rely on undisclosed training data in the order of millions of training samples. Similarly, the evaluation of privacy issues in social networks requires access to sensitive data, which is often not available to the public research community (for good reasons ). The used PIPA dataset serves as good proxy, but has its limitations. It is an emerging challenge to keep representative data in the public domain in order to model privacy implications of social media and keep up with the rapidly evolving technology that is enabled by such sources.
In this work, we focus on the analysis aspect of person recognition in social media. In the future, one would like to translate such analyses to actionable systems that enable users to control their privacy while still enabling communication via visual media exchanges.
This research was supported by the German Research Foundation (DFG CRC 1223).
References
Supplementary Materials
Section 0.B of this supplementary materials provides details of the training procedure for the model components. Sections 0.C and 0.G present the quantitative tables behind the bar plots of the main paper. Section 0.D discusses in more detail the pairwise term of our model. Section 0.E discuss in more detail some of the design choices for the graph inference. Section 0.F gives the rough computation cost of our method. Finally section 0.H shows additional qualitative examples of our faceless recognition system.
Appendix 0.B Convnet training details
The convnet parts of our recognition system are built using Caffe , the CRF is built using PyStruct .
We initialise the AlexNet network with ImageNet pretrained model, and use the following parameters from to fine-tune the respective models:
We choose the batch size . This corresponds to epochs. This setting is used for fine-tuning all the unary models, including the ones adapted to different head obfuscation types (black, white, and blur).
The network consists of the Siamese part with our unary model, followed by three fully connect layers with , , and dimensional weights. First two fully connected layers are followed by ReLU activation layers, and additional dropout layers (with chance) during training phase.
The network is first trained on the PIPA train set, and then fine-tuned for instances (tagged). The learning parameters are as follows for both training and fine-tuning:
We choose the batch size of and maintain the same ratio of positive and negative pairs () for each batch. The training pairs consist of the pairs within PIPA albums because eventually these are the edges used in the graph inference. Depending on the setting (within/across events), training iterations correspond to epochs. We stop at training iterations, as the loss does not decrease further. For fine-tuning, we stop at iterations.
Appendix 0.C Unaries recognition accuracy
Tables 2 and 3 show the accuracy of the unary system alone in the presence of head obfuscation and with different tag rates, respectively.
Appendix 0.D Pairwise term
We discussed the effectiveness of fine-tuning on examples in the main paper. Here, we provide ROC curves for the visible pair models before and after fine-tuning on the validation set (figure 11). We observe that the matching network indeed performs better when it has been trained on examples of the identities to be queried.
We introduced the unary baseline person matcher in the main paper in order to verify that the network has learned to predict a match better than its initialisation model. We provide further details here.
The match probability of a person pair is computed solely from the unary prediction probabilities and . In order to model the match confidence, we use the average of the unary entropies and , where the entropy is .
Specifically, the unary baseline match probability is computed as follows:
Compute unary predictions and .
For a typical unary prediction, entropy takes a value . Thus, if , then the match probability is within , with a lower value when the mean entropy (uncertainty) is higher; in the other case , it takes a value in with a higher value for higher mean entropy (uncertainty).
Appendix 0.E CRF inference
In this section, we supplement the discussion about the following inference problem in the main paper.
We describe the effect of changing values of in §0.E.1. We then discuss the pruning (§0.E.2) and approximate inference (§0.E.3) strategies to realise efficient inference. We include the numerical results for the validation graph inference experiment (figure 7 in the main paper) in §0.E.4.
We use the unary-pairwise balancing term for all the experiments in the main paper, as the performance reaches a plateau for around . In figure 12, we show the performance of the system in three different scenarios (, , and ) at different values of on the validation set. Black obfuscation is used for scenarios and . We observe that the plateau is reached at around for both within and across events cases.
E.2 Graph pruning
As described in the main paper, we prune the full graph down by only having fully connected cliques within albums. As shown in table 4, this reduces the number of edges by two orders of magnitude. This also allows album-wise parallel computation in a multi-core environment.
We provide details of the “preliminary oracle experiments” discussed in §5.3 of main paper. In order to quantify how much we lose from the pruning, we perform an oracle experiments assuming perfect propagation (given actual unaries) on the validation set. In the within events case, perfect propagation inside album cliques already gives , compared to for full graph propagation. Thus, nearly all the information for a perfect inference is already present inside each album. Under the across events, the oracle numbers are (inside album propagation) and (full graph propagation). As current unary model performance on across events () is still far worse than those oracles, we choose efficiency over the extra boost in the oracle performance.
We prune edges below certain match score (), as simply adding pairwise terms can hurt the performance (figure 7 in main paper). In figure 13, we show the performance of the Faceless Recognition system on the validation set at different pruning thresholds . Again for obfuscation scenarios ( and ), black obfuscation is used. As mentioned in the main paper, we observe that any threshold in the range works equally fine, and thus use in all the experiments.
E.3 Approximate inference
For further efficiency, we perform approximate inference on the graph. Specifically, given a node to infer identity, we consider propagations only on the neighbouring edges for the node. Since the resulting graph is a tree, this significantly reduces the computation time, while achieving similar (even better) accuracy than the full max-product inference (table 5). For within events, the reduction in inference time for the whole validation set is from hours to only seconds.
E.4 Full validation set results
We include the full numerical results for the validation set graphical inference results (figure 7 in the main paper) in table 6 below.
Appendix 0.F Computation time
For unary convnet training, it takes days to train on a single GPU machine. Unary logistic regression training takes minutes. On a single GPU, pairwise matching network training and fine-tuning take hours and hours, respectively.
Details for graph inference time is found in table 5. Note that before inter-album pruning, inference over the entire test set takes more than several days. However, after pruning and applying the approximate inference, it takes seconds for across events, and minutes for within events.
Appendix 0.G Test results
Numerical results for the figures 8 and 9 of the main paper are presented below (tables 8 and 7). Tables show the accuracy of the Faceless Recognition system under different scenarios (, , and ) in within and across events cases.
Appendix 0.H Qualitative results
See figure 14 for additional qualitative examples of success cases. Note the difficulty of the dataset (various pose, back view, and changing clothing).