Learning Differentially Private Recurrent Language Models

H. Brendan McMahan, Daniel Ramage, Kunal Talwar, Li Zhang

Introduction

Deep recurrent models like long short-term memory (LSTM) recurrent neural networks (RNNs) have become a standard building block in modern approaches to language modeling, with applications in speech recognition, input decoding for mobile keyboards, and language translation. Because language usage varies widely by problem domain and dataset, training a language model on data from the right distribution is critical. For example, a model to aid typing on a mobile keyboard is better served by training data typed in mobile apps rather than from scanned books or transcribed utterances. However, language data can be uniquely privacy sensitive. In the case of text typed on a mobile phone, this sensitive information might include passwords, text messages, and search queries. In general, language data may identify a speaker—explicitly by name or implicitly, for example via a rare or unique phrase—and link that speaker to secret or sensitive information.

Ideally, a language model’s parameters would encode patterns of language use common to many users without memorizing any individual user’s unique input sequences. However, we know convolutional NNs can memorize arbitrary labelings of the training data [Zhang et al., 2017] and recurrent language models are also capable of memorizing unique patterns in the training data [Carlini et al., 2018]. Recent attacks on neural networks such as those of Shokri et al. underscore the implicit risk. The main goal of our work is to provide a strong guarantee that the trained model protects the privacy of individuals’ data without undue sacrifice in model quality.

We are motivated by the problem of training models for next-word prediction in a mobile keyboard, and use this as a running example. This problem is well suited to the techniques we introduce, as differential privacy may allow for training on data from the true distribution (actual mobile usage) rather than on proxy data from some other source that would produce inferior models. However, to facilitate reproducibility and comparison to non-private models, our experiments are conducted on a public dataset as is standard in differential privacy research. The remainder of this paper is structured around the following contributions:

1. We apply differential privacy to model training using the notion of user-adjacent datasets, leading to formal guarantees of user-level privacy, rather than privacy for single examples.

2. We introduce a noised version of the federated averaging algorithm [McMahan et al., 2016] in $\S$ 2, which satisfies user-adjacent differential privacy via use of the moments accountant [Abadi et al., 2016a] first developed to analyze differentially private stochastic gradient descent (SGD) for example-level privacy. The federated averaging approach groups multiple SGD updates together, enabling large-step model updates.

3. We demonstrate the first high quality LSTM language model trained with strong privacy guarantees in $\S$ 3, showing no significant decrease in model accuracy given a large enough dataset. For example, on a dataset of 763,430 users, baseline (non-private) training achieves an accuracy of $17.5\%$ in 4120 rounds of training, where we use the data from 100 random users on each round. We achieve this same level of accuracy with $(4.6,10^{-9})$ -differential privacy in 4980 rounds, processing on average 5000 users per round—maintaining the same level of accuracy at a significant computational cost of roughly $60\times$ .The additional computational cost could be mitigated by initializing by training on a public dataset, rather than starting from random initialization as we do in our experiments. Running the same computation on a larger dataset with $10^{8}$ users would improve the privacy guarantee to $(1.2,10^{-9})$ . We guarantee privacy and maintain utility despite the complex internal structure of the LSTM—with per-word embeddings as well as dense state transitions—by using the federated averaging algorithm. We demonstrate that the noised model’s metrics and qualitative behavior (with respect to head words) does not differ significantly from the non-private model. To our knowledge, our work represents the most sophisticated machine learning model, judged by the size and the complexity of the model, ever trained with privacy guarantees, and the first such model trained with user-level privacy.

4. In extensive experiments in $\S$ 3, we offer guidelines for parameter tuning when training complex models with differential privacy guarantees. We show that a small number of experiments can narrow the parameter space into a regime where we pay for privacy not in terms of a loss in utility but in terms of an increased computational cost.

We now introduce a few preliminaries. Differential privacy (DP) [Dwork et al., 2006, Dwork, 2011, Dwork and Roth, 2014] provides a well-tested formalization for the release of information derived from private data. Applied to machine learning, a differentially private training mechanism allows the public release of model parameters with a strong guarantee: adversaries are severely limited in what they can learn about the original training data based on analyzing the parameters, even when they have access to arbitrary side information. Formally, it says:

Differential Privacy: A randomized mechanism $\mathcal{M}\colon\mathcal{D}\rightarrow\mathcal{R}$ with a domain $\mathcal{D}$ (e.g., possible training datasets) and range $\mathcal{R}$ (e.g., all possible trained models) satisfies $(\epsilon,\delta)$ -differential privacy if for any two adjacent datasets $d,d^{\prime}\in\mathcal{D}$ and for any subset of outputs $S\subseteq\mathcal{R}$ it holds that $\Pr[\mathcal{M}(d)\in S]\leq e^{\epsilon}\Pr[\mathcal{M}(d^{\prime})\in S]+\delta.$

The definition above leaves open the definition of adjacent datasets which will depend on the application. Most prior work on differentially private machine learning (e.g. Chaudhuri et al. , Bassily et al. , Abadi et al. [2016a], Wu et al. , Papernot et al. ) deals with example-level privacy: two datasets $d$ and $d^{\prime}$ are defined to be adjacent if $d^{\prime}$ can be formed by adding or removing a single training example from $d$ . We remark that while the recent PATE approach of [Papernot et al., 2017] can be adapted to give user-level privacy, it is not suited for a language model where the number of classes (possible output words) is large.

For problems like language modeling, protecting individual examples is insufficient—each typed word makes its own contribution to the RNN’s training objective, so one user may contribute many thousands of examples to the training data. A sensitive word or phrase may be typed several times by an individual user, but it should still be protected.Differential privacy satisfies a property known as group privacy that can allow translation from example-level privacy to user-level privacy at the cost of an increased $\epsilon$ . In our setting, such a blackbox approach would incur a prohibitive privacy cost. This forces us to directly address user-level privacy. In this work, we therefore apply the definition of differential privacy to protect whole user histories in the training set. This user-level privacy is ensured by using an appropriate adjacency relation:

User-adjacent datasets: Let $d$ and $d^{\prime}$ be two datasets of training examples, where each example is associated with a user. Then, $d$ and $d^{\prime}$ are adjacent if $d^{\prime}$ can be formed by adding or removing all of the examples associated with a single user from $d$ .

Model training that satisfies differential privacy with respect to datasets that are user-adjacent satisfies the intuitive notion of privacy we aim to protect for language modeling: the presence or absence of any specific user’s data in the training set has an imperceptible impact on the (distribution over) the parameters of the learned model. It follows that an adversary looking at the trained model cannot infer whether any specific user’s data was used in the training, irrespective of what auxiliary information they may have. In particular, differential privacy rules out memorization of sensitive information in a strong information theoretic sense.

Algorithms for user-level differentially private training

Our private algorithm relies heavily on two prior works: the FederatedAveraging (or FedAvg) algorithm of McMahan et al. , which trains deep networks on user-partitioned data, and the moments accountant of Abadi et al. [2016a], which provides tight composition guarantees for the repeated application of the Gaussian mechanism combined with amplification-via-sampling. While we have attempted to make the current work as self-contained as possible, the above references provide useful background.

FedAvg was introduced by McMahan et al. for federated learning, where the goal is to train a shared model while leaving the training data on each user’s mobile device. Instead, devices download the current model and compute an update by performing local computation on their dataset. It is worthwhile to perform extra computation on each user’s data to minimize the number of communication rounds required to train a model, due to the significantly limited bandwidth when training data remains decentralized on mobile devices. We observe, however, that FedAvg is of interest even in the datacenter when DP is applied: larger updates are more resistant to noise, and fewer rounds of training can imply less privacy cost. Most importantly, the algorithm naturally forms per-user updates based on a single user’s data, and these updates are then averaged to compute the final update applied to the shared model on each round. As we will see, this structure makes it possible to extend the algorithm to provide a user-level differential privacy guarantee.

We also evaluate the FederatedSGD algorithm, essentially large-batch SGD where each minibatch is composed of “microbatches” that include data from a single distinct user. In some datacenter applications FedSGD might be preferable to FedAvg, since fast networks make it more practical to run more iterations. However, those additional iterations come at a privacy cost. Further, the privacy benefits of federated learning are nicely complementary to those of differential privacy, and FedAvg can be applied in the datacenter as well, so we focus on this algorithm while showing that our results also extend to FedSGD.

Both FedAvg and FedSGD are iterative procedures, and in both cases we make the following modifications to the non-private versions in order to achieve differential privacy:

We use random-sized batches where we select users independently with probability $q$ , rather than always selecting a fixed number of users.

We enforce clipping of per-user updates so the total update has bounded $L_{2}$ norm.

We use different estimators for the average update (introduced next).

We add Gaussian noise to the final average update.

The pseudocode for DP-FedAvg and DP-FedSGD is given as Algorithm 1. In the remainder of this section, we introduce estimators for C) and then different clipping strategies for B). Adding the sampling procedure from A) and noise added in D) allows us to apply the moments accountant to bound the total privacy loss of the algorithm, given in Theorem 1. Finally, we consider the properties of the moments accountant that make training on large datasets particular attractive.

Randomly sampling users (or training examples) by selecting each independently with probability $q$ is crucial for proving low privacy loss through the use of the moments accountant [Abadi et al., 2016a]. However, this procedure produces variable-sized samples $\mathcal{C}$ , and when the quantity to be estimated $f(\mathcal{C})$ is an average rather than a sum (as in computing the weighted average update in FedAvg or the average loss on a minibatch in SGD with example-level DP), this has ramifications for the sensitivity of the query $f$ .

Specifically, we consider weighted databases $d$ where each row $k\in d$ is associated with a particular user, and has an associated weight $w_{k}\in$ . This weight captures the desired influence of the row on the final outcome. For example, we might think of row $k$ containing $n_{k}$ different training examples all generated by user $k$ , with weight $w_{k}$ proportional to $n_{k}$ . We are then interested in a bounded-sensitivity estimate of $f(\mathcal{C})=\frac{\sum_{k\in\mathcal{C}}w_{k}\Delta_{k}}{\sum_{k\in\mathcal{C}}w_{k}}$ for per-user vectors $\Delta_{k}$ , for example to estimate the weighted-average user update in FedAvg. Let $W=\sum_{k}w_{k}$ . We consider two such estimators:

Clipping strategies for multi-layer models

However, for deep networks it is more natural to treat the parameters of each layer as a separate vector. The updates to each of these layers could have vastly different $L_{2}$ norms, and so it can be preferable to clip each layer separately.

Formally, suppose each update $\Delta_{k}$ contains $m$ vectors $\Delta_{k}=(\Delta_{k}(1),\dots,\Delta_{k}(m))$ . We consider the following clipping strategies, both of which ensure the total update has norm at most $S$ :

Flat clipping Given an overall clipping parameter $S$ , we clip the concatenation of all the layers as $\Delta^{\prime}_{k}=\operatorname{\pi}(\Delta_{k},S)$ .

Per-layer clipping Given a per-layer clipping parameter $S_{j}$ for each layer, we set $\Delta_{k}^{\prime}(j)=\operatorname{\pi}(\Delta_{k}(j),S_{j})$ . Let $S=\sqrt{\sum_{j=1}^{m}S_{j}^{2}}$ . The simplest model-independent choice is to take $S_{j}=\frac{S}{\sqrt{m}}$ for all $j$ , which we use in experiments.

We remark here that clipping itself leads to additional bias, and ideally, we would choose the clipping parameter to be large enough that nearly all updates are smaller than the clip value. On the other hand, a larger $S$ will require more noise in order to achieve privacy, potentially slowing training. We treat $S$ as a hyper-parameter and tune it.

A privacy guarantee

Differential privacy for large datasets

We use the implementation of the moments accountant from Abadi et al. [2016b]. The moments accountant makes strong use of amplification via sampling, which means increasing dataset size makes achieving high levels of privacy significantly easier. Table 1 summarizes the privacy guarantees offered as we vary some of the key parameters.

The takeaway from this table is that as long as we can afford the cost in utility of adding noise proportional to $z$ times the sensitivity of the updates, we can get reasonable privacy guarantees over a large range of parameters. The size of the dataset has a modest impact on the privacy cost of a single query (1 round column), but a large effect on the number of queries that can be run without significantly increasing the privacy cost (compare the $10^{6}$ round column). For example, on a dataset with $10^{9}$ users, the privacy upper bound is nearly constant between $1$ and $10^{6}$ calls to the mechanism (that is, rounds of the optimization algorithm).

Experimental Results

In this section, we evaluate DP-FedAvg while training an LSTM RNN tuned for language modeling in a mobile keyboard. We vary noise, clipping, and the number of users per round to develop an intuition of how privacy affects model quality in practice.

We defer our experimental results on FedSGD as well as on models with larger dictionaries to Appendix $\S$ D. To summarize, they show that FedAvg gives better privacy-utility trade-offs than FedSGD, and that our empirical conclusions extend to larger dictionaries with relatively little need for additional parameter tuning despite the significantly larger models. Some less important plots are deferred to $\S$ C.

The goal of a language model is to predict the next word in a sequence $s_{t}$ from the preceding words $s_{0}...s_{t-1}$ . The neural language model architecture used here is a variant of the LSTM recurrent neural network [Hochreiter and Schmidhuber, 1997] trained to predict the next word (from a fixed dictionary) given the current word and a state vector passed from the previous time step. LSTM language models are competitive with traditional n-gram models [Sundermeyer et al., 2012] and are a standard baseline for a variety of ever more advanced neural language model architectures [Grave et al., 2016, Merity et al., 2016, Gal and Ghahramani, 2016]. Our model uses a few tricks to decrease the size for deployment on mobile devices (total size is 1.35M parameters), but is otherwise standard. We evaluate using AccuracyTop1, the probability that the word to which the model assigns highest probability is correct . Details on the model and evaluation metrics are given in $\S$ B. All training began from a common random initialization, though for real-world applications pre-training on public data is likely preferable (see $\S$ B for additional discussion).

Dataset

Building towards DP: sampling, estimators, clipping, and noise

Recall achieving differential privacy for FedAvg required a number of changes ( $\S$ 2, items A–D). In this section, we examine the impact of each of these changes, both to understand the immediate effects and to enable the selection of reasonable parameters for our final DP experiments. This sequence of experiments also provides a general road-map for applying differentially private training to new models and datasets. For these experiments, we use the FedAvg algorithm with a fixed learning rate of 6.0, which we verified was a reasonable choice in preliminary experiments.The proper choice of for the clipping parameters may depend on the learning rate, so if the learning rate is changed, clipping parameter choices will also need to be re-evaluated. In all FedAvg experiments, we used a local batch size of $B=8$ , an unroll size of 10 tokens, and made $E=1$ passes over the local dataset; thus FedAvg processes $80$ tokens per batch, processing a user’s 1600 tokens in 20 batches per round.

Next, we investigate the impact of flat and per-layer clipping on the convergence rate of FedAvg. The model has 11 parameter vectors, and for per-layer clipping we simply chose to distribute the clipping budget equally across layers with $S_{j}=S/\sqrt{11}$ . Figure 2 shows that choosing $S\in$ has at most a small effect on convergence rate.

Estimating the accuracy of private models for large datasets

Adjusting noise and clipping as training progresses

Figure 1 shows that as training progresses, each level of noise eventually becomes detrimental (the line drops somewhat below the baseline). This suggests using a smaller $\sigma$ and correspondingly smaller $S$ (thus fixing $z$ so the privacy cost of each round is unchanged) as training progresses. Figure 4 (and Figure 8 in $\S$ C) shows this can be effective. We indeed observe that early in training (red), $S$ in the $10$ – $12.6$ range works well ( $\sigma=0.006$ – $0.0076$ ). However, if we adjust the clipping/noise tradeoff after 4885 rounds of training and continue for another 6000, switching to $S=7.9$ and $\sigma=0.0048$ performs better.

Comparing DP and non-DP models

Conclusions

In this work, we introduced an algorithm for user-level differentially private training of large neural networks, in particular a complex sequence model for next-word prediction. We empirically evaluated the algorithm on a realistic dataset and demonstrated that such training is possible at a negligible loss in utility, instead paying a cost in additional computation. Such private training, combined with federated learning (which leaves the sensitive training data on device rather than centralizing it), shows the possibility of training models with significant privacy guarantees for important real world applications. Much future work remains, for example designing private algorithms that automate and make adaptive the tuning of the clipping/noise tradeoff, and the application to a wider range of model families and architectures, for example GRUs and character-level models. Our work also highlights the open direction of reducing the computational overhead of differentially private training of non-convex models.

References

Appendix A Additional proofs

It suffices to verify that 1. the moments (of the privacy loss) at each step are correctly bounded; and, 2. the composability holds when accumulating the moments of multiple steps.

Our algorithm, as described in Figure 1, uses a fixed noise variance and generates the i.i.d. noise independent of the private data. Hence we can apply the composability as in Theorem 2.1 in Abadi et al. [2016a].

Appendix B Experiment details

The first step in training a word-level recurrent language model is selecting the vocabulary of words to model, with remaining words mapped to a special “UNK” (unknown) token. Training a fully differentially private language model from scratch requires a private mechanism to discover which words are frequent across the corpus, for example using techniques like distributed heavy-hitter estimation [Chan et al., 2012, Bassily et al., 2017]. For this work, we simplified the problem by pre-selecting a dictionary of the most frequent 10,000 words (after normalization) in a large corpus of mixed material from the web and message boards (but not our training or test dataset).

Unlike standard LSTM language models, our model uses the same learned embedding for the input tokens and for determining the predicted distribution on output tokens from the softmax.Press and Wolf independently introduced this technique and provide an empirical analysis comparing models with and without weight tying. This reduces the size of the model by about 40% for a small decrease in model quality, an advantageous tradeoff for mobile applications. Another change from many standard LSTM RNN approaches is that we train these models to restrict the word embeddings to have a fixed $L_{2}$ norm of 1.0, a modification found in earlier experiments to improve convergence time. In total the model has 1.35M trainable parameters.

Initialization and personalization

For many applications public proxy data is available, e.g., for next-word prediction one could use public domain books, Wikipedia articles, or other web content. In this case, an initial model trained with standard (non-private) algorithms on the public data (which is likely drawn from the wrong distribution) can then be further refined by continuing with differentially-private training on the private data for the precise problem at hand. Such pre-training is likely the best approach for practical applications. However, since training models purely on private data (starting from random initialization) is a strictly harder problem, we focus on this scenario for our experiments.

Our focus is also on training a single model which is shared by all users. However, we note that our approach is fully compatible with further on-device personalization of these models to the particular data of each user. It is also possible to give the central model some ability to personalize simply by providing information about the user as a feature vector along with the raw text input. LSTMs are well-suited to incorporating such additional context.

Accuracy metrics

We evaluate using AccuracyTop1, the probability that the word to which the model assigns highest probability is correct (after some minimal normalization). We always count it as a mistake if the true next word is not in the dictionary, even if the model predicts UNK, in order to allow fair comparisons of models using different dictionaries. In our experiments, we found that our model architecture is competitive on AccuracyTop1 and related metrics (Top3, Top5, and perplexity) across a variety of tasks and corpora.

Dataset

The Reddit dataset can be accessed through Google BigQuery [Reddit Comments Dataset, ]. Since our goal is to limit the contribution of any one author to the final model, it is not necessary to include all the data from users with a large number of posts. On the other hand, processing users with too little data slows experiments (due to constant per-user overhead). Thus, we use a training set where we have removed all users with fewer than 1600 tokens (words), and truncated the remaining $K=763,430$ users to have exactly 1600 tokens.

We intentionally chose a public dataset for research purposes, but carefully chose one with a structure and contents similar to private datasets that arise in real-world language modeling task such as predicting the next-word in a mobile keyboard. This allows for reproducibility, comparisons to non-private models, and inspection of the data to understand the impact of differential privacy beyond coarse aggregate statistics (as in Table 3).

Appendix C Supplementary Plots

Appendix D Additional Experiments

Models with larger dictionaries

Other experiments

We experimented with adding an explicit $L_{2}$ penalty on the model updates (not the full model) on each user, hoping this would decrease the need for clipping by preferring updates with a smaller $L_{2}$ norm. However, we saw no positive effect from this.