Stealing Machine Learning Models via Prediction APIs

Florian Tramèr, Fan Zhang, Ari Juels, Michael K. Reiter, Thomas Ristenpart

Introduction

Machine learning (ML) aims to provide automated extraction of insights from data by means of a predictive model. A predictive model is a function that maps feature vectors to a categorical or real-valued output. In a supervised setting, a previously gathered data set consisting of possibly confidential feature-vector inputs (e.g., digitized health records) with corresponding output class labels (e.g., a diagnosis) serves to train a predictive model that can generate labels on future inputs. Popular models include support vector machines (SVMs), logistic regressions, neural networks, and decision trees.

ML algorithms’ success in the lab and in practice has led to an explosion in demand. Open-source frameworks such as PredictionIO and cloud-based services offered by Amazon, Google, Microsoft, BigML, and others have arisen to broaden and simplify ML model deployment.

Cloud-based ML services often allow model owners to charge others for queries to their commercially valuable models. This pay-per-query deployment option exemplifies an increasingly common tension: The query interface of an ML model may be widely accessible, yet the model itself and the data on which it was trained may be proprietary and confidential. Models may also be privacy-sensitive because they leak information about training data . For security applications such as spam or fraud detection , an ML model’s confidentiality is critical to its utility: An adversary that can learn the model can also often evade detection .

In this paper we explore model extraction attacks, which exploit the tension between query access and confidentiality in ML models. We consider an adversary that can query an ML model (a.k.a. a prediction API) to obtain predictions on input feature vectors. The model may be viewed as a black box. The adversary may or may not know the model type (logistic regression, decision tree, etc.) or the distribution over the data used to train the model. The adversary’s goal is to extract an equivalent or near-equivalent ML model, i.e., one that achieves (close to) 100% agreement on an input space of interest.

We demonstrate successful model extraction attacks against a wide variety of ML model types, including decision trees, logistic regressions, SVMs, and deep neural networks, and against production ML-as-a-service (MLaaS) providers, including Amazon and BigML.We simulated victims by training models in our own accounts. We have disclosed our results to affected services in February 2016. In nearly all cases, our attacks yield models that are functionally very close to the target. In some cases, our attacks extract the exact parameters of the target (e.g., the coefficients of a linear classifier or the paths of a decision tree). For some targets employing a model type, parameters or features unknown to the attacker, we additionally show a successful preliminary attack step involving reverse-engineering these model characteristics.

Our most successful attacks rely on the information-rich outputs returned by the ML prediction APIs of all cloud-based services we investigated. Those of Google, Amazon, Microsoft, and BigML all return high-precision confidence values in addition to class labels. They also respond to partial queries lacking one or more features. Our setting thus differs from traditional learning-theory settings that assume only membership queries, outputs consisting of a class label only. For example, for logistic regression, the confidence value is a simple log-linear function $1/(1+e^{-(\mathbf{w}\cdot\mathbf{x}+\beta)})$ of the $d$ -dimensional input vector $\mathbf{x}$ . By querying $d+1$ random $d$ -dimensional inputs, an attacker can with high probability solve for the unknown $d+1$ parameters $\mathbf{w}$ and $\beta$ defining the model. We emphasize that while this model extraction attack is simple and non-adaptive, it affects all of the ML services we have investigated.

Such equation-solving attacks extend to multiclass logistic regressions and neural networks, but do not work for decision trees, a popular model choice. (BigML, for example, initially offered only decision trees.) For decision trees, a confidence value reflects the number of training data points labeled correctly on an input’s path in the tree; simple equation-solving is thus inapplicable. We show how confidence values can nonetheless be exploited as pseudo-identifiers for paths in the tree, facilitating discovery of the tree’s structure. We demonstrate successful model extraction attacks that use adaptive, iterative search algorithms to discover paths in a tree.

We experimentally evaluate our attacks by training models on an array of public data sets suitable as stand-ins for proprietary ones. We validate the attacks locally using standard ML libraries, and then present case studies on BigML and Amazon. For both services, we show computationally fast attacks that use a small number of queries to extract models matching the targets on 100% of tested inputs. See Table 1 for a quantitative summary.

Having demonstrated the broad applicability of model extraction attacks to existing services, we consider the most obvious potential countermeasure ML services might adopt: Omission of confidence values, i.e., output of class labels only. This approach would place model extraction back in the membership query setting of prior work in learning theory . We demonstrate a generalization of an adaptive algorithm by Lowd and Meek from binary linear classifiers to more complex model types, and also propose an attack inspired by the agnostic learning algorithm of Cohn et al. . Our new attacks extract models matching targets on $>$ 99% of the input space for a variety of model classes, but need up to $100\times$ more queries than equation-solving attacks (specifically for multiclass linear regression and neural networks). While less effective than equation-solving, these attacks remain attractive for certain types of adversary. We thus discuss further ideas for countermeasures.

In summary, we explore model extraction attacks, a practical kind of learning task that, in particular, affects emerging cloud-based ML services being built by Amazon, Google, Microsoft, BigML, and others. We show:

Simple equation-solving model extraction attacks that use non-adaptive, random queries to solve for the parameters of a target model. These attacks affect a wide variety of ML models that output confidence values. We show their success against Amazon’s service (using our own models as stand-ins for victims’), and also report successful reverse-engineering of the (only partially documented) model type employed by Amazon.

A new path-finding algorithm for extracting decision trees that abuses confidence values as quasi-identifiers for paths. To our knowledge, this is the first example of practical “exact” decision tree learning. We demonstrate the attack’s efficacy via experiments on BigML.

Model extraction attacks against models that output only class labels, the obvious countermeasure against extraction attacks that rely on confidence values. We show slower, but still potentially dangerous, attacks in this setting that build on prior work in learning theory.

We additionally make a number of observations about the implications of extraction. For example, attacks against Amazon’s system indirectly leak various summary statistics about a private training set, while extraction against kernel logistic regression models recovers significant information about individual training data points.

The source code for our attacks is available online at https://github.com/ftramer/Steal-ML.

Background

For our purposes, a ML model is a function $f\colon\mathcal{X}\rightarrow\mathcal{Y}$ . An input is a $d$ -dimensional vector in the feature space $\mathcal{X}=\mathcal{X}_{1}\times\mathcal{X}_{2}\times\dots\times\mathcal{X}_{d}$ . Outputs lie in the range $\mathcal{Y}$ .

We consider models obtained via supervised learning. These models are generated by a training algorithm $\mathcal{T}$ that takes as input a training set $\{(\mathbf{x}_{i},y_{i})\}_{i}$ , where $(\mathbf{x}_{i},y_{i})\in\mathcal{X}\times\mathcal{Y}$ is an input with an associated (presumptively correct) class label. The output of $\mathcal{T}$ is a model $f$ defined by a set of parameters, which are model-specific, and hyper-parameters, which specify the type of models $\mathcal{T}$ generates. Hyper-parameters may be viewed as distinguished parameters, often taken from a small number of standard values; for example, the kernel-type used in an SVM, of which only a small set are used in practice, may be seen as a hyper-parameter.

Model Extraction Attacks

An ML model extraction attack arises when an adversary obtains black-box access to some target model $f$ and attempts to learn a model $\hat{f}$ that closely approximates, or even matches, $f$ (see Figure 1).

As mentioned previously, the restricted case in which $f$ outputs class labels only, matches the membership query setting considered in learning theory, e.g., PAC learning and other previous works . Learning theory algorithms have seen only limited study in practice, e.g., in , and our investigation may be viewed as a practice-oriented exploration of this branch of research. Our initial focus, however, is on a different setting common in today’s MLaaS services, which we now explain in detail. Models trained by these services emit data-rich outputs that often include confidence values, and in which partial feature vectors may be considered valid inputs. As we show later, this setting greatly advantages adversaries.

A number of companies have launched or are planning to launch cloud-based ML services. A common denominator is the ability of users to upload data sets, have the provider run training algorithms on the data, and make the resulting models generally available for prediction queries. Simple-to-use Web APIs handle the entire interaction. This service model lets users capitalize on their data without having to set up their own large-scale ML infrastructure. Details vary greatly across services. We summarize a number of them in Table 2 and now explain some of the salient features.

A model is white-box if a user may download a representation suitable for local use. It is black-box if accessible only via a prediction query interface. Amazon and Google, for example, provide black-box-only services. Google does not even specify what training algorithm their service uses, while Amazon provides only partial documentation for its feature extraction ex (see Section 5). Some services allow users to monetize trained models by charging others for prediction queries.

To use these services, a user uploads a data set and optionally applies some data pre-processing (e.g., field removal or handling of missing values). She then trains a model by either choosing one of many supported model classes (as in BigML, Microsoft, and PredictionIO) or having the service choose an appropriate model class (as in Amazon and Google). Two services have also announced upcoming support for users to upload their own trained models (Google) and their own custom learning algorithms (PredictionIO). When training a model, users may tune various parameters of the model or training-algorithm (e.g., regularizers, tree size, learning rates) and control feature-extraction and transformation methods.

For black-box models, the service provides users with information needed to create and interpret predictions, such as the list of input features and their types. Some services also supply the model class, chosen training parameters, and training data statistics (e.g., BigML gives the range, mean, and standard deviation of each feature).

To get a prediction from a model, a user sends one or more input queries. The services we reviewed accept both synchronous requests and asynchronous ‘batch’ requests for multiple predictions. We further found varying degrees of support for ‘incomplete’ queries, in which some input features are left unspecified . We will show that exploiting incomplete queries can drastically improve the success of some of our attacks. Apart from PredictionIO, all of the services we examined respond to prediction queries with not only class labels, but a variety of additional information, including confidence scores (typically class probabilities) for the predicted outputs.

We now describe possible motivations for adversaries to perform model extraction attacks. We then present a more detailed threat model informed by characteristics of the aforementioned ML services.

Avoiding query charges. Successful monetization of prediction queries by the owner of an ML model $f$ requires confidentiality of $f$ . A malicious user may seek to launch what we call a cross-user model extraction attack, stealing $f$ for subsequent free use. More subtly, in black-box-only settings (e.g., Google and Amazon), a service’s business model may involve amortizing up-front training costs by charging users for future predictions. A model extraction attack will undermine the provider’s business model if a malicious user pays less for training and extracting than for paying per-query charges.

Violating training-data privacy. Model extraction could, in turn, leak information about sensitive training data. Prior attacks such as model inversion have shown that access to a model can be abused to infer information about training set points. Many of these attacks work better in white-box settings; model extraction may thus be a stepping stone to such privacy-abusing attacks. Looking ahead, we will see that in some cases, significant information about training data is leaked trivially by successful model extraction, because the model itself directly incorporates training set points.

Stepping stone to evasion. In settings where an ML model serves to detect adversarial behavior, such as identification of spam, malware classification, and network anomaly detection, model extraction can facilitate evasion attacks. An adversary may use knowledge of the ML model to avoid detection by it .

In all of these settings, there is an inherent assumption of secrecy of the ML model in use. We show that this assumption is broken for all ML APIs that we investigate.

Two distinct adversarial models arise in practice. An adversary may be able to make direct queries, providing an arbitrary input $\mathbf{x}$ to a model $f$ and obtaining the output $f(\mathbf{x})$ . Or the adversary may be able to make only indirect queries, i.e., queries on points in input space $\mathcal{M}$ yielding outputs $f(\textsf{ex}(M))$ . The feature extraction mechanism ex may be unknown to the adversary. In Section 5, we show how ML APIs can further be exploited to “learn” feature extraction mechanisms. Both direct and indirect access to $f$ arise in ML services. (Direct query interfaces arise when clients are expected to perform feature extraction locally.) In either case, the output value can be a class label, a confidence value vector, or some data structure revealing various levels of information, depending on the exposed API.

We model the adversary, denoted by $\mathcal{A}$ , as a randomized algorithm. The adversary’s goal is to use as few queries as possible to $f$ in order to efficiently compute an approximation $\hat{f}$ that closely matches $f$ . We formalize “closely matching” using two different error measures:

An adversary may know any of a number of pieces of information about a target $f$ : What training algorithm $\mathcal{T}$ generated $f$ , the hyper-parameters used with $\mathcal{T}$ , the feature extraction function ex, etc. We will investigate a variety of settings in this work corresponding to different APIs seen in practice. We assume that $\mathcal{A}$ has no more information about a model’s training data, than what is provided by an ML API (e.g., summary statistics). For simplicity, we focus on proper model extraction: If $\mathcal{A}$ believes that $f$ belongs to some model class, then $\mathcal{A}$ ’s goal is to extract a model $\hat{f}$ from the same class. We discuss some intuition in favor of proper extraction in Appendix D, and leave a broader treatment of improper extraction strategies as an interesting open problem.

Extraction with Confidence Values

We begin our study of extraction attacks by focusing on prediction APIs that return confidence values. As per Section 2, the output of a query to $f$ thus falls in a range $^{c}$ where $c$ is the number of classes. To motivate this, we recall that most ML APIs reveal confidence values for models that support them (see Table 2). This includes logistic regressions (LR), neural networks, and decision trees, defined formally in Appendix A. We first introduce a generic equation-solving attack that applies to all logistic models (LR and neural networks). In Section 4.2, we present two novel path-finding attacks on decision trees.

Many ML models we consider directly compute class probabilities as a continuous function of the input $\mathbf{x}$ and real-valued model parameters. In this case, an API that reveals these class probabilities provides an adversary $\mathcal{A}$ with samples $(\mathbf{x},f(\mathbf{x}))$ that can be viewed as equations in the unknown model parameters. For a large class of models, these equation systems can be efficiently solved, thus recovering $f$ (or some good approximation of it).

Our approach for evaluating attacks will primarily be experimental. We use a suite of synthetic or publicly available data sets to serve as stand-ins for proprietary data that might be the target of an extraction attack. Table 3 displays the data sets used in this section, which we obtained from various sources: the synthetic ones we generated; the others are taken from public surveys (Steak Survey and GSS Survey ), from scikit (Digits) or from the UCI ML library . Mre details about these data sets are in Appendix B.

As a simple starting point, we consider the case of logistic regression (LR). A LR model performs binary classification $(c=2)$ , by estimating the probability of a binary response, based on a number of independent features. LR is one of the most popular binary classifiers, due to its simplicity and efficiency. It is widely used in many scientific fields (e.g., medical and social sciences) and is supported by all the ML services we reviewed.

Given an oracle sample $(\mathbf{x},f(\mathbf{x}))$ , we get a linear equation $\mathbf{w}\cdot\mathbf{x}+\beta=\sigma^{-1}(f_{1}(\mathbf{x}))$ . Thus, $d+1$ samples are both necessary and sufficient (if the queried $\mathbf{x}$ are linearly independent) to recover $\mathbf{w}$ and $\beta$ . Note that the required samples are chosen non-adaptively, and can thus be obtained from a single batch request to the ML service.

We stress that while this extraction attack is rather straightforward, it directly applies, with possibly devastating consequences, to all cloud-based ML services we considered. As an example, recall that some services (e.g., BigML and Google) let model owners monetize black-box access to their models. Any user who wishes to make more than $d+1$ queries to a model would then minimize the prediction cost by first running a cross-user model extraction attack, and then using the extracted model for personal use, free of charge. As mentioned in Section 3, attackers with a final goal of model-inversion or evasion may also have incentives to first extract the model. Moreover, for services with black-box-only access (e.g., Amazon or Google), a user may abuse the service’s resources to train a model over a large data set $D$ (i.e., $|D|\gg d$ ), and extract it after only $d+1$ predictions. Crucially, the extraction cost is independent of $|D|$ . This could undermine a service’s business model, should prediction fees be used to amortize the high cost of training.

1.2 Multiclass LRs and Multilayer Perceptrons

We now show that such equation-solving attacks broadly extend to all model classes with a ‘logistic’ layer, including multiclass $(c>2)$ LR and deeper neural networks. We define these models formally in Appendix A.

A multiclass logistic regression (MLR) combines $c$ binary models, each with parameters $\mathbf{w}_{i},\beta_{i}$ , to form a multiclass model. MLRs are available in all ML services we reviewed. We consider two types of MLR models: softmax and one-vs-rest (OvR), that differ in how the $c$ binary models are trained and combined: A softmax model fits a joint multinomial distribution to all training samples, while a OvR model trains a separate binary LR for each class, and then normalizes the class probabilities.

To illustrate our attack’s success, we train a softmax regression, a OvR regression and a MLP on the Adult data set with target ‘Race’ ( $c=5$ ). For the non-linear equation systems we obtain, we do not know a priori how many samples we need to find a solution (in contrast to linear systems where $d+1$ samples are necessary and sufficient). We thus explore various query budgets of the form $\alpha\cdot k$ , where $k$ is the number of unknown model parameters, and $\alpha$ is a budget scaling factor. For MLRs, we solve the equation system with BFGS in scikit . For MLPs, we use theano to run stochastic gradient descent for $1{,}000$ epochs. Our experiments were performed on a commodity laptop (2-core Intel CPU @3.1GHz, 16GB RAM, no GPU acceleration).

Table 4 shows the extraction success for each model, as we vary $\alpha$ from $0.5$ to at most $5$ . For MLR models (softmax and OvR), the attack is extremely efficient, requiring around one query per unknown parameter of $f$ (each query yields $c=5$ equations). For MLPs, the system to solve is more complex, with about $4$ times more unknowns. With a sufficiently over-determined system, we converge to a model $\hat{f}$ that very closely approximates $f$ . As for LR models, queries are chosen non-adaptively, so $\mathcal{A}$ may submit a single ‘batch request’ to the API.

1.3 Training Data Leakage for Kernel LR

We now move to a less mainstream model class, kernel logistic regression , that illustrates how extraction attacks can leak private training data, when a model’s outputs are directly computed as a function of that data.

Kernel methods are commonly used to efficiently extend support vector machines (SVM) to nonlinear classifiers , but similar techniques can be applied to logistic regression . Compared to kernel SVMs, kernel logistic regressions (KLR) have the advantage of computing class probabilities, and of naturally extending to multiclass problems. Yet, KLRs have not reached the popularity of kernel SVMs or standard LRs, and are not provided by any MLaaS provider at the time. We note that KLRs could easily be constructed in any ML library that supports both kernel functions and LR models.

A KLR model is a softmax model, where we replace the linear components $\mathbf{w}_{i}\cdot\mathbf{x}+\beta_{i}$ by a mapping $\sum_{r=1}^{s}\alpha_{i,r}K(\mathbf{x},\mathbf{x}_{r})+\beta_{i}$ . Here, $K$ is a kernel function, and the representers $\mathbf{x}_{1},\dots,\mathbf{x}_{s}$ are a chosen subset of the training points . More details are in Appendix A.

We build two KLR models with a radial-basis function (RBF) kernel for a data set of handwritten digits. We select $20$ random digits as representers for the first model, and all $1{,}257$ training points for the second. We extract the first model, assuming knowledge of $s$ , by solving a system of $50{,}000$ equations in $1{,}490$ unknowns. We use the same approach as for MLPs, i.e., logistic-loss minimization using gradient descent. We initialize the extracted representers to uniformly random vectors in $\mathcal{X}$ , as we assume $\mathcal{A}$ does not know the training data distribution. In Figure 3(a), we plot $5$ of the model’s representers from the training data, and the $5$ closest (in $l_{1}$ norm) extracted representers. The attack clearly leaks information on individual training points. We measure the attack’s robustness to uncertainty about $s$ , by attacking the second model with only $10$ local representers ( $10{,}000$ equations in $750$ unknowns). Figure 3(b) shows the average image of training points classified as a $3,4,5,6$ or $7$ by the target model $f$ , along with $5$ extracted representers of $\hat{f}$ . Surprisingly maybe, the attack seems to be leaking the ‘average representor’ of each class in the training data.

1.4 Model Inversion Attacks on Extracted Models

Access to a model may enable inference of privacy-damaging information, particularly about the training set . The model inversion attack explored by Fredrikson et al. uses access to a classifier $f$ to find the input $\mathbf{x}_{\rm opt}$ that maximizes the class probability for class $i$ , i.e., $\mathbf{x}_{\rm opt}=\operatorname*{argmax}_{\mathbf{x}\in\mathcal{X}}f_{i}(\mathbf{x})$ . This was shown to allow recovery of recognizable images of training set members’ faces when $f$ is a facial recognition model.

Their attacks work best in a white-box setting, where the attacker knows $f$ and its parameters. Yet, the authors also note that in a black-box setting, remote queries to a prediction API, combined with numerical approximation techniques, enable successful, albeit much less efficient, attacks. Furthermore, their black-box attacks inherently require $f$ to be queried adaptively. They leave as an open question making black-box attacks more efficient.

We explore composing an attack that first attempts to extract a model $\hat{f}\approx f$ , and then uses it with the white-box inversion attack. Our extraction techniques replace adaptive queries with a non-adaptive “batch” query to $f$ , followed by local computation. We show that extraction plus inversion can require fewer queries and less time than performing black-box inversion directly.

As a case study, we use the softmax model from , trained over the AT&T Faces data . The data set consists of images of faces ( $92\times 112$ pixels) of $40$ people. The black-box attack from needs about $20{,}600$ queries to reconstruct a recognizable face for a single training set individual. Reconstructing the faces of all $40$ individuals would require around $800{,}000$ online queries.

For black-box attacks, the authors of estimate a query latency of $70$ milliseconds (a little less than in our own measurements of ML services, see Table 1). Thus, it takes $24$ minutes to recover a single face (the inversion attack runs in seconds), and $16$ hours to recover all $40$ images. In contrast, solving the large equation system underlying our model-extraction attack took $10$ hours. The $41{,}216$ online queries would take under one hour if executed sequentially and even less with a batch query. The cost of the $40$ local white-box attacks is negligible.

Thus, if the goal is to reconstruct faces for all $40$ training individuals, performing model inversion over a previously extracted model results in an attack that is both faster and requires $20\times$ fewer online queries.

2 Decision Tree Path-Finding Attacks

Contrary to logistic models, decision trees do not compute class probabilities as a continuous function of their input. Rather, decision trees partition the input space into discrete regions, each of which is assigned a label and confidence score. We propose a new path-finding attack, that exploits API particularities to extract the ‘decisions’ taken by a tree when classifying an input.

Prior work on decision tree extraction has focused on trees with Boolean features and outputs. While of theoretical importance, such trees have limited practical use. Kushilevitz and Mansour showed that Boolean trees can be extracted using membership queries (arbitrary queries for class labels), but their algorithm does not extend to more general trees. Here, we propose attacks that exploit ML API specificities, and that apply to decision tree models used in MLaaS platforms.

Our tree model, defined formally in Appendix A, allows for binary and multi-ary splits over categorical features, and binary splits over numeric features. Each leaf of the tree is labeled with a class label and a confidence score. We note that our attacks also apply (often with better results) to regression trees. In regression trees, each leaf is labeled with a real-valued output and confidence.

The key idea behind our attack is to use the rich information provided by APIs on a prediction query, as a pseudo-identifier for the path that the input traversed in the tree. By varying the value of each input feature, we then find the predicates to be satisfied, for an input to follow a given path in the tree. We will also exploit the ability to query incomplete inputs, in which each feature $x_{i}$ is chosen from a space $\mathcal{X}_{i}\cup\left\{\bot\right\}$ , where $\bot$ encodes the absence of a value. One way of handling such inputs () is to label each node in the tree with an output value. On an input, we traverse the tree until we reach a leaf or an internal node with a split over a missing feature, and output that value of that leaf or node.

We formalize these notions by defining oracles that $\mathcal{A}$ can query to obtain an identifier for the leaf or internal node reached by an input. In practice, we instantiate these oracles using prediction API peculiarities.

Let each node $v$ of a tree $T$ be assigned some identifier $\texttt{id}_{v}$ . A leaf-identity oracle $\mathcal{O}$ takes as input a query $\mathbf{x}\in\mathcal{X}$ and returns the identifier of the leaf of the tree $T$ that is reached on input $\mathbf{x}$ .

A node-identity oracle $\mathcal{O}_{\bot}$ takes as input a query $\mathbf{x}\in\mathcal{X}_{1}\cup\left\{\bot\right\}\times\cdots\times\mathcal{X}_{d}\cup\left\{\bot\right\}$ and returns the identifier of the node or leaf of $T$ at which the tree computation halts.

We now present our path-finding attack (Algorithm 1), that assumes a leaf-identity oracle that returns unique identifiers for each leaf. We will relax the uniqueness assumption further on. The attack starts with a random input $\mathbf{x}$ and gets the leaf id from the oracle. We then search for all constraints on $\mathbf{x}$ that have to be satisfied to remain in that leaf, using procedures LINE_SEARCH (for continuous features) and CAT_SPLIT (for categorical features) described below. From this information, we then create new queries for unvisited leaves. Once all leaves have been found, the algorithm returns, for each leaf, the corresponding constraints on $\mathbf{x}$ . We analyze the algorithm’s correctness and complexity in Appendix C.

We illustrate our algorithm with a toy example of a tree over continuous feature Size and categorical feature Color (see Figure 4). The current query is $\mathbf{x}=\{Size=50,\ Color=R\}$ and $\mathcal{O}(\mathbf{x})=\texttt{id}_{2}$ . Our goal is two-fold: (1) Find the predicates that $\mathbf{x}$ has to satisfy to end up in leaf $\texttt{id}_{2}$ (i.e., $Size\in(40,60],\ Color=R$ ), and (2) create new inputs $\mathbf{x}^{\prime}$ to explore other paths in the tree.

The CATEGORY_SPLIT procedure (line 20) finds splits on categorical features. In our example, we vary the value of Color in $\mathbf{x}$ and query $\mathcal{O}$ to get a leaf id for each value. We then build a set $S$ of values that lead to the current leaf, i.e., $S=\left\{\texttt{R}\right\}$ , and a set $V$ of values to set in $\mathbf{x}$ to explore other leaves (one representative per leaf). In our example, we could have $V=\left\{\texttt{B},\texttt{G},\texttt{Y}\right\}$ or $V=\left\{\texttt{B},\texttt{G},\texttt{O}\right\}$ .

Using these two procedures, we thus find the predicates defining the path to leaf $\texttt{id}_{2}$ , and generate new queries $\mathbf{x}^{\prime}$ for unvisited leaves of the tree.

We propose an empirically more efficient top-down algorithm that exploits queries over partial inputs. It extracts the tree ‘layer by layer’, starting at the root: We start with an empty query (all features set to $\bot$ ) and get the root’s id by querying $\mathcal{O}_{\bot}$ . We then set each feature in turn and query $\mathcal{O}$ again. For exactly one feature (the root’s splitting feature), the input will reach a different node. With similar procedures as described previously, we extract the root’s splitting criterion, and recursively search lower layers of the tree.

As we verify empirically, our attacks are resilient to some nodes or leaves sharing the same id. We can modify line 7 in Algorithm 1 to detect id duplicates, by checking not only whether a leaf with the current id was already visited, but also whether the current query violates that leaf’s predicates. The main issue with duplicate ids comes from the LINE_SEARCH and CATEGORY_SPLIT procedures: if two queries $\mathbf{x}$ and $\mathbf{x}^{\prime}$ differ in a single feature and reach different leaves with the same id, the split on that feature will be missed.

2.2 Attack Evaluation

Our tree model (see Appendix A) is the one used by BigML. Other ML services use similar tree models. For our experiments, we downloaded eight public decision trees from BigML (see Table 5), and queried them locally using available API bindings. More details on these models are in Appendix B. We show online extraction attacks on black-box models from BigML in Section 5.

To emulate black-box model access, we first issue online queries to BigML, to determine the information contained in the service’s responses. We then simulate black-box access locally, by discarding any extra information returned by the local API. Specifically, we make use of the following fields in query responses:

Prediction. This entry contains the predicted class label (classification) or real-valued output (regression).

Confidence. For classification and regression trees, BigML computes confidence scores based on a confidence interval for predictions at each node . The prediction and confidence value constitute a node’s id.

Fields. Responses to black-box queries contain a ‘fields’ property, that lists all features that appear either in the input query or on the path traversed in the tree. If a partial query $\mathbf{x}$ reaches an internal node $v$ , this entry tells us which feature $v$ splits on (the feature is in the ‘fields’ entry, but not in the input $\mathbf{x}$ ). We make use of this property for the top-down attack variant.

Table 6 displays the results of our attacks. For each tree, we give its number of leaves, the number of unique leaf ids, and the tree depth. We display the success rate for Algorithm 1 and for the “top-down” variant with incomplete queries. Querying partial inputs vastly improves our attack: we require far less queries (except for the Steak Survey model, where Algorithm 1 only visits a fraction of all leaves and thus achieves low success) and achieve higher accuracy for trees with duplicate leaf ids. As expected, both attacks achieve perfect extraction when all leaves have unique ids. While this is not always the case for classification trees, it is far more likely for regression trees, where both the label and confidence score take real values. Surprisingly maybe, the top-down approach also fully extracts some trees with a large number of duplicate leaf ids. The attacks are also efficient: The top-down approach takes less than $10$ seconds to extract a tree, and Algorithm 1 takes less than $6$ minutes for the largest tree. For online attacks on ML services, discussed next, this cost is trumped by the delay for the inherently adaptive prediction queries that are issued.

Online Model Extraction Attacks

In this section, we showcase online model extraction attacks against two ML services: BigML and Amazon. For BigML, we focus on extracting models set up by a user, who wishes to charge for predictions. For Amazon, our goal is to extract a model trained by ourselves, to which we only get black-box access. Our attacks only use exposed APIs, and do not in any way attempt to bypass the services’ authentication or access-control mechanisms. We only attack models trained in our own accounts.

BigML currently only allows monetization of decision trees . We train a tree on the German Credit data, and set it up as a black-box model. The tree has $26$ leaves, two of which share the same label and confidence score. From another account, we extract the model using the two attacks from Section 4.2. We first find the tree’s number of features, their type and their range, from BigML’s public gallery. Our attacks (Algorithm 1 and the top-down variant) extract an exact description of the tree’s paths, using respectively $1{,}722$ and $1{,}150$ queries. Both attacks’ duration ( $1{,}030$ seconds and $631$ seconds) is dominated by query latency ( $\approx 500\text{ms}/\text{query}$ ). The monetary cost of the attack depends on the per-prediction-fee set by the model owner. In any case, a user who wishes to make more than $1{,}150$ predictions has economic incentives to run an extraction attack.

2 Case Study 2: Amazon Web Services

We now consider models with feature extraction enabled. We assume that $\mathcal{A}$ knows the input space $\mathcal{M}$ , but not the training data distribution. For one-hot-encoding, knowledge of $\mathcal{M}$ suffices to apply the same encoding locally. For quantile binning however, applying ex locally requires knowledge of the training data quantiles. To reverse-engineer the binning transformation, we use line-searches similar to those we used for decision trees: For each numeric feature, we search the feature’s range in input space for thresholds (up to a granularity $\epsilon$ ) where $f$ ’s output changes. This indicates our value landed in an adjacent bin, with a different learned regression coefficient. Note that learning the bin boundaries may be interesting in its own right, as it leaks information about the training data distribution. Having found the bin boundaries, we can apply both one-hot-encoding and binning locally, and extract $f$ over its feature space. As we are restricted to queries over $\mathcal{M}$ , we cannot define an arbitrary system of equations over $\mathcal{X}$ . Building a well-determined and consistent system can be difficult, as the encoding ex generates sparse inputs over $\mathcal{X}$ . However, Amazon facilitates this process with the way it handles queries with missing features: if a feature is omitted from a query, all corresponding features in $\mathcal{X}$ are set to . For a linear model for instance, we can trivially re-construct the model by issuing queries with a single feature specified, such as to obtain equations with a single unknown in $\mathcal{X}$ .

We trained models for the Circles, Iris and Adult data sets, with Amazon’s default feature-extraction settings. Table 7 shows the results of our attacks, for the reverse-engineering of ex and extraction of $f$ . For binary models (Circles and Adult), we use $d+1$ queries to solve a linear equation-system over $\mathcal{X}$ . For models with $c>2$ classes, we use $c\cdot(d+1)$ queries. In all cases, the extracted model matches $f$ on 100% of tested inputs. To optimize the query complexity, the queries we use to find quantile bins are re-used for equation-solving. As line searches require adaptive queries, we do not use batch predictions. However, even for the Digits model, we resorted to using real-time predictions, because of the service’s significant overhead in evaluating batches. For attacks that require a large number of non-adaptive queries, we expect batch predictions to be faster than real-time predictions.

3 Discussion

In some ML services we considered, users may enable further feature extractors. A common transformation is feature scaling or normalization. If $\mathcal{A}$ has access to training data statistics (as provided by BigML for instance), applying the transformation locally is trivial. More generally, for models with a linear input layer (i.e., logistic regressions, linear SVMs, MLPs) the scaling or normalization can be seen as being applied to the learned weights, rather than the input features. We can thus view the composition $f\circ\textsf{ex}$ as a model $f^{\prime}$ that operates over the ‘un-scaled’ input space $\mathcal{M}$ and extract $f^{\prime}$ directly using equation-solving.

Further extractors include text analysis (e.g., bag-of-words or n-gram models) and Cartesian products (grouping many features into one). We have not analyzed these in this work, but we believe that they could also be easily reverse-engineered, especially given some training data statistics and the ability to make incomplete queries.

For our online attacks, we obtained information about the model class of $f$ , the enabled feature extraction ex, and other hyper-parameters, directly from the ML service or its documentation. More generally, if $\mathcal{A}$ does not have full certainty about certain model characteristics, it may be able to narrow down a guess to a small range. Model hyper-parameters for instance (such as the free parameter of an RBF kernel) are typically chosen through cross-validation over a default range of values.

Our attacks on Amazon’s service followed this approach: We first formulated guesses for model characteristics left unspecified by the documentation (e.g., we found no mention of one-hot-encoding, or of how missing inputs are handled). We then evaluated our assumptions with successive extraction attempts. Our results indicate that Amazon uses softmax regression and does not create binary predictors for missing values. Interestingly, BigML takes the ’opposite’ approach (i.e., BigML uses OvR regression and adds predictors for missing values).

Extraction Given Class Labels Only

The successful attacks given in Sections 4 and 5 show the danger of revealing confidence values. While current ML services have been designed to reveal rich information, our attacks may suggest that returning only labels would be safer. Here we explore model extraction in a setting with no confidence scores. We will discuss further countermeasures in Section 7. We primarily focus on settings where $\mathcal{A}$ can make direct queries to an API, i.e., queries for arbitrary inputs $\mathbf{x}\in\mathcal{X}$ . We briefly discuss indirect queries in the context of linear classifiers.

This attack only works for linear binary models. We describe a straightforward extension to some non-linear models, such as polynomial kernel SVMs. Extracting a polynomial kernel SVM can be reduced to extracting a linear SVM in the transformed feature space. Indeed, for any kernel $K_{\text{poly}}(\mathbf{x},\mathbf{x}^{\prime})<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>=</mo></mrow><annotation encoding="application/x-tex">=</annotation></semantics></math>=(\mathbf{x}^{T}\cdot\mathbf{x}^{\prime}+1)^{d}$ , we can derive a projection function $\phi(\cdot)$ , so that $K_{\text{poly}}(\mathbf{x},\mathbf{x}^{\prime})<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>=</mo></mrow><annotation encoding="application/x-tex">=</annotation></semantics></math>=\phi(\mathbf{x})^{T}\cdot\phi(\mathbf{x}^{\prime})$ . This transforms the kernel SVM into a linear one, since the decision boundary now becomes $\mathbf{w}^{F}\cdot\phi(\mathbf{x})+\beta=0$ where $\mathbf{w}^{F}=\sum_{i=1}^{t}\alpha_{i}\phi(\mathbf{x}_{i})$ . We can use the Lowd-Meek attack to extract $\mathbf{w}^{F}$ and $\beta$ as long as $\phi(\mathbf{x})$ and its inverse are feasible to compute; this is unfortunately not the case for the more common RBF kernels.We did explore using approximations of $\phi$ , but found that the adaptive re-training techniques discussed in this section perform better.

In addition to evaluating the Lowd-Meek attack against ML APIs, we introduce a number of other approaches based on the broad strategy of re-training a model locally, given input-output examples. Informally, our hope is that by extracting a model that achieves low training error over the queried samples, we would effectively approximate the target model’s decision boundaries. We consider three retraining strategies, described below. We apply these to the model classes that we previously extracted using equation-solving attacks, as well as to SVMs.We do not expect retraining attacks to work well for decision trees, because of the greedy approach taken by learning algorithms. We have not evaluated extraction of trees, given class labels only, in this work.

Retraining with uniform queries. This baseline strategy simply consists in sampling $m$ points $\mathbf{x}_{i}\in\mathcal{X}$ uniformly at random, querying the oracle, and training a model $\hat{f}$ on these samples.

Line-search retraining. This strategy can be seen as a model-agnostic generalization of the Lowd-Meek attack. It issues $m$ adaptive queries to the oracle using line search techniques, to find samples close to the decision boundaries of $f$ . A model $\hat{f}$ is then trained on the $m$ queried samples.

Adaptive retraining. This strategy applies techniques from active learning . For some number $r$ of rounds and a query budget $m$ , it first queries the oracle on $\frac{m}{r}$ uniform points, and trains a model $\hat{f}$ . Over a total of $r$ rounds, it then selects $\frac{m}{r}$ new points, along the decision boundary of $\hat{f}$ (intuitively, these are points $\hat{f}$ is least certain about), and sends those to the oracle before retraining $\hat{f}$ .

1 Linear Binary Models

The retraining strategies that search for points near the decision boundary clearly perform better than simple uniform retraining. The adaptive strategy is the most efficient of our three strategies. For relatively low budgets, it even outperforms the Lowd-Meek attack. However, for budgets large enough to run line searches in each dimension, the Lowd-Meek attack is clearly the most efficient.

For the models we trained, about $2{,}050$ queries on average, and $5{,}650$ at most, are needed to run the Lowd-Meek attack effectively. This is $50\times$ more queries than what we needed for equation-solving attacks. With $827$ queries on average, adaptive retraining yields a model $\hat{f}$ that matches $f$ on over $99\%$ of tested inputs. Thus, even if an ML API only provides class labels, efficient extraction attacks on linear models remain possible.

We further consider a setting where feature-extraction (specifically one-hot-encoding of categorical features) is applied by the ML service, rather than by the user. $\mathcal{A}$ is then limited to indirect queries in input space. Lowd and Meek note that their extraction attack does not work in this setting, as $\mathcal{A}$ can not run line searches directly over $\mathcal{X}$ . In contrast, for the linear models we trained, we observed no major difference in extraction accuracy for the adaptive-retraining strategy, when limited to queries over $\mathcal{M}$ . We leave an in-depth study of model extraction with indirect queries, and class labels only, for future work.

2 Multiclass LR Models

The Lowd-Meek attack is not applicable in multiclass ( $c>2$ ) settings, even when the decision boundary is a combination of linear boundaries (as in multiclass regression) . We thus focus on evaluating the three retraining attacks we introduced, for the type of ML models we expect to find in real-world applications.

We focus on softmax models here, as softmax and one-vs-rest models have identical output behaviors when only class labels are provided: in both cases, the class label for an input $\mathbf{x}$ is given by $\operatorname*{argmax}_{i}(\mathbf{w}_{i}\cdot\mathbf{x}+\beta_{i})$ . From an extractor’s perspective, it is thus irrelevant whether the target was trained using a softmax or OvR approach.

For all models, $100\cdot c\cdot(d+1)$ queries resulted in extraction accuracy above $99.9\%$ . This represents $26{,}000$ queries on average, and $65{,}000$ at the most (Digits data set). Our equation-solving attacks achieved similar or better results with 100 $\times$ less queries. Yet, for scenarios with high monetary incentives (e.g., intrusion detector evasion), extraction attacks on MLR models may be attractive, even if APIs only provide class labels.

3 Neural Networks

We now turn to attacks on more complex deep neural networks. We expect these to be harder to retrain than multiclass regressions, as deep networks have more parameters and non-linear decision-boundaries. Therefore, we may need to find a large number of points close to a decision boundary in order to extract it accurately.

4 RBF Kernel SVMs

Another class of nonlinear models that we consider are support-vector machines (SVMs) with radial-basis function (RBF) kernels. A kernel SVM first maps inputs into a higher-dimensional space, and then finds the hyperplane that maximally separates the two classes. As mentioned in Section 6, SVMs with polynomial kernels can be extracted using the Lowd-Meek attack in the transformed feature space. For RBF kernels, this is not possible because the transformed space has infinite dimension.

SVMs do not provide class probability estimates. Our only applicable attack is thus retraining. As for linear models, we vary the query budget as $\alpha\cdot(d+1)$ , where $d$ is the input dimension. We further use the extract-and-test approach from Section 5 to find the value of the RBF kernel’s hyper-parameter. Results of our attacks are in Figure 7. Again, we see that adaptive retraining performs best, even though the decision boundary to extract is non-linear (in input space) here. Kernel SVMs models are overall harder to retrain than models with linear decision boundaries. Yet, for our largest budgets ( $2{,}050$ queries on average), we do extract models with over 99% accuracy, which may suffice in certain adversarial settings.

Extraction Countermeasures

We have shown in Sections 4 and 5 that adversarial clients can effectively extract ML models given access to rich prediction APIs. Given that this undermines the financial models targeted by some ML cloud services, and potentially leaks confidential training data, we believe researchers should seek countermeasures.

In Section 6, we analyzed the most obvious defense against our attacks: prediction API minimization. The constraint here is that the resulting API must still be useful in (honest) applications. For example, it is simple to change APIs to not return confidences and not respond to incomplete queries, assuming applications can get by without it. This will prevent many of our attacks, most notably the ones described in Section 4 as well as the feature discovery techniques used in our Amazon case study (Section 5). Yet, we showed that even if we strip an API to only provide class labels, successful attacks remain possible (Section 6), albeit at a much higher query cost.

We discuss further potential countermeasures below.

Applications might need confidences, but only at lower granularity. A possible defense is to round confidence scores to some fixed precision . We note that ML APIs already work with some finite precision when answering queries. For instance, BigML reports confidences with $5$ decimal places, and Amazon provides values with $16$ significant digits.

To understand the effects of limiting precision further, we re-evaluate equation-solving and decision tree path-finding attacks with confidence scores rounded to a fixed decimal place. For equation-solving attacks, rounding the class probabilities means that the solution to the obtained equation-system might not be the target $f$ , but some truncated version of it. For decision trees, rounding confidence scores increases the chance of node id collisions, and thus decreases our attacks’ success rate.

For regression trees, rounding has no effect on our attacks. Indeed, for the models we considered, the output itself is unique in each leaf (we could also round outputs, but the impact on utility may be more critical). For classification trees, we re-evaluated our top-down attack, with confidence scores rounded to fewer than $5$ decimal places. The attacks on the ‘IRS Tax Patterns’ and ‘Email Importance’ models are the most resilient, and suffer no success degradation before scores are rounded to $2$ decimal places. For the other models, rounding confidences to $3$ or $4$ decimal places severely undermines our attack.

Differential privacy (DP) and its variants have been explored as mechanisms for protecting, in particular, the privacy of ML training data . DP learning has been applied to regressions , SVMs , decision trees and neural networks . As some of our extraction attacks leak training data information (Section 4.1.3), one may ask whether DP can prevent extraction, or at least reduce the severity of the privacy violations that extraction enables.

Consider naïve application of DP to protect individual training data elements. This should, in theory, decrease the ability of an adversary $\mathcal{A}$ to learn information about training set elements, when given access to prediction queries. One would not expect, however, that this prevents model extraction, as DP is not defined to do so: consider a trivially useless learning algorithm for binary logistic regression, that discards the training data and sets $\mathbf{w}$ and $\beta$ to . This algorithm is differentially private, yet $\mathbf{w}$ and $\beta$ can easily be recovered using equation-solving.

A more appropriate strategy would be to apply DP directly to the model parameters, which would amount to saying that a query should not allow $\mathcal{A}$ to distinguish between closely neighboring model parameters. How exactly this would work and what privacy budgets would be required is left as an open question by our work.

Ensemble methods such as random forests return as prediction an aggregation of predictions by a number of individual models. While we have not experimented with ensemble methods as targets, we suspect that they may be more resilient to extraction attacks, in the sense that attackers will only be able to obtain relatively coarse approximations of the target function. Nevertheless, ensemble methods may still be vulnerable to other attacks such as model evasion .

Related Work

Our work is related to the extensive literature on learning theory, such as PAC learning and its variants . Indeed, extraction can be viewed as a type of learning, in which an unknown instance of a known hypothesis class (model type) is providing labels (without error). This is often called learning with membership queries . Our setting differs from these in two ways. The first is conceptual: in PAC learning one builds algorithms to learn a concept — the terminology belies the motivation of formalizing learning from data. In model extraction, an attacker is literally given a function oracle that it seeks to illicitly determine. The second difference is more pragmatic: prediction APIs reveal richer information than assumed in prior learning theory work, and we exploit that.

Algorithms for learning with membership queries have been proposed for Boolean functions and various binary classifiers . The latter line of work, initiated by Lowd and Meek , studies strategies for model evasion, in the context of spam or fraud detectors . Intuitively, model extraction seems harder than evasion, and this is corroborated by results from theory and practice .

Evasion attacks fall into the larger field of adversarial machine learning, that studies machine learning in general adversarial settings . In that context, a number of authors have considered strategies and defenses for poisoning attacks, that consist in injecting maliciously crafted samples into a model’s train or test data, so as to decrease the learned model’s accuracy .

In a non-adversarial setting, improper model extraction techniques have been applied for interpreting and compressing complex neural networks.

Conclusion

We demonstrated how the flexible prediction APIs exposed by current ML-as-a-service providers enable new model extraction attacks that could subvert model monetization, violate training-data privacy, and facilitate model evasion. Through local experiments and online attacks on two major providers, BigML and Amazon, we illustrated the efficiency and broad applicability of attacks that exploit common API features, such as the availability of confidence scores or the ability to query arbitrary partial inputs. We presented a generic equation-solving attack for models with a logistic output layer and a novel path-finding algorithm for decision trees.

We further explored potential countermeasures to these attacks, the most obvious being a restriction on the information provided by ML APIs. Building upon prior work from learning-theory, we showed how an attacker that only obtains class labels for adaptively chosen inputs, may launch less effective, yet potentially harmful, retraining attacks. Evaluating these attacks, as well as more refined countermeasures, on production-grade ML services is an interesting avenue for future work.

Acknowledgments. We thank Martín Abadi and the anonymous reviewers for their comments. This work was supported by NSF grants 1330599, 1330308, and 1546033, as well as a generous gift from Microsoft.

References

Appendix A Some Details on Models

Logistic regression. SVMs do not directly generalize to multiclass settings $c>2$ , nor do they output class probabilities. Logistic regression (LR) is a popular classifier that does. A binary LR model is defined as $f_{1}(\mathbf{x})=\sigma(\mathbf{w}\cdot\mathbf{x}+\beta)=1/(1+e^{-(\mathbf{w}\cdot\mathbf{x}+\beta)})$ and $f_{0}(\mathbf{x})=1-f_{1}(\mathbf{x})$ . A class label is chosen as 1 iff $f_{1}(\mathbf{x})>0.5$ .

These are log-linear models, and may not be suitable for data that is not linearly separable in $\mathcal{X}$ . Again, one may use kernel techniques to deal with more complex data relationships (c.f., ). Then, one replaces $\mathbf{w}_{i}\cdot\mathbf{x}+\beta_{i}$ with $\sum_{r=1}^{t}\alpha_{i,r}K(\mathbf{x},\mathbf{x}_{r})+\beta_{i}$ . As written, this uses the entire set of training data points $\mathbf{x}_{1},\ldots,\mathbf{x}_{t}$ as so-called representors (here analogous to support vectors). Unlike with SVMs, where most training data set points will never end up as support vectors, here all training set points are potentially representors. In practice one uses a size $s<t$ random subset of training data .

On an input $\mathbf{x}=(x_{1},x_{2},\dots,x_{d})$ , a tree $T$ defines a computation as follows, starting at the root. When we reach a node $v$ , labeled by $\left\{i,\rho\right\}$ , we proceed to the child of $v$ indexed by $\rho(x_{i})$ . We consider three types of splitting functions $\rho$ that are typically used in practice ():

The feature $x_{i}$ is continuous with $\mathcal{X}_{i}=[a,b]$ . Let $a<t<b$ be a threshold. Then $k_{v}=2$ and $\rho(x_{i})=0$ if $x_{i}\leq t$ and $\rho(x_{i})=1$ if $x_{i}>t$ . This is a binary split on a continuous feature with threshold $t$ .

When we reach a leaf, we terminate and output that leaf’s value. This value can be a class label, or a class label and confidence score. This defines a function $f:\mathcal{X}\rightarrow\mathcal{Y}$ .

Appendix B Details on Data Sets

Here we give some more information about the data sets we used in this work. Refer back to Table 3 and Table 5.

Synthetic data sets. We used 4 synthetic data sets from scikit . The first two data sets are classic examples of non-linearly separable data, consisting of two concentric Circles, or two interleaving Moons. The next two synthetic data sets, Blobs and 5-Class, consist of Gaussian clusters of points assigned to either $3$ or $5$ classes.

Public data sets. We gathered a varied set of data sets representative of the type of data we would expect ML service users to use to train logistic and SVM based models. These include famous data sets used for supervised learning, obtained from the UCI ML repository (Adult, Iris, Breast Cancer, Mushrooms, Diabetes). We also consider the Steak and GSS data sets used in prior work on model inversion . Finally, we add a data set of digits available in scikit, to visually illustrate training data leakage in kernelized logistic models (c.f. Section 4.1.3).

Public data sets and models from BigML. For experiments on decision trees, we chose a varied set of models publicly available on BigML’s platform. These models were trained by real MLaaS users and they cover a wide range of application scenarios, thus providing a realistic benchmark for the evaluation of our extraction attacks.

The IRS model predicts a US state, based on administrative tax records. The Steak and GSS models respectively predict a person’s preferred steak preparation and happiness level, from survey and demographic data. These two models were also considered in . The Email Importance model predicts whether Gmail classifies an email as ‘important’ or not, given message metadata. The Email Spam model classifies emails as spam, given the presence of certain words in its content. The German Credit data set was taken from the UCI library and classifies a user’s loan risk. Finally, two regression models respectively predict Medical Charges in the US based on state demographics, and the Bitcoin Market Price from daily opening and closing values.

Appendix C Analysis of the Path-Finding Algorithm

In this section, we analyze the correctness and complexity of the decision tree extraction algorithm in Algorithm 1. We assume that all leaves are assigned a unique id by the oracle $\mathcal{O}$ , and that no continuous feature is split into intervals of width smaller than $\epsilon$ . We may use id to refer directly to the leaf with identity id.

Correctness. Termination of the algorithm follows immediately from the fact that new queries are only added to $Q$ when a new leaf is visited. As the number of leaves in the tree is bounded, the algorithm must terminate.

We prove by contradiction that all leaves are eventually visited. Let the depth of a node $v$ , denote the length of the path from $v$ to the root (the root has depth ). For two leaves $\texttt{id},\texttt{id}^{\prime}$ , let $A$ be their deepest common ancestor ( $A$ is the deepest node appearing on both the paths of id and $\texttt{id}^{\prime}$ ). We denote the depth of $A$ as $\Delta(\texttt{id},\texttt{id}^{\prime})$ .

Suppose Algorithm 1 terminates without visiting all leaves, and let $(\texttt{id},\texttt{id}^{\prime})$ be a pair of leaves with maximal $\Delta(\texttt{id},\texttt{id}^{\prime})$ , such that id was visited but $\texttt{id}^{\prime}$ was not. Let $x_{i}$ be the feature that their deepest common ancestor $A$ splits on. When id is visited, the algorithm calls LINE_SEARCH or CATEGORY_SPLIT on feature $x_{i}$ . As all leaf ids are unique and there are no intervals smaller than $\epsilon$ , we will discover a leaf in each sub-tree rooted at $A$ , including the one that contains $\texttt{id}^{\prime}$ . Thus, we visit a leaf $\texttt{id}^{\prime\prime}$ for which $\Delta(\texttt{id}^{\prime\prime},\texttt{id}^{\prime})>\Delta(\texttt{id},\texttt{id}^{\prime})$ , a contradiction.

Complexity. Let $m$ denote the number of leaves in the tree. Each leaf is visited exactly once, and for each leaf we check all $d$ features. Suppose continuous features have range $[0,b]$ , and categorical features have arity $k$ . For continuous features, finding one threshold takes at most $\log_{2}(\frac{b}{\epsilon})$ queries. As the total number of splits on one feature is at most $m$ (i.e., all nodes split on the same feature), finding all thresholds uses at most $m\cdot\log_{2}(\frac{b}{\epsilon})$ queries. Testing a categorical feature uses $k$ queries. The total query complexity is $O(m\cdot(d_{cat}\cdot k+d_{cont}\cdot m\cdot\log(\frac{b}{\epsilon}))$ , where $d_{cat}$ and $d_{cont}$ represent respectively the number of categorical and continuous features.

For the special case of boolean trees, the complexity is $O(m\cdot d)$ . In comparison, the algorithm of , that uses membership queries only, has a complexity polynomial in $d$ and $2^{\delta}$ , where $\delta$ is the tree depth. For degenerate trees, $2^{\delta}$ can be exponential in $m$ , implying that the assumption of unique leaf identities (obtained from confidence scores for instance) provides an exponential speed-up over the best-known approach with class labels only. The algorithm from can be extended to regression trees, with a complexity polynomial in the size of the output range $\mathcal{Y}$ . Again, under the assumption of unique leaf identities (which could be obtained solely from the output values) we obtain a much more efficient algorithm, with a complexity independent of the output range.

The Top-Down Approach. The correctness and complexity of the top-down algorithm from Section 4.2 (which uses incomplete queries), follow from a similar analysis. The main difference is that we assume that all nodes have a unique id, rather than only the leaves.

Appendix D A Note on Improper Extraction

However, this strategy intuitively does not appear to be optimal. Even if we know that we can find a multilayer perceptron $\hat{f}$ that closely matches $f$ , $\hat{f}$ might have a far more complex representation (more parameters) than $f$ . Thus, tailoring the extraction to the ‘simpler’ model class of the target $f$ appears more efficient. In learning theory, the problem of finding a succinct representation of some target model $f$ is known as Occam Learning .

Our experiments indicate that such generic improper extraction indeed appears sub-optimal, in the context of equation-solving attacks. We train a softmax regression over the Adult data set with target “Race”. The model $f$ is defined by $530$ real-valued parameters. As shown in Section 4.1.2, using only $530$ queries, we extract a model $\hat{f}$ from the same model class, that closely matches $f$ ( $\hat{f}$ and $f$ predict the same labels on 100% of tested inputs, and produce class probabilities that differ by less than $10^{-7}$ in TV distance). We also extracted the same model, assuming a multilayer perceptron target class. Even with $1{,}000$ hidden nodes (this model has $111{,}005$ parameters), and 10 $\times$ more queries ( $5{,}300$ ), the extracted model $\hat{f}$ is a weaker approximation of $f$ ( $99.5\%$ accuracy for class labels and TV distance of $10^{-2}$ for class probabilities).