Information-Theoretic Probing for Linguistic Structure

Tiago Pimentel, Josef Valvoda, Rowan Hall Maudslay, Ran Zmigrod, Adina Williams, Ryan Cotterell

Introduction

Neural networks are the backbone of modern state-of-the-art natural language processing (NLP) systems. One inherent by-product of training a neural network is the production of real-valued representations. Many speculate that these representations encode a continuous analogue of discrete linguistic properties, e.g., part-of-speech tags, due to the networks' impressive performance on many NLP tasks (Belinkov et al., 2017). As a result of this speculation, one common thread of research focuses on the construction of probes, i.e., supervised models that are trained to extract the linguistic properties directly (Belinkov et al., 2017; Conneau et al., 2018; Peters et al., 2018b; Zhang and Bowman, 2018; Naik et al., 2018; Tenney et al., 2019). A syntactic probe, then, is a model for extracting syntactic properties, such as part of speech, from the representations Hewitt and Liang (2019).

In this work, we question what the goal of probing for linguistic properties ought to be. Informally, probing is often described as an attempt to discern how much information representations encode about a specific linguistic property. We make this statement more formal: We assert that the natural operationalization of probing is estimating the mutual information Cover and Thomas (2012) between a representation-valued random variable and a linguistic property–valued random variable. This operationalization gives probing a clean, information-theoretic foundation, and allows us to consider what ``probing'' actually means.

Our analysis also provides insight into how to choose a probe family: We show that choosing the highest-performing probe, independent of its complexity, is optimal for achieving the best estimate of mutual information (MI). This contradicts the received wisdom that one should always select simple probes over more complex ones Alain and Bengio (2017); Liu et al. (2019); Hewitt and Manning (2019). In this context, we also discuss the recent work of Hewitt and Liang (2019) who proposes selectivity as a criterion for choosing families of probes. Hewitt and Liang (2019) defines selectivity as the performance difference between a probe on the target task and a control task, writing ``[t]he selectivity of a probe puts linguistic task accuracy in context with the probe's capacity to memorize from word types.'' They further ponder: ``when a probe achieves high accuracy on a linguistic task using a representation, can we conclude that the representation encodes linguistic structure, or has the probe just learned the task?'' Information-theoretically, there is no difference between learning the task and probing for linguistic structure, as we will show; thus, it follows that one should always employ the best possible probe for the task without resorting to artificial constraints.

In the experimental portion of the paper, we empirically analyze word-level part-of-speech labeling, a common syntactic probing task Hewitt and Liang (2019); Sahin et al. (2019), within our MI operationalization. Working on a typologically diverse set of languages (Basque, Czech, English, Finnish, Indonesian, Korean, Marathi, Tamil, Telugu, Turkish and Urdu), we show that only in five of these eleven languages do we recover higher estimates of mutual information between part-of-speech tags and BERT (Devlin et al., 2019), a common contextualized embedder, than from a control. These modest improvements suggest that most of the information needed to tag part-of-speech well is encoded at the lexical level, and does not require sentential context. Put more simply, words are not very ambiguous with respect to part of speech, a result known to practitioners of NLP Garrette et al. (2013). We interpret this to mean that part-of-speech labeling is not a very informative probing task. We further investigate how BERT fares in dependency labeling, as analysed by Tenney et al. (2019). In this task, estimates based on BERT return more information than a type-level embedding in all analysed languages. However, our MI estimates still only show that BERT contains at most 12%12\% more information than the control.

We also remark that operationalizing probing information-theoretically gives us a simple, but stunning result: contextual word embeddings, e.g., BERT Devlin et al. (2019) and ELMo Peters et al. (2018a), contain the same amount of information about the linguistic property of interest as the original sentence. This follows from the data-processing inequality under a very mild assumption. What this suggests is that, in a certain sense, probing for linguistic properties in representations may not be a well grounded enterprise at all. It also highlights the need to more formally define ease of extraction.

Word-Level Syntactic Probes for Contextual Embeddings

Following Hewitt and Liang (2019), we consider probes that examine syntactic knowledge in contextualized embeddings. These probes only consider a token's embedding in isolation, and try to perform the task using only that information. Specifically, in this work, we consider part-of-speech (POS) and dependency labeling: determining a word's part of speech in a given sentence and the dependency relation for a pair of tokens joined by a dependency arc. Say we wish to determine whether the word love is a noun or a verb. This task requires the sentential context for success. As an example, consider the utterance ``love is blind'' where, only with the context, is it clear that love is a noun. Thus, to do well on this task, the contextualized embeddings need to encode enough about the surrounding context to correctly guess the POS. Analogously, we need the whole sentence to know that love is the nominal subject. Whereas in the sentence ``greed can blind love'', love is the direct object.

Since contextual embeddings are a deterministic function of a sentence s\mathbf{s}, the augmented distribution in eq. 1 has no more randomness than the original—its entropy is the same. We assume the values of the random variables defined above are distributed according to this (unknown) pp. While we do not have access to pp, we assume the data in our corpus were drawn according to it. Note that WW—the random variable over possible word types—is distributed according to

where we define the deterministic distribution

2 Probing as Mutual Information

The task of supervised probing is an attempt to ascertain how much information a specific representation r\mathbf{r} tells us about the value of tt. This is naturally operationalized as the mutual information, a quantity from information theory:

where we define the entropy, which is constant with respect to the representations, as

where the point-wise conditional entropy inside the sum is defined as

Again, we will not know any of the distributions required to compute these quantities; the distributions in the formulae are marginals and conditionals of the true distribution discussed in eq. 1.

3 Bounding Mutual Information

This bound gets tighter, the more similar—in the sense of the KL divergence—qθ(r)q_{{\boldsymbol{\theta}}}(\cdot\mid\mathbf{r}) is to the true distribution p(r)p(\cdot\mid\mathbf{r}).

If we accept mutual information as a natural operationalization for how much representations encode a target linguistic task (§2.2), the best estimate of that mutual information is the one where the probe qθ(tr)q_{{\boldsymbol{\theta}}}(t\mid\mathbf{r}) is best at the target task. In other words, we want the best probe qθ(tr)q_{{\boldsymbol{\theta}}}(t\mid\mathbf{r}) such that we get the tightest bound to the actual distribution p(tr)p(t\mid\mathbf{r}). This paints the question posed in Hewitt and Liang (2019), who write

``when a probe achieves high accuracy on a linguistic task using a representation, can we conclude that the representation encodes linguistic structure, or has the probe just learned the task?''

as a false dichotomy.Assuming that the authors intended ‘or’ here as strictly non-inclusive. See Levinson (2000, 91) and Chevallier et al. (2008, 1743) on conversational implicatures from ‘or’. From an information-theoretic view, we will always prefer the probe that does better at the target task, since there is no difference between learning a task and the representations encoding the linguistic structure.

Control Functions

In other words, information can only be lost by processing data. A common adage associated with this inequality is ``garbage in, garbage out.''

We focus on type-level control functions in this paper. These functions have the effect of decontextualizing the embeddings, being related to the common trend of analyzing probe results in comparison to input layer embeddings (Belinkov and Glass, 2017; Liu et al., 2019; Hewitt and Manning, 2019; Tenney et al., 2019). Such functions allow us to inquire how much the contextual aspect of the contextual embeddings help the probe perform the target task. To show that we may map from contextual embeddings to the identity of the word type, we need the following assumption.

Every contextualized embedding is unique, i.e., for any pair of sentences s,sV\mathbf{s},\mathbf{s}^{\prime}\in\mathcal{V}^{*}, we have (ss)(ij)\textscbert(s)i\textscbert(s)j(\mathbf{s}\neq\mathbf{s}^{\prime})\mid\mid(i\neq j)\Rightarrow\textsc{bert}(\mathbf{s})_{i}\neq\textsc{bert}(\mathbf{s}^{\prime})_{j} for all i{1,s}i\in\{1,\ldots|\mathbf{s}|\} and j{1,,s}j\in\{1,\ldots,|\mathbf{s}^{\prime}|\}.

This resultNote that although this result holds in theory, in practice the functions id and e()\mathbf{e}(\cdot) might be arbitrarily hard to estimate. This is discussed in length in § 4.3. is intuitive and, perhaps, trivial—context matters information-theoretically. However, it gives us a principled foundation by which to measure the effectiveness of probes as we will show in § 3.2.

2 How Much Information Did We Gain?

We will now quantify how much a contextualized word embedding knows about a task with respect to a specific control function c()\mathbf{c}(\cdot). We term how much more information the contextualized embeddings have about a task than a control variable the gain, G\mathcal{G}, which we define as

The gain function will be our method for measuring how much more information contextualized representations have over a controlled baseline, encoded as the function c\mathbf{c}. We will empirically estimate this value in § 6. Interestingly enough, the gain has a straightforward interpretation.

The gain function is equal to the following conditional mutual information

The jump from the first to the second equality follows since RR encodes, by construction, all the information about TT provided by c(R)\mathbf{c}(R). ∎

Proposition 1 gives us a clear understanding of the quantity we wish to estimate: It is how much information about a task is encoded in the representations, given some control knowledge. If properly designed, this control transformation will remove information from the probed representations.

3 Approximating the Gain

The gain, as defined in eq. 13, is intractable to compute. In this section we derive a pair of variational bounds on G(T,R,e)\mathcal{G}(T,R,\mathbf{e})—one upper and one lower. To approximate the gain, we will simultaneously minimize an upper and maximize a lower-bound on eq. 13. We begin by approximating the gain in the following manner

these cross-entropies can be empirically estimated. We will assume access to a corpus {(ti,ri)}i=1N\{(t_{i},\mathbf{r}_{i})\}_{i=1}^{N} that is human-annotated for the target linguistic property; we further assume that these are samples (ti,ri)p(,)(t_{i},\mathbf{r}_{i})\sim p(\cdot,\cdot) from the true distribution. This yields a second approximation that is tractable:

This approximation is exact in the limit NN\rightarrow\infty by the law of large numbers.

We note the approximation given in eq. 15 may be either positive or negative and its estimation error follows from eq. 9:

where we abuse the KL notation to simplify the equation. This is an undesired behavior since we know the gain itself is non-negative by the data-processing inequality, but we have yet to devise a remedy.

We justify the approximation in eq. 15 with a pair of variational bounds. The following two corollaries are a result of Theorem 2 in App. A.

We have the following upper-bound on the gain

We have the following lower-bound on the gain

The conjunction of Corollary 2 and Corollary 3 suggest a simple procedure for finding a good approximation: We choose qθ1(r)q_{{\boldsymbol{\theta}}1}(\cdot\mid r) and qθ2(r)q_{{\boldsymbol{\theta}}2}(\cdot\mid r) so as to minimize eq. 18 and maximize eq. 19, respectively. These distributions contain no overlapping parameters, by construction, so these two optimization routines may be performed independently. We will optimize both with a gradient-based procedure, discussed in § 6.

Understanding Probing Information-Theoretically

In § 3, we developed an information-theoretic framework for thinking about probing contextual word embeddings for linguistic structure. However, we now cast doubt on whether probing makes sense as a scientific endeavour. We prove in § 4.1 that contextualized word embeddings, by construction, contain no more information about a word-level syntactic task than the original sentence itself. Nevertheless, we do find a meaningful scientific interpretation of control functions. We expound upon this in § 4.2, arguing that control functions are useful, not for understanding representations, but rather for understanding the influence of sentential context on word-level syntactic tasks, e.g., labeling words with their part of speech.

To start, we note the following corollary

It directly follows from Assumption 1 that bert is a bijection between sentences s\mathbf{s} and sequences of embeddings r1,,rs\langle\mathbf{r}_{1},\ldots,\mathbf{r}_{|\mathbf{s}|}\rangle. As bert is a bijection, it has an inverse, which we will denote as \textscbert1\textsc{bert}^{-1}.

\textscbert(S)\textsc{bert}(S) cannot provide more information about TT than the sentence SS itself.

While Theorem 1 is a straightforward application of the data-processing inequality, it has deeper ramifications for probing. It means that if we search for syntax in the contextualized word embeddings of a sentence, we should not expect to find any more syntax than is present in the original sentence. In a sense, Theorem 1 is a cynical statement: under our operationalization, the endeavour of finding syntax in contextualized embeddings sentences is nonsensical. This is because, under Assumption 1, we know the answer a priori—the contextualized word embeddings of a sentence contain exactly the same amount of information about syntax as does the sentence itself.

2 What Do Control Functions Mean?

3 Discussion: Ease of Extraction

We do acknowledge another interpretation of the work of Hewitt and Liang (2019) inter alia; BERT makes the syntactic information present in an ordered sequence of words more easily extractable. However, ease of extraction is not a trivial notion to operationalize, and indeed, we know of no attempt to do so;Xu et al. (2020) is a possible exception. it is certainly more complex to determine than the number of layers in a multi-layer perceptron (MLP). Indeed, a MLP with a single hidden layer can represent any function over the unit cube, with the caveat that we may need a very large number of hidden units Cybenko (1989).

Although for perfect probes the above results should hold, in practice id()\texttt{id}(\cdot) and c()\mathbf{c}(\cdot) may be hard to approximate. Furthermore, if these functions were to be learned, they might require an unreasonably large dataset. Learning a random embedding control function, for example, would require a dataset containing all words in the vocabulary VV—in an open vocabulary setting an infinite dataset would be required! ``Better'' representations should make their respective probes easily learnable—and consequently their encoded information is more accessible (Voita and Titov, 2020).

We suggest that future work on probing should focus on operationalizing ease of extraction more rigorously—even though we do not attempt this ourselves. As previously argued by Saphra and Lopez (2019, §5), the advantage of simple probes is that they may reveal something about the structure of the encoded information—i.e., is it structured in such a way that it can be easily taken advantage of by downstream consumers of the contextualized embeddings? Many researchers who are interested in less complex probes have, either implicitly or explicitly, had this in mind.

A Critique of Control Tasks

We agree with Hewitt and Liang (2019)—and with both Zhang and Bowman (2018) and Tenney et al. (2019)—that we should have controlled baselines when probing for linguistic properties. However, we disagree with parts of their methodology for constructing control tasks. We present these disagreements here.

Hewitt and Liang (2019) introduces control tasks to evaluate the effectiveness of probes. We draw inspiration from this technique as evidenced by our introduction of control functions. However, we take issue with the suggestion that controls should have structure and randomness, to use the terminology from Hewitt and Liang (2019). They define structure as ``the output for a word token is a deterministic function of the word type.'' This means that they are stripping the language of ambiguity with respect to the target task. In the case of part-of-speech labeling, love would either be a noun or a verb in a control task, never both: this is a problem. The second feature of control tasks is randomness, i.e., ``the output for each word type is sampled independently at random.'' In conjunction, structure and randomness may yield a relatively trivial task that does not look like natural language.

What is more, there is a closed-form solution for an optimal, retrieval-based ``probe'' that has zero learned parameters: If a word type appears in the training set, return the label with which it was annotated there, otherwise return the most frequently occurring label across all words in the training set. This probe will achieve an accuracy that is 1 minus the out-of-vocabulary rate (the number of tokens in the test set that correspond to novel types divided by the number of tokens) times the percentage of tags in the test set that do not correspond to the most frequent tag (the error rate of the guess-the-most-frequent-tag classifier). In short, the best model for a control task is a pure memorizer that guesses the most frequent tag for out-of-vocabulary words.

2 What's Wrong with Memorization?

Hewitt and Liang (2019) proposes that probes should be optimized to maximize accuracy and selectivity. Recall selectivity is given by the distance between the accuracy on the original task and the accuracy on the control task using the same architecture. Given their characterization of control tasks, maximising selectivity leads to a selection of a model that is bad at memorization. But why should we punish memorization? Much of linguistic competence is about generalization, however memorization also plays a key role (Fodor et al., 1974; Nooteboom et al., 2002; Fromkin et al., 2018), with word learning (Carey, 1978) being an obvious example. Indeed, maximizing selectivity as a criterion for creating probes seems to artificially disfavor this property.

3 What Low-Selectivity Means

Hewitt and Liang (2019) acknowledges that for the more complex task of dependency edge prediction, a MLP probe is more accurate and, therefore, preferable despite its low selectivity. However, they offer two counter-examples where the less selective neural probe exhibits drawbacks when compared to its more selective, linear counterpart. We believe both examples are a symptom of using a simple probe rather than of selectivity being a useful metric for probe selection.

First, Hewitt and Liang (2019, §3.6) point out that, in their experiments, the MLP-1 model frequently mislabels the word with suffix -s as NNPS on the POS labeling task. They present this finding as a possible example of a less selective probe being less faithful in representing what linguistic information has the model learned. Our analysis leads us to believe that, on contrary, this shows that one should be using the best possible probe to minimize the chance of misinterpreting its encoded information. Since more complex probes achieve higher accuracy on the task, as evidence by the findings of Hewitt and Liang (2019), we believe that the overall trend of misinterpretation is higher for the probes with higher selectivity. The same applies for the second example in Hewitt and Liang 2019, §4.2 where a less selective probe appears to be less faithful. The paper shows that the representations on ELMo's second layer fail to outperform its word type ones (layer zero) on the POS labeling task when using the MLP-1 probe. While the paper argues this is evidence for selectivity being a useful metric in choosing appropriate probes, we argue that this demonstrates, yet again, that one needs to use a more complex probe to minimize the chances of misinterpreting what the model has learned. The fact that the linear probe shows a difference only demonstrates that the information is perhaps more accessible with ELMo, not that it is not present.

Experiments

Despite our discussion in § 4, we still wish to empirically vet our estimation technique for the gain and we use this section to highlight the need to formally define ease of extraction (as argued in § 4.3). We consider the tasks of POS and dependency labeling, using the universal POS tag Petrov et al. (2012) and dependency label information from the Universal Dependencies 2.5 Zeman et al. (2019). We probe the multilingual release of BERTWe used Wolf et al. (2019)’s implementation. on eleven typologically diverse languages: Basque, Czech, English, Finnish, Indonesian, Korean, Marathi, Tamil, Telugu, Turkish and Urdu; and we compute the contextual representations of each sentence by feeding it into BERT and averaging the output word piece representations for each word, as tokenized in the treebank.

We will consider two different control functions. Each is defined as the composition c=eid\mathbf{c}=\mathbf{e}\circ\texttt{id} with a different look-up function:

efastText\mathbf{e}_{\textit{fastText}} returns a language specific fastText embedding (Bojanowski et al., 2017);

eonehot\mathbf{e}_{\textit{onehot}} returns a one-hot embedding.We initialize random embeddings at the type level, and let them train during the model’s optimization. We also experiment with fixed random embeddings—results for this control are in the Appendix.

These functions can be considered type level, as they remove the influence of context on the word.

2 Probe Architecture

As expounded upon above, our purpose is to achieve the best bound on mutual information we can. To this end, we employ a deep MLP as our probe. We define the probe as

3 Results

We know bert can generate text in many languages. Here we assess how much it actually ``knows'' about syntax in those languages—or at least how much we can extract from it given as powerful probes as we can train. We further evaluate how much it knows above and beyond simple type-level baselines.

Dependency labels

As shown in Table 2, bert improves over type-level embeddings in all languages on this task. Nonetheless, although this is a much more context-dependent task, we see bert-based estimates reveal at most 12%12\% more information than fastText in English, the highest resource language in our set. If we look at the lower-resource languages, in five of them the gains are of less than 5%5\%.

Discussion

When put into perspective, multilingual bert's representations do not seem to encode much more information about syntax than a simple baseline. On POS labeling, bert only improves upon fastText in five of the eleven analysed languages—and by small amounts (less than 9%9\%) when it does. Even at dependency labelling, a task considered to require more contextual knowledge, we could only decode from bert at most (in English) 12%12\% additional information— which again highlights the need to formalize ease of extraction.

Conclusion

We propose an information-theoretic operationalization of probing that defines it as the task of estimating conditional mutual information. We introduce control functions, which put in context our mutual information estimates—how much more informative are contextual representations than some knowledge judged to be trivial? We further explored our operationalization and showed that, given perfect probes, probing can only yield insights into the language itself and cannot tell us anything about the representations under investigation. Keeping this in mind, we suggest a change of focus—instead of concentrating on probe size or information, we should pursue ease of extraction going forward.

Acknowledgements

The authors would like to thank Adam Poliak and John Hewitt for several helpful suggestions.

References

Appendix A Variational Bounds

The estimation error between Gqθ(T,R,e)\mathcal{G}_{q_{{\boldsymbol{\theta}}}}(T,R,\mathbf{e}) and the true gain can be upper- and lower-bounded by two distinct Kullback–Leibler divergences.

We first find the error given by our estimate, which is a difference between two KL divergences—as shown in eq. 22 in Figure 1. Making use of this error, we trivially find an upper-bound on the estimation error as

which follows since KL divergences are never negative. Analogously, we find a lower-bound as

Appendix B Further Results

In this section, we present accuracies for the models trained using bert, fastText and one-hot embeddings, and the full results on random embeddings. These random embeddings are generated once before the task, at the type level, and kept fixed without training. Table 3 shows that both BERT and fastText present high accuracies at POS labeling in all languages, except Tamil and Marathi. One-hot and random results are considerably worse, as expected, since they could not do more than take random guesses (e.g. guessing the most frequent label in the training test) in any word which was not seen during training. Table 4 presents similar results for dependency labeling, although accuracies for this task are considerably lower.