Explainable Prediction of Medical Codes from Clinical Text

James Mullenbach, Sarah Wiegreffe, Jon Duke, Jimeng Sun, Jacob Eisenstein

Introduction

Clinical notes are free text narratives generated by clinicians during patient encounters. They are typically accompanied by a set of metadata codes from the International Classification of Diseases (ICD), which present a standardized way of indicating diagnoses and procedures that were performed during the encounter. ICD codes have a variety of uses, ranging from billing to predictive modeling of patient state Choi et al. (2016); Ranganath et al. (2015); Denny et al. (2010); Avati et al. (2017). Because manual coding is time-consuming and error-prone, automatic coding has been studied since at least the 1990s de Lima et al. (1998). The task is difficult for two main reasons. First, the label space is very high-dimensional, with over 15,000 codes in the ICD-9 taxonomy, and over 140,000 codes combined in the newer ICD-10-CM and ICD-10-PCS taxonomies World Health Organization (2016). Second, clinical text includes irrelevant information, misspellings and non-standard abbreviations, and a large medical vocabulary. These features combine to make the prediction of ICD codes from clinical notes an especially difficult task, for computers and human coders alike Birman-Deych et al. (2005).

In this application paper, we develop convolutional neural network (CNN)-based methods for automatic ICD code assignment based on text discharge summaries from intensive care unit (ICU) stays. To better adapt to the multi-label setting, we employ a per-label attention mechanism, which allows our model to learn distinct document representations for each label. We call our method Convolutional Attention for Multi-Label classification (CAML). Our model design is motivated by the conjecture that important information correlated with a code’s presence may be contained in short snippets of text which could be anywhere in the document, and that these snippets likely differ for different labels. To cope with the large label space, we exploit the textual descriptions of each code to guide our model towards appropriate parameters: in the absence of many labeled examples for a given code, its parameters should be similar to those of codes with similar textual descriptions.

We evaluate our approach on two versions of MIMIC Johnson et al. (2016), an open dataset of ICU medical records. Each record includes a variety of narrative notes describing a patient’s stay, including diagnoses and procedures. Our approach substantially outperforms previous results on medical code prediction on both MIMIC-II and MIMIC-III datasets.

We consider applications of this work in a decision support setting. Interpretability is important for any decision support system, especially in the medical domain. The system should be able to explain why it predicted each code; even if the codes are manually annotated, it is desirable to explain what parts of the text are most relevant to each code. These considerations further motivate our per-label attention mechanism, which assigns importance values to $n$ -grams in the input document, and which can therefore provide explanations for each code, in the form of extracted snippets of text from the input document. We perform a human evaluation of the quality of the explanations provided by the attention mechanism, asking a physician to rate the informativeness of a set of automatically generated explanations.Our code, data splits, and pre-trained models are available at github.com/jamesmullenbach/caml-mimic.

Method

2 Attention

where $\text{SoftMax}(\boldsymbol{x})=\frac{\exp(\boldsymbol{x})}{\sum_{i}\exp(x_{i})}$ , and $\exp(\boldsymbol{x})$ is the element-wise exponentiation of the vector $\boldsymbol{x}$ . The attention vector $\boldsymbol{\alpha}$ is then used to compute vector representations for each label,

As a baseline model, we instead use max-pooling to compute a single vector $\boldsymbol{v}$ for all labels,

3 Classification

4 Training

The training procedure minimizes the binary cross-entropy loss,

plus the L2 norm of the model weights, using the Adam optimizer Kingma and Ba (2015).

5 Embedding label descriptions

where $\lambda$ is a tradeoff hyperparameter that calibrates the performance of the two objectives. We call this model variant Description Regularized-CAML (DR-CAML).

Evaluation of code prediction

This section evaluates the accuracy of code prediction, comparing our models against several competitive baselines.

MIMIC-III Johnson et al. (2016) is an open-access dataset of text and structured records from a hospital ICU. Following previous work, we focus on discharge summaries, which condense information about a stay into a single document. In MIMIC-III, some admissions have addenda to their summary, which we concatenate to form one document.

Each admission is tagged by human coders with a set of ICD-9 codes, describing both diagnoses and procedures which occurred during the patient’s stay. There are 8,921 unique ICD-9 codes present in our datasets, including 6,918 diagnosis codes and 2,003 procedure codes. Some patients have multiple admissions and therefore multiple discharge summaries; we split the data by patient ID, so that no patient appears in both the training and test sets.

In this full-label setting, we use a set of 47,724 discharge summaries from 36,998 patients for training, with 1,632 summaries and 3,372 summaries for validation and testing, respectively.

For comparison with prior work, we also follow Shi et al. (2017) and train and evaluate on a label set consisting of the 50 most frequent labels. In this setting, we filter each dataset down to the instances that have at least one of the top 50 most frequent codes, and subset the training data to equal the size of the training set of Shi et al. (2017), resulting in 8,067 summaries for training, 1,574 for validation, and 1,730 for testing.

We also run experiments with the MIMIC-II dataset, to compare with prior work by Baumel et al. (2018) and Perotte et al. (2013). We use the train/test split of Perotte et al. (2013), which consists of 20,533 training examples and 2,282 testing examples. Detailed statistics for the three settings are summarized in Table 2.

Preprocessing

We remove tokens that contain no alphabetic characters (e.g., removing “500” but keeping “250mg”), lowercase all tokens, and replace tokens that appear in fewer than three training documents with an ‘UNK’ token. We pretrain word embeddings of size $d_{e}=100$ using the word2vec CBOW method Mikolov et al. (2013) on the preprocessed text from all discharge summaries. All documents are truncated to a maximum length of 2500 tokens.

2 Systems

We compare against the following baselines:

a single-layer one-dimensional convolutional neural network Kim (2014);

a bag-of-words logistic regression model;

a bidirectional gated recurrent unit (Bi-GRU).Our pilot experiments found that GRU was stronger than long short-term memory (LSTM) for this task.

For the CNN and Bi-GRU, we initialize the embedding weights using the same pretrained word2vec vectors that we use for the CAML models. All neural models are implemented using PyTorchhttps://github.com/pytorch/pytorch. The logistic regression model consists of $|\mathcal{L}|$ binary one-vs-rest classifiers acting on unigram bag-of-words features for all labels present in the training data. If a label is not present in the training data, the model will never predict it in the held-out data.

We tune the hyperparameters of the CAML model and the neural baselines using the Spearmint Bayesian optimization package (Snoek et al., 2012; Swersky et al., 2013).https://github.com/HIPS/Spearmint We allow Spearmint to sample parameter values for the L2 penalty on the model weights $\rho$ and learning rate $\eta$ , as well as filter size $k$ , number of filters $d_{c}$ , and dropout probability $q$ for the convolutional models, and number of hidden layers $s$ of dimension $v$ for the Bi-GRU, using precision $@8$ on the MIMIC-III full-label validation set as the performance measure. We use these parameters for DR-CAML as well, and port the optimized parameters to the MIMIC-II full-label and MIMIC-III 50-label models, and manually fine-tune the learning rate in these settings. We select $\lambda$ for DR-CAML based on pilot experiments on the validation sets. Hyperparameter tuning is summarized in Table 3. Convolutional models are trained with dropout after the embedding layer. We use a fixed batch size of 16 for all models and datasets. Models are trained with early stopping on the validation set; training terminates after the precision@8 does not improve for 10 epochs, and the model at the time of the highest precision@8 is used on the test set.

3 Evaluation Metrics

To facilitate comparison with both future and prior work, we report a variety of metrics, focusing on the micro-averaged and macro-averaged F1 and area under the ROC curve (AUC). Micro-averaged values are calculated by treating each (text, code) pair as a separate prediction. Macro-averaged values, while less frequently reported in the multi-label classification literature, are calculated by averaging metrics computed per-label. For recall, the metrics are distinguished as follows:

where TP denotes true positive examples and FN denotes false negative examples. Precision is computed analogously. The macro-averaged metrics place much more emphasis on rare label prediction.

We also report precision at $n$ (denoted as ‘P@n’), which is the fraction of the $n$ highest-scored labels that are present in the ground truth. This is motivated by the potential use case as a decision support application, in which a user is presented with a fixed number of predicted codes to review. In such a case, it is more suitable to select a model with high precision than high recall. We choose $n=5$ and $n=8$ to compare with prior work Vani et al. (2017); Prakash et al. (2017). For the MIMIC-III full label setting, we also compute precision@15, which roughly corresponds to the average number of codes in MIMIC-III discharge summaries (Table 2).

4 Results

Our main quantitative evaluation involves predicting the full set of ICD-9 codes based on the text of the MIMIC-III discharge summaries. These results are shown in Table 4. The CAML model gives the strongest results on all metrics. Attention yields substantial improvements over the “vanilla” convolutional neural network (CNN). The recurrent Bi-GRU architecture is comparable to the vanilla CNN, and the logistic regression baseline is substantially worse than all neural architectures. The best-performing CNN model has 9.86M tunable parameters, compared with 6.14M tunable parameters for CAML. This is due to the hyperparameter search preferring a larger number of filters for the CNN. Finally, we observe that the DR-CAML performs worse on most metrics than CAML, with a tuned regularization coefficient of $\lambda=0.01$ .

Among prior work, only Scheurwegs et al. (2017) evaluate on the full ICD-9 code set for MIMIC-III. Their reported results distinguished between diagnosis codes and procedure codes. The CAML models are stronger on both sets. Additionally, our method does not make use of any external information or structured data, while Scheurwegs et al. use structured data and various medical ontologies in their text representation.

We feel that precision@8 is the most informative of the metrics, as it measures the ability of the system to return a small high-confidence subset of codes. Even with a space of thousands of labels, our models achieve relatively high precision: of the eight most confident predictions, on average 5.5 are correct. It is also apparent how difficult it is to achieve high Macro-F1 scores, due to the metric’s emphasis on rare-label performance. To put these results in context, a hypothetical system that performs perfectly on the 500 most common labels, and ignores all others, would achieve a Macro-F1 of 0.052 and a Micro-F1 of 0.842.

To compare with prior published work, we also evaluate on the 50 most common codes in MIMIC-III (Table 5), and on MIMIC-II (Table 6). We report DR-CAML results on the 50-label setting of MIMIC-III with $\lambda=10$ , and on MIMIC-II with $\lambda=0.1$ , which were determined by grid search on a validation set. The other hyperparameters were left at the settings for the main MIMIC-III evaluation, as described in Table 3. In the 50-label setting of MIMIC-III, we see strong improvement over prior work in all reported metrics, as well as against the baselines, with the exception of precision@5, on which the CNN baseline performs best. We hypothesize that this is because the relatively large value of $k=10$ for CAML leads to a larger network that is more suited to larger datasets; tuning CAML’s hyperparameters on this dataset would be expected to improve performance on all metrics. Baumel et al. (2018) additionally report a micro-F1 score of 0.407 by training on MIMIC-III, and evaluating on MIMIC-II. Our model achieves better performance using only the (smaller) MIMIC-II training set, leaving this alternative training protocol for future work.

Evaluation of Interpretability

We now evaluate the explanations generated by CAML’s attention mechanism, in comparison with three alternative heuristics. A physician was presented with explanations from four methods, using a random sample of 100 predicted codes from the MIMIC-III full-label test set. The most important $k$ -gram from each method was extracted, along with a window of five words on either side for context. We select $k=4$ in this setting to emulate a span of attention over words likely to be given by a human reader. Examples can be found in Table 1. Observe that the snippets may overlap in multiple words. We prompted the evaluator to select all text snippets which he felt adequately explained the presence of a given code, provided the code and its description, with the option to distinguish snippets as “highly informative” should they be found particularly informative over others.

Max-pooling CNN

We select the $k$ -grams that provide the maximum value selected by max-pooling at least once and weighting by the final layer weights. Defining an argmax vector $\boldsymbol{a}$ which results from the max-pooling step as

Logistic regression

Code descriptions

Finally, we calculate a word similarity metric between each stemmed $k$ -gram and the stemmed ICD-9 code description. We compute the idf-weighted cosine similarity, with idf weights calculated on the corpus consisting of all notes and relevant code descriptions. We then select the argmax over $k$ -grams in the document, breaking ties by selecting the first occurrence. We remove those note-label pairs for which no $k$ -gram has a score greater than 0, which gives an “unfair” advantage to this baseline.

2 Results

The results of the interpretability evaluation are presented in Table 7. Our model selects the greatest number of “highly informative” explanations, and selects more “informative” explanations than both the CNN baseline and the logistic regression model. While the cosine similarity metric also performs well, the examples in Table 1 demonstrate the strengths of CAML in extracting text snippets in line with more intuitive explanations for the presence of a code. As noted above, there exist some cases, which we exclude, where the cosine similarity method is unable to provide any explanation, because no $k$ -grams in a note have a non-zero similarity for a given label description. This occurs for about 12 $\%$ of all note-label pairs in the test set.

Related Work

CNNs have been successfully applied to tasks such as sentiment classification Kim (2014) and language modeling Dauphin et al. (2017). Our work combines convolution with attention Bahdanau et al. (2015); Yang et al. (2016) to select the most relevant parts of the discharge summary. Other recent work has combined convolution and attention (e.g., Allamanis et al., 2016; Yin et al., 2016; dos Santos et al., 2016; Yin and Schütze, 2017). Our attention mechanism is most similar to those of Yang et al. (2016) and Allamanis et al. (2016), in that we use context vectors to compute attention over specific locations in the text. Our work differs in that we compute separate attention weights for each label in our label space, which is better tuned to our goal of selecting locations in a document which are most important for predicting specific labels.

Automatic ICD coding

ICD coding is a long-standing task in the medical informatics community, which has been approached with machine learning and handcrafted methods Scheurwegs et al. (2015). Many recent approaches, like ours, use unstructured text data as the only source of information (e.g., Kavuluru et al., 2015; Subotin and Davis, 2014), though some incorporates structured data as well (e.g., Scheurwegs et al., 2017; Wang et al., 2016). Most previous methods have either evaluated only on a strict subset of the full ICD label space Wang et al. (2016), relied on datasets that focus on a subset of medical scenarios Zhang et al. (2017), or evaluated on data that are not publicly available, making direct comparison difficult Subotin and Davis (2016). A recent shared task for ICD-10 coding focused on coding of death certificates in English and French Névéol et al. (2017). This dataset also contains shorter documents than those we consider, with an average of 18 tokens per certificate in the French corpus. We use the open-access MIMIC datasets containing de-identified, general-purpose records of intensive care unit stays at a single hospital.

Perotte et al. (2013) use “flat” and “hierarchical” SVMs; the former treats each code as an individual prediction, while the latter trains on child codes only if the parent code is present, and predicts on child codes only if the parent code was positively predicted. Scheurwegs et al. (2017) use a feature selection approach to ICD-9 and ICD-10 classification, incorporating structured and unstructured text information from EHRs. They evaluate over various medical specialties and on the MIMIC-III dataset. We compare directly to their results on the full label set of MIMIC-III.

Other recent approaches have employed neural network architectures. Baumel et al. (2018) apply recurrent networks with hierarchical sentence and word attention (the HA-GRU) to classify ICD9 diagnosis codes while providing insights into the model decision process. Similarly, Shi et al. (2017) applied character-aware LSTMs to generate sentence representations from specific subsections of discharge summaries, and apply attention to form a soft matching between the representations and the top 50 codes. Prakash et al. (2017) use memory networks that draw from discharge summaries as well as Wikipedia, to predict top-50 and top-100 codes. Another recent neural architecture is the Grounded Recurrent Neural Network Vani et al. (2017), which employs a modified GRU with dimensions dedicated to predicting the presence of individual labels. We compare directly with published results from all of these papers, except Vani et al. (2017), who evaluate on only a 5000 code subset of ICD-9. Empirically, the CAML architecture proposed in this paper yields stronger results across all experimental conditions. We attribute these improvements to the attention mechanism, which focuses on the most critical features for each code, rather than applying a uniform pooling operation for all codes. We also observed that convolution-based models are at least as effective, and significantly more computationally efficient, than recurrent neural networks such as the Bi-GRU.

Explainable text classification

A goal of this work is that the code predictions be explainable from features of the text. Prior work has also emphasized explainability. Lei et al. (2016) model “rationales” through a latent variable, which tags each word as relevant to the document label. Li et al. (2016) compute the salience of individual words by the derivative of the label score with respect to the word embedding. Ribeiro et al. (2016) use submodular optimization to select a subset of features that closely approximate a specific classification decision (this work is also notable for extensive human evaluations). In comparison to these approaches, we employ a relatively simple attentional architecture; this simplicity is motivated by the challenge of scaling to multi-label classification with thousands of possible labels. Other prior work has emphasized the use of attention for highlighting salient features of the text (e.g., Rush et al., 2015; Rocktäschel et al., 2016), although these papers did not perform human evaluations of the interpretability of the features selected by the attention mechanism.

Conclusions and Future Work

We present CAML, a convolutional neural network for multi-label document classification, which employs an attention mechanism to adaptively pool the convolution output for each label, learning to identify highly-predictive locations for each label. CAML yields strong improvements over previous metrics on several formulations of the ICD-9 code prediction task, while providing satisfactory explanations for its predictions. Although we focus on a clinical setting, CAML is extensible without modification to other multi-label document tagging tasks, including ICD-10 coding. We see a number of directions for future work. From the linguistic side, we plan to integrate the document structure of discharge summaries in MIMIC-III, and to better handle non-standard writing and other sources of out-of-vocabulary tokens. From the application perspective, we plan to build models that leverage hierarchy of ICD codes Choi et al. (2016), and to attempt the more difficult task of predicting diagnosis and treatment codes for future visits from discharge summaries.

Helpful feedback was provided by the anonymous reviewers, and by the members of the Georgia Tech Computational Linguistics lab. The project was partially supported by project HDTRA1-15-1-0019 from the Defense Threat Reduction Agency, by the National Science Foundation under awards IIS-1418511 and CCF-1533768, by the National Institutes of Health under awards 1R01MD011682-01 and R56HL138415, by Children’s Healthcare of Atlanta, and by UCB.