On the Automatic Generation of Medical Imaging Reports

Baoyu Jing, Pengtao Xie, Eric Xing

Introduction

Medical images, such as radiology and pathology images, are widely used in hospitals for the diagnosis and treatment of many diseases, such as pneumonia and pneumothorax. The reading and interpretation of medical images are usually conducted by specialized medical professionals. For example, radiology images are read by radiologists. They write textual reports (Figure 1) to narrate the findings regarding each area of the body examined in the imaging study, specifically whether each area was found to be normal, abnormal or potentially abnormal.

For less-experienced radiologists and pathologists, especially those working in the rural area where the quality of healthcare is relatively low, writing medical-imaging reports is demanding. For instance, to correctly read a chest x-ray image, the following skills are needed Delrue et al. (2011): (1) thorough knowledge of the normal anatomy of the thorax, and the basic physiology of chest diseases; (2) skills of analyzing the radiograph through a fixed pattern; (3) ability of evaluating the evolution over time; (4) knowledge of clinical presentation and history; (5) knowledge of the correlation with other diagnostic results (laboratory results, electrocardiogram, and respiratory function tests).

For experienced radiologists and pathologists, writing imaging reports is tedious and time-consuming. In nations with large population such as China, a radiologist may need to read hundreds of radiology images per day. Typing the findings of each image into computer takes about 5-10 minutes, which occupies most of their working time. In sum, for both unexperienced and experienced medical professionals, writing imaging reports is unpleasant.

This motivates us to investigate whether it is possible to automatically generate medical image reports. Several challenges need to be addressed. First, a complete diagnostic report is comprised of multiple heterogeneous forms of information. As shown in Figure 1, the report for a chest x-ray contains impression which is a sentence, findings which are a paragraph, and tags which are a list of keywords. Generating this heterogeneous information in a unified framework is technically demanding. We address this problem by building a multi-task framework, which treats the prediction of tags as a multi-label classification task, and treats the generation of long descriptions as a text generation task.

Second, how to localize image-regions and attach the right description to them are challenging. We solve these problems by introducing a co-attention mechanism, which simultaneously attends to images and predicted tags and explores the synergistic effects of visual and semantic information.

Third, the descriptions in imaging reports are usually long, containing multiple sentences. Generating such long text is highly nontrivial. Rather than adopting a single-layer LSTM Hochreiter and Schmidhuber (1997), which is less capable of modeling long word sequences, we leverage the compositional nature of the report and adopt a hierarchical LSTM to produce long texts. Combined with the co-attention mechanism, the hierarchical LSTM first generates high-level topics, and then produces fine-grained descriptions according to the topics.

Overall, the main contributions of our work are:

We propose a multi-task learning framework which can simultaneously predict the tags and generate the text descriptions.

We introduce a co-attention mechanism for localizing sub-regions in the image and generating the corresponding descriptions.

We build a hierarchical LSTM to generate long paragraphs.

We perform extensive experiments to show the effectiveness of the proposed methods.

The rest of the paper is organized as follows. Section 2 reviews related works. Section 3 introduces the method. Section 4 present the experimental results and Section 5 concludes the paper.

Related Works

There have been several works aiming at attaching “texts” to medical images. In their settings, the target “texts” are either fully-structured or semi-structured (e.g. tags, templates), rather than natural texts. Kisilev et al. (2015) build a pipeline to predict the attributes of medical images. Shin et al. (2016) adopt a CNN-RNN based framework to predict tags (e.g. locations, severities) of chest x-ray images. The work closest to ours is recently contributed by Zhang et al. (2017), which aims at generating semi-structured pathology reports, whose contents are restricted to 5 predefined topics.

However, in the real-world, different physicians usually have different writing habits and different x-ray images will represent different abnormalities. Therefore, collecting semi-structured reports is less practical and thus it is important to build models to learn from natural reports. To the best of our knowledge, our work represents the first one that generates truly natural reports written by physicians, which are usually long and cover diverse topics.

Image captioning aims at automatically generating text descriptions for given images. Most recent image captioning models are based on a CNN-RNN framework Vinyals et al. (2015); Fang et al. (2015); Karpathy and Fei-Fei (2015); Xu et al. (2015); You et al. (2016); Krause et al. (2017).

Recently, attention mechanisms have been shown to be useful for image captioning Xu et al. (2015); You et al. (2016). Xu et al. (2015) introduce a spatial-visual attention mechanism over image features extracted from intermediate layers of the CNN. You et al. (2016) propose a semantic attention mechanism over tags of given images. To better leverage both the visual features and semantic tags, we propose a co-attention mechanism for report generation.

Instead of only generating one-sentence caption for images, Krause et al. (2017) and Liang et al. (2017) generate paragraph captions using a hierarchical LSTM. Our method also adopts a hierarchical LSTM for paragraph generation, but unlike Krause et al. (2017), we use a co-attention network to generate topics.

Methods

A complete diagnostic report for a medical image is comprised of both text descriptions (long paragraphs) and lists of tags, as shown in Figure 1. We propose a multi-task hierarchical model with co-attention for automatically predicting keywords and generating long paragraphs. Given an image which is divided into regions, we use a CNN to learn visual features for these patches. Then these visual features are fed into a multi-label classification (MLC) network to predict the relevant tags. In the tag vocabulary, each tag is represented by a word-embedding vector. Given the predicted tags for a specific image, their word-embedding vectors serve as the semantic features of this image. Then the visual features and semantic features are fed into a co-attention model to generate a context vector that simultaneously captures the visual and semantic information of this image. As of now, the encoding process is completed.

Next, starting from the context vector, the decoding process generates the text descriptions. The description of a medical image usually contains multiple sentences, and each sentence focuses on one specific topic. Our model leverages this compositional structure to generate reports in a hierarchical way: it first generates a sequence of high-level topic vectors representing sentences, then generates a sentence from each topic vector. Specifically, the context vector is inputted into a sentence LSTM, which unrolls for a few steps and produces a topic vector at each step. A topic vector represents the semantics of a sentence to be generated. Given a topic vector, the word LSTM takes it as input and generates a sequence of words to form a sentence. The termination of the unrolling process is controlled by the sentence LSTM.

2 Tag Prediction

For simplicity, we extract visual features from the last convolutional layer of the VGG-19 model Simonyan and Zisserman (2014) and use the last two fully connected layers of VGG-19 for MLC.

3 Co-Attention

Previous works have shown that visual attention alone can perform fairly well for localizing objects Ba et al. (2015) and aiding caption generation Xu et al. (2015). However, visual attention does not provide sufficient high level semantic information. For example, only looking at the right lower region of the chest x-ray image (Figure 1) without accounting for other areas, we might not be able to recognize what we are looking at, not to even mention detecting the abnormalities. In contrast, the tags can always provide the needed high level information. To this end, we propose a co-attention mechanism which can simultaneously attend to visual and semantic modalities.

where $\mathbf{W}_{\mathbf{v}}$ , $\mathbf{W}_{\mathbf{v},\mathbf{h}}$ , and $\mathbf{W}_{\mathbf{v}_{att}}$ are parameter matrices of the visual attention network. $\mathbf{W}_{\mathbf{a}}$ , $\mathbf{W}_{\mathbf{a},\mathbf{h}}$ , and $\mathbf{W}_{\mathbf{a}_{att}}$ are parameter matrices of the semantic attention network.

The visual and semantic context vectors are computed as:

There are many ways to combine the visual and semantic context vectors such as concatenation and element-wise operations. In this paper, we first concatenate these two vectors as $[\mathbf{v}_{att}^{(s)};\mathbf{a}_{att}^{(s)}]$ , and then use a fully connected layer $\mathbf{W}_{fc}$ to obtain a joint context vector:

4 Sentence LSTM

We use a deep output layer Pascanu et al. (2014) to strengthen the context information in topic vector $\mathbf{t}^{(s)}$ , by combining the hidden state $\mathbf{h}_{sent}^{(s)}$ and the joint context vector $\mathbf{ctx}^{(s)}$ of the current step:

where $\mathbf{W}_{\mathbf{t},\mathbf{h}_{sent}}$ and $\mathbf{W}_{\mathbf{t},\mathbf{ctx}}$ are weight parameters.

We also apply a deep output layer to control the continuation of the sentence LSTM. The layer takes the previous and current hidden state $\mathbf{h}_{sent}^{(s-1)}$ , $\mathbf{h}_{sent}^{(s)}$ as input and produces a distribution over {STOP=1, CONTINUE=0}:

where $\mathbf{W}_{stop}$ , $\mathbf{W}_{stop,s-1}$ and $\mathbf{W}_{stop,s}$ are parameter matrices. If $p(STOP|\mathbf{h}_{sent}^{(s-1)},\mathbf{h}_{sent}^{(s)})$ is greater than a predefined threshold (e.g. 0.5), then the sentence LSTM will stop producing new topic vectors and the word LSTM will also stop producing words.

5 Word LSTM

The words of each sentence are generated by a word LSTM. Similar to Krause et al. (2017), the topic vector $\mathbf{t}$ produced by the sentence LSTM and the special START token are used as the first and second input of the word LSTM, and the subsequent inputs are the word sequence.

where $\mathbf{W}_{out}$ is the parameter matrix. After each word-LSTM has generated its word sequences, the final report is simply the concatenation of all the generated sequences.

6 Parameter Learning

Each training example is a tuple ( $I$ , $\mathbf{l}$ , $\mathbf{w}$ ) where $I$ is an image, $\mathbf{l}$ denotes the ground-truth tag vector, and $\mathbf{w}$ is the diagnostic paragraph, which is comprised of $S$ sentences and each sentence consists of $T_{s}$ words.

Such regularization encourages the model to pay equal attention over different image regions and different tags.

Experiments

In this section, we evaluate the proposed model with extensive quantitative and qualitative experiments.

We used two publicly available medical image datasets to evaluate our proposed model.

The Indiana University Chest X-Ray Collection (IU X-Ray) Demner-Fushman et al. (2015) is a set of chest x-ray images paired with their corresponding diagnostic reports. The dataset contains 7,470 pairs of images and reports. Each report consists of the following sections: impression, findings, tagsThere are two types of tags: manually generated (MeSH) and Medical Text Indexer (MTI) generated., comparison, and indication. In this paper, we treat the contents in impression and findings as the target captionsThe impression and findings sections are concatenated together as a long paragraph, since impression can be viewed as a conclusion or topic sentence of the report. to be generated and the Medical Text Indexer (MTI) annotated tags as the target tags to be predicted (Figure 1 provides an example).

We preprocessed the data by converting all tokens to lowercases, removing all of non-alpha tokens, which resulting in 572 unique tags and 1915 unique words. On average, each image is associated with 2.2 tags, 5.7 sentences, and each sentence contains 6.5 words. Besides, we find that top 1,000 words cover 99.0% word occurrences in the dataset, therefore we only included top 1,000 words in the dictionary. Finally, we randomly selected 500 images for validation and 500 images for testing.

The Pathology Education Informational Resource (PEIR) digital libraryPEIR is ©University of Alabama at Birmingham, Department of Pathology. (http://peir.path.uab.edu/library/) is a public medical image library for medical education. We collected the images together with their descriptions in the Gross sub-collection, resulting in the PEIR Gross dataset that contains 7,442 image-caption pairs from 21 different sub-categories. Different from the IU X-Ray dataset, each caption in PEIR Gross contains only one sentence. We used this dataset to evaluate our model’s ability of generating single-sentence report.

For PEIR Gross, we applied the same preprocessing as IU X-Ray, which yields 4,452 unique words. On average, each image contains 12.0 words. Besides, for each caption, we selected 5 words with the highest tf-idf scores as tags.

2 Implementation Details

We used the full VGG-19 model Simonyan and Zisserman (2014) for tag prediction. As for the training loss of the multi-label classification (MLC) task, since the number of tags for semantic attention is fixed as 10, we treat MLC as a multi-label retrieval task and adopt a softmax cross-entropy loss (a multi-label ranking loss), similar to Gong et al. (2013); Guillaumin et al. (2009).

In paragraph generation, we set the dimensions of all hidden states and word embeddings as 512. For words and tags, different embedding matrices were used since a tag might contain multiple words. We utilized the embeddings of the 10 most likely tags as the semantic feature vectors $\{\mathbf{a}_{m}\}_{m=1}^{M=10}$ . We extracted the visual features from the last convolutional layer of the VGG-19 network, which yields a $14\times 14\times 512$ feature map.

We used the Adam Kingma and Ba (2014) optimizer for parameter learning. The learning rates for the CNN (VGG-19) and the hierarchical LSTM were 1e-5 and 5e-4 respectively. The weights ( $\lambda_{tag}$ , $\lambda_{sent}$ , $\lambda_{word}$ and $\lambda_{reg}$ ) of different losses were set to 1.0. The threshold for stop control was 0.5. Early stopping was used to prevent over-fitting.

3 Baselines

We compared our method with several state-of-the-art image captioning models: CNN-RNN Vinyals et al. (2015), LRCN Donahue et al. (2015), Soft ATT Xu et al. (2015), and ATT-RK You et al. (2016). We re-implemented all of these models and adopt VGG-19 Simonyan and Zisserman (2014) as the CNN encoder. Considering these models are built for single sentence captions and to better show the effectiveness of the hierarchical LSTM and the attention mechanism for paragraph generation, we also implemented a hierarchical model without any attention: Ours-no-Attention. The input of Ours-no-Attention is the overall image feature of VGG-19, which has a dimension of 4096. Ours-no-Attention can be viewed as a CNN-RNN Vinyals et al. (2015) equipped with a hierarchical LSTM decoder. To further show the effectiveness of the proposed co-attention mechanism, we also implemented two ablated versions of our model: Ours-Semantic-only and Ours-Visual-only, which takes solely the semantic attention or visual attention context vector to produce topic vectors.

4 Quantitative Results

We report the paragraph generation (upper part of Table 1) and one sentence generation (lower part of Table 1) results using the standard image captioning evaluation tool https://github.com/tylin/coco-caption which provides evaluation on the following metrics: BLEU Papineni et al. (2002), METEOR Denkowski and Lavie (2014), ROUGE Lin (2004), and CIDER Vedantam et al. (2015).

For paragraph generation, as shown in the upper part of Table 1, it is clear that models with a single LSTM decoder perform much worse than those with a hierarchical LSTM decoder. Note that the only difference between Ours-no-Attention and CNN-RNN Vinyals et al. (2015) is that Ours-no-Attention adopts a hierarchical LSTM decoder while CNN-RNN Vinyals et al. (2015) adopts a single-layer LSTM. The comparison between these two models directly demonstrates the effectiveness of the hierarchical LSTM. This result is not surprising since it is well-known that a single-layer LSTM cannot effectively model long sequences Liu et al. (2015); Martin and Cundy (2018). Additionally, employing semantic attention alone (Ours-Semantic-only) or visual attention alone (Ours-Visual-only) to generate topic vectors does not seem to help caption generation a lot. The potential reason might be that visual attention can only capture the visual information of sub-regions of the image and is unable to correctly capture the semantics of the entire image. Semantic attention is inadequate of localizing small abnormal image-regions. Finally, our full model (Ours-CoAttention) achieves the best results on all of the evaluation metrics, which demonstrates the effectiveness of the proposed co-attention mechanism.

For the single-sentence generation results (shown in the lower part of Table 1), the ablated versions of our model (Ours-Semantic-only and Ours-Visual-only) achieve competitive scores compared with the state-of-the-art methods. Our full model (Ours-CoAttention) outperforms all of the baseline, which indicates the effectiveness of the proposed co-attention mechanism.

5 Qualitative Results

An illustration of paragraph generation by three models (Ours-CoAttention, Ours-no-Attention and Soft Attention models) is shown in Figure 3. We can find that different sentences have different topics. The first sentence is usually a high level description of the image, while each of the following sentences is associated with one area of the image (e.g. “lung”, “heart”). Soft Attention and Ours-no-Attention models detect only a few abnormalities of the images and the detected abnormalities are incorrect. In contrast, Ours-CoAttention model is able to correctly describe many true abnormalities (as shown in top three images). This comparison demonstrates that co-attention is better at capturing abnormalities.

For the third image, Ours-CoAttention model successfully detects the area (“right lower lobe”) which is abnormal (“eventration”), however, it fails to precisely describe this abnormality. In addition, the model also finds abnormalities about “interstitial opacities” and “atheroscalerotic calcification”, which are not considered as true abnormality by human experts. The potential reason for this mis-description might be that this x-ray image is darker (compared with the above images), and our model might be very sensitive to this change.

The image at the bottom is a failure case of Ours-CoAttention. However, even though the model makes the wrong judgment about the major abnormalities in the image, it does find some unusual regions: “lateral lucency” and “left lower lobe”.

To further understand models’ ability of detecting abnormalities, we present the portion of sentences which describe the normalities and abnormalities in Table 2. We consider sentences which contain “no”, “normal”, “clear”, “stable” as sentences describing normalities. It is clear that Ours-CoAttention best approximates the ground truth distribution over normality and abnormality.

5.2 Co-Attention Learning

Figure 4 presents visualizations of co-attention. The first property shown by Figure 4 is that the sentence LSTM can generate different topics at different time steps since the model focuses on different image regions and tags for different sentences. The next finding is that visual attention can guide our model to concentrate on relevant regions of the image. For example, the third sentence of the first example is about “cardio”, and the visual attention concentrates on regions near the heart. Similar behavior can also be found for semantic attention: for the last sentence in the first example, our model correctly concentrates on “degenerative change” which is the topic of the sentence. Finally, the first sentence of the last example presents a mis-description caused by incorrect semantic attention over tags. Such incorrect attention can be reduced by building a better tag prediction module.

Conclusion

In this paper, we study how to automatically generate textual reports for medical images, with the goal to help medical professionals produce reports more accurately and efficiently. Our proposed methods address three major challenges: (1) how to generate multiple heterogeneous forms of information within a unified framework, (2) how to localize abnormal regions and produce accurate descriptions for them, (3) how to generate long texts that contain multiple sentences or even paragraphs. To cope with these challenges, we propose a multi-task learning framework which jointly predicts tags and generates descriptions. We introduce a co-attention mechanism that can simultaneously explore visual and semantic information to accurately localize and describe abnormal regions. We develop a hierarchical LSTM network that can more effectively capture long-range semantics and produce high quality long texts. On two medical datasets containing radiology and pathology images, we demonstrate the effectiveness of the proposed methods through quantitative and qualitative studies.