Hybrid Retrieval-Generation Reinforced Agent for Medical Image Report Generation

Christy Y. Li, Xiaodan Liang, Zhiting Hu, Eric P. Xing

Introduction

Beyond the traditional visual captioning task that produces one single sentence, generating long and topic-coherent stories or reports to describe visual contents (images or videos) has recently attracted increasing research interests , posed as a more challenging and realistic goal towards bridging visual patterns with human linguistic descriptions. Particularly, report generation has several challenges to be resolved: 1) The generated report is a long narrative consisting of multiple sentences or paragraphs, which must have a plausible logic and consistent topics; 2) There is a presumed content coverage and specific terminology/phrases, depending on the task at hand. For example, a sports game report should describe competing teams, wining points, and outstanding players . 3) The content ordering is very crucial. For example, a sports game report usually talks about the competition results before describing teams and players in detail.

As one of the most representative and practical report generation task, the desired medical image report generation must satisfy more critical protocols and ensure the correctness of medical term usage. As shown in Figure 1, a medical report consists of a findings section describing medical observations in details of both normal and abnormal features, an impression or conclusion sentence indicating the most prominent medical observation or conclusion, and comparison and indication sections that list patient’s peripheral information. Among these sections, the findings section posed as the most important component, ought to cover contents of various aspects such as heart size, lung opacity, bone structure; any abnormality appearing at lungs, aortic and hilum; and potential diseases such as effusion, pneumothorax and consolidation. And, in terms of content ordering, the narrative of findings section usually follows a presumptive order, e.g. heart size, mediastinum contour followed by lung opacity, remarkable abnormalities followed by mild or potential abnormalities.

State-of-the-art caption generation models tend to perform poorly on medical report generation with specific content requirements due to several reasons. First, medical reports are usually dominated by normal findings, that is, a small portion of majority sentences usually forms a template database. For these normal cases, a retrieval-based system (e.g. directly perform classification among a list of majority sentences given image features) can perform surprisingly well due to the low variance of language. For instance, in Figure 1, a retrieval-based system correctly detects effusion from a chest x-ray image, while a generative model that generates word-by-word given image features, fails to detect effusion. On the other hand, abnormal findings which are relatively rare and remarkably diverse, however, are of higher importance. Current text generation approaches often fail to capture the diversity of such small portion of descriptions, and pure generation pipelines are biased towards generating plausible sentences that look natural by the language model but poor at finding visual groundings . On the contrary, a desirable medical report usually has to not only describe normal and abnormal findings, but also support itself by visual evidences such as location and attributes of the detected findings appearing in the image.

Inspired by the fact that radiologists often follow templates for writing reports and modify them accordingly for each individual case , we propose a Hybrid Retrieval-Generation Reinforced Agent (HRGR-Agent) which is the first attempt to incorporate human prior knowledge with learning-based generation for medical reports. HRGR-Agent employs a retrieval policy module to decide between automatically generating sentences by a generation module and retrieving specific sentences from the template database, and then sequentially generates multiple sentences via a hierarchical decision-making. The template database is built based on human prior knowledge collected from available medical reports. To enable effective and robust report generation, we jointly train the retrieval policy module and generation module via reinforcement learning (RL) guided by sentence-level and word-level rewards, respectively. Figure 1 shows an example generated report by our HRGR-Agent which correctly describes "a small effusion" from the chest x-ray image, and successfully supports its finding by providing the appearance ("blunting") and location ("costophrenic sulcus") of the evidence.

Our main contribution is to bridge rule-based (retrieval) and learning-based generation via reinforcement learning, which can achieve plausible, correct and diverse medical report generation. Moreover, our HRGR-Agenet has several technical merits compared to existing retrieval-generation-based models: 1) our retrieval and generation modules are updated and benefit from each other via policy learning; 2) the retrieval actions are regarded as a part of the generation whose selection of templates directly influences the final generated result. 3) the generation module is encouraged to learn diverse and complicated sentences while the retrieval policy module learns template-like sentences, driven by distinct word-level and sentence-level rewards, respectively. Other work such as still enforces the generative model to predict template-like sentences.

We conduct extensive experiments on two medical image report dataset . Our HRGR-Agent achieves the state-of-the-art performance on both datasets under three kinds of evaluation metrics: automatic metrics such as CIDEr , BLEU and ROUGE , human evaluation, and detection precision of medical terminologies. Experiments show that the generated sentences by HRGR-Agent shares a descent balance between concise template sentences, and complicated and diverse sentences.

Related Work

Visual Captioning and Report Generation. Visual captioning aims at generating a descriptive sentence for images or videos. State-of-the-art approaches use CNN-RNN architectures and attention mechanisms . The generated sequence is usually short, describing only the dominating visual event, and is primarily rewarded by language fluency in practice. Generating reports that are informative and have multiple sentences poses higher requirements on content selection, relation generation, and content ordering. The task differs from image captioning and sentence generation where usually single or few sentences are required, or summarization where summaries tend to be more diverse without clear template sentences. State-of-the-art methods on report generation are still remarkably cloning expert behaviour, and incapable of diversifying language and depicting rare but prominent findings. Our approach prevents from mimicking teacher behaviour by sparing the burden of automatic generative model with a template selection and retrieval mechanism, which by design promotes language diversity and better content selection.

Template Based Sequence Generation. Some of the recent approaches bridged generative language approaches and traditional template-based methods. However, state-of-the-art approaches either treat a retrieval mechanism as latent guidance , the impact of which to text generation is limited, or still encourage the generation network to mimic template-like sequences . Our method is close to previous copy mechanism work such as pointer-generator , however, we are different in that: 1) our retrieval module aims to retrieve from an external common template base, which is particularly effective to the task, as opposed to copying from a specific source article; 2) we formulate the retrieval-generation choices as discrete actions (as opposed to soft weights as in previous work) and learn with hierarchical reinforcement learning for optimizing both short- and long-term goals.

Reinforcement Learning for Sequence Generation. Recently, reinforcement learning (RL) has been receiving increasing popularity in sequence generation such as visual captioning , text summarization , and machine translation . Traditional methods use cross entropy loss which is prone to exposure bias and do not necessarily optimize evaluation metrics such as CIDEr , ROUGE , BLEU and METEOR . In contrast, reinforcement learning can directly use the evaluation metrics as reward and update model parameters via policy gradient. There has been some recent efforts devoted in applying hierarchical reinforcement learning (HRL) where sequence generation is broken down into several sub-tasks each of which targets at a chunk of words. However, HRL for long report generation is still under-explored.

Approach

As described in Figure 2, a set of images for each sample is first fed into a CNN to extract visual features which is then transformed into a context vector by an image encoder. Then a sentence decoder recurrently generates a sequence of hidden states $\textbf{q}=(\textbf{q}_{1},\textbf{q}_{2},\dots,\textbf{q}_{M})$ which represent sentence topics. Given each topic state $\textbf{q}_{i}$ , a retrieval policy module decides to either automatically generate a new sentence by invoking a generation module, or retrieve an existing template from the template database. Both the retrieval policy module (that determines between automatic generation or template retrieval) and the generation module (that generates words) are making discrete decisions and be updated via the REINFORCE algorithm . We devise sentence-level and word-level rewards accordingly for the two modules, respectively.

Sentence Decoder. Sentence decoder comprises stacked RNN layers which generates a sequence of topic states q. We equip the stacked RNNs with attention mechanism to enhance text generation, inspired by . Each stacked RNN first generates an attentive context vector $\mathbf{c}_{i}^{s}$ , where $i$ indicates time steps, given the image context vector $\mathbf{h}^{v}$ and previous hidden state $\mathbf{h}_{i-1}^{s}$ . It then generates a hidden state $\mathbf{h}_{i}^{s}$ based on $\mathbf{c}_{i}^{s}$ and $\mathbf{h}_{i-1}^{s}$ . The generated hidden state $\mathbf{h}_{i}^{s}$ is further projected into a topic space as $\mathbf{q}_{i}$ and a stop control probability $z_{i}\in$ through non-linear functions respectively. Formally, the sentence decoder can be written as:

where $F^{s}_{\text{attn}}$ denotes a function of the attention mechanism , $F^{s}_{\text{RNN}}$ denotes the non-linear functions of Stacked RNN, $\mathbf{W}_{q}$ and $\mathbf{b}_{q}$ are parameters which project hidden states into the topic space while $\mathbf{W}_{z}$ and $\mathbf{b}_{z}$ are parameters for stop control, and $\sigma$ is a non-linear activation function. The stop control probability $z_{i}$ greater than or equal to a predefined threshold (e.g. 0.5) indicates stopping generating topic states, and thus the hierarchical report generation process.

where $\mathbf{W}_{u}$ and $\mathbf{b}_{u}$ are network parameters, and the resulting $m_{i}$ is the index of highest probability in $\mathbf{u}_{i}$ .

Reward Module. We use automatic metrics CIDEr for computing rewards since recent work on image captioning has shown that CIDEr performs better than many traditional automatic metrics such as BLEU, METEOR and ROUGE. We consider two kinds of reward functions: sentence-level reward and word-level reward. For the $i$ -th generated sentence $\textbf{y}_{i}=(y_{i,1},y_{i,2},\dots,y_{i,N})$ either from retrieval or generation outputs, we compute a delta CIDEr score at sentence level, which is $R_{sent}(\textbf{y}_{i})=f(\{\textbf{y}_{k}\}_{k=1}^{i},\text{gt})-f(\{\textbf{y}_{k}\}_{k=1}^{i-1},\text{gt})$ , where $f$ denotes CIDEr evaluation, and gt denotes ground truth report. This assesses the advantages the generated sentence brings in to the existing sentences when evaluating the quality of the whole report. For a single word input, we use reward as delta CIDEr score which is $R_{word}(y_{t})=f(\{y_{k}\}_{k=1}^{t},\text{gt}^{s})-f(\{y_{k}\}_{k=1}^{t-1},\text{gt}^{s})$ where $\text{gt}^{s}$ denotes the ground truth sentence. The sentence-level and word-level rewards are used for computing discounted reward for retrieval policy module and generation module respectively.

2 Hierarchical Reinforcement Learning

Our objective is to maximize the reward of generated report $\mathbf{Y}$ compared to ground truth report $\mathbf{Y}^{*}$ . Omitting the condition on image features for simplicity, the loss function can be written as:

Policy Update for Retrieval Policy Module. We define the reward for retrieval policy module $R^{r}$ at sentence level. The generated sentence or retrieved template sentence is used for computing the reward. The discounted sentence-level reward and its corresponding policy update according to REINFORCE algorithm can be written as:

where $\gamma$ is a discount factor; $\mathbf{y}_{i}$ is the $i$ -th generated sequence; and $\theta_{r}$ represents parameters of retrieval policy module which are $W_{u}$ and $b_{u}$ in Equation 5 .

Policy Update for Generation Module. We define the word-level reward $R^{g}(y_{t})$ for each word generated by generation module as discounted reward of all generated words after the considered word. The discounted reward function and its policy update for generation module can be written as:

where $\gamma$ is a discount factor, and $\theta_{g}$ represents the parameters of generation module such as $W_{y}$ , $b_{y}$ , $W_{e}$ in Equation 9-11 and parameters of attention functions in Equation 7 and RNNs in Equation 8. Detailed policy update algorithm is provides in supplementary materials.

Experiments and Analysis

Datasets. We conduct experiments on two medical image report datasets. First, Indiana University Chest X-Ray Collection (IU X-Ray) is a public dataset consists of 7,470 frontal and lateral-view chest x-ray images paired with their corresponding diagnostic reports. Each patient has 2 images and a report which includes impression, findings, comparison and indication sections. We preprocess the reports by tokenizing, converting to lower-cases, and filtering tokens of frequency no less than 3 as vocabulary, which results in 1185 unique tokens covering over 99.0% word occurrences in the corpus.

CX-CHR is a proprietary internal dataset of chest X-ray images with Chinese reports collected from a professional medical institution for health checking. The dataset consists of 35,500 patients. Each patient has one or multiple chest x-ray images in different views such as posteroanterior and lateral, and a corresponding Chinese report. We select patients with no more than 2 images and obtained 33,236 patient samples in total which covers over 93% of the dataset. We preprocess the reports through tokenizing by Jieba and filtering tokens of frequency no less than 3 as vocabulary, which results in 1282 unique tokens.

On both datasets, we randomly split the data by patients into training, validation and testing by a ratio of 7:1:2. There is no overlap between patients in different sets. We predict the ’findings’ section as it is the most important component of reports. On CX-CHR dataset, we pretrain a DenseNet with public available ChestX-ray8 dataset on classification, and fine-tune it on CX-CHR dataset on 20 common thorax disease labels. As IU X-Ray dataset is relatively small, we do not directly fine-tune the pretrained DenseNet on it, and instead extract visual features from a DenseNet pretrained jointly on ChestX-ray8 dataset and CX-CHR datasets. Please see Supplementary Material for more details.

Template Database. We select sentences in the training set whose document frequencies (the number of occurrence of a sentence in training documents) are no less than a threshold as template candidates. We further group candidates that express the same meaning but have a little linguistic variations. For example, "no pleural effusion or pneumothorax" and "there is no pleural effusion or pneumonthorax" are grouped as one template. This results in 97 templates with greater than 500 document frequency for CX-CHR and 28 templates with greater than 100 document frequency for IU X-Ray. Upon retrieval, only the most frequent sentence of a template group will be retrieved for HRGR-Agent or any rule-based models that we compare with. Although this introduces minor but inevitable error in the generated results, our experiments show that the error is negligible compared to the advantages that a hybrid of retrieval-based and generation-based approaches brings in. Besides, separating templates of the same meaning into different categories diminishes the capability of retrieval policy module to predict the most suitable template for a given visual input, as multiple templates share the exact same meaning. Table 1 shows examples of templates for IU X-Ray dataset. More template examples are provided in supplementary materials.

Evaluation Metrics. We use three kinds of evaluation metrics: 1) automatic metrics including CIDEr, ROUGE, and BLEU; 2) medical abnormality terminology detection accuracy: we select 10 most frequent medical abnormality terminologies in medical reports and evaluate average precision and average false positive (AFP) of compared models; 3) human evaluation: we randomly select 100 samples from testing set for each method and conduct surveys through Amazon Mechanical Turk. Each survey question gives a ground truth report, and ask candidate to choose among reports generated by different models that matches with the ground truth report the best in terms of language fluency, content selection, and correctness of medical abnormal finding. A default choice is provided in case of no or both reports are preferred. We collect results from 20 participants and compute the average preference percentage for each model excluding default choices.

Training Details. We implement our model on PyTorch and train on a GeForce GTX TITAN GPU. We first train all models with cross entropy loss for 30 epochs with an initial learning rate of 5e-4, and then fine-tune the retrieval policy module and generation module of HRGR-Agent via RL with a fixed learning rate 5e-5 for another 30 epochs. We use 512 as dimension of all hidden states and word embeddings, and batch size 16. We set the maximum number of sentences of a report and maximum number of tokens in a sentence as 18 and 44 for CX-CHR and 7 and 15 for IU X-Ray. Besides, as observed from baseline models which overly predict most popular and normal reports for all testing samples and the fact that most medical reports describe normal cases, we add post-processing to increase the length and comprehensiveness of the generated reports for both datasets while maintaining the design of HRGR-Agent to better predict abnormalities. The post-processing we use is that we first select 4 most commonly predicted key words with normal descriptions by other baselines, then for each key word, if the generated report does not describe any abnormality nor normality of these key words, we add the a corresponding sentence of these key words that describe their normal cases respectively. The key words for IU X-Ray are ’heart size and mediastinal contours’, ’pleural effusion or pneumothorax’, ’consolidation’, and ’lungs are clear’. As observed in our experiments, this step maintains the same medical abnormality term detection results, and improves the automatic report generation metrics, especially on BLEU-n metrics.

Baselines. On both datasets, we compare with four state-of-the-art image captioning models: CNN-RNN , LRCN , AdaAtt , and Att2in . Visual features for all models are extracted from the last convolutional layer of pretrained densetNets respectively as mentioned in 4, yielding 16 $\times$ 16 $\times$ 256 feature maps for both datasets. We use greedy search and argmax sampling for HRGR-Agent and the baselines on both datasets. On IU X-Ray dataset, we also compare with CoAtt which uses different visual features extracted from a pretrained ResNet . The authors of CoAtt re-trained their model using our train/test split, and provided evaluation results for automatic report generation metrics using greedy search and sampling temperature 0.5 at test time. We further evaluated their prediction to obtain medical abnormality terminology detection precision and AFP. Due to the relatively large size of CX-CHR, we conduct additional experiments on it to compare HRGR-Agent with its different variants by removing individual components (Retrieval, Generation, RL). We train a hierarchical generative model (Generation) without any template retrieval or RL fine-tuning, and our model without RL fine-tuning (HRG). To exam the quality of our pre-defined templates, we separately evaluate the retrieval policy module of HRGR-Agent by masking out the generation part and only use the retrieved templates as prediction (Retrieval). Note that Retrieval uses the same model as HRG-Agent whose training involves automatic generation of sentences, thus the results of which may be higher than a general retrieval-based system (e.g. directly perform classification among a list of majority sentences given image features).

Automatic Evaluation. Table 2 shows automatic evaluation comparison of state-of-the-art methods and our model variants. Most importantly, HRGR-Agent outperforms all baseline models that have no retrieval mechanism or hierarchical structure on both datasets by great margins, demonstrating its effectiveness and robustness. On IU X-Ray dataset, HRGR-Agent achieves slightly lower BLEU-1,4 and ROUGE score than that of CoAtt . However, CoAtt uses different pre-processing of reports and visual features, jointly predicts ’impression’ and ’findings’, and uses single-image input while our method focuses on ’findings’ and use combined frontal and lateral view of patients. On CX-CHR, HRGR-Agent increases CIDEr score by 0.73 compared to HRG, demonstrating that reinforcement fine-tuning is crucial to performance increase since it directly optimizes the evaluation metric. Besides, Retrieval surpasses Generation by relatively large margins, showing that retrieval-based method is beneficial to generating structured reports, which leads to boosted performance of HRGR-Agent when combined with neural generation approaches (generation module). To better understand HRGR-Agent’s performance, each generated report at testing has on average 7.2 and 4.8 sentences for CX-CHR and IU X-Ray dataset, respectively. The percentages of retrieval vs generation are 83.5 vs 16.5 on the CX-CHR data, and 82.0 vs 18.0 on IU X-Ray, respectively.

Medical Abnormality Terminology Evaluation. Table 3 shows evaluation results of average precision and average false positive of medical abnormality terminology detection. HGRG-Agent achieves the highest precision. and is only slightly lower AFP than CoAtt, demonstrating that its robustness on detecting rare abnormal findings which are among the most important components of medical reports.

Retrieval vs. Generation. It’s worth knowing that on CX-CHR, Retrieval achieves higher automatic evaluation scores (Table 2 the $7_{th}$ row) but lower medical term detection precision (Table 3 the $2_{nd}$ column) than Generation. Note that Retrieval evaluates retrieval policy module of HRGR-Agent by masking out the generation results of generation module. The result shows that simply composing templates that mostly describe normal medical findings can lead to high automatic evaluation scores since the majority reports describe normal cases. However, this kind retrieval-based approaches lack of the capability of detecting significant but rare abnormal findings. On the other hand, the high medical abnormality term detection precision and low average false positive of HRGR-Agent verifies that its generation module learns to describe abnormal findings. The win-win combination of retrieval policy module and generation module leads to state-of-the-art performance of HRGE-Agent, surpassing a generative model (Generation) that is purely trained without any retrieval mechanism.

Human Evaluation. Table 3 (last row) shows average human preference percentage of HRGR-Agent compared with Generation and CoAtt on CX-CHR and IU X-Ray respectively, evaluated in terms of content coverage, specific terminology accuracy and language fluency. HRGR-Agent achieves much higher human preference than baseline models, showing that it is able to generate natural and plausible reports that are human preferable.

Qualitative Analysis. Figure 3 demonstrate qualitative results of HRGR-Agent and baseline models on IU X-Ray dataset. The reports of HRGR-Agent are generally longer than that of the baseline models, and share a well balance of templates and generated sentences. And, among the generated sentences, HRGR-Agent has higher rate of detecting abnormal findings.

Conclusion

In this paper, we introduce a novel Hybrid Retrieval-Generation Reinforced Agent (HRGR-Agent) to perform robust medical image report generation. Our approach is the first attempt to bridge human prior knowledge and generative neural network via reinforcement learning. Experiments show that HRGR-Agent does not only achieve state-of-the-art performance on two medical image report datasets, but also generates robust reports that has high precision on medical abnormal findings detection and best human preference.

References

Appendix A Policy Update Algorithm

Appendix B DenseNet Pretraining

We pretrain a DenseNet with publically avaiable ChestX-ray8 dataset on multi-label classification, and fine-tune it on CX-CHR dataset on 20 common thorax disease labels. ChestX-ray8 dataset comprises 108,948 frontal-view X-ray images of 32,717 unique patients with each image labeled with occurrence of 14 common thorax diseases where labels were text-mined from the associated radiological reports using natural language processing. We expand the 14 labels with 6 additional labels text-mined from CX-CHR dataset for fine-tuning. The additional 6 labels are: tortuous aortic sclerosis, bronchitis, calcification, tuberculosis, interstitial lung disease, and patchy consolidation.

We implement our model on PyTorch and train on a single GeForce GTX TITAN GPU. We add an additional lateral layer as in Feature Pyramid Network for the last three dense blocks and additional convolutional layers to transform feature dimension to 256. We extract features from the last convolutional layer of the second dense block which yields 16 $\times$ 16 $\times$ 256 feature maps. These feature maps contain higher resolution details and more location information without expanding total feature size than features directly extracted from the last layer of DenesNet (e.g., 16 $\times$ 16 $\times$ 1024 feature maps). We use initial learning rate of 0.1 and multiply by 0.1 every 10 epochs. We train 30 epochs and select the best model by validation performance. The classification model achieves 78.00% AUC score.

Appendix C Template Database

Table 4 shows examples of template database of CX-CHR dataset. The template databases are designed by selecting the top most frequent sentences over a threshold in the training corpus and grouping sentences of the same meaning but slightly different language variation. The document frequency threshold for IU X-Ray and CX-CHR dataset is 100 and 500 respectively.