Towards Adversarial Retinal Image Synthesis

Pedro Costa, Adrian Galdran, Maria Inês Meyer, Michael David Abràmoff, Meindert Niemeijer, Ana Maria Mendonça, Aurélio Campilho

cs.CV cs.LG stat.ML

Introduction

Modern machine learning methods require large amounts of data to be trained. This data is rarely available in the field of medical image analysis, since obtaining clinical annotations is often a costly process. Therefore, the possibility of synthetically generating medical visual data is greatly appealing, and has been explored for years. However, the realistic generation of high-quality medical imagery still remains a complex unsolved challenge for current computer vision methods.

Early methods for medical image generation consisted of digital phantoms, following simplified mathematical models of human anatomy . These models slowly evolved to more complex techniques, able to reliably model relevant aspects of the different acquisition devices. When combined with anatomical and physiological information arising from expert medical knowledge, realistic images can be produced . These are useful to validate image analysis techniques , for medical training , therapy planning , and a wide range of applications.

However, the traditional top-down approach of observing the available data and formulating mathematical models that explain it (image simulation) implies modeling complex natural laws by unavoidably simplifying assumptions. More recently, a new paradigm has arisen in the field of medical image generation, exploiting the bottom-up approach of directly learning from the data the relevant information. This is achieved with machine learning systems able to automatically learn the inner variability on a large training dataset . Once trained, the same system can be sampled to output a new but plausible image (image synthesis).

In the general computer vision field, the synthesis of natural images has recently experimented a dramatic progress, based on the general idea of adversarial learning . In this context, a generator component synthesizes images from random noise, and an auxiliary discriminator system trained on real data is assigned the task of discerning whether the generated data is real or not. In the training process, the generator is expected to learn to produce images that pose an increasingly more difficult classification problem for the discriminator.

Although adversarial techniques have achieved a great success in the generation of natural images, their application to medical imaging is still incipient. This is partially due to the lack of large amounts of training data, and partially to the difficulty of finely controlling the output of the adversarial generator. In this work, we propose to apply the adversarial learning framework to retinal images. Notably, instead of generating images from scratch, we propose to generate new plausible images from binary retinal vessel trees. Therefore, the task of the generator remains achievable, as it only needs to learn how to generate part of the retinal content, such as the optical disk, or the texture of the background (Figure 1).

The remaining of this work is organized as follows: we first describe a recent generative adversarial framework that can be employed on pairs of vessel trees and retinal images to learn how to map the former to the latter. Then, we briefly review U-Net, a Deep Convolutional Neural Network architecture designed for image segmentation, which allows us to generate pairs of retinal images and corresponding binary vessel trees. This model provides us with a dataset of vessel trees and corresponding retinal images that we then use to train an adversarial model, producing new good-quality retinal images out of a new vessel tree. Finally, the quality of the generated images is evaluated qualitatively and quantitatively, and a description of potential future research directions is presented.

Adversarial Retinal Image Synthesis

Image-to-image translation is a relatively recent computer vision task in which the goal is to learn a mapping $G$ , called Generator, from an image $x$ into another representation $y$ . Once the model has been trained, it is able to predict the most likely representation $G(x_{new})$ for a previously unseen image $x_{new}$ .

However, for many problems a single input image can correspond to many different correct representations. If we consider the mapping $G$ between a retinal vessel tree $v$ and a corresponding retinal fundus image $r$ , variations in color or illumination may produce many acceptable retinal images that correspond to the same vessel tree, i.e. $G(v)=\{r_{1},r_{2},\ldots,r_{n}\}$ . Directly related to this is the choice of the objective function to be minimized while learning $G$ , which turns out to be critical. Training a model to naively minimize the $L2$ distance between $G(v_{i})$ and $r_{i}$ for a collection of training pairs given by $\{(r_{1},v_{1}),\ldots,(r_{n},v_{n})\}$ is known to produce low-quality results with lack of detail , due to the model selecting an average of many equally valid representations.

Instead of explicitly defining a particular loss function for each task, it is possible to employ Generative Adversarial Networks to implicitly build a more appropriate loss . In this case, the learning process attempts to maximize the misclassification error of a neural network (called Discriminator, $D$ ) that is trained jointly with $G$ , but with the goal of discriminating between real and generated images. This way, not only $G$ but also the loss are progressively learned from examples, and adapt to each other: while $G$ tries to generate increasingly more plausible representations $G(v_{i})$ that can deceive $D$ , $D$ becomes better at its task, thereby improving the ability of $G$ to generate high-quality samples. Specifically, the adversarial loss is defined by:

To generate realistic retinal images from binary vessel trees, we follow recent ideas from , which propose to combine the adversarial loss with a global $L1$ loss to produce sharper results. Thus, the loss function to optimize becomes:

where $\lambda$ balances the contribution of the two losses. The goal of the learning process is thus to find an equilibrium of this expression. The discriminator $D$ attempts to maximize eq. (2) by classifying each $N\times N$ patch of a retinal image, deciding if it comes from a real or synthetic image, while the generator aims at minimizing it. The $L1$ loss controls low-frequency information in images generated by $G$ in order to produce globally consistent results, while the adversarial loss promotes sharp results. Once $G$ is trained, it is able to produce a realistic retinal image from a new binary vessel tree.

2 Obtaining Training Data

The model described above requires training data in the form of pairs of binary retinal vessel trees and corresponding retinal images. Since such a large scale manually annotated database is not available, we apply a state-of-the-art retinal vessel segmentation algorithm to obtain enough data for the model to learn the mapping from vessel trees to retinal images. There exist a large number of methods capable of providing reliable retinal vessel segmentations. Here we employ a supervised method based on Convolutional Neural Networks (CNNs), namely the U-Net architecture, first proposed in for the segmentation of biomedical images. This technique is an extension of the idea of Fully-Convolutional Networks, introduced in , adapted to be trained with a low number of images and produce more precise segmentations.

The architecture of the U-Net consists of a downsampling and an upsampling block. The first half of the network follows a typical CNN architecture, with stacked convolutional layers of stride two and Rectified Linear Unit (ReLU) activations. The second part of the architecture upsamples the input input feature map symmetrically to the downsampling path. The feature map of the last layer of the downsampling path is upsampled so that it has the same dimension of the second last layer. The result is concatenated with the feature map of the corresponding layer in the downsampling path, and this new feature map undergoes convolution and activation. This is repeated until the upsampling path layers reach the same dimensions as the first layer of the the network.

The final layer is a convolution followed by a sigmoid activation in order to map each feature vector into vessel/non-vessel classes. The concatenation operation allows for very precise spatial localization, while preserving the coarse-level features learned during the downsampling path. A representation of this architecture as used in the present work is depicted in Figure 3.

3 Implementation

For the purpose of retinal vessel segmentation, the DRIVE database was used to train the method described in the previous Section. The images and the ground truth annotations were divided into overlapping patches of $64\times 64$ pixels and fed randomly to the U-Net, with 10% of the patches being used for validation. The network was trained using the Adam optimizer and binary crossentropy as the loss function.

Retinal vessel segmentation using the U-Net was evaluated on DRIVE’s test set, achieving a $0.9755$ AUC, aligned with state-of-the-art results . The optimal binarization threshold maximizing the Youden index was selected. Messidor images were cropped, in order to only display the field of view, and downscaled to $512\times 512$ . Then, the segmentation method was applied to these images. Messidor contains $1200$ images annotated with the corresponding diabetic retinopathy grade, and displays more color and texture variability than DRIVE’s $20$ training images. Due to the U-Net being trained and tested in different datasets, some of the produced segmentations were not entirely correct. This may be related to DRIVE only containing $7$ examples of images with signs of mild diabetic retinopathy (grade 1). For this reason, we decided to retain only pairs of images and vessel trees in which the corresponding image had grade 0, 1, and 2.

The final dataset collected for training our adversarial model consisted of $946$ Messidor image pairs. This dataset was further randomly divided into training ( $614$ pairs), validation ( $155$ pairs) and test ( $177$ pairs) sets. Regarding image resolution, the original model in used pairs of $256\times 256$ images, with a U-Net-like generator $G$ . We modified the architecture to handle $512\times 512$ pairs, which is closer to the resolution of DRIVE images. For that, we added one layer to the downsampling part and another to the upsampling part of $G$ . The discriminator $D$ classifies $16\times 16$ overlapping patches of size $63\times 63$ . The implementation was developed in Python using KerasCode to reproduce our results is available at https://github.com/costapt/vess2ret . The learning process starts by training $D$ with real $(v,r)$ and generated pairs $(v,G(v))$ . Then, $G$ is trained with real $(v,r)$ pairs. This process is repeated iteratively until the losses of $D$ and $G$ stabilize.

Experimental Evaluation

For a subjective visual evaluation of the images generated by our model, we show in Figure 4 some results. The first row depicts a random sample of real images extracted from the held-out test set, which was not used during training. The second row shows vessel trees segmented from those images with the method outlined in Section 2.2, and the bottom row shows the synthetic retinal images produced by the proposed technique. We see that the original and the generated images share some global geometric characteristics. This is natural, since they approximately share the same vascular structure. However, the synthetic images have markedly different high-level visual features, such as the color and tone of the image, or the illumination. This information was extracted by our model from the training set, and effectively applied to the input vessel trees in order to produce realistic retinal images.

The first seven columns of Figure 4 show results in which the model behaved as expected: the vessel trees retrieved from the images in the first row were approximately correct, and provided sufficient information for the generator to create new consistent information in the synthetic image, shown in the last row. The last column in Figure 4 shows a failure case of the proposed technique. Therein, the segmentation technique described in Section 2.2 failed to produce a meaningful vessel network out of the original image. This is probably due to the high degree of defocus that the input image had. In this situation, the binary vessel tree supplied to the generator contained too few information, leading to the appearance of spurious artifacts and chromatic noise in the synthetic image. Fortunately, the amount of cases in which this happens was relatively low: out of our test set of $177$ images, $6$ were found to suffer from artifacts.

Objective image quality verification is known to be a hard challenge when no reference is available . In addition, for generative models it has been recently observed that specialized evaluation should be performed for each problem . In our case, to achieve a meaningful objective quantitative evaluation of the quality of the generated images, we apply two different retinal image quality metrics, namely the $Q_{v}$ score, proposed in , and the Image Structure Clustering (ISC) metric . Both metrics have been employed previously to assess the quality of retinal images. While the $Q_{v}$ score focuses more on the assessment of contrast around vessel pixels, the ISC metric performs a more global evaluation. Thus, together they provide an appropriate mechanism to quantitatively evaluate the correctness of a synthetically generated retinal image.

It is worth noting that in cases where artifacts and distortions were generated due to the undercomplete vessel network problem explained above, the ISC metric tended to artificially rise the quality of the synthetic image, as compared to the real one. Due to this, synthetic images containing this class of degradations were manually identified and removed from the ISC metric analysis below, together with their real counterparts. A more detailed discussion of both of the employed retinal image quality metrics, and their behavior when distorted images where supplied to them is provided in appendix A, together with supplementary results generated by the proposed technique.

The ISC score was computed on a reduced test set of 171 images (after removing the $6$ images with visual artifacts), while the $Q_{v}$ score was computed on all the $177$ images. The statistical analysis performed on both quality score distributions showed that both were normal according to the Kolmogorov-Smirnov test. The resulting data was therefore expressed as mean $\pm$ standard deviation, and compared with the paired Student’s t-test. All $p$ -values were two-tailed and $p<0.05$ was considered significant. Statistical analyses were performed using GraphPad Prism 7 (Graphpad Software Inc.) software. Results obtained with this methodology are shown in Table 1.

In the case of the ISC metric, the synthetic images produced a slightly higher quality score, with the difference between them not statistically significant ( $p=0.2188$ ). For the $Q_{v}$ score, the real images were considered to be of better quality with regard to their synthetic counterparts, the difference being statistically significant ( $p<0.05$ ). However, it should be considered that the $Q_{v}$ score consists of an anisotropy measure weighted by the values of a simple vessel detector (see Appendix A.1). In this case, it can be expected that image regions around the vessels of a synthetic image won’t probably be of a better quality than the original ones. On the other hand, results on the ISC metric, which has a more global nature, point to a similar quality in the real and synthetic images, which agrees with the subjective visual quality found in the produced images, see Appendix A.2.

Conclusions and Future Work

The above visual and quantitative results demonstrate the feasibility of learning to synthesize new retinal images from a dataset of pairs of retinal vessel trees and corresponding retinal images, applying current generative adversarial models. In addition, the dimension of the produced images was $512\times 512$ , which is greater than commonly generated images on general computer vision problems. We believe that achieving this resolution was only possible due to the constrained class of images in which the method was applied: contrarily to generic natural images, retinal images show a repetitive geometry, where high-level structures such as the field of view, the optical disc, or the macula, are usually present in the image, and act as a guide for the model to learn how to produce new texture and background intensities.

The main limitation of the presented method is its dependence on a pre-existing vessel tree in order to generate a new image. Furthermore, if the vessel tree comes from the application of a segmentation technique to the original image, the potential weaknesses of the segmentation algorithm will be inherited by the synthesized image. We are currently working on overcoming these challenges.

This work is financed by the ERDF – European Regional Development Fund through the Operational Programme for Competitiveness and Internationalisation - COMPETE 2020 Programme, by National Funds through the FCT – Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) within project CMUP-ERI/TIC/0028/2014 and by the North Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement within the project "NanoSTIMA: Macro-to-Nano Human Sensing: Towards Integrated Multimodal Health Monitoring and Analytics/NORTE-01-0145-FEDER-000016". MDA is the recipient of the Robert C. Watzke Professor of Ophthalmology and Visual Sciences. IDx LLC has no interest in any of the algorithms discussed in this study.

Appendix A Synthetic Retinal Image Quality Evaluation - Discussion

We discuss now the technical details of the two retinal image quality metrics employed in this work. Regarding the $Q_{v}$ score , it is a no-reference quality metric that proceeds by computing a local degree of vesselness around each pixel. This is achieved by building a multiscale version of the input image, represented by the local Hessian matrix around each pixel extracted from the green channel. Frangi’s vesselness measure is then computed , and used as an estimate of visible vessel pixels. Following, an anisotropy measure based on a local Singular Value Decomposition is computed , and the final quality score is obtained as a weighted average of the vesselness map and the local anisotropy values. This way, only vessel pixels are considered in this metric, since these are expected to be good candidates for a reliable contrast and focus estimate.

On the other hand, the Image Structure Clustering (ISC) proposed in follows a substantially different approach. Even if it is also a no-reference quality metric, it is trained on a dataset of retinal images. This dataset contained $1000$ images (independent of our training set) that had been previously labeled by medical experts, depending on whether they showed enough visibility to perform diagnosis. The ISC metric assesses a correct distribution of pixel intensities corresponding to the relevant anatomical structures present in the retina. This is achieved by extracting features consisting of intensities and Gaussian derivatives of the $R$ , $G$ , and $B$ channels, and then employing k-means to group them into $5$ different clusters. These are observed to be sufficient to model the relevant regions of a retinal image (vessels, optical disk, macula, background-to-foreground and foreground-to-background transitions). Histograms of counts of the computed features are then passed to an SVM, which is trained to predict if the presence and proportion of pixels associated to those structures is consistent, according to the training set correspondent quantities.

Both metrics seem thus quite complementary, since the ISC technique considers regions from the image that are not addressed by the $Q_{v}$ score. In our experiments, however, we noticed that the artifacts produced when the generative model was provided an undercomplete vessel tree tended to rise the ISC score. This drawback was not observed when the $Q_{v}$ score was computed.

We believe that the reason for this was the following: starting from a real synthetic image, our method employs the vessel tree extracted from it to synthesize a new image; thus, the amount of vessel pixels present in a real image will always be greater than in the corresponding synthetic image, favoring the $Q_{v}$ score. The ISC metric does not only rely on vessels, but on other anatomical structures. In addition, it considers the three color channels, while the $Q_{v}$ score employs only one of them. When supplied an image with artifacts such as those in Figure 5, the ISC score finds that the proportion of colors and edges is not adequate, but still relatively acceptable (note that the scores assigned to the synthetic images are not high in these cases). This situation was detected only on $6$ images from the entire $177$ images present in our test set. Accordingly, for a fair comparison, those images were removed from the statistical experiments that involved the ISC score. Since the $Q_{v}$ score seemed to be unaffected by this problem, we include every test image on its analysis.

We believe that current retinal image quality metrics are reasonably suitable to assess the visual quality of synthetic images. However, the study of the anatomical plausibility of these images may benefit of specifically designed quality metrics, that may involve different aspects (local and global) of existing quality assessment approaches.

A.2 Supplementary Results

Below we show a random sample of the results produced by our model, together with their real counterparts.