RV-GAN: Segmenting Retinal Vascular Structure in Fundus Photographs Using a Novel Multiscale Generative Adversarial Network

Table of contents

1. Network Architecture

2. Loss function

3. Dataset

3.1 Dataset Introduction

3.2 Data preprocessing

3.3 Parameter initialization

4. Experimental results

5 Conclusion


Conference: MICCAI Release time: 2021/1

作说:Sharif Amit Kamran1 , Khondker Fariha Hossain1 , Alireza Tavakkoli1 , Stewart LeeZuckerbrod2 , Kenton M. Sanders3 , and Salah A. Baker3

Introduction of MICCAI:

MICCAI is a frontier and hot weathervane in the field of medical image analysis, with very strong international influence and high academic authority, and belongs to the middle and high-end journals. MICCAI is an internationally recognized top comprehensive academic conference in the fields of medical imaging computing and computer-aided intervention

1. Network Architecture

Overview: In order to better perform pixel-by-pixel segmentation, this paper designs an architecture that can extract both global and local features from images . Specifically, two generators and two discriminators are used, where the generator Gf segments images by extracting local information such as small branches, in contrast, the generator Gc tries to learn and preserve global information, such as The structure of macular branches, while yielding less detailed microvessel segmentation. Then, in order to facilitate the overall adversarial training, we paired the generator with the discriminator respectively.

 

In detail:

For Gc , we can see that its input is two parts, the original image (256, 256, 3) and the mask (256, 256, 1). After inputting these two parts, first do a splicing work on these two parts, and then fill the image with a 3×3 mirror image on the basis of the splicing, and then convolve. After convoluting, it is Normalization and LeakRelu activation. Then there is downsampling, where the downsampling module is shown in the figure below. Next, we can see that we have two layers of downsampling modules, the number of channels of the image will be doubled every time the image is downsampled, and then there are nine residual blocks, as shown in the figure below. Then there are two upsampling operations. Specifically, the output after the residual block is first subjected to a decoder decoding operation, and then the decoded result is simply added to the left part as the next upsampling input. , and then in the same way, after the input of the previous step is upsampled, it is simply added to the left part. After the addition, there are two arrows pointing to it, one is used as the input of the Gf residual block. Then the other is a 3×3 mirror filling, and then convolution, and after the convolution is a tanh activation. The result after activation is used as the output of the generator feature map and also the input of the discriminator Dc.

For Gf , it can be seen that it has three parts of input, namely the original image (512, 512, 3), the mask (512, 512, 1) and x_coarse (256, 256, 64). Then the next operation is roughly the same as Gc. Then talk about the difference. The first difference is the size of the input image and there is an input from Gc; the second difference is that there is only one layer of upsampling and downsampling, the SFA module has only one layer, and the residual block has only three layers. Overall, the operation steps on the image are less than Gc.

For Df , there are two parts of input, fundus image (512, 512, 3) and label (512, 512, 1). First, a simple splicing of the input of these two parts is performed, and then there are 3 downsampling and residual blocks, and 3 upsampling modules. We can see that there are a total of six parts of the operation. A list of feature maps is designed in the code, which means that every part of the process will save this part of the feature map to the list. In fact, there are Conv2D convolutions and tanh activations after the implementation.

For Dc , the operation is similar to Df

2. Loss function

Feature Matching Loss: Feature matching loss performs semantic segmentation by extracting features from the discriminator

 Through successive downsampling and upsampling, we lose the basic spatial information and features; this is why we need to assign weights to different components in the overall architecture. We propose a new weighted feature matching loss as shown in Equation 1. Equation 2 It combines elements from the encoder and decoder and prioritizes specific features to overcome this. For our case, we experiment and see that giving larger weights to decoder feature maps leads to better vessel segmentation

For (1), k refers to the number of pixels, and N is the number of feature maps. It can be seen that for the encoder in the discriminator, when the input is divided into the original image, the label and the original image, and the generator generates the image, the pixel points of each corresponding position of the two feature maps correspond to each other. Subtract, then add the absolute value separately, then square him, sum and finally open the root sign twice. Finally divide by N to get the average.

For (2), it is equivalent to doing the same operation on the decoder on different inputs based on (1), and then adding the two parts together, and adding two hyperparameters.

Computed by extracting features from each downsampling and upsampling block of the encoder and decoder of the discriminator. We sequentially interpolate real and synthetic segmentation maps. N represents the number of features. Here, λenc and λdec are internal weight multipliers for each extracted feature map. The weight values ​​are between [0, 1], and the sum of weights is 1, and we use higher weight values ​​for decoder feature maps than encoder feature maps.

For (3), the part of each pixel in the feature map generated by the discriminator is greater than 1 or less than -1, the loss is equal to 0, that is to say, this part is a determined result. For pixels between -1 and 1, the loss is not 0, that is to say, it is an uncertain result. At this time, it is necessary to continuously correct the loss to make it reach the minimum value to train our network.

For (4), we can see the images and labels generated by the input generator in the discriminator, and then one of its results is a feature map. (4) means to average each value in the feature map.

For (5), it is to add the two parts of (3) and (4).

We first train the discriminator on real fundus x and real segmentation map y. After that, we train with the real fundus x and the synthetic segmentation map G(x). We start training the discriminators Df and Dc in batches for several iterations on the training data. Next, we train Gc while keeping the weights of the discriminator constant. In a similar manner, we train Gf to batch training images while keeping all discriminator weights constant.

The generator also incorporates a reconstruction loss (mean squared error), as shown in Equation (6). By utilizing losses, we ensure that synthetic images contain more realistic structures of microvessels, arteries, and blood vessels

 

By adding equations 2, 5 and 6, we can formulate our final objective function as equation (7)

 

3. Dataset

We use three public retina datasets: DRIVE, CHASE-DB1, STARE, in the formats: tif (565 × 584), .jpg(999 × 960), and .ppm (700 × 605)

3.1 Dataset Introduction

DRIVE dataset, comparative research on blood vessel segmentation in retinal images, data source and diabetic retinopathy screening project, 40 retinal images, 20 samples for training, 20 samples for testing, the original size of the image is 565x584

CHASE-DB1 training 20, testing 8, the original size of the picture is 999×960

The STARE dataset is a project initiated by Michael Goldbaum in 1975. It was first cited and published in a paper by Hoover et al. in 2000. It is a color fundus map database for retinal vessel segmentation, including 20 fundus images, of which 10 are There are lesions and 10 images without lesions. The image resolution is 605×700. Each image corresponds to the result of manual segmentation by 2 experts. It is one of the most commonly used fundus image standard libraries. But there is no mask in its own database, you need to manually set the mask. Currently it has been extended to 40 hand-annotated results for vessel segmentation and 80 hand-annotated results for optic nerve detection. 16 for training and 4 for testing.

3.2 Data preprocessing

We train three different RV-GAN networks using 5-fold cross-validation for each dataset. We use overlapping image patches with stride 32 and image size 128×128 for training and validation. So we end up with 4320 for STARE, 15120 for CHASE-DB1 and 4200 for DRIVE. Overlapping patches augment the data.

The DRIVE dataset comes with official FoV masks for test images. For the CHASE and STARE datasets, we also generate FoV masks similar to Li et al. [16]. For testing,

Overlapping image patches with a stride of 3 were extracted and averaged by taking 20, 8 and 4 images from DRIVE, CHASE-DB1 and STARE.

3.3 Parameter initialization

For adversarial training, we use the hinge loss. We choose λenc = 0.4 (Eq. 1), λdec = 0.6 (Eq. 2), λadv = 10 (Eq. 5), λrec = 10 (Eq. 6) and λwfm = 10 (Eq. 7). We used Adam with learning rate α = 0.0002, β1 = 0.5, β2 = 0.999. We train for 100 epochs with mini-batches of batch size b = 24 in three stages using Tensorflow. Training our model on an Nvidia P100 GPU took 24-48 hours depending on the dataset. Because DRIVE and STARE have fewer patches than CHASE-DB1, the training load is less. The inference time is 0.025 seconds per image.

4. Experimental results

 Our model outperforms UNet-derived architectures and recent GAN-based models in terms of AUC-ROC, Mean-IOU and SSIM (three main metrics for this task). M-GAN achieves better specificity and accuracy in CHASE-DB1 and STARE

5 Conclusion

In this paper, we propose a new multi-scale generative architecture, RV-GAN. By combining our novel approach featuring a matching loss, the architecture synthesizes accurate venule structure segmentation and high confidence scores for two correlation metrics. Therefore, we can effectively adopt this architecture in various applications in ophthalmology. This model is best suited for analyzing retinal degenerative diseases and monitoring future prognosis. We hope to extend this work to other data modalities.

Guess you like

Origin blog.csdn.net/weixin_51781852/article/details/126203826