【Paper Notes】Exploring and Distilling Posterior and Prior Knowledge for Radiology Report ... (CVPR 2021)

Original paper: https://arxiv.org/pdf/2106.06963.pdf

Reference: https://blog.csdn.net/qq_45645521/article/details/123493075

Prior knowledge: These persimmons are red and must be ripe
Posterior knowledge: I just ate persimmons and they are already ripe

 

Abstract

Posterior-and-Prior Knowledge Exploring-and-Distilling approach (PPKED)

  • first examine the abnormal regions

  • assign the disease topic tags

  • include modules:

    • Posterior Knowledge Explorer (PoKE) Posterior Knowledge Explorer
      • explores the posterior knowledge explores the posterior knowledge
      • Provides **explicit abnormal visual regions ** Provides explicit abnormal visual regions
      • Alleviate visual data bias Alleviate visual data bias
      • Exploring posterior knowledge using bag-of-words of disease, capturing rare, diverse, and important abnormal regions
    • Prior Knowledge Explorer (PrKE) Prior Knowledge Explorer
      • explores the prior knowledge from the prior medical knowledge graph (prior medical knowledge PrMK G P r G_{Pr} GPr) and prior radiology reports (prior working experience PrWE W P r W_{Pr} WPr) to explore prior knowledge from prior medical knowledge graphs (medical knowledge) and previous radiology reports (work experience)
      • alleviate textual data bias to alleviate textual data bias
      • Explore previous knowledge from previous work experience and previous medical knowledge
    • Multi-domain Knowledge Distiller (MKD) multi-domain knowledge extractor
      • generate the final reports
      • Extract the extracted knowledge to generate reports
      • adaptive distilling attention (ADA)
        • make the model adaptively learn to distill correlate knowledge

 

Introduction

directly applying image captioning approaches to radiology images has problems:

  • visual data deviation - unbalanced visual distribution
  • textual data deviation - too much normal discriptions

 

Related Works

Image Captioning

encoder-decoder framework - translates the image to a single descriptive sentence single descriptive sentence

radiology report generation - aims to generate a long paragraph - consists of multiple structural sentences

  • each one focusing on a specific medical observation for a specific region in the radiology image

 

Image Paragraph Generation

  • in a natural image paragraph: each sentence has equal importance
  • in radiology report: generating abnormalities should be emphasized more than other normalities

 

Radiology Report Generation

explore and distill the posterior and prior knowledge for accurate radiology report generation

  1. for the network structure: explore the posterior knowledge of input radiology image by proposing to explicitly extract the abnormal regions
  2. leverage the retrieved reports and medical knowledge graph to model the prior working experience and prior medical knowledge
  3. retrieve a large amount of similar reports
  4. treat the retrieved reports as latent guidance
    (use fixed templates to introduce inevitable errors)

 

Posterior-and-Prior Knowledge Exploring-and-Distilling (PPKED)

PoKE Posterior Knowledge Explorer + PrKE Prior Knowledge Explorer + MKD Multi-Domain Knowledge Distiller

  • PoKE: explores the posterior knowledge by extracting the explicit abnormal regions
  • PrKE: explores the relevant prior knowledge for the input image
  • MKD: distills accurate posterior and prior knowledge and adaptively merging them to generate accurate reports Extract accurate posterior and prior knowledge and adaptively merge them to generate accurate reports

 

Backgrounds

Problem Formulation

PoKE : { I , T } → I ′ ; PrKE : { I ′ , W Pr } ;   { I ′ , G Pr } → G Pr ′ MKD : { I ′ , W Pr ′ , G Pr ′ } → R \text{PoKE}:\{I,T\}\to I'; \\ \text{PrKE}:\{I',W_{\text{Pr}}\};\ \{I',G_{\text{Pr}}\}\to G'_{\text{Pr}} \\ \text{MKD}:\{I',W'_{\text{Pr}},G'_{\text{Pr}}\}\to R PoKE:{ I,T}I;PrKE:{ I,WPr}; { I,GPr}GPrMKD:{ I,WPr,GPr}R

 

Information Sources

  • I I I: adopt the ResNet-152 to extract 2048 7$\times 7 i m a g e f e a t u r e m a p s w h i c h a r e f u r t h e r p r o j e c t e d i n t o 5127 7 image feature maps which are further projected into 512 7 7imagefeaturemapswhicharefurtherprojectedinto5127\times$7 feature maps, resulting I = { i 1 , i 2 , . . . , i N 1 } ∈ R N 1 × d ( N 1 = 49 , d = 512 ) I=\{i_1,i_2,...,i_{N_1}\}\in \mathbb{R}^{N_1 \times d}(N_1=49,d=512) I={ i1,i2,...,iN1}RN1×d(N1=49,d=512)

  • T T T: topic bag (common abnormality topics or findings)

    • T = { t 1 , t 2 , . . . , t N T ∈ R N T × d } T=\{t_1,t_2,...,t_{N_T}\in \mathbb{R}^{N_T \times d}\} T={ t1,t2,...,tNTRNT×d}
    • t i ∈ R d t_i\in\mathbb{R}^d tiRd: the word embedding of the i t h i^{th} iWord embeddings for t h topic topic
  • W Pr W_{\text{Pr}} WPr: the reports of the top- N K N_K NK retrieved images are returned and encoded as the W Pr = { R 1 , R 2 , . . . , R N K } ∈ R N K × d W_{\text{Pr}}=\{R_1,R_2,...,R_{N_K}\}\in\mathbb{R}^{N_K\times d} WPr={ R1,R2,...,RNK}RNK×d

    • use a BERT encoder followed by a max -pooling layer over all output vectors as the report embedding module R i ∈ R d R_i\in\mathbb{R} ^dRiRd of the i t h i^{th} ith retrieved report
    • Prior work experience: extract image embedding from the last average pooling layer of ResNet-152, this image embedding is for all images; then for a given image. Find the 100 pictures with the highest cosine similarity with the input image in the corpus, and encode the report of the 100 pictures retrieved in this way with BERT and a maximum pooled connection layer to obtain work experience
  • G Pr G_{\text{Pr}}GPr:

    1. build a universal graph G Uni = ( V , E ) G_{\text{Uni}}=(V,E) GUni=(V,E ) : models the domain-specific prior knowledge structure modeling the domain-specific prior knowledge structure
    2. compose a graph that covers the most common abnormalities or findings
    3. connect nodes with bidirectional edges connect nodes with bidirectional edges
      • nodes V V V: N T N_T NT common topics in T T T
    4. acquire a set of nodes V ′ = { v 1 ′ , v 2 ′ , . . . , v N T } ∈ R R T × d V'=\{v_1',v_2',...,v_{N_T}\}\in \mathbb{R}^{R_T\times d} V={ v1,v2,...,vNT}RRT× d encoded by a graph embedding module encoded by a graph embedding module
      • based on the graph convolution operation
    • Prior Medical Knowledge: Constructing a Medical Graph. The topics in the bag of words are set as nodes and grouped according to their related organs and body parts; the topics grouped together are connected by edges, and the prior medical knowledge is extracted by graph convolutional neural network

 

Basic Module

Multi-Head Attention (MHA)

The MHA consists of n parallel heads and each head is defined as a scaled dot-product attention:
Att i ( X , Y ) = softmax ( X W i Q ( Y W i K ) T d n ) Y W i V MHA ( X , Y ) = [ Att 1 ( X , Y ) ; . . . ; Att n ( X , Y ) ] W O \text{Att}_i(X,Y)=\text{softmax}(\frac{X\text{W}_i^\text{Q}(Y\text{W}_i^\text{K})^T}{\sqrt{d_n}})Y\text{W}_i^\text{V} \\ \text{MHA}(X,Y)=[\text{Att}_1(X,Y);...;\text{Att}_n(X,Y)]\text{W}^{\text{O}} Toi(X,Y)=softmax(dn XWiQ( The WiK)T) The WiVMHA(X,Y)=[ That1(X,Y);...;Ton(X,Y)]WO

  • X ∈ R l x × d X\in\mathbb{R}^{l_x \times d} XRlx×d: the Query matrix

  • Y ∈ R l y × d Y\in\mathbb{R}^{l_y \times d} YRly×d: the Key/Value matrix

  • W i Q , W i K , W i V ∈ R d × d n \text{W}_i^\text{Q},\text{W}_i^\text{K},\text{W}_i^\text{V}\in\mathbb{R}^{d\times d_n} WiQ,WiK,WiVRd×dn, W i O ∈ R d × d \text{W}_i^\text{O}\in \mathbb{R}^{d\times d} WiORd×d: learnable parameters

 

Feed-Forward Network (FFN)

FNN ( x ) = max ( 0 , x W f + b f ) W ff + b ff \text{FNN}(x)=\text{max}(0,x\text{W}_\text{f}+\text{b}_\text{f})\text{W}_\text{ff}+\text{b}_\text{ff} FNN(x)=max(0,xWf+bf)Wff+bff

  • max ( 0 , ∗ ) \text{max}(0,*) max(0,): ReLU activation function
  • W f ∈ R d × 4 d \text{W}_\text{f} \in \mathbb{R}^{d\times4d} WfRd×4d & W ff ∈ R 4 d × d \text{W}_\text{ff} \in \mathbb{R}^{4d\times d} WffR4 d × d : learnable matrices for linear transformation The learnable matrix of linear transformation
  • b f \text{b}_\text{f} bf & b ff \text{b}_\text{ff} bff: bias terms Bias term

 

Motivation

  • MHA computes the association weights between different features calculates the association weights between different features
    • allows probabilistic many-to-may relations

apply MHA to correlate the posterior and prior knowledge for the input radiology image, as well as distilling useful knowledge to generate accurate reports Report

 

Posterior Knowledge Explorer (PoKE)

extract the posterior knowledge from the input image (abnormal regions) Extract the posterior knowledge from the input image
T ^ = FFN ( MHA ( I , T ) ) ; I ^ = FFN ( MHA ( T ^ , I ) ) ; \hat{ T}=\text{FFN}(\text{MHA}(I,T)); \\ \hat{I}=\text{FFN}(\text{MHA}(\hat{T},I)) ;T^=FFN ( MHA ( I ,T));I^=FFN ( MHA (T^,I));

the image features I ∈ R N 1 × d I\in\mathbb{R}^{N_1\times d} IRN1×d are first used to find the most relevant topics and filter out the irrelevant topics, resulting in T ^ ∈ R N 1 × d \hat{T}\in\mathbb{R}^{N_1\times d} T^RN1×d. Then the attended topics T ^ \hat{T} T^ are further used to mine topic related image features I ^ ∈ R N 1 × d \hat{I}\in\mathbb{R}^{N_1\times d} I^RN1× d is used to mine image features related to topics

Find outlier regions in an image using anomalous topics contained in bag-of-words

align the attended abnormal regions with the relevant topics

  • need to filter out the irrelevant topics

 

Align exceptional areas of engagement with related topics

since I ^ \hat{I} I^ and T ^ \hat{T} T^ are aligned, we directly add them up to acquire the posterior knowledge of the input image:
I ′ = LayerNorm ( I ^ + T ^ ) I'=\text{LayerNorm}(\hat{I}+\hat{T}) I=LayerNorm (I^+T^)

  • LayerNorm\text{LayerNorm}LayerNorm : Layer Normalization layer normalization
  • I ′ I' I: first impression of radiologists after check the abnormal regions

 

Prior Knowledge Explorer (PrKE)

PrKE consists of a Prior Working Experience component and a Prior Medical Knowledge component

  • both obtain prior knowledge from existing radiology report corpus and represent them as W Pr W_{\text{Pr}} WPr & G Pr G_{\text{Pr}} GPr
  • W Pr ′ W'_{\text{Pr}} WPr & G Pr ′ G'_{\text{Pr}} GPr: prior knowledge relating to the abnormal regions of the input image represent previous work experience and previous medical knowledge respectively
  • I ′ ∈ R N I × d I'\in\mathbb{R}^{N_\text{I} \times d} IRNI×d: Query
  • W Pr ∈ R N K × d W_{\text{Pr}} \in\mathbb{R}^{N_\text{K} \times d} WPrRNK×d: Key
  • G Pr ∈ R N T × d G_{\text{Pr}} \in\mathbb{R}^{N_\text{T} \times d} GPrRNT×d: Value

W Pr ′ = FNN ( MHA ( I ′ , W Pr ) ) G Pr ′ = FNN ( MHA ( I ′ , G Pr ) ) W'_{\text{Pr}}=\text{FNN}(\text{ MHA}(I',W_{\text{Pr}})) \\ G'_{\text{Pr}}=\text{FNN}(\text{MHA}(I',G_{\text{Pr }}))WPr=FNN ( MHA ( I,WPr))GPr=FNN ( MHA ( I,GPr))

  • W Pr ′ ∈ R N I × d W'_{\text{Pr}} \in\mathbb{R}^{N_\text{I} \times d} WPrRNI×d & G Pr ′ ∈ R N I × d G'_{\text{Pr}} \in\mathbb{R}^{N_\text{I} \times d} GPrRNI× d : a set of attendedprior knowledgerelated to the abnormalities of the input image A set of related prior knowledge related to the abnormalities of the input image
    • have potential to alleviate the textual data bias

Through these two parts to process the posterior knowledge in PoKE, the prior knowledge of the abnormal region of the input image can be obtained

 

Multi-domain Knowledge Distiller (MKD)

Performs as a decoder to generate the final radiology report
take the embedding of current input word xt = wt + et x_t=w_t+e_txt=wt+et as input:

  • w t w_t wt: word embedding word embedding
  • e t e_t et: fixed position embedding position embedding

h t = MHA ( x t , x 1 : t ) h_t = \text{MHA}(x_t,x_{1:t}) ht=MHA(xt,x1:t)

Then employ the proposed Adaptive Distilling Attention (ADA) to distill the useful and related knowledge:
ht ′ = ADA ( ht , I ′ , G Pr ′ , W Pr ′ ) h_t'=\text{ADA}(h_t,I',G'_{\text{Pr}},W'_{\text{Pr}})ht=ADA ( pt,I,GPr,WPr)
Finally, the h t ′ h_t' htis passed to a FFN and a linear layer to predict the next word: is passed to a FFN and a linear layer to predict the next word yt ∼ pt = softmax ( FNN ( ht ′ ) W p + bp )
y_t\sim p_t= \text{softmax}(\text{FNN}(h'_t)\text{W}_p+\text{b}_p)ytpt=softmax(FNN(ht)Wp+bp)

  • W p \text{W}_p Wp & b p \text{b}_p bp: learnable parameters

 

train the PPKED by minimizing the cross-entropy loss:
L CE ( θ ) = − ∑ i = 1 N R log ( p θ ( y i ∗ ∣ y 1 : i − 1 ∗ ) ) L_{\text{CE}}(\theta)=-\sum_{i=1}^{N_R}\text{log}(p_\theta(y_i^*|y_{1:i-1}^*)) LCE( i )=i=1NRlog(pi(yiy1:i1))

  • R ∗ = { y 1 ∗ , y 2 ∗ , . . . , y NR ∗ } R^*=\{y_1^*,y_2^*,...,y_{N_R}^*\}R={ y1,y2,...,yNR}: ground truth report

 

Adaptive Distilling Attention (ADA)

make the model adaptively learn to distill correlate knowledge: 使模型自适应学习提取相关知识
ADA ( h t , I ′ , G Pr ′ , W Pr ′ ) = MHA ( h t , I ′ + λ 1 ⊙ G Pr ′ + λ 2 ⊙ W Pr ′ ) λ 1 , λ 2 = σ ( h t W h ⊕ ( I ′ W I + G Pr ′ W G + W Pr ′ W W ) ) \text{ADA}(h_t,I',G'_{\text{Pr}},W'_{\text{Pr}})=\text{MHA}(h_t,I'+\lambda_1\odot G'_{\text{Pr}}+\lambda_2\odot W'_{\text{Pr}}) \\ \lambda_1,\lambda_2 = \sigma(h_t\text{W}_h\oplus(I'\text{W}_I+G'_{\text{Pr}}\text{W}_G+W'_{\text{Pr}}\text{W}_W)) ADA ( pt,I,GPr,WPr)=MHA ( ht,I+l1GPr+l2WPr)l1,l2=s ( htWh(IWI+GPrWG+WPrWW))

  • W h , W I , W G , W W ∈ R d × 2 \text{W}_h,\text{W}_I,\text{W}_G,\text{W}_W\in\mathbb{R}^{d\times 2} Wh,WI,WG,WWRd×2: learnable parameters
  • ⊙ \odot : element-wise multiplication Hadama product
  • σ \sigma σ: sigmoid function
  • \oplus: matrix-vector addition
  • λ 1 , λ 2 ∈ [ 0 , 1 ] \lambda_1,\lambda_2\in [0,1l1,l2[0,1]: weight the expected importance of G Pr ′ G'_{\text{Pr}} GPr & W Pr ′ W'_{\text{Pr}} WPr for each target word

 

Experiments

datasets: IU-Xray and MIMIC-CXR

insert image description here

 

Quantitative Analysis

insert image description here

 

Posterior Knowledge Explorer

PoKE can better recognize abnormalities

insert image description here

 

Prior Knowledge Explorer

  • PrMK: can help the model learn enriched medical knowledge of the most common abnormalities or findings
  • PrWE: verifies the effectiveness of introducing existing similar reports

insert image description here

 

Multi-domain Knowledge Distiller

based on the Transformer Decoder equipped with the proposed Adaptive Distilling Attention

 

Qualitative Analysis

prove that their arguments and verify the effectiveness of our proposed approach in alleviating the data bias problem by exploring and distilling posterior and prior knowledge Effectiveness in Alleviating Data Bias Problems
insert image description here

 

Conclusion

  • generate meaning and robust radiology reports supported with accurate abnormal descriptions and regions
  • outperforms previous state-of-the-art models on the 2 public datasets

Guess you like

Origin blog.csdn.net/Kqp12_27/article/details/124648646