Original paper: https://arxiv.org/pdf/2106.06963.pdf
Reference: https://blog.csdn.net/qq_45645521/article/details/123493075
Prior knowledge: These persimmons are red and must be ripe
Posterior knowledge: I just ate persimmons and they are already ripe
Abstract
Posterior-and-Prior Knowledge Exploring-and-Distilling approach (PPKED)
-
first examine the abnormal regions
-
assign the disease topic tags
-
include modules:
- Posterior Knowledge Explorer (PoKE) Posterior Knowledge Explorer
- explores the posterior knowledge explores the posterior knowledge
- Provides **explicit abnormal visual regions ** Provides explicit abnormal visual regions
- Alleviate visual data bias Alleviate visual data bias
- Exploring posterior knowledge using bag-of-words of disease, capturing rare, diverse, and important abnormal regions
- Prior Knowledge Explorer (PrKE) Prior Knowledge Explorer
- explores the prior knowledge from the prior medical knowledge graph (prior medical knowledge PrMK G P r G_{Pr} GPr) and prior radiology reports (prior working experience PrWE W P r W_{Pr} WPr) to explore prior knowledge from prior medical knowledge graphs (medical knowledge) and previous radiology reports (work experience)
- alleviate textual data bias to alleviate textual data bias
- Explore previous knowledge from previous work experience and previous medical knowledge
- Multi-domain Knowledge Distiller (MKD) multi-domain knowledge extractor
- generate the final reports
- Extract the extracted knowledge to generate reports
- adaptive distilling attention (ADA)
- make the model adaptively learn to distill correlate knowledge
- Posterior Knowledge Explorer (PoKE) Posterior Knowledge Explorer
Introduction
directly applying image captioning approaches to radiology images has problems:
- visual data deviation - unbalanced visual distribution
- textual data deviation - too much normal discriptions
Related Works
Image Captioning
encoder-decoder framework - translates the image to a single descriptive sentence single descriptive sentence
radiology report generation - aims to generate a long paragraph - consists of multiple structural sentences
- each one focusing on a specific medical observation for a specific region in the radiology image
Image Paragraph Generation
- in a natural image paragraph: each sentence has equal importance
- in radiology report: generating abnormalities should be emphasized more than other normalities
Radiology Report Generation
explore and distill the posterior and prior knowledge for accurate radiology report generation
- for the network structure: explore the posterior knowledge of input radiology image by proposing to explicitly extract the abnormal regions
- leverage the retrieved reports and medical knowledge graph to model the prior working experience and prior medical knowledge
- retrieve a large amount of similar reports
- treat the retrieved reports as latent guidance
(use fixed templates to introduce inevitable errors)
Posterior-and-Prior Knowledge Exploring-and-Distilling (PPKED)
PoKE Posterior Knowledge Explorer + PrKE Prior Knowledge Explorer + MKD Multi-Domain Knowledge Distiller
- PoKE: explores the posterior knowledge by extracting the explicit abnormal regions
- PrKE: explores the relevant prior knowledge for the input image
- MKD: distills accurate posterior and prior knowledge and adaptively merging them to generate accurate reports Extract accurate posterior and prior knowledge and adaptively merge them to generate accurate reports
Backgrounds
Problem Formulation
PoKE : { I , T } → I ′ ; PrKE : { I ′ , W Pr } ; { I ′ , G Pr } → G Pr ′ MKD : { I ′ , W Pr ′ , G Pr ′ } → R \text{PoKE}:\{I,T\}\to I'; \\ \text{PrKE}:\{I',W_{\text{Pr}}\};\ \{I',G_{\text{Pr}}\}\to G'_{\text{Pr}} \\ \text{MKD}:\{I',W'_{\text{Pr}},G'_{\text{Pr}}\}\to R PoKE:{ I,T}→I′;PrKE:{ I′,WPr}; { I′,GPr}→GPr′MKD:{ I′,WPr′,GPr′}→R
Information Sources
-
I I I: adopt the ResNet-152 to extract 2048 7$\times 7 i m a g e f e a t u r e m a p s w h i c h a r e f u r t h e r p r o j e c t e d i n t o 5127 7 image feature maps which are further projected into 512 7 7imagefeaturemapswhicharefurtherprojectedinto5127\times$7 feature maps, resulting I = { i 1 , i 2 , . . . , i N 1 } ∈ R N 1 × d ( N 1 = 49 , d = 512 ) I=\{i_1,i_2,...,i_{N_1}\}\in \mathbb{R}^{N_1 \times d}(N_1=49,d=512) I={ i1,i2,...,iN1}∈RN1×d(N1=49,d=512)
-
T T T: topic bag (common abnormality topics or findings)
- T = { t 1 , t 2 , . . . , t N T ∈ R N T × d } T=\{t_1,t_2,...,t_{N_T}\in \mathbb{R}^{N_T \times d}\} T={ t1,t2,...,tNT∈RNT×d}
- t i ∈ R d t_i\in\mathbb{R}^d ti∈Rd: the word embedding of the i t h i^{th} iWord embeddings for t h topic topic
-
W Pr W_{\text{Pr}} WPr: the reports of the top- N K N_K NK retrieved images are returned and encoded as the W Pr = { R 1 , R 2 , . . . , R N K } ∈ R N K × d W_{\text{Pr}}=\{R_1,R_2,...,R_{N_K}\}\in\mathbb{R}^{N_K\times d} WPr={ R1,R2,...,RNK}∈RNK×d
- use a BERT encoder followed by a max -pooling layer over all output vectors as the report embedding module R i ∈ R d R_i\in\mathbb{R} ^dRi∈Rd of the i t h i^{th} ith retrieved report
- Prior work experience: extract image embedding from the last average pooling layer of ResNet-152, this image embedding is for all images; then for a given image. Find the 100 pictures with the highest cosine similarity with the input image in the corpus, and encode the report of the 100 pictures retrieved in this way with BERT and a maximum pooled connection layer to obtain work experience
-
G Pr G_{\text{Pr}}GPr:
- build a universal graph G Uni = ( V , E ) G_{\text{Uni}}=(V,E) GUni=(V,E ) : models the domain-specific prior knowledge structure modeling the domain-specific prior knowledge structure
- compose a graph that covers the most common abnormalities or findings
- connect nodes with bidirectional edges connect nodes with bidirectional edges
- nodes V V V: N T N_T NT common topics in T T T
- acquire a set of nodes V ′ = { v 1 ′ , v 2 ′ , . . . , v N T } ∈ R R T × d V'=\{v_1',v_2',...,v_{N_T}\}\in \mathbb{R}^{R_T\times d} V′={
v1′,v2′,...,vNT}∈RRT× d encoded by a graph embedding module encoded by a graph embedding module
- based on the graph convolution operation
- Prior Medical Knowledge: Constructing a Medical Graph. The topics in the bag of words are set as nodes and grouped according to their related organs and body parts; the topics grouped together are connected by edges, and the prior medical knowledge is extracted by graph convolutional neural network
Basic Module
Multi-Head Attention (MHA)
The MHA consists of n parallel heads and each head is defined as a scaled dot-product attention:
Att i ( X , Y ) = softmax ( X W i Q ( Y W i K ) T d n ) Y W i V MHA ( X , Y ) = [ Att 1 ( X , Y ) ; . . . ; Att n ( X , Y ) ] W O \text{Att}_i(X,Y)=\text{softmax}(\frac{X\text{W}_i^\text{Q}(Y\text{W}_i^\text{K})^T}{\sqrt{d_n}})Y\text{W}_i^\text{V} \\ \text{MHA}(X,Y)=[\text{Att}_1(X,Y);...;\text{Att}_n(X,Y)]\text{W}^{\text{O}} Toi(X,Y)=softmax(dnXWiQ( The WiK)T) The WiVMHA(X,Y)=[ That1(X,Y);...;Ton(X,Y)]WO
-
X ∈ R l x × d X\in\mathbb{R}^{l_x \times d} X∈Rlx×d: the Query matrix
-
Y ∈ R l y × d Y\in\mathbb{R}^{l_y \times d} Y∈Rly×d: the Key/Value matrix
-
W i Q , W i K , W i V ∈ R d × d n \text{W}_i^\text{Q},\text{W}_i^\text{K},\text{W}_i^\text{V}\in\mathbb{R}^{d\times d_n} WiQ,WiK,WiV∈Rd×dn, W i O ∈ R d × d \text{W}_i^\text{O}\in \mathbb{R}^{d\times d} WiO∈Rd×d: learnable parameters
-
d n = d / n d_n=d/n dn=d/n
-
[ ⋅ , ⋅ ] [·,·] [⋅,⋅] : concatenation operation
Serial operation: https://blog.csdn.net/Frank_LJiang/article/details/104333272
-
Feed-Forward Network (FFN)
FNN ( x ) = max ( 0 , x W f + b f ) W ff + b ff \text{FNN}(x)=\text{max}(0,x\text{W}_\text{f}+\text{b}_\text{f})\text{W}_\text{ff}+\text{b}_\text{ff} FNN(x)=max(0,xWf+bf)Wff+bff
- max ( 0 , ∗ ) \text{max}(0,*) max(0,∗): ReLU activation function
- W f ∈ R d × 4 d \text{W}_\text{f} \in \mathbb{R}^{d\times4d} Wf∈Rd×4d & W ff ∈ R 4 d × d \text{W}_\text{ff} \in \mathbb{R}^{4d\times d} Wff∈R4 d × d : learnable matrices for linear transformation The learnable matrix of linear transformation
- b f \text{b}_\text{f} bf & b ff \text{b}_\text{ff} bff: bias terms Bias term
Motivation
- MHA computes the association weights between different features calculates the association weights between different features
- allows probabilistic many-to-may relations
apply MHA to correlate the posterior and prior knowledge for the input radiology image, as well as distilling useful knowledge to generate accurate reports Report
Posterior Knowledge Explorer (PoKE)
extract the posterior knowledge from the input image (abnormal regions) Extract the posterior knowledge from the input image
T ^ = FFN ( MHA ( I , T ) ) ; I ^ = FFN ( MHA ( T ^ , I ) ) ; \hat{ T}=\text{FFN}(\text{MHA}(I,T)); \\ \hat{I}=\text{FFN}(\text{MHA}(\hat{T},I)) ;T^=FFN ( MHA ( I ,T));I^=FFN ( MHA (T^,I));
the image features I ∈ R N 1 × d I\in\mathbb{R}^{N_1\times d} I∈RN1×d are first used to find the most relevant topics and filter out the irrelevant topics, resulting in T ^ ∈ R N 1 × d \hat{T}\in\mathbb{R}^{N_1\times d} T^∈RN1×d. Then the attended topics T ^ \hat{T} T^ are further used to mine topic related image features I ^ ∈ R N 1 × d \hat{I}\in\mathbb{R}^{N_1\times d} I^∈RN1× d is used to mine image features related to topics
Find outlier regions in an image using anomalous topics contained in bag-of-words
align the attended abnormal regions with the relevant topics
- need to filter out the irrelevant topics
Align exceptional areas of engagement with related topics
since I ^ \hat{I} I^ and T ^ \hat{T} T^ are aligned, we directly add them up to acquire the posterior knowledge of the input image:
I ′ = LayerNorm ( I ^ + T ^ ) I'=\text{LayerNorm}(\hat{I}+\hat{T}) I′=LayerNorm (I^+T^)
- LayerNorm\text{LayerNorm}LayerNorm : Layer Normalization layer normalization
- I ′ I' I′: first impression of radiologists after check the abnormal regions
Prior Knowledge Explorer (PrKE)
PrKE consists of a Prior Working Experience component and a Prior Medical Knowledge component
- both obtain prior knowledge from existing radiology report corpus and represent them as W Pr W_{\text{Pr}} WPr & G Pr G_{\text{Pr}} GPr
- W Pr ′ W'_{\text{Pr}} WPr′ & G Pr ′ G'_{\text{Pr}} GPr′: prior knowledge relating to the abnormal regions of the input image represent previous work experience and previous medical knowledge respectively
- I ′ ∈ R N I × d I'\in\mathbb{R}^{N_\text{I} \times d} I′∈RNI×d: Query
- W Pr ∈ R N K × d W_{\text{Pr}} \in\mathbb{R}^{N_\text{K} \times d} WPr∈RNK×d: Key
- G Pr ∈ R N T × d G_{\text{Pr}} \in\mathbb{R}^{N_\text{T} \times d} GPr∈RNT×d: Value
W Pr ′ = FNN ( MHA ( I ′ , W Pr ) ) G Pr ′ = FNN ( MHA ( I ′ , G Pr ) ) W'_{\text{Pr}}=\text{FNN}(\text{ MHA}(I',W_{\text{Pr}})) \\ G'_{\text{Pr}}=\text{FNN}(\text{MHA}(I',G_{\text{Pr }}))WPr′=FNN ( MHA ( I′,WPr))GPr′=FNN ( MHA ( I′,GPr))
- W Pr ′ ∈ R N I × d W'_{\text{Pr}} \in\mathbb{R}^{N_\text{I} \times d} WPr′∈RNI×d & G Pr ′ ∈ R N I × d G'_{\text{Pr}} \in\mathbb{R}^{N_\text{I} \times d} GPr′∈RNI× d : a set of attendedprior knowledgerelated to the abnormalities of the input image A set of related prior knowledge related to the abnormalities of the input image
- have potential to alleviate the textual data bias
Through these two parts to process the posterior knowledge in PoKE, the prior knowledge of the abnormal region of the input image can be obtained
Multi-domain Knowledge Distiller (MKD)
Performs as a decoder to generate the final radiology report
take the embedding of current input word xt = wt + et x_t=w_t+e_txt=wt+et as input:
- w t w_t wt: word embedding word embedding
- e t e_t et: fixed position embedding position embedding
h t = MHA ( x t , x 1 : t ) h_t = \text{MHA}(x_t,x_{1:t}) ht=MHA(xt,x1:t)
Then employ the proposed Adaptive Distilling Attention (ADA) to distill the useful and related knowledge:
ht ′ = ADA ( ht , I ′ , G Pr ′ , W Pr ′ ) h_t'=\text{ADA}(h_t,I',G'_{\text{Pr}},W'_{\text{Pr}})ht′=ADA ( pt,I′,GPr′,WPr′)
Finally, the h t ′ h_t' ht′is passed to a FFN and a linear layer to predict the next word: is passed to a FFN and a linear layer to predict the next word yt ∼ pt = softmax ( FNN ( ht ′ ) W p + bp )
y_t\sim p_t= \text{softmax}(\text{FNN}(h'_t)\text{W}_p+\text{b}_p)yt∼pt=softmax(FNN(ht′)Wp+bp)
- W p \text{W}_p Wp & b p \text{b}_p bp: learnable parameters
train the PPKED by minimizing the cross-entropy loss:
L CE ( θ ) = − ∑ i = 1 N R log ( p θ ( y i ∗ ∣ y 1 : i − 1 ∗ ) ) L_{\text{CE}}(\theta)=-\sum_{i=1}^{N_R}\text{log}(p_\theta(y_i^*|y_{1:i-1}^*)) LCE( i )=−i=1∑NRlog(pi(yi∗∣y1:i−1∗))
- R ∗ = { y 1 ∗ , y 2 ∗ , . . . , y NR ∗ } R^*=\{y_1^*,y_2^*,...,y_{N_R}^*\}R∗={ y1∗,y2∗,...,yNR∗}: ground truth report
Adaptive Distilling Attention (ADA)
make the model adaptively learn to distill correlate knowledge: 使模型自适应学习提取相关知识
ADA ( h t , I ′ , G Pr ′ , W Pr ′ ) = MHA ( h t , I ′ + λ 1 ⊙ G Pr ′ + λ 2 ⊙ W Pr ′ ) λ 1 , λ 2 = σ ( h t W h ⊕ ( I ′ W I + G Pr ′ W G + W Pr ′ W W ) ) \text{ADA}(h_t,I',G'_{\text{Pr}},W'_{\text{Pr}})=\text{MHA}(h_t,I'+\lambda_1\odot G'_{\text{Pr}}+\lambda_2\odot W'_{\text{Pr}}) \\ \lambda_1,\lambda_2 = \sigma(h_t\text{W}_h\oplus(I'\text{W}_I+G'_{\text{Pr}}\text{W}_G+W'_{\text{Pr}}\text{W}_W)) ADA ( pt,I′,GPr′,WPr′)=MHA ( ht,I′+l1⊙GPr′+l2⊙WPr′)l1,l2=s ( htWh⊕(I′WI+GPr′WG+WPr′WW))
- W h , W I , W G , W W ∈ R d × 2 \text{W}_h,\text{W}_I,\text{W}_G,\text{W}_W\in\mathbb{R}^{d\times 2} Wh,WI,WG,WW∈Rd×2: learnable parameters
- ⊙ \odot ⊙ : element-wise multiplication Hadama product
- σ \sigma σ: sigmoid function
- \oplus⊕: matrix-vector addition
- λ 1 , λ 2 ∈ [ 0 , 1 ] \lambda_1,\lambda_2\in [0,1l1,l2∈[0,1]: weight the expected importance of G Pr ′ G'_{\text{Pr}} GPr′ & W Pr ′ W'_{\text{Pr}} WPr′ for each target word
Experiments
datasets: IU-Xray and MIMIC-CXR
Quantitative Analysis
Posterior Knowledge Explorer
PoKE can better recognize abnormalities
Prior Knowledge Explorer
- PrMK: can help the model learn enriched medical knowledge of the most common abnormalities or findings
- PrWE: verifies the effectiveness of introducing existing similar reports
Multi-domain Knowledge Distiller
based on the Transformer Decoder equipped with the proposed Adaptive Distilling Attention
Qualitative Analysis
prove that their arguments and verify the effectiveness of our proposed approach in alleviating the data bias problem by exploring and distilling posterior and prior knowledge Effectiveness in Alleviating Data Bias Problems
Conclusion
- generate meaning and robust radiology reports supported with accurate abnormal descriptions and regions
- outperforms previous state-of-the-art models on the 2 public datasets