Cross-Modal Complementary Network with Hierarchical Fusion for Multimodal Sentiment Classification

1. The problem solved in this paper

  • Strategies that improperly measure the strength of the association between images and text may lead to false fusion, that is, image-text pairs that may not be related to each other are also fused;
  • Even if there is a real connection, simply splicing the feature vectors of each modality cannot fully mine the feature information within a single modality and between multimodalities;

  Starting from the above two problems, this paper proposes the CMCN model (Cross-Modal Complementary Network with hierarchical fusion). The model structure is as follows:
insert image description here

  The model is divided into three parts, FEM (Feature Extraction Module, feature extraction part), FAM (Feature Attention Module, the part that implements attention operations on image and text features), CMHF (Cross-Modal Hierarchical Fusion module, layered fusion part).

​The author   believes that text information has advanced semantic features. For sentiment classification, the text features obtained through the attention mechanism are more discriminative and have more semantics, so the text is used as the main mode, and the text is used to know the image attention. Generation of force vectors.

1.1 FAM

​ Generate image-text correlation, the input of this module is the encoded original text feature F t F_tFtand images transcribed as corresponding text features F ti F_{ti}Ft i, using the cosine similarity to measure the relevance of the image and the text, the module will calculate a value ccc
insert image description here
insert image description here

c c c indicates how much the text attention vector plays in the process of image attention vector generation.

1.2 CMHF

This layer consists of four parts, and the Upsampling part uses four features, namely F t , F i , F tatt , F iatt F_{t}, F_{i},F_{t_{att}},F_{i_{att}}Ft,Fi,Fta tt,Fia ttmapped to the same dimensional space;

insert image description here

Going up one level, perform 4 fusion operations, fusion within the mode and fusion between modes, where g( ) means fusion by dot product;
insert image description here

Going up one layer, the global fusion operation is performed, and the four vectors obtained in the previous layer are fused to obtain the global feature vector.
insert image description here

Get the content of the four tags, and do the cross-entropy function with the real value to get four losses, and optimize the model by joint optimization of the four losses.
insert image description here

2. The data set used in the experiment
insert image description here

3. Experimental results

insert image description here

4. Summary

  There are some problems with this article. First of all, in terms of the formula, the dimension shape of each quantity in the formula does not indicate how much it is. During the derivation process of the formula, it is completely confused. When some people push it, they feel that the two tensors are different The dimensions are all different, so there is no way to calculate (from the model diagram of the article, the intermediate vectors obtained in the article are all one-dimensional vectors).

​ At the beginning of the article, it said that inappropriate image-text correlation measurement strategy may lead to wrong fusion. After reading it, I thought that if there is no correlation between the image and the text, then do not allow them to be fused, so that there will be no wrong fusion. Combines. But after reading the formula in the article, I found that this is not the case. Any image-text data pair in the data set will still be fused.

Guess you like

Origin blog.csdn.net/qq_43775680/article/details/130092588