Paper Reading-AVoiD-DF: Audio-Visual Joint Learning for Detecting Deepfake (Multimodal Dataset DefakeAVMiT+Multimodal Authentication Method AVoiD-DF)

1. Paper information

Paper name: AVoiD-DF: Audio-Visual Joint Learning for Detecting Deepfake

Author team:

 

2. Main innovations

Previous methods only focus on the forgery of single modality. Even multi-modal data only regard the audio signal as a supervisory signal, ignoring the possibility of audio forgery.

  • A new multimodal benchmark dataset, DefakeAVMiT, is proposed, which contains sufficient video and audio fake content, with fakes in both modalities.

  • An audiovisual joint learning method for deepfake detection (AVoiD-DF) is proposed, which exploits audiovisual inconsistency for multimodal forgery detection.

3. Method

AVoiD-DF consists of three key parts: space-time encoder TSE, multimodal joint decoding MMD, and Cross-Modal Classifier uses the output of MMD for multimodal classification.

1. Time-space encoder TSE

 This module consists of two transformer encoders connected in series. First, unified frame sampling and preprocessing are performed for audio and video modalities, and the first temporal encoder model encodes the interaction information between time steps and temporal embeddings of the same window. The encoding produced by the second spatial encoder represents the spatial features of each temporal index. Therefore, it corresponds to spatiotemporal information. Then the features of the two modalities will be sent to MMD in parallel for multimodal fusion.

2. Multi-modal joint decoding MMD

 Modal fusion is performed using the MMD module. The input visual and sound embedding blocks will be fed through two parallel decoder channels. Each channel has a bidirectional cross-attention (BiCroAtt) module, followed by self-attention blocks and feed-forward layers. This module mainly uses bidirectional cross-attention BiCroAtt to enable information sharing and joint learning between two modalities.

BiCroAtt:

 self-attention:

 3. Cross-Modal Classifier

Combined with the final output of MMD, the final multimodal classification is performed.

 4. Loss function

1) Contrastive loss Lcon: set to maximize the similarity of classification labels of fake labels and real labels. Audio-visual matches are positive samples, and the rest are negative samples.

2) Cross entropy loss

3) Additive Angular Margin Loss (ArcfaceLoss) additive angular margin loss: face recognition

 

 4) The overall loss is as follows:

 4. Dataset: DefakeAVMiT

 A total of 8 types of forgery generation technologies, 5 types of visual generation technologies, and 3 types of speech generation technologies. The real video comes from the VidTIMIT data set, and the fake video is composed of Faceswap (face change), DeepFaceLab (high-quality face change), Wave2Lip (generate lip-sync talking face), EVP (audio-driven picture), PC-AVS (generate lip-sync talking face), SV2TTS (real-time voice cloning, different speakers generate the same voice audio), Voice Replay (voice replay, using pre-recorded audio of real people to correspond to false identities), AV exemplar autoencoders (convert any input voice into audio-visual stream, input speech that mimics a specific target).

 5. Experimental results

1. Detection performance

 2. Generalization

 3. Ablation experiment

Guess you like

Origin blog.csdn.net/qq_43687860/article/details/130943953