Unbiased Scene Graph Generation in Videos paper explanation

This is an article from 2023 CVPR, which performs related work based on STTran.

Paper address:

2304.00733.pdf (arxiv.org) https://arxiv.org/pdf/2304.00733.pdf Code address:

GitHub - sayaknag/unbiasedSGGhttps://github.com/sayaknag/unbiasedSGG

Summary

Due to the inherent dynamics of scenes, temporal fluctuations in model predictions, and the long-tail distribution of visual relationships, coupled with the already existing challenges of image-based dynamic scene graph generation (SGG), the task of generating dynamic scene graphs (SGG) from videos is very complex. Complex and challenging.

Existing dynamic SGG methods mainly focus on capturing spatiotemporal context using complex architectures without addressing the above challenges, especially the long-tail distribution of relationships. This often results in biased scene graphs. To address these challenges, we introduce a new framework called TEMPURA ( TE mporal consistency and Memory Prototype
guided U nce R tainty A tentuation for unbiased dynamic SGG): Temporal consistency and memory prototype guided unbiased dynamic SGG Uncertainty decay. TEMPURA achieves object-level temporal consistency through transformation-based sequence modeling, uses memory-guided training to learn to synthesize unbiased relational representations, and uses Gaussian mixture models (GMMs) to attenuate the prediction uncertainty of visual relations. Extensive experiments show that our method achieves significant (in some cases up to 10%) performance improvements over existing methods, highlighting its advantage in generating more unbiased scene graphs.

Main contributions

1) TEMPURA models the prediction uncertainty associated with dynamic SGG and attenuates the impact of noisy annotations, resulting in a more unbiased scene graph.

2) Utilizing a novel memory-guided training method, TEMPURA learns to generate more unbiased predicate representations by diffusing knowledge from frequent predicate classes to rare predicate classes.

3) Leveraging a Transformer-based sequence processing mechanism, TEMPURA promotes more temporally consistent object classification, which remains relatively unsolved in the SGG literature.

4) Compared with existing state-of-the-art methods, TEMPURA achieves significant performance improvements in mean-Recall@K, highlighting its advantage in generating more unbiased scene graphs.

proposed question

1. (a) Long-tail distribution of predicate classes in Action Genome. (b) The visual relationship or predicate classification performance of STTran and TRACE, two state-of-the-art dynamic SGG methods, drops significantly in the tail class.

 2. Action Genome’s noisy scene graph annotations increase the uncertainty of the predicted scene graph.

3. Occlusion and motion blur caused by moving objects in videos prevent existing object detectors, such as FasterRCNN, from producing consistent object classification.

Overview

In order to generate more unbiased scene graphs from videos, it is necessary to address the challenges highlighted in Figures 1, 2, and 3. To this end, we propose TEMPURA of unbiased dynamic SGG. As shown in Figure 4, TEMPURA works with a Predicate Embedding Generator (PEG), which can be obtained from any existing dynamic SGG model. Since transformer-based models have proven to be better spatiotemporal dynamic learners, we model PEG as a spatiotemporal Transformer, which is built on top of the ordinary Transformer architecture. The object sequence processing unit (OSPU) makes object classification consistent over time. The Memory Diffusion Unit (MDU) and Gaussian Mixture Model (GMM) heads address the long-tail bias and overall noise issues in video SGG data respectively. In the following sections, we describe these units in more detail, as well as the training and testing details of TEMPURA.

Figure 4. Framework of TEMPURA. The object detector generates initial object proposals for each RGB frame in the video. These proposals are then passed to OSPU, where they are first linked into sequences based on the object detector confidence scores. These sequences are processed by a Transformer encoder to produce temporally consistent object embeddings for improved object classification. The suggestions and semantic information of each subject-object pair are passed to PEG to generate a spatiotemporal representation of the relationship between them. As a spatiotemporal Transformer, PEG's encoder learns the spatial context of relationships, and its decoder learns their temporal dependence. Due to the long-tail nature of the relation/predicate class, a memory bank combined with MDU is used during training to remove the bias of PEG, thus being able to produce more generalized predicate embeddings. Finally, K GMM heads classify the PEG embeddings and model the uncertainty associated with each predicate class for a given subject-object pair.

Introduction to some modules

PEG is very similar to the space-time Transformer in the previous paper STTran, so I won’t go into too much detail here.

memory guided training

Due to the long-tail bias in the SGG dataset, direct PEG embeddings R_{has}are biased towards rare predicate classes, so they need to be debiased. That is, for any given relation embedding, r^{j}_{tem} \in R_{tem}the Memory Diffusion Unit (MDU) first retrieves relevant information from the predicate class-centered memory bank ΩR and uses it to enrich the given relation embedding, thereby producing a more r^{j}_{tem}balanced of embedding \hat{r}^{j}_{tem}. The memory bank \boldsymbol{\Omega}_R=\{\boldsymbol{\omega}_p\}_{p=1}^{\mathcal{C}_r}consists of a set of memory prototypes, each of which is an abstraction of a predicate class and computed as a function of its corresponding PEG embedding.

In the paper, the prototype is defined as the centroid of a specific category.

\ball symbol{\omega}_p=\frac1{N_{y_{r_p}}}\sum_{j=1}^{N_{y_{r_p}}}\ballsymbol{r}_{tem}^j\mathrm{ ~}\forall p\in\mathcal{Y}_r,

where Nyrp is the total number of subject-object pairs in the entire training set that map to the predicate category yrp.

Progressive memory computation . ΩR is calculated in a progressive manner, that is, the last state of the model is used to calculate the memory of the current state, that is, the memory of epoch α is calculated using the model weights of epoch α−1. This makes ΩR more refined with every era. Since no memory is available for the first epoch, the MDU remains inactive in this state, and \hat{r}^{j}_{tem} = r^{j}_{tem}.

MDU: The purpose is to get a more reasonable relationship embedding

As shown in the structure diagram, for a given query, MDU uses attention operators to retrieve relevant information from ΩR as memory diffusion features, namely:

\boldsymbol{r}_{mem}^j=\mathbb{A}(\boldsymbol{Q}W_Q^{mem},\boldsymbol{K}W_K^{mem},\boldsymbol{V}W_V^{mem}), (10)

Among them Q = r^{j}_{tem},K=V=\Omega_r.

Since each subject-object pair has multiple predicates mapped to it, many visual relations have similar characteristics, which means that their corresponding memory prototypes ωp share multiple predicate embeddings. Therefore, the attention operation of Eq. 10 helps to utilize the memory bank to transfer knowledge from data-rich classes to data-poor classes, so that the r^{j}_{mem}classes generate r^{j}_{tem}compensating information about the data-poor classes that are missing in the class. Diffuse this information back r^{j}_{tem}to obtain a balanced embedding \hat{r}^{j}_{tem}, as shown in the figure below:

 MDU structure diagram:

As shown in Figure 4, MDU is only used during the training phase because it is not forwarded as a network module but as a meta-learning inspired structural meta-regularizer. Since ΩR is computed directly from the PEG embedding, backpropagation on the MDU refines the computed memory prototype, allowing for better information diffusion and essentially teaching PEG how to generate more balanced embeddings. These embeddings do not underfit to data-poor relationships. λ here acts as the gradient scaling factor. Since the initial PEG embedding is heavily biased towards data-rich classes, if λ is too high, the compensation effect of memory diffusion features will be greatly reduced. On the other hand, if λ is too low, too much knowledge is transferred from data-rich classes to data-poor classes, resulting in poor performance of the former.

GMM

If you are interested, you can search for it yourself. I haven’t taken a closer look yet.

Guess you like

Origin blog.csdn.net/Mr___WQ/article/details/131558742