Unpaired/Partially/Unsupervised Image Captioning

This involves the following three papers:

Unpaired Image Captioning by Language Pivoting (ECCV 2018)

Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data (ECCV 2018)

Unsupervised Image Caption (CVPR 2019)

 

1. Unpaired Image Captioning by Language Pivoting (ECCV 2018)

Abstract

The authors suggest the image caption to solve the problem of one kind of picture and description does not paired by the method of pivot language (language pivoting) of (unpaired image captioning problem).

Our method can effectively capture the characteristics of an image captioner from the pivot language(Chinese) and align it to the target language (English) using another pivot-target (Chinese-English) sentence parallel corpus.

Introduction

Since the encoder-decoder structure needs a lot of image-caption pairs to train, usually such a large-scale tag data is difficult to obtain, researchers began to think unpaired data or by using semi-supervised method to utilize other areas in pairs the tag data to achieve the purpose of unsupervised learning. In this paper, the authors hope that by using the source language - Chinese language as a pivot, to eliminate the input picture and the target language - the interval between the English description, which requires pictures - Chinese and Chinese description - two into English of data sets, so as to achieve a picture does not need to - be described in English paired data sets to achieve the object image described in English generated.

The authors say this idea comes from studies in the field of machine translation, machine translation method using such a strategy is usually divided into two steps, first to translate the source language into the pivot languages ​​and then translate the pivot language into the target language. But the image caption with machine translation there are many different places: 1.image-Chinese caption and Chinese-English sentences in style and vocabulary distribution is very different; 2.source-to-pivot conversion errors will be passed to the pivot- to-target

Use AIC-ICC and AIC-MT as the training datasets and two datasets (MSCOCO and Flickr30K) as the validation datasets

i: source image, x: pivot language sentence, y: target language, y_hat: ground truth captions in target language (for y_hat here, from inside random training set MSCOCO descriptive statements (captions), used to train the autoencoder)

 

Ideas in the article easier to understand, difficult is to Image-to-Pivot and Pivot-to-Target linking the two data sets to overcome language style and vocabulary inconsistent distribution of these two issues.

2. Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data (ECCV 2018)

作者在这篇文章中指出,目前已有的caption模型倾向于复制训练集中的句子或短语,生成的描述通常是泛化和模板化的,缺乏生成区分性描述的能力。

基于GAN的caption模型可以提升句子的多样性,但在标准的评价指标上会有比较差的表现。

作者提出在Captioning Module上结合一个Self-retrieval Module,来达到generate discriminative captions的目的。

 

3. Unsupervised Image Caption (CVPR 2019)

这是一篇真正的无监督方法来做Image Caption的文章,不 rely on any labeled image sentence pairs

与Unsupervised Machine Translation相比,Unsupervised Image Caption任务更具挑战是因为图像和文本是两个不同的模态,有很大的差别。

模型由an image encoder, a sentence generator,a sentence discriminator组成。

Encoder:

普通的image encoder即可,作者采用的是Inception-V4

Generator:

由LSTM组成的decoder

Discriminator:

由LSTM来实现,用来distinguish whether a partial sentence is a real sentence from the corpus or is generated by the model.

 

Training:

由于do not have any paired image-sentence,就不能用有监督的方式来训练模型了,于是作者设计了三种目标函数来实现Unsupervised Image Captioning

Adversarial Caption Generation:

Visual Concept Distillation:

Bi-directional Image-Sentence Reconstruction:

Image Reconstruction: reconstruct the image features instead of the full image

Sentence Reconstruction: the discriminator can encode one sentence and project it into the common latent space, which can be viewed as one image representation related to the given sentence. The generator can reconstruct the sentence based on the obtained representation.

Integration:Generator:

Discriminator:

 

Initialization

It challenging to adequately train our image captioning model from scratch with the given unpaired data, need an initialization pipeline to pre-train the generator and discriminator.

For generator:

Firstly, build a concept dictionary consisting of the object classes in the OpenImages dataset

Second, train a concept-to-sentence(con2sen) model using the sentence corpus only

Third, detect the visual concepts in each image using the existing visual concept detector. Use the detected concepts and the concept-to-sentence model to generate a pseudo caption for each image

Fourth, train the generator with the pseudo image-caption pairs

 

For discriminator, initialized by training an adversarial sentence generation model on the sentence corpus.

Guess you like

Origin www.cnblogs.com/czhwust/p/10946628.html