【Paper】CNN-LSTM:Show and Tell: A Neural Image Caption Generator

论文期刊:CVPR
论文年份:2015
论文被引:3390(04/22/20)
论文下载:点击此处



Abstract

Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. For instance, while the current state-of-the-art BLEU-1 score (the higher the better) on the Pascal dataset is 25, our approach yields 59, to be compared to human performance around 69. We also show BLEU-1 score improvements on Flickr30k, from 56 to 66, and on SBU, from 19 to 28. Lastly, on the newly released COCO dataset, we achieve a BLEU-4 of 27.7, which is the current state-of-the-art.

自动描述图像的内容是连接计算机视觉和自然语言处理的人工智能的基本问题。在本文中,我们提出了一个基于深度递归体系结构的生成模型,该模型结合了计算机视觉和机器翻译的最新进展,可用于生成描述图像的自然句子。训练模型以在给定训练图像的情况下最大化目标描述语句的可能性。在多个数据集上进行的实验表明,该模型的准确性以及仅从图像描述中学习的语言的流畅性。我们的模型通常非常准确,我们可以在定性和定量上进行验证。例如,在Pascal数据集上,当前最先进的BLEU-1得分(越高越好)是25,而我们的方法得出的结果是59,与人类表现在69左右相比。我们还显示了BLEU-1 Flickr30k的得分从56提升到66,SBU的得分从19提升到28。最后,在新发布的COCO数据集上,我们的BLEU-4为27.7,这是当前的最新水平。


1. Introduction

Being able to automatically describe the content of an image using properly formed English sentences is a very challenging task, but it could have great impact, for instance by helping visually impaired people better understand the content of images on the web. This task is significantly harder, for example, than the well-studied image classification or object recognition tasks, which have been a main focus in the computer vision community [27]. Indeed, a description must capture not only the objects contained in an image, but it also must express how these objects relate to each other as well as their attributes and the activities they are involved in. Moreover, the above semantic knowledge has to be expressed in a natural language like English, which means that a language model is needed in addition to visual understanding.

能够使用格式正确的英语句子自动描述图像的内容是一项非常具有挑战性的任务,但它可能会产生巨大的影响,例如,通过帮助视力障碍的人们更好地理解网络上的图像内容。例如,此任务比经过充分研究的图像分类或对象识别任务要困难得多,而这些任务已成为计算机视觉领域的主要关注点[27]。实际上,描述不仅必须捕获图像中包含的对象,而且还必须表达这些对象如何相互关联以及它们的属性和它们所涉及的活动。此外,必须表达上述语义知识以自然语言(例如英语)表示,这意味着除了视觉理解外还需要一种语言模型。

Most previous attempts have proposed to stitch together existing solutions of the above sub-problems, in order to go from an image to its description [6, 16]. In contrast, we would like to present in this work a single joint model that takes an image I as input, and is trained to maximize the likelihood p ( S I ) p(S|I) of producing a target sequence of words S = S 1 , S 2 , . . . S = {S_1, S_2, . . .} where each word S t S_t comes from a given dictionary, that describes the image adequately.

以前的大多数尝试都是将上述子问题的现有解决方案组合在一起,以便从图像进行描述[6,16]。相反,我们在这项工作中提出一个联合模型,该模型以图像 I I 作为输入,并经过训练以最大化产生目标单词序列 S = S 1 , S 2 , . . . S = {S1, S2, . . .} 的可能性 p ( S I ) p(S|I) ,其中每个单词 S t S_t 来自给定的字典,即充分描述图像。

The main inspiration of our work comes from recent advances in machine translation, where the task is to transform a sentence S S written in a source language, into its translation T T in the target language, by maximizing p ( T S ) p(T|S) . For many years, machine translation was also achieved by a series of separate tasks (translating words individually, aligning words, reordering, etc), but recent work has shown that translation can be done in a much simpler way using Recurrent Neural Networks (RNNs) [3, 2, 30] and still reach state-of-the-art performance. An “encoder” RNN reads the source sentence and transforms it into a rich fixed-length vector representation, which in turn in used as the initial hidden state of a “decoder” RNN that generates the target sentence.

我们工作的主要灵感来自机器翻译的最新进展,其中的任务是通过最大化 p ( T S ) p(T|S) ,将以源语言编写的句子 S S 转换为目标语言的译文 T T 。多年以来,机器翻译还通过一系列单独的任务来实现(分别翻译单词,对齐单词,重新排序等),但是最近的工作表明,使用递归神经网络(RNN)可以以更简单的方式完成翻译。 [3,2,30]并仍达到最先进的性能。 “编码器” RNN读取源语句并将其转换为丰富的固定长度向量表示形式,然后将其用作生成目标语句的“解码器” RNN的初始隐藏状态。

在这里插入图片描述

Figure 1. NIC, our model, is based end-to-end on a neural network consisting of a vision CNN followed by a language generating RNN. It generates complete sentences in natural language from an input image, as shown on the example above.

Here, we propose to follow this elegant recipe, replacing the encoder RNN by a deep convolution neural network (CNN). Over the last few years it has been convincingly shown that CNNs can produce a rich representation of the input image by embedding it to a fixed-length vector, such that this representation can be used for a variety of vision tasks [28]. Hence, it is natural to use a CNN as an image “encoder”, by first pre-training it for an image classification task and using the last hidden layer as an input to the RNN decoder that generates sentences (see Fig. 1). We call this model the Neural Image Caption, or NIC.

在这里,我们建议遵循这种优雅的方法,用深度卷积神经网络(CNN)代替编码器RNN。在过去的几年中,令人信服的表明,CNN可以通过将输入图像嵌入到固定长度的向量中来生成输入图像的丰富表示,从而这种表示可以用于各种视觉任务[28]。因此,自然是将CNN用作图像“编码器”,方法是先对其进行预训练以进行图像分类任务,然后将最后一个隐藏层用作生成语句的RNN解码器的输入(请参见图1)。我们将此模型称为神经图像标题或NIC。

Our contributions are as follows. First, we present an end-to-end system for the problem. It is a neural net which is fully trainable using stochastic gradient descent. Second, our model combines state-of-art sub-networks for vision and language models. These can be pre-trained on larger corpora and thus can take advantage of additional data. Finally, it yields significantly better performance compared to state-of-the-art approaches; for instance, on the Pascal dataset, NIC yielded a BLEU score of 59, to be compared to the current state-of-the-art of 25, while human performance reaches 69. On Flickr30k, we improve from 56 to 66, and on SBU, from 19 to 28.

我们的贡献如下。首先,我们提出了一个解决问题的端到端系统。它是一种神经网络,可以使用随机梯度下降训练。其次,我们的模型结合了用于视觉和语言模型的最新网络。这些可以在较大的语料库上进行预训练,因此可以利用其他数据。最后,与最先进的方法相比,它的性能显着提高。 例如,在Pascal数据集上,NIC的BLEU得分为59,与当前的最新水平25相比,而人类的性能达到69。在Flickr30k上,我们的得分从56提高到66,在SBU,从19到28。


2. Related Work

The problem of generating natural language descriptions from visual data has long been studied in computer vision, but mainly for video [7, 32]. This has led to complex systems composed of visual primitive recognizers combined with a structured formal language, e.g. And-Or Graphs or logic systems, which are further converted to natural language via rule-based systems. Such systems are heavily hand-designed, relatively brittle and have been demonstrated only on limited domains, e.g. traffic scenes or sports.

从视觉数据中生成自然语言描述的问题已经在计算机视觉中进行了长期研究,但主要针对视频[7,32]。这导致了由视觉原始识别器与结构化形式语言(例如, And-Or图形或逻辑系统,它们通过基于规则的系统进一步转换为自然语言。这样的系统是手工设计的,相对较脆,并且仅在有限的领域(例如,图1)中被证明。例如,交通场景或运动描述。

The problem of still image description with natural text has gained interest more recently. Leveraging recent advances in recognition of objects, their attributes and locations, allows us to drive natural language generation systems, though these are limited in their expressivity. Farhadi et al. [6] use detections to infer a triplet of scene elements which is converted to text using templates. Similarly, Li et al. [19] start off with detections and piece together a final description using phrases containing detected objects and relationships. A more complex graph of detections beyond triplets is used by Kulkani et al. [16], but with template-based text generation. More powerful language models based on language parsing have been used as well [23, 1, 17, 18, 5]. The above approaches have been able to describe images “in the wild”, but they are heavily handdesigned and rigid when it comes to text generation.

带有自然文本的静止图像描述问题最近引起了人们的关注。借助对象,属性和位置识别方面的最新进展,尽管这些语言的表达能力受到限制,但我们仍可以驱动自然语言生成系统。 Farhadi等。 [6]使用检测来推断场景元素的三元组,并使用模板将其转换为文本。同样,李等。 [19]从检测开始,并使用包含检测到的对象和关系的短语拼凑出最终描述。 Kulkani等人使用了一个更复杂的检测图(三重态除外)。 [16],但具有基于模板的文本生成。也使用了基于语言解析的更强大的语言模型[23,1,17,18,5]。上面的方法已经能够“in the wild”描述图像,但是在文本生成方面,它们是经过大量手工设计和严格设计的。

A large body of work has addressed the problem of ranking descriptions for a given image [11, 8, 24]. Such approaches are based on the idea of co-embedding of images and text in the same vector space. For an image query, descriptions are retrieved which lie close to the image in the embedding space. Most closely, neural networks are used to co-embedimagesand sentencestogether[29]oreven image crops and subsentences [13] but do not attempt to generate novel descriptions. In general, the above approaches cannot describe previously unseen compositions of objects, even though the individual objects might have been observed in the training data. Moreover, they avoid addressing the problem of evaluating how good a generated description is.

大量工作解决了对给定图像[11、8、24]进行描述排名的问题。这样的方法基于在相同向量空间中共同嵌入图像和文本的想法。对于图像查询,将获取接近嵌入空间中图像的描述。最紧密地,神经网络用于共同嵌入图像和句子[29],甚至嵌入图像作物和句子[13],但并未尝试生成新颖的描述。通常,即使可能已经在训练数据中观察到了单个对象,上述方法也无法描述以前看不见的对象组成。而且,它们避免了解决评估所生成的描述的良好程度的问题。

In this work we combine deep convolutional nets for image classification [12] with recurrent networks for sequence modeling [10], to create a single network that generates descriptions of images. The RNN is trained in the context of this single “end-to-end” network. The model is inspired by recent successes of sequence generation in machine translation [3, 2, 30], with the difference that instead of starting with a sentence, we provide an image processed by a convolutional net. The closest works are by Kiros et al. [15] who use a neural net, but a feedforward one, to predict the next word given the image and previous words. A recent work by Mao et al. [21] uses a recurrent NN for the same prediction task. This is very similar to the present proposal but there are a number of important differences: we use a more powerful RNN model, and provide the visual input to the RNN model directly, which makes it possible for the RNN to keep track of the objects that have been explained by the text. As a result of these seemingly insignificant differences, our system achieves substantially better results on the established benchmarks. Lastly, Kiros et al. [14] propose to construct a joint multimodal embedding space by using a powerful computer vision model and an LSTM that encodestext. Incontrasttoourapproach, theyusetwoseparate pathways (one for images, one for text) to define a joint embedding, and, even though they can generate text, their approach is highly tuned for ranking.

在这项工作中,我们将用于图像分类的深层卷积网络[12]与用于序列建模的循环网络[10]相结合,以创建一个生成图像描述的单一网络。在这个单一的“端到端”网络的背景下对RNN进行了训练。该模型的灵感来自机器翻译中序列生成的最新成功[3,2,30],区别在于我们提供的不是卷积句子,而是提供了由卷积网络处理的图像。最近的著作是基洛斯等人。 [15]他们使用神经网络,但使用前馈神经网络,根据图像和前一个单词来预测下一个单词。毛等人的最新著作。 [21]使用递归神经网络进行相同的预测任务。这与当前建议非常相似,但是有许多重要的区别:我们使用功能更强大的RNN模型,并直接向RNN模型提供可视输入,这使得RNN可以跟踪那些由文字解释。由于这些看似微不足道的差异,我们的系统在已建立的基准上取得了明显更好的结果。最后,基洛斯等。 [14]提出通过使用功能强大的计算机视觉模型和对文本进行编码的LSTM来构建联合多峰嵌入空间。与我们的方法相反,它们使用两个单独的路径(一个用于图像,一个用于文本)来定义联合嵌入,并且即使它们可以生成文本,也对其方法进行了高度调整以进行排名。


3. Model

In this paper, we propose a neural and probabilistic framework to generate descriptions from images. Recent advances in statistical machine translation have shown that, given a powerful sequence model, it is possible to achieve state-of-the-art results by directly maximizing the probability of the correct translation given an input sentence in an “end-to-end” fashion – both for training and inference. These models make use of a recurrent neural network which encodes the variable length input into a fixed dimensional vector, and uses this representation to “decode” it to the desired output sentence. Thus, it is natural to use the same approach where, given an image (instead of an input sentence in the source language), one applies the same principle of “translating” it into its description.

在本文中,我们提出了一种神经和概率框架来从图像生成描述。统计机器翻译的最新进展表明,给定强大的序列模型,可以通过在“端到端”的方式中给定输入句子的情况下,直接最大化正确翻译的概率来获得最新的结果–既用于训练又用于推理。这些模型使用循环神经网络,该网络将可变长度输入编码为固定维向量,并使用此表示形式将其“解码”为所需的输出语句。 因此,很自然地使用相同的方法,即在给定图像(而不是源语言中的输入句子)的情况下,应用相同的原理将其“翻译”成其描述。

Thus, we propose to directly maximize the probability of the correct description given the image by using the following formulation:

在这里插入图片描述

where θ θ are the parameters of our model, I I is an image, and S its correct transcription. Since S S represents any sentence, its length is unbounded. Thus, it is common to apply the chain rule to model the joint probability over S 0 , . . . , S N S_0, . . . , S_N , where N is the length of this particular example as

在这里插入图片描述

where we dropped the dependency on θ θ for convenience. At training time, ( S , I ) (S, I) is a training example pair, and we optimize the sum of the log probabilities as described in (2) over the whole training set using stochastic gradient descent (further training details are given in Section 4).

It is natural to model p ( S t I , S 0 , . . . , S t 1 ) p(S_t|I, S_0, . . . , S_{t−1}) with a Recurrent Neural Network (RNN), where the variable number of words we condition upon up to t 1 t − 1 is expressed by a fixed length hidden state or memory h t h_t . This memory is updated after seeing a new input xtby using a non-linear function f f :

h t + 1 = f ( h t ; x t ) (3) h_{t+1} = f(h_t; x_t) \tag{3}

To make the above RNN more concrete two crucial design choices are to be made: what is the exact form of f f and how are the images and words fed as inputs x t x_t . For f f we use a Long-Short Term Memory (LSTM) net, which has shown state-of-the art performance on sequence tasks such as translation. This model is outlined in the next section.

为了使上述RNN更具体,需要做出两个关键的设计选择: f f 的确切形式是什么以及如何将图像和单词作为输入 x t x_t 输入。对于 f f ,我们使用长时记忆(LSTM)网络,该网络已显示出诸如翻译之类的序列任务的最新性能。下一节将概述此模型。

For the representation of images, we use a Convolutional Neural Network (CNN). They have been widely used and studied for image tasks, and are currently state-of-the art for object recognition and detection. Our particular choice of CNN uses a novel approach to batch normalization and yields the current best performance on the ILSVRC 2014 classification competition [12]. Furthermore, they have been shown to generalize to other tasks such as scene classification by means of transfer learning [4]. The words are represented with an embedding model.

对于图像的表示,我们使用卷积神经网络(CNN)。它们已被广泛地用于图像任务并已被研究,并且目前是物体识别和检测的最新技术。我们对CNN的特定选择使用一种新颖的方法对批处理进行归一化,并在ILSVRC 2014分类竞赛中获得当前的最佳表现[12]。此外,它们已被证明可以通过转移学习推广到其他任务,例如场景分类[4]。单词用嵌入模型表示。


3.1. LSTM-based Sentence Generator

在这里插入图片描述
Figure 2. LSTM: the memory block contains a cell c which is controlled by three gates. In blue we show the recurrent connections – the output m at time t 1 t − 1 is fed back to the memory at time t via the three gates; the cell value is fed back via the forget gate; the predicted word at time t 1 t−1 is fed back in addition to the memory output m at time t t into the Softmax for word prediction.

The choice of f f in (3) is governed by its ability to deal with vanishing and exploding gradients [10], the most common challenge in designing and training RNNs. To address this challenge, a particular form of recurrent nets, called LSTM, was introduced [10] and applied with great success to translation [3, 30] and sequence generation [9]. The core of the LSTM model is a memory cell c c encoding knowledge at every time step of what inputs have been observed up to this step (see Figure 2) . The behavior of the cell is controlled by “gates” – layers which are applied multiplicatively and thus can either keep a value from the gated layer if the gate is 1 or zero this value if the gate is 0. In particular, three gates are being used which control whether to forget the current cell value (forget gate f f ), if it should read its input (input gate i i ) and whether to output the new cell value (output gate o o ). The definition of the gates and cell update and output are as follows:

在(3)中f的选择取决于它处理消失和爆炸梯度的能力[10],这是设计和训练RNN时最常见的挑战。为了解决这一挑战,引入了一种称为LSTM的特殊形式的递归网络[10],并成功应用于翻译[3,30]和序列生成[9]。 LSTM模型的核心是存储单元 c c ,它在每个时间步上编码知识,直到该步为止都观察到了哪些输入(参见图2)。单元的行为由“门”控制,“门”是相乘的层,因此如果门为1则可以保留门控层的值,如果门为0则可以保持此值为零。特别是,正在使用三个门用于控制是否忘记当前单元格值(忘记门 f f ),是否应读取其输入(输入门 i i )以及是否输出新单元格值(输出门 o o )。门的定义以及单元更新和输出如下:

在这里插入图片描述

where represents the product with a gate value, and the various W W matrices are trained parameters. Such multiplicative gates make it possible to train the LSTM robustly as these gates deal well with exploding and vanishing gradients [10]. The nonlinearities are sigmoid σ ( ) σ(·) and hyperbolic tangent h ( ) h(·) . The last equation m_t is what is used to feed to a Softmax, which will produce a probability distribution ptover all words.

其中 表示门值的乘积,而各种 W W 矩阵都是经过训练的参数。这样的乘法门使训练鲁棒的LSTM成为可能,因为这些门很好地处理了爆炸和消失的梯度[10]。非线性为S型 σ ( ) σ(·) 和双曲正切 h ( ) h(·) 。最后一个方程 m t m_t 是输入给Softmax的方程,它将产生所有单词上的概率分布。

Training The LSTM model is trained to predict each word of the sentence after it has seen the image as well as all preceding words as defined by p ( S t I , S 0 , . . . , S t 1 ) p(S_t|I, S_0, . . . , S_{t−1}) . For this purpose, it is instructive to think of the LSTM in unrolled form – a copy of the LSTM memory is created for the image and each sentence word such that all LSTMs share the same parameters and the output m t 1 m_{t−1} of the LSTM at time t 1 t − 1 is fed to the LSTM at time t t (see Figure 3). All recurrent connections are transformed to feed-forward connections in the unrolled version. In more detail, if we denote by I the input image and by S = ( S 0 , . . . , S N ) S = (S_0, . . . , S_N) a true sentence describing this image, the unrolling procedure reads:

LSTM模型经过训练,可以在看到图像后预测句子中的每个单词以及通过 p ( S t I , S 0 , . . . , S t 1 ) p(S_t|I, S_0, . . . , S_{t−1}) 预测所有先前单词。为此,以展开形式考虑LSTM是有启发性的–为图像和每个句子单词创建LSTM存储器的副本,以便所有LSTM在时间t共享相同的参数和LSTM的输出 m t 1 m_{t−1} 在时间 t t 将馈送到LSTM(见图3)。在展开版本中,所有经常性连接都将转换为前馈连接。更详细地讲,如果我们用I表示输入图像,而用 S = ( S 0 , . . . , S N ) S = (S_0, . . . , S_N) 表示描述该图像的真实句子,则展开过程为:
在这里插入图片描述

where we represent each word as a one-hot vector S t S_t of dimension equal to the size of the dictionary. Note that we denote by S 0 S_0 a special start word and by S N S_N a special stop word which designates the start and end of the sentence. In particular by emitting the stop word the LSTM signals that a complete sentence has been generated. Both the image and the words are mapped to the same space, the image by using a vision CNN, the words by using word embedding W e W_e . The image I I is only input once, at t = 1 t = −1 , to inform the LSTM about the image contents. We empirically verified that feeding the image at each time step as an extra input yields inferior results, as the network can explicitly exploit noise in the image and overfits more easily.

在这里,我们将每个单词表示为一维向量 S t S_t ,其维数等于字典的大小。注意,我们用 S 0 S_0 表示一个特殊的开始词,用 S N S_N 表示一个特殊的停止词,它指定句子的开头和结尾。特别是通过发出停用词,LSTM发出信号,表明已生成完整的句子。图像和单词都映射到相同的空间,使用视觉CNN映射图像,使用单词嵌入 W e W_e 映射到单词。图像 I I 仅在 t = 1 t = -1 时输入一次,以通知LSTM有关图像内容。我们凭经验验证了,由于网络可以显式利用图像中的噪声并更容易过度拟合,因此在每个时间步幅上作为额外的输入来馈送图像会产生较差的结果。

Our loss is the sum of the negative log likelihood of the correct word at each step as follows:

在这里插入图片描述

The above loss is minimized w.r.t. all the parameters of the LSTM, the top layer of the image embedder CNN and word embeddings W e W_e

在这里插入图片描述
Figure 3. LSTM model combined with a CNN image embedder(as defined in [12]) and word embeddings. The unrolled connections between the LSTM memories are in blue and they correspond to the recurrent connections in Figure 2. All LSTMs share the same parameters.

Inference There are multiple approaches that can be used to generate a sentence given an image, with NIC. The first one is Sampling where we just sample the first word according to p 1 p_1 , then provide the corresponding embedding as input and sample p 2 p2 , continuing like this until we sample the special end-of-sentence token or some maximum length. The second one is BeamSearch: iteratively consider the set of the k best sentences up to time t as candidates to generate sentences of size t + 1, and keep only the resulting best k of them. This better approximates S = a r g m a x S 0 p ( S I ) S = arg max_{S_0} p(S'|I) . We used the BeamSearch approach in the following experiments, with a beam of size 20. Using a beam size of 1 (i.e., greedy search) did degrade our results by 2 BLEU points on average.

使用NIC,有多种方法可以用于生成给定图像的句子。第一个是抽样,我们只是根据p1对第一个单词进行抽样,然后提供相应的嵌入作为输入,然后对p2进行抽样,这样一直进行下去,直到我们对特殊的语句结束标记或某个最大长度进行抽样。第二种方法是BeamSearch:迭代地考虑k个最好的句子,直到时间t,作为候选,生成大小为t + 1的句子,并只保留其中最好的k个。这更接近于S = arg maxS0 p(S0|I)。在接下来的实验中,我们使用了波束搜索方法,波束大小为20。使用光束大小为1(即(贪婪搜索)降低了平均2个BELU点。


4. Experiments

为了与现有技术进行比较,我们执行了一组广泛的实验,以使用几个度量标准、数据源和模型架构来评估我们的模型的有效性。

4.1 Evaluation Metrics

Although it is sometimes not clear whether a description should be deemed successful or not given an image, prior art has proposed several evaluation metrics. The most reliable (but time consuming) is to ask for raters to give a subjective score on the usefulness of each description given the image. In this paper, we used this to reinforce that some of the automatic metrics indeed correlate with this subjective score, following the guidelines proposed in [11], which asks the graders to evaluate each generated sentence with a scale from 1 to 41.

虽然有时不清楚一个描述是否应该被视为成功的图像,现有技术提出了几个评价指标。最可靠(但耗时)的方法是让评分者对图像中每个描述的有用性进行主观评分。在这篇论文中,我们使用这个来强调一些自动的度量标准确实与这个主观的分数相关,遵循[11]中提出的指导方针,要求评分员用1到41的分值来评估每个生成的句子。

For this metric, we set up an Amazon Mechanical Turk experiment. Each image was rated by 2 workers. The typical level of agreement between workers is 65%. In case of disagreement we simply average the scores and record the average as the score. For variance analysis, we perform bootstrapping (re-sampling the results with replacement and computing means/standard deviation over the resampled results). Like [11] we report the fraction of scores which are larger or equal than a set of predefined thresholds.

对于这个度量,我们设置了一个Amazon Mechanical Turk实验。每张图片由两名工作人员评分。工人之间的典型协议水平是65%。在出现分歧的情况下,我们简单地对分数进行平均,并将平均值作为分数记录下来。对于方差分析,我们执行bootstrapping(用替换重新采样结果并计算重新采样结果的平均值/标准差)。与[11]类似,我们报告大于或等于一组预定义阈值的分数的分数。

The rest of the metrics can be computed automatically assuming one has access to groundtruth, i.e. human generated descriptions. The most commonly used metric so far in the image description literature has been the BLEU score [25], which is a form of precision of word n-grams between generated and reference sentences2. Even though this metric has some obvious drawbacks, it has been shown to correlate well with human evaluations. In this work, we corroborate this as well, as we show in Section 4.3. An extensive evaluation protocol, as well as the generated outputs of our system, can be found at http://nic. droppages.com/.

其余的指标可以自动计算,假设一个人可以访问groundtruth,即人类生成的描述。到目前为止,图像描述文献中最常用的度量是BLEU分数[25],它是生成的句子和参考句子之间的单词n-gram精度的一种形式2。尽管这个指标有一些明显的缺陷,但它已经被证明与人类的评估有很好的相关性。在这项工作中,我们也证实了这一点,正如我们在4.3节中所显示的。一个广泛的评估协议,以及生成的输出,我们的系统,可以在http://nic找到。droppages.com/。

Besides BLEU, one can use the perplexity of the model for a given transcription (which is closely related to our objective function in (1)). The perplexity is the geometric mean of the inverse probability for each predicted word. We used this metric to perform choices regarding model selection and hyperparameter tuning in our held-out set, but we do not report it since BLEU is always preferred3. A much more detailed discussion regarding metrics can be found in [31], and research groups working on this topic have been reporting other metrics which are deemed more appropriate for evaluating caption. We report two such metrics - METEOR and Cider - hoping for much more discussion and research to arise regarding the choice of metric.

除了BLEU,我们还可以利用给定转录的模型的复杂性(这与我们在(1)中的目标函数密切相关)。令人困惑的是每个预测词的反概率的几何平均数。我们使用这个度量来执行关于模型选择和超参数调优的选择,但是我们没有报告它,因为BLEU总是首选3。关于度量的更详细的讨论可以在[31]中找到,并且致力于这个主题的研究小组已经报告了其他被认为更适合评价标题的度量。我们报告两个这样的指标-流星和苹果酒-希望更多的讨论和研究,以产生关于选择的指标。

Lastly, the current literature on image description has also been using the proxy task of ranking a set of available descriptions with respect to a given image (see for instance [14]). Doing so has the advantage that one can use known ranking metrics like recall@k. On the other hand, transforming the description generation task into a ranking task is unsatisfactory: as the complexity of images to describe grows, together with its dictionary, the number of possible sentences grows exponentially with the size of the dictionary, and the likelihood that a predefined sentence will fit a new image will go down unless the number of such sentences also grows exponentially, which is not realistic; not to mention the underlying computational complexity of evaluating efficiently such a large corpus of stored sentences for each image. The same argument has been used in speech recognition, where one has to produce the sentence corresponding to a given acoustic sequence; while early attempts concentrated on classification of isolated phonemes or words, state-of-the-art approaches for this task are now generative and can produce sentences from a large dictionary.

最后,当前关于图像描述的文献也使用了代理任务,即根据给定的图像对一组可用的描述进行排序(例如[14])。这样做的好处是可以使用已知的排名指标,比如recall@k。另一方面,将描述一代任务到一个排名是不满意:随着图像描述的复杂性增加,连同它的字典,可能的句子的数量呈指数级增长的大小字典,和一个预定义的句子的可能性将适合新形象的数量会下降,除非这样的句子也呈指数级增长,这是不现实的;更不用说对每个图像有效地评估这么大的存储句子库的潜在计算复杂性了。同样的论证也被用于语音识别,在语音识别中,一个人必须根据给定的声音序列生成相应的句子;早期的尝试主要集中在对孤立的音素或单词进行分类,而目前这项工作的最先进的方法是生成法,可以从大型字典中生成句子。

既然我们的模型可以生成合理质量的描述,尽管评估图像描述的模糊性(可能有多个有效描述不在groundtruth中),我们认为我们应该集中于生成任务的评估指标,而不是排名。

既然我们的模型可以生成合理质量的描述,尽管评估图像描述的模糊性(可能有多个有效描述不在groundtruth中),我们认为我们应该集中于生成任务的评估指标,而不是排名。


4.2. Datasets

为了评估,我们使用了大量的数据集,这些数据集由描述这些图像的英语图像和句子组成。数据集的统计如下:
在这里插入图片描述
除了SBU外,每个图像都由贴标签者用5个句子进行注释,这5个句子相对比较直观和公正。SBU由图片所有者在将图片上传到Flickr时给出的描述组成。因此,它们不能保证是可视的或无偏的,因此这个数据集有更多的噪声。

Pascal数据集通常只在系统对不同的数据(如其他四个数据集中的任何一个)进行了培训之后才用于测试。在SBU的情况下,我们会拿出1000张图片进行测试,剩下的图片我们会按照[18]的方式进行训练。类似地,我们将来自MSCOCO验证集的4K随机图像保留为test(称为COCO-4k),并在下一节中使用它来报告结果。


4.3. Results

由于我们的模型是数据驱动和端到端训练的,并且考虑到数据集的丰富,我们想要回答诸如“数据集的大小如何影响泛化”、“它能够实现什么样的转换学习”以及“它将如何处理弱标记的示例”之类的问题。因此,我们在5个不同的数据集上进行了实验,第4.2节对此进行了解释,这使我们能够深入地理解我们的模型。


4.3.1 Training Details

我们在训练模型时面临的许多挑战都与过度拟合有关。的确,纯监督方法需要大量的数据,但是高质量的数据集只有不到100000张图像。分配描述的任务严格来说比对象分类更困难,数据驱动方法直到最近才成为主流,这要归功于像ImageNet这样的数据集(除了SBU之外,它的数据是我们在本文中描述的数据集的十倍)。因此,我们相信,即使我们获得了相当好的结果,我们的方法相对于当前人类工程方法的优势在未来几年内只会随着训练集的增长而增加。

尽管如此,我们还是探讨了几种处理过度拟合的技术。最明显的不过度拟合的方法是将我们系统的CNN组件的权重初始化为一个预先训练的模型(例如,在ImageNet上)。我们在所有的实验中都这样做(类似于[8]),它确实在泛化方面有很大的帮助。另一组可以合理初始化的权值是 W e W_e ,即embeddings这个词。我们尝试从一个大型新闻语料库[22]初始化它们,但是没有观察到明显的效果,为了简单起见,我们决定不初始化它们。最后,我们尝试了一些抑制过拟合的策略。尝试了dropout[34]和 ensembling models,以及探索的大小(即通过权衡隐藏单元的数量和深度来确定模型的容量。 Dropout 和 ensembling 提高了BLEU,正如论文所述。

我们使用固定学习率和无动量的随机梯度下降法训练了所有的权值。所有的权值都是随机初始化的,除了CNN的权值,我们保持不变,因为改变它们会产生负面影响。我们使用了512个维度的LSTM内存的嵌入和大小。

描述用基本标记法进行预处理,保留所有在训练集中出现至少5次的单词。


4.3.2 Generation Results

在这里插入图片描述
我们在表1和表2中报告了所有相关数据集的主要结果。由于PASCAL没有训练集,所以我们使用使用MSCOCO训练的系统(对于这个任务来说,MSCOCO可能是最大和最高质量的数据集)。PASCAL和SBU的最新研究结果并没有使用基于深度学习的图像特征,因此可以说,这些分数上的一个巨大进步仅仅来自于这种改变。Flickr数据集最近才被使用[11,21,14],但大多数是在检索框架中评估的。一个值得注意的例外是[21],它们在其中进行检索和生成,并且在Flickr数据集上产生了迄今为止最好的性能。

表2中的人类评分是通过比较其中一个人类字幕和另外四个字幕计算出来的。我们为五个打分者中的每一个打分,并对他们的BLEU分数进行平均。由于这给我们的系统带来了一点优势,考虑到BLEU分数是根据5个参考句计算的,而不是4个,我们将5个参考句而不是4个参考句的平均差异加回人类分数。

鉴于该领域在过去几年中取得了重大进展,我们认为报告BLEU-4更有意义,这是机器翻译向前发展的标准。此外,我们还报告了表14中显示的与人工评估关联更好的度量标准。尽管最近在更好的评价指标[31]的努力,我们的模型与人类评分者相比表现良好。然而,当使用人工评分员来评估我们的字幕时(见4.3.6节),我们的模型表现得更差,这表明我们需要做更多的工作来获得更好的指标。在官方测试集上,我们的标签只能通过官方网站获得,我们的型号有27.2的BLEU-4。

4.3.3 Transfer Learning, Data Size and Label Quality

因为我们已经培训了很多模型,我们有几个测试集,我们想研究是否可以将一个模型迁移到不同的数据集,以及领域中的不匹配会在多大程度上得到补偿,例如更高质量的标签或更多的培训数据。

迁移学习和数据大小最明显的例子是在Flickr30k和Flickr8k之间。这两个数据集的标签类似,因为它们是由相同的组创建的。确实,在Flickr30k上进行训练(使用大约4倍多的训练数据),得到的结果要比在其他条件下训练的结果好4个BLEU点。很明显,在这种情况下,我们可以通过添加更多的训练数据来获得收益,因为整个过程是数据驱动的,而且容易出现过拟合。MSCOCO甚至更大(是Flickr30k的5倍多的训练数据),但是由于收集过程是不同的,所以可能会有更多的词汇差异和更大的不匹配。事实上,所有的蓝盟分数都会降低10分。尽管如此,这些描述仍然是合理的。由于PASCAL没有正式的训练集,而且是独立于Flickr和MSCOCO收集的,所以我们报告了从MSCOCO进行迁移学习的情况(见表2)。从Flickr30k进行转移学习的结果比从BLEU-1进行转移学习的结果差,为53 (cf. 59)。


4.3.4 Generation Diversity Discussion

经过 p ( S I ) p(S|I) 生成模型的训练,一个明显的问题是该模型生成的字幕是否新颖,生成的字幕是否多样且质量高。表3显示了从我们的波束搜索解码器返回N-best列表而不是最佳假设时的一些样本。请注意,这些样本是如何不同的,并可能显示来自同一图像的不同方面。前15个生成的句子在BLEU中的得分为58,与人类的相似。这表示我们的模型生成的多样性的数量。用粗体表示的是没有出现在训练集中的句子。如果我们选出最好的候选人,这个句子出现在训练集中的几率是80%。这并不奇怪,因为训练数据的数量非常少,所以对于模型来说,挑选“范例”句子并使用它们来生成描述是相对容易的。相反,如果我们分析前15个生成的句子,大约有一半的情况下,我们看到的是完全新颖的描述,但仍然有类似的BLEU分数,这表明它们具有足够的质量,但它们提供了一个健康的多样性。

在这里插入图片描述
Table 3. N-best examples from the MSCOCO test set. Bold lines indicate a novel sentence not present in the training set.


4.3.5 Ranking Results

虽然我们认为排名是一个不令人满意的方式来评估描述生成的图像,许多论文报告排名分数,使用一套测试标题作为候选人,以排名给定的测试图像。最适合这些指标的方法(MNLM),具体实现了排名感知损失。尽管如此,NIC在两个排序任务(给出图像的排序描述和给出描述的排序图像)上都表现得出奇的好,如表4和表5所示。注意,对于图像注释任务,我们使用与[21]相似的标准化分数。
在这里插入图片描述
在这里插入图片描述
Figure 4. Flickr-8k: NIC: predictions produced by NIC on the Flickr8k test set (average score: 2.37); Pascal: NIC: (average score: 2.45); COCO-1k: NIC: A subset of 1000 images from the MSCOCO test set with descriptions produced by NIC (average score: 2.72); Flickr-8k: ref: these are results from [11] on Flickr8k rated using the same protocol, as a baseline (average score: 2.08); Flickr-8k: GT: we rated the groundtruth labels from Flickr8k using the same protocol. This provides us with a “calibration” of the scores (average score: 3.89)


4.3.6 Human Evaluation

图4显示了NIC提供的描述的人工评估结果,以及各种数据集上的参考系统和groundtruth。我们可以看到NIC比参考系统更好,但显然比groundtruth更差,正如预期的那样。这表明BLEU并不是一个完美的度量标准,因为它并不能很好地捕捉到评估者评估的NIC和人类描述之间的差异。如图5所示。有趣的是,例如在第一列的第二幅图中,模型是如何注意到飞盘的大小的。
在这里插入图片描述


4.3.7 Analysis of Embeddings

In order to represent the previous word S t 1 S_{t−1} as input to the decoding LSTM producing S t S_t , we use word embedding vectors [22], which have the advantage of being independent of the size of the dictionary (contrary to a simpler onehot-encoding approach). Furthermore, these word embeddings can be jointly trained with the rest of the model. It is remarkable to see how the learned representations have captured some semantic from the statistics of the language. Table 4.3.7 shows, for a few example words, the nearest other words found in the learned embedding space.

为了表示前一个单词 S t 1 S_{t−1} 作为产生St的LSTM解码的输入,我们使用单词嵌入向量[22],它的优点是不依赖于字典的大小(与一种更简单的热编码方法相反)。此外,这些词嵌入可以与模型的其余部分联合训练。值得注意的是,学习表示是如何从语言的统计数据中捕获一些语义的。表4.3.7给出了几个例句词在学习嵌入空间中找到的最接近的其他词。

Note how some of the relationships learned by the model will help the vision component. Indeed, having “horse”, “pony”, and “donkey” close to each other will encourage the CNN to extract features that are relevant to horse-looking animals. We hypothesize that, in the extreme case where we see very few examples of a class (e.g., “unicorn”), its proximity to other word embeddings (e.g., “horse”) should provide a lot more information that would be completely lost with more traditional bag-of-words based approaches.

请注意模型学习的一些关系将如何帮助vision组件。事实上,“马”、“矮种马”和“驴”彼此靠得很近,这将促使CNN提取出与马形动物相关的特征。我们假设,在我们很少看到一个类的例子的极端情况下(例如,“unicorn”),它与其他词嵌入的接近性(例如,“horse”)应该提供更多的信息,而这些信息将完全被更传统的基于词包的方法所丢失。


5. Conclusion

We have presented NIC, an end-to-end neural network system that can automatically view an image and generate a reasonable description in plain English. NIC is based on a convolution neural network that encodes an image into a compact representation, followed by a recurrent neural network that generates a corresponding sentence. The model is trained to maximize the likelihood of the sentence given the image. Experiments on several datasets show the robustness of NIC in terms of qualitative results (the generated sentences are very reasonable) and quantitative evaluations, using either ranking metrics or BLEU, a metric used in machine translation to evaluate the quality of generated sentences. It is clear from these experiments that, as the size of the available datasets for image description increases, so will the performance of approaches like NIC. Furthermore, it will be interesting to see how one can use unsupervised data, both from images alone and text alone, to improve image description approaches.

我们提出了NIC,一个端到端的神经网络系统,可以自动查看图像并生成合理的描述。NIC基于卷积神经网络,将图像编码成紧凑的表示形式,然后是递归神经网络,生成相应的句子。该模型经过训练以最大化给定图像的句子的可能性。在多个数据集上的实验表明,NIC在定性结果(生成的句子非常合理)和定量评估方面具有鲁棒性,可以使用排名指标,也可以使用机器翻译中用来评估生成句子质量的BLEU指标。从这些实验中可以清楚地看到,随着用于图像描述的可用数据集的大小的增加,NIC等方法的性能也会提高。此外,观察如何使用非监督数据(单独来自图像和单独来自文本)来改进图像描述方法也很有趣。


Acknowledgement

We would like to thank Geoffrey Hinton, Ilya Sutskever, Quoc Le, Vincent V anhoucke, and Jeff Dean for useful discussions on the ideas behind the paper, and the write up.


References

[1] A. Aker and R. Gaizauskas. Generating image descriptions using dependency relational patterns. In ACL, 2010.
[2] D. Bahdanau, K. Cho, and Y . Bengio. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473, 2014.
[3] K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y . Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, 2014.
[4] J. Donahue, Y . Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, 2014.
[5] D. Elliott and F. Keller. Image description using visual dependency representations. In EMNLP, 2013.
[6] A. Farhadi, M. Hejrati, M. A. Sadeghi, P . Y oung, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: Generating sentences from images. In ECCV, 2010.
[7] R. Gerber and H.-H. Nagel. Knowledge representation for the generation of quantified natural language descriptions of vehicle traffic in image sequences. In ICIP. IEEE, 1996.
[8] Y . Gong, L. Wang, M. Hodosh, J. Hockenmaier, and S. Lazebnik. Improving image-sentence embeddings using large weakly annotated photo collections. In ECCV, 2014.
[9] A. Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.
[10] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8), 1997.
[11] M. Hodosh, P . Y oung, and J. Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. JAIR, 47, 2013.
[12] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In arXiv:1502.03167, 2015.
[13] A. Karpathy, A. Joulin, and L. Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. NIPS, 2014.
[14] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. In arXiv:1411.2539, 2014.
[15] R. Kiros and R. Z. R. Salakhutdinov. Multimodal neural language models. In NIPS Deep Learning Workshop, 2013.
[16] G. Kulkarni, V . Premraj, S. Dhar, S. Li, Y . Choi, A. C. Berg, and T. L. Berg. Baby talk: Understanding and generating simple image descriptions. In CVPR, 2011.
[17] P . Kuznetsova, V . Ordonez, A. C. Berg, T. L. Berg, and Y . Choi. Collective generation of natural image descriptions. In ACL, 2012. [18] P . Kuznetsova, V . Ordonez, T. Berg, and Y . Choi. Treetalk: Composition and compression of trees for image descriptions. ACL, 2(10), 2014.
[19] S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y . Choi. Composing simple image descriptions using web-scale n-grams. In Conference on Computational Natural Language Learning, 2011.
[20] T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P . Perona, D. Ramanan, P . Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. arXiv:1405.0312, 2014.
[21] J. Mao, W. Xu, Y . Yang, J. Wang, and A. Y uille. Explain images with multimodal recurrent neural networks. In arXiv:1410.1090, 2014. [22] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In ICLR, 2013.
[23] M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. C. Berg, K. Yamaguchi, T. L. Berg, K. Stratos, and H. D. III. Midge: Generating image descriptions from computer vision detections. In EACL, 2012.
[24] V . Ordonez, G. Kulkarni, and T. L. Berg. Im2text: Describing images using 1 million captioned photographs. In NIPS, 2011.
[25] K. Papineni, S. Roukos, T. Ward, and W. J. Zhu. BLEU: A method for automatic evaluation of machine translation. In ACL, 2002.
[26] C. Rashtchian, P . Y oung, M. Hodosh, and J. Hockenmaier. Collecting image annotations using amazon’s mechanical turk. In NAACL HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pages 139– 147, 2010.
[27] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge, 2014.
[28] P . Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y . LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
[29] R. Socher, A. Karpathy, Q. V . Le, C. Manning, and A. Y . Ng. Grounded compositional semantics for finding and describing images with sentences. In ACL, 2014.
[30] I. Sutskever, O. Vinyals, and Q. V . Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
[31] R. V edantam, C. L. Zitnick, and D. Parikh. CIDEr: Consensus-based image description evaluation. In arXiv:1411.5726, 2015.
[32] B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu. I2t: Image parsing to text description. Proceedings of the IEEE, 98(8), 2010.
[33] P . Y oung, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. In ACL, 2014.
[34] W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. In arXiv:1409.2329, 2014.


发布了176 篇原创文章 · 获赞 694 · 访问量 5万+

猜你喜欢

转载自blog.csdn.net/weixin_39653948/article/details/105672741
今日推荐