[Paper Reading] 发表在 arXiv 上的 Image Captioning 方向的论文 -- 持续更新

网页链接：https://arxiv.org/search/?searchtype=all&query=image+captioning&abstracts=show&size=50&order=announced_date_first

本篇博客列举了发表在 arXiv 上与 Image Captioning 方向相关的论文。
博客中对每篇论文的总结主要关注论文的摘要部分，重点论文以“★”或“☆”标注（重要程度：★>☆）。

论文列表

arXiv:1409.2329 <Recurrent Neural Network Regularization> ★：提出了一种 RNN/LSTM 的正则化方法。

We present a simple regularization technique for Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units. Dropout, the most successful technique for regularizing neural networks, does not work well with RNNs and LSTMs. In this paper, we show how to correctly apply dropout to LSTMs, and show that it substantially reduces overfitting on a variety of tasks. These tasks include language modeling, speech recognition, image captiongeneration, and machine translation.

arXiv:1411.4555 <Show and Tell: A Neural Image Caption Generator> ★：使用深度神经网络解决 Image Captioning 任务；向 Image Captioning 领域引入“编码器-解码器（Encoder-Deocder）”框架。

Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. For instance, while the current state-of-the-art BLEU-1 score (the higher the better) on the Pascal dataset is 25, our approach yields 59, to be compared to human performance around 69. We also show BLEU-1 score improvements on Flickr30k, from 56 to 66, and on SBU, from 19 to 28. Lastly, on the newly released COCO dataset, we achieve a BLEU-4 of 27.7, which is the current state-of-the-art.

arXiv:1411.4952 <From Captions to Visual Concepts and Back> ★：使用多实例学习（Multi Instance Learning）的方法从图片中提取单词（Word Detectors）；采用传统的语言建模方式；使用句子级特征和一个深度多模态相似性模型对生成的描述进行重排序。

This paper presents a novel approach for automatically generating image descriptions: visual detectors, language models, and multimodal similarity models learnt directly from a dataset of image captions. We use multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives. The word detector outputs serve as conditional inputs to a maximum-entropy language model. The language model learns from a set of over 400,000 image descriptions to capture the statistics of word usage. We capture global semantics by re-ranking caption candidates using sentence-level features and a deep multimodal similarity model. Our system is state-of-the-art on the official Microsoft COCO benchmark, producing a BLEU-4 score of 29.1%. When human judges compare the system captions to ones written by other people on our held-out test set, the system captions have equal or better quality 34% of the time.

arXiv:1411.5654 <Learning a Recurrent Visual Representation for Image Caption Generation>：探索了图像与描述之间的双向映射。

In this paper we explore the bi-directional mapping between images and their sentence-based descriptions. We propose learning this mapping using a recurrent neural network. Unlike previous approaches that map both sentences and images to a common embedding, we enable the generation of novel sentences given an image. Using the same model, we can also reconstruct the visual features associated with an image given its visual description. We use a novel recurrent visual memory that automatically learns to remember long-term visual concepts to aid in both sentence generation and visual feature reconstruction. We evaluate our approach on several tasks. These include sentence generation, sentence retrieval and image retrieval. State-of-the-art results are shown for the task of generating novel image descriptions. When compared to human generated captions, our automatically generated captions are preferred by humans over 19.8% of the time. Results are better than or comparable to state-of-the-art results on the image and sentence retrieval tasks for methods using similar visual features.

arXiv:1412.6632 <Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)>：提出了一种多模态循环神经网络（multimodal Recurrent Neural Network，m-RNN），多模态层连接了语言模型部分和视觉部分。

In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. In addition, we apply the m-RNN model to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval. The project page of this work is: www.stat.ucla.edu/~junhua.mao/m-RNN.html .

arXiv:1412.8419 <Simple Image Description Generator via a Linear Phrase-Based Approach>：在图像特征与短语表示（phrase representations）之间学习了一个公共空间（common space）。

Generating a novel textual description of an image is an interesting problem that connects computer vision and natural language processing. In this paper, we present a simple model that is able to generate descriptive sentences given a sample image. This model has a strong focus on the syntax of the descriptions. We train a purely bilinear model that learns a metric between an image representation (generated from a previously trained Convolutional Neural Network) and phrases that are used to described them. The system is then able to infer phrases from a given image sample. Based on captionsyntax statistics, we propose a simple language model that can produce relevant descriptions for a given test image using the phrases inferred. Our approach, which is considerably simpler than state-of-the-art models, achieves comparable results on the recently release Microsoft COCO dataset.

arXiv:1502.03044 <Show, Attend and Tell: Neural Image Caption Generation with Visual Attention>：向 Image Captioning 领域引入注意力机制（Attention Mechanism），提出了 Hard Attention 与 Soft Attention。

Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.

[Paper Reading] 发表在 arXiv 上的 Image Captioning 方向的论文 -- 持续更新

猜你喜欢