[TPAMI-2023] From Show to Tell: A Survey on Deep Learning-Based Image Captioning

论文阅读 [TPAMI-2023] From Show to Tell: A Survey on Deep Learning-Based Image Captioning

论文搜索(studyai.com)

搜索论文: From Show to Tell: A Survey on Deep Learning-Based Image Captioning

搜索论文: http://www.studyai.com/search/whole-site/?q=From+Show+to+Tell:+A+Survey+on+Deep+Learning-Based+Image+Captioning&fr=csdn

关键字(Keywords)

Visualization; Feature extraction; Task analysis; Convolutional neural networks; Additives; Image coding; Training; Image captioning; vision-and-language; deep learning; survey

机器视觉; 自然语言处理

视觉(频)字幕; 多模态感知; 视觉语言任务; 语言模型; 文本生成; BERT

摘要(Abstract)

Connecting Vision and Language plays an essential role in Generative Intelligence.

连接视觉和语言在生成智能中起着至关重要的作用。

For this reason, large research efforts have been devoted to image captioning, i.e.

为此,人们对图像字幕进行了大量的研究,即。

describing images with syntactically and semantically meaningful sentences.

用有语法和语义意义的句子描述图像。

Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation.

从2015年开始,这项任务通常由视觉编码器和文本生成语言模型组成的管道来完成。

During these years, both components have evolved considerably through the exploitation of object regions, attributes, the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies.

在这些年中,通过开发对象区域、属性、引入多模态连接、充分关注的方法和类似BERT的早期融合策略,这两个组件都有了长足的发展。

However, regardless of the impressive results, research in image captioning has not reached a conclusive answer yet.

然而,尽管有令人印象深刻的结果,图像字幕的研究还没有得出结论。

This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics.

这项工作旨在全面概述图像字幕方法,从视觉编码和文本生成到训练策略、数据集和评估指标。

In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies.

在这方面,我们定量比较了许多相关的最先进方法,以确定架构和培训策略中最具影响力的技术创新。

Moreover, many variants of the problem and its open challenges are discussed.

此外,还讨论了该问题的许多变体及其面临的挑战。

The final goal of this work is to serve as a tool for understanding the existing literature and highlighting the future directions for a research area where Computer Vision and Natural Language Processing can find an optimal synergy…

这项工作的最终目标是作为一种工具来理解现有文献,并强调计算机视觉和自然语言处理可以找到最佳协同作用的研究领域的未来方向。

作者(Authors)

[‘Matteo Stefanini’, ‘Marcella Cornia’, ‘Lorenzo Baraldi’, ‘Silvia Cascianelli’, ‘Giuseppe Fiameni’, ‘Rita Cucchiara’]

猜你喜欢

转载自blog.csdn.net/weixin_42155685/article/details/129353886