Paper Reading - Show and Tell: A Neural Image Caption Generator ( CVPR 2015 )

Link of the Paper: https://arxiv.org/abs/1411.4555

Main Points:

  1. A generative model ( NIC, CNN + RNN ) based on a deep recurrent architecture: the model is trained to maximize the likelihood P(S|I) of the target description sentence given the training image I. S = { S1, S2, ... } is the target sequence of words and each word St comes from a given dictionary, that describes the image adequately.
  2. The authors use a CNN as an image "encoder", by first pre-training it for an image classification task and using the last hidden layer as an input to the RNN decoder that generates sentences. They call this model the Neural Image Caption, or NIC.

Other Key Points:

  1. A description must capture not only the objects contained in an image, but it also must express how these objects relate to each other as well as their attributes and the activities they are involved in.
  2. The inspiration of Image Captioning could come from advances in Machine Translation.

猜你喜欢

转载自www.cnblogs.com/zlian2016/p/9471483.html