Link of the Paper: https://arxiv.org/abs/1411.5654
Main Points:
- The bi-directional mapping model using recurrent neural networks: unlike previous approaches that map both sentences and images to a common embedding ( and then calculate the similarity and match / generate, or just use Encoder-Decoder Model, I guess ); interconversion between Visual Features and Visual Descriptions.
- A novel recurrent visual memory: automatically learns to remember long-term visual concepts.
Other Key Points: