Neural Baby Talk

Claudia Zhu
4 min readFeb 17, 2021

Review of this article

Conceptual Contributions

This paper aims to improve image captioning, which is one of the primary challenges in the intersection of CV and NLP. Tangible process improves applications from aiding visually impaired users to human computer interaction. Current state of the art captioning models typically include an end-to-end large neural network where images are passed to a CNN as a vector and then captions are retrieved from an output vector using a Recurrent Neural Network.

This paper introduces a novel framework for image captioning that is restricted to the set of objects that the model is able to detect in the image. The approach uses both classical slot filling approaches as well as modern neural captioning approaches. Classical slot filling approaches produce captions that are deeply grounded in the image (such that all information is carefully extracted via detection), however captions are sound unnatural and “canned”. The latter technique that most modern approaches follow often produces very natural sounding sentences (since it is trained on natural language captions) that are not derived from image specific objects/features. This paper presents a novel approach that combines the two existing frameworks that is both visually grounded in that it pulls from image detection data, but is also generated such that the captions sound natural. The model works as follows: first, the model generates a template sentence that has “slots” for words describing specific locations in the image. These words are then chosen by the “visual concepts” identified by an object detector model.

Technical Contributions

The objective that the model learns is split into two parts. Given an input image I and corresponding target caption y, the model first maximizes the probability of generating the “template” containing grounding regions or slots to fill in. Then, the paper learns a model to find visual words to denote a specific image region and subsequently fills in the slots.

To do this, define latent variable r_t can either be a visual word (y_{vis}) or a textual…