Neural Baby Talk

Claudia Zhu
4 min readFeb 17, 2021

Review of this article

Conceptual Contributions

This paper aims to improve image captioning, which is one of the primary challenges in the intersection of CV and NLP. Tangible process improves applications from aiding visually impaired users to human computer interaction. Current state of the art captioning models typically include an end-to-end large neural network where images are passed to a CNN as a vector and then captions are retrieved from an output vector using a Recurrent Neural Network.

This paper introduces a novel framework for image captioning that is restricted to the set of objects that the model is able to detect in the image. The approach uses both classical slot filling approaches as well as modern neural captioning approaches. Classical slot filling approaches produce captions that are deeply grounded in the image (such that all information is carefully extracted via detection), however captions are sound unnatural and “canned”. The latter technique that most modern approaches follow often produces very natural sounding sentences (since it is trained on natural language captions) that are not derived from image specific objects/features. This paper presents a novel approach that combines the two existing frameworks that is both visually grounded in that it pulls from image detection data, but is also generated such that the…

--

--