Deep Dive into DeViSE

Review of DeViSE

Claudia Zhu
4 min readFeb 17, 2021

Conceptual Contributions

In DeViSE, the authors tackle the issue that visual recognition systems are often limited in their ability to scale to large numbers of classification categories in part due to difficulty in acquiring such a balanced dataset as well as the traditionally rigid nature of classification within defined classes. The authors propose a new deep visual-semantic embedding model (DeViSE) to use text data to train visual models and to constrain their predictions. DeViSE leverages both labeled image data as well as unannotateed text data. The model uses the textual data to leaern a. semantic relationship between the labels of the image data, which it can then extrapolate further than the previous state of the art of using a deep CNN with softmax output layer as DeViSE is able to generalize to new classes. The model comes in two parts, first is a language model based on skip-gram text modeling architecture, which can efficiently learn semantically-meaningful vector embeddings of unannotated text by learning the relationship of words relative to each other. This yields clustering of certain labels. The authors begin by pre-training a simple neural language model that learns a semantically-meaningful, dense vector representations of words. The authors simultaneously pre-train a state-of-the-art deep CNN for visual object…

--

--