Deep Dive into DeViSE

Review of DeViSE

Claudia Zhu

--

Conceptual Contributions

In DeViSE, the authors tackle the issue that visual recognition systems are often limited in their ability to scale to large numbers of classification categories in part due to difficulty in acquiring such a balanced dataset as well as the traditionally rigid nature of classification within defined classes. The authors propose a new deep visual-semantic embedding model (DeViSE) to use text data to train visual models and to constrain their predictions. DeViSE leverages both labeled image data as well as unannotateed text data. The model uses the textual data to leaern a. semantic relationship between the labels of the image data, which it can then extrapolate further than the previous state of the art of using a deep CNN with softmax output layer as DeViSE is able to generalize to new classes. The model comes in two parts, first is a language model based on skip-gram text modeling architecture, which can efficiently learn semantically-meaningful vector embeddings of unannotated text by learning the relationship of words relative to each other. This yields clustering of certain labels. The authors begin by pre-training a simple neural language model that learns a semantically-meaningful, dense vector representations of words. The authors simultaneously pre-train a state-of-the-art deep CNN for visual object recognition with a traditional softmax layers before output. Finally, the authors construct the deep visual-semantic model by taking the trained visual object recognition network without softmax and retraining the layers to predict the vector representation of the image label text as learned by the language model. The loss function that is used here is a combination of dot-product similarity and hinge rank loss, which allows the model to be trained to measure the similarity between the visual model output and the vector representation of the correct label. It is defined as follows:

Technical Contributions

The authors implemented thee model as follows. For the language model, the authors trained a skip-gram text model on a tokenized corpus of 5.7 million documents (5.4 billion words) extracted from wikipedia.org. The lexicon consists of around 155,000 terms consisting of common English words and phrases as well as terms from…

--

--