Knowledge Graphs and NLP

Overview of From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

3 min readSep 30, 2020

Semantic interpretation, or making inferences on the meaning of atext is a fundamental step in language understanding. Drawing from our human ability to envisionmental images when prompted by a description, this paper introduces a novel approach for automaticallydenoting similarities between descriptions of everyday situations via sets of images and captions detailedmore below. Existing approaches include standard distributional approaches to lexical similarity, which is the idea that “linguistic expressions that appear in similar contexts have a similar meaning.” This was executedusing vector-based distributional similarities that are calculated by representing each word as a vectorbased on the counts of the word’s appearance with other words.

sample captions of images capturing the variety and complexity of finding a good caption

While standard distributional approaches are successful in identifying which words are related underbroad topics and offer useful features for semantic interpretation, such approaches fail when tasked with capturing precise entailment between complex expressions.

This paper presents a denotation approach. Intuitively, the paper follows a truth conditional view-point so that we can create a “denotation” of a declarative sentence to be the set of all possible worlds inwhich the sentence is true.

To capture this, the paper sets the denotation of some sentence to be the setof images it describes. That is, define the interpretation function,[·]as a function that maps sentencesor captions to their visual denotations (sets of images that a caption can truthfully describe). Formally,if s is a sentence and i is an image then

[s] ={i∈U|sis a truthful description of i}

The paper then constructs a denotation graph to create an ordering of more specific and less specificcaptions. These are called subsumption hierarchies.

The paper relies on a database of images of everyday activitieswhere each image is described by multiple captions. Their corpus contains 158,439 unique captions and31,783 images, the denotation graph contains 1,749,097 captions, out of which 230,811 describe morethan a single image. There are also various cleaning techniques for the captions such as normalizationof tenses and format cleaning.

Another method of note is hypernym lexicon construction. For each headnoun where the definition is not clear, look at every coreference chain it appears in and then reduce itssynsets to those in hypernym-hyponym relation with at least one other head noun in the chain. Then, use a greedy majority voting algorithm to reduce to a single synset. Greedy majority voting algorithmensures that the chosen synset is compatible with the largest number of coreference chains.The algorithm to construct the denotation graph uses “purely syntactic and lexical rules to producesimpler captions”. Note that since each imagee is associated with several captions (and each captionwith several images), the graph is able to capture more subtle relations such as similarities betweensyntactical and lexically unrelated descriptions (since the graph relationship can capture both pieces ofinformation).

The paper evaluates the usefulness of the denotation graph on twodifferent tasks that both require semantic inference of textual information. One is an approximate entail-ment recognition task aimed to decide whether an image caption describes the same image as anotherset of four captions. Metric for evaluation for this task is the standard difference between the word tovec embdding of the two words. The second task is the Semantic Textual Similarity task, which is agraded version of paraphrase detection.

This paper is very useful because it introduces a newmethod of studying the relation between different sentences including a heirarchy of what is a generic vsspecific statement. Further, the success in the results indicate that the denotation graph, which captures denotational similarities are at least as effective as standard approaches to textual similarity, showing that there is promise in this direction.

However, I think there should be more discussion on the data that captions provide. For example, the paper only uses positive captioning (presence of items instead of absence of items). It would be interesting to see what would happen with subjective captions that are probably more common in text. Furthermore, the paper is missing an analysis of the distributions of the images and captions that they use. I think it is not immediately clear that the distributions of images that the corpus contains is the same as distributions of captions in all text. Finally, I think the evaluation via word to vec similarity is a bit confusing since optimizing using that framework should be the most optimal.

Knowledge Graphs and NLP

Overview of From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

Written by Claudia Zhu