ML asks “wHY?!”

Claudia Zhu
4 min readFeb 17, 2021

Review of MIT’s WHY!

Conceptual Contributions

It is an interesting and challenging question to resolve how we determine the “why” behind an image. It is posited that it is a result of “Theory of the Mind” and psychophysics researchers hypothesize that our capacity to reliably infer another person’s motivation stems from our ability to impute our own beliefs to others, that is, if we experience a certain emotion doing something, we infer that others experience that emotion as well. This paper seeks to computationally deduce the motivation behind people’s actions in images. This problem is challenging in many regards as it is far from clear how we can reduce the reasoning behind the theory of the mind into a set of operations that can be run by a machine. However, since humans are able to consistently perform this task, the authors believe introducing this problem to the computer vision community will spur research in this direction and introduce a new dataset to facilitate such further research. These datasets are created by a group of human workers who annotate why people are likely undertaking actions in photographs. These annotations were then combined with state-of-the-art image features to train data-driven classifiers that predict a person’s motivation from images.

The paper presents an incipient framework for inferring people’s motivations in images. First, the authors propose to give computer vision systems access to many of the human experiences by using state-of-the-art language models on estimated on billions of websites to extract common knowledge about people’s experiences, such as their interactions with objects, their environments, and their motivations. Define y_i ∈ {1,···, M_i}be a type of visual concept, such as objectsor scenes, for i ∈ {1,···, N}. The paper assigned eachyito one of the M_i vocabulary terms from thedataset. This provides a general approach to the types of visual concepts, but for simplicity, the paper restricts to y_1 as motivation, y_2 as action, y_3 as object, and y_4 as scene. The loss function is then definedby calculate the log-probability L_{ij}(y_i, y_j) that the visual concepts y_i and y_j are related by querying alanguage model with sentences about those concepts. The model produces a third order factor graph. The model then learns the parameters w for the visual features and u for the language potentials using training data of images and their corresponding labels, {x_n, y_n}.