The World’s Biggest Game of Memory

Claudia Zhu
11 min readAug 15, 2020

An Overview of Segmenting Scenes by Matching Image Composites

How do we derive meaning by matching visually similar images? Image and caption from paper.


One of the major goals in computer vision is semantic object segmentation, which is essentially identifying different objects that have semantic meaning (toothbrush or tree) within an image. This is a difficult endeavor because objects in semantic classifications take on a variety of forms and can rely on context spread throughout an image.

Beyond basic methods such as canny edge detection, where we look for a large difference (in intensity) between a series of neighboring pixels, we wish to identify entire objects. This presents technical challenges such as how do we identify windows as part of a building even though canonical edge detection will highlight the windows as something different. How do we identify a grey car in front of a grey lamp post in front of a grey building as different objects even if they are the same color. Much of this information relies on further information of the objects in the image that we are interested in parsing.

In the following paper we will present written by Russell et. al., we will explore the idea of obtaining such information by looking at visually similar images (in this case we mean that the images look very similar to one another in positive/negative space, where objects are, lighting, etc.).

This paper presents a model that finds matching images via a low-dimensional descriptor (think color, shape, texture, region in the image) and then transferring the associated labels from the matching images to the new image. Russell et. al. continue to show that this has achieved stellar results for object recognition, detection, annotation; image location, completion, exploration, colorization; and scene recognition and estimation.

Prior Work & Intuition

There has been much prior work done to produce an image using a stack of pre-aligned images that show the same scene. One of the most successful has been work done by PhotoMontage. The PhotoMontage framework selects regions from the aligned stack of images by optimizing a quality score to create a visually appealing image. In an effort to perform image segmentation and region filling (inserting computer generated image into gaps of another image), recent work has used PhotoMontage to automatically align images that contain the same scene (semantically).

The approach used by Russel et. al is different because they use a stack of visually similar images that are semantically different. Intuitively, this means that we only care about the shape of the picture (shallow similarity) and not the objects in it, in an effort to obtain semantic results. This is beneficial because finding similar scenes (semantically) as PhotoMontage does is both difficult to find/compute and does not always generate good results as semantically similar scenes does not work better than visually similar scenes in segmenting semantic objects.

To learn more, check out this paper on generating computer graphics based on datasets of globally similar images. And this paper on selecting objects based on large sets of images.

Motivation & Overview

Due to the internet, iPhones, accessibility of camera and pictures, we have a surplus of images, which allows us to compile massive collections of images. In this case, since we are specifically taking a look at scenery, we could compile road pictures, mountains, lakes, etc. From here, we select an image stack, which is a set of 5000 images by matching images (using standard image matching technique that we discuss more later on) to a given input image.

From here, we try to use the image stack of “similar” images that we found to give a semantically meaningful explanation of the new given input image. This means that we are searching for things by object and their purpose instead of by color or texture. All of this relies on a fundamental assumption that the image stack is already aligned such that the regions of objects that we care about are roughly in the same position. For examples, mountains will always be at the bottom of an image. The sky will always be near the top.

Then, we can use these semantic objects from our image stack to explain the input image. Using a large stack of images to explain the new input image instead of current methods such as CNN which relies on matching to a couple training images and then interpolating, we can compute a matching image composite (by matching each region in the input image to a semantically similar region in other images in the image stack. Note that this is informed by the difference in images as explained in the next paragraph). We can then use the image composite to explain the input image.

However, scene matches are never perfect. Specifically, parts of an image get matched up very well while others do not. Consider if you have a picture of a typical road with a tree and a building. This Scene Segmentation via Matching method takes advantage of the matching of parts property. For example, for a street scene, one matching image could have a building match very well, but getting the shape of the road wrong, while another matching image could get the road exactly right, but have a tree instead of a building. The differences actual tell us more about objects images and help us identify semantic boundaries since we can aggregate all of the images to create a matching composite image.

The top row shows matches for semantically similar images, which are very varied although they all depict crosswalks while the bottom row depicts visually similar images. These are much less visually varied even though they do not share meaning. Image and caption from paper.


Russel et. al’s approach combines boundary-based and region-based segmentation processes together within a single MRF framework. The boundary process (Le Boundary Process) tries to figure out the semantic boundaries between objects using the stack. The region process (Le Region Process) groups pixels that belong to the same object based on consistent objects in the stack. This is aggregated in an MRF framework which is solved using GraphCut optimization (Le MRF Process).

Le Boundary Process

Russel et. al is able to exploit the stack of images for edge detection by relying on the assumption that objects in an image (which are difficult to detect from a single image due to coloring similarities/lack of features) do not rest at the same relative locations. For example cars could be parked anywhere on the street, or people could be anywhere in the image. Internal object structures, however such as windows relative to a building or wheels relative to a car will always be in the same place. Therefore the success of this approach is measured by the consistency of matches made locally (in the internal structures) in images in the stack.

Further, instead of matching the entire image, we find regions in an image where objects will match up to other regions of images in the stack. Regions outside of the boundary belong to a different region and will match with other images.

Formally, given an oriented line passing through an image point p at orientation θ, we independently search for matches in the stack of images by forming a local image descriptor modulated by a weighted mask for both sides of the line. Again, remember that the images in the stack are assumed to be “coarsely aligned” and therefore we only look for matches on both sides of the line at point p and orientation θ across the stack. In other words, matching is not translation invariant. See figure below.

Data driven boundary detection. Left: Input image with a line at a point and orientation. Right: The top 9 matches in a large collection for both sides of the queried line. Image and caption from paper.

Notice that for point B both sides look similar, resulting in a high rank correlation score. Point A lies on a boundary so the the sets of retrieved images are very different, resulting in a low rank correlation score.

The ordering in the list is given by the L1 distance between the local descriptors g(p, θ) of the target image and each image in the stack. They then compute Spearman’s rank correlation coefficient between the two rank-ordered lists see below:

Spearman’s Rank Correlation

High rank correlation indicates that point p lies inside an object’s extent, whereas a low correlation should indicate that point p is at an object boundary with orientation θ.

Note that this data driven boundary detection is different from image based edge detection (ex. PB). This is because strong image edges (detected edges) can receive a low score since the query will result in similar pictures (at windows, the query will pull up the same picture since windows are likely to be in the same spot). Weak image edges can also result in a high score if such a boundary is uncommon. For example, a grey car parked at a grey building.

Further, data driven boundary scores are determined based on occurrence and matching statistics of similar scenes therefore requiring no additional manual supervision.

Data driven boundary detection: (a) Input image (b) Ground truth boundaries (c) PB Boundary detection (d) Proposed data driven boundary detection. In d, we observe clearer object boundaries and suppressed false positive boundaries (leaves of tree) in object. Image and caption from paper.

Le Region Process

The second part is the group pixels in the input image to form the region of the object. This entails finding clusters within the stack that are “self-consistent” and “explain the input image well”. Again, the paper presents a data driven grouping approach. The paper starts off with a hypothesis that regions of semantically meaningful objects are similar across the stack (cars look similar, trees look similar).

Russel et. al’s approach is to “find clusters of image patches that match the same images within the stack”. This means that two patches in the input image will be regions in the same object if their respective matching images from the stack are similar.

Similar to the boundary process, the query images are compared at a particular patch location so again, the matching is not translation invariant. Again, compared with standard image region processing, patches with very different appearance can be grouped together if they query the same images. For example, a car wheel and car window look very different but will return similar images. This will reveal more semantic objects than only visually similar results.

In the figure below, patches that were matched to similar images and therefore belong to the same object or “major scene components” are highlighted. This again shows that local regions with different visual similarity map to the same object since their queries pull up visually similar images. Also note that there can be multiple hypotheses (multiple different interpretations of the semantic objects in an image) for a patch.

Data driven image grouping. Left: input image. Right: heat maps indicating groupings of pixels belonging to the same scene component, which are found by clustering image patches that match the same set of images in the stack (warmer colors correspond to higher similarity to a cluster center). Image and caption from paper.

Finally, we find cluster centers c_k for k ∈ {1,…,K}. Intuitively, we can understand the object clusters as “topics of an image stack” or different possibilities of objects. These cluster centers can be found via k-means or any other standard topic discovery methods. Further, as we aren’t aiming to find all the semantic objects in the stack and only those that can explain the query image, we rely on a relatively small k ~ 5.

Le MRF Process

From the data driven boundary process, we were able to estimate clear boundaries and from the data driven region process, we were able to grow regions of semantic objects. However, even though the two processes pull from the same data, they complementary information. This is because the region process groups as many pixels as it can and is precise in location. The boundary process focuses on local image behavior at specific points.

An MRF based optimization framework ties the global region process and well-localized boundary process together. The paper presents a a multi-state MRF on pixels for segmentation. MRF is formulated as follows, but we will only go over a general explanation from the paper paraphrased as follows: Below, x_i ∈ {0, 1,…,K} is the state at pixel i that corresponds to one of the K different image stack groups. φ_i are unary costs, which are defined by similarity of a patch at pixel i. This is described by an indicator vector y_i to each of the K image stack groups. Then ψi,j are binary costs for a boundary-dependent Potts model.

Novel Methods

One of the challenges in scene-matching, which is the process of selecting images of the dataset so that we have a stack of images to use our process on, is that it is difficult to obtain good matches despite training/having a massive databases. This is because we rely on low-level image descriptors (no semantics) so it is difficult to find good matches that allow us to label parts of our new image. Several approaches such as synthetically increasing the dataset with transformed copies of images, which helps by adding translational diversity to the dataset; cleaning matching results using clustering, which aids by limiting our error, or simply picking good matches by hand. However, these do not resolve the fundamental issue that “because there is a potentially infinite cardinality of possible images to match, it is very unlikely to obtain a good overall match”.

Instead, Russell et. al. propose that the new image is better matched by breaking the image into chunks that have good matches in the database, but are large enough to be meaningful as a chunk.


The figure to the left shows a precision-recall curve for the data-driven boundary detector. Precision-recall curves measure precision (number of true positive instances in all instances labeled to be positive) and recall (number of true positive instances in all positive instances regardless of what they were labeled as). Ideally we want both to be high. The curve comes from setting different thresholds to examine performance of various models.

Evaluation of the boundary detection task on the principal occlusion and contact boundaries extracted from the LabelMe database. We show precision-recall curves for PB (blue triangle line) and our data-driven boundary detector (red circle line)

Notice that Russel et. al.’s method achieves higher precision at all recall levels. At the same recall level, PB and the data-driven boundary detector achieves 0.45 and 0.50 precision, respectively. In other words, the Average Precision is higher for this method and therefore more desirable.

The star (output segmentation) produced the highest precision (0.55) at 0.09 recall. Improved performance at lower recall is due to the model suppressing interior contours in objects in the image. At higher recall, the model tends to miss small, moveable objects.

Left: Output segmentation produced by our system. Notice that the segment boundaries align well with the depicted objects in the scene. Top right: Top matches for each recovered segment, which are stitched together to form a composite. Bottom right: Top whole-image matches using the gist descriptor. By recovering the segmentation, we are able to recover improved semantic matches. Image and caption from paper.