One Model’s Trash is Another’s Treasure

Review of Microsoft’s Deep Residual Learning for Image Recognition

Claudia Zhu
4 min readJun 14, 2019

Note: This paper was a bit difficult for me and I relied heavily on other sources to help me understand the material. A lot of the mechanisms and why’s of res nets I still don’t understand to the point where I feel confident explaining it to others, so I’m going to keep this one short and as a brief overview.

In this review, I will discuss the motivations for this work, their solution and how the residual network helps without going into details much. If you would like a more detailed explanation of the exact mechanisms and procedure, I recommend this source. If you want to know more from an industry standpoint why this is so important, I recommend this source.

This paper starts off with:

Deeper neural networks are more difficult to train.

I thought it was ironic that this was the starting line, especially as I am cramming hard for finals week. This thought has stuck in my head for a while.

Image via twitter! One of my very good and very smart friends sent this to me. Ilya Sutskever, as he tells me, is one of the leading ML researchers in the world. He works at OpenAI.

The premise of this paper is that although it has been found that better models undoubtedly take longer to train and that there has been a lot of significant results from deeper models, it is not always true that models with simply more layers are better. In other words, constructing more successful deeper models is not as easy as simply stacking more layers. There have been experiments that have found some sort of “sweet spot” between 16–30 layers, but after that, there seems to be some sort of accuracy saturation. This is known as the degradation problem. Where accuracy peaks and then quickly degrades.

However, this degradation is not caused by overfitting. Which is spooky, because most of our problems in machine learning is caused by overfitting and Deep Learning is almost entirely concerned with minimizing the overfit. From my understanding, this degradation problem is due to the fact that these layers become too difficult to optimize because there is just so many nodes and variables. Therefore, it takes too long (beyond reasonable time) to train since making even one step takes enormous computing power and long periods of time.

Even more shocking, however, is the fact that adding more layers to a model in the “sweet spot” deep range generates higher training error and higher testing error (see Figure 1 below). Microsoft tested this by creating a model by construction, which they call the constructed model. The constructed model simply takes a shallower network and then makes it a deeper network by adding many layers that are just identity mappings (so these layers don’t do anything and its essentially JUST the shallow network placed in a deeper network. While it would be expected that any deeper model trained for longer period of time should do AT LEAST as well as the constructed model, Microsoft researchers found that this was not true. That is, the constructed model did better: had lower training and testing error. See the figure below.

From Microsoft Paper. Shockingly, the constructed model does significantly better.

So now, the solution. To combat this, they created residual networks. This is pretty much what the paper is focused on. From the abstract:

We explicitly reformulate the layers as learn- ing residual functions with reference to the layer inputs, in- stead of learning unreferenced functions.

Again, won’t dig too deep here because it’s beyond me, but from my understanding, the idea is that we have some shallower network that is of some sweet-spot depth. Then with the other layers, remember how we used to just replace everything with identity in the constructed model? In the case of residual networks, we replace these other layers with another model that essentially learns the difference.

Residual just means the difference or error

So for example, if my model consistently outputs 3 when the answer is 4, I can store this difference: 1, and have another model learn that. So the rest of the layers are learning my deviation from the correct answer for whatever input I have and then updating my answer with this residual calculation added in.

A way that they can do this is by feeding the input into the network at deeper levels. You can think of this intuitively as sort of a “refresh” for the model to look at the input again. We see below in Figure 4 that this Residual network is wildly successful as evidenced by the decreased error rate.

From Microsoft Paper. Plain neural net is on the left and res net is on the right.

Residual networks are used in many state of the art machine learning and AI systems. It has had remarkable results especially for applications in image classification and object detection.