Some bits of theory to better understand deep learning

This is a collection of links to some interesting recent theoretical advances in order to better understand deep learning. It is not complete, not even comprehensive. It is more of a reading list for the upcoming Christmas holidays. More on the topic of deep learning theory can be found in this new MOOC by Stanford professors Hatef Monajemi, David Donoho and Vardan Papyan, in this DL theory class by Ioannis Mitliagkas of U Montreal, and in this overview blogpost by Dmytrii S. on why deep neural nets generalize so well. A good collection of links is also included in this blog post by Carlos Perez.

Insights from physics

Lin, Tenmark and Rolnick (2016) argue that in physics, most complex phenomena are explained using simple models and formulas; this might explain why deep neural nets learn so well despite the vast complexity of the tasks.

  • Paper from the Journal of Statistical Physics

  • Summary on Technology Review

Insights from information theory

Naftali Tishby and colleagues argue that neural nets learn in two distinct phases: first (and quickly), they coarsly fit the labels using the data, and then (longer) they learn to generalize by compressing the noisy data representation. We are proud to have Prof. Tishby giving a keynote on this topic at SDS 2018, the 5th Swiss Conference on Data Science!

  • Paper 1 and 2 on arXiv

  • Summary on The Neural Perspective (here’s also a more intuitive one)

  • Tishby’s talk that started the discussion

Insights from neuroscience

Chang and Tsao show that monkey brains decode faces into ca. 200 orthogonal axes of facial appearance features, and that each face is recognized not by a single neuron (i.e., a softmax layer), but by a specific firing pattern of these 200 neurons. The monkey brain thus learns a face embedding.

  • Paper from Cell Journal

  • My summary: It is argued that this encoding procedure of the monkey brain is similar to how CNNs learn representations, and that the discovery of the same mechanism in the brain might hint at a general underlying mechanism (“The fact that the CNN developed these properties even though it was not explicitly trained to extract appearance coordinates suggests that an axis representation may arise naturally from general constraints on efficient face recognition.”). The paper concludes with the following paragraph that cites Lin & Tegmark (see “physics” above) and hints at Hinton’s new Capsule Networks (see below): “Our finding that AM cells are coding axes of shape-free appearance representations rather than ‘‘Eigenface’’ features (Figure 4L) is consistent with a recently proposed explanation for the effectiveness of deep neural networks in image recognition (Lin and Tegmark, 2016): a visual image on the retina can be considered the result of a hierarchical generative model starting from a set of simple variables, e.g., shape-free appearance features. Deep neural networks are reversing this generative hierarchy, one step at a time, to derive these variables at the final layers. According to this view, the reason the brain codes shape and appearance features is that these are the key input variables to the hierarchical generative model for producing face images that the brain has learned to reverse.”

Insights from Geoffrey Hinton

Geoffrey Hinton published an improvement of the errors he founds in current CNNs (max pooling, need for many training examples): capsule networks. They are able to learn hierarchies of representations with respect to their relation to each opther (e.g., in computer vision: translation and rotation). This makes it much easier for the network to learn that objects depicted from different viewpoints are actually the same thing (i.e., CapsNets need less training data).

Insights from Bayesian theory

Yarin Gal and colleagues argues that using dropout in the evaluation phase of a deep net (thus effectively executing slightly different versions of the same net on exactly the same test data), and then creating a statistic of the prediction probabilities for all classes, helps in judging if the network can classify a certain example well, or if it in fact has no clue what the current input is (this might happen if you train a network on all dogs and then show it a cat in evaluation: it will classify the cat as some dog, maybe even with high probability, because it does not have the option to say “I don’t know”).

(Updated on May 14, 2018; June 06, 2018)

Written on November 22, 2017 (last modified: June 6, 2018)