Many slides and slide ideas thanks to Marc'Aurelio Ranzato and Michael Nielson. http://www.cs.toronto.edu/~ranzato/files/ranzato_CNN_stanford2015.pdf http://neuralnetworksanddeeplearning.com/
http://illusionoftheyear.com/2009/05/the-illusion-of-sex/ http://public.gettysburg.edu/~rrussell/Russell_2009.pdf Left appears female, right appears male. Eyes and lips have not had their brightness/contrast adjusted, but it looks like the face on the left has darker lips This is an example of simultaneous contrast adjustment Russell, 2009
Simultaneous contrast http://pixgood.com/color-contrast-art-examples.html
In this one, the skin was left unchanged, but the lips and eyes were lightened on the left and darkened on the right.
No illusion – both faces appear androgenous “A sex difference in facial contrast and its exaggeration by cosmetics” Female faces have greater luminance contrast between the eyes, lips, and surrounding skin than do male faces. This contrast change in some way represents ‘femininity’. Cosmetics could be seen to increase contrast of these areas, and so increase femininity.
Perceptron: This is convolution!
v v v Shared weights Output should be a little smaller, as we aren’t padding v
Filter = ‘local’ perceptron. Also called kernel. This is the major trick. For each layer of the neural network, we’re going to learn multiple filters.
By pooling responses at different locations, we gain robustness to the exact spatial location of image features.
FLOPS = just to execute, not train.
Universality A single-layer network can learn any function: So long as it is differentiable To some approximation; More perceptrons = a better approximation Visual proof (Michael Nielson): http://neuralnetworksanddeeplearning.com/chap4.html Visual proof is with sigmoid activation functions
If a single-layer network can learn any function… …given enough parameters… …then why do we go deeper? Intuitively, composition is efficient because it allows reuse. Empirically, deep networks do a better job than shallow networks at learning such hierarchies of knowledge.
Problem of fitting Too many parameters = overfitting Not enough parameters = underfitting More data = less chance to overfit How do we know what is required?
Regularization Attempt to guide solution to not overfit But still give freedom with many parameters
Data fitting problem [Nielson]
Which is better? Which is better a priori? How do we know a priori how many parameters we need? Cross-validation is one way. 9th order polynomial 1st order polynomial [Nielson]
Regularization Attempt to guide solution to not overfit But still give freedom with many parameters Idea: Penalize the use of parameters to prefer small weights.
Regularization: Idea: add a cost to having high weights λ = regularization parameter n is size of training data Intuitively, the effect of regularization is to make it so the network prefers to learn small weights, all other things being equal. Large weights will only be allowed if they considerably improve the first part of the cost function. Put another way, regularization can be viewed as a way of compromising between finding small weights and minimizing the original cost function. [Nielson]
Both can describe the data… …but one is simpler. Occam’s razor: “Among competing hypotheses, the one with the fewest assumptions should be selected” For us: Large weights cause large changes in behaviour in response to small changes in the input. Simpler models (or smaller changes) are more robust to noise. Why does this help reduce overfitting?
Regularization Idea: add a cost to having high weights λ = regularization parameter n is size of training data Intuitively, the effect of regularization is to make it so the network prefers to learn small weights, all other things being equal. Large weights will only be allowed if they considerably improve the first part of the cost function. Put another way, regularization can be viewed as a way of compromising between finding small weights and minimizing the original cost function. Normal cross-entropy loss (binary classes) Regularization term [Nielson]
Regularization: Dropout Our networks typically start with random weights. Every time we train = slightly different outcome. Why random weights? If weights are all equal, response across filters will be equivalent. Network doesn’t train. http://stackoverflow.com/questions/20027598/why-should-weights-of-neural-networks-be-initialized-to-random-numbers [Nielson]
Regularization Our networks typically start with random weights. Every time we train = slightly different outcome. Why not train 5 different networks with random starts and vote on their outcome? Works fine! Helps generalization because error is averaged. Any problem here? But, kind of slow?
Regularization: Dropout [Nielson]
Regularization: Dropout At each mini-batch: Randomly select a subset of neurons. Ignore them. On test: half weights outgoing to compensate for training on half neurons. Effect: Neurons become less dependent on output of connected neurons. Forces network to learn more robust features that are useful to more subsets of neurons. Like averaging over many different trained networks with different random initializations. Except cheaper to train. [Nielson]
More regularization Adding more data is a kind of regularization Pooling is a kind of regularization Data augmentation is a kind of regularization
CNNs for Medical Imaging - Connectomics This is some of James’ work with Harvard Brain Science Center. The brain represented as a network: image obtained from MR connectomics. (Provided by Patric Hagmann, CHUV-UNIL, Lausanne, Switzerland) [Patric Hagmann]
Vision for understanding the brain 1mm cubed of brain Image at 5-30 nanometers How much data? [Kaynig-Fittkau et al.]
Vision for understanding the brain 1mm cubed of brain Image at 5-30 nanometers How much data? 1 Petabyte – 1,000,000,000,000,000 ~ All photos uploaded to Facebook per day [Kaynig-Fittkau et al.]
[Kaynig-Fittkau et al.]
Vision for understanding the brain Use computer vision to classify neurons in the brain. Attempt to build a connectivity graph of all neurons – ‘connectome’ – so that we can map and simulate brain function.
[Kaynig-Fittkau et al.]
https://arxiv.org/abs/1704.00848 [Haehn et al.]
Network Architecture [Haehn et al.] 170,000 parameters 9000 positive / negative example patches 75 x 75 80,000 image patches https://arxiv.org/abs/1704.00848 [Haehn et al.]
https://arxiv.org/abs/1704.00848 [Haehn et al.]
https://arxiv.org/abs/1704.00848 [Haehn et al.]