Topics in Deep Learning

Topics in Deep Learning

Computation Graphs A computation graph is a way to represent computation in software It is a directed acyclic graph (DAG) in which: Nodes – corresponds to variables, constants or mathematical operations Edges – corresponds to the flow of intermediary values No need for ‘()’ in computation graphs, calculation is defined by topological sort Used to calculate both function values and gradients

The Chain Rule Scalar function: 𝑌=𝑓 𝑔 𝑥 → 𝒅𝒚 𝒅𝒙 𝒙 𝟎 = 𝒅𝒚 𝒅𝒈 𝒈( 𝒙 𝟎 ) 𝒅𝒚 𝒅𝒙 𝒙 𝟎 In the general case we look for gradients: 𝜵𝒀 𝑿 = 𝜕𝑌 𝜕𝑿 = [ 𝜕𝑌 𝜕 𝑥 1 , … 𝜕𝑌 𝜕 𝑥 𝑘 ] 𝑌=𝑓 𝑿 𝑡 → 𝜵𝒀 𝑡 0 =[ 𝜕𝑦 𝜕 𝑥 1 𝑥 1 𝑡 0 𝑑 𝑥 1 𝑑𝑡 𝑡 0 , …, 𝜕𝑦 𝜕 𝑥 𝑘 𝑥 𝑘 𝑡 0 𝑑 𝑥 𝑘 𝑑𝑡 𝑡 0 ] Similarly the case where Y depends via many variables in many inputs.

The Chain Rules as Computation Graphs
x g f 𝑥 1 𝑥 2 f 𝑡 … … 𝑥 𝑘

Computation Graphs – Toy Example
For a given 𝑿=[ 𝑥 1 =7, 𝑥 2 =5, 𝑥 3 =−3], we will calculate: 𝑌 𝑿 𝜵𝒀 𝑿 = 𝜕𝑌 𝜕𝑿 = [ 𝜕𝑌 𝜕 𝑥 1 , 𝜕𝑌 𝜕 𝑥 2 , 𝜕𝑌 𝜕 𝑥 3 ] The analytic derivative is: 𝜵𝒀= 2𝑥 3 , 𝑥 3 , 2𝑥 1 + 𝑥 2

Computation Graphs – Toy Example Cont.
The white numbers = topological sort index 1 2 3 * 2 6 𝑥 1 + 4 7 𝑥 2 * 5 𝑥 3

Computation Graphs – Forward Algorithm
Rules for forward pass: If a node has no parent then it is either a constant or a bounded variable The calculation is done in topological sort order For each intermediate node that is being computed: Gather the value of its parents (they are ready from previous steps) Apply the function that is written in it

Computation Graphs – Backward Algorithm
Rules for backward pass: The node in the calculation are denoted 𝑢 𝑖 1≤𝑖≤𝑛 in topological order, iterations done in reverse order We keep a vector of values 𝑑 𝑖 = 𝜕𝑦 𝜕 𝑢 𝑗 𝒙 0 , where The output has no sons so 𝑑𝑦 𝑑 𝑢 𝑛 =1 For each intermediate node that is being computed: Gather the value of the derivatives of its sons (they are ready from previous steps) 𝑑 𝑖 = 𝜋 𝑖 𝑑 𝑗 𝜕 𝑓 𝑗 𝜕 𝑢 𝑖 𝑢 𝑖 𝒙 𝟎

Computation Graphs – Software Packages
The most commonly used software packages are: TensorFlow – Google (similar to Theano) Caffe2 – Facebook (has predefined models) PyTorch – mostly used by researchers MXNet – Amazon a new player in the court Wrappers - Keras

Software Packages – Main Features
Definition of computation graph (static / dynamic graphs): Function evaluation Automatic derivative Automatic training - support for mini batchs Support GPU’s / cloud computing Pre-defined layers/ Pre-trained models

Predefined Types of Nonlinearities
sigmoid – between 0 and 1, fading derivative in the end soft-max (many categories): 𝑦 𝑗 𝑿 = 𝑒 𝑥 𝑗 𝑖=1 𝑘 𝑥 𝑖 𝐷 𝑗 𝑥 𝑖 = 𝑦 𝑖 ( 𝛿 𝑖,𝑗 − 𝑦 𝑗 ) tanh – between -1 and 1, again fading derivative relU – hinge loss

Implement your own Layer (keras)
class MyLayer(Layer): def __init__(self, filters=10, kernel_size=(3, 3), **kwargs): self.filters = filters self.kernel_size = kernel_size super(MyLayer, self).__init__(**kwargs) def build(self, input_shape): # Create a trainable weight variable for this layer nun_channels = input_shape[-1] self.kernel = self.add_weight(name='kernel', shape=(self.kernel_size + (nun_channels, self.filters)), initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None), trainable=True) super(MyLayer, self).build(input_shape) def call(self, x): f1 = K.conv2d(x, self.kernel, padding='same') f2 = K.conv2d(x, K.permute_dimensions(self.kernel, (1, 0, 2, 3)), padding='same') return multiply([f1, f2]) def compute_output_shape(self, input_shape): return input_shape[:-1] + (self.filters,)

Predefined Types of Layers
Fully connected (Dense) All inputs connected to all outputs Used as classifier for example Batchnorm – normalize the outputs, not very fashionable nowadays Dropout – used for regulatization only during training During the train randomly drop p of the vertices and scale all the others

Predefined Types of Layers – Cont.
Conv2d – perform convolution on the image  Our subject on the next slides Embedding – calculate a mapping of objects (e.g., word) to vector representation  The subject right after it LSTM – used for sequences of unknown size, recurrent transformations (if time permits I will talk about it)

Convolutions (1-d and 2-d)
Convolution: operation between functions, 𝑓∗𝑔 𝑡 = −∞ ∞ 𝑓 𝜏 𝑔 𝑡−𝜏 𝑑𝜏 The same definition extends to ℝ 𝑑 , and to discrete sets Example: 𝑓 𝑥 =1 ↔−1≤𝑥≤1, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 0 The convolution 𝑓∗𝑓 is a triangle of width 4 and maximum height 1

Convolutions – Cont. (example from Wikipedia)
What happens if we convolve an image with these masks: (slightly lying) 𝐿 𝑥 = −1 0 1 −2 0 2 − 𝐿 𝑦 = −1 −2 −1 

Convolution in Image Processing
A very useful tool for edge detection, smoothing, sharpening and other operations. But all these are manual made filters. Can we let the system learn the filters ? What happens if we run the filters one on top of the other ? Can we match a filter to detect certain shapes ?

From Convolutions to Convoltion-NN
Convolution is linear, so running two filters consecutively is equivalent to have a different filter applied once. So some non-linearity is required. A convolution neural network does exactly what we have asked in the previous slide: It learns a set of filters It uses non-linearity to get a more complex results The layers matches themselves to the object we want to learn

The Challenge of Image Classification - ImageNet
ImageNet is a dataset of labeled images taken from the internet Over 14M color images: 224 X 224 X 3 Over 20K categories (nested) The challenge (approximately 1K categories) was announced at 2010 The error was > 25% before 2012 In 2012 ConvNet got 16% error !!!

Architcture of Convnet
Each convolution layer is a convolution + relu + max pooling + batchnorm

The Next Winner VGG16 Same as ConvNet but bigger !!!

ConvNet Visualization
Can we visualize what the network have learned ? For first layer the results are linear filters and they can easily be visualized What about the next layers ? Why is it interesting ? The next slides are taken from Zeiler and Fergus 2013, “Visualizing and Understanding Convolutional Networks”

ConvNet Visualization – Cont.

ConvNet Visualization – Cont. II

ConvNet Visualization – Cont. III
The secret behind these images is the deconv layer: Revert the maxpooling The relu is preserved if active The filter needs to be transposed (I won’t explain why) This layer is used in other reconstruction problems Another secret is of course to select the right images !!!

resnet – The New Architecture Type
The 2016 winner has defeated everyone in all criteria – resnet

Word2Vec In language problems we may want to use words as features
The obvious solution people are using for over 20 years is to take each word as an indicator (with some enhancements) This representation loses all notion of proximity We would like to map a Vocabulary to a vector space in a way that preserves semantic (and syntactic) relations

Word2Vec – Cont. “You shall know a word by the company that it keeps” (Firth 1957) How can we exploit this distributional similarity to study a representation ? We define a vector for each word (distributed representation) These vectors should be such that you can predict neighboring words from them Word2Vec is an algorithm that finds this vector representation

Word2Vec – Cont. II Algorithm: (Skip-gram)
Select a window of k-words surrounding a word (k ~ 4 or other small number) Write predictions 𝑝 𝑤 𝑡−𝑗 𝑤 𝑡 ; 𝜃) (as a function of the parameters) Optimize the theta by optimizing the likelihood over the whole corpus CBOW is similar but the prediction is the other way around, in both cases relative position is irrelevant OK, but what is the functional form of the p’s ?

Word2Vec – Cont. III 𝑝 𝑤 𝑟 𝑤 𝑞 )= exp⁡( 𝑤 𝑞 𝑇 ∙ 𝑤 𝑟 ) 𝑖∈𝑉 exp⁡( 𝑤 𝑞 𝑇 ∙ 𝑤 𝑖 ) ; a softmax representation We may choose two different representation for words (predictor and predicted) Let’s write the likelihood as a sum and do the derivative of one of the summands The main paper that made it efficient is “Distributed representations of words and phrases and their compositionality” (Mikolov 2013)

Word2Vec – Results Interesting feature: small windows are good for syntax Some cool semantic relations (Mikolov 2013)

Embedding Layer In Word2Vec we learned the representation solving a classification problem This problem was unsupervised since we learned proximity In Embedding layer we also learn proximity, for example as a step in a supervised learning problem It is similar to using principal components as features in a regression

Embedding Layer – Example keras
def build_model(n_words, emb_size=10, lstm_h_size=5, n_out=2): model = Sequential() model.add(Embedding(n_words + 1, emb_size)) # +1 for the padding sym model.add(LSTM(lstm_h_size)) model.add(Dense(500, activation='relu')) model.add(Dropout(.25)) model.add(Dense(n_out, activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer=Adam(1e-3), metrics=['accuracy']) model.summary() return model

Generative Adversarial Network
Generative models - find the probabilistic process that generates the data Can be done parametrically using approximation to the likelihood GAN changed the focus and since then this is by far the most used model The generating distribution is not kept in an explicit form The original paper is: “Generative Adversarial Nets” (Goodfellow 2014)

The Idea Behind GAN We define 𝑝 𝑥 to be the generating distribution of the data, and 𝑝 𝑔 the distribution that our generator learned so far We want 𝑝 𝑔 → 𝑝 𝑥 and we achieve that by training a network D to try and classify the source of the input. Is it from 𝑝 𝑥 or from 𝑝 𝑔 ? The generator G is also represented by a network When D can no longer differentiate between inputs we get that 𝑝 𝑔 = 𝑝 𝑥

Topics in Deep Learning

Similar presentations

Presentation on theme: "Topics in Deep Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Topics in Deep Learning

Similar presentations

Presentation on theme: "Topics in Deep Learning"— Presentation transcript:

Similar presentations

About project

Feedback