LeCun, Bengio, And Hinton doi: /nature14539

Name: LeCun, Bengio, And Hinton doi: /nature14539
Uploaded: 2017-07-14T05:26:12+00:00
Duration: PTM13S35
Channel: Bernadette Bailey
Description: LeCun, Bengio, And Hinton doi: /nature14539

LeCun, Bengio, And Hinton doi:10.1038/nature14539
Deep Learning LeCun, Bengio, And Hinton doi: /nature14539

Complex Interaction Prediction
Speech Recognition Image Recognition Complex Interaction Prediction Wat? Deep learning is a series of machine learning techniques that are being widely employed in image recognition/categorization, speech recognition and translation- including semantic processing. Also, predicting the behavior of complex interactions such of drugs and bio-chemicals, and the effects of mutations in non-coding DNA on gene expression.

Moar Wat? X Y Cat The topic we will be concentrating on the most will be image and object recognition. Most widely covered: cats. And in one case a dog.

Teaching to Learn Supervised / Unsupervised Representation Learning
Kernel Methods Convolution Neural Networks Recurrent Neural Networks Supervised Learning – involves presenting data to the machine that has been previously collected, categorized, and labeled. The machine produces a vector of scores for each category. Unsupervised Learning – presenting unlabeled data to the machine and allowing it to self-categorize and internally correlate information. This process leads to better generalization and prevents overfitting (when a model describes error/noise instead of the underlying relationship – usually due to excessive complexity of parameters relative to observations). Representation Learning – also called ‘feature learning’, is a set of techniques that learn a feature: a transformation of raw data input to a representation that can be effectively exploited in machine learning tasks. This allows a machine to both learn a specific task and learn the features themselves: basically – it teaches them to learn how to learn. Kernel Methods – Examples: Scalable Vector Machine (supervised learning techniques that use learning algorithms to solve classification and regression problems), Polynomial Kernals (for learning non-linear models), Fisher Kernel (classification and information retrieval) Convolution Neural Networks – widely deployed for image recognition >>> a feed-forward neural network where neurons are tiled such that they respond to the overlapping regions in the visual field. (more on this later) Recurrent Neural Networks – an architecture for sequence recognition and reproduction / temporal association and prediction (more on this later)

How is this possible? Loads of data Time (and/or)
Colossal amounts of computational horsepower Almost every supervised machine learning technique requires a mountain of data to learn how to classify with any significant degree of accuracy. There are exceptions and tradeoffs for each technique. Some require fewer/greater amounts of data and are faster/slower, more/less accurate in certain arenas.

More specifically, how? Generally speaking; an input image is ingested, processed, and output with a vector of scores, one per category. Units that are not input or output are usually referred to as hidden. Error is then calculated based on the output versus the desired pattern scores. The machine is then able to modify the weights accordingly. Weights can be thought of as the tunable elements in the machine. In practice, most practitioners use stochastic gradient descent to pre-train a machine in order to gauge it’s ability to generalize images.

MOAR SPECIFICALLY! Even more specifically – because machine learning isn’t difficult enough yet… The math necessary for creating a machine like this greatly depends on Jacobian partial derivatives. If you’re familiar with the chain rule of derivatives, you can see how a change of value to the variable x on y, and subsequently, that of y on z yields a cascading effect on the network. If you’re not familiar, just notice how all of the functions are related; changing one of them will change them all. Look at the left-graph, center screen; you can see how the graph from the input image creates nearly homogenous curves. Then when the machine applies a non-linear function of weights to the data set, it not only affects the curves, but it affects the entire graph manifold, warping and distorting the input space – this process makes the data sets LINEARLY SEPERABLE. Dare I say: almost binary. The bottom two images help to visualize each layer: both their values as well as their abstract network topology.

Feed Forward network with one input layer of 3 units, 2 hidden layers with 4 and 3 units respectively, and a single output layer with 2 units. (Starting from the bottom right of the screen moving up) First thing – compute the input to each unit (z), which is the weighted sum of the previous layers output. Next – apply the non-linear function ( f ( Zl ) ) to calculate the output, in this case the Rectified Linear Unit f(z) = max(0, z). Repeat this process for all subsequent layers until the output unit is achieved.

Backward Propagation of Errors – This is the same topology, just backwards.
The error of each unit is calculated layer by layer and then eventually returned to the original input layer for tuning of each layer’s weights. This is done by computing the partial derivative of (E)rror with respect to the output of each unit, which is in turn a weighted sum of the partial derivatives of error with respect to the total inputs to the units in the layer above. Computationally, this is nothing more than the chain rule of derivatives, again. “We then convert the error derivative with respect to the output into the error derivative with respect to the input by multiplying it by the gradient of f(z) = ∂zk [middle left screen bottom line] . At the output layer, the error derivative with respect to the output of a unit is computed by differentiating the cost function. This gives yl−tl if the cost function for unit l is 0.5(yl−tl )^2 , where tl is the target value. Once the ∂E/∂zk is known, the error-derivative for the weight wjk on the connection from unit j in the layer below is just yj ∂E/∂zk.”

Any questions?

A brief history Revival surged in 2006 thanks to CIFAR The researchers introduced unsupervised learning without labelled data Remarkable performance when detecting pedestrians and recognizing handwritten digits The revival of interest in deep learning feed forward networks happened in 2006 when the Canadian Institute for Advanced Research succeeded in creating an unsupervised learning machine that performed extremely at well at recognizing handwritten digits as well as pedestrians using very limited labelled data.

Pre-training isn’t as necessary as originally thought
Cheap GPUs 20 fold increase in speed Pre-training isn’t as necessary as originally thought One type of network stood out as easier to train and performed better when generalizing.... -Graphical Processing Units became cheaper and easier to program -Their power coupled with more modern deep learning algorithms led to a fold increase in speed -Shortly after rehabilitation, deep learning advocates learned that pre-training was only needed for smaller data sets

The Convolution Neural Network
Inspired by the human visual cortex High-level features are composed of lower-level ones Backprop gradients through a CNN is as simple as a regular Deep Neural Network. ConvNets process input data as multiple arrays – a good example is that of an image with three color intensity channels: red, green, and blue. The CNNs digest the data in a series of stages, moving from a convolutional layer to a pooling layer and then repeating that cycle. - Units in the convolutional layer are feature maps that are connected by a filter bank (the weights in previous diagrams) – each layer of filter banks is different than the previous to aid in distinguishing local motifs as well as finding duplicate motifs among the same image in different locations. - The pooling layer is then used to semantically merge each feature into a single one. Varying local maps are computed to merge via a coarse-graining of each features position.

Convolution Neural Network
Let’s look at a Convolution Neural Networks – From the bottom; the network is given an image which it begins processing upward layer by layer. The features uncovered at the lower levels act as edge detectors for each subsequent layer whereby a score is computed for each image class in the output. Edges are detected to form motifs, motifs them form parts, and parts make up objects. ReLU=Rectified Linear Unit.

Traffic Sign Recognition
Into the future… Traffic Sign Recognition Biological Segmentation Facial Recognition So the present state of deep learning is already somewhat sobering. One can imagine how all of these self-learning mechanisms can be employed and what they may even evolve into. Between self-driving cars, natural language processing, and general semantic understanding, it’s no wonder why convolution networks have become the de facto standard for recognition and detection tasks.

Now I want to talk to you guys about a powerful CNN that has been blowing up the internet lately: Google Deep Dream. It’s a type of convolution neural network with an algorithm that searches according to a series of input base cases – in this case: an image and a description of what is in it. The algorithm attempts to search the image space for objects that are provided in the description and in the process, projects images that meet the weighted criteria; it acts as a form of pareidolia, the psychological phenomenon where a pattern is perceived when no such pattern is actually there. (i.e. the face on mars, the man in the moon). Google recently open sourced the algorithms – since then, a few sites have popped up that allow you to make your own deep dream distortions.

Any questions?

A brief overview of Recurrent Neural Networks
Another architecture worth mentioning, due to it’s uncanny power, is the recurrent neural network. The neurons act according to a typical deep learning network, but with one tricky difference; one of the output connections returns to itself. This allows the neuron to maintain an understanding of all of the previous state vectors that contained information about the history of past elements. A naïve method that competes with state of the art translation algorithms

They’re hard. Problems Backprop gradients grow out of scope
Backprop gradients shrink out of scope Architecturally difficult to train Recurrent Neural Networks have provided some significant challenges in the past. Such that, the architecture of the network itself and the training regimes have had to adapt in order to overcome their once previously limited scope of use. Now however, thanks to advances in architecture and training, they’ve been found to be the best predictors of characters in text, word in sequence, and even language translation.

Now for the future! Who thinks robots will take the place of human?
And why not? If time permits: Where else can convolution networks be applied? All Images are public record/open source

LeCun, Bengio, And Hinton doi: /nature14539

Similar presentations

Presentation on theme: "LeCun, Bengio, And Hinton doi: /nature14539"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LeCun, Bengio, And Hinton doi: /nature14539

Similar presentations

Presentation on theme: "LeCun, Bengio, And Hinton doi: /nature14539"— Presentation transcript:

Similar presentations

About project

Feedback