Deep Learning for Speech and Language Yoshua Bengio, U. Montreal NIPS’2009 Workshop on Deep Learning for Speech Recognition and Related Applications December.

Slides:

Advertisements

Similar presentations

Greedy Layer-Wise Training of Deep Networks

Advertisements

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.

CS590M 2008 Fall: Paper Presentation

Advanced topics.

Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University.

Nathan Wiebe, Ashish Kapoor and Krysta Svore Microsoft Research ASCR Workshop Washington DC Quantum Deep Learning.

Stacking RBMs and Auto-encoders for Deep Architectures References:[Bengio, 2009], [Vincent et al., 2008] 2011/03/03 강병곤.

Generalizing Backpropagation to Include Sparse Coding David M. Bradley and Drew Bagnell Robotics Institute Carnegie.

Supervised Learning Recap

Tiled Convolutional Neural Networks TICA Speedup Results on the CIFAR-10 dataset Motivation Pretraining with Topographic ICA References [1] Y. LeCun, L.

What is the Best Multi-Stage Architecture for Object Recognition? Ruiwen Wu [1] Jarrett, Kevin, et al. "What is the best multi-stage architecture for object.

Presented by: Mingyuan Zhou Duke University, ECE September 18, 2009

A Neural Probabilistic Language Model Keren Ye.

Yoshua Bengio Statistical Machine Learning Chair, U. Montreal ICML 2011 Workshop on Unsupervised and Transfer Learning July 2 nd 2011, Bellevue, WA.

Recent Developments in Deep Learning Quoc V. Le Stanford University and Google.

An introduction to: Deep Learning aka or related to Deep Neural Networks Deep Structural Learning Deep Belief Networks etc,

AN ANALYSIS OF SINGLE- LAYER NETWORKS IN UNSUPERVISED FEATURE LEARNING [1] Yani Chen 10/14/

Introduction to Automatic Speech Recognition

Comp 5013 Deep Learning Architectures Daniel L. Silver March,

Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.

Curriculum Learning Yoshua Bengio, U. Montreal Jérôme Louradour, A2iA

Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College.

A shallow introduction to Deep Learning

Csc Lecture 7 Recognizing speech. Geoffrey Hinton.

Dr. Z. R. Ghassabi Spring 2015 Deep learning for Human action Recognition 1.

Word representations: A simple and general method for semi-supervised learning Joseph Turian with Lev Ratinov and Yoshua Bengio Goodies:

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

LeCun, Bengio, And Hinton doi: /nature14539

Epitomic Location Recognition A generative approach for location recognition K. Ni, A. Kannan, A. Criminisi and J. Winn In proc. CVPR Anchorage,

Students: Meera & Si Mentor: Afshin Dehghan WEEK 4: DEEP TRACKING.

Ch 5b: Discriminative Training (temporal model) Ilkka Aho.

Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.

Neural Net Language Models

Introduction to Deep Learning

Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov

Deep learning Tsai bing-chen 10/22.

Ganesh J1, Manish Gupta1,2 and Vasudeva Varma1

A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning Ronan Collobert Jason Weston Presented by Jie Peng.

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.

Xintao Wu University of Arkansas Introduction to Deep Learning 1.

Yann LeCun Other Methods and Applications of Deep Learning Yann Le Cun The Courant Institute of Mathematical Sciences New York University

Distributed Representations for Natural Language Processing

Vision-inspired classification

Big data classification using neural network

Learning Deep Generative Models by Ruslan Salakhutdinov

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

Deep Learning Amin Sobhani.

an introduction to: Deep Learning

ECE 5424: Introduction to Machine Learning

Deep Learning Yoshua Bengio, U. Montreal

Intelligent Information System Lab

Neural networks (3) Regularization Autoencoder

Machine Learning Basics

Deep learning and applications to Natural language processing

Deep Belief Networks Psychology 209 February 22, 2013.

Deep Learning Workshop

State-of-the-art face recognition systems

Convolutional Neural Networks for sentence classification

Deep learning Introduction Classes of Deep Learning Networks

Deep Architectures for Artificial Intelligence

Learning Deep Architectures

المشرف د.يــــاســـــــــر فـــــــؤاد By: ahmed badrealldeen

实习生汇报 ——北邮张安迪.

Word embeddings Text processing with current NNs requires encoding into vectors. One-hot encoding: N words encoded by length N vectors. A word gets a.

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

An introduction to: Deep Learning aka or related to Deep Neural Networks Deep Structural Learning Deep Belief Networks etc,

CS249: Neural Language Model

Presentation transcript:

Deep Learning for Speech and Language Yoshua Bengio, U. Montreal NIPS’2009 Workshop on Deep Learning for Speech Recognition and Related Applications December 12, 2009

Interesting Experimental Results with Deep Architectures  Beating shallow neural networks on vision and NLP tasks  Beating SVMs on visions tasks from pixels (and handling dataset sizes that SVMs cannot handle in NLP)  Reaching or beating state-of-the-art performance in NLP and phoneme classification  Beating deep neural nets without unsupervised component  Learn visual features similar to V1 and V2 neurons as well as auditory cortex neurons

Deep Motivations  Brains have a deep architecture  Humans organize their ideas hierarchically, through composition of simpler ideas  Unsufficiently deep architectures can be exponentially inefficient  Distributed (possibly sparse) representations are necessary to achieve non-local generalization  Multiple levels of latent variables allow combinatorial sharing of statistical strength

Architecture Depth Depth = 3 Depth = 4

Deep Architectures are More Expressive Theoretical arguments: … 123 2n2n 123 … n = universal approximator2 layers of Logic gates Formal neurons RBF units Theorems for all 3: (Hastad et al 86 & 91, Bengio et al 2007) Functions compactly represented with k layers may require exponential size with k-1 layers

Deep Architectures and Sharing Statistical Strength, Multi-Task Learning  Generalizing better to new tasks is crucial to approach AI  Deep architectures learn good intermediate representations that can be shared across tasks  A good representation is one that makes sense for many tasks raw input x task 1 output y 1 task 3 output y 3 task 2 output y 2 shared intermediate representation h

Feature and Sub-Feature Sharing  Different tasks can share the same high-level feature  Different high-level features can be built from the same set of lower-level features  More levels = up to exponential gain in representational efficiency … … … … … task 1 output y 1 task N output y N High-level features Low-level features

Sharing Components in a Deep Architecture Polynomial expressed with shared components: advantage of depth may grow exponentially

The Deep Breakthrough  Before 2006, training deep architectures was unsuccessful, except for convolutional neural nets  Hinton, Osindero & Teh « A Fast Learning Algorithm for Deep Belief Nets », Neural Computation, 2006  Bengio, Lamblin, Popovici, Larochelle « Greedy Layer-Wise Training of Deep Networks », NIPS’2006  Ranzato, Poultney, Chopra, LeCun « Efficient Learning of Sparse Representations with an Energy-Based Model », NIPS’2006

The need for non-local generalization and distributed (possibly sparse) representations  Most machine learning algorithms are based on local generalization  Curse of dimensionality effect with local generalizers  How distributed representations can help

Locally Capture the Variations

Easy with Few Variations

The Curse of Dimensionality To generalise locally, need representative exemples for all possible variations!

Limits of Local Generalization: Theoretical Results  Theorem: Gaussian kernel machines need at least k examples to learn a function that has 2k zero-crossings along some line  Theorem: For a Gaussian kernel machine to learn some maximally varying functions over d inputs require O(2 d ) examples (Bengio & Delalleau 2007)

Curse of Dimensionality When Generalizing Locally on a Manifold

How to Beat the Curse of Many Factors of Variation? Compositionality: exponential gain in representational power Distributed representations Deep architecture

Distributed Representations (Hinton 1986)  Many neurons active simultaneously  Input represented by the activation of a set of features that are not mutually exclusive  Can be exponentially more efficient than local representations

Local vs Distributed

Currrent Speech Recognition & Language Modeling  Acoustic model: Gaussian mixture with a huge number of components, trained on very large datasets, on spectral representation  Within-phoneme model: HMMs = dynamically warpable templates for phoneme-context dependent distributions  Within-word models: concatenating phoneme models based on transcribed or learned phonetic transcriptions  Word sequence models: smoothed n-grams

Current Speech Recognition & Language Modeling: Local  Acoustic model: GMM = local generalization only, Euclidean distance  Within-phoneme model: HMM = local generalization with time-warping invariant similarity  Within-word models: exact template matching  Word sequence models: n-grams= non- parametric template matching (histograms) with suffix prior (use longer suffixes if enough data)

Deep & Distributed NLP  See “Neural Net Language Models” Scholarpedia entry  NIPS’2000 and JMLR 2003 “A Neural Probabilistic Language Model” Each word represented by a distributed continuous-valued code Generalizes to sequences of words that are semantically similar to training sequences

Generalization through distributed semantic representation  Training sentence The cat is walking in the bedroom  can generalize to A dog was running in a room  because of the similarity between distributed representations for (a,the), (cat,dog), (is,was), etc.

Results with deep distributed representations for NLP  (Bengio et al 2001, 2003): beating n-grams on small datasets (Brown & APNews), but much slower  (Schwenk et al 2002,2004,2006): beating state-of-the-art large- vocabulary speech recognizer using deep & distributed NLP model, with *real-time* speech recognition  (Morin & Bengio 2005, Blitzer et al 2005, Mnih & Hinton 2007,2009): better & faster models through hierarchical representations  (Collobert & Weston 2008): reaching or beating state-of-the-art in multiple NLP tasks ( SRL, POS, NER, chunking) thanks to unsupervised pre-training and multi-task learning  (Bai et al 2009): ranking & semantic indexing (info retrieval).

Thank you for your attention!  Questions?  Comments?