A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning Ronan Collobert Jason Weston Presented by Jie Peng.

Slides:



Advertisements
Similar presentations
Machine Learning for Vision-Based Motion Analysis Learning pullback metrics for linear models Oxford Brookes Vision Group Oxford Brookes University 17/10/2008.
Advertisements

A brief review of non-neural-network approaches to deep learning
Neural networks Introduction Fitting neural networks
Machine learning continued Image source:
Measuring the Influence of Long Range Dependencies with Neural Network Language Models Le Hai Son, Alexandre Allauzen, Franc¸ois Yvon Univ. Paris-Sud and.
What is the Best Multi-Stage Architecture for Object Recognition? Ruiwen Wu [1] Jarrett, Kevin, et al. "What is the best multi-stage architecture for object.
Lecture 14 – Neural Networks
Pattern Recognition and Machine Learning
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, Bing Qin
Distributed Representations of Sentences and Documents
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
This week: overview on pattern recognition (related to machine learning)
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Ronan Collobert Jason Weston Leon Bottou Michael Karlen Koray Kavukcouglu Pavel Kuksa.
Deep Learning for Speech and Language Yoshua Bengio, U. Montreal NIPS’2009 Workshop on Deep Learning for Speech Recognition and Related Applications December.
Chapter 9 Neural Network.
Mastering the Pipeline CSCI-GA.2590 Ralph Grishman NYU.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan.
Word representations: A simple and general method for semi-supervised learning Joseph Turian with Lev Ratinov and Yoshua Bengio Goodies:
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.
Christopher M. Bishop Object Recognition: A Statistical Learning Perspective Microsoft Research, Cambridge Sicily, 2003.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.
ConvNets for Image Classification
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.
Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
Xintao Wu University of Arkansas Introduction to Deep Learning 1.
Mastering the Pipeline CSCI-GA.2590 Ralph Grishman NYU.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Fill-in-The-Blank Using Sum Product Network
Distributed Representations for Natural Language Processing
Neural networks and support vector machines
Convolutional Sequence to Sequence Learning
Learning Deep Generative Models by Ruslan Salakhutdinov
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Deep Feedforward Networks
Deep Learning Amin Sobhani.
Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
Intelligent Information System Lab
Neural networks (3) Regularization Autoencoder
Are End-to-end Systems the Ultimate Solutions for NLP?
Supervised Training of Deep Networks
Deep learning and applications to Natural language processing
Convolutional Networks
Convolutional Neural Networks for sentence classification
Neural Networks and Their Application in the Fields of Coporate Finance By Eric Séverin Hanna Viinikainen.
Word Embedding Word2Vec.
Graph Neural Networks Amog Kamsetty January 30, 2019.
Word embeddings (continued)
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
CSE 291G : Deep Learning for Sequences
Automatic Handwriting Generation
Introduction to Neural Networks
Dan Roth Department of Computer Science
Modeling IDS using hybrid intelligent systems
Bidirectional LSTM-CRF Models for Sequence Tagging
Guanqun Yang, Pencheng Xu, Haiqi Xiao, Yuqing Wang
CS249: Neural Language Model
Presentation transcript:

A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning Ronan Collobert Jason Weston Presented by Jie Peng

Motivation Many systems possess few characteristics that would help develop a unified architecture: – (i) they are shallow in the sense that the classifier is often linear. – (ii) for good performance with a linear classifier they must incorporate many hand- engineered features specific for the task; – (iii) they cascade features learnt separately from other tasks, thus propagating errors. The objective of this paper is define a unified architecture for Natural Language Processing – learns features that are relevant to the tasks at hand given very limited prior knowledge. – achieved by training a deep neural network, building upon work by (Bengio & Ducharme, 2001) and (Collobert & Weston, 2007) All the tasks except the language model are supervised tasks with labeled training data. The language model is trained in an unsupervised fashion on the entire Wikipedia website. That is semi-supervised learning.

Motivation Multitask Learning: Learn features from one task and use them as features for another task. Include shared parameter, sharing hidden nodes or creating a common set of features.

NLP Tasks in This Paper Part-of-Speech Tagging(POS):systactic roles (noun, adverb…) Chunking: syntactic constituents( noun phrase, verb phrase…) Name Entity Recognition (NER) : person/company/location… Semantic Role Labeling: giving a semantic role to a syntactic constituent of a sentence. Language Models: A language model traditionally estimates the probability of the next word being w in a sequence. Semantically Related Words: Predicting whether two words are semantically related (synonyms, holonyms, hypernyms...)

Multitask Learning in NLP Cascading Features : The most obvious way to achieve MTL in NLP Train one task and use it as feature for another task. Shallow joint training – Hand-craft features, train on all tasks at the same time. Conditional Random Fields and statistical parsing models for POS, NER, chunking and relation extraction. Problem: joint labeling requirement – Learn independently by using different training sets but leverage predictions jointly. Problem: Not using the shared tasks at training – SRL by joint inference and semantic parsing. Problem: Not state of the art

Multitask Learning in NLP Multitask Learning: Learn features from one task and use them as features for another task. Include shared parameter, sharing hidden nodes or creating a common set of features. Cascading Features: Good or bad? Error propagation, extract hand made feature for each task. How about an END-TO-END system that learns features completely implicitly? That is the motivation of this paper.

The Most Intuitive Way Craft a Deep Architecture Train everything by back-propagation. (End-to-End)

How the System Evolves? Train all tasks jointly. Can gain performance on all tasks and fully exploits the potentials of the shared features.

The Implementation of the Intuitive Way But there is a problem. For SRL, if wanting to tag a word such as a verb, the full sentences must be seen at the same time. How to solve it? This is the highlight of this paper.

The General Deep NN Architecture

How to represent a word using a common way? The first layer has to map words into real-valued vectors for processing by subsequent layers. Each word is embedded into a d-dimensional space using a lookup table: It considers words as indices in a finite dictionary of words D. W is a matrix of parameter to be learnt and its size is dX|D|. is the column of W and d is the word vector size (wsz ) to be chosen by the user. An input sequence of n words in the dictionary D is transformed into a sequence of vectors Pre-processing, eg. steming, lower case, capitalizations as a new feature.

How to represent a word using a common way? Decompose a word into K features. Represent words as tuples: Classifying with respect to a predicate:

How to represent a word using a common way? As and is the dimension of the feature. So a word i is then embedded in a dimensional space by concatenating all the lookup-table outputs. Thus a unified representation of word is defined. The lookup table layer maps the original sentence into a sequence of n identically sized vertors (d) : There is another problem that the size n of the sequence varies depending in the sentence but normal NNs are not able to handle sequences of variable length.

How to solve the problems? One possible way is using a window around the word to be labeled. – Work for simple tasks like POS – Fails on more complex tasks like SRL as the role of a word depends on a word far away from the sentence-outside of window. This paper proposed the solution of using TDNN’s. When modeling long- distance dependencies is important, Time-Delay Neural Networks (TDNNs) are a better choice. Time refers to order of words in the sentence. A TDNN “reads” the sequence in an online fashion: at time t, it sees the word. It performs a convolution are the parameters of the layer and is the number of the hidden units.

How to solve the problems? A classical window approach only considers words in a window of size ksz around the word to be labeled. But a TDNN considers at the same time all windows of ksz words in the sentence. It shares weights through time-space. TDNN layers can also be stacked so that one can extract local features in lower layers, and more global features in subsequent ones, which is typically used in convolutional networks for vision tasks. Now how to solve the problem of sequences of variable length? Take all windows of words in the sentence to predict the label of one word of interest. Add to the architecture a layer which captures the most relevant features over the sentence by feeding the TDNN layer(s) into a “Max” Layer.

Max Layer Local feature vectors extracted by the convolutional layers have to be combined to obtain a global feature vector, with a fixed size independent of the sentence length in order to apply subsequent standard affine layers. Traditional networks often apply an average operation over the time. But the average operation does not make much sense in this case. As in general most words in the sentence do not have any influence on the semantic role of a given word to tag. Instead, it uses a max approach, which forces the network to capture the most useful local features produced by the convolutional layers. The max output by a convolutional layer l-1 is as following:

Deeper Architecture A TDNN layer performs a linear operator over the input sequence. Linear approach works for POS and NER but NOT for SRL. More complex tasks like SRL require nonlinear models. We can add one or more layers to the NN with a non-linear appraoch. The output of the layer is computed by back-propagation by

Deeper Architecture The size of the last output is the number of classes considered in the NLP task A softmax layer follows, to ensure positivity and summation to 1. It allows us to interpret the output as probabilities for each class. The whole network is trained by the cross-entropy criterion

Multitask Learning Related tasks,means that features for one task might be useful for features in other tasks. Deepest layer learns features implicitly for each work in D. Reasonable to expect that training NNs on related tasks that share deep layers would improve performance. Training the NN by looping over the tasks: 1.Select the next task 2.Select training example for this task at random 3.Update the NN by taking a gradient step w. r. t this example 4.Go to 1

Example of MTL

Leveraging Unlabled Data Labeling in NLP is expensive and unlabeled data are abundant. We can leverage unlabeled data by unsupervised learning and jointly train supervised and unsupervised tasks. Language Model: Discriminate a two-class classification task – Whether the word in the middle of the input window is related to context or not. Used ranking-type cost: We construct a dataset for this task by considering all possible ksz windows of text from the entire of English Wikipedia.

Language Model Results The language model is trained alone and the embedding obtained in the look-up table was very successful. Then it simply initializes the word lookup tables of the supervised networks with the embeddings computed by the language models.

Experiments Use sections of PropBank dataset (labeled community dataset) for training and testing SRL tasks. POS,NER and chunking, were trained with the window version ksz=5. POS/NER linear models,chunking hidden layer of 200 units. Language model was trained with ksz=5 and 100 hidden units. – Used two look-up tables: one of dimension wsz and one of dimension 2 (Capital or Not Capital) SRL NN had a convolution layer with ksz=5 and 100 hidden units followed by another layer of 100 units. – Used three look-up tables: one for the word, two for relative distances Language model was trained on Wikipedia,631 M-words with ksz=11 and 100 hidden units. – Used only one lookup table, that is the word in lower case. – It takes about a week to train on one computer.

Experiments A Deep Architecture for SRL improves by learning auxiliary tasks that share the first layer that represents words as wsz-dimensional vectors. Giving word error rates for wsz=15, 50 and 100 and various shared tasks.

Experiments Test error versus number of training epochs over PropBank, for the SRL task alone and SRL jointly trained with various other NLP tasks, using deep NNs. Training was achieved in a few epochs (about a day) over the PropBank dataset. All MTL experiments performed better than SRL alone. With larger wsz, namely larger capacity, the relative improvement becomes larger from using MTL compared to the task alone, which shows MTL is a good way of regularizing.

Experiments SRL with language model by semi-supervised training achieves the best results. Beat state-of-the-art, 14.30% vs 16.54% Obtain modest improvements to POS and chunking results using MTL. With MTL it obtained 2.91% for POS and 3.8% (92.71 F-measure) for chunking, which are state-of-art.

Conclusion General deep NN architecture for NLP is feasible. Advantages: – Avoid hand-craft engineering and don’t need empirical experience of NLP – Extremely fast – MTL can improve the performance – When leveraging on semi-supervised learning, results are superior – Very convenient to implement – One architecture can be applied to various tasks. This is an important result, given that the NLP community considers syntax as a mandatory feature for semantic extraction.

Reference Collobert R, Weston J, Bottou L, et al. Natural language processing (almost) from scratch[J]. The Journal of Machine Learning Research, 2011, 12: