Deep Learning in NLP Word representation and how to use it for Parsing

Slides:



Advertisements
Similar presentations
Parsing with Compositional Vector Grammars Socher, Bauer, Manning, NG 2013.
Advertisements

10. Lexicalized and Probabilistic Parsing -Speech and Language Processing- 발표자 : 정영임 발표일 :
Vector space word representations
For Monday Read Chapter 23, sections 3-4 Homework –Chapter 23, exercises 1, 6, 14, 19 –Do them in order. Do NOT read ahead.
Word/Doc2Vec for Sentiment Analysis
Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.
Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.
Natural Language Processing Lab, Tsinghua University
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Ronan Collobert Jason Weston Leon Bottou Michael Karlen Koray Kavukcouglu Pavel Kuksa.
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College.
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
L’età della parola Giuseppe Attardi Dipartimento di Informatica Università di Pisa ESA SoBigDataPisa, 24 febbraio 2015.
THE BIG PICTURE Basic Assumptions Linguistics is the empirical science that studies language (or linguistic behavior) Linguistics proposes theories (models)
Natural Language Processing Artificial Intelligence CMSC February 28, 2002.
Constructing Knowledge Graph from Unstructured Text Image Source: Kundan Kumar Siddhant Manocha.
1 Intelligente Analyse- und Informationssysteme Frank Reichartz, Hannes Korte & Gerhard Paass Fraunhofer IAIS, Sankt Augustin, Germany Dependency Tree.
Word representations: A simple and general method for semi-supervised learning Joseph Turian with Lev Ratinov and Yoshua Bengio Goodies:
PARSING 2 David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.
Neural Net Language Models
Kai Sheng-Tai, Richard Socher, Christopher D. Manning
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Deep Visual Analogy-Making
Accurate Cross-lingual Projection between Count-based Word Vectors by Exploiting Translatable Context Pairs SHONOSUKE ISHIWATARI NOBUHIRO KAJI NAOKI YOSHINAGA.
Semantic Compositionality through Recursive Matrix-Vector Spaces
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
PARSING David Kauchak CS159 – Fall Admin Assignment 3 Quiz #1  High: 36  Average: 33 (92%)  Median: 33.5 (93%)
Ganesh J1, Manish Gupta1,2 and Vasudeva Varma1
A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning Ronan Collobert Jason Weston Presented by Jie Peng.
Vector Semantics Dense Vectors.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.
Mastering the Pipeline CSCI-GA.2590 Ralph Grishman NYU.
Medical Semantic Similarity with a Neural Language Model Dongfang Xu School of Information Using Skip-gram Model for word embedding.
Fill-in-The-Blank Using Sum Product Network
Distributed Representations for Natural Language Processing
Neural Machine Translation
CNN-RNN: A Unified Framework for Multi-label Image Classification
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Deep Learning for Bacteria Event Identification
Natural Language and Text Processing Laboratory
Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.
Recursive Neural Networks
On Dataless Hierarchical Text Classification
Relation Extraction CSCI-GA.2591
Giuseppe Attardi Dipartimento di Informatica Università di Pisa
Neural networks (3) Regularization Autoencoder
Deep learning and applications to Natural language processing
Word2Vec CS246 Junghoo “John” Cho.
Distributed Representation of Words, Sentences and Paragraphs
Lei Sha, Jing Liu, Chin-Yew Lin, Sujian Li, Baobao Chang, Zhifang Sui
Image Captions With Deep Learning Yulia Kogan & Ron Shiff
Seminar Topics and Projects
Word embeddings based mapping
Word embeddings based mapping
RCNN, Fast-RCNN, Faster-RCNN
Natural Language to SQL(nl2sql)
Vector Representation of Text
Word embeddings Text processing with current NNs requires encoding into vectors. One-hot encoding: N words encoded by length N vectors. A word gets a.
Giuseppe Attardi Dipartimento di Informatica Università di Pisa
Word embeddings (continued)
David Kauchak CS159 – Spring 2019
Artificial Intelligence 2004 Speech & Natural Language Processing
Vector Representation of Text
CS249: Neural Language Model
Presentation transcript:

Deep Learning in NLP Word representation and how to use it for Parsing For Sig Group Seminar Talk Wenyi Huang harrywy@gmail.com

Existing NLP Applications Language Modeling Speech Recognition Machine Translation Part-Of-Speech Tagging Chunking Named Entity Recognition Semantic Role Labeling Sentiment Analysis Paraphrasing Question-Answering Word-Sense Disambiguation

Word Representation One hit representation: movie [0 0 0 0 0 0 0 0 1 0 0 0 0] && film [0 0 0 0 0 1 0 0 0 0 0 0 0] = 0 Distributional similarity based representations You can get a lot of value by representing a word by means of its neighbors: Class-based (hard) clustering word representations Brown clustering (Brown et al. 1992) Exchange clustering (Martin et al. 1998, Clark 2003) Soft clustering word representations LSA/LSI LDA

Brown Clustering: Example Clusters (from Brown et al, 1992)

The Formulation Quality of a partition 𝐶: 𝑉 is the set of all words seen in the corpus 𝑤 1 , 𝑤 2 … 𝑤 𝑇 Say 𝐶 : 𝑉 → 1, 2, …𝑘 is a partition of the vocabulary into 𝑘 classes The model: 𝑝 𝑤 1 , 𝑤 2 ,… 𝑤 𝑡 = 𝑖=1 𝑛 𝑝 𝑤 𝑖 𝐶 𝑤 𝑖 𝑝 𝐶 𝑤 𝑖 𝐶 𝑤 𝑖−1 ) Quality of a partition 𝐶: 𝑄 𝐶 = 1 𝑛 𝑖=1 𝑛 log 𝑝 𝑤 𝑖 𝐶 𝑤 𝑖 𝑝(𝐶( 𝑤 𝑖 )|𝐶( 𝑤 𝑖−1 )) Bigram model, HHM

A Sample Hierarchy (from Miller et al., NAACL 2004) Huffman Coding Prefixes -> similarity

Language Model N-gram model (n=3) 𝑃 𝐼, 𝑠𝑎𝑤, 𝑡ℎ𝑒, 𝑟𝑒𝑑, ℎ𝑜𝑢𝑠𝑒 ≈𝑃 𝐼 <𝑠>,<𝑠> 𝑃 𝑠𝑎𝑤 <𝑠>,𝐼 𝑃 𝑡ℎ𝑒 𝐼,𝑠𝑎𝑤 𝑃 𝑟𝑒𝑑 𝑠𝑎𝑤,𝑡ℎ𝑒 𝑃 ℎ𝑜𝑢𝑠𝑒 𝑡ℎ𝑒,𝑟𝑒𝑑 𝑃(</𝑠>|𝑟𝑒𝑑,ℎ𝑜𝑢𝑠𝑒) Calculated from n-gram frequency counts: 𝑃 𝑤 𝑖 𝑤 𝑖− 𝑛−1 ,…, 𝑤 𝑖−1 = 𝑐𝑜𝑢𝑛𝑡 𝑤 𝑖− 𝑛−1 ,…, 𝑤 𝑖−1 , 𝑤 𝑖 𝑐𝑜𝑢𝑛𝑡 𝑤 𝑖− 𝑛−1 ,…, 𝑤 𝑖−1

A Neural Probabilistic Language Model (Bengio et al, NIPS’2000 and JMLR 2003) Motivation: LM does not take into account contexts farther than 2 words. LM does not take into account the “similarity” between words. Idea: A word 𝑤 is associated with a distributed feature vector (a real-valued vector in ℝ 𝑛 𝑛 is much smaller than size of the vocabulary) Express joint probability function 𝑓 of words sequence in terms of feature vectors Learn simultaneously the word feature vector and the parameters of 𝑓

Neural Language Model Neural architecture 𝑓 𝑖, 𝑤 𝑡−1 ,…, 𝑤 𝑡−𝑛+1 =𝑔(𝑖,𝐶 𝑤 𝑡−1 ,…,𝐶( 𝑤 𝑡−𝑛+1 ))

Neural Language Model softmax output layer: 𝑃 𝑤 𝑡 𝑤 𝑡−1 ,⋯, 𝑤 𝑡−𝑛+1 = 𝑒 𝑦 𝑤 𝑡 𝑖 𝑒 𝑦 𝑖 𝑦 𝑖 unnormalized log-probabilities for each output word 𝑖 𝑦=𝑏+𝑈tanh(𝑑+𝐻𝑥) 𝑥 is the word features layer activation vector 𝑥= 𝐶 𝑤 𝑡−1 ,…,𝐶 𝑤 𝑡−𝑛+1 The free parameters of the model are: 𝜃=(𝑏,𝑑,𝑊,𝑈,𝐻,𝐶) 𝑏 output biases (|𝑉|) 𝑑 the hidden layer biases (ℎ) 𝑈 the hidden-to-output weights ( 𝑉 ×ℎ) 𝐻 the hidden layer weights (ℎ×(𝑛−1)𝑚) 𝐶 word features ( 𝑉 ×𝑚) 4 weeks of training (40 CPUs) on 14,000,000 words training set |V|=17964

Neural word Embedding as a distributed representation http://metaoptimize.com/projects/wordreprs/

A neural network for learning word vectors (Collobert et al. JMLR 2011) A word and its context is a positive training sample; a random word in that same context gives a negative training sample: [+] positive = Score(Cat chills [on] a mat) > [-] negative = Score(Cat chills [god] a mat) What to feed in the NN each word is an n-dimensional vector, a look up table: 𝐿∈ ℝ 𝑛× 𝑉 Training objective: 𝜃→ 𝑀𝑎𝑥{0, 1 − 𝑆 𝑝𝑜𝑠 + 𝑆 𝑛𝑒𝑔 } 3-layer NN: 𝑠= 𝑈 𝑇 𝑓 𝜃 (𝑊𝑥+𝑏) Where 𝑓 𝜃 ⋅ is a NN function. SENNA: http://ml.nec-labs.com/senna/ Window size n = 11 |V| = 1300000 7 weeks

Linguistic Regularities in Continuous Space Word Representations (Mikolov, et al. 2013) Recurrent Neural Network Model d(t) 0 0 1 …. 0 𝑣 Error= y(t)-d(t) 0 1 0 …. 0 𝑣 One-hit representation Context at time t-1

Linguistic Regularities in Continuous Space Word Representations (Mikolov, et al. 2013) Recurrent Neural Network Model The input vector 𝑤(𝑡) represents input word at time 𝑡 encoded using One hit coding. The output layer 𝑦(𝑡) produces a probability distribution over words. The hidden layer 𝑠(𝑡) maintains a representation of the sentence history. 𝑤(𝑡) and 𝑦 𝑡 are of same dimension as vocabulary Model: 𝑠 𝑡 =𝑓 𝑈𝑤 𝑡 +𝑊𝑠 𝑡−1 𝑦 𝑡 =𝑔(𝑉𝑠(𝑡)) Where 𝑓 is the sigmod function and 𝑔 is the softmax funciton

Linguistic Regularities in Continuous Space Word Representations (Mikolov, et al. 2013) Training: Stochastic Gradient Descent (SGD) Objective(Error) function: 𝑒𝑟𝑟𝑜𝑟 𝑡 =𝑑 𝑡 −𝑦(𝑡) where d(t) is the desired vector, i.e w(t) Go through all the training data iteratively, and update the weight matrices U, V and W online (after processing every word) Training is performed in several epochs (usually 5-10) Where is the word representation? 𝑈, with each column

Linguistic Regularities in Continuous Space Word Representations (Mikolov, et al. 2013) Measuring Linguistic Regularity Syntactic/Semetic Test These representations are surprisingly good at capturing syntactic and semantic regularities in language, and that each relationship is characterized by a relation-specific vector offset.

Exploiting Similarities among Languages for Machine Translation (Mikolov, et al. 2013 http://arxiv.org/pdf/1309.4168.pdf) Figure 1: Distributed word vector representations of numbers and animals in English (left) and Spanish (right). The five vectors in each language were projected down to two dimensions using PCA, and then manually rotated to accentuate their similarity. It can be seen that these concepts have similar geometric arrangements in both spaces, suggesting that it is possible to learn an accurate linear mapping from one space to another.

Improving Word Representations via Global Context and Multiple Word Prototypes (Huang, et al. ACL 2013)

Improving Word Representations via Global Context and Multiple Word Prototypes (Huang, et al. ACL 2013) Improve Collobert & Weston’s model Training objective: 𝜃→ 𝑀𝑎𝑥{0, 1 − 𝑆 𝑝𝑜𝑠 + 𝑆 𝑛𝑒𝑔 } ↓ 𝜃→ 𝑀𝑎𝑥{0, 1 − 𝑆 𝑝𝑜𝑠, 𝑑 + 𝑆 𝑛𝑒𝑔,𝑑 } where 𝑑 is the document (weighted sum of words in 𝑑)

Improving Word Representations via Global Context and Multiple Word Prototypes (Huang, et al. ACL 2013) The Model Clustering word base on context and retrain the model

Summary: Word Representation and Language Model Bengio et al, 2010 Associated Press (AP) News from 1995 and 1996 14,000,000 words Word: 17964 Trained for 4 weeks (40 CPUS) C&W English Wikipedia  + Reuters RCV1 Words: 130000 Dimensions: 50, Trained for 7 weeks Mikolov Broadcast news Words: 82390  Dimensions: 80, 640, 1600 Trained for several days Huang 2012 English Wikipedia Words: 100232 Dimensions: 50 |V|= 6000 10 cluster for each words

Parsing What we want:

Using word vector space model The meaning (vector) of a sentence is determined by (1) The meanings of its words and (2) The rules that combine them.

Recursive Neural Networks for Structure Prediction Inputs: two candidate children’s representations Outputs: The semantic representation if the two nodes are merged Score of how plausible the new node would be

Recursive Neural Networks

Parsing Example with an RNN

Parsing Example with an RNN

Parsing Example with an RNN

Parsing Example with an RNN

Parsing with Compositional Vector Grammars (Socher, et al. ACL 2013) A Compositional Vector Grammar (CVG) model, which combines PCFGs with a syntactically untied recursive neural network that learns syntactico-semantic, compositional vector representations.

Parsing with Compositional Vector Grammars (Socher, et al. ACL 2013) Probabilistic Context-Free Grammars (PCFGs) A PCFG consists of: A context-free grammar G = (N, Σ, S, R). A parameter 𝑞 𝛼 → 𝛽 for each rule 𝛼 → 𝛽 ∈ 𝑅. The parameter 𝑞(𝛼 → 𝛽) can be interpreted as the conditional probabilty of choosing rule 𝛼 → 𝛽 in a left-most derivation, given that the non-terminal being expanded is 𝛼. For any 𝑋 ∈ 𝑁, we have the constraint: 𝑎→𝛽∈𝑅, 𝛼=𝑋 𝑞 𝛼→𝛽 =1 Given a parse-tree 𝑡 ∈ 𝑇 𝐺 containing rules 𝛼 1 → 𝛽 1 , 𝛼 2 → 𝛽 2 , . . . , 𝛼 𝑛 → 𝛽 𝑛 , the probability of 𝑡 under the PCFG is 𝑝 𝑡 = 𝑖=1 𝑛 𝑞 𝛼→𝛽 Chomsky Normal Form -> Binary Parsing Tree

Parsing with Compositional Vector Grammars (Socher, et al. ACL 2013) Define a structured margin loss Δ 𝑦 𝑖 , 𝑦 for predicting a tree 𝑦 for a given correct tree. Δ 𝑦 𝑖 , 𝑦 = 𝑑∈𝑁( 𝑦 ) 𝑘×1{𝑑∉𝑁( 𝑦 𝑖 )} For a given set of training instances ( 𝑥 𝑖 , 𝑦 𝑖 ), we search for the function 𝑔 𝜃 , parameterized by 𝜃, with the smallest expected loss on a new sentence. 𝑔 𝜃 = argmax 𝑦 ∈𝑌(𝑥) 𝑠(𝐶𝑉𝐺 𝜃,𝑥, 𝑦 ) where 𝑠 𝐶𝑉𝐺 𝜃, 𝑥 𝑖 , 𝑦 𝑖 ≥𝑠 𝜃, 𝑥 𝑖 , 𝑦 +Δ 𝑦 𝑖 , 𝑦

Parsing with Compositional Vector Grammars (Socher, et al. ACL 2013) The CVG computes the first parent vector via the SU-RNN: 𝑃 (1) =𝑓 𝑊 (𝐵,𝐶) 𝑏 𝑐 where 𝑊 (𝐵,𝐶) ∈ ℝ 𝑛×2𝑛 is now a matrix that depends on the categories of the two children. Score for each node consists of summing two elements: 𝑠 𝑝 1 = 𝑣 𝐵,𝐶 𝑇 𝑝 1 + log 𝑃( 𝑃 1 →𝐵 𝐶) where 𝑣∈ ℝ 𝑛 is a vector of parameters that need to be trained. And 𝑃( 𝑃 1 →𝐵 𝐶) comes from the PCFG

Parsing with CVGs bottom-up beam search keeping a k-best list at every cell of the chart The CVG improves the PCFG of the Stanford Parser by 3.8% to obtain an F1 score of 90.4%. As a reranker.

Q&A Thanks!