Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.

Slides:

Advertisements

Similar presentations

Structured SVM Chen-Tse Tsai and Siddharth Gupta.

Advertisements

Measuring the Influence of Long Range Dependencies with Neural Network Language Models Le Hai Son, Alexandre Allauzen, Franc¸ois Yvon Univ. Paris-Sud and.

1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.

CSC321: Neural Networks Lecture 3: Perceptrons

Support Vector Machines and Kernel Methods

Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.

Conditional Random Fields

Sequence labeling and beam search LING 572 Fei Xia 2/15/07.

Distributed Representations of Sentences and Documents

Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Ronan Collobert Jason Weston Leon Bottou Michael Karlen Koray Kavukcouglu Pavel Kuksa.

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

EM and expected complete log-likelihood Mixture of Experts

CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.

Slides for “Data Mining” by I. H. Witten and E. Frank.

CSC321: Neural Networks Lecture 16: Hidden Markov Models

CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of.

John Lafferty Andrew McCallum Fernando Pereira

Conditional Markov Models: MaxEnt Tagging and MEMMs

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning Ronan Collobert Jason Weston Presented by Jie Peng.

Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.

Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.

Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),

1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.

Mastering the Pipeline CSCI-GA.2590 Ralph Grishman NYU.

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.

Graph-based Dependency Parsing with Bidirectional LSTM Wenhui Wang and Baobao Chang Institute of Computational Linguistics, Peking University.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Lecture 7: Constrained Conditional Models

Convolutional Sequence to Sequence Learning

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

Deep Feedforward Networks

Maximum Entropy Models and Feature Engineering CSCI-GA.2591

Randomness in Neural Networks

Relation Extraction CSCI-GA.2591

Intelligent Information System Lab

Intro to NLP and Deep Learning

Basic machine learning background with Python scikit-learn

Neural networks (3) Regularization Autoencoder

Machine Learning Basics

Deep learning and applications to Natural language processing

CSC 594 Topics in AI – Natural Language Processing

Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules

Deep Learning based Machine Translation

Convolutional Neural Networks for sentence classification

Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.

Goodfellow: Chapter 14 Autoencoders

Word Embedding Word2Vec.

CSCI 5832 Natural Language Processing

Vector Representation of Text

CSCI 5832 Natural Language Processing

Word embeddings Text processing with current NNs requires encoding into vectors. One-hot encoding: N words encoded by length N vectors. A word gets a.

Word embeddings (continued)

Attention for translation

CSE 291G : Deep Learning for Sequences

Keshav Balasubramanian

Guanqun Yang, Pencheng Xu, Haiqi Xiao, Yuqing Wang

CS249: Neural Language Model

Presentation transcript:

Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from here: Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from here:

Motivation - How far can one go without architecting features? -Feature engineering is difficult. -Adapting to different domains/tasks requires work.

Task Formulations Parsing as recognizing strings in a grammar. Parsing as finding the maximum-weighted spanning tree. Parsing as predicting the sequence of transitions.

Formulation: High-level Idea Parse tree is a set of tags at each level.

Formulation: High-level Idea Tags encoded via Inside, Outside, Begin, End, Single scheme.

Formulation: High-level Idea Task: Predict the IOBES tag for each word iteratively for starting from level 1 Question: Will this terminate?

Formulation: High-level Idea Task: Predict the IOBES tag for each word iteratively for starting from level 1 Question: Will this terminate? No. Although a simple constraint on the tags ensures it does. Every phrase that overlaps a lower-level phrase must be strictly bigger.

Window Approach 1) Words represented as K features get embedded into a D dimensional vector space. 2) Hidden Layer + Non-linearity 3) Linear layer to assign scores for each tag. Key Issue: Long range dependencies between words not captured. Key Issue: Long range dependencies between words not captured.

Modeling Long-distance Dependencies: Incorporate Relative Distance Key Idea: Combine entire sentence but incorporate relative distance of words. Add relative distance of the neighboring words to the embedding. Use a new matrix M 1 to combine the entries from lookup table. The rest of the architecture remains the same!

Sentence - Convolutional Approach Key Idea: For every word that needs to be tagged: 1)Use a K word sliding window to tile over the entire length of the sentence. 1)# of tiles = # of words. 2)For each word convolution produces a fixed D- dimensional representation. 3)Max over time layer compresses the entire representation to a single D-dimensional vector. 4)Rest of the architecture is the same.

Structured Prediction Key Issue: Still making independent predictions for each word. Tag for one word should influence the next tag. Key Issue: Still making independent predictions for each word. Tag for one word should influence the next tag.

Structured Prediction Key Idea Output decisions on words should influence each other. Treat as a sequence prediction problem. Learn transition probabilities also as a parameter.

Structure Prediction Constraints: Chunk History and Tree-ness. Use chunks from previous level as history. Only use the largest chunk if there are many overlapping chunks. e.g., At level 4 the chunks in history Would be: NP for “stocks” and VP for “kept falling” Use the IOBES tags of the chunk words as features.

Structure Prediction Constraints: Chunk History and Tree-ness. Use chunks from previous level as history. Only use the largest chunk if there are many overlapping chunks. e.g., At level 4 the chunks in history Would be: NP for “stocks” and VP for “kept falling” Use the IOBES tags of the chunk words as features.

Structure Prediction Constraints: Chunk History and Tree-ness. Not all paths lead to valid trees. Simple rules/constraints disallow invalid trees. Perform inference on the restricted paths.

Training: Objective & Optimization Find parameters that maximize the log-likelihood of data Log probability of tag sequence given input sequence and parameters (theta): Logadd component can get exponential in length of the sentence. A recursive definition (ala inference w/ Viterbi) makes it tractable. Logadd component can get exponential in length of the sentence. A recursive definition (ala inference w/ Viterbi) makes it tractable.

Experimental Results: Is this any good? Comparable to prior feature based approaches. Often the case with results in deep learning with NLP. Benefits more likely when applying to new domain/language. What next? Add some of the linguistically motivated features back? Comparable to prior feature based approaches. Often the case with results in deep learning with NLP. Benefits more likely when applying to new domain/language. What next? Add some of the linguistically motivated features back?

Summary and Issues A somewhat complicated architecture that accomplishes state-of-the-art results. Seems to force-fit a naturally recursive model into a CNN. Recurrent neural networks appear more natural.