Shallow Parsing Swapnil Chaudhari – 11305R011 Ankur Aher - 113059006 Raj Dabre – 11305R001.

Slides:



Advertisements
Similar presentations
Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A.
Advertisements

Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields - A probabilistic graphical model Stefan Mutter Machine Learning Group Conditional Random Fields - A probabilistic graphical.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Models in NLP
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Visual Recognition Tutorial
Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Conditional Random Fields
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
1 Sequence Labeling Raymond J. Mooney University of Texas at Austin.
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 6: Conditional Random Fields 1.
Graphical models for part of speech tagging
1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.
CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
John Lafferty Andrew McCallum Fernando Pereira
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Conditional Random Fields & Table Extraction Dongfang Xu School of Information.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
CIS 700 Advanced Machine Learning Structured Machine Learning:   Theory and Applications in Natural Language Processing Shyam Upadhyay Department of.
Bidirectional CRF for NER
CSC 594 Topics in AI – Natural Language Processing
Hidden Markov Models Part 2: Algorithms
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Shallow Parsing Swapnil Chaudhari – 11305R011 Ankur Aher Raj Dabre – 11305R001

Purpose of the Seminar To emphasize on the need for Shallow Parsing. To impart basic information about techniques to perform Shallow Parsing. To emphasize on the power CRF’s for Sequence Labeling tasks.

Overall flow Motivation Introduction to Shallow Parsing Overview and limitations of approaches Conditional Random Fields Training Decoding Experiments Terminology Results Conclusions

A quick refresher Identify the Core Noun Phrases: He reckons the current account deficit will narrow to only 1.8 billion in September. [NP He] reckons [NP the current account deficit] will narrow to only [NP 1.8 billion] in [NP September]

Flow in MT Taken from NLP slides by Pushpak Bhattacharyya, IITB.

Motivation The logical stage after POS tagging is Parsing. Constituents of sentences are related to each other. Relations identification helps move towards meaning. Identification of Base Phrases precedes Full Parsing. Closely related languages need only shallow parsing. Also crucial to IR [Picnic locations] in [Alibaug]. [Details] of [Earthquake] in [2006]. [War] between [USA] and [Iraq].

Introduction Shallow parsing is the process of identification of the non- recursive cores of various phrase types in given text. Also called Chunking. Most basic work is on NP Chunking. Results in a flat structure. “Ram goes to the market” is chunked as [Ram-NP] [goes-VG] [to-PP] [the market-NP] Can move towards Full Parsing and then Transfer for translation purposes.

Identification of Problem Task is Sequence Labeling i.e. to identify labels for constituents, whether in or out of Chunk. Also can be sequence of classification problems, one for each of the labels in the sequence (SVM). 2 Major Approaches: Generative Models Discriminative Models

Generative v.s. Discriminative Models In generative models we model Joint probability P(X,Y) which contains P(X|Y) and P(Y) where Y=Label Seq. X=Obs. Seq. Model(The labels) Generates Observations in Generative Models. So given Observations one can get Labels. Easy to model. No inclusion of features. Even less effective when data is sparse but can use EM algorithm. Example is HMM.

Generative v.s. Discriminative Models In Discriminative Models we model P(Y|X) directly. The features distinguish between labels and a weighted combination of various features can decide them. Difficult to model (Decision of Features). Overall more effective when sparse data. Examples are MEMM and CRF.

Motivation: Shortcomings of Hidden Markov Model HMM models direct dependence between each state and only its corresponding observation. P(X|Y) is decomposed into product of all P(x i |y i ) for all positions i in input. This independence assumption is for simplification but it ignores potential relations between neighboring words. HMM learns a joint distribution of states and observations P(Y, X), but in a prediction task, we need the conditional probability P(Y|X) Y1Y1 Y2Y2 ………YnYn X1X1 X2X2 ………XnXn

Solution: Maximum Entropy Markov Model (MEMM) Models dependence between each state and the full observation sequence explicitly More expressive than HMMs Discriminative model Completely ignores modeling P(X): saves modeling effort Learning objective function consistent with predictive function: P(Y|X) Y1Y1 Y2Y2 ………YnYn X 1:n

MEMM: Label bias problem State 1 State 2 State 3 State 4 State 5 Observation 1Observation 2Observation 3Observation What the local transition probabilities say: State 1 almost always prefers to go to state 2 State 2 almost always prefer to stay in state 2

MEMM: Label bias problem State 1 State 2 State 3 State 4 State 5 Observation 1Observation 2Observation 3Observation Probability of path 1-> 1-> 1-> 1: 0.4 x 0.45 x 0.5 = 0.09

MEMM: Label bias problem State 1 State 2 State 3 State 4 State 5 Observation 1Observation 2Observation 3Observation Probability of path 2->2->2->2 : 0.2 X 0.3 X 0.3 = Other paths: 1-> 1-> 1-> 1: 0.09

MEMM: Label bias problem State 1 State 2 State 3 State 4 State 5 Observation 1Observation 2Observation 3Observation Probability of path 1->2->1->2: 0.6 X 0.2 X 0.5 = 0.06 Other paths: 1->1->1->1: >2->2->2: 0.018

MEMM: Label bias problem State 1 State 2 State 3 State 4 State 5 Observation 1Observation 2Observation 3Observation Probability of path 1->1->2->2: 0.4 X 0.55 X 0.3 = Other paths: 1->1->1->1: >2->2->2: >2->1->2: 0.06

MEMM: Label bias problem State 1 State 2 State 3 State 4 State 5 Observation 1Observation 2Observation 3Observation Most Likely Path: 1-> 1-> 1-> 1 Although locally it seems state 1 wants to go to state 2 and state 2 wants to remain in state 2.

MEMM: Label bias problem State 1 State 2 State 3 State 4 State 5 Observation 1Observation 2Observation 3Observation Most Likely Path: 1-> 1-> 1-> 1 State 1 has only two transitions but state 2 has 5: Average transition probability from state 2 is lower

MEMM: Label bias problem State 1 State 2 State 3 State 4 State 5 Observation 1Observation 2Observation 3Observation Label bias problem in MEMM: Preference of states with lower number of transitions over others

Interpretation States with lower number of outgoing transitions effectively ignore observations. The normalization is done upto the current position and not over all positions. Thus uncertainties in tagging in previous locations will affect the current decision. Thus a global normalizer is needed.

Conditional Random Fields CRFs define conditional probability distributions p(Y|X) of label sequences given input sequences. X and Y have the same length. x = x1…xn and y = y1…yn f is a vector of local features and a corresponding weight vector λ Markovian assumption needed

Conditional Random Fields CRF is a partially directed model Discriminative model like MEMM Usage of global normalizer Z(x) overcomes the label bias problem of MEMM Models the dependence between each state and the entire observation sequence (like MEMM) Y1Y1 Y2Y2 ………YnYn x 1:n

Some rewriting Let us have some easier to write notations. Z(x 1:n ) as Z λ (x) f(y i,y i-1,x 1:n,i) as f(y,x,i) P(y 1:n,x 1:n ) as P λ (Y|X) w as λ.

Conditional Random Fields The global feature vector is where i is the input position Combine f and g (transition and state features) into one. The conditional probability distribution defined by the CRF is then

Training the CRF We train a CRF by maximizing the log-likelihood of a given training set T = {x k, y k } for k=1 to n i.e. for all training pairs. The log likelihood is:

Solving the optimization problem We seek the zero of the gradient. is the feature expectation. Since number of features can be too large we do dynamic programming.

CRF learning Computing marginals using junction-tree calibration: Junction Tree Initialization: Y1Y1 Y2Y2 ………YnYn x 1:n Y 1,Y 2 Y 2,Y 3 ……. Y n-2,Y n-1 Y n-1,Y n Y2Y2 Y3Y3 Y n-2 Y n-1

Solving the optimization problem

where α i and β i the forward and backward state-cost vectors defined by We can use forward- backward algorithms to calculate α i and β i to get feature expectations.

Avoiding over fitting To avoid over fitting, we penalize the likelihood as:

Decoding the label sequence

Experiments Experiments were carried out on the CONLL-2000 and RM (Ramshaw and Marcus) dataset. Considered 2 nd order Markov dependency b/w chunk labels. Highest overall F-Scores of around 94% were obtained.

Terminologies NP chunker consists of the words in a sentence annotated automatically with part-of-speech (POS) tags. Chunker's task is to label each word with a label indicating whether the word is outside a chunk (O), starts a chunk (B), or continues a chunk (I). For eg [Rockwell International Corp.] ['s Tulsa unit] said [it] signed [a tentative agreement] extending [its contract] with [Boeing Co.] to provide [structural parts] for [Boeing] ['s 747 jetliners]. Has chunk tags BIIBIIOBOBIIOBIOBIOOBIOBBII.

Terminologies 2 nd order Markov dependency encoded by making the CRF labels pairs of consecutive chunk tags. Label at position i is y i = c i-1 c i. c i is the chunk tag of word i, one of O, B, or I. B must be used to start a chunk so the label OI is impossible. In addition, successive labels are constrained: y i-1 = c i-2 c i-1, and, y i = c i-1 c i and c 0 = O. Constraints enforced by giving appropriate features a weight of minus infinity. This forces all the forbidden labelling's to have zero probability.

Features used 3.8 million features. Factored representation for features. p(x,i) is a predicate on the input sequence x and current position i and q(y i-1, y i ) is a predicate on pairs of labels. For instance, p(x, i) might be: word at position i is “the” or the POS tags at positions i-1, i are “DT, NN”. q(y i-1, y i ) can be: current Label is “IB” and previous Label is “BI”.

The predicate can be an identity so that we can consider only the labels The predicate can be a combination of features consisting of POS tags and word (features) More the features more the complexity

Using the tags For a given position i, w i is the word, t i its POS tag, and y i its label. For any label y = c 0 c, c(y) = c is the corresponding chunk tag. For example, c(OB) = B. Use of chunk tags as well as labels provides a form of back off from the very small feature counts in a second order model. Also allows significant associations between tag pairs and input predicates to be modelled.

Evaluation Metrics Standard evaluation metrics for a chunker are Precision P (fraction of output chunks that exactly match the reference chunks), Recall R (fraction of reference chunks returned by the chunker), The F1 score F1 = 2*P*R/(P + R).

Labelling Accuracy The accuracy rate for individual labelling decisions is over- optimistic as an accuracy measure for shallow parsing. For instance, if the chunk BIIIIIII is labelled as BIIBIIII, the labelling accuracy is 87.5% (1 of 8 is an error), but recall is 0. Usually accuracies are high, precisions are lower and recalls even lower. Due to complete chunks not being identified.

Comparisons ModelF-Score SVM94.39% CRF94.38% Voted Perceptron94.09% Generalized Winnow93.89% MEMM93.70%

Conclusions CRF’s have extreme capabilities for sequence labeling tasks. Can be used to exploit features. Extremely useful in case of agglutinative languages which have lots features after Morphological Analysis. Even linear chain CRF’s have significant power. Better than HMM’s and MEMM’s and comparable to SVM’s. Training time is a major drawback. Increasing the complexity of a CRF increases the train time exponentially.

References Fei Sha and Fernando Pereira (2003). “Shallow Parsing with Conditional Random Fields”. NAACL. Department of Computer and Information Science, UPENN Roman Klinger and Katrin Tomanek (2007). “Classical Probabilistic Models and Conditional Random Fields”. Algorithm Engineering Report, Faculty of Computer Science, Germany.. Animations from slides by Ramesh Nallapati, CMU.