Chunking Pierre Bourreau Cristina España i Bonet LSI-UPC PLN-PTM.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
Expectation Maximization Dekang Lin Department of Computing Science University of Alberta.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Ling 570 Day 6: HMM POS Taggers 1. Overview Open Questions HMM POS Tagging Review Viterbi algorithm Training and Smoothing HMM Implementation Details.
CPSC 422, Lecture 16Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 16 Feb, 11, 2015.
SEMANTIC ROLE LABELING BY TAGGING SYNTACTIC CHUNKS
Part-Of-Speech Tagging and Chunking using CRF & TBL
Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,
Semantic Role Labeling Abdul-Lateef Yussiff
Learning with Probabilistic Features for Improved Pipeline Models Razvan C. Bunescu Electrical Engineering and Computer Science Ohio University Athens,
Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.
Foundations of Statistical NLP Chapter 9. Markov Models 한 기 덕한 기 덕.
Part of Speech Tagging with MaxEnt Re-ranked Hidden Markov Model Brian Highfill.
POS Tagging & Chunking Sambhav Jain LTRC, IIIT Hyderabad.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
Applications of Sequence Learning CMPT 825 Mashaal A. Memon
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Asma Naseer.  Shallow Parsing or Partial Parsing  At first proposed by Steven Abney (1991)  Breaking text up into small pieces  Each piece is parsed.
Part-of-speech Tagging cs224n Final project Spring, 2008 Tim Lai.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
Two-Phase Semantic Role Labeling based on Support Vector Machines Kyung-Mi Park Young-Sook Hwang Hae-Chang Rim NLP Lab. Korea Univ.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.
Part of speech (POS) tagging
Part-of-Speech Tagging & Sequence Labeling
BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen.
CS224N Interactive Session Competitive Grammar Writing Chris Manning Sida, Rush, Ankur, Frank, Kai Sheng.
TopicTrend By: Jovian Lin Discover Emerging and Novel Research Topics.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:
Graphical models for part of speech tagging
Outline POS tagging Tag wise accuracy Graph- tag wise accuracy
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.
AQUAINT Workshop – June 2003 Improved Semantic Role Parsing Kadri Hacioglu, Sameer Pradhan, Valerie Krugler, Steven Bethard, Ashley Thornton, Wayne Ward,
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
Tokenization & POS-Tagging
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.
NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size.
Part-of-speech tagging
POS Tagger and Chunker for Tamil
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Chunk Parsing II Chunking as Tagging. Chunk Parsing “Shallow parsing has become an interesting alternative to full parsing. The main goal of a shallow.
NLP. Parsing ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Part-of-Speech Tagging & Sequence Labeling Hongning Wang
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 3 rd.
Conditional Random Fields & Table Extraction Dongfang Xu School of Information.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
LING 581: Advanced Computational Linguistics Lecture Notes March 2nd.
Bidirectional CRF for NER
CSC 594 Topics in AI – Natural Language Processing
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
Part-of-Speech Tagging Using Hidden Markov Models
Stance Classification of Ideological Debates
Presentation transcript:

Chunking Pierre Bourreau Cristina España i Bonet LSI-UPC PLN-PTM

Plan Introduction Methods  HMM  SVM  CRF Global analysis Conclusion

Introduction What is chunking?  Identifying groups of contiguous words.  Ex: He is the person you read about. [NP He] [VP is] [NP the person] [NP you] [VP read] [PP about].  First step to full parsing

Introduction Chunking task in CoNLL  Based on a previous POS tagging  Chunks: B/I/O-Chunk  ADJP (adjective phrase)  ADVP (adverb phrase)  CONJP (conjunction phrase)  INTJ (interjection)  LST (list marker)  NP (noun phrase)  PP (prepositional phrase)  PRT (particles)  SBAR (subordinated clause)  UCP (unlike coordinated phrase)  VP (verb phrase)    56  31  10    556   2  ( over chunks )

Introduction Corpus  Wall Street Journal (WSJ)  Training set: Four sections: tokens  Test set: One section: tokens

Evaluation Output files style  Word POS Real-Chunk Processed-Chunk  Ex: Boeing NNP B-NP I-NP 's POS B-NP B-NP 747 CD I-NP I-NP jetliners NNS I-NP I-NP.. O O  Evaluation’s script: precision, recall, F 1 score (1+β)*recall*precision/(recall+ βprecision) where β=1

Plan Introduction Methods  HMM  SVM  CRF Global analysis Conclusion

Hidden Markov Models (HMM) A bit of theory...  Find the most probable tags for a sentence (I), given a vocabulary and the set of possible tags Bayes theorem  All the states?! First order Second order Bigrams Trigrams

HMM: Chunking Common Setting  Input sentence: words  Input/output tags: POS Chunking  Input sentence: POS  Input/output tags: Chunks  Tagger Problem!  Small vocabulary

HMM: Chunking Solution: Specialization  Input sentence: POS  Input/output tags: POS + Chunks Improvement  Input sentence: Special words + POS  Input/output tags: Special words + POS + Chunks

HMM: Chunking In practice: Modification of the Input data (WSJ train and test) NNP NNP·O of·IN of·IN·B-PP the·DT the·DT·B-NP NNP NNP·I-NP NNP NNP·B-NP NNP NNP·I-NP POS POS·B-NP VBN VBN·I-NP NN NN·I-NP to·TO to·TO·B-PP a·DT a·DT·B-NP NN NN·I-NP JJ JJ·I-NP NN NN·I-NP has·VBZ has·VBZ·B-VP helped·VBN helped·VBN·I-VP to·TO to·TO·I-VP VB VB·I-VP a·DT a·DT·B-NP NN NN·I-NP in·IN in·IN·B-PP NN NN·B-NP over·IN over·IN·B-PP the·DT the·DT·B-NP past·JJ past·JJ·I-NP NN NN·I-NP..·O Chancellor NNP O of IN B-PP the DT B-NP Exchequer NNP I-NP Nigel NNP B-NP Lawson NNP I-NP 's POS B-NP restated VBN I-NP commitment NN I-NP to TO B-PP a DT B-NP firm NN I-NP monetary JJ I-NP policy NN I-NP has VBZ B-VP helped VBN I-VP to TO I-VP prevent VB I-VP a DT B-NP freefall NN I-NP in IN B-PP sterling NN B-NP over IN B-PP the DT B-NP past JJ I-NP week NN I-NP.. O

HMM: Results Tool:  TnT Tagger (Thorsten Brants)  Implements Viterbi algorithm for second order MM  Allows to evaluate unigrams, bigrams and trigrams MM

HMM: Results Configuration 1:  No special words, no POS  3grams  Default parameters Results:  Far from the best scores (F 1 ~94%) Non-lex chunks Precision (%) Recall (%) F β=1 (%) ADJP ADVP CONJP 0.00 INTJ LST 0.00 NP PP PRT SBAR UCP Total

HMM: Results Trying to improve… Configuration 2:  Lexical specialization (409 words, F. Pla)  Trigrams Configuration 3 :  Lexical specialization (409 words, F. Pla)  Bigrams -makes any difference?-

HMM: Results BigramsTrigrams Chunks Precision (%) Recall (%) F β=1 (%) Precision (%) Recall (%) F β=1 (%) ADJP ADVP CONJP INTJ NP PP PRT SBAR UCP 0.00 VP Total

HMM: Results Comments:  Adding specialization information improves 7 points the total F 1.  That’s much more that the improvement of using trigrams instead of bigrams (~1%).  As before, NP and PP are the best determined chunks.  Impressive improvement for PRT and SBAR (but small #).

HMM: Results Importance of the training set size:  Test: Divide the training set in 7 parts (~17000 tokens/part). Calculate the results adding a part each time.  Conclusion: Performances improve with the set size (see plot).  Limit? Molina & Pla got a F 1 =93.26% with 18 sections of WSJ as training set.

HMM: Results

Support Vector Machines (SVM) A bit of theory…  Objective: Maximize the minimum margin  Allow missclassifications Controlled by the C parameter

SVM Tool:  SVMTool (Jesús Giménez & Lluís Màrquez) Uses SVMLight (Thorsten Joachims) for learning.  Sequential tagger  chunking No necessity to change input data  Binarizes the problem to apply SVMs

SVM Features (model 0):

SVM Results  Default parameters, vary C and/or direction (LR/LRL)  Very small variations with this configuration ModelPrecision (%)Recall (%)F β=1 (%) C=0.1, LR C=0.1 LRL C=1 LR C=0.05 LR

SVM Chunks Precision (%) Recall (%) F β=1 (%) ADJP ADVP CONJP INTJ LST 0.00 NP PP PRT SBAR VP Total Best results:  F1 > 90% for the three main chunks.  Modest values for the others.  Main difference with HMM in PP.

Conditional Random Fields (CRF) A bit of theory…  Idea based on extension of HMM and Maximum-Entropy Models.  We don’t consider a chain but a graph G=(V,E) Conditioned on X, observation sequence variable Each node represents a value Y v of Y (output label)

Conditional Random Fields P(y|x) (Lafferty ) where y is a label, and x an observation sequence  tj is a transition feature function (regarding previous features and observation sequence).  sj is a state feature function (regarding current features and observation sequence).  factors are set at the training level.

Conditional Random Fields (CRF) CRF  Developed by Taku Kudo (2nd at the CoNll2000 with SVM combination)  Parameters: Features being used:  We can use words, POS tagging  We proposed three alternatives:  Using a binary combinations of words+POS on a frame size=2  Using the above + a 3-ary combination of POS  Using only POS on a 3 size frame Unigrams or Bigrams  Getting a score regarding probabilities for our current OR for the pair of words.

Conditional Random Fields (CRF) Results

Conditional Random Fields (CRF) Analysis:  Bigrams with maximum features -> 93.81% Global F1  Global F1-score does not depend much on feature window, but on bigram/unigram selection: tagging pairs of tokens give more power than single tagging  Ocurrences: LST->0, INTJ->1, CONJP->9 => Identical resuts  PRT is the only POS tag which depends more on feature window and works better for size 2 windows. Prepositions tagging rely on bigger windows (ex: out, around, in, …)  Slightly the same for SBAR. (ex: than, rather, …)

Conditional Random Fields (CRF) How to improve results:  Molina & Pla’s method? Should improve efficiency in SBAR and CONJP  Mixing the different methods?

Plan Introduction Methods  HMM  SVM  CRF Global analysis Conclusion

Global Analysis

CRF outperforms HMM and SVM HMM performs better than SVM <= context. Particularly evident for SBAR and PRT. HMM performs outperforms CRF for CONJP!  HMM: uses 3-grams -> better for expression like “as well as”, “rather than”  HMM improvement with Pla’s method

Global Analysis CRF results are close to CoNll 2000 best results: Need finest analysis, per POS precisionrecallF1 [ZDJ01] [KM01] CRF [CM03]

Gobal Anaysis Combining the three methods:

Global Analysis Combining does not help for PRPT, where the difference was big between HMM and CRF! Helps… just a bit on SBAR Global results are better for CRF alone: 93.81>93.57

Plan Introduction Methods  HMM  SVM  CRF Global analysis Conclusion

Without exicalization, SVM performs a lot better than HMM With lexical specialization, HMM performs better than SVM… and is a lot faster! Only 3 experiments for votation: few. Taggers make mistakes for the same POS tags.

Conclusion At a certain stage, hard to improve results. CRF proves to be efficient without any specific modification -> how can we improve it? => CRF with 3-grams… but probably really slow. Some fine comparisons with CoNll results?