Chunking Pierre Bourreau Cristina España i Bonet LSI-UPC PLN-PTM
Plan Introduction Methods HMM SVM CRF Global analysis Conclusion
Introduction What is chunking? Identifying groups of contiguous words. Ex: He is the person you read about. [NP He] [VP is] [NP the person] [NP you] [VP read] [PP about]. First step to full parsing
Introduction Chunking task in CoNLL Based on a previous POS tagging Chunks: B/I/O-Chunk ADJP (adjective phrase) ADVP (adverb phrase) CONJP (conjunction phrase) INTJ (interjection) LST (list marker) NP (noun phrase) PP (prepositional phrase) PRT (particles) SBAR (subordinated clause) UCP (unlike coordinated phrase) VP (verb phrase) 56 31 10 556 2 ( over chunks )
Introduction Corpus Wall Street Journal (WSJ) Training set: Four sections: tokens Test set: One section: tokens
Evaluation Output files style Word POS Real-Chunk Processed-Chunk Ex: Boeing NNP B-NP I-NP 's POS B-NP B-NP 747 CD I-NP I-NP jetliners NNS I-NP I-NP.. O O Evaluation’s script: precision, recall, F 1 score (1+β)*recall*precision/(recall+ βprecision) where β=1
Plan Introduction Methods HMM SVM CRF Global analysis Conclusion
Hidden Markov Models (HMM) A bit of theory... Find the most probable tags for a sentence (I), given a vocabulary and the set of possible tags Bayes theorem All the states?! First order Second order Bigrams Trigrams
HMM: Chunking Common Setting Input sentence: words Input/output tags: POS Chunking Input sentence: POS Input/output tags: Chunks Tagger Problem! Small vocabulary
HMM: Chunking Solution: Specialization Input sentence: POS Input/output tags: POS + Chunks Improvement Input sentence: Special words + POS Input/output tags: Special words + POS + Chunks
HMM: Chunking In practice: Modification of the Input data (WSJ train and test) NNP NNP·O of·IN of·IN·B-PP the·DT the·DT·B-NP NNP NNP·I-NP NNP NNP·B-NP NNP NNP·I-NP POS POS·B-NP VBN VBN·I-NP NN NN·I-NP to·TO to·TO·B-PP a·DT a·DT·B-NP NN NN·I-NP JJ JJ·I-NP NN NN·I-NP has·VBZ has·VBZ·B-VP helped·VBN helped·VBN·I-VP to·TO to·TO·I-VP VB VB·I-VP a·DT a·DT·B-NP NN NN·I-NP in·IN in·IN·B-PP NN NN·B-NP over·IN over·IN·B-PP the·DT the·DT·B-NP past·JJ past·JJ·I-NP NN NN·I-NP..·O Chancellor NNP O of IN B-PP the DT B-NP Exchequer NNP I-NP Nigel NNP B-NP Lawson NNP I-NP 's POS B-NP restated VBN I-NP commitment NN I-NP to TO B-PP a DT B-NP firm NN I-NP monetary JJ I-NP policy NN I-NP has VBZ B-VP helped VBN I-VP to TO I-VP prevent VB I-VP a DT B-NP freefall NN I-NP in IN B-PP sterling NN B-NP over IN B-PP the DT B-NP past JJ I-NP week NN I-NP.. O
HMM: Results Tool: TnT Tagger (Thorsten Brants) Implements Viterbi algorithm for second order MM Allows to evaluate unigrams, bigrams and trigrams MM
HMM: Results Configuration 1: No special words, no POS 3grams Default parameters Results: Far from the best scores (F 1 ~94%) Non-lex chunks Precision (%) Recall (%) F β=1 (%) ADJP ADVP CONJP 0.00 INTJ LST 0.00 NP PP PRT SBAR UCP Total
HMM: Results Trying to improve… Configuration 2: Lexical specialization (409 words, F. Pla) Trigrams Configuration 3 : Lexical specialization (409 words, F. Pla) Bigrams -makes any difference?-
HMM: Results BigramsTrigrams Chunks Precision (%) Recall (%) F β=1 (%) Precision (%) Recall (%) F β=1 (%) ADJP ADVP CONJP INTJ NP PP PRT SBAR UCP 0.00 VP Total
HMM: Results Comments: Adding specialization information improves 7 points the total F 1. That’s much more that the improvement of using trigrams instead of bigrams (~1%). As before, NP and PP are the best determined chunks. Impressive improvement for PRT and SBAR (but small #).
HMM: Results Importance of the training set size: Test: Divide the training set in 7 parts (~17000 tokens/part). Calculate the results adding a part each time. Conclusion: Performances improve with the set size (see plot). Limit? Molina & Pla got a F 1 =93.26% with 18 sections of WSJ as training set.
HMM: Results
Support Vector Machines (SVM) A bit of theory… Objective: Maximize the minimum margin Allow missclassifications Controlled by the C parameter
SVM Tool: SVMTool (Jesús Giménez & Lluís Màrquez) Uses SVMLight (Thorsten Joachims) for learning. Sequential tagger chunking No necessity to change input data Binarizes the problem to apply SVMs
SVM Features (model 0):
SVM Results Default parameters, vary C and/or direction (LR/LRL) Very small variations with this configuration ModelPrecision (%)Recall (%)F β=1 (%) C=0.1, LR C=0.1 LRL C=1 LR C=0.05 LR
SVM Chunks Precision (%) Recall (%) F β=1 (%) ADJP ADVP CONJP INTJ LST 0.00 NP PP PRT SBAR VP Total Best results: F1 > 90% for the three main chunks. Modest values for the others. Main difference with HMM in PP.
Conditional Random Fields (CRF) A bit of theory… Idea based on extension of HMM and Maximum-Entropy Models. We don’t consider a chain but a graph G=(V,E) Conditioned on X, observation sequence variable Each node represents a value Y v of Y (output label)
Conditional Random Fields P(y|x) (Lafferty ) where y is a label, and x an observation sequence tj is a transition feature function (regarding previous features and observation sequence). sj is a state feature function (regarding current features and observation sequence). factors are set at the training level.
Conditional Random Fields (CRF) CRF Developed by Taku Kudo (2nd at the CoNll2000 with SVM combination) Parameters: Features being used: We can use words, POS tagging We proposed three alternatives: Using a binary combinations of words+POS on a frame size=2 Using the above + a 3-ary combination of POS Using only POS on a 3 size frame Unigrams or Bigrams Getting a score regarding probabilities for our current OR for the pair of words.
Conditional Random Fields (CRF) Results
Conditional Random Fields (CRF) Analysis: Bigrams with maximum features -> 93.81% Global F1 Global F1-score does not depend much on feature window, but on bigram/unigram selection: tagging pairs of tokens give more power than single tagging Ocurrences: LST->0, INTJ->1, CONJP->9 => Identical resuts PRT is the only POS tag which depends more on feature window and works better for size 2 windows. Prepositions tagging rely on bigger windows (ex: out, around, in, …) Slightly the same for SBAR. (ex: than, rather, …)
Conditional Random Fields (CRF) How to improve results: Molina & Pla’s method? Should improve efficiency in SBAR and CONJP Mixing the different methods?
Plan Introduction Methods HMM SVM CRF Global analysis Conclusion
Global Analysis
CRF outperforms HMM and SVM HMM performs better than SVM <= context. Particularly evident for SBAR and PRT. HMM performs outperforms CRF for CONJP! HMM: uses 3-grams -> better for expression like “as well as”, “rather than” HMM improvement with Pla’s method
Global Analysis CRF results are close to CoNll 2000 best results: Need finest analysis, per POS precisionrecallF1 [ZDJ01] [KM01] CRF [CM03]
Gobal Anaysis Combining the three methods:
Global Analysis Combining does not help for PRPT, where the difference was big between HMM and CRF! Helps… just a bit on SBAR Global results are better for CRF alone: 93.81>93.57
Plan Introduction Methods HMM SVM CRF Global analysis Conclusion
Without exicalization, SVM performs a lot better than HMM With lexical specialization, HMM performs better than SVM… and is a lot faster! Only 3 experiments for votation: few. Taggers make mistakes for the same POS tags.
Conclusion At a certain stage, hard to improve results. CRF proves to be efficient without any specific modification -> how can we improve it? => CRF with 3-grams… but probably really slow. Some fine comparisons with CoNll results?