Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chunking Pierre Bourreau Cristina España i Bonet LSI-UPC PLN-PTM.

Similar presentations


Presentation on theme: "Chunking Pierre Bourreau Cristina España i Bonet LSI-UPC PLN-PTM."— Presentation transcript:

1 Chunking Pierre Bourreau Cristina España i Bonet LSI-UPC PLN-PTM

2 Plan Introduction Methods  HMM  SVM  CRF Global analysis Conclusion

3 Introduction What is chunking?  Identifying groups of contiguous words.  Ex: He is the person you read about. [NP He] [VP is] [NP the person] [NP you] [VP read] [PP about].  First step to full parsing

4 Introduction Chunking task in CoNLL  Based on a previous POS tagging  Chunks: B/I/O-Chunk  ADJP (adjective phrase)  ADVP (adverb phrase)  CONJP (conjunction phrase)  INTJ (interjection)  LST (list marker)  NP (noun phrase)  PP (prepositional phrase)  PRT (particles)  SBAR (subordinated clause)  UCP (unlike coordinated phrase)  VP (verb phrase)  2.060  4.227  56  31  10  55.081  21.281  556  2.207  2  21.467 ( over 106.978 chunks )

5 Introduction Corpus  Wall Street Journal (WSJ)  Training set: Four sections: 15-18 211.727 tokens  Test set: One section: 20 47.377 tokens

6 Evaluation Output files style  Word POS Real-Chunk Processed-Chunk  Ex: Boeing NNP B-NP I-NP 's POS B-NP B-NP 747 CD I-NP I-NP jetliners NNS I-NP I-NP.. O O  Evaluation’s script: precision, recall, F 1 score (1+β)*recall*precision/(recall+ βprecision) where β=1

7 Plan Introduction Methods  HMM  SVM  CRF Global analysis Conclusion

8 Hidden Markov Models (HMM) A bit of theory...  Find the most probable tags for a sentence (I), given a vocabulary and the set of possible tags Bayes theorem  All the states?! First order Second order Bigrams Trigrams

9 HMM: Chunking Common Setting  Input sentence: words  Input/output tags: POS Chunking  Input sentence: POS  Input/output tags: Chunks  Tagger Problem!  Small vocabulary

10 HMM: Chunking Solution: Specialization  Input sentence: POS  Input/output tags: POS + Chunks Improvement  Input sentence: Special words + POS  Input/output tags: Special words + POS + Chunks

11 HMM: Chunking In practice: Modification of the Input data (WSJ train and test) NNP NNP·O of·IN of·IN·B-PP the·DT the·DT·B-NP NNP NNP·I-NP NNP NNP·B-NP NNP NNP·I-NP POS POS·B-NP VBN VBN·I-NP NN NN·I-NP to·TO to·TO·B-PP a·DT a·DT·B-NP NN NN·I-NP JJ JJ·I-NP NN NN·I-NP has·VBZ has·VBZ·B-VP helped·VBN helped·VBN·I-VP to·TO to·TO·I-VP VB VB·I-VP a·DT a·DT·B-NP NN NN·I-NP in·IN in·IN·B-PP NN NN·B-NP over·IN over·IN·B-PP the·DT the·DT·B-NP past·JJ past·JJ·I-NP NN NN·I-NP..·O Chancellor NNP O of IN B-PP the DT B-NP Exchequer NNP I-NP Nigel NNP B-NP Lawson NNP I-NP 's POS B-NP restated VBN I-NP commitment NN I-NP to TO B-PP a DT B-NP firm NN I-NP monetary JJ I-NP policy NN I-NP has VBZ B-VP helped VBN I-VP to TO I-VP prevent VB I-VP a DT B-NP freefall NN I-NP in IN B-PP sterling NN B-NP over IN B-PP the DT B-NP past JJ I-NP week NN I-NP.. O

12 HMM: Results Tool:  TnT Tagger (Thorsten Brants)  Implements Viterbi algorithm for second order MM  Allows to evaluate unigrams, bigrams and trigrams MM

13 HMM: Results Configuration 1:  No special words, no POS  3grams  Default parameters Results:  Far from the best scores (F 1 ~94%) Non-lex chunks Precision (%) Recall (%) F β=1 (%) ADJP 67.9747.4955.91 ADVP 71.4167.7869.55 CONJP 0.00 INTJ 50.00 LST 0.00 NP 85.5086.7086.09 PP 83.5793.7688.37 PRT 55.0020.7530.14 SBAR 80.496.1711.46 UCP 85.7485.5785.66 Total 84.3184.3584.33

14 HMM: Results Trying to improve… Configuration 2:  Lexical specialization (409 words, F. Pla)  Trigrams Configuration 3 :  Lexical specialization (409 words, F. Pla)  Bigrams -makes any difference?-

15 HMM: Results BigramsTrigrams Chunks Precision (%) Recall (%) F β=1 (%) Precision (%) Recall (%) F β=1 (%) ADJP 69.9569.6369.7968.6969.6369.16 ADVP 79.7278.5279.1279.4478.5278.98 CONJP 33.3355.5641.6745.4555.5650.00 INTJ 33.3350.0040.0050.00 NP 90.4991.0490.7691.8692.6192.23 PP 96.3996.6396.5196.6697.5597.10 PRT 71.5483.0276.8671.4375.4773.39 SBAR 85.2883.3684.3185.9684.6785.31 UCP 0.00 VP 90.2691.9591.1090.1691.9191.03 Total 90.6291.2590.9391.3792.2491.81

16 HMM: Results Comments:  Adding specialization information improves 7 points the total F 1.  That’s much more that the improvement of using trigrams instead of bigrams (~1%).  As before, NP and PP are the best determined chunks.  Impressive improvement for PRT and SBAR (but small #).

17 HMM: Results Importance of the training set size:  Test: Divide the training set in 7 parts (~17000 tokens/part). Calculate the results adding a part each time.  Conclusion: Performances improve with the set size (see plot).  Limit? Molina & Pla got a F 1 =93.26% with 18 sections of WSJ as training set.

18 HMM: Results

19 Support Vector Machines (SVM) A bit of theory…  Objective: Maximize the minimum margin  Allow missclassifications Controlled by the C parameter

20 SVM Tool:  SVMTool (Jesús Giménez & Lluís Màrquez) Uses SVMLight (Thorsten Joachims) for learning.  Sequential tagger  chunking No necessity to change input data  Binarizes the problem to apply SVMs

21 SVM Features (model 0):

22 SVM Results  Default parameters, vary C and/or direction (LR/LRL)  Very small variations with this configuration ModelPrecision (%)Recall (%)F β=1 (%) C=0.1, LR 89.8390.1890.01 C=0.1 LRL 89.3290.1089.71 C=1 LR 89.0189.2689.13 C=0.05 LR 89.6890.0689.87

23 SVM Chunks Precision (%) Recall (%) F β=1 (%) ADJP 69.1266.4467.75 ADVP 74.8675.6475.24 CONJP 44.44 INTJ 100.0050.0066.67 LST 0.00 NP 91.7291.7991.76 PP 88.5096.0192.10 PRT 51.8526.4235.00 SBAR 82.0234.9549.02 VP 91.8692.8192.33 Total 89.8390.1890.01 Best results:  F1 > 90% for the three main chunks.  Modest values for the others.  Main difference with HMM in PP.

24 Conditional Random Fields (CRF) A bit of theory…  Idea based on extension of HMM and Maximum-Entropy Models.  We don’t consider a chain but a graph G=(V,E) Conditioned on X, observation sequence variable Each node represents a value Y v of Y (output label)

25 Conditional Random Fields P(y|x) (Lafferty ) where y is a label, and x an observation sequence  tj is a transition feature function (regarding previous features and observation sequence).  sj is a state feature function (regarding current features and observation sequence).  factors are set at the training level.

26 Conditional Random Fields (CRF) CRF++ 0.45  Developed by Taku Kudo (2nd at the CoNll2000 with SVM combination)  Parameters: Features being used:  We can use words, POS tagging  We proposed three alternatives:  Using a binary combinations of words+POS on a frame size=2  Using the above + a 3-ary combination of POS  Using only POS on a 3 size frame Unigrams or Bigrams  Getting a score regarding probabilities for our current OR for the pair of words.

27 Conditional Random Fields (CRF) Results

28 Conditional Random Fields (CRF) Analysis:  Bigrams with maximum features -> 93.81% Global F1  Global F1-score does not depend much on feature window, but on bigram/unigram selection: tagging pairs of tokens give more power than single tagging  Ocurrences: LST->0, INTJ->1, CONJP->9 => Identical resuts  PRT is the only POS tag which depends more on feature window and works better for size 2 windows. Prepositions tagging rely on bigger windows (ex: out, around, in, …)  Slightly the same for SBAR. (ex: than, rather, …)

29 Conditional Random Fields (CRF) How to improve results:  Molina & Pla’s method? Should improve efficiency in SBAR and CONJP  Mixing the different methods?

30 Plan Introduction Methods  HMM  SVM  CRF Global analysis Conclusion

31 Global Analysis

32 CRF outperforms HMM and SVM HMM performs better than SVM <= context. Particularly evident for SBAR and PRT. HMM performs outperforms CRF for CONJP!  HMM: uses 3-grams -> better for expression like “as well as”, “rather than”  HMM improvement with Pla’s method

33 Global Analysis CRF results are close to CoNll 2000 best results: Need finest analysis, per POS precisionrecallF1 [ZDJ01]94.2994.0194.13 [KM01]93.8993.9293.91 CRF++93.8193.9693.66 [CM03]94.1993.2993.74

34 Gobal Anaysis Combining the three methods:

35 Global Analysis Combining does not help for PRPT, where the difference was big between HMM and CRF! Helps… just a bit on SBAR Global results are better for CRF alone: 93.81>93.57

36 Plan Introduction Methods  HMM  SVM  CRF Global analysis Conclusion

37 Without exicalization, SVM performs a lot better than HMM With lexical specialization, HMM performs better than SVM… and is a lot faster! Only 3 experiments for votation: few. Taggers make mistakes for the same POS tags.

38 Conclusion At a certain stage, hard to improve results. CRF proves to be efficient without any specific modification -> how can we improve it? => CRF with 3-grams… but probably really slow. Some fine comparisons with CoNll results?


Download ppt "Chunking Pierre Bourreau Cristina España i Bonet LSI-UPC PLN-PTM."

Similar presentations


Ads by Google