Download presentation
Presentation is loading. Please wait.
1
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat Department of Computer Science & Engineering Indian Institute of Technology Kharagpur
2
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Goal Lexical Analysis Part-Of-Speech (POS) Tagging : Assigning part-of-speech to each word. e.g. Noun, Verb... Syntactic Analysis Chunking: Identify and label phrases as verb phrase and noun phrase etc.
3
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Machine Learning to Resolve POS Tagging and Chunking HMM Supervised (DeRose,88; Mcteer,91; Brants,2000; etc.) Semi-supervised (Cutting,92; Merialdo,94; Kupiec,92; etc.) Maximum Entropy (Ratnaparkhi,96; etc.) TB(ED)L (Brill,92,94,95; etc.) Decision Tree (Black,92; Marquez,97; etc.)
4
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Our Approach Maximum Entropy based Diverse and overlapping features Language Independence Reasonably good accuracy Data intensive Absence of sequence information
5
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging Schema Language Model Disambiguation Algorithm Raw text Tagged text Possible POS Class Restriction … POS tagging
6
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging: Our Approach ME Model Disambiguation Algorithm Raw text Tagged text Possible POS Class Restriction … POS tagging ME Model: Current state depends on history (features)
7
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging: Our Approach ME Model Disambiguation Algorithm Raw text Tagged text Possible POS Class Restriction … POS tagging ME Model: Current state depends on history (features)
8
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Learning ME Model GIS (Generalized Iterative Scaling) Finds the model parameters that define the maximum entropy classifier for a given feature set and training corpus The parameters of the ME model are estimated using an off-the-shelf toolkit ( http://maxent.sourceforge.net )
9
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging: Our Approach ME Model Disambiguation Algorithm Raw text Tagged text … POS tagging t i {T} or t i T MA (w i ) {T} : Set of all tags T MA (w i ) : Set of tags computed by Morphological Analyzer
10
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging: Our Approach ME Model Beam Search Raw text Tagged text … POS tagging t i {T} or t i T MA (w i ) {T} : Set of all tags T MA (w i ) : Set of tags computed by Morphological Analyzer
11
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Disambiguation Algorithm Text: Tags: Where, t i {T}, w i {T} = Set of tags
12
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Disambiguation Algorithm Text: Tags: Where, t i T MA (w i ), w i {T} = Set of tags
13
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur What are Features? Feature function Binary function of the history and target Example,
14
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging Features W1 W2 W3 W4 T2 T3 T4 T5 T6 T7 i-3 W1T1 i-2 i-1 i i+1 i+2 i+3 T4 Estimated Tag Feature Set 40 different experiments were conducted taking several combination from set ‘F’ pos word POS_Tag
15
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging Features Estimated Tag Feature Set ConditionFeatures Static features for all words Current word(w i ) Previous word (w i-1 ) Next word (w i+1 ) |prefix| ≤ 4 |suffix| ≤ 4 Dynamic Features for all words POS tag of previous word (t i-1 ) W3 W4 T3 T4 T5 T6 T7 i-3 W1 T1 i-2 i-1 i i+1 i+2 i+3 W6 W7 W2 T2 pos word POS_Tag
16
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Chunking Features T2 T3 T4 T5 T6 C3 C4 C5 C6 C7 -3 W1 T1 C1 W2 W3 T7 -2 0 +1 +2 +3 C2 W5 W6 W7 W4 Estimated Tag Feature Set Static features for all words Current word (w i ) POS tag of the current word (t i ) POS tags of previous two words (t i-1 and t i-2 ) POS tags of next two words (t i+1 and t i+2 ) Dynamic Features for all words Chunk tags of previous two word (C i-1 and C i-2 )
17
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Chunking Features T2 T3 T4 T5 T6 C2 C3 C4 C5 C6 C7 i-3 W1 T1 C1 W2 W3 T7 i-2 i-1 i i+1 i+2 i+3 W5 W6 W7 W4 Estimated Tag Feature Set Static features for all words Current word (w i ) POS tag of the current word (t i ) POS tags of previous two words (t i-1 and t i-2 ) POS tags of next two words (t i+1 and t i+2 ) Dynamic Features for all words Chunk tags of previous two words (C i-1 and C i-2 ) pos word POS_Tag Chunk_Tag
18
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Experiments: POS tagging Baseline Model Maximum Entropy Model ME (Bengali, Hindi and Telugu) ME + IMA ( Bengali) ME + CMA (Bengali) Data Used LanguageBengaliHindiTelugu Training data20,39621,47021,416 Development data5,0235,6816,098 Test data5,2264,9245,193 No. of POS tags2725 No. of Chunk labels676
19
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Tagset and Corpus Ambiguity Tagset consists of 27 grammatical classes Corpus Ambiguity Mean number of possible tags for each word Measured in the training tagged data LanguageDutchGermanEnglishFrenchBengaliHindiTelugu Corpus Ambiguity 1.111.31.341.691.751.851.70 Accuracy96%97%96.5%94.5%??? Unknown Words 13%9%11%5%33%21%56% (Dermatas et al 1995)
20
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging Results on Development Set Overall Accuracy LanguageBengaliHindiTelugu Corpus Ambiguity 1.751.851.70 Accuracy79.74%83.10%67.12% Unknown Words 33%21%56%
21
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging Results on Development Set Known Words Unknown Words Overall Accuracy
22
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging Results - Bengali
23
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Results on Development set MethodBengaliHindiTelugu Baseline58.8868.93- ME 79.74 (89.3, 60.5) 83.10 (90.9,53.7) 67.82 (82.570.0) ME + IMA 83.51 (84.2, 82.1) -- ME + CMA 88.25 (89.3, 86.2) --
24
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Chunking Results Two different measures Per word basis Per chunk basis Correctly identified groups along with correctly labeled groups Evaluation Criteria MethodBengaliHindiTelugu Per word basis ME + I_POS84.4579.8865.92 Per chunk basis ME + I_POS87.3,80.674.1,67.469.6,56.7 ME + C_POS93.3,87.778.5,74.4-
25
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Assessment of Error Types Predicted Class Actual Class % of total error % of class error NNNNC10.43.43 NNJJ7.92.6 NNNNP6.01.9 VFMVRB4.45.4 NNPNNPC4.411.11 Predicted Class Actual Class % of total error % of class error NNNNP14.510.2 NNJJ7.95.6 NNNNC6.04.27 JJNN3.914.34 VFMVAUX3.15.4 Bengali Hindi Predicted Class Actual Class % of total error % of class error NNJJ12.59.5 NNNNP10.98.3 PREPNLOC6.123.7 NNRB4.53.4 Telugu
26
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Results on Test Set Bengali data has been tagged using ME+IMA model Hindi and Telugu data has been tagged with simple ME model Language Number of Words POS Tagging Accuracy Chunking Accuracy Bengali522577.6180.59 Hindi492475.6974.92 Telugu519374.4768.59 Chunk Accuracy has been measured per word basis
27
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Conclusion and Future Scope Morphological restriction on tags gives an efficient tagging model even when small labeled text is available The performance of Hindi and Telugu can be improved using the morphological analyzer of the languages Linguistic prefix and suffix information can be adopted More features can be explored for chunking
28
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Thank You
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.