Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat.

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat Department of Computer Science & Engineering Indian Institute of Technology Kharagpur

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Goal  Lexical Analysis Part-Of-Speech (POS) Tagging : Assigning part-of-speech to each word. e.g. Noun, Verb...  Syntactic Analysis Chunking: Identify and label phrases as verb phrase and noun phrase etc.

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Machine Learning to Resolve POS Tagging and Chunking  HMM Supervised (DeRose,88; Mcteer,91; Brants,2000; etc.) Semi-supervised (Cutting,92; Merialdo,94; Kupiec,92; etc.)  Maximum Entropy (Ratnaparkhi,96; etc.)  TB(ED)L (Brill,92,94,95; etc.)  Decision Tree (Black,92; Marquez,97; etc.)

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Our Approach  Maximum Entropy based Diverse and overlapping features Language Independence Reasonably good accuracy Data intensive Absence of sequence information

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging Schema Language Model Disambiguation Algorithm Raw text Tagged text Possible POS Class Restriction … POS tagging

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging: Our Approach ME Model Disambiguation Algorithm Raw text Tagged text Possible POS Class Restriction … POS tagging ME Model: Current state depends on history (features)

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Learning ME Model  GIS (Generalized Iterative Scaling) Finds the model parameters that define the maximum entropy classifier for a given feature set and training corpus  The parameters of the ME model are estimated using an off-the-shelf toolkit ( http://maxent.sourceforge.net )

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging: Our Approach ME Model Disambiguation Algorithm Raw text Tagged text … POS tagging t i  {T} or t i  T MA (w i ) {T} : Set of all tags T MA (w i ) : Set of tags computed by Morphological Analyzer

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging: Our Approach ME Model Beam Search Raw text Tagged text … POS tagging t i  {T} or t i  T MA (w i ) {T} : Set of all tags T MA (w i ) : Set of tags computed by Morphological Analyzer

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Disambiguation Algorithm Text: Tags: Where, t i  {T},  w i {T} = Set of tags

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Disambiguation Algorithm Text: Tags: Where, t i  T MA (w i ),  w i {T} = Set of tags

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur What are Features?  Feature function Binary function of the history and target Example,

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging Features W1 W2 W3 W4 T2 T3 T4 T5 T6 T7 i-3 W1T1 i-2 i-1 i i+1 i+2 i+3 T4 Estimated Tag Feature Set  40 different experiments were conducted taking several combination from set ‘F’ pos word POS_Tag

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging Features Estimated Tag Feature Set ConditionFeatures Static features for all words Current word(w i ) Previous word (w i-1 ) Next word (w i+1 ) |prefix| ≤ 4 |suffix| ≤ 4 Dynamic Features for all words POS tag of previous word (t i-1 ) W3 W4 T3 T4 T5 T6 T7 i-3 W1 T1 i-2 i-1 i i+1 i+2 i+3 W6 W7 W2 T2 pos word POS_Tag

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Chunking Features T2 T3 T4 T5 T6 C3 C4 C5 C6 C7 -3 W1 T1 C1 W2 W3 T7 -2 0 +1 +2 +3 C2 W5 W6 W7 W4 Estimated Tag Feature Set Static features for all words Current word (w i ) POS tag of the current word (t i ) POS tags of previous two words (t i-1 and t i-2 ) POS tags of next two words (t i+1 and t i+2 ) Dynamic Features for all words Chunk tags of previous two word (C i-1 and C i-2 )

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Chunking Features T2 T3 T4 T5 T6 C2 C3 C4 C5 C6 C7 i-3 W1 T1 C1 W2 W3 T7 i-2 i-1 i i+1 i+2 i+3 W5 W6 W7 W4 Estimated Tag Feature Set Static features for all words Current word (w i ) POS tag of the current word (t i ) POS tags of previous two words (t i-1 and t i-2 ) POS tags of next two words (t i+1 and t i+2 ) Dynamic Features for all words Chunk tags of previous two words (C i-1 and C i-2 ) pos word POS_Tag Chunk_Tag

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Experiments: POS tagging  Baseline Model  Maximum Entropy Model ME (Bengali, Hindi and Telugu) ME + IMA ( Bengali) ME + CMA (Bengali)  Data Used LanguageBengaliHindiTelugu Training data20,39621,47021,416 Development data5,0235,6816,098 Test data5,2264,9245,193 No. of POS tags2725 No. of Chunk labels676

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Tagset and Corpus Ambiguity  Tagset consists of 27 grammatical classes  Corpus Ambiguity Mean number of possible tags for each word Measured in the training tagged data LanguageDutchGermanEnglishFrenchBengaliHindiTelugu Corpus Ambiguity 1.111.31.341.691.751.851.70 Accuracy96%97%96.5%94.5%??? Unknown Words 13%9%11%5%33%21%56% (Dermatas et al 1995)

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging Results on Development Set Overall Accuracy LanguageBengaliHindiTelugu Corpus Ambiguity 1.751.851.70 Accuracy79.74%83.10%67.12% Unknown Words 33%21%56%

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging Results on Development Set Known Words Unknown Words Overall Accuracy

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur POS Tagging Results - Bengali

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Results on Development set MethodBengaliHindiTelugu Baseline58.8868.93- ME 79.74 (89.3, 60.5) 83.10 (90.9,53.7) 67.82 (82.570.0) ME + IMA 83.51 (84.2, 82.1) -- ME + CMA 88.25 (89.3, 86.2) --

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Chunking Results  Two different measures Per word basis Per chunk basis  Correctly identified groups along with correctly labeled groups Evaluation Criteria MethodBengaliHindiTelugu Per word basis ME + I_POS84.4579.8865.92 Per chunk basis ME + I_POS87.3,80.674.1,67.469.6,56.7 ME + C_POS93.3,87.778.5,74.4-

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Assessment of Error Types Predicted Class Actual Class % of total error % of class error NNNNC10.43.43 NNJJ7.92.6 NNNNP6.01.9 VFMVRB4.45.4 NNPNNPC4.411.11 Predicted Class Actual Class % of total error % of class error NNNNP14.510.2 NNJJ7.95.6 NNNNC6.04.27 JJNN3.914.34 VFMVAUX3.15.4 Bengali Hindi Predicted Class Actual Class % of total error % of class error NNJJ12.59.5 NNNNP10.98.3 PREPNLOC6.123.7 NNRB4.53.4 Telugu

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Results on Test Set  Bengali data has been tagged using ME+IMA model  Hindi and Telugu data has been tagged with simple ME model Language Number of Words POS Tagging Accuracy Chunking Accuracy Bengali522577.6180.59 Hindi492475.6974.92 Telugu519374.4768.59  Chunk Accuracy has been measured per word basis

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Conclusion and Future Scope  Morphological restriction on tags gives an efficient tagging model even when small labeled text is available  The performance of Hindi and Telugu can be improved using the morphological analyzer of the languages  Linguistic prefix and suffix information can be adopted  More features can be explored for chunking

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Thank You

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat.

Similar presentations

Presentation on theme: "Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat.

Similar presentations

Presentation on theme: "Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat."— Presentation transcript:

Similar presentations

About project

Feedback