Download presentation
Presentation is loading. Please wait.
Published byAnabel Nichols Modified over 9 years ago
1
A New Approach for HMM Based Chunking for Hindi Ashish Tiwari Arnab Sinha Under the guidance of Dr. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
2
.utf file (IIIT corpus).tt file (tagged training Data) script tnt_para.123 file.lex file tnt.t file (untagged data).tts file (tagged by TnT).tt file (tagged) tnt_diff Accuracy Model files TnT
3
Parse the corpus Apply 4 types of token schemes Apply 3 different tag schemes Add POS context to chunk-tags Do Chunk-labeling Results Compare the accuracies Results Recommendations Chunk labeling Chunk Boundary Tool Flow
4
1.(word-token, Chunk-tag) (( ashish)) (( arnab ke pIche )) (( bajar meM )) (( gaya. )) ashish arnab of behind market in went. NN NN PREP PREP NN PREP VB SYM 2. (POS-tag, Chunk-tag) (( ashish)) (( arnab ke pIche )) (( bajar meM )) (( gaya. )) NN NN PREP PREP NN PREP VB SYM 3. (word_POS-tag, Chunk-tag) (( ashish)) (( arnab ke pIche )) (( bajar meM )) (( gaya. )) ashish arnab of behind market in went. ashish _NN arnab _NN of _PREP market _NN in _PREP went _VB SYM behind _PREP 4. (POS-tag_word, Chunk-tag) (( ashish)) (( arnab ke pIche )) (( bajar meM )) (( gaya. )) ashish arnab of behind market in went. NN_ ashish NN_ arnab PREP_ of NN_ market PREP_ in VB_ went SYM_ PREP_ behind Token schemes
5
Chunk Tag schemes 2-Tag Scheme: {STRT, CNT} 3-Tag Scheme: {STRT, CNT, END} 4-Tag Scheme: {STRT, CNT, END, STRT_END}
6
Adding POS-tag to Chunk-tag (( ashish)) (( arnab ke pIche )) (( bajar meM )) (( gaya. )) ashish arnab of behind market in went. NN NN PREP PREP NN PREP VB SYM NN :STRT NN :STRT NN:STRT VB :STRT PREP :CNT PREP :CNT PREP:CNT SYM :CNT Ex: Word as token and POS:2tag chunking
7
Colon vs Non-Colon Corpus size=20000 words In large data-set, token might perform better Marginal Improvement
8
Chunk Boundary identification Results are improved ! 4tag 2tag gives the highest precision and recall.!!
9
Addition of POS-tag Information to Chunk-tags Significant increment in precision and recall is observed. 4 2-tag scheme for scores highest
10
Labeling the Chunks First SchemeSecond SchemeThird Scheme token: _ label: :POS-tag: (if this is the first token of the chunk.) :POS-tag (otherwise) token: _ label: :POS-tag: (for all tokens) token: _ label: :POS-tag: (if this is the last token of the chunk.) :POS-tag (otherwise)
11
Results –Labelling Of Chunks The first scheme is giving the highest precision 89.02% but again to be noted that word_pos tag approach is not far behind with 85.58% precision and highest recall 98.48%. Recall value of word_pos and pos_word approach is same in all schemes, this is because ordering seems to add no new knowledge to existing model.
12
Recommendations scheme 1 is best POS-tag info addition improves the precision and recall of chunk labeling. For Identification of Chunk Boundary For chunk labeling this approach can be used for other Indian languages as well !!! Best option: : Subsequent convertion to 2-tag set gives better results
13
References An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. By Daniel Jurafsky and James H. Martin Miles Osborne 2000. Shallow Parsing as Partof-Speech Tagging. Proceedings of CoNLL-2000.(2000) Lance A. Ramshaw, and Mitchell P. Marcus. 1995. Text Chunking Using Transformation-Based Learning. Proceedings of the 3rd Workshop on Very Large Corpora (1995) 88.94 W. Skut and T. Brants 1998. Chunk Tagger, Statistical Recognition of Noun Phrases. ESSLLI-1998 (1998) Thorsten Brants. 2000. TnT - A Statistical Part-of-Speech Tagger Proceedings of the sixth conference on Applied Natural Language Processing (2000) 224.231
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.