Download presentation
Presentation is loading. Please wait.
Published byTobias Ellis Modified over 9 years ago
1
ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian School of Mines Dhanbad, India
2
Contents Introduction Adhoc retrieval task participation Morpheme Extraction Task participation Conclusion
3
Introduction Stemmer ISMstemmer Evaluation
4
Stemmer Attempts to reduce word variants to its stem or root form Example – education, educating, educative will all reduce to educat Approaches for Stemming Language based approach Statistical approach
5
ISMstemmer statistical stemmer based on suffix extraction suffix frequency algorithm
6
Data Preprocessing Convert the corpus into single file File 1 File 2 File n … Single File Cleaning of data John asked a girl with an apple of Kashmir, “ do you have the time ”. She said, “ yes ”. John asked a girl with an apple of Kashmir do you have the time she said yes Removing Stop Words John asked a girl with an apple of Kashmir do you have the time she said yes John asked girl with apple Kashmir you time she said yes John asked girl with apple Kashmir you time she said yes Convert file into Single Column
7
Data preprocessing (contd….) unique words extracted Hindi- 4,90,391 English-7,95,144
8
Find valid suffixes Reverse the words of single column file aborning absolution absorption abuilding acquisition activation added addition admiration admitted admitting agreed agreeing allotted allotting ambling angling gninroba noitulosba noitprosba gnidliuba noitisiuqca noitavitca dedda noitidda noitarimda dettimda gnittimda deerga gnieerga dettolla gnittolla gnilbma gnilgna Sort the reversed list gninroba noitulosba noitprosba gnidliuba noitisiuqca noitavitca dedda noitidda noitarimda dettimda gnittimda deerga gnieerga dettolla gnittolla gnilbma gnilgna dedda deerga dettimda dettolla gnidliuba gnieerga gnilbma gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba Find suffix according to threshold dedda deerga dettimda dettolla gnidliuba gnieerga gnilbma gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba de gni niot gni 17% 40%
9
Threshold used English: 0.01 - 0.1% Hindi: 0.1 – 1.0%
10
Stemming of corpus Stem the reversed words with reversed valid suffixes dedda deerga dettimda dettolla gnidliuba gnieerga gnilbma gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba dda erga ttimda ttolla dliuba eerga lbma lgna nroba ttimda ttolla arimda avitca idda isiuqca prosba ulosba Reverse stemmed words to get the original words dda erga ttimda ttolla dliuba eerga lbma lgna nroba ttimda ttolla arimda avitca idda isiuqca prosba ulosba add agre admitt allott abuild agree ambl angl aborn admitt allott admira activa addi acquisi absorp absolu
11
Note: If the length of a word after stemming is less than ’3’ alphabets, then that word will not be stemmed aging king ag k
12
Evaluation of ISMstemmer For evaluation of ISMstemmer we have participated in: 1.Monolingual Adhoc retrieval task in English and Hindi Languages 2.Morpheme Extraction Task (MET) of FIRE- 2012
13
Adhoc Retrieval Task(ART) Participation Monolingual task Languages chosen: English Approach Results Hindi Approach Results
14
ART: English Approach: Indexing: Search Engine used: Indri(IndriBuildIndex) Retrieval: Search engine used: Lemur (RetEval) Data Provided: Corpus from The Telegraph and BD News 50 query set
15
ART: English (contd….) Results: Run idNo. of queries No. of results No. of relevant docs. No. of rel. docs ret. MAP value EE.ism.unst emmed 5050000353925030.2264 EE.ism.krov etzstemme r 5050000353925040.2255 EE.ism.isms temmer 5050000353924150.2096
16
ART: Hindi Approach: Indexing: Search Engine used: Indri (IndriBuildIndex) Retrieval: Search Engine used: Indri (IndriRunQuery) Data Provided: Corpus from Navbharat Times and Amar Ujala 50 query set
17
ART: Hindi (contd….) Results: Run idNo. of queries No. of results No. of relevant docs No. of rel. docs ret. MAP value HH.ism.un stemmed.i ndri 505000023092220.0173 HH.stemm medcorpus.unstemme dquery 50500002309980.0026 HH.stemm medcorpus.stemmedq uery 505000023092090.0137
18
Morpheme Extraction Task Participation Tool submitted Results
19
MET Tool Submission. ISMstemmer submitted evaluated at IR Labs: DAIICT, Gujarat tested on 6 languages of South Asian origin has given efficient results with 3 languages
20
MET Results: 1. BENGALI Institute LanguageMAP Obtained Baseline Bengali 0.2740 JU Bengali 0.3307 DCU Bengali 0.3300 IIT-KGP Bengali 0.3225 CVPR-Team1Bengali 0.3159 ISM Bengali 0.3103 CVPR-Team2 + BengaliNA
21
MET Results (contd….) 2. GUJARATI Institute Language MAP Obtained Baseline Gujarati 0.2677 ISM Gujarati 0.2824 3. MARATHI Institute Language MAP Obtained Baseline Marathi 0.2320 ISM Marathi 0.2797 IIT-B Marathi 0.2684
22
MET Results (contd….) 4. ODIA Institute Language MAP Obtained Baseline Odia 0.1537 IIIT-Bh Odia 0.1537 ISM Odia 0.1537 5. HINDI Institute Language MAP Obtained Baseline Hindi 0.2821 DCU Hindi 0.2963 ISM Hindi 0.2793
23
MET Results (contd….) 6. TAMIL Institute Language MAP Obtained Baseline Tamil NA AUCEG Tamil NA ISM Tamil NA NA : results are not available, due non-availability of qrels
24
Reasons for Underperformance with Hindi overstemming undesired stemming of proper nouns
25
Overstemming This refers to words that shouldn’t be grouped together by stemming, but are. Example – 1. accent, accentual, accentuate Stem word – accent 2. accept, acceptant, acceptor Stem word – accept 3. access, accessible, accession Stem word – access due to overstemming it may be possible that these all group into wrong stem - acce
26
Undesired stemming of proper nouns proper nouns should not be stemmed as they are not inflected Example – Beijing It will get stemmed to Beij
27
Conclusion ART : English: not satisfactory Hindi: poor Reasons: overstemming undesired stemming of proper nouns MET: performed efficiently with Bengali, Gujarati and Marathi languages performed up to the mark with Odia underperformed with Hindi
28
References 1. Banerjee R. and Pal S. 2011. ISM@FIRE-2011 Bengali Monolingual Task: A frequency based stemmer. Forum for Information Retrieval Evaluation 2011, ISI kolkata. 2. www.isical.ac.in/~fire/ (as on 06.12.2012) 3. Christopher D. Manning, Hinrich Schütze: Foundations of Statistical Natural Language Processing, MIT Press (1999), ISBN 978-0-262-13360-9. 4. http://en.wikipedia.org/wiki/Information_retrieval (as on 06.12.2012) 5.http://sourceforge.net/p/lemur/wiki/Indri%20query%20Language%20Referen ce/ (as on 06.12.2012) 6. www.lemurproject.org (as on 06.12.2012) 7. Paik, J. H., Mitra, M., Parui, S. K., and J¨ arvelin, K. 2011. GRAS: An effective and efficient stemming algorithm for information retrieval. ACM Trans. Inf. Syst. 29, 4, Article 19 (November 2011)
29
References (contd…) 8. Paik, J. H. and Parui, S. K. 2011. A fast corpus-based stemmer. ACM Trans. Asian Lang. N form. Process. 10, 2, Article 8 (June 2011). 9. Paik J. H., Pal Dipasree, Parui S. K. A Novel Corpus-Based Stemming Algorithm using Co-occurrence Statistics. SIGIR’11, July 24–28, 2011, Beijing, China. 10. Xu, J. and Croft, W. B. 1998. Corpus-based stemming using co-occurrence of word variants. ACM Trans. Inf. Syst. 16, 1, 61–81. 11. http://en.wikipedia.org/wiki/Stemming (as on 06.12.2012) 12. How Effective Is Suffixing? Donna Harman. lister Hill Center for Biomedical Communications, National Library of Medicine, Bethesda, MD 20209
30
THANK YOU!!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.