2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian School of Mines Dhanbad, India

Contents  Introduction  Adhoc retrieval task participation  Morpheme Extraction Task participation  Conclusion

Introduction Stemmer ISMstemmer Evaluation

Stemmer Attempts to reduce word variants to its stem or root form Example – education, educating, educative will all reduce to educat Approaches for Stemming  Language based approach  Statistical approach

ISMstemmer statistical stemmer based on suffix extraction suffix frequency algorithm

Data Preprocessing Convert the corpus into single file File 1 File 2 File n … Single File Cleaning of data John asked a girl with an apple of Kashmir, “ do you have the time ”. She said, “ yes ”. John asked a girl with an apple of Kashmir do you have the time she said yes Removing Stop Words John asked a girl with an apple of Kashmir do you have the time she said yes John asked girl with apple Kashmir you time she said yes John asked girl with apple Kashmir you time she said yes Convert file into Single Column

Data preprocessing (contd….)  unique words extracted  Hindi- 4,90,391  English-7,95,144

Find valid suffixes Reverse the words of single column file aborning absolution absorption abuilding acquisition activation added addition admiration admitted admitting agreed agreeing allotted allotting ambling angling gninroba noitulosba noitprosba gnidliuba noitisiuqca noitavitca dedda noitidda noitarimda dettimda gnittimda deerga gnieerga dettolla gnittolla gnilbma gnilgna Sort the reversed list gninroba noitulosba noitprosba gnidliuba noitisiuqca noitavitca dedda noitidda noitarimda dettimda gnittimda deerga gnieerga dettolla gnittolla gnilbma gnilgna dedda deerga dettimda dettolla gnidliuba gnieerga gnilbma gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba Find suffix according to threshold dedda deerga dettimda dettolla gnidliuba gnieerga gnilbma gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba de gni niot gni 17% 40%

Threshold used English: 0.01 - 0.1% Hindi: 0.1 – 1.0%

Stemming of corpus Stem the reversed words with reversed valid suffixes dedda deerga dettimda dettolla gnidliuba gnieerga gnilbma gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba dda erga ttimda ttolla dliuba eerga lbma lgna nroba ttimda ttolla arimda avitca idda isiuqca prosba ulosba Reverse stemmed words to get the original words dda erga ttimda ttolla dliuba eerga lbma lgna nroba ttimda ttolla arimda avitca idda isiuqca prosba ulosba add agre admitt allott abuild agree ambl angl aborn admitt allott admira activa addi acquisi absorp absolu

Note: If the length of a word after stemming is less than ’3’ alphabets, then that word will not be stemmed aging king ag k

Evaluation of ISMstemmer For evaluation of ISMstemmer we have participated in: 1.Monolingual Adhoc retrieval task in English and Hindi Languages 2.Morpheme Extraction Task (MET) of FIRE- 2012

Adhoc Retrieval Task(ART) Participation Monolingual task Languages chosen:  English  Approach  Results  Hindi  Approach  Results

ART: English Approach:  Indexing:  Search Engine used: Indri(IndriBuildIndex)  Retrieval:  Search engine used: Lemur (RetEval)  Data Provided:  Corpus from The Telegraph and BD News  50 query set

ART: English (contd….) Results: Run idNo. of queries No. of results No. of relevant docs. No. of rel. docs ret. MAP value EE.ism.unst emmed 5050000353925030.2264 EE.ism.krov etzstemme r 5050000353925040.2255 EE.ism.isms temmer 5050000353924150.2096

ART: Hindi Approach:  Indexing:  Search Engine used: Indri (IndriBuildIndex)  Retrieval:  Search Engine used: Indri (IndriRunQuery)  Data Provided:  Corpus from Navbharat Times and Amar Ujala  50 query set

ART: Hindi (contd….) Results: Run idNo. of queries No. of results No. of relevant docs No. of rel. docs ret. MAP value HH.ism.un stemmed.i ndri 505000023092220.0173 HH.stemm medcorpus.unstemme dquery 50500002309980.0026 HH.stemm medcorpus.stemmedq uery 505000023092090.0137

Morpheme Extraction Task Participation Tool submitted Results

MET Tool Submission. ISMstemmer submitted evaluated at IR Labs: DAIICT, Gujarat tested on 6 languages of South Asian origin has given efficient results with 3 languages

MET Results: 1. BENGALI Institute LanguageMAP Obtained Baseline Bengali 0.2740 JU Bengali 0.3307 DCU Bengali 0.3300 IIT-KGP Bengali 0.3225 CVPR-Team1Bengali 0.3159 ISM Bengali 0.3103 CVPR-Team2 + BengaliNA

MET Results (contd….) 2. GUJARATI Institute Language MAP Obtained Baseline Gujarati 0.2677 ISM Gujarati 0.2824 3. MARATHI Institute Language MAP Obtained Baseline Marathi 0.2320 ISM Marathi 0.2797 IIT-B Marathi 0.2684

MET Results (contd….) 4. ODIA Institute Language MAP Obtained Baseline Odia 0.1537 IIIT-Bh Odia 0.1537 ISM Odia 0.1537 5. HINDI Institute Language MAP Obtained Baseline Hindi 0.2821 DCU Hindi 0.2963 ISM Hindi 0.2793

MET Results (contd….) 6. TAMIL Institute Language MAP Obtained Baseline Tamil NA AUCEG Tamil NA ISM Tamil NA NA : results are not available, due non-availability of qrels

Reasons for Underperformance with Hindi overstemming undesired stemming of proper nouns

Overstemming This refers to words that shouldn’t be grouped together by stemming, but are. Example – 1. accent, accentual, accentuate Stem word – accent 2. accept, acceptant, acceptor Stem word – accept 3. access, accessible, accession Stem word – access due to overstemming it may be possible that these all group into wrong stem - acce

Undesired stemming of proper nouns proper nouns should not be stemmed as they are not inflected Example – Beijing It will get stemmed to Beij

Conclusion ART :  English: not satisfactory Hindi: poor Reasons:  overstemming  undesired stemming of proper nouns MET:  performed efficiently with Bengali, Gujarati and Marathi languages  performed up to the mark with Odia  underperformed with Hindi

References 1. Banerjee R. and Pal S. 2011. ISM@FIRE-2011 Bengali Monolingual Task: A frequency based stemmer. Forum for Information Retrieval Evaluation 2011, ISI kolkata. 2. www.isical.ac.in/~fire/ (as on 06.12.2012) 3. Christopher D. Manning, Hinrich Schütze: Foundations of Statistical Natural Language Processing, MIT Press (1999), ISBN 978-0-262-13360-9. 4. http://en.wikipedia.org/wiki/Information_retrieval (as on 06.12.2012) 5.http://sourceforge.net/p/lemur/wiki/Indri%20query%20Language%20Referen ce/ (as on 06.12.2012) 6. www.lemurproject.org (as on 06.12.2012) 7. Paik, J. H., Mitra, M., Parui, S. K., and J¨ arvelin, K. 2011. GRAS: An effective and efficient stemming algorithm for information retrieval. ACM Trans. Inf. Syst. 29, 4, Article 19 (November 2011)

References (contd…) 8. Paik, J. H. and Parui, S. K. 2011. A fast corpus-based stemmer. ACM Trans. Asian Lang. N form. Process. 10, 2, Article 8 (June 2011). 9. Paik J. H., Pal Dipasree, Parui S. K. A Novel Corpus-Based Stemming Algorithm using Co-occurrence Statistics. SIGIR’11, July 24–28, 2011, Beijing, China. 10. Xu, J. and Croft, W. B. 1998. Corpus-based stemming using co-occurrence of word variants. ACM Trans. Inf. Syst. 16, 1, 61–81. 11. http://en.wikipedia.org/wiki/Stemming (as on 06.12.2012) 12. How Effective Is Suffixing? Donna Harman. lister Hill Center for Biomedical Communications, National Library of Medicine, Bethesda, MD 20209

THANK YOU!!

2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

Similar presentations

Presentation on theme: "2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

Similar presentations

Presentation on theme: "2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian."— Presentation transcript:

Similar presentations

About project

Feedback