Presentation is loading. Please wait.

Presentation is loading. Please wait.

2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

Similar presentations


Presentation on theme: "2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian."— Presentation transcript:

1 ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian School of Mines Dhanbad, India

2 Contents  Introduction  Adhoc retrieval task participation  Morpheme Extraction Task participation  Conclusion

3 Introduction Stemmer ISMstemmer Evaluation

4 Stemmer Attempts to reduce word variants to its stem or root form Example – education, educating, educative will all reduce to educat Approaches for Stemming  Language based approach  Statistical approach

5 ISMstemmer statistical stemmer based on suffix extraction suffix frequency algorithm

6 Data Preprocessing Convert the corpus into single file File 1 File 2 File n … Single File Cleaning of data John asked a girl with an apple of Kashmir, “ do you have the time ”. She said, “ yes ”. John asked a girl with an apple of Kashmir do you have the time she said yes Removing Stop Words John asked a girl with an apple of Kashmir do you have the time she said yes John asked girl with apple Kashmir you time she said yes John asked girl with apple Kashmir you time she said yes Convert file into Single Column

7 Data preprocessing (contd….)  unique words extracted  Hindi- 4,90,391  English-7,95,144

8 Find valid suffixes Reverse the words of single column file aborning absolution absorption abuilding acquisition activation added addition admiration admitted admitting agreed agreeing allotted allotting ambling angling gninroba noitulosba noitprosba gnidliuba noitisiuqca noitavitca dedda noitidda noitarimda dettimda gnittimda deerga gnieerga dettolla gnittolla gnilbma gnilgna Sort the reversed list gninroba noitulosba noitprosba gnidliuba noitisiuqca noitavitca dedda noitidda noitarimda dettimda gnittimda deerga gnieerga dettolla gnittolla gnilbma gnilgna dedda deerga dettimda dettolla gnidliuba gnieerga gnilbma gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba Find suffix according to threshold dedda deerga dettimda dettolla gnidliuba gnieerga gnilbma gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba de gni niot gni 17% 40%

9 Threshold used English: 0.01 - 0.1% Hindi: 0.1 – 1.0%

10 Stemming of corpus Stem the reversed words with reversed valid suffixes dedda deerga dettimda dettolla gnidliuba gnieerga gnilbma gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba dda erga ttimda ttolla dliuba eerga lbma lgna nroba ttimda ttolla arimda avitca idda isiuqca prosba ulosba Reverse stemmed words to get the original words dda erga ttimda ttolla dliuba eerga lbma lgna nroba ttimda ttolla arimda avitca idda isiuqca prosba ulosba add agre admitt allott abuild agree ambl angl aborn admitt allott admira activa addi acquisi absorp absolu

11 Note: If the length of a word after stemming is less than ’3’ alphabets, then that word will not be stemmed aging king ag k

12 Evaluation of ISMstemmer For evaluation of ISMstemmer we have participated in: 1.Monolingual Adhoc retrieval task in English and Hindi Languages 2.Morpheme Extraction Task (MET) of FIRE- 2012

13 Adhoc Retrieval Task(ART) Participation Monolingual task Languages chosen:  English  Approach  Results  Hindi  Approach  Results

14 ART: English Approach:  Indexing:  Search Engine used: Indri(IndriBuildIndex)  Retrieval:  Search engine used: Lemur (RetEval)  Data Provided:  Corpus from The Telegraph and BD News  50 query set

15 ART: English (contd….) Results: Run idNo. of queries No. of results No. of relevant docs. No. of rel. docs ret. MAP value EE.ism.unst emmed 5050000353925030.2264 EE.ism.krov etzstemme r 5050000353925040.2255 EE.ism.isms temmer 5050000353924150.2096

16 ART: Hindi Approach:  Indexing:  Search Engine used: Indri (IndriBuildIndex)  Retrieval:  Search Engine used: Indri (IndriRunQuery)  Data Provided:  Corpus from Navbharat Times and Amar Ujala  50 query set

17 ART: Hindi (contd….) Results: Run idNo. of queries No. of results No. of relevant docs No. of rel. docs ret. MAP value HH.ism.un stemmed.i ndri 505000023092220.0173 HH.stemm medcorpus.unstemme dquery 50500002309980.0026 HH.stemm medcorpus.stemmedq uery 505000023092090.0137

18 Morpheme Extraction Task Participation Tool submitted Results

19 MET Tool Submission. ISMstemmer submitted evaluated at IR Labs: DAIICT, Gujarat tested on 6 languages of South Asian origin has given efficient results with 3 languages

20 MET Results: 1. BENGALI Institute LanguageMAP Obtained Baseline Bengali 0.2740 JU Bengali 0.3307 DCU Bengali 0.3300 IIT-KGP Bengali 0.3225 CVPR-Team1Bengali 0.3159 ISM Bengali 0.3103 CVPR-Team2 + BengaliNA

21 MET Results (contd….) 2. GUJARATI Institute Language MAP Obtained Baseline Gujarati 0.2677 ISM Gujarati 0.2824 3. MARATHI Institute Language MAP Obtained Baseline Marathi 0.2320 ISM Marathi 0.2797 IIT-B Marathi 0.2684

22 MET Results (contd….) 4. ODIA Institute Language MAP Obtained Baseline Odia 0.1537 IIIT-Bh Odia 0.1537 ISM Odia 0.1537 5. HINDI Institute Language MAP Obtained Baseline Hindi 0.2821 DCU Hindi 0.2963 ISM Hindi 0.2793

23 MET Results (contd….) 6. TAMIL Institute Language MAP Obtained Baseline Tamil NA AUCEG Tamil NA ISM Tamil NA NA : results are not available, due non-availability of qrels

24 Reasons for Underperformance with Hindi overstemming undesired stemming of proper nouns

25 Overstemming This refers to words that shouldn’t be grouped together by stemming, but are. Example – 1. accent, accentual, accentuate Stem word – accent 2. accept, acceptant, acceptor Stem word – accept 3. access, accessible, accession Stem word – access due to overstemming it may be possible that these all group into wrong stem - acce

26 Undesired stemming of proper nouns proper nouns should not be stemmed as they are not inflected Example – Beijing It will get stemmed to Beij

27 Conclusion ART :  English: not satisfactory Hindi: poor Reasons:  overstemming  undesired stemming of proper nouns MET:  performed efficiently with Bengali, Gujarati and Marathi languages  performed up to the mark with Odia  underperformed with Hindi

28 References 1. Banerjee R. and Pal S. 2011. ISM@FIRE-2011 Bengali Monolingual Task: A frequency based stemmer. Forum for Information Retrieval Evaluation 2011, ISI kolkata. 2. www.isical.ac.in/~fire/ (as on 06.12.2012) 3. Christopher D. Manning, Hinrich Schütze: Foundations of Statistical Natural Language Processing, MIT Press (1999), ISBN 978-0-262-13360-9. 4. http://en.wikipedia.org/wiki/Information_retrieval (as on 06.12.2012) 5.http://sourceforge.net/p/lemur/wiki/Indri%20query%20Language%20Referen ce/ (as on 06.12.2012) 6. www.lemurproject.org (as on 06.12.2012) 7. Paik, J. H., Mitra, M., Parui, S. K., and J¨ arvelin, K. 2011. GRAS: An effective and efficient stemming algorithm for information retrieval. ACM Trans. Inf. Syst. 29, 4, Article 19 (November 2011)

29 References (contd…) 8. Paik, J. H. and Parui, S. K. 2011. A fast corpus-based stemmer. ACM Trans. Asian Lang. N form. Process. 10, 2, Article 8 (June 2011). 9. Paik J. H., Pal Dipasree, Parui S. K. A Novel Corpus-Based Stemming Algorithm using Co-occurrence Statistics. SIGIR’11, July 24–28, 2011, Beijing, China. 10. Xu, J. and Croft, W. B. 1998. Corpus-based stemming using co-occurrence of word variants. ACM Trans. Inf. Syst. 16, 1, 61–81. 11. http://en.wikipedia.org/wiki/Stemming (as on 06.12.2012) 12. How Effective Is Suffixing? Donna Harman. lister Hill Center for Biomedical Communications, National Library of Medicine, Bethesda, MD 20209

30 THANK YOU!!


Download ppt "2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian."

Similar presentations


Ads by Google