Presentation is loading. Please wait.

Presentation is loading. Please wait.

Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya.

Similar presentations


Presentation on theme: "Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya."— Presentation transcript:

1 Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya

2 Overview Morphological analyzer Suffix processing Stop-words Future work

3 Present work  Search “ भारत ” – bhaarat – Bharat  Will not match pages which has terms such as  भारताचा – bharataachaa - Of Bharat  भारतात – bharataat - In Bharat  Lack of large size corpus  Unavailability of tools

4 Corpus Statistics- Marathi 99,275 Documents (510 MB) – Maharashtra times – Sakal News April 2004 to September 2007 UTF-8 encoding XML tags – DOC - document – DOCNO – document identifier – TEXT - article

5 Document: example MaharashtraC06E811C6B.htm.txt मोहफूल वेचण्यास गेलेल्या तरुणावर बिबट्याचा हल्ला ( attack of a leapord on a young man who has gone to collect flowers of Moha ) इस्लापूर, ता. २२ - चारोळी आणि मोहफूल वेचण्यासाठी जंगलात गेलेल्या एका आदिवासी तरुणावर बिबट्याने अचानक हल्ला केल्याने तो तरुण गंभीर जखमी झाला आहे. ही घटना शुक्रवारी ( ता. २० ) मुळझरा ( ता. किनवट ) या गावाच्या जंगलात घडली........ इस्लापूर वन परिक्षेत्र कार्यालयाअंतर्गत येणाऱ्या मुळझरा येथील आदिवासी तरुण मनोहर...

6 Topics 100 topics Aligned with English topics XML tags – num : query identifier – title: title of the query – desc: description – narr: Additional information about the query Cover all issues –local, international

7 Topic example 1 ट्वेंटी - २० विश्वचषकातील भारताचे क्रीडापटुत्व (India’s championship in tewnty-20 Worldcup) पहिल्या आयसीसी विश्व ट्वेंटी - २० सर्वोत्कृष्ट - विजेता - स्पर्धेतील भारताच्या विजयाचे वृत्त देणारा लेख शोधा. ट्वेंटी - २० विश्चचषक स्पर्धेमधील पाकिस्तान विरूद्ध भारताचा विजय, ह्या ऐतिहासिक विजया निमित्त खेळाडूंनी केलेले विक्रम त्यांनी मिळविलेली बक्षिसे व पुरस्काराची रक्कम सामनावीराचे तसेच मालिकावीराचे नाव, माजी खेळाडूंनी आणि जगभरातील लोकांनी केलेली प्रशंसा यासंदर्भात आम्ही उचित माहिती मिळवत आहोत.

8 Tools Terrier – Open source IR system – Models TF-IDF (Vector space model) DFR-BM25 (Probabilistic) – Both models available in Terrier Evaluation against relevance judged document for 25 queries

9 Lemmatizer Vs stemmer – भारताला bhaarataalaa – for Bharat – भारताचा bhaarataachaa - of Bharat – भारतात bhaarataat – in Bharat – भारतावर bhaarataavar – on Bharat Lemmatizer finds Lemma – भारत Stemmer finds stem: Longest unchangeable word prefix – भारता

10 Marathi suffixes Suffixes include case markers, postposition markers etc. Suffixes may get attached after another suffix Example: – घरासमोरचादेखिल – घरा - समोर - चा - देखिल – gharaa-samor-chaa-dekhil – house-front- of-also – Root word: घर (ghar) (house)

11 Morphological analyzer Use of Marathi morphology analyzer – Better matching words राम versus रामा Gives all possible roots – Selects first root – most frequent Used at indexing and query processing end

12 Lemmatizer Results MAP R- precision Precision at 5 Precision at 10 Recall TF-IDF without lemmatizer 0.33660.29440.31670.25830.8724 TF-IDF + lemmatizer0.40030.35510.34170.29170.9686 DFR+ without lemmatizer 0.34550.32090.35000.26670.8744 DFR-BM25 + lemmatizer 0.41400.36860.38330.30830.9619 DFR-BM25 + lemmatizer (Fire submission) 0.36250.37970.46000.39600.9178

13 Suffixes Usually ignored Indexing suffixes - not studied Index selected suffixes – Suffixes of space and time वर – var - on समोर – samor - in front of मध्ये – madhye - in नंतर -nanter – after Created manually – 66 words list

14 Stop-words Most frequently occurring words Little discriminatory value Occur in 80 % or more documents Selected stop-words – ती, ते, या, ून, अस, आह, ये, हो, कर, त

15 Results suffix indexing and stop-words MAPR- precision Precision at 5 Precision at 10 Recall DFR-BM25 + lemmatization + suffix Indexing 0.43810.38460.39170.31670.97085 DFR-BM25 + lemmatization + suffix Indexing + stop-words 0.44330.37980.40000.32080.9731

16 P-R graph Precision-recall graph for all four cases is show below

17 Future work Morphological analyzer – Accuracy 94.5 % Needs to be improved Heuristic suffix stripping: unknown words Handle derivational morphology Spelling variations, common spelling mistakes

18 Acknowledgement “Cross Lingual Information Access” Project Maharashtra times: Times Media Group, – http://in.indiatimes.com/aboutus.cms Sakal: Sakal Media Group – http://www.sakaal.in/

19 References http://ir.dcs.gla.ac.uk/terrier/ Ricardo Baeza Yates and Berthier Ribeiro Neto, Modern Information Retrieval Jacques Savoy, Searching strategies for the Bulgarian language Morphological Analyzer, CFILT

20 Thank you


Download ppt "Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya."

Similar presentations


Ads by Google