Download presentation
Presentation is loading. Please wait.
Published byJanice Williams Modified over 9 years ago
1
Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya
2
Overview Morphological analyzer Suffix processing Stop-words Future work
3
Present work Search “ भारत ” – bhaarat – Bharat Will not match pages which has terms such as भारताचा – bharataachaa - Of Bharat भारतात – bharataat - In Bharat Lack of large size corpus Unavailability of tools
4
Corpus Statistics- Marathi 99,275 Documents (510 MB) – Maharashtra times – Sakal News April 2004 to September 2007 UTF-8 encoding XML tags – DOC - document – DOCNO – document identifier – TEXT - article
5
Document: example MaharashtraC06E811C6B.htm.txt मोहफूल वेचण्यास गेलेल्या तरुणावर बिबट्याचा हल्ला ( attack of a leapord on a young man who has gone to collect flowers of Moha ) इस्लापूर, ता. २२ - चारोळी आणि मोहफूल वेचण्यासाठी जंगलात गेलेल्या एका आदिवासी तरुणावर बिबट्याने अचानक हल्ला केल्याने तो तरुण गंभीर जखमी झाला आहे. ही घटना शुक्रवारी ( ता. २० ) मुळझरा ( ता. किनवट ) या गावाच्या जंगलात घडली........ इस्लापूर वन परिक्षेत्र कार्यालयाअंतर्गत येणाऱ्या मुळझरा येथील आदिवासी तरुण मनोहर...
6
Topics 100 topics Aligned with English topics XML tags – num : query identifier – title: title of the query – desc: description – narr: Additional information about the query Cover all issues –local, international
7
Topic example 1 ट्वेंटी - २० विश्वचषकातील भारताचे क्रीडापटुत्व (India’s championship in tewnty-20 Worldcup) पहिल्या आयसीसी विश्व ट्वेंटी - २० सर्वोत्कृष्ट - विजेता - स्पर्धेतील भारताच्या विजयाचे वृत्त देणारा लेख शोधा. ट्वेंटी - २० विश्चचषक स्पर्धेमधील पाकिस्तान विरूद्ध भारताचा विजय, ह्या ऐतिहासिक विजया निमित्त खेळाडूंनी केलेले विक्रम त्यांनी मिळविलेली बक्षिसे व पुरस्काराची रक्कम सामनावीराचे तसेच मालिकावीराचे नाव, माजी खेळाडूंनी आणि जगभरातील लोकांनी केलेली प्रशंसा यासंदर्भात आम्ही उचित माहिती मिळवत आहोत.
8
Tools Terrier – Open source IR system – Models TF-IDF (Vector space model) DFR-BM25 (Probabilistic) – Both models available in Terrier Evaluation against relevance judged document for 25 queries
9
Lemmatizer Vs stemmer – भारताला bhaarataalaa – for Bharat – भारताचा bhaarataachaa - of Bharat – भारतात bhaarataat – in Bharat – भारतावर bhaarataavar – on Bharat Lemmatizer finds Lemma – भारत Stemmer finds stem: Longest unchangeable word prefix – भारता
10
Marathi suffixes Suffixes include case markers, postposition markers etc. Suffixes may get attached after another suffix Example: – घरासमोरचादेखिल – घरा - समोर - चा - देखिल – gharaa-samor-chaa-dekhil – house-front- of-also – Root word: घर (ghar) (house)
11
Morphological analyzer Use of Marathi morphology analyzer – Better matching words राम versus रामा Gives all possible roots – Selects first root – most frequent Used at indexing and query processing end
12
Lemmatizer Results MAP R- precision Precision at 5 Precision at 10 Recall TF-IDF without lemmatizer 0.33660.29440.31670.25830.8724 TF-IDF + lemmatizer0.40030.35510.34170.29170.9686 DFR+ without lemmatizer 0.34550.32090.35000.26670.8744 DFR-BM25 + lemmatizer 0.41400.36860.38330.30830.9619 DFR-BM25 + lemmatizer (Fire submission) 0.36250.37970.46000.39600.9178
13
Suffixes Usually ignored Indexing suffixes - not studied Index selected suffixes – Suffixes of space and time वर – var - on समोर – samor - in front of मध्ये – madhye - in नंतर -nanter – after Created manually – 66 words list
14
Stop-words Most frequently occurring words Little discriminatory value Occur in 80 % or more documents Selected stop-words – ती, ते, या, ून, अस, आह, ये, हो, कर, त
15
Results suffix indexing and stop-words MAPR- precision Precision at 5 Precision at 10 Recall DFR-BM25 + lemmatization + suffix Indexing 0.43810.38460.39170.31670.97085 DFR-BM25 + lemmatization + suffix Indexing + stop-words 0.44330.37980.40000.32080.9731
16
P-R graph Precision-recall graph for all four cases is show below
17
Future work Morphological analyzer – Accuracy 94.5 % Needs to be improved Heuristic suffix stripping: unknown words Handle derivational morphology Spelling variations, common spelling mistakes
18
Acknowledgement “Cross Lingual Information Access” Project Maharashtra times: Times Media Group, – http://in.indiatimes.com/aboutus.cms Sakal: Sakal Media Group – http://www.sakaal.in/
19
References http://ir.dcs.gla.ac.uk/terrier/ Ricardo Baeza Yates and Berthier Ribeiro Neto, Modern Information Retrieval Jacques Savoy, Searching strategies for the Bulgarian language Morphological Analyzer, CFILT
20
Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.