Download presentation
Presentation is loading. Please wait.
Published byRaymond Manning Modified over 9 years ago
1
ISM@FIRE MET-2013 Amit Jain Nitish Gupta Sukomal Pal Indian School of Mines, Dhanbad
2
Contents Introduction to Morpheme ISMStemmer Result of MET at FIRE-2013 Problems in ISMStemmer Conclusion
3
Morpheme In linguistics, a morpheme is the smallest grammatical unit in a language. Every word comprises one or more morphemes. Morphological analysis is the process of segmenting a word into its component. e.g. "Unbreakable" comprises three morphemes: un- (a morpheme signifying "not") -break- (the stem, a free morpheme), and -able (a morpheme signifying "can be done").
4
Stemmer Attempts to reduce word variants to its stem or root form Example – education, educating, educative will all reduce to educat Reasons: search engines are based on string matching similarity of a document wrt a query mostly determined by exact term overlap vocabulary mismatch as natural language documents use different form of a word for the same content
5
Why stemming? (contd…) Example – Suppose we have to search some information about “education” For children education is very important What is the reason we educate children Query: education doc 1 doc 2 doc 3 Educating young minds is the job of a teacher Government aims to make people educated doc 4
6
Why stemming? (contd…) For children education is very important Government aims to make people educated What is the reason we educate children Query: education doc 1 doc 2 doc 3 By stemming: Original word - education, educate Stemmed word - educat Educating young minds is the job of a teacher doc 4
7
ISMstemmer Approaches for Stemming Language based approach Statistical approach ISMStemmer is statistical Based on suffix extraction Suffix identified applying Apriori Algorithm (Agrawal and Srikant, 1994)
8
ISMStemmer algorithm Single Colum Refined File Generate valid suffixes (Apriori Algo) Strip off valid suffixes to get stems aborning absolution absorption abuilding acquisition activation added addition admiration admitted admitting agreed agreeing allotted allotting ambling angling aborn absolu absorp abuild aquisi activa add admira admitt agre agree allott ambl angl
9
Suffix Generation Input is Single Column Sorted Refined File Reverse the unique sorted word file Generate frequent suffixes (of length 1-character, 2- characters and so on). Find valid suffixes whose frequency is above a pre- decided threshold value α. ing ed tion. er ment Valid Suffixes aborning absolution absorption abuilding acquisition activation added addition admiration admitted admitting agreed agreeing allotted allotting ambling angling dedda dettolla … noitidda noitulosba … gnidliuba gnieera Gnilgng …..
10
Evaluation of ISMstemmer For evaluation of ISMstemmer we have participated in: Morpheme Extraction Task (MET) of FIRE-2013 ISMstemmer submitted evaluated at IR Labs: DAIICT, Gujarat tested on 5 languages of South Asian origin has given efficient results with 3 languages
11
MET Results (IR Evaluation) Language Baseline MAP Obtained % improveme nt Bengali 0.2740 0.3158 15.25% Hindi 0.2821 0.2793 -0.99% Gujarati 0.2677 0.2824 5.49% Marathi 0.2320 0.2797 20.56% Odia 0.1537 0.1583 2.99%
12
Results ( Linguistic Evaluation) Tamil: Precision: 80.22%; non-affixes: 80.22% Recall: 18.86%; non-affixes: 18.86% F-measure: 30.54%; non-affixes: 30.54% Bengali: Precision: 60.64%; non-affixes: 60.64% Recall: 32.15%; non-affixes: 32.15% F-measure: 42.02%; non-affixes: 42.02% Tamil: Bengali:
13
Post-hoc Analysis Over stemming 1.accent, accentual, accentuate – accent 2.accept, acceptant, acceptor – accept 3.access, accessible, accession – access due to overstemming acce Stemming of Named Entities 1. Beijing Beij
14
Analysis
15
Future plan Need to consider the prefix as well -Clustering based on prefix Identification NEs (Use o NERs) ….
16
THANK YOU!. Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.