Download presentation
Presentation is loading. Please wait.
Published byLinette Blair Modified over 9 years ago
1
1 Mono & Cross Language Experiments on Persian Text Abolfazl AleAhmad, Hadi Amiri, Farhad Oroumchian Database Research Group School of Electrical and Computer Engineering University of Tehran Database Research Group 18 Sep 2008 Persian@CLEF 2008
2
Outline Persian Language Persian Test Collections Hamshahri in CLEF 2008 UT Participants Using Part of Speech Tagging in Persian Information Retrieval Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track Local Cluster Analysis Using Part of Speech Tagging Investigation on Application of Local Cluster Analysis and Part of Speech Tagging on Persian Text Cross Language Experiments at Persian@CLEF 2008 Next Year 2
3
The Persian Language A branch of Indo-European Languages Official Language of Iran, Afghanistan and Tajikistan Its morphological analysis is Comparably difficult The word “خبر” has two plural forms: Persian rules: “خبرها” Arabic rules: “اخبار” 3
4
Writing Style Issues: e.g. ”می شود“ and “میشود” are the same e.g. ”کتابها“ and ”کتاب ها“ are the same KASRE: e.g. چراغ علی خانه را سوزاند has two different meanings: CheraghAli burned the house Ali’s lantern burned the house Some Processing Issues 4
5
5 Encoding
6
Persian in the Middle East 6 Source: Internet World Statistics, http://internetworldstats.com/ December 31, 2007 User Population Growth on the Web (2000-2008)
7
Persian Test Collections IR Domain Ghavanin (domain specific) Hamshahri (news) WEB: http://ece.ut.ac.ir/dbrg/hamshahri NLP Domain Bijankhan (2 Million Word) WEB: http://ece.ut.ac.ir/dbrg/bijankhan 7
8
Hamshahri in CLEF 2008 8 News articles of Hamshahri newspaper from year 1996 to 2002 Size of the documents varies from short news (under 1 KB) to rather long articles (e.g. 140 KB) 22 assessors Evaluation based on DIRECT System
9
Hamshahri in CLEF 2008 9 Collection size564 MB (Unicode text) No. Of documents166,774 No. Of unique terms417,339 Average length of documents380 Terms No. Of categories9 No. Of Topics50 bilingual
10
Implementation of our methods We submitted top 100 for each run 10
11
11 Using Part of Speech Tagging in Persian Information Retrieval Reza Karimpour, Amineh Ghorbani, Azadeh Pishdad, Mitra Mohtarami, Abolfazl AleAhmad, Hadi Amiri, Farhad Oroumchian
12
Config.CorpusQuery 1TaggedTitle with equal weighting for all POS tags 2Stemmed and taggedStemmed title with equal weighting for all POS tags 3StemmedStemmed title without POS tagging 4StemmedStemmed Title plus description 5 Stemmed (stop words removed) Stemmed Title plus description (stop words removed) 6Tagged Title plus description with equal weighting for all POS tags 7Tagged Title with various weighting schemes for different POS tags 8NormalTitle (Neither stemmed nor tagged) 12 Using Part of Speech Tagging in Persian Information Retrieval
13
13 20 less used tags omitted, others equal weight Noun=3 Verb=2 Adj=1 Adv=1 Noun=3 Verb=0 Avj=3 Adv = 0 Noun=0 Verb=2 Adj=0 Adv=0 Noun=0 Verb=0 Adj=1 Adv=0 Noun=0 Verb=0 Adj=0 Adv=1 Average precision 0.27450.26350.25970.11080.11980.0977 R-Precision0.30970.31040.28880.12560.11860.1111 Using Part of Speech Tagging in Persian Information Retrieval
14
14 Using Part of Speech Tagging in Persian Information Retrieval
15
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track Zahra Aghazade, Nazanin Dehghani, Leili Farzinvash, Razieh Rahimi, Abolfazel AleAhmad, Hadi Amiri, Farhad Oroumchian Weighting ModelDescription BB2 Bose-Einstein model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization BM25 The BM25 probabilistic model DFR_BM25 The DFR version of BM25 IFB2 Inverse Term Frequency model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization In_expB2 Inverse expected document frequency model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization In_expC2 Inverse expected document frequency model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization with natural logarithm InL2 Inverse document frequency model for randomness, succession for first normalization, and Normalization 2 for term frequency normalization PL2 Poisson estimation for randomness, succession for first normalization, and Normalization 2 for term frequency normalization TF_IDF The tf*idf weighting function, where tf is given by Robertson's tf and idf is given by the standard Sparck Jones' idf 15 Terrier Open Source Retrieval Engine: http:// ir.dcs.gla.ac.uk/terrier/
16
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track Weighting ModelAverage PrecisionR-Precision BB20.38540.4167 BM250.35620.4009 DFR_BM250.40060.4347 IFB20.40170.4328 In_expB20.39970.4329 In_expC20.41900.4461 InL20.38320.4200 PL20.43140.4548 TF_IDF0.35740.4017 16
17
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track And two other variations of this operator: IOWA and NOWA 17
18
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track 18
19
Retrieval MethodToolkitAverage PrecisionR-PrecisionDif TF_IDF with unstemmed single terms Terrier0.38470.4122 PL2 with 4gram terms Terrier0.36690.3939 Indri with stemmed terms Lemur0.39550.4149 IOWA 0.45150.4708 +5.6 NOWA 0.45220.4736 +5.67 19 Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track Post hoc Results
20
Investigation on Application of Local Cluster Analysis and Part of Speech Tagging on Persian Text Amir Hossein Jadidinejad, Mitra Mohtarami,Hadi Amiri 20
21
Investigation on Application of Local Cluster Analysis and Part of Speech Tagging on Persian Text 21 But the result was not good on the test set
22
Cross Language Experiments at Persian@CLEF 2008 Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian 22 Runtot-retrel-retMAPRetrieval ModelTool Using Light Stemmer 5161196726.89Vector SpaceLucene Without Stemmer5161199127.08Vector SpaceLucene 3Grams5161190126.07Language ModelingLemur 4Grams5161195026.70Language ModelingLemur 5Grams5161198327.13Language ModelingLemur Term-Based5161203528.14Language ModelingLemur
23
Probabilistic Structured Queries (PSQ) Combinatorial Translation Probability (CTP) Cross Language Experiments at Persian@CLEF 2008 Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian Query Translation 23
24
Cross Language Experiments at Persian@CLEF 2008 Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian Query Translation Results 24
25
Cross Language Experiments at Persian@CLEF 2008 Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian Document Translation Using Shiraz machine translation system from CRL of NMSU Took 10 days to translate 130,000+ docs from Persian to English 25
26
Cross Language Experiments at Persian@CLEF 2008 Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian Document Translation & Hybrid Results 26
27
Next Year Ham2 for the Next Year Extended Version of Hamshahri Collection 2 times larger (~1.5 GB) 27 HAM2-851011-001 /1385/851011/news/_adabh.htm دوشنبه 11 دي 1385 - سال چهاردهم - شماره 4172 - Jan 1, 2007 2007-01-01 ادب و هنر Literature and Art <TITLE> <![CDATA[ مديركل كتاب و كتابخواني وزارت فرهنگ و ارشاد اسلامي خبر داد آيين نامه خريد كتاب اصلاح شد ]]></TITLE> /1385/851011/news/008505.jpg <![CDATA[ فارس : مدير كل كتاب و كتاب خواني وزارت فرهنگ و ارشاد اسلامي گفت : آيين نام
28
28 Questions? Thanks For Your Attention Database Research Group http://ece.ut.ac.ir/dbrg
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.