Presentation is loading. Please wait.

Presentation is loading. Please wait.

Microsoft Research India’s Participation in FIRE2008 Raghavendra Udupa

Similar presentations


Presentation on theme: "Microsoft Research India’s Participation in FIRE2008 Raghavendra Udupa"— Presentation transcript:

1 Microsoft Research India’s Participation in FIRE2008 Raghavendra Udupa raghavu@microsoft.com

2 Inverted Index Dictionary LA Times 2002 articles Document Ranker Query Translator पिम फोरत् ‍ यून की राजनीति CLEF’07 Query #10.2452/447-AH ऐसे दस् ‍ तावेज खोजिए जिनमें पिम फोरत् ‍ यून के राजनैतिक विचारों पर चर्चा की गई हो। Pim Fortuyn politics CLIR System

3 Inverted Index Dictionary Document Collection Document Ranker Query Translator Domain Adaptation Mining Translation Lexicon from Comparable Corpora Mining transliterations of OOV words Cross- Language Ranking Model Mining NETE Transliterations from Comparable Corpora

4 Inverted Index Dictionary Document Collection Document Ranker Query Translator Domain Adaptation Mining transliterations of OOV terms (ECIR 2009) Cross- Language Ranking Models Mining NETE Transliterations from Comparable Corpora (CIKM’08) Mining Translation Lexicon from Comparable Corpora (MT Summit 2007)

5 Baseline Retrieval System  Language Model-Based Retrieval Probabilistic Translation Lexicon ~100K parallel sentences IBM Model 3 Alignment GIZA++ J. Jagarlamudi and A. Kumaran, Cross-Lingual Information Retrieval System for Indian Languages. Working Notes for the CLEF 2007 Workshop.

6 FIRE Fighting  Mining Transliterations of Out-Of-Vocabulary Query Terms.  Date-Based Document Restriction.

7 Mining Transliterations of Out- Of-Vocabulary Query Terms Raghavendra Udupa

8 OOV Query Terms  Many OOV query terms are NEs  NEs are often the focus of a query  NEs form an open class of terms in all languages.  Getting their transliterations right is extremely important  Many OOV query terms are not NEs but transliterations of English words.  E.g. सेमिनार (seminar), कार्पोंरेशन (corporation), चैम्पियन (champion), फिल्म (film)

9 A Hypothesis  The transliterations of most of the transliteratable OOV terms of a query can be found in documents relevant to the query.

10 Empirical Validation CollectionTransliterat able OOV terms Terms with transliterations in at least one relevant document Terms with transliteration in at least 50% of relevant documents CLEF 2006 (Hindi)6258 (94%)49 (79%) CLEF 2007 (Hindi)4742 (89%)34 (72%) CLEF 2007 (Tamil)4342 (98%)39 (89%)

11 A Practical Hypothesis  The transliterations of many of the transliteratable OOV terms of a query can be found in the top results of the CLIR system for the query.

12 Mining OOV Transliteration Equivalents  Basic Idea:  Pair the query with each of the top N results.  Treat each pair as a comparable document pair.  Mine transliteration equivalents from the comparable document pairs. “They are out there, if you know where to look”: Mining Transliterations of OOV Query Terms for Cross-Language Information Retrieval ECIR 2009, Toulouse

13 Long Queries: MAP CollectionBaselineTransliterations Mining % change over baseline CLEF 2006 (Hindi)0.14630.2476+69.24* CLEF 2007 (Hindi)0.25210.3389+34.43* CLEF 2007 (Tamil)0.18480.2270+22.84*

14 Short Queries: MAP CollectionBaseline Transliterations Mining % change over baseline CLEF 2006 (Hindi)0.08770.1467 67.3 CLEF 2007 (Hindi)0.18290.2323 27.0 CLEF 2007 (Tamil)0.10240.1265 23.5

15 FIRE 2008: MAP Baseline Transliterations Mining % change over baseline Short (unofficial) 0.26160.319122 Long (unofficial) 0.43510.487112 Long (official) 0.41400.45269

16 FIRE2008: MAP Difference (Long, official)

17 FIRE 2008: Num_Rel_Ret Baseline Transliterations Mining Short (unofficial) 70.6080.0 Long (unofficial) 84.55%88.54% Long (official) 79.68%82.11%

18 FIRE 2008: P@10 Baseline Transliterations Mining Short (unofficial) 0.10000.4320 Long (unofficial) 0.62600.6540 Long (official) 0.61200.6480

19 Mining Transliterations @ FIRE2008  Worked.

20 Date-Based Document Restriction Raghavendra Udupa

21 Dates  Some queries contain dates  CLEF 2007, Topic 407: Who was the Australian Prime Minister in 2002?  CLEF 2007, Topic 411: …terrorist car bomb in Bali, Indonesia, in 2002.  CLEF 2006, Topic 326: …winners in any category of the 1995 Emmy Awards.  CLEF 2006, Topic 327: …earthquakes in Mexico City in 1995.

22 Hypothesis  If a query contains a date then the relevant documents for the query are likely to be from the same time period.

23 Empirical Validation  CLEF’07  LATimes 2002  CLEF’06  GH 95, LATimes 1994

24 CLEF’06: C327  Title:  Earthquakes in Mexico City  Description:  Find documents that provide details on the impact of or the damage caused by earthquakes in Mexico City in 1995.  Narrative:  Relevant document should contain some information on earthquakes in Mexico City in 1995, such as their magnitude, damages caused, panic of the inhabitants, etc. Documents on earthquakes in other places in Mexico are not relevant unless the seismic impact was also felt in Mexico City.

25 Relevant Document  LA121194-0313  107228  December 11, 1994, Sunday, Home Edition  A magnitude 6.3 earthquake rocked Mexico City, causing people to flee their homes in fear. There were no immediate reports of injuries or severe damage. The U.S. Geological Survey's National Earthquake Information Center in Golden, Colo., said the quake's epicenter was in Petatlan in the southwestern state of Guerrero.

26 Date-Based Document Restriction  Identify dates (if any) in the query.  Restrict candidate documents to the set of documents coming from the same time period.

27 FIRE 2008: Relevant Docs TopicRelevant Docs from different time period 44(11/56) 47(23/32) 48(70/76) 50(18/61) 52(2/38) 73(10/53)

28 FIRE 2008: Hindi  English MAP Without DR With DR Short 0.2616 (unofficial) 0.2601 (unofficial) Long 0.4351 (unofficial) 0.4140 (official)

29 Date-Based Document Restriction @ FIRE2008  Hurt us.  Deeper investigation needed.


Download ppt "Microsoft Research India’s Participation in FIRE2008 Raghavendra Udupa"

Similar presentations


Ads by Google