Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User.

Similar presentations


Presentation on theme: "Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User."— Presentation transcript:

1 Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User Experience of Wikipedia Using Cross-Language Name Search

2 Name Search Searching people directories by name. Facebook Friend Search Outlook Address Book Search

3 Cross-Language Name Search Searching people directories by name across languages. Query in Russian Query in Hebrew

4 Challenges Script and phonetic differences Large Directories – Millions of names Multi-word Names and Partial Matches Spelling Variations

5 Naive Approach Transliterate and Search –רשיד  Rashid Limitations – Slow as it involves the intermediate step of transliteration generation. – Machine Transliteration is not perfect Transliteration errors affect search results Is Transliteration Generation necessary?

6 Our Approach רשיד אנטוני Rashid NamesLanguage-Independent Geometric Representation Similarity

7 Search Overview QueryNamesGeometric Distance רשיד Geometric Nearest Neighbor Search

8 What is the advantage? Can scale to reasonably large name directories Compact geometric representation 50 dimensional space 6 M names Search is effective and efficient Geometric nearest-neighbor search using Approximate Nearest Neighbor (ANN) [Arya et al, 1998] ~1s per query for searching 6 M names >20 % improvement in MRR over Transliterate-and- Search

9 What is the challenge? Language/Script Independent Representation Learning common geometric feature space from training data Multi-Word Names and Partial Matches Maximum Weighted Bipartite Matching

10 Previous Work Language Independent Representation (2007) Canonical Correlation Analysis: An overview with application to learning methods. D. Hardoon et al., Neural Computation 2004. Transliteration Equivalence (2006) Named entity transliteration and discovery from multilingual comparable corpora. A. Klementiev and A. Roth, HLT-NAACL 2006. (2009) Learning better transliterations. J. Pasternack and D. Roth, CIKM 2009. (2010) Transliteration equivalence using canonical correlation analysis. R. Udupa and M. Khapra, ECIR 2010.

11 Common Feature Space Training Data Parallel Names  Similar Vectors Common Feature Space

12 Feature Vectors

13 Learning Common Feature Space Canonical Correlation Analysis

14 12 3 4 1 2 3 4 1 12 2 3 3 4 4

15 Learning Common Feature Space Canonical Correlation Analysis (Hoteling, 1936)

16 Multi-Word Names 0.970.91 Score = Maximum Weighted Matching / (m – n + 1)

17 Experimental Setup Name Directory: English Wikipedia Titles 6 Million Titles, 2 Million Unique Words Query Languages: Russian, Hebrew, Kannada, Tamil, Hindi, Bengali 1000 multi-word names in each language Baseline: State-of-the-art Machine Transliteration (NEWS 2009)

18 Experimental Results MRR 01 Very BadPerfect Competitor GEOM-SEARCH AlgorithmRussianKannadaTamilHindi TRANS-SEARCH0.470.520.290.49 GEOM-SEARCH0.560.690.490.69

19 Conclusions Pros – Data driven: Easy to include new languages. – Not training data hungry: a few thousand parallel names suffice. – Bridge languages are useful: feature space for (P,Q) can be learnt using only data in (P,R) and (Q,R) (Khapra and Udupa, AAAI 2010) – Fast search: ~1s for 6 M names directory – Applications: Cross-Language Wikipedia Search Spelling Correction of Personal Names

20 Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User Experience of Wikipedia Using Cross-Language Name Search Thank you!


Download ppt "Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User."

Similar presentations


Ads by Google