Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User.

Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User Experience of Wikipedia Using Cross-Language Name Search

Name Search Searching people directories by name. Facebook Friend Search Outlook Address Book Search

Cross-Language Name Search Searching people directories by name across languages. Query in Russian Query in Hebrew

Challenges Script and phonetic differences Large Directories – Millions of names Multi-word Names and Partial Matches Spelling Variations

Naive Approach Transliterate and Search –רשיד  Rashid Limitations – Slow as it involves the intermediate step of transliteration generation. – Machine Transliteration is not perfect Transliteration errors affect search results Is Transliteration Generation necessary?

Our Approach רשיד אנטוני Rashid NamesLanguage-Independent Geometric Representation Similarity

Search Overview QueryNamesGeometric Distance רשיד Geometric Nearest Neighbor Search

What is the advantage? Can scale to reasonably large name directories Compact geometric representation 50 dimensional space 6 M names Search is effective and efficient Geometric nearest-neighbor search using Approximate Nearest Neighbor (ANN) [Arya et al, 1998] ~1s per query for searching 6 M names >20 % improvement in MRR over Transliterate-and- Search

What is the challenge? Language/Script Independent Representation Learning common geometric feature space from training data Multi-Word Names and Partial Matches Maximum Weighted Bipartite Matching

Previous Work Language Independent Representation (2007) Canonical Correlation Analysis: An overview with application to learning methods. D. Hardoon et al., Neural Computation 2004. Transliteration Equivalence (2006) Named entity transliteration and discovery from multilingual comparable corpora. A. Klementiev and A. Roth, HLT-NAACL 2006. (2009) Learning better transliterations. J. Pasternack and D. Roth, CIKM 2009. (2010) Transliteration equivalence using canonical correlation analysis. R. Udupa and M. Khapra, ECIR 2010.

Common Feature Space Training Data Parallel Names  Similar Vectors Common Feature Space

Feature Vectors

Learning Common Feature Space Canonical Correlation Analysis

12 3 4 1 2 3 4 1 12 2 3 3 4 4

Learning Common Feature Space Canonical Correlation Analysis (Hoteling, 1936)

Multi-Word Names 0.970.91 Score = Maximum Weighted Matching / (m – n + 1)

Experimental Setup Name Directory: English Wikipedia Titles 6 Million Titles, 2 Million Unique Words Query Languages: Russian, Hebrew, Kannada, Tamil, Hindi, Bengali 1000 multi-word names in each language Baseline: State-of-the-art Machine Transliteration (NEWS 2009)

Experimental Results MRR 01 Very BadPerfect Competitor GEOM-SEARCH AlgorithmRussianKannadaTamilHindi TRANS-SEARCH0.470.520.290.49 GEOM-SEARCH0.560.690.490.69

Conclusions Pros – Data driven: Easy to include new languages. – Not training data hungry: a few thousand parallel names suffice. – Bridge languages are useful: feature space for (P,Q) can be learnt using only data in (P,R) and (Q,R) (Khapra and Udupa, AAAI 2010) – Fast search: ~1s for 6 M names directory – Applications: Cross-Language Wikipedia Search Spelling Correction of Personal Names

Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User Experience of Wikipedia Using Cross-Language Name Search Thank you!

Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User.

Similar presentations

Presentation on theme: "Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User.

Similar presentations

Presentation on theme: "Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User."— Presentation transcript:

Similar presentations

About project

Feedback