Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User Experience of Wikipedia Using Cross-Language Name Search
Name Search Searching people directories by name. Facebook Friend Search Outlook Address Book Search
Cross-Language Name Search Searching people directories by name across languages. Query in Russian Query in Hebrew
Challenges Script and phonetic differences Large Directories – Millions of names Multi-word Names and Partial Matches Spelling Variations
Naive Approach Transliterate and Search –רשיד Rashid Limitations – Slow as it involves the intermediate step of transliteration generation. – Machine Transliteration is not perfect Transliteration errors affect search results Is Transliteration Generation necessary?
Our Approach רשיד אנטוני Rashid NamesLanguage-Independent Geometric Representation Similarity
Search Overview QueryNamesGeometric Distance רשיד Geometric Nearest Neighbor Search
What is the advantage? Can scale to reasonably large name directories Compact geometric representation 50 dimensional space 6 M names Search is effective and efficient Geometric nearest-neighbor search using Approximate Nearest Neighbor (ANN) [Arya et al, 1998] ~1s per query for searching 6 M names >20 % improvement in MRR over Transliterate-and- Search
What is the challenge? Language/Script Independent Representation Learning common geometric feature space from training data Multi-Word Names and Partial Matches Maximum Weighted Bipartite Matching
Previous Work Language Independent Representation (2007) Canonical Correlation Analysis: An overview with application to learning methods. D. Hardoon et al., Neural Computation Transliteration Equivalence (2006) Named entity transliteration and discovery from multilingual comparable corpora. A. Klementiev and A. Roth, HLT-NAACL (2009) Learning better transliterations. J. Pasternack and D. Roth, CIKM (2010) Transliteration equivalence using canonical correlation analysis. R. Udupa and M. Khapra, ECIR 2010.
Common Feature Space Training Data Parallel Names Similar Vectors Common Feature Space
Feature Vectors
Learning Common Feature Space Canonical Correlation Analysis
Learning Common Feature Space Canonical Correlation Analysis (Hoteling, 1936)
Multi-Word Names Score = Maximum Weighted Matching / (m – n + 1)
Experimental Setup Name Directory: English Wikipedia Titles 6 Million Titles, 2 Million Unique Words Query Languages: Russian, Hebrew, Kannada, Tamil, Hindi, Bengali 1000 multi-word names in each language Baseline: State-of-the-art Machine Transliteration (NEWS 2009)
Experimental Results MRR 01 Very BadPerfect Competitor GEOM-SEARCH AlgorithmRussianKannadaTamilHindi TRANS-SEARCH GEOM-SEARCH
Conclusions Pros – Data driven: Easy to include new languages. – Not training data hungry: a few thousand parallel names suffice. – Bridge languages are useful: feature space for (P,Q) can be learnt using only data in (P,R) and (Q,R) (Khapra and Udupa, AAAI 2010) – Fast search: ~1s for 6 M names directory – Applications: Cross-Language Wikipedia Search Spelling Correction of Personal Names
Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User Experience of Wikipedia Using Cross-Language Name Search Thank you!