Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User.

Slides:



Advertisements
Similar presentations
Latent Variables Naman Agarwal Michael Nute May 1, 2013.
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Presented by Xinyu Chang
When is “Nearest Neighbor Meaningful? Authors: Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan Uri Shaft Presentation by: Vuk Malbasa For CIS664 Prof.
Mixed-Resolution Patch- Matching (MRPM) Harshit Sureka and P.J. Narayanan (ECCV 2012) Presentation by Yaniv Romano 1.
Database-Based Hand Pose Estimation CSE 6367 – Computer Vision Vassilis Athitsos University of Texas at Arlington.
Cascaded Filtering For Biometric Identification Using Random Projection Atif Iqbal.
The Viola/Jones Face Detector (2001)
Fast High-Dimensional Feature Matching for Object Recognition David Lowe Computer Science Department University of British Columbia.
Robust and large-scale alignment Image from
WISE: Large Scale Content-Based Web Image Search Michael Isard Joint with: Qifa Ke, Jian Sun, Zhong Wu Microsoft Research Silicon Valley 1.
Small Codes and Large Image Databases for Recognition CVPR 2008 Antonio Torralba, MIT Rob Fergus, NYU Yair Weiss, Hebrew University.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Fast and Compact Retrieval Methods in Computer Vision Part II A. Torralba, R. Fergus and Y. Weiss. Small Codes and Large Image Databases for Recognition.
1 Jun Wang, 2 Sanjiv Kumar, and 1 Shih-Fu Chang 1 Columbia University, New York, USA 2 Google Research, New York, USA Sequential Projection Learning for.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Content-Based Image Retrieval (CBIR) Student: Mihaela David Professor: Michael Eckmann Most of the database images in this presentation are from the Annotated.
Canonical Correlation Analysis: An overview with application to learning methods By David R. Hardoon, Sandor Szedmak, John Shawe-Taylor School of Electronics.
Scalable Text Mining with Sparse Generative Models
Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.
A Search-based Method for Forecasting Ad Impression in Contextual Advertising Defense.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
IIIT Hyderabad Atif Iqbal and Anoop Namboodiri Cascaded.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
Indexing Techniques Mei-Chen Yeh.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
Cao et al. ICML 2010 Presented by Danushka Bollegala.
A Smart-Pen Product VariSearch A Unique, Cross-language, Spelling-tolerant Search Engine Features and Application Area.
Kuang Ru; Jinan Xu; Yujie Zhang; Peihao Wu Beijing Jiaotong University
This week: overview on pattern recognition (related to machine learning)
The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Annealing Paths for the Evaluation of Topic Models James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine* *James.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Copyright Protection of Images Based on Large-Scale Image Recognition Koichi Kise, Satoshi Yokota, Akira Shiozaki Osaka Prefecture University.
Unsupervised Constraint Driven Learning for Transliteration Discovery M. Chang, D. Goldwasser, D. Roth, and Y. Tu.
IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1.
IEEE Int'l Symposium on Signal Processing and its Applications 1 An Unsupervised Learning Approach to Content-Based Image Retrieval Yixin Chen & James.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
EasyQuerier: A Keyword Interface in Web Database Integration System Xian Li 1, Weiyi Meng 2, Xiaofeng Meng 1 1 WAMDM Lab, RUC & 2 SUNY Binghamton.
IIIT Hyderabad Document Image Retrieval using Bag of Visual Words Model Ravi Shekhar CVIT, IIIT Hyderabad Advisor : Prof. C.V. Jawahar.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Introduction to String Kernels Blaz Fortuna JSI, Slovenija.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Date: 2015/11/19 Author: Reza Zafarani, Huan Liu Source: CIKM '15
University of Macau Discovering Longest-lasting Correlation in Sequence Databases Yuhong Li Department of Computer and Information Science.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Bundling Features for Large Scale Partial-Duplicate Web Image Search Zhong Wu ∗, Qifa Ke, Michael Isard, and Jian Sun Microsoft Research.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
A Fast Kernel for Attributed Graphs Yu Su University of California at Santa Barbara with Fangqiu Han, Richard E. Harang, and Xifeng Yan.
Deep Learning based Machine Translation
Instance Based Learning
Topological Signatures For Fast Mobility Analysis
Presentation transcript:

Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User Experience of Wikipedia Using Cross-Language Name Search

Name Search Searching people directories by name. Facebook Friend Search Outlook Address Book Search

Cross-Language Name Search Searching people directories by name across languages. Query in Russian Query in Hebrew

Challenges Script and phonetic differences Large Directories – Millions of names Multi-word Names and Partial Matches Spelling Variations

Naive Approach Transliterate and Search –רשיד  Rashid Limitations – Slow as it involves the intermediate step of transliteration generation. – Machine Transliteration is not perfect Transliteration errors affect search results Is Transliteration Generation necessary?

Our Approach רשיד אנטוני Rashid NamesLanguage-Independent Geometric Representation Similarity

Search Overview QueryNamesGeometric Distance רשיד Geometric Nearest Neighbor Search

What is the advantage? Can scale to reasonably large name directories Compact geometric representation 50 dimensional space 6 M names Search is effective and efficient Geometric nearest-neighbor search using Approximate Nearest Neighbor (ANN) [Arya et al, 1998] ~1s per query for searching 6 M names >20 % improvement in MRR over Transliterate-and- Search

What is the challenge? Language/Script Independent Representation Learning common geometric feature space from training data Multi-Word Names and Partial Matches Maximum Weighted Bipartite Matching

Previous Work Language Independent Representation (2007) Canonical Correlation Analysis: An overview with application to learning methods. D. Hardoon et al., Neural Computation Transliteration Equivalence (2006) Named entity transliteration and discovery from multilingual comparable corpora. A. Klementiev and A. Roth, HLT-NAACL (2009) Learning better transliterations. J. Pasternack and D. Roth, CIKM (2010) Transliteration equivalence using canonical correlation analysis. R. Udupa and M. Khapra, ECIR 2010.

Common Feature Space Training Data Parallel Names  Similar Vectors Common Feature Space

Feature Vectors

Learning Common Feature Space Canonical Correlation Analysis

Learning Common Feature Space Canonical Correlation Analysis (Hoteling, 1936)

Multi-Word Names Score = Maximum Weighted Matching / (m – n + 1)

Experimental Setup Name Directory: English Wikipedia Titles 6 Million Titles, 2 Million Unique Words Query Languages: Russian, Hebrew, Kannada, Tamil, Hindi, Bengali 1000 multi-word names in each language Baseline: State-of-the-art Machine Transliteration (NEWS 2009)

Experimental Results MRR 01 Very BadPerfect Competitor GEOM-SEARCH AlgorithmRussianKannadaTamilHindi TRANS-SEARCH GEOM-SEARCH

Conclusions Pros – Data driven: Easy to include new languages. – Not training data hungry: a few thousand parallel names suffice. – Bridge languages are useful: feature space for (P,Q) can be learnt using only data in (P,R) and (Q,R) (Khapra and Udupa, AAAI 2010) – Fast search: ~1s for 6 M names directory – Applications: Cross-Language Wikipedia Search Spelling Correction of Personal Names

Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User Experience of Wikipedia Using Cross-Language Name Search Thank you!