Automatic Translation of Named Entities in Multiple Languages Using Web Search Engines Present by Richard C. Wang Supervised by Teruko Mitamura December.

Slides:



Advertisements
Similar presentations
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Improved TF-IDF Ranker
Large-Scale Entity-Based Online Social Network Profile Linkage.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Person Name Disambiguation by Bootstrapping Presenter: Lijie Zhang Advisor: Weining Zhang.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Information Retrieval in Practice
SMS-Based Web Search for Low-end Mobile Devices Jay Chen New York University Lakshmi Subramanian New York University Eric Brewer University of California.
The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Automatic Set Expansion for List Question Answering Richard C. Wang, Nico Schlaefer, William W. Cohen, and Eric Nyberg Language Technologies Institute.
Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University.
Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Overview of Search Engines
SMS-Based web Search for Low- end Mobile Devices Jay Chen New York University Lakshmi Subramanian New York University
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Aparna Kulkarni Nachal Ramasamy Rashmi Havaldar N-grams to Process Hindi Queries.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)
1 A Unified Relevance Model for Opinion Retrieval (CIKM 09’) Xuanjing Huang, W. Bruce Croft Date: 2010/02/08 Speaker: Yu-Wen, Hsu.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Carnegie Mellon School of Computer Science Copyright © 2001, Carnegie Mellon. All Rights Reserved. JAVELIN Project Briefing 1 AQUAINT Phase I Kickoff December.
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
Intent Subtopic Mining for Web Search Diversification Aymeric Damien, Min Zhang, Yiqun Liu, Shaoping Ma State Key Laboratory of Intelligent Technology.
Unsupervised Constraint Driven Learning for Transliteration Discovery M. Chang, D. Goldwasser, D. Roth, and Y. Tu.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,
Mining Translations of OOV Terms from the Web through Crosslingual Query Expansion Ying Zhang Fei Huang Stephan Vogel SIGIR 2005.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Measuring Semantic Similarity between Words Using Web Search Engines WWW 07.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Automatic Question Answering  Introduction  Factoid Based Question Answering.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,
Advisor: Koh Jia-Ling Nonhlanhla Shongwe EFFICIENT QUERY EXPANSION FOR ADVERTISEMENT SEARCH WANG.H, LIANG.Y, FU.L, XUE.G, YU.Y SIGIR’09.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese Teruko Mitamura Mengqiu Wang Hideki Shima Frank Lin In CMU EACL.
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Data Integration for Relational Web
Presentation transcript:

Automatic Translation of Named Entities in Multiple Languages Using Web Search Engines Present by Richard C. Wang Supervised by Teruko Mitamura December 15, 2005

Presentation Outline Introduction Related Work System Implementation Experimental Results Conclusion Future Work

Introduction Machine Translation  Reduces human labor for translating text One challenge – Translating newly emerged proper nouns (named entities)  Movie, book, magazine, protein, cell, disease, person, location, company, organization names, etc.  No central database that stores NE (and their translations) World Wide Web – An enormous unstructured corpus  Contains named entities in various languages Automatic translation of NE in multiple languages  Near-language-independent approach  Utilize popular search engines: Google, Yahoo!, AlltheWeb

Related Work Wu, Lin, and Chang (2005) – English-to-Chinese NE MT  Surface pattern knowledge learner  Trained transliteration model Shima (2005) – English-to-Japanese NE MT  Hand-coded transliteration model  Heuristic computations using N-gram Huang, Zhang, and Vogel (2005) – Chinese-to-English NE MT  Cross-lingual query expansion  Trained transliteration model  IBM translation model  Frequency-Distance model In contrast, our system does not require any  Training data and process  Transliteration model

System Architecture Overview s Search Engines (External) Search Results Segment Parser Segments Translation Candidate Extractor Translation Candidates Translation Candidate Filter Filtered Translation Candidates Candidate Score Calculator Scored Translation Candidates

Querying World Wide Web We want to retrieve documents containing  Source word s and target word t Search for s using Google, Yahoo!, and AlltheWeb  Request for web pages written in the same language as t Current system supports target languages:  English, Simplified/Traditional Chinese, Japanese, and Korean  A target language can be added easily (see Adding Target Languages slide) Current system allows s to be in any language except:  s and t have to be in languages that use different character sets i.e. (English, Chinese), (Korean, Japanese), (Hebrew, English)  Can be overcome by using English as a pivot language (see Future Work slide)

Preprocessing Returned Results Segments Snippet Our system preprocesses results by:  Extracting each snippet and insert into N such that no snippets in N can have duplicating titles  Extracting each segment from each snippet in N and insert into G such that any segment in G cannot be a substring of another segment in G Prevent words to have biased weights  Weights are dependent on their occurring frequencies

Extracting Translation Candidates Translation Candidate  Any lonely cluster in the target language that resides in the same segment as the source word Our system uses regular expression patterns to extract lonely clusters Oftentimes there is at least one correct translation in the returned results that is lonely Lonely ClustersClusters

Filtering Translation Candidates Suppose candidate A is a substring of candidate B, and if B occurs more than half the times that A occurs in all segments, then A is discarded. For example: Since TF(B) > 0.5 x TF(A), A is discarded CandidateTF A“Back to the”40 B“Back to the Future”25

Ranking Translation Candidates Source word: “The Lord of the Rings” Target Language: Japanese FeatureDefinition TF c # of occurrences of c in all segments DF c # of segments that contain c CTF c # of occurrences of lonely c in all segments CDF c # of segments that contain lonely c NG c # of grams that c is consist of WD c sum of inverse word distance between s and c in all segments

Adding a New Target Language Three basic elements:  Tokenization Pattern A regular expression pattern for tokenization  Search Engine Language Code Language codes for the target language for each of the search engines  Other General Properties Common minimum number of grams/alphabets for named entities in the target language Is the language spaced or non-spaced

Experimental Data Gold Test WordOriginal Gold TranslationAdditional Gold Trans. 纽约客 new yorkerThe New Yorker 牛虻 The Gadflygadfly 汇丰银行 HSBC Hong Kong and Shang Hai Banking Corporation 海豹 sealseals Mt. Pinatubo ピナツボ火山, ピナトゥボ火山ピナツボ山 Roger Dingman ロージャー・ディングマン, ロージャーディン グマン ロジャーディングマン Jean-Henri Dunant アンリ・デュナン, アンリデュナンジャン・アンリ・デュナン Charles Wang チャールズ・ウォン, チャールズウォンチャールズ・ワン Dataset# Test WordsSource-Target EJ202English-Japanese CE310Simplified Chinese-English

Evaluation Metric Translatable words  words that our system is able to produce at least one translation candidate for

Usefulness of Features

Experimental Results (EJ)

Experimental Results (CE)

Performance Comparison (EJ) Our SystemAll Features Our SystemAll Features Original Test Set Extended Test Set

Performance Comparison (CE) Original Test Set Extended Test Set

Conclusion Even though our system is not state-of- the-art, it is capable of doing named entities translation in multiple languages with decent performance Most of the correct translations are ranked in top 3. Coverage of correct translations in the search results needs improvement

Future Work (1) Incorporate a language-specific transliteration model to improve performance on a particular language pair Try similar techniques as the cross-lingual query expansion proposed in Fei’s paper to expand the coverage of correct translations in the returned search results

Future Work (2) 神鬼戰士 (Traditional Chinese) 角斗士 (Simplified Chinese) Gladiator (English) グラディエーター (Japanese) I am a pivot! Translate named entities from non-English to non-English by using English as a pivot Note: This is a real-working example

Future Work (3) Named EntityTranslations Alon Lavie Lori Levin Donna Gates Alex Waibel Carnegie Mellon University Language Technologies Institute Teruko Mitamura Eric Nyberg Carnegie Mellon University Lori Levin Language Technologies Institute Keith Miller Kathryn Baker Owen Rambow Robert Frederking Carnegie Mellon University Ralf Brown Eduard Hovy Alan Black Lori Levin Eric Nyberg Alon Lavie Nancy Ide Teruko Mitamura Language Technologies Institute Carnegie Mellon university Pennsylvania Pittsburgh Search for closely related named entities