Download presentation
Presentation is loading. Please wait.
Published byCory Harrington Modified over 9 years ago
1
Automatic Translation of Named Entities in Multiple Languages Using Web Search Engines Present by Richard C. Wang Supervised by Teruko Mitamura December 15, 2005
2
Presentation Outline Introduction Related Work System Implementation Experimental Results Conclusion Future Work
3
Introduction Machine Translation Reduces human labor for translating text One challenge – Translating newly emerged proper nouns (named entities) Movie, book, magazine, protein, cell, disease, person, location, company, organization names, etc. No central database that stores NE (and their translations) World Wide Web – An enormous unstructured corpus Contains named entities in various languages Automatic translation of NE in multiple languages Near-language-independent approach Utilize popular search engines: Google, Yahoo!, AlltheWeb
4
Related Work Wu, Lin, and Chang (2005) – English-to-Chinese NE MT Surface pattern knowledge learner Trained transliteration model Shima (2005) – English-to-Japanese NE MT Hand-coded transliteration model Heuristic computations using N-gram Huang, Zhang, and Vogel (2005) – Chinese-to-English NE MT Cross-lingual query expansion Trained transliteration model IBM translation model Frequency-Distance model In contrast, our system does not require any Training data and process Transliteration model
5
System Architecture Overview s Search Engines (External) Search Results Segment Parser Segments Translation Candidate Extractor Translation Candidates Translation Candidate Filter Filtered Translation Candidates Candidate Score Calculator Scored Translation Candidates
6
Querying World Wide Web We want to retrieve documents containing Source word s and target word t Search for s using Google, Yahoo!, and AlltheWeb Request for web pages written in the same language as t Current system supports target languages: English, Simplified/Traditional Chinese, Japanese, and Korean A target language can be added easily (see Adding Target Languages slide) Current system allows s to be in any language except: s and t have to be in languages that use different character sets i.e. (English, Chinese), (Korean, Japanese), (Hebrew, English) Can be overcome by using English as a pivot language (see Future Work slide)
7
Preprocessing Returned Results Segments Snippet Our system preprocesses results by: Extracting each snippet and insert into N such that no snippets in N can have duplicating titles Extracting each segment from each snippet in N and insert into G such that any segment in G cannot be a substring of another segment in G Prevent words to have biased weights Weights are dependent on their occurring frequencies
8
Extracting Translation Candidates Translation Candidate Any lonely cluster in the target language that resides in the same segment as the source word Our system uses regular expression patterns to extract lonely clusters Oftentimes there is at least one correct translation in the returned results that is lonely Lonely ClustersClusters
9
Filtering Translation Candidates Suppose candidate A is a substring of candidate B, and if B occurs more than half the times that A occurs in all segments, then A is discarded. For example: Since TF(B) > 0.5 x TF(A), A is discarded CandidateTF A“Back to the”40 B“Back to the Future”25
10
Ranking Translation Candidates Source word: “The Lord of the Rings” Target Language: Japanese FeatureDefinition TF c # of occurrences of c in all segments DF c # of segments that contain c CTF c # of occurrences of lonely c in all segments CDF c # of segments that contain lonely c NG c # of grams that c is consist of WD c sum of inverse word distance between s and c in all segments
11
Adding a New Target Language Three basic elements: Tokenization Pattern A regular expression pattern for tokenization Search Engine Language Code Language codes for the target language for each of the search engines Other General Properties Common minimum number of grams/alphabets for named entities in the target language Is the language spaced or non-spaced
12
Experimental Data Gold Test WordOriginal Gold TranslationAdditional Gold Trans. 纽约客 new yorkerThe New Yorker 牛虻 The Gadflygadfly 汇丰银行 HSBC Hong Kong and Shang Hai Banking Corporation 海豹 sealseals Mt. Pinatubo ピナツボ火山, ピナトゥボ火山ピナツボ山 Roger Dingman ロージャー・ディングマン, ロージャーディン グマン ロジャーディングマン Jean-Henri Dunant アンリ・デュナン, アンリデュナンジャン・アンリ・デュナン Charles Wang チャールズ・ウォン, チャールズウォンチャールズ・ワン Dataset# Test WordsSource-Target EJ202English-Japanese CE310Simplified Chinese-English
13
Evaluation Metric Translatable words words that our system is able to produce at least one translation candidate for
14
Usefulness of Features
15
Experimental Results (EJ)
16
Experimental Results (CE)
17
Performance Comparison (EJ) Our SystemAll Features 1551000.6450.4950.560 Our SystemAll Features 155930.6000.4600.521 Original Test Set Extended Test Set
18
Performance Comparison (CE) 7971.578.583.984.786.5 7964.672.377.078.580.2 Original Test Set Extended Test Set
19
Conclusion Even though our system is not state-of- the-art, it is capable of doing named entities translation in multiple languages with decent performance Most of the correct translations are ranked in top 3. Coverage of correct translations in the search results needs improvement
20
Future Work (1) Incorporate a language-specific transliteration model to improve performance on a particular language pair Try similar techniques as the cross-lingual query expansion proposed in Fei’s paper to expand the coverage of correct translations in the returned search results
21
Future Work (2) 神鬼戰士 (Traditional Chinese) 角斗士 (Simplified Chinese) Gladiator (English) グラディエーター (Japanese) I am a pivot! Translate named entities from non-English to non-English by using English as a pivot Note: This is a real-working example
22
Future Work (3) Named EntityTranslations Alon Lavie Lori Levin Donna Gates Alex Waibel Carnegie Mellon University Language Technologies Institute Teruko Mitamura Eric Nyberg Carnegie Mellon University Lori Levin Language Technologies Institute Keith Miller Kathryn Baker Owen Rambow Robert Frederking Carnegie Mellon University Ralf Brown Eduard Hovy Alan Black Lori Levin Eric Nyberg Alon Lavie Nancy Ide Teruko Mitamura Language Technologies Institute Carnegie Mellon university Pennsylvania Pittsburgh Search for closely related named entities
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.