Automatic Transliteration for Japanese-to-English Text Retrieval

Slides:



Advertisements
Similar presentations
Ma.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Click on each of us to hear our sounds.
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
By Max. What is Katakana Katakana is the phonetic written Japanese language (there are a total three written Japanese languages.) Katakana is simpler.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Evaluation of Hindi→English, Marathi→English and English→Hindi CLIR at FIRE 2008 Nilesh Padariya, Manoj Chinnakotla, Ajay Nagesh and Om P. Damani Center.
Aparna Kulkarni Nachal Ramasamy Rashmi Havaldar N-grams to Process Hindi Queries.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Mandarin-English Information (MEI) Johns Hopkins University Summer Workshop 2000 presented at the TDT-3 Workshop February 28, 2000 Helen Meng The Chinese.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( Bridging Languages for Question Answering: DIOGENE at CLEF-2003.
CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.
Transliteration Transliteration CS 626 course seminar by Purva Joshi Mugdha Bapat Aditya Joshi Manasi Bapat
CS460/626 : Natural Language Processing/Speech, NLP and the Web Lecture 33: Transliteration Pushpak Bhattacharyya CSE Dept., IIT Bombay 8 th Nov, 2012.
1 The Domain-Specific Track at CLEF 2008 Vivien Petras & Stefan Baerisch GESIS Social Science Information Centre, Bonn, Germany Aarhus, Denmark, September.
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications.
Japanese Writing Systems Part 1: HIRAGANA (HEE-RAH-GAH-NAH)
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City.
The PATENTSCOPE search system: CLIR February 2013 Sandrine Ammann Marketing & Communications Officer.
Microsoft Research India’s Participation in FIRE2008 Raghavendra Udupa
The CLEF 2003 cross language image retrieval task Paul Clough and Mark Sanderson University of Sheffield
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using the Web for Automated Translation Extraction in.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Automatic Extraction of Translational Japanese- KATAKANA.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.
A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-25: Vowels cntd and a “grand” assignment.
Multilingual Search Shibamouli Lahiri
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
Audio Books for Phonetics Research CatCod2008 Jiahong Yuan and Mark Liberman University of Pennsylvania Dec. 4, 2008.
Cross-Language Information Retrieval (CLIR)
Evaluation Anisio Lacerda.
Yiming Yang1,2, Abhay Harpale1 and Subramanian Ganaphathy1
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Pushpak Bhattacharyya CSE Dept., IIT Bombay
Audio Books for Phonetics Research
Thomas L. Packer BYU CS DEG
From Word Spotting to OOV Modeling
Machine Translation and MT tools: Giza++ and Moses
Machine Translation and MT tools: Giza++ and Moses
Pīn Yīn Ⅰ Huang, Paoshan.
Introduction to Search Engines
Presentation transcript:

Automatic Transliteration for Japanese-to-English Text Retrieval Yan Qu, Gregory Grefenstette, David A. Evans Clairvoyance Corporation SIGIR 2003 Toronto, Canada

Cross Language Information Retrieval (CLIR) User query in one Language, documents to be retrieved in another Common Approach Translate query into target language using bilingual translation dictionary Then, monolingual search Problem: unknown word Word not in translation dictionary Proper name, technical term Solutions Cognates, or pass through unchanged languages with different writing systems — transliteration

English-to-Japanese Transliteration Katakana – Japanese syllabic alphabet, e.g., コ, ン, ピ, ュ, ー, タ Applies to foreign proper names and borrowed technical terms “computer” – “konpiyuuta” – コンピュータ コ for “ko” ン for “n” ピ for “pi” ュー for “yuu” タ for “ta”

Does CLIR performance improve? Research Questions How to automatically augment an English-to-Japanese translation dictionary with new transliterations? Explore generate and attest method Does CLIR performance improve?

Generate and Attest Method Generate possible transliterations English word to English sound sequence English sounds to Japanese sounds Japanese sounds to katakana Attest validity with corpus statistics Use monolingual Japanese corpus to attest katakana

Generate and Attest Method (Cont.) Y:y Y:I Y:yu … AH:a AH:o AH:e … English pronunciation dictionary English-Japanese phoneme mapping English Word e.g., YEMEN Get EN sound sequence Map to possible JP sound sequences YEMEM: y-e-m-a-n i-e-m-a-n yu-e-m-a-n … yu-a-n-u-n Y-EH-M-AH-N Get katakana sequences Discard イエマン ユエマン … ユアヌン Validate katakana sequences イエマン -- 306 ユエマン -- 0 … ユアヌン -- 0 Kana to JP phoneme mapping web corpus

English Word to English Phone CMUDICT .0.6. Carnegie Mellon University Pronunciation Dictionary www.speech.cs.cmu.edu/cgi-bin/cmudict 39 English “phonemes” Phoneme Example Translation AA odd AA D AE at AE T AH hut HH AH T AO ought AO T AW cow K AW

Stress information is removed English Word to English Phone (cont.) CMUDICT .0.6. Lexicon Sample ACTOR'S AE1 K T ER0 Z ACTORS AE1 K T ER0 Z ACTORS' AE1 K T ER0 Z ACTRESS AE1 K T R AH0 S ACTRESS'S AE1 K T R AH0 S AH0 Z ACTRESSES AE1 K T R AH0 S AH0 Z ACTS AE1 K T S ACTS(2) AE1 K S ACTUAL AE1 K CH AH0 W AH0 L ACTUAL(2) AE1 K SH AH0 L ACTUALITY AE2 K CH AH0 W AE1 L AH0 T IY0 ACTUALIZE AE1 K CH AH0 W AH0 L AY2 Z ACTUALLY AE1 K CH AH0 W AH0 L IY0 Stress information is removed

English Phone to Japanese Phone Mapping between English and Japanese Phonemes (Knight & Graehl, 1998) EnglishPhone JapanesePhone Prob AA o 0.566 AA a 0.382 AA aa 0.024 AA oo 0.018 AE a 0.942 AE ya 0.046

English Phone to Japanese Phone (cont.) Generate hypothetical Japanese sound sequences Heuristics for pruning hypothesis space Discard if the last sound is a consonant (except “n”) Discard if EN-JP mapping probability falls under 0.05

English Phone to Japanese Phone (cont) computer: kampyuutaa kuampyuutaa kkuampyuutaa kompyuutaa kuompyuutaa kkuompyuutaa kempyuutaa kuempyuutaa kkuempyuutaa kimpyuutaa …

Japanese Phone to Katakana Mapping between katakana characters and phonemes ボ bo ポ po マ ma ミ mi ム mu メ me モ mo トヮ toa モヮ moa ウィ wi ウェ we ウォ wo ファ fa フィ fi

Japanese Phone to Katakana (cont.) Longest match to segment JP sound sequences konpyuutaa (computer)  ko n pyu u ta a If vowel lengthening, add “ー” supiido (speed)  スピード(su pi i do) If consonant doubling (t, k, p, s, etc), add “ッ” baggu (bag)  バッグ (ba g gu) If multiple mappings, keep all candidates If mapping unavailable, discard

Katakana Sequence Validation Use monolingual Japanese corpus for validation If the sequence is attested, then it’s validated The frequency of the attested sequence is recorded YEMEN イエメン 306 COMPUTER コンピュータ 8331 COMPUTER コンピューター 6184 COMPUTER コンピュタ 13 COMPUTER コンピュター 13 COMPUTER カンピューター 4 COMPUTER コムピュータ 1 COMPUTER コンピユーター 1

Prediction Accuracy English Name to Katakana top candidate 860 58.5 matches % top candidate 860 58.5 top 2 candidates 1094 74.5 top 3 candidates 1158 78.8 top 4 candidates 1185 80.7 top 5 candidates 1202 81.8 top 6 candidates 1207 82.2 top 7 candidates 1211 82.4 Results based on 1,469 English names, evaluated against a gold standard

Japanese-to-English IR Target documents LA Times 94 from CLEF: 113,000 documents Japanese source queries CLEF 2001: 37 topics; CLEF 2002: 30 topics Topics with title field and description field With at least one katakana sequence <num> C091 </num> <JA-title>ラテンアメリカにおけるAI</JA-title> <JA-desc> ラテンアメリカにおける人権についてのアムネスティ・インターナショナルの報告書</JA-desc> <JA-narr> 適合文書は、ラテンアメリカにおける人権に関するアムネスティ・インターナショナルの報告書、またはこの報告書に対する反応についての情報を読者に提供するもの。</JA-narr>

Japanese-to-English IR (Cont.) Japanese word segmentation Dictionary-based EDICT http://www.csse.monash.edu.au/~jwb/edict.html EDICT expanded with transliterations Japanese topic translation Use same dictionaries mentioned above Pick best translation candidate by measuring pair-wise coherence of translation alternatives (Qu, Grefenstette, Evans, CLEF 2002) Validation Corpora — independent variable Japanese Corpus from LDC (230 MB newswire) Japanese Corpus from NTCIR-1 (240 MB technical abstracts) LDC plus NTCIR1

Japanese-to-English IR Results CLEF 2001 Topics No feedback Recall Avg. Prec No translit (baseline) 476 / 587 0.2667 Translit, NTCIR1 457 (-5.0%) 0.2888 (+8.3%) Translit, LDC 491 (+3.2%) 0.2952 (+10.7%) Translit, NTCIR1+LDC With feedback Recall Avg. Prec No translit (baseline) 508 / 587 0.2763 Translit, NTCIR1 488 (-3.9%) 0.2833 (+2.5%) Translit, LDC 517 (+1.8%) 0.3032 (+9.7%) Translit, NTCIR1+LDC

Japanese-to-English IR Results CLEF 2002 Topics No feedback Recall Avg. Prec No translit (baseline) 396 / 513 0.1518 Translit, NTCIR1 391 (-1.3%) 0.2206 (+45.3%) Translit, LDC 407 (+2.8%) 0.2501 (+64.8%) Translit, NTCIR1+LDC 0.2500 (+64.8%) With feedback Recall Avg. Prec No translit (baseline) 367 / 513 0.2063 Translit, NTCIR1 390 (+6.3%) 0.2746 (+33.1%) Translit, LDC 405 (+10.4%) 0.2707 (+31.2%) Translit, NTCIR1+LDC

CLEF 2002 Topic-by-Topic Differences

Reasons for Better Performance Improved segmentation and translation E.g., Topic 85 (“Turquoise Program in Rwanda”) In EDICT No entry for ルワンダ as Rwanda Automatic transliteration adds: ルワンダ: Rwanda Increased precision from 0.0005 to 0.2648 Added new translations E.g., Topic 47 (“Russian Intervention in Chechnya”) チェチェン: Chechin; Chechnia Automatic transliteration adds チェチェン: Chechen Increased precision from 0.0533 to 0.7335

Error Analysis Transliteration not from English Incomplete mapping ベネチア from Italian Venezia, not Venice Incomplete mapping Query term Energy エネルギー (e-ne-ru-gi-i) not in the hypothesis space Erroneous mapping from Japanese sound sequences to katakana Greedy match algorithm used Insufficient coverage of the reference corpus Chechen not present in NTCIR1 corpus

Related Work English/Japanese (Knight & Graehl 98), English/Korean (Kang & Choi 99) , English/Chinese (Meng, et al., 01), English/Arabic (Stalls & Knight 98) Knight & Graehl (98) Focused on back-transliteration Fujii & Ishikawa (01) English-string-to-Japanese-string transliteration with corpus validation, also improved CLIR

Conclusion Automated English-to-Japanese transliteration (for names & technical terms) is possible through generation and validation Augmenting translation dictionary with discovered transliteration improves CLIR performance

The End