Automatic Transliteration for Japanese-to-English Text Retrieval Yan Qu, Gregory Grefenstette, David A. Evans Clairvoyance Corporation SIGIR 2003 Toronto, Canada
Cross Language Information Retrieval (CLIR) User query in one Language, documents to be retrieved in another Common Approach Translate query into target language using bilingual translation dictionary Then, monolingual search Problem: unknown word Word not in translation dictionary Proper name, technical term Solutions Cognates, or pass through unchanged languages with different writing systems — transliteration
English-to-Japanese Transliteration Katakana – Japanese syllabic alphabet, e.g., コ, ン, ピ, ュ, ー, タ Applies to foreign proper names and borrowed technical terms “computer” – “konpiyuuta” – コンピュータ コ for “ko” ン for “n” ピ for “pi” ュー for “yuu” タ for “ta”
Does CLIR performance improve? Research Questions How to automatically augment an English-to-Japanese translation dictionary with new transliterations? Explore generate and attest method Does CLIR performance improve?
Generate and Attest Method Generate possible transliterations English word to English sound sequence English sounds to Japanese sounds Japanese sounds to katakana Attest validity with corpus statistics Use monolingual Japanese corpus to attest katakana
Generate and Attest Method (Cont.) Y:y Y:I Y:yu … AH:a AH:o AH:e … English pronunciation dictionary English-Japanese phoneme mapping English Word e.g., YEMEN Get EN sound sequence Map to possible JP sound sequences YEMEM: y-e-m-a-n i-e-m-a-n yu-e-m-a-n … yu-a-n-u-n Y-EH-M-AH-N Get katakana sequences Discard イエマン ユエマン … ユアヌン Validate katakana sequences イエマン -- 306 ユエマン -- 0 … ユアヌン -- 0 Kana to JP phoneme mapping web corpus
English Word to English Phone CMUDICT .0.6. Carnegie Mellon University Pronunciation Dictionary www.speech.cs.cmu.edu/cgi-bin/cmudict 39 English “phonemes” Phoneme Example Translation AA odd AA D AE at AE T AH hut HH AH T AO ought AO T AW cow K AW
Stress information is removed English Word to English Phone (cont.) CMUDICT .0.6. Lexicon Sample ACTOR'S AE1 K T ER0 Z ACTORS AE1 K T ER0 Z ACTORS' AE1 K T ER0 Z ACTRESS AE1 K T R AH0 S ACTRESS'S AE1 K T R AH0 S AH0 Z ACTRESSES AE1 K T R AH0 S AH0 Z ACTS AE1 K T S ACTS(2) AE1 K S ACTUAL AE1 K CH AH0 W AH0 L ACTUAL(2) AE1 K SH AH0 L ACTUALITY AE2 K CH AH0 W AE1 L AH0 T IY0 ACTUALIZE AE1 K CH AH0 W AH0 L AY2 Z ACTUALLY AE1 K CH AH0 W AH0 L IY0 Stress information is removed
English Phone to Japanese Phone Mapping between English and Japanese Phonemes (Knight & Graehl, 1998) EnglishPhone JapanesePhone Prob AA o 0.566 AA a 0.382 AA aa 0.024 AA oo 0.018 AE a 0.942 AE ya 0.046
English Phone to Japanese Phone (cont.) Generate hypothetical Japanese sound sequences Heuristics for pruning hypothesis space Discard if the last sound is a consonant (except “n”) Discard if EN-JP mapping probability falls under 0.05
English Phone to Japanese Phone (cont) computer: kampyuutaa kuampyuutaa kkuampyuutaa kompyuutaa kuompyuutaa kkuompyuutaa kempyuutaa kuempyuutaa kkuempyuutaa kimpyuutaa …
Japanese Phone to Katakana Mapping between katakana characters and phonemes ボ bo ポ po マ ma ミ mi ム mu メ me モ mo トヮ toa モヮ moa ウィ wi ウェ we ウォ wo ファ fa フィ fi
Japanese Phone to Katakana (cont.) Longest match to segment JP sound sequences konpyuutaa (computer) ko n pyu u ta a If vowel lengthening, add “ー” supiido (speed) スピード(su pi i do) If consonant doubling (t, k, p, s, etc), add “ッ” baggu (bag) バッグ (ba g gu) If multiple mappings, keep all candidates If mapping unavailable, discard
Katakana Sequence Validation Use monolingual Japanese corpus for validation If the sequence is attested, then it’s validated The frequency of the attested sequence is recorded YEMEN イエメン 306 COMPUTER コンピュータ 8331 COMPUTER コンピューター 6184 COMPUTER コンピュタ 13 COMPUTER コンピュター 13 COMPUTER カンピューター 4 COMPUTER コムピュータ 1 COMPUTER コンピユーター 1
Prediction Accuracy English Name to Katakana top candidate 860 58.5 matches % top candidate 860 58.5 top 2 candidates 1094 74.5 top 3 candidates 1158 78.8 top 4 candidates 1185 80.7 top 5 candidates 1202 81.8 top 6 candidates 1207 82.2 top 7 candidates 1211 82.4 Results based on 1,469 English names, evaluated against a gold standard
Japanese-to-English IR Target documents LA Times 94 from CLEF: 113,000 documents Japanese source queries CLEF 2001: 37 topics; CLEF 2002: 30 topics Topics with title field and description field With at least one katakana sequence <num> C091 </num> <JA-title>ラテンアメリカにおけるAI</JA-title> <JA-desc> ラテンアメリカにおける人権についてのアムネスティ・インターナショナルの報告書</JA-desc> <JA-narr> 適合文書は、ラテンアメリカにおける人権に関するアムネスティ・インターナショナルの報告書、またはこの報告書に対する反応についての情報を読者に提供するもの。</JA-narr>
Japanese-to-English IR (Cont.) Japanese word segmentation Dictionary-based EDICT http://www.csse.monash.edu.au/~jwb/edict.html EDICT expanded with transliterations Japanese topic translation Use same dictionaries mentioned above Pick best translation candidate by measuring pair-wise coherence of translation alternatives (Qu, Grefenstette, Evans, CLEF 2002) Validation Corpora — independent variable Japanese Corpus from LDC (230 MB newswire) Japanese Corpus from NTCIR-1 (240 MB technical abstracts) LDC plus NTCIR1
Japanese-to-English IR Results CLEF 2001 Topics No feedback Recall Avg. Prec No translit (baseline) 476 / 587 0.2667 Translit, NTCIR1 457 (-5.0%) 0.2888 (+8.3%) Translit, LDC 491 (+3.2%) 0.2952 (+10.7%) Translit, NTCIR1+LDC With feedback Recall Avg. Prec No translit (baseline) 508 / 587 0.2763 Translit, NTCIR1 488 (-3.9%) 0.2833 (+2.5%) Translit, LDC 517 (+1.8%) 0.3032 (+9.7%) Translit, NTCIR1+LDC
Japanese-to-English IR Results CLEF 2002 Topics No feedback Recall Avg. Prec No translit (baseline) 396 / 513 0.1518 Translit, NTCIR1 391 (-1.3%) 0.2206 (+45.3%) Translit, LDC 407 (+2.8%) 0.2501 (+64.8%) Translit, NTCIR1+LDC 0.2500 (+64.8%) With feedback Recall Avg. Prec No translit (baseline) 367 / 513 0.2063 Translit, NTCIR1 390 (+6.3%) 0.2746 (+33.1%) Translit, LDC 405 (+10.4%) 0.2707 (+31.2%) Translit, NTCIR1+LDC
CLEF 2002 Topic-by-Topic Differences
Reasons for Better Performance Improved segmentation and translation E.g., Topic 85 (“Turquoise Program in Rwanda”) In EDICT No entry for ルワンダ as Rwanda Automatic transliteration adds: ルワンダ: Rwanda Increased precision from 0.0005 to 0.2648 Added new translations E.g., Topic 47 (“Russian Intervention in Chechnya”) チェチェン: Chechin; Chechnia Automatic transliteration adds チェチェン: Chechen Increased precision from 0.0533 to 0.7335
Error Analysis Transliteration not from English Incomplete mapping ベネチア from Italian Venezia, not Venice Incomplete mapping Query term Energy エネルギー (e-ne-ru-gi-i) not in the hypothesis space Erroneous mapping from Japanese sound sequences to katakana Greedy match algorithm used Insufficient coverage of the reference corpus Chechen not present in NTCIR1 corpus
Related Work English/Japanese (Knight & Graehl 98), English/Korean (Kang & Choi 99) , English/Chinese (Meng, et al., 01), English/Arabic (Stalls & Knight 98) Knight & Graehl (98) Focused on back-transliteration Fujii & Ishikawa (01) English-string-to-Japanese-string transliteration with corpus validation, also improved CLIR
Conclusion Automated English-to-Japanese transliteration (for names & technical terms) is possible through generation and validation Augmenting translation dictionary with discovered transliteration improves CLIR performance
The End