Download presentation
Presentation is loading. Please wait.
Published byOpal Matthews Modified over 6 years ago
1
Automatic Transliteration for Japanese-to-English Text Retrieval
Yan Qu, Gregory Grefenstette, David A. Evans Clairvoyance Corporation SIGIR 2003 Toronto, Canada
2
Cross Language Information Retrieval (CLIR)
User query in one Language, documents to be retrieved in another Common Approach Translate query into target language using bilingual translation dictionary Then, monolingual search Problem: unknown word Word not in translation dictionary Proper name, technical term Solutions Cognates, or pass through unchanged languages with different writing systems — transliteration
3
English-to-Japanese Transliteration
Katakana – Japanese syllabic alphabet, e.g., コ, ン, ピ, ュ, ー, タ Applies to foreign proper names and borrowed technical terms “computer” – “konpiyuuta” – コンピュータ コ for “ko” ン for “n” ピ for “pi” ュー for “yuu” タ for “ta”
4
Does CLIR performance improve?
Research Questions How to automatically augment an English-to-Japanese translation dictionary with new transliterations? Explore generate and attest method Does CLIR performance improve?
5
Generate and Attest Method
Generate possible transliterations English word to English sound sequence English sounds to Japanese sounds Japanese sounds to katakana Attest validity with corpus statistics Use monolingual Japanese corpus to attest katakana
6
Generate and Attest Method (Cont.)
Y:y Y:I Y:yu … AH:a AH:o AH:e … English pronunciation dictionary English-Japanese phoneme mapping English Word e.g., YEMEN Get EN sound sequence Map to possible JP sound sequences YEMEM: y-e-m-a-n i-e-m-a-n yu-e-m-a-n … yu-a-n-u-n Y-EH-M-AH-N Get katakana sequences Discard イエマン ユエマン … ユアヌン Validate katakana sequences イエマン ユエマン -- 0 … ユアヌン -- 0 Kana to JP phoneme mapping web corpus
7
English Word to English Phone
CMUDICT .0.6. Carnegie Mellon University Pronunciation Dictionary 39 English “phonemes” Phoneme Example Translation AA odd AA D AE at AE T AH hut HH AH T AO ought AO T AW cow K AW
8
Stress information is removed
English Word to English Phone (cont.) CMUDICT Lexicon Sample ACTOR'S AE1 K T ER0 Z ACTORS AE1 K T ER0 Z ACTORS' AE1 K T ER0 Z ACTRESS AE1 K T R AH0 S ACTRESS'S AE1 K T R AH0 S AH0 Z ACTRESSES AE1 K T R AH0 S AH0 Z ACTS AE1 K T S ACTS(2) AE1 K S ACTUAL AE1 K CH AH0 W AH0 L ACTUAL(2) AE1 K SH AH0 L ACTUALITY AE2 K CH AH0 W AE1 L AH0 T IY0 ACTUALIZE AE1 K CH AH0 W AH0 L AY2 Z ACTUALLY AE1 K CH AH0 W AH0 L IY0 Stress information is removed
9
English Phone to Japanese Phone
Mapping between English and Japanese Phonemes (Knight & Graehl, 1998) EnglishPhone JapanesePhone Prob AA o AA a AA aa AA oo AE a AE ya
10
English Phone to Japanese Phone (cont.)
Generate hypothetical Japanese sound sequences Heuristics for pruning hypothesis space Discard if the last sound is a consonant (except “n”) Discard if EN-JP mapping probability falls under 0.05
11
English Phone to Japanese Phone (cont)
computer: kampyuutaa kuampyuutaa kkuampyuutaa kompyuutaa kuompyuutaa kkuompyuutaa kempyuutaa kuempyuutaa kkuempyuutaa kimpyuutaa …
12
Japanese Phone to Katakana
Mapping between katakana characters and phonemes ボ bo ポ po マ ma ミ mi ム mu メ me モ mo トヮ toa モヮ moa ウィ wi ウェ we ウォ wo ファ fa フィ fi
13
Japanese Phone to Katakana (cont.)
Longest match to segment JP sound sequences konpyuutaa (computer) ko n pyu u ta a If vowel lengthening, add “ー” supiido (speed) スピード(su pi i do) If consonant doubling (t, k, p, s, etc), add “ッ” baggu (bag) バッグ (ba g gu) If multiple mappings, keep all candidates If mapping unavailable, discard
14
Katakana Sequence Validation
Use monolingual Japanese corpus for validation If the sequence is attested, then it’s validated The frequency of the attested sequence is recorded YEMEN イエメン COMPUTER コンピュータ 8331 COMPUTER コンピューター 6184 COMPUTER コンピュタ COMPUTER コンピュター COMPUTER カンピューター COMPUTER コムピュータ COMPUTER コンピユーター
15
Prediction Accuracy English Name to Katakana top candidate 860 58.5
matches % top candidate 860 58.5 top 2 candidates 1094 74.5 top 3 candidates 1158 78.8 top 4 candidates 1185 80.7 top 5 candidates 1202 81.8 top 6 candidates 1207 82.2 top 7 candidates 1211 82.4 Results based on 1,469 English names, evaluated against a gold standard
16
Japanese-to-English IR
Target documents LA Times 94 from CLEF: 113,000 documents Japanese source queries CLEF 2001: 37 topics; CLEF 2002: 30 topics Topics with title field and description field With at least one katakana sequence <num> C091 </num> <JA-title>ラテンアメリカにおけるAI</JA-title> <JA-desc> ラテンアメリカにおける人権についてのアムネスティ・インターナショナルの報告書</JA-desc> <JA-narr> 適合文書は、ラテンアメリカにおける人権に関するアムネスティ・インターナショナルの報告書、またはこの報告書に対する反応についての情報を読者に提供するもの。</JA-narr>
17
Japanese-to-English IR (Cont.)
Japanese word segmentation Dictionary-based EDICT EDICT expanded with transliterations Japanese topic translation Use same dictionaries mentioned above Pick best translation candidate by measuring pair-wise coherence of translation alternatives (Qu, Grefenstette, Evans, CLEF 2002) Validation Corpora — independent variable Japanese Corpus from LDC (230 MB newswire) Japanese Corpus from NTCIR-1 (240 MB technical abstracts) LDC plus NTCIR1
18
Japanese-to-English IR Results
CLEF 2001 Topics No feedback Recall Avg. Prec No translit (baseline) 476 / 587 0.2667 Translit, NTCIR1 457 (-5.0%) (+8.3%) Translit, LDC 491 (+3.2%) (+10.7%) Translit, NTCIR1+LDC With feedback Recall Avg. Prec No translit (baseline) 508 / 587 0.2763 Translit, NTCIR1 488 (-3.9%) (+2.5%) Translit, LDC 517 (+1.8%) (+9.7%) Translit, NTCIR1+LDC
19
Japanese-to-English IR Results
CLEF 2002 Topics No feedback Recall Avg. Prec No translit (baseline) 396 / 513 0.1518 Translit, NTCIR1 391 (-1.3%) (+45.3%) Translit, LDC 407 (+2.8%) (+64.8%) Translit, NTCIR1+LDC (+64.8%) With feedback Recall Avg. Prec No translit (baseline) 367 / 513 0.2063 Translit, NTCIR1 390 (+6.3%) (+33.1%) Translit, LDC 405 (+10.4%) (+31.2%) Translit, NTCIR1+LDC
20
CLEF 2002 Topic-by-Topic Differences
21
Reasons for Better Performance
Improved segmentation and translation E.g., Topic 85 (“Turquoise Program in Rwanda”) In EDICT No entry for ルワンダ as Rwanda Automatic transliteration adds: ルワンダ: Rwanda Increased precision from to Added new translations E.g., Topic 47 (“Russian Intervention in Chechnya”) チェチェン: Chechin; Chechnia Automatic transliteration adds チェチェン: Chechen Increased precision from to
22
Error Analysis Transliteration not from English Incomplete mapping
ベネチア from Italian Venezia, not Venice Incomplete mapping Query term Energy エネルギー (e-ne-ru-gi-i) not in the hypothesis space Erroneous mapping from Japanese sound sequences to katakana Greedy match algorithm used Insufficient coverage of the reference corpus Chechen not present in NTCIR1 corpus
23
Related Work English/Japanese (Knight & Graehl 98), English/Korean (Kang & Choi 99) , English/Chinese (Meng, et al., 01), English/Arabic (Stalls & Knight 98) Knight & Graehl (98) Focused on back-transliteration Fujii & Ishikawa (01) English-string-to-Japanese-string transliteration with corpus validation, also improved CLIR
24
Conclusion Automated English-to-Japanese transliteration (for names & technical terms) is possible through generation and validation Augmenting translation dictionary with discovered transliteration improves CLIR performance
25
The End
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.