Download presentation
Presentation is loading. Please wait.
1
A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu
2
Transliterated Mandarin Search Google suggests spelling correction
3
Alternate Transliterations? Want to say “Did you mean Peiching?”
4
Transliteration Problems “Beijing” provides many results Google doesn’t find “Peiching,” “Peking,” “Bukgyeong,” etc. Many pages using variety of transliterations Transliterations unorganized This paper organizes for Mandarin Chinese
5
The Problem (Cont’d) Why variety of transliterations? Web content: 82% Romanized Majority’s native languages: other scripts Standard keyboards Non-Romanized sources normally transliterated (esp. on Web) Transliteration variations
6
Example 1: Tibetan Four languages: transliteration problems Hello in Tibetan Wylie (bkra shis bde legs) Tibetan Pinyin Several unofficial systems based on pronunciation Spelled/transcribed in several ways (with some guidelines)
7
Example 2: Malayalam No official transliteration system Transliteration based on personal preference (many unorganized variations) Script conversion programs: more consistent systems /maleja: m/ usu. transcribed “Malayalam” malayaaLam (Maya), Malajal- (Slavic)
8
Example 3: Romani Vlax Romani standard Literacy → few adopt standard Different countries, different official languages → different spellings No official systems (government) Several transliteration systems exist (often inconsistent)—as in last 2 languages
9
Example 4: Mandarin Hànyŭ Pīnyīn Tōngyòng Pīnyīn Wade-Giles Gwoyeu Romatzyh (Yóuzhèngshì Pīnyīn) (etc.)
10
Prior Work In Mandarin: geared towards Chinese users searching for information from West Western names-Hànzĭ-Hànyŭ Pīnyīn-Hànzĭ Algorithms designed for Arabic & Japanese transliteration Google This method designed for Western users searching for Chinese information
11
Initial Effort on Mandarin Practical first step: increased trade with China Simple transliteration problem (relatively) Modifications for Tibetan, Romani, Hindustani, etc. Intact for some other languages? (e.g. Russian, Arabic, Japanese, Korean) Input = Hànyŭ Pīnyīn; output = other systems
12
Initial Program Combined many systems Ying – yink – yenk – yenk’ – yemk’ – yermk’ – yarmk’ Instead of “victory,” searched for “Yarmuk” River in Middle East Transliteration systems organized by row but not by column
13
Organize into Transliteration Table Entries for “beijing” in two systems (Purpose is to go from one column to another) Hanyu PinyinWade Giles 1bp 2ei 3jch 4ing
14
Part of Patterns Table 8 systems
15
Decomposition Search for “Beijing” in table Delete one letter; search for “Beijin” Beiji, Beij…B Search for “eijing” (beijing – b) similarly Ei found, search for “jing” “J” found, search for “ing”
16
Composing new search terms Components: b, ei, j, ing B → b, p ei → ei j → j, ch ing → ing
17
Implementation Java program After composition, how does algorithm search? Connects to Google via Google API (Application Programming Interface) Google searches 1-2 second delay (due to Google)
18
Transliteration Patterns Transliterations organized into table {"üe", "yue", "yue", "ue", "ve", "üeh", "üeh", "üeh"} lüe, lyue, lue, lve, lüeh 3 transliteration systems; at most 5 patterns First column Hànyŭ Pīnyīn like “ing” “b” “ei”
19
Transliteration Systems By Column Only 3 systems (in effect) Hànyŭ Pīnyīn (HP) Tōngyòng Pīnyīn #1 ( TP1 ) & Tōngyòng Pīnyīn #2 ( TP2 ) Modified Hànyŭ Pīnyīn #1 (MHP1) & Modified Hànyŭ Pīnyīn #2 (MHP2) Wade-Giles #1 (WG1), Wade-Giles #2 (WG2), & Wade-Giles #3 (WG3)
20
Differences Between Transliteration System Variants TP1- iu, ui, ‘ TP2- iou, uei, - WG2- h’ung (not hung) WG3- ts’u (not tz’u) WG1- szu (not ssu)
21
Web version http://www.translitsearch.com/demos/demos.htm
22
Web search
23
What is the effect? Search for 130 Pinyin cities/regions 16 – no other transliteration 60 – at least two others 6 – three or more How much did Xiaozhi find? (8% more) 5 min. 12 sec. – entire search
24
Further work 1 Include Yale, GR (Gwoyeu Romatzyh), &c. YZSPY (Yóuzhèngshì Pīnyīn) Accents Hanja- and Kanji-based transliterations Application to research archives
25
Further Work 2 Improvements in accuracy of transliteration Search in other transliterations Japanese version of current paper Hindustani version Romani with Indic cognates Extension to translation (transliterated Mandarin-Cantonese characters)
26
Solutions for Tibetan Start with Wylie Xiaozhi with adjustments Dzongkha Dzongkha-based variations? Analysis of common transliteration patterns (usu. based on closest pronunciation)
27
Solutions for Malayalam Start with Maya (script conversion program) Include minor variations from other script conversion programs Analysis of transliterations used
28
Solutions for Romani Start with Vlax Romani Standard Regional variations Some transliterations easier to use on computers e.g. chh, sh to omit hacek
29
Conclusions Enhances search by finding alternate transliterations –Applied to Mandarin –Applicable to other languages Applicable to lesser-studied (& other) languages Language- (or script-) specific
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.