Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John

Similar presentations


Presentation on theme: "A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John"— Presentation transcript:

1 A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu

2 Transliterated Mandarin Search Google suggests spelling correction

3 Alternate Transliterations? Want to say “Did you mean Peiching?”

4 Transliteration Problems “Beijing” provides many results Google doesn’t find “Peiching,” “Peking,” “Bukgyeong,” etc. Many pages using variety of transliterations Transliterations unorganized This paper organizes for Mandarin Chinese

5 The Problem (Cont’d) Why variety of transliterations? Web content: 82% Romanized Majority’s native languages: other scripts Standard keyboards Non-Romanized sources normally transliterated (esp. on Web) Transliteration variations

6 Example 1: Tibetan Four languages: transliteration problems Hello in Tibetan Wylie (bkra shis bde legs) Tibetan Pinyin Several unofficial systems based on pronunciation Spelled/transcribed in several ways (with some guidelines)

7 Example 2: Malayalam No official transliteration system Transliteration based on personal preference (many unorganized variations) Script conversion programs: more consistent systems /maleja:  m/ usu. transcribed “Malayalam” malayaaLam (Maya), Malajal- (Slavic)

8 Example 3: Romani Vlax Romani standard Literacy → few adopt standard Different countries, different official languages → different spellings No official systems (government) Several transliteration systems exist (often inconsistent)—as in last 2 languages

9 Example 4: Mandarin Hànyŭ Pīnyīn Tōngyòng Pīnyīn Wade-Giles Gwoyeu Romatzyh (Yóuzhèngshì Pīnyīn) (etc.)

10 Prior Work In Mandarin: geared towards Chinese users searching for information from West Western names-Hànzĭ-Hànyŭ Pīnyīn-Hànzĭ Algorithms designed for Arabic & Japanese transliteration Google This method designed for Western users searching for Chinese information

11 Initial Effort on Mandarin Practical first step: increased trade with China Simple transliteration problem (relatively) Modifications for Tibetan, Romani, Hindustani, etc. Intact for some other languages? (e.g. Russian, Arabic, Japanese, Korean) Input = Hànyŭ Pīnyīn; output = other systems

12 Initial Program Combined many systems Ying – yink – yenk – yenk’ – yemk’ – yermk’ – yarmk’ Instead of “victory,” searched for “Yarmuk” River in Middle East Transliteration systems organized by row but not by column

13 Organize into Transliteration Table Entries for “beijing” in two systems (Purpose is to go from one column to another) Hanyu PinyinWade Giles 1bp 2ei 3jch 4ing

14 Part of Patterns Table 8 systems

15 Decomposition Search for “Beijing” in table Delete one letter; search for “Beijin” Beiji, Beij…B Search for “eijing” (beijing – b) similarly Ei found, search for “jing” “J” found, search for “ing”

16 Composing new search terms Components: b, ei, j, ing B → b, p ei → ei j → j, ch ing → ing

17 Implementation Java program After composition, how does algorithm search? Connects to Google via Google API (Application Programming Interface) Google searches 1-2 second delay (due to Google)

18 Transliteration Patterns Transliterations organized into table {"üe", "yue", "yue", "ue", "ve", "üeh", "üeh", "üeh"} lüe, lyue, lue, lve, lüeh 3 transliteration systems; at most 5 patterns First column Hànyŭ Pīnyīn like “ing” “b” “ei”

19 Transliteration Systems By Column Only 3 systems (in effect) Hànyŭ Pīnyīn (HP) Tōngyòng Pīnyīn #1 ( TP1 ) & Tōngyòng Pīnyīn #2 ( TP2 ) Modified Hànyŭ Pīnyīn #1 (MHP1) & Modified Hànyŭ Pīnyīn #2 (MHP2) Wade-Giles #1 (WG1), Wade-Giles #2 (WG2), & Wade-Giles #3 (WG3)

20 Differences Between Transliteration System Variants TP1- iu, ui, ‘ TP2- iou, uei, - WG2- h’ung (not hung) WG3- ts’u (not tz’u) WG1- szu (not ssu)

21 Web version http://www.translitsearch.com/demos/demos.htm

22 Web search

23 What is the effect? Search for 130 Pinyin cities/regions 16 – no other transliteration 60 – at least two others 6 – three or more How much did Xiaozhi find? (8% more) 5 min. 12 sec. – entire search

24 Further work 1 Include Yale, GR (Gwoyeu Romatzyh), &c. YZSPY (Yóuzhèngshì Pīnyīn) Accents Hanja- and Kanji-based transliterations Application to research archives

25 Further Work 2 Improvements in accuracy of transliteration Search in other transliterations Japanese version of current paper Hindustani version Romani with Indic cognates Extension to translation (transliterated Mandarin-Cantonese characters)

26 Solutions for Tibetan Start with Wylie Xiaozhi with adjustments Dzongkha Dzongkha-based variations? Analysis of common transliteration patterns (usu. based on closest pronunciation)

27 Solutions for Malayalam Start with Maya (script conversion program) Include minor variations from other script conversion programs Analysis of transliterations used

28 Solutions for Romani Start with Vlax Romani Standard Regional variations Some transliterations easier to use on computers e.g. chh, sh to omit hacek

29 Conclusions Enhances search by finding alternate transliterations –Applied to Mandarin –Applicable to other languages Applicable to lesser-studied (& other) languages Language- (or script-) specific


Download ppt "A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John"

Similar presentations


Ads by Google