A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John

Slides:



Advertisements
Similar presentations
An Introduction of Chinese Language Clary Xue
Advertisements

----Mei Xiang Sunnyside Middle School. Chinese written symbols are called characters.
June 2004 Adil Allawi Technical Director
Java Packages CSci 1130 Intro to Computer Programming with Java Instructor Tatyana Volk.
Lesson 12 Getting Started with Excel Essentials
DICOM INTERNATIONAL CONFERENCE & SEMINAR Oct 9-11, 2010 Rio de Janeiro, Brazil Building a DICOM Library in C# Victor Derks GE Healthcare.
Learning Objectives Explain similarities and differences among algorithms, programs, and heuristic solutions List the five essential properties of an algorithm.
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
Chapter 7 Using Data Flow Diagrams
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Internationalization of Java Platform Presenter: Ataru Nakazawa Advisor: Xiaoping Jia Date: January 23, 2004.
Chapter 7 Managing Data Sources. ASP.NET 2.0, Third Edition2.
The Design Of A Web Document Snapshots Delivery System David Chao College of Business San Francisco State University.
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
Overview of Search Engines
2012 年 8 月 16 日 Do Now Complete the table: èrsānwŭ qījiŭ.
Teaching English to Korean Students Understanding Their Particular Problems.
你知道吗 ???. Chinese written symbols are called characters.
Dataface API Essentials Steve Hannah Web Lite Solutions Corp.
Chinese-European Workshop on Digital Preservation, Beijing July 14 – Chinese-European Workshop on Digital Preservation Beijing (China), July.
English GCSE Revision. Section A - Reading There are essentially 5 reading questions as Q1 has two parts. You are being tested on your reading, not your.
Assistive Technology and Web Accessibility University of Hawaii Information Technology Services Jon Nakasone.
M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,
How IPA is Used in SSML and PLS Paolo Baggia, Loquendo Wed. August 9 th, 2006.
สาขาวิชาเทคโนโลยี สารสนเทศ คณะเทคโนโลยีสารสนเทศ และการสื่อสาร.
W3C Workshop, Beijing, 2nd of November 2005 An extension to the SSML for diacritics auto-completion R&D Centre Vocal Services Section.
Towards a Javascript CoG Kit Gregor von Laszewski Fugang Wang Marlon Pierce Gerald Guo
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
A seminar on “Mobile Version of The Website”
Introduction to PinYin Miss Chiang. History Romanization has been around for a long time to make Chinese language more accessible to foreigners. Many.
Assignee Name Harmonization Efforts at the U.S. Patent and Trademark Office US Patent and Trademark Office Office of Electronic Information Products Patent.
Chapter Five Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Mandarin Beginner What we will do…  Introduce Chinese as a language and dialects  Introduce Pinyin (function, composing rules)
10/12/98Organization of Information in Collections Form of Names -- Personal Names (cont), Corporate Names and Uniform Titles University of California,
Formatting WorksheetsFormatting Worksheets Lesson 7.
Data Representation Conversion 24/04/2017.
Mir Farooq Ali Computer Science, Virginia Tech May 9, 2003 Building Multi-platform User Interfaces using UIML.
C OMPUTING E SSENTIALS Timothy J. O’Leary Linda I. O’Leary Presentations by: Fred Bounds.
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
1 Lesson 12 Getting Started with Excel Essentials Computer Literacy BASICS: A Comprehensive Guide to IC 3, 3 rd Edition Morrison / Wells.
Database Management Supplement 1. 2 I. The Hierarchy of Data Database File (Entity, Table) Record (info for a specific entity, Row) Field (Attribute,
STEAM Content and Alt Format. STEAM – step up from STEM Science Technology Engineering Arts Math.
As Of March 28 th, 2001 A quick summary of LeNDI / Celware Integration. rbp.
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
Skill Area 311 Part B. Lecture Overview Assembly Code Assembler Format of Assembly Code Advantages Assembly Code Disadvantages Assembly Code High-Level.
Towards Developing a Multi-Dialect Morphological Analyser for Arabic 4 th International Conference on Arabic Language Processing May 2–3, 2012, Rabat,
Writing System Implementation On-the-Fly Extensibility for the common man Sharon Correll, SIL International Copyright © 2001.
Mandarin Chinese lesson one Pinyin 20/01/20121Qiaochao Zhang
About one-fifth of the world’s population, or over one billion people, speak some form of Chinese as their native language. 1. How many people speak Chinese.
Office UI Fabric INTRO. The Pitch The pitch Looks amazing!
Web Accessibility. Why accessibility? "The power of the Web is in its universality. Access by everyone regardless of disability is an essential aspect."
John Metz and Jeff Potts Michigan’s A. E. R. Annual Conference 2017
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Learning Usage of English KWICly with WebLEAP/DSR
Human Computer Interaction Lecture 21,22 User Support
itranslit (Indic Transliteration Tool)
Did You Know That??? 你知道吗 ??? ----Mei Xiang Sunnyside Middle School.
Predesign.
7 Best Programming Languages Based as per Earnings & Opportunities
CHI I-101 Prof. Ling-Ling Shih (Shi Laoshi) BHSEC
Web Systems Development (CSC-215)
Designing and Debugging Batch and Interactive COBOL Programs
CHI I-101 Prof. Ling-Ling Shih (Shi Laoshi) BHSEC
Lesson 17 Getting Started with Excel Essentials
Trust and Culture on the Web
Statistical n-gram David ling.
Global Legal Information Network
Introduction to Pinyin
Presentation transcript:

A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John

Transliterated Mandarin Search Google suggests spelling correction

Alternate Transliterations? Want to say “Did you mean Peiching?”

Transliteration Problems “Beijing” provides many results Google doesn’t find “Peiching,” “Peking,” “Bukgyeong,” etc. Many pages using variety of transliterations Transliterations unorganized This paper organizes for Mandarin Chinese

The Problem (Cont’d) Why variety of transliterations? Web content: 82% Romanized Majority’s native languages: other scripts Standard keyboards Non-Romanized sources normally transliterated (esp. on Web) Transliteration variations

Example 1: Tibetan Four languages: transliteration problems Hello in Tibetan Wylie (bkra shis bde legs) Tibetan Pinyin Several unofficial systems based on pronunciation Spelled/transcribed in several ways (with some guidelines)

Example 2: Malayalam No official transliteration system Transliteration based on personal preference (many unorganized variations) Script conversion programs: more consistent systems /maleja:  m/ usu. transcribed “Malayalam” malayaaLam (Maya), Malajal- (Slavic)

Example 3: Romani Vlax Romani standard Literacy → few adopt standard Different countries, different official languages → different spellings No official systems (government) Several transliteration systems exist (often inconsistent)—as in last 2 languages

Example 4: Mandarin Hànyŭ Pīnyīn Tōngyòng Pīnyīn Wade-Giles Gwoyeu Romatzyh (Yóuzhèngshì Pīnyīn) (etc.)

Prior Work In Mandarin: geared towards Chinese users searching for information from West Western names-Hànzĭ-Hànyŭ Pīnyīn-Hànzĭ Algorithms designed for Arabic & Japanese transliteration Google This method designed for Western users searching for Chinese information

Initial Effort on Mandarin Practical first step: increased trade with China Simple transliteration problem (relatively) Modifications for Tibetan, Romani, Hindustani, etc. Intact for some other languages? (e.g. Russian, Arabic, Japanese, Korean) Input = Hànyŭ Pīnyīn; output = other systems

Initial Program Combined many systems Ying – yink – yenk – yenk’ – yemk’ – yermk’ – yarmk’ Instead of “victory,” searched for “Yarmuk” River in Middle East Transliteration systems organized by row but not by column

Organize into Transliteration Table Entries for “beijing” in two systems (Purpose is to go from one column to another) Hanyu PinyinWade Giles 1bp 2ei 3jch 4ing

Part of Patterns Table 8 systems

Decomposition Search for “Beijing” in table Delete one letter; search for “Beijin” Beiji, Beij…B Search for “eijing” (beijing – b) similarly Ei found, search for “jing” “J” found, search for “ing”

Composing new search terms Components: b, ei, j, ing B → b, p ei → ei j → j, ch ing → ing

Implementation Java program After composition, how does algorithm search? Connects to Google via Google API (Application Programming Interface) Google searches 1-2 second delay (due to Google)

Transliteration Patterns Transliterations organized into table {"üe", "yue", "yue", "ue", "ve", "üeh", "üeh", "üeh"} lüe, lyue, lue, lve, lüeh 3 transliteration systems; at most 5 patterns First column Hànyŭ Pīnyīn like “ing” “b” “ei”

Transliteration Systems By Column Only 3 systems (in effect) Hànyŭ Pīnyīn (HP) Tōngyòng Pīnyīn #1 ( TP1 ) & Tōngyòng Pīnyīn #2 ( TP2 ) Modified Hànyŭ Pīnyīn #1 (MHP1) & Modified Hànyŭ Pīnyīn #2 (MHP2) Wade-Giles #1 (WG1), Wade-Giles #2 (WG2), & Wade-Giles #3 (WG3)

Differences Between Transliteration System Variants TP1- iu, ui, ‘ TP2- iou, uei, - WG2- h’ung (not hung) WG3- ts’u (not tz’u) WG1- szu (not ssu)

Web version

Web search

What is the effect? Search for 130 Pinyin cities/regions 16 – no other transliteration 60 – at least two others 6 – three or more How much did Xiaozhi find? (8% more) 5 min. 12 sec. – entire search

Further work 1 Include Yale, GR (Gwoyeu Romatzyh), &c. YZSPY (Yóuzhèngshì Pīnyīn) Accents Hanja- and Kanji-based transliterations Application to research archives

Further Work 2 Improvements in accuracy of transliteration Search in other transliterations Japanese version of current paper Hindustani version Romani with Indic cognates Extension to translation (transliterated Mandarin-Cantonese characters)

Solutions for Tibetan Start with Wylie Xiaozhi with adjustments Dzongkha Dzongkha-based variations? Analysis of common transliteration patterns (usu. based on closest pronunciation)

Solutions for Malayalam Start with Maya (script conversion program) Include minor variations from other script conversion programs Analysis of transliterations used

Solutions for Romani Start with Vlax Romani Standard Regional variations Some transliterations easier to use on computers e.g. chh, sh to omit hacek

Conclusions Enhances search by finding alternate transliterations –Applied to Mandarin –Applicable to other languages Applicable to lesser-studied (& other) languages Language- (or script-) specific