1 A Study on Implementation of Southern-Min Taiwanese Tone Sandhi System Iu n Un-gian Lau Kiat-gak Li Sheng-an.

Slides:

Advertisements

Similar presentations

Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.

Advertisements

LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.

M. A. K. Halliday Notes on transivity and theme in English (4.2 – 4.5) Part 2.

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.

Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.

Sub-Project I Prosody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan.

Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.

Part-Of-Speech Tagging and Chunking using CRF & TBL

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.

1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.

1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.

Project topics Projects are due till the end of May Choose one of these topics or think of something else you’d like to code and send me the details (so.

Part-of-speech Tagging cs224n Final project Spring, 2008 Tim Lai.

Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.

SPOKEN LANGUAGE SYSTEMS MIT Computer Science and Artificial Intelligence Laboratory Mitchell Peabody, Chao Wang, and Stephanie Seneff June 19, 2004 Lexical.

Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat.

1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/2010 Overview of NLP tasks (text pre-processing)

Noun Phrases & Suffixes. Nouns Part of the form class Have markers and identifiers to show that it is a noun Can be made either plural or possessive Markers.

TopicTrend By: Jovian Lin Discover Emerging and Novel Research Topics.

Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.

Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,

Toshiba (China) R&D Center LOU Xiaoyan, LI Jian Research and Development Center, Toshiba China Suggestions on Tone and Word Boundary of Mandarin for SSML.

Conversation Partnering Directions Guided Project Anthropology 105 Language & Culture.

REFACTORING Lecture 4. Definition Refactoring is a process of changing the internal structure of the program, not affecting its external behavior and.

Welcome Orientation. Introduction to the Course Course Objectives By the end of this course students will be able to: · Master the grammatical uses and.

Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**

Phonetics and Phonology

Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.

TagHelper: Basics Part 1 Carolyn Penstein Rosé Carnegie Mellon University Funded through the Pittsburgh Science of Learning Center and The Office of Naval.

Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.

The Linguistics of Second Language Acquisition

Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

Assessment of Morphology & Syntax Expression. Objectives What is MLU Stages of Syntactic Development Examples of Difficulties in Syntax Why preferring.

Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.

Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.

Getting Started with MATLAB 1. Fundamentals of MATLAB 2. Different Windows of MATLAB 1.

Problem Solving Techniques. Compiler n Is a computer program whose purpose is to take a description of a desired program coded in a programming language.

13-1 Chapter 13 Part-of-Speech Tagging POS Tagging + HMMs Part of Speech Tagging –What and Why? What Information is Available? Visible Markov Models.

Text structures and paragraph order Quiz date ______________________.

1 Prof.Roseline WEEK-4 LECTURE -4 SYNTAX. 2 Prof.Roseline Syntax Concentrate on the structure and ordering of components within a sentence Greater focus.

1 Multi-Perspective Question Answering Using the OpQA Corpus (HLT/EMNLP 2005) Veselin Stoyanov Claire Cardie Janyce Wiebe Cornell University University.

Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.

Critical Thinking Lesson 8

Final Paper Spring 2015 – New Testament Exegesis Instructions.

Natural Language Processing

Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.

 2008 Pearson Education, Inc. All rights reserved JavaScript: Introduction to Scripting.

楊允言 Iunn Un-gian 台語文特性分析及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

Levels of Linguistic Analysis

3 Phonology: Speech Sounds as a System No language has all the speech sounds possible in human languages; each language contains a selection of the possible.

Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,

The Expository Essay. What is an expository essay? An expository essay explains, or acquaints the reader with knowledge about the topic. Expository essays.

Passive Generalizations Li, Charles N. & Thompson, Sandra A. (1981). Mandarin Chinese - A Functional Reference Grammar. Los Angeles: University of California.

Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

Dictionary graphs Duško Vitas University of Belgrade, Faculty of Mathematics.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

FILES AND EXCEPTIONS Topics Introduction to File Input and Output Using Loops to Process Files Processing Records Exceptions.

Part-Of-Speech Tagging Radhika Mamidi. POS tagging Tagging means automatic assignment of descriptors, or tags, to input tokens. Example: “Computational.

Second Grade Chinese Literacy September 15, 2016

Chapter Eight Syntax.

Chapter Eight Syntax.

Essentials of Oral Defense (English/Chinese Translation)

Levels of Linguistic Analysis

Statistical n-gram David ling.

Shēng diào 声调 Tones.

Introduction to Pinyin

SANSKRIT ANALYZING SYSTEM

Presentation transcript:

1 A Study on Implementation of Southern-Min Taiwanese Tone Sandhi System Iu n Un-gian Lau Kiat-gak Li Sheng-an Kao Cheng-yan Dept. of Computer Sci. and Info. Eng., National Taiwan Univ., Taiwan PACLIC /1~3

2 Paper Outline-1 In the past two hundred years or so, a sizable corpus of Taiwanese text in Latin script has been accumulated. However, due to the political and historical situation of Taiwan, few people can read these materials at present. It is regrettable that the utilization of these plentiful materials is very low. This paper addresses problems raised by the Taiwanese tone sandhi system by describing a set of computational rules to approximate this system, as well as the results obtained from our implementation.

3 Paper Outline-2 Using the Taiwanese Latinization text as source, we take the sentence as the unit, translate every word into Chinese via a Taiwanese-Chinese dictionary, and obtain the POS information made by the CKIP group of the Academia Sinica. Using the POS data and tone sandhi rules we formulated based on linguistics, we then tag each syllable with its post-sandhi tone marker.

4 Paper Outline-3 Finally we implemented a Taiwanese tone sandhi processing system which takes a Latinized sentence as input and outputs the tone markers. We were able to obtain an accuracy rate of 97.56% and 88.90% with training and testing data, respectively. We analyze the sources of error for the purpose of future improvement. Keywords: written Taiwanese, tone sandhi system, Taiwanese latinization

5 Tone Sandhi at Word Level -1 Normal sandhi : most cases follow this rule, 1 → 7 2 → 1 3 → 2 4 → 2 /-h (8 /-p-t-k) 5 → 7(3) 7 → 3 8 → 3 /-h (4 /-p-t-k)

6 Tone Sandhi at Word Level -2 Following sandhi : this pattern generally occurs on pronouns or the suffix of names. The tone pitch depends on that of the immediately preceding syllable and is either tone 1, 3, or 7. Neutral sandhi : the previous syllable is read as base tone, and the tones of the neutral sandhi are read softly as if they were tone 3 or tone 4 Double sandhi : this pattern mostly appears in syllables endng in the glottal stop (-h) and having tone 4. The normal sandhi rules are applied twice in sequence (i.e. tone 4 → tone 2 → tone 1)

7 Tone Sandhi at Word Level -2 Pre- á sandhi : the syllables before á are different from the normal sandhi unless they are tone 1 or tone 2 Triplicated sandhi : the first syllable of triplicated words does not follow normal sandhi rules unless it is of tone 2, 3, or 4 Rising sandhi : this pattern usually occurs in loanwords from Japanese; the sandhi tone is similar to tone 5

8 Tone Sandhi at Sentence Level In brief, tonal groups are related to syntax in a way that it is possible to cut a sentence into a sequence of tonal groups on the basis of its syntactic structural description. A sentence has one or more tonal group, the boundary is at the last syllable of the sentence, the preceding syllable of ê, the last syllable of noun phrase, and so on. The boundary syllable is pronunciated as base tone. In fact, it seems a very long story.

9 Our method -1 Method : we use rule-based instead of statistical-based method because no public training data at present. Data : we select 8 segment of Taiwanese Latinization text from 4 articles as training data, the published dates range from 1910 ’ s to 1960 ’ s, there are 614 syllables totally; and another 8 segment of text as testing data, the published dates range from 1880 ’ s to 1990 ’ s, there are 955 syllables totally. POS: we obtain the corresponding Chinese translation for each Taiwanese word by looking up the Taiwanese- Chinese On-line Dictionary. We then look up the POS of the Chinese in the CKIP database.

10 Our method -2 Rules : we formulate 20 rules on 4 different levels : the syllable, the word, the POS, and the sentence pattern(syntax) Example : Chhin-chhiūⁿ án-ni lâi kóng, chāi lán Tâi- ôan kīn-kīn chít-tiap-á-kú ê kang-hu, ài soaⁿ chiū ū soaⁿ, ài hái chiū ū hái, beh jóah chiū ū jóah,kôaⁿ chiū ū kôaⁿ. ( 如此說來，在台灣只要花一點工夫，要山就有山、要海就有海；要熱就有熱、冷就有冷。 ) → Chhin-chhiūⁿ án-ni# lâi kóng#, chāi lán Tâi-ôan# kīn-kīn chít-tiap&-á-kú# ê kang-hu#, ài soaⁿ# chiū ū soaⁿ#, ài hái# chiū ū hái#, beh jóah# chiū ū jóah#,kôaⁿ# chiū ū kôaⁿ#. (we add tone marker)

11 Results Accuracy rates of sandhi marks Problems : Lack of POS standards for Taiwanese Lack of word segmentation standard and dictionary following the standard for Taiwanese standardization of written Taiwanese some tone sandhi problems cannot be solved by POS order SyllablesErrorsAcc Rate Training data % Testing data %

12 Future Work Solicit assistance from linguists ; Improve word segmentation, especially the processing of morphology, quantitative words, and proper nouns ; Improve the processing of POS tags to account for ambiguity ; Improve the dictionary of part-of-speech ; Improve the sandhi rules ; Find alternative ways of modeling sandhi processing.