1 A Study on Implementation of Southern-Min Taiwanese Tone Sandhi System Iu n Un-gian Lau Kiat-gak Li Sheng-an Kao Cheng-yan Dept. of Computer Sci. and Info. Eng., National Taiwan Univ., Taiwan PACLIC /1~3
2 Paper Outline-1 In the past two hundred years or so, a sizable corpus of Taiwanese text in Latin script has been accumulated. However, due to the political and historical situation of Taiwan, few people can read these materials at present. It is regrettable that the utilization of these plentiful materials is very low. This paper addresses problems raised by the Taiwanese tone sandhi system by describing a set of computational rules to approximate this system, as well as the results obtained from our implementation.
3 Paper Outline-2 Using the Taiwanese Latinization text as source, we take the sentence as the unit, translate every word into Chinese via a Taiwanese-Chinese dictionary, and obtain the POS information made by the CKIP group of the Academia Sinica. Using the POS data and tone sandhi rules we formulated based on linguistics, we then tag each syllable with its post-sandhi tone marker.
4 Paper Outline-3 Finally we implemented a Taiwanese tone sandhi processing system which takes a Latinized sentence as input and outputs the tone markers. We were able to obtain an accuracy rate of 97.56% and 88.90% with training and testing data, respectively. We analyze the sources of error for the purpose of future improvement. Keywords: written Taiwanese, tone sandhi system, Taiwanese latinization
5 Tone Sandhi at Word Level -1 Normal sandhi : most cases follow this rule, 1 → 7 2 → 1 3 → 2 4 → 2 /-h (8 /-p-t-k) 5 → 7(3) 7 → 3 8 → 3 /-h (4 /-p-t-k)
6 Tone Sandhi at Word Level -2 Following sandhi : this pattern generally occurs on pronouns or the suffix of names. The tone pitch depends on that of the immediately preceding syllable and is either tone 1, 3, or 7. Neutral sandhi : the previous syllable is read as base tone, and the tones of the neutral sandhi are read softly as if they were tone 3 or tone 4 Double sandhi : this pattern mostly appears in syllables endng in the glottal stop (-h) and having tone 4. The normal sandhi rules are applied twice in sequence (i.e. tone 4 → tone 2 → tone 1)
7 Tone Sandhi at Word Level -2 Pre- á sandhi : the syllables before á are different from the normal sandhi unless they are tone 1 or tone 2 Triplicated sandhi : the first syllable of triplicated words does not follow normal sandhi rules unless it is of tone 2, 3, or 4 Rising sandhi : this pattern usually occurs in loanwords from Japanese; the sandhi tone is similar to tone 5
8 Tone Sandhi at Sentence Level In brief, tonal groups are related to syntax in a way that it is possible to cut a sentence into a sequence of tonal groups on the basis of its syntactic structural description. A sentence has one or more tonal group, the boundary is at the last syllable of the sentence, the preceding syllable of ê, the last syllable of noun phrase, and so on. The boundary syllable is pronunciated as base tone. In fact, it seems a very long story.
9 Our method -1 Method : we use rule-based instead of statistical-based method because no public training data at present. Data : we select 8 segment of Taiwanese Latinization text from 4 articles as training data, the published dates range from 1910 ’ s to 1960 ’ s, there are 614 syllables totally; and another 8 segment of text as testing data, the published dates range from 1880 ’ s to 1990 ’ s, there are 955 syllables totally. POS: we obtain the corresponding Chinese translation for each Taiwanese word by looking up the Taiwanese- Chinese On-line Dictionary. We then look up the POS of the Chinese in the CKIP database.
10 Our method -2 Rules : we formulate 20 rules on 4 different levels : the syllable, the word, the POS, and the sentence pattern(syntax) Example : Chhin-chhiūⁿ án-ni lâi kóng, chāi lán Tâi- ôan kīn-kīn chít-tiap-á-kú ê kang-hu, ài soaⁿ chiū ū soaⁿ, ài hái chiū ū hái, beh jóah chiū ū jóah,kôaⁿ chiū ū kôaⁿ. ( 如此說來,在台灣只要花一點工夫,要 山就有山、要海就有海;要熱就有熱、冷就有冷。 ) → Chhin-chhiūⁿ án-ni# lâi kóng#, chāi lán Tâi-ôan# kīn-kīn chít-tiap&-á-kú# ê kang-hu#, ài soaⁿ# chiū ū soaⁿ#, ài hái# chiū ū hái#, beh jóah# chiū ū jóah#,kôaⁿ# chiū ū kôaⁿ#. (we add tone marker)
11 Results Accuracy rates of sandhi marks Problems : Lack of POS standards for Taiwanese Lack of word segmentation standard and dictionary following the standard for Taiwanese standardization of written Taiwanese some tone sandhi problems cannot be solved by POS order SyllablesErrorsAcc Rate Training data % Testing data %
12 Future Work Solicit assistance from linguists ; Improve word segmentation, especially the processing of morphology, quantitative words, and proper nouns ; Improve the processing of POS tags to account for ambiguity ; Improve the dictionary of part-of-speech ; Improve the sandhi rules ; Find alternative ways of modeling sandhi processing.