Current Status and Future of Language Resources in Taiwan Chu-Ren Huang Institute of Linguistics, Academia Sinica Symposium on Language Resources in Asia January 19, 2001, Tokyo, Japan
Languages of Concern --Modern Mandarin Chinese, -- Archaic, Ancient, and Near Modern Chinese (the diachronic record of three thousand years of Chinese ) --Formosan Languages (endangered, one of the richest branch of Austronesian languages)
Sharable Resources for Chinese Computational Linguistics Corpora Lexicons Procedures
Sharable Resources for Chinese Computational Linguistics--Corpora -Academia Sinica Balanced Corpus of Mandarin Chinese (Sinica Corpus) -Sinica Treebank -Standard Segmentation Corpus -ROCLING Corpus -Mandarin-Across-Taiwan (MAT) Speech Database
Academia Sinica Balanced Corpus of Mandarin Chinese (Sinica Corpus) 5 million words, segmented and tagged Direct WWW Access - words/modern-words/index.html OR - License Information -
Sinica Treebank ,725 Trees 239,532 Words Direct WWW Access (1000 sample trees) License Information
Mandarin-Across-Taiwan (MAT) Speech Database Speech files are collected through telephone networks. The content Includes spontaneous speech (short answering statements) and read speech (numbers, Mandarin syllables, words of 2 to 4 syllables, phonetically balanced sentences). MAT-160 ( 160 speakers) - MAT
A Database of Chinese Characters (i.e. Kanji) For each character: The Component Composition ( 部件組成 ) Information is important Over 10,000 Components ( 部件 ) have been identified for Chinese, roughly 2,000 of them productive --optional: radicals, number of strokes, variants
Sharable Resources for Chinese Computational Linguistics-Procedures Segmentation Standard for Chinese Language Processing Segmentation Standard Standard Segmentation Corpus (2 million words, segmented) Standard Segmentation Lexicon (42,138 entries, w/ frequency) Segmentation Program (free download )
Sharable Resources in Languages Other than Modern Mandarin Classical Chinese Corpora Corpus of Formosan Austronesian Languages Under construction, part of the National Digital Archive Initiative Lexical Databases of other Sino-Tibetan and Tibeto-Burmese Languages
Synchronic and Diachronic Chinese Corpora Three Projects Sponsored by the CCK Foundation ( ) Chu-Ren Huang, Keh-jiann Chen and Pei-chuan Wei, Academia Sinica Paul Thompson, SOAS, University of London Chaofen Sun, Stanford University
Mechanisms for Scholarly Exchange and Collaboration Department of International Programs, NSC Canada: NRC France: CNRS Japan: EAACST Germany: DFG, DAAD, DKFG Netherlands: NWO, IIAS USA: NSF, NIH UK: Royal Society of London, ETC
Other Resources in our area: Singapore (K.T. Lua) Consortium of Asian Language Resources ---Last Updated Oct Contains detailed information of about 50 (mostly Chinese) linguistics resources including comprehensive review, as well as license information
Other Resources: HowNet: An attribute-bases Semantic Network (Dong Zhengdong)
Future 1. Linguistic Ontology: Wordnets --Bi- or Multi-lingual Wordnets in EuroWordNet style --Collaboration among Chinese speaking communities (Academia Sinica, City University of Hong Kong, Peking University)
Future 2. Language Archives under the Digital Archive National Project --Digital Archive Initiatives Started in The Language Resource Project (PI: Huang) includes 3 corpus projects on 20 th Century Taiwan Mandarinn Near Modern Chinese ( Century) Pilot project on Formosan language corpora --Expected to become a National Project in 2002
Future 3. A universal and sharable scheme for encoding Chinese characters 4. Join the Open Language Archives Community (OLAC) 5. Participation and Conformation to International Standards for Language Engineering (ISLE)