Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium on Language Resources in Asia Thai Linguistic Resources
How Important ! Language Processing Defining Rules Linguistic Knowledge Statistical Modeling Training Resources Linguistic Knowledge Top-DownBottom-Up Evaluation Models Adjust Evaluation Resources Linguistic resources are necessary even in top-down and bottom-up design Exploitable in modeling and evaluation
What we need ? Lexicon / Dictionary (30k) Tagged Text (2MB) / Speech Corpora Language Model Word Extraction (ML; p=85%; r=56%) Word Segmentation / POS tagger (ML; 96-97%) Sentence Segmentation (ML; 85-89%) Grapheme-to-Phoneme Conversion (PGLR; 73-90%) Word Sense Disambiguation Corpus / UNL / UW (concept) Editor MT (ParSit; / UNL Text Summarization Speech Recognition / Synthesis
Our Workbench …
Open Linguistic Resources LEXiTRON v 1.1 (a corpus based T-E dictionary, 1994) About 11,000 Thai entries; 9,000 English entries ORCHID POS-Tagged Corpus (supported by CRL, 1997) 160 documents; 2MB text; 400K words XML tagged for Paragraph, Sentence, Word, Part-of-Speech (47 tags) Thai Royal Institute Dictionary (T-T dictionary) Basic term 32,000 entries Technical term15,339 entries ParSit ( 2000)
Ongoing : Thai Speech Corpus #1 Scope (2001) Large Vocabulary Continuous Speech Recognition (LVCSR) Corpus - Phonetically-balanced sentences - 5K vocabulary coverage sentences Corpus for Text-to-Speech Synthesis phonetically and prosodic-balanced sentences - For probabilistic prosody generation Dialog speech corpus (collaboration with ATR) - 50 conversations, 2,099 sentences - 5,000 words, 866 phonetically-balanced sentences - 40 speakers (males and females)
Ongoing : Thai Speech Corpus #2 Procedure
Ongoing : Thai Speech Corpus #3 Tools Plain Text Corpus Editor XML Corpus
Ongoing : Thai Speech Corpus #4 Text Sources Technology Promotion Association (Thailand-Japan) Amarin Printing Co., Ltd. Matichon Public Co., Ltd. Project Collaboration Kasetsart University Thammasat University King’s Mongkut University of Technology Thonburi Prince of Songkhla University
Ongoing : Thai Speech Corpus #5
Ongoing : LEXiTRON v 2.0 #1 Scope (2001) Entries - 25,000 Thai - English - 25,000 English - Thai Fields - Translation - Phonetics - Root of vocabulary - Part-of-speech - Synonym - Antonym - Sentence sample Procedure
Ongoing : LEXiTRON v 2.0 #2 Tools Dictionary DB Phonetic Symbols Wordnet Corpus-based Sample Sentences
Discussion Language difficulties; 13 Tai-family languages Text sources Common tagset Resource center Institutional collaboration