Download presentation
Presentation is loading. Please wait.
Published byJoella Lynch Modified over 9 years ago
1
Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium on Language Resources in Asia Thai Linguistic Resources
2
How Important ! Language Processing Defining Rules Linguistic Knowledge Statistical Modeling Training Resources Linguistic Knowledge Top-DownBottom-Up Evaluation Models Adjust Evaluation Resources Linguistic resources are necessary even in top-down and bottom-up design Exploitable in modeling and evaluation
3
What we need ? Lexicon / Dictionary (30k) Tagged Text (2MB) / Speech Corpora Language Model Word Extraction (ML; p=85%; r=56%) Word Segmentation / POS tagger (ML; 96-97%) Sentence Segmentation (ML; 85-89%) Grapheme-to-Phoneme Conversion (PGLR; 73-90%) Word Sense Disambiguation Corpus / UNL / UW (concept) Editor MT (ParSit; http://come.to/parsit) / UNL Text Summarization Speech Recognition / Synthesis
4
Our Workbench …
5
Open Linguistic Resources LEXiTRON v 1.1 (a corpus based T-E dictionary, 1994) About 11,000 Thai entries; 9,000 English entries http://www.links.nectec.or.th/lexit ORCHID POS-Tagged Corpus (supported by CRL, 1997) 160 documents; 2MB text; 400K words XML tagged for Paragraph, Sentence, Word, Part-of-Speech (47 tags) http://www.links.nectec.or.th/orchid Thai Royal Institute Dictionary (T-T dictionary) Basic term 32,000 entries Technical term15,339 entries http://www.royin.go.th/ ParSit (http://come.to/parsit, 2000)
6
Ongoing : Thai Speech Corpus #1 Scope (2001) Large Vocabulary Continuous Speech Recognition (LVCSR) Corpus - Phonetically-balanced sentences - 5K vocabulary coverage sentences Corpus for Text-to-Speech Synthesis - 400 phonetically and prosodic-balanced sentences - For probabilistic prosody generation Dialog speech corpus (collaboration with ATR) - 50 conversations, 2,099 sentences - 5,000 words, 866 phonetically-balanced sentences - 40 speakers (males and females)
7
Ongoing : Thai Speech Corpus #2 Procedure
8
Ongoing : Thai Speech Corpus #3 Tools Plain Text Corpus Editor XML Corpus
9
Ongoing : Thai Speech Corpus #4 Text Sources Technology Promotion Association (Thailand-Japan) Amarin Printing Co., Ltd. Matichon Public Co., Ltd. Project Collaboration Kasetsart University Thammasat University King’s Mongkut University of Technology Thonburi Prince of Songkhla University
10
Ongoing : Thai Speech Corpus #5
11
Ongoing : LEXiTRON v 2.0 #1 Scope (2001) Entries - 25,000 Thai - English - 25,000 English - Thai Fields - Translation - Phonetics - Root of vocabulary - Part-of-speech - Synonym - Antonym - Sentence sample Procedure
12
Ongoing : LEXiTRON v 2.0 #2 Tools Dictionary DB Phonetic Symbols Wordnet Corpus-based Sample Sentences
13
Discussion Language difficulties; 13 Tai-family languages Text sources Common tagset Resource center Institutional collaboration
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.