Presentation is loading. Please wait.

Presentation is loading. Please wait.

Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

Similar presentations


Presentation on theme: "Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium."— Presentation transcript:

1 Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium on Language Resources in Asia Thai Linguistic Resources

2 How Important ! Language Processing Defining Rules Linguistic Knowledge Statistical Modeling Training Resources Linguistic Knowledge Top-DownBottom-Up Evaluation Models Adjust Evaluation Resources Linguistic resources are necessary even in top-down and bottom-up design Exploitable in modeling and evaluation

3 What we need ? Lexicon / Dictionary (30k) Tagged Text (2MB) / Speech Corpora Language Model Word Extraction (ML; p=85%; r=56%) Word Segmentation / POS tagger (ML; 96-97%) Sentence Segmentation (ML; 85-89%) Grapheme-to-Phoneme Conversion (PGLR; 73-90%) Word Sense Disambiguation Corpus / UNL / UW (concept) Editor MT (ParSit; http://come.to/parsit) / UNL Text Summarization Speech Recognition / Synthesis

4 Our Workbench …

5 Open Linguistic Resources LEXiTRON v 1.1 (a corpus based T-E dictionary, 1994) About 11,000 Thai entries; 9,000 English entries http://www.links.nectec.or.th/lexit ORCHID POS-Tagged Corpus (supported by CRL, 1997) 160 documents; 2MB text; 400K words XML tagged for Paragraph, Sentence, Word, Part-of-Speech (47 tags) http://www.links.nectec.or.th/orchid Thai Royal Institute Dictionary (T-T dictionary) Basic term 32,000 entries Technical term15,339 entries http://www.royin.go.th/ ParSit (http://come.to/parsit, 2000)

6 Ongoing : Thai Speech Corpus #1 Scope (2001) Large Vocabulary Continuous Speech Recognition (LVCSR) Corpus - Phonetically-balanced sentences - 5K vocabulary coverage sentences Corpus for Text-to-Speech Synthesis - 400 phonetically and prosodic-balanced sentences - For probabilistic prosody generation Dialog speech corpus (collaboration with ATR) - 50 conversations, 2,099 sentences - 5,000 words, 866 phonetically-balanced sentences - 40 speakers (males and females)

7 Ongoing : Thai Speech Corpus #2 Procedure

8 Ongoing : Thai Speech Corpus #3 Tools Plain Text Corpus Editor XML Corpus

9 Ongoing : Thai Speech Corpus #4 Text Sources Technology Promotion Association (Thailand-Japan) Amarin Printing Co., Ltd. Matichon Public Co., Ltd. Project Collaboration Kasetsart University Thammasat University King’s Mongkut University of Technology Thonburi Prince of Songkhla University

10 Ongoing : Thai Speech Corpus #5

11 Ongoing : LEXiTRON v 2.0 #1 Scope (2001) Entries - 25,000 Thai - English - 25,000 English - Thai Fields - Translation - Phonetics - Root of vocabulary - Part-of-speech - Synonym - Antonym - Sentence sample Procedure

12 Ongoing : LEXiTRON v 2.0 #2 Tools Dictionary DB Phonetic Symbols Wordnet Corpus-based Sample Sentences

13 Discussion Language difficulties; 13 Tai-family languages Text sources Common tagset Resource center Institutional collaboration


Download ppt "Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium."

Similar presentations


Ads by Google