Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.

Slides:



Advertisements
Similar presentations
a Terminological and Statistical Approach
Advertisements

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
GSK: Development and Distribution of Resources Hitoshi ISAHARA GSK: Gengo Shigen Kyokai (Language Resource Association) National Institute of Information.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Language Resources in Indonesia Language Technology & Applied Information Laboratory Directorate for Information Technology and Electronics Agency for.
1 AFNLP 2008 Meeting Indonesia Country Report Hammam Riza Agency for the Assessment and Application of Technology (BPPT) Ministry of.
J. Kunzmann, K. Choukri, E. Janke, A. Kießling, K. Knill, L. Lamel, T. Schultz, and S. Yamamoto Automatic Speech Recognition and Understanding ASRU, December.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
The contribution of NLP Corpus processing Ontologies and terminologies
 Asian WordNet: Development and Service in Collaborative Approach Virach Sornlertlamvanich Thai Computational Linguistics Laboratory (TCL), NICT, and.
1/7 INFO60021 Natural Language Processing Harold Somers Professor of Language Engineering.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
WSD using Optimized Combination of Knowledge Sources Authors: Yorick Wilks and Mark Stevenson Presenter: Marian Olteanu.
ÓC-DAC Noida’2004 Efforts in Language & Speech Technology Natural Language Processing Lab Centre for Development of Advanced Computing (Ministry of Communications.
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
Research methods in corpus linguistics Xiaofei Lu.
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi.
1 NLP in Thailand by Asanee Kawtrakul Kasetsart University.
Saturday, March 15 th and Monday, March 17 th English FL: Reading Comprehension and Composition. Writing: Paragraph Structure; unity; parts, etc. Translation.
ELN – Natural Language Processing Giuseppe Attardi
Assessing Performance: Enhanced FLO Diagnostics (EFD)
Introduction to Natural Language Processing Heshaam Faili University of Tehran.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
NLP superficial and lexic level1 Superficial & Lexical level 1 Superficial level What is a word Lexical level Lexicons How to acquire lexical information.
Machine Translation, Digital Libraries, and the Computing Research Laboratory Indo-US Workshop on Digital Libraries June 23, 2003.
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
Vocabulary SENTENCE FROM TEXT DEFINTION ILLUSTRATION USE IN YOUR OWN SENTENCE PART OF SPEECH SENTENCE FROM TEXT DEFINITION ILLUSTRATION USE IN YOUR OWN.
NLP Related Activities in Thailand Virach Sornlertlamvanich Information Research and Development Division National Electronics and Computer Technology.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Summary Report Survey on Research and Development of Machine Translation in Asian Countries Virach Sornlertlamvanich Information Research and Development.
UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history.
Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR Language Resource and Language Technology Virach Sornlertlamvanich NECTEC, Thailand TCL,
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Modeling and Generation of Accentual Phrase F 0 Contours Based on Discrete HMMs Synchronized at Mora-Unit Transitions Atsuhiro Sakurai (Texas Instruments.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
Computational Linguistics. The Subject Computational Linguistics is a branch of linguistics that concerns with the statistical and rule-based natural.
© 2013 by Larson Technical Services
ADD and SNLP in Thailand Virach Sornlertlamvanich Thai Computational Linguistics Lab. (TCL), NICT Asia Research Center, Thailand
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Type a sentence using the word.
MedKAT Medical Knowledge Analysis Tool December 2009.
LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 1 (03/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Introduction to Natural.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Rapid Development in new languages Limited training data (6hrs) provided by NECTEC from 34 speakers, + 8 spks for development and test Romanization of.
The study on the impact of the promulgation of English language as Thai’s second language Virach Sornlertlamvanich Director Information Research and Development.
1 An Introduction to Computational Linguistics Mohammad Bahrani.
11/23/00UNU/IAS/UNL Centre1 The Universal Networking Language United Nations University Institute of Advanced Studies United Networking Language ® UNU/IAS.
Computational Linguistics Courses Experiment Test.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
UNL Document Summarization Virach Sornlertlamvanich, Tanapong Potipiti and Thatsanee Charoenporn Information Research and Development Division National.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
NLP Midterm Solution #1 bilingual corpora –parallel corpus (document-aligned, sentence-aligned, word-aligned) (4) –comparable corpus (4) Source.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
Thai AGROVOC Ontology Base for Agricultural Information Retrieval
Text-To-Speech System for English
Text Analytics Giuseppe Attardi Università di Pisa
Presentation transcript:

Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium on Language Resources in Asia Thai Linguistic Resources

How Important ! Language Processing Defining Rules Linguistic Knowledge Statistical Modeling Training Resources Linguistic Knowledge Top-DownBottom-Up Evaluation Models Adjust Evaluation Resources Linguistic resources are necessary even in top-down and bottom-up design Exploitable in modeling and evaluation

What we need ? Lexicon / Dictionary (30k) Tagged Text (2MB) / Speech Corpora Language Model Word Extraction (ML; p=85%; r=56%) Word Segmentation / POS tagger (ML; 96-97%) Sentence Segmentation (ML; 85-89%) Grapheme-to-Phoneme Conversion (PGLR; 73-90%) Word Sense Disambiguation Corpus / UNL / UW (concept) Editor MT (ParSit; / UNL Text Summarization Speech Recognition / Synthesis

Our Workbench …

Open Linguistic Resources LEXiTRON v 1.1 (a corpus based T-E dictionary, 1994) About 11,000 Thai entries; 9,000 English entries ORCHID POS-Tagged Corpus (supported by CRL, 1997) 160 documents; 2MB text; 400K words XML tagged for Paragraph, Sentence, Word, Part-of-Speech (47 tags) Thai Royal Institute Dictionary (T-T dictionary) Basic term 32,000 entries Technical term15,339 entries ParSit ( 2000)

Ongoing : Thai Speech Corpus #1 Scope (2001) Large Vocabulary Continuous Speech Recognition (LVCSR) Corpus - Phonetically-balanced sentences - 5K vocabulary coverage sentences Corpus for Text-to-Speech Synthesis phonetically and prosodic-balanced sentences - For probabilistic prosody generation Dialog speech corpus (collaboration with ATR) - 50 conversations, 2,099 sentences - 5,000 words, 866 phonetically-balanced sentences - 40 speakers (males and females)

Ongoing : Thai Speech Corpus #2 Procedure

Ongoing : Thai Speech Corpus #3 Tools Plain Text Corpus Editor XML Corpus

Ongoing : Thai Speech Corpus #4 Text Sources Technology Promotion Association (Thailand-Japan) Amarin Printing Co., Ltd. Matichon Public Co., Ltd. Project Collaboration Kasetsart University Thammasat University King’s Mongkut University of Technology Thonburi Prince of Songkhla University

Ongoing : Thai Speech Corpus #5

Ongoing : LEXiTRON v 2.0 #1 Scope (2001) Entries - 25,000 Thai - English - 25,000 English - Thai Fields - Translation - Phonetics - Root of vocabulary - Part-of-speech - Synonym - Antonym - Sentence sample Procedure

Ongoing : LEXiTRON v 2.0 #2 Tools Dictionary DB Phonetic Symbols Wordnet Corpus-based Sample Sentences

Discussion Language difficulties; 13 Tai-family languages Text sources Common tagset Resource center Institutional collaboration