LANGUAGE RESOURCES IN MALAYSIA Zaharin Yusoff Computer-Aided Translation Unit School of Computer Sciences Universiti Sains Malaysia 11800 Penang, Malaysia.

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Where do we stand? Harold Somers Centre for Computational Linguistics, UMIST, Manchester, England Panel session, MT Summit VIII, September 2001.
Language Resources in Indonesia Language Technology & Applied Information Laboratory Directorate for Information Technology and Electronics Agency for.
The Bulgarian National Corpus and Its Application in Bulgarian Academic Lexicography Diana Blagoeva, Sia Kolkovska, Nadezhda Kostova, Cvetelina Georgieva.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
Building an Ontology-based Multilingual Lexicon for Word Sense Disambiguation in Machine Translation Lian-Tze Lim & Tang Enya Kong Unit Terjemahan Melalui.
4.1 Blended approaches: Information Engineering IMS Information Systems Development Practices.
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.
Search Engines and Information Retrieval
Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language.
1/7 INFO60021 Natural Language Processing Harold Somers Professor of Language Engineering.
Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th, 2001.
1 ETT 429 Spring 2007 Microsoft Publisher II. 2 World Wide Web Terminology Internet Web pages Browsers Search Engines.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
1 WMES3103 : INFORMATION RETRIEVAL WEEK 13 DIGITAL LIBRARIES.
COMP 4—Power Tools for the Mind 1 Power Tools Word Processing What we’ll cover for this lecture topic: –Types and Examples of Application software –Creating.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
An innovative platform to allow translation and indexing of internet sites Localization World
 Definition of HTML Definition of HTML  Tags in HTML Tags in HTML  Creation of HTML document Creation of HTML document  Structure of HTML Structure.
Tools and resources supporting the cultural tourism Istituto di Linguistica Computazionale “Antonio Zampolli” CNR - Pisa GL14: November 28, Sassolini.
Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Malaysia : INIS Promotion & Outreach Programs
Internet and Social Networking Research Tools for Academic Writing Copyright © 2014 Todd A. Whittaker
Decision Support and Business Intelligence Systems (9 th Ed., Prentice Hall) Chapter 7: Text and Web Mining.
XP Practical PC, 3e Chapter 10 1 Writing and Printing Documents.
Machine translation Context-based approach Lucia Otoyo.
Search Engines and Information Retrieval Chapter 1.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Computational Linguistics WTLAB ( Web Technology Laboratory ) Mohsen Kamyar.
ICS-FORTH January 11, Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January.
Related terms search based on WordNet / Wiktionary and its application in ontology matching RCDL'2009 St. Petersburg Institute for Informatics and Automation.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Natural Language Processing Guangyan Song. What is NLP  Natural Language processing (NLP) is a field of computer science and linguistics concerned with.
Software. Generic Software  e.g. word processing, spreadsheet and database. – This simply implies that any of the dozens of spreadsheet packages, for.
GEORGIOS FAKAS Department of Computing and Mathematics, Manchester Metropolitan University Manchester, UK. Automated Generation of Object.
Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.
Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
The Internet Do you really know what is out there?
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
NLP ? Natural Language is one of fundamental aspects of human behaviors. One of the final aim of human-computer communication. Provide easy interaction.
8. ONLINE REFERENCE TOOLS Dictionaries and Thesauruses Concordancers and corpuses for language analysis Translators for language analysis Encyclopedias.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
Computational Linguistics. The Subject Computational Linguistics is a branch of linguistics that concerns with the statistical and rule-based natural.
Web Page Design Introduction. The ________________ is a large collection of pages stored on computers, or ______________ around the world. Hypertext ________.
For Wednesday No reading Homework –Chapter 23, exercise 15 –Process: 1.Create 5 sentences 2.Select a language 3.Translate each sentence into that language.
E-Heritage and the VU Semantic Web group Guus Schreiber Computer Science VU University Amsterdam.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
LINGUATECA FLUP/CLUP The Corpógrafo – a Web-based environment for corpora research extract Term Candidates.
NATURAL LANGUAGE PROCESSING Zachary McNellis. Overview  Background  Areas of NLP  How it works?  Future of NLP  References.
Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University.
Types of Dictionaries A. Types of Dictionaries in terms of form/medium: - Books (advantages & disadvantages) - CDs (advantages & disadvantages) - Internet/Online.
What are the specific needs of your dictionary? Flexibility, flexibility, flexibility!! All dictionary projects are different Revising/reusing as framework/compilation.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
INTRODUCTION TO APPLIED LINGUISTICS
1. 2 CONTENTS 3 Where to start Search by Subject Catalogue Electronic journals Journal databases Search engins - Google Additional information Where.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
WEEK 6 WEEK 6 CHAPTER 3: MALAY CIVILIZATION (CONT.) 3.6. ACHIEVEMENT IN SCIENCE AND TECHNOLOGY PERSPECTIVE. NOR’AFIFAH SIDIK CHAI YEN CHING.
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
Facilitating Semantic Web Search with Embedded Grammar Tags (EGTs) Gautham K.Dorai Yaser Yacoob Department of Computer Science University of Maryland –
Measuring Monolinguality
LACONEC A Large-scale Multilingual Semantics-based Dictionary
User’s Perspective Laurie Gerber.
Presentation transcript:

LANGUAGE RESOURCES IN MALAYSIA Zaharin Yusoff Computer-Aided Translation Unit School of Computer Sciences Universiti Sains Malaysia Penang, Malaysia Symposium on Language Resources in Asia

HISTORICAL PERSPECTIVE USM GETA UTMK MT MT, MAHT CL TOOLS NLP APPLICATIONS UTM CICC MT ITNM TRANSLATION UKM NLP UM UiTM MT CALL UNIVERSITI SAINS MALAYSIA (USM) Unit Terjemahan Melalui komputer (UTMK) UNIV. TEKNOLOGI MALAYSIA (UTM) INSTITUT TERJEMAHAN NEGARA (ITNM) UNIV. KEBANGSAAN MALAYSIA (UKM) UNIVERSITI MALAYA (UM) UNIV. Institut TEKNOLOGI MARA (UiTM) DEWAN BAHASA DAN PUSTAKA (DBP)

LINGWARE DATA APPLICATION BASED GENERIC TOOLS LINGUISTIC DATA COMP. LING. TOOLS MAIN POINTS NOT TOO MANY MOSTLY NOT UPDATED SOME ARE REUSABLE LANGUAGE RESOURCES THE MORE RECENT ONES DEPENDENT ON DEMAND BUT MODULAR & REPROGRAMMABLE LANGUAGE DATA VERY LITTLE NOT REUSABLE METHODOLOGIES OK REASONABLE SOME INCOMPLETE DIFFICULT TO ACQUIRE BUT REUSABLE RECALL: Too Few Researchers (60 at peak in 1991, now 15) Lacking in Formal Linguistic Studies for Malay Lack of Culture of Data Accumulation

LINGUISTIC RESOURCES GENERIC TOOLS MT software: JEMAH Automatic Generator of Lingware Analysis Synthesis User-Driven MT Sytem Language Tools: -Spellchecker -Desktop Accessories (Dicts) -Text Analysis -etc. Linguistic Tools: -Corpus System -Dictionary System -Grammar Editor (STCG) -Bilingual Corpus Bank -etc. APPLICATION BASED TOOLS MAHT system: SISKEP Example Based MT EDI (parsing/generation msg. types) Semantic Driven Search Engine WEB Crawler Internet Portal (??) NOT TOO MANY MOSTLY NOT UPDATED SOME ARE REUSABLE THE MORE RECENT ONES DEPENDENT ON DEMAND BUT MODULAR & REPROGRAMMABLE LINGWARE DATA Ariane/Jemah MT English->Malay (all phases) STCG Malay Grammar VERY LITTLE NOT REUSABLE METHODOLOGIES OK

LANGUAGE DATA DICTIONARIES (WINHELP) ENGLISH-MALAY DBP (KIMD)10.16 MB1945 pages MALAY DBP (KD) 6.63 MB1566 pages TERMINOLOGIES (MABBIM) 8.13 MB1069 pages COMPUTER (Malay) 1.15 MB … FRENCH-ENGLISH-MALAY 3.57 MB … DICTIONARIES (Databases: attribute format) KIMD (as above)missing data B,O,R,S,T,U,V,X,Y,Z KD (as above)alphabet A only (1,544 words) MALAY THESAURUS CORPUS Malay Books, Letters to Editor (System)2.2 million words Translations (Malay only in MS Word) 23 titles (average 1.5 MB, 350 pages) English-Malay (Parallel Text) 3 titles (1 with sentence alignment) REASONABLE SOME INCOMPLETE DIFFICULT TO ACQUIRE BUT REUSABLE

LANGUAGE DATA (cont..) KIMD-WordNet Link (A->F only) Sources are KIMD and WordNet, and linked by sense entry in Wordnet and KIMD, e.g. abacus KIMD(abacus,n,1 [device, for, calculating, ’,’, a, square, or, rectangular, frame, ….]). ***(entry and definition taken from KIMD – some redefined to fit) WORDNET( , 1, ‘abacus’, n, 2, 0, [performs, arithmetic, functions, by, ….]). ***(entry and definition taken from Wordnet) ===sepua, sempoa, dekak-dekak ***(Malay equivalent taken from KIMD) KD Sense Processing (A->Z) Source is KAMUS DEWAN (KD) Steps of process: Extract word senses (ws) from KD (result: approx. 30K ws with definition) Extract primitive words (ps) from KD based on frequency (result: approx. 5K ps with definition) Extract synonyms from KD (result: approx. 6K synonyms) Use KD sense numbering to tag synonyms. Example of result: syn_kd(adem1, sejuk1) syn_kd(adem3, tenang2)

LANGUAGE DATA (cont..) OTHER POSSIBLE SOURCES OF DATA DEWAN BAHASA DAN PUSTAKA (LANGUAGE ACADEMY) Copies of all types in UTMK (perhaps more volume) Corpus: more recent publications (books, novels, journals, etc.) NEWSPAPERS Corpus: more recent years, i.e. since publishing on internet STAR, NEW STRAINTS TIMES, etc. OTHER R&D CENTRES UNIV. TEKNOLOGI MALAYSIA (UTM) INSTITUT TERJEMAHAN NEGARA (ITNM) UNIV. KEBANGSAAN MALAYSIA (UKM) UNIVERSITI MALAYA (UM) UNIV. Institut TEKNOLOGI MARA (UiTM)

THANK YOU ARIGATO MERCI SHUKRIYA GRAZZIE XIE-XIE NI TERIMA KASIH