Current Status and Future of Language Resources in Taiwan Chu-Ren Huang Institute of Linguistics, Academia Sinica Symposium on Language Resources in Asia.

Slides:



Advertisements
Similar presentations
European Masters Program in Language and Communication Technologies Free University.
Advertisements

Getting Involved in OLAC Steven Bird University of Pennsylvania LREC Symposium: The Open Language Archives Community 29 May 2002.
Getting Involved in OLAC Steven Bird University of Pennsylvania LSA Symposium: The Open Language Archives Community 4 January 2002.
Language Archives and Linguistic Anchoring of Digital Archives Chu-Ren Huang Institute of Linguistics, Academia Sinica LSA Symposium: The Open Language.
Infrastructures in Taiwan and for the Chinese Languages Chu-Ren Huang Institute of Linguistics Academia Sinica ACL 2000 WORKSHOP:
太平洋鄰里協會 Pacific Neighborhood Consortium (PNC) An organizational mechanism for encouraging development and sharing of digital content.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Leo Tak- hung CHAN (AATI Director of Research and Publications), Hong Kong (China)
San Diego Supercomputer Center, University of California at San Diego Grid Physics Network (GriPhyN) University of Florida A Data Storage Language for.
LREC 2006 May Genoa, Italy 1 Oriental COCOSDA: Past, Present and Future Shuichi ITAHASHI National Institute of Informatics (NII), Tokyo, Japan AIST,
Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.
Resource-Based Learning – A rationale for the integration of digital archives into Taiwan’s K-12 education Clarence Tsa-Kang Chu Professor, National Taiwan.
PETER LOBO POPULATION DIVISION November 23, 2010 Using ACS Data for New York City – Opportunities and Challenges.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
An integrated approach to online dictionary and ontology building for Austronesian Languages in Taiwan D. Victoria Rau, Wheaton College, U.S.A. Meng-Chien.
Research methods in corpus linguistics Xiaofei Lu.
Taihoku Imperial University 1928 National Taiwan University 1945.
LKR2004, Tokyo March The European Resources Landscape Steven Krauwer ELSNET / Utrecht University The Netherlands.
STANDARDIZATION OF SPEECH CORPUS Li Ai-jun, Yin Zhi-gang Phonetics Laboratory, Institute of Linguistics, Chinese Academy of Social Sciences.
E-Meld Workshop on Digitization of lexical Information 3-5 August 2002, EMU, Ypsilanti Working Group on Lexicon Macrostructures Chairman’s Report Dafydd.
Sharing linguistic multi-media resources Jacquelijn Ringersma Paul Trilsbeek Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands.
Hong Kong, 7 October 2000 Infrastructures1 Infrastructures for Global Collaboration Welcome Purpose of the workshop 7 Presentations (9: :00) BREAK.
OCLC Online Computer Library Center Strategic Partnerships: An International View 30 October 2003.
Nasal endings of Taiwan Mandarin: Production, perception, and linguistic change Student : Shu-Ping Huang ID No. : NA3C0004 Professor : Dr. Chung Chienjer.
PRIVP Huang Overview of Successes and Challenges
1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.
PUBLIC HSBC Archives Hong Kong Library Education & Career Forum 2012 Date: July 2012Prepared by: Matthew Edmondson.
Historical linguistics Historical linguistics (also called diachronic linguistics) is the study of language change. Diachronic: The study of linguistic.
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
4th National NLP Research Symposium, De La Salle Univ., Manila, June From Synergy to Knowledge: Integrating multiple language resources Part.
NLP Related Activities in Thailand Virach Sornlertlamvanich Information Research and Development Division National Electronics and Computer Technology.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Knowledge Upon Social Media: Dialogue Between Archives and Services 14:00~15:30 December 1, 2010 Room P4704 Moderator:Ching-Teng HSIAO Research Center.
Summary Report Survey on Research and Development of Machine Translation in Asian Countries Virach Sornlertlamvanich Information Research and Development.
1 Announcing … Global broadband subscribers to 30 June 2005 Total: 176 million 115 million * 65% * choose DSL.
Taihoku Imperial University 1928 National Taiwan University 1945.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.
RSC Publishing Karlheinz Lamprecht Regional Sales Manager, Europe RSC Publishing.
On Different Perspectives of Utilizing the Fujisaki Model to Mandarin Speech Prosody Zhao-yu Su Phonetics Lab, Institute of Linguistics, Academia Sinica.
INTRODUCTION: RESEARCH AREA 1. Chinese Semantics 2. Semantic difference related to syntax 3. Module Attribute Representation of Verbal Semantics (MARVS)
Innovative Pedagogical Practices Using Technology: IEA SITES Module 2 Design Presentation to the CMEC–OECD–Canada Seminar Robert Kozma SRI International.
How Can Corpora Help Me To Be Successful in CO150?
OWL Description of Linguistic Content of Chinese Characters Chu-Ren Huang August 14, 2007 ISO TC37 SC4 Meeting, Provo, Utah.
1. Taihoku Imperial University National Taiwan University
Taihoku Imperial University 1928 National Taiwan University 1945.
Application of Spatiotemporal Methods to the Humanities 14:00~15:30 December 3, 2010 Room P4701 Moderator:Howie LAN Electronic Cultural Atlas Initiative.
A quick walk through phonetic databases Read English –TIMIT –Boston University Radio News Spontaneous English –Switchboard ICSI transcriptions –Buckeye.
Brill Online Resources on East Asia Albert Hoffstadt Senior Acquisitions Manager / Asian Studies BRILL.
MOST COMMON SPOKEN LANGUAGES. Chinese (Mandarin)  Spoken by more than 1 billion. I.e. (12.44%) of the world population.  Countries- China, Macau, Hong.
How much are you aware of the English language?
World Universities Forum LISBON 2014 Toward an Updated Pedagogy for Classical Languages Doug Lauffer, Community College Beaver County, USA (c) 2014 Doug.
Hitoshi ISAHARA National Institute of Information and Communications Technology (NICT) Sustainability of the work and PAN L10n network: Vision Beyond 2010.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
Unit1 Where is your pen pal from? China is a great country. Period 1.
NLP Midterm Solution #1 bilingual corpora –parallel corpus (document-aligned, sentence-aligned, word-aligned) (4) –comparable corpus (4) Source.
Computational and Statistical Methods for Corpus Analysis: Overview
Natural Language Processing (NLP)
Dasfaa 2003 Panel Next-Generation Web Technology and Database Issues
A Country Report – COCOSDA Activities in China Data More and more companies on data resources and services suppliers are emerging in China: a new.
CASAGRAS CASAGRAS: Coordination and Support Action (CSA) for Global RFID- related Activities and Standardisation Provide an incisive framework of foundation.
Status of EQ-5D-5L Valuation Using Standardized Valuation Methodology
Corpus-Based ELT CEL Symposium Creating Learning Designers
Max Planck Digital Library (MPDL) Supporting the scientific information workflow within the Max Planck Society Malte Dreyer.
Open Archives Initiative
Max Planck Digital Library (MPDL) Supporting the scientific information workflow within the Max Planck Society M. Dreyer.
Natural Language Processing (NLP)
Issues and Possible Solutions
Natural Language Processing (NLP)
Presentation transcript:

Current Status and Future of Language Resources in Taiwan Chu-Ren Huang Institute of Linguistics, Academia Sinica Symposium on Language Resources in Asia January 19, 2001, Tokyo, Japan

Languages of Concern --Modern Mandarin Chinese, -- Archaic, Ancient, and Near Modern Chinese (the diachronic record of three thousand years of Chinese ) --Formosan Languages (endangered, one of the richest branch of Austronesian languages)

Sharable Resources for Chinese Computational Linguistics Corpora Lexicons Procedures

Sharable Resources for Chinese Computational Linguistics--Corpora -Academia Sinica Balanced Corpus of Mandarin Chinese (Sinica Corpus) -Sinica Treebank -Standard Segmentation Corpus -ROCLING Corpus -Mandarin-Across-Taiwan (MAT) Speech Database

Academia Sinica Balanced Corpus of Mandarin Chinese (Sinica Corpus) 5 million words, segmented and tagged Direct WWW Access - words/modern-words/index.html OR - License Information -

Sinica Treebank ,725 Trees 239,532 Words Direct WWW Access (1000 sample trees) License Information

Mandarin-Across-Taiwan (MAT) Speech Database Speech files are collected through telephone networks. The content Includes spontaneous speech (short answering statements) and read speech (numbers, Mandarin syllables, words of 2 to 4 syllables, phonetically balanced sentences). MAT-160 ( 160 speakers) - MAT

A Database of Chinese Characters (i.e. Kanji) For each character: The Component Composition ( 部件組成 ) Information is important Over 10,000 Components ( 部件 ) have been identified for Chinese, roughly 2,000 of them productive --optional: radicals, number of strokes, variants

Sharable Resources for Chinese Computational Linguistics-Procedures Segmentation Standard for Chinese Language Processing Segmentation Standard Standard Segmentation Corpus (2 million words, segmented) Standard Segmentation Lexicon (42,138 entries, w/ frequency) Segmentation Program (free download )

Sharable Resources in Languages Other than Modern Mandarin Classical Chinese Corpora Corpus of Formosan Austronesian Languages Under construction, part of the National Digital Archive Initiative Lexical Databases of other Sino-Tibetan and Tibeto-Burmese Languages

Synchronic and Diachronic Chinese Corpora Three Projects Sponsored by the CCK Foundation ( ) Chu-Ren Huang, Keh-jiann Chen and Pei-chuan Wei, Academia Sinica Paul Thompson, SOAS, University of London Chaofen Sun, Stanford University

Mechanisms for Scholarly Exchange and Collaboration Department of International Programs, NSC Canada: NRC France: CNRS Japan: EAACST Germany: DFG, DAAD, DKFG Netherlands: NWO, IIAS USA: NSF, NIH UK: Royal Society of London, ETC

Other Resources in our area: Singapore (K.T. Lua) Consortium of Asian Language Resources ---Last Updated Oct Contains detailed information of about 50 (mostly Chinese) linguistics resources including comprehensive review, as well as license information

Other Resources: HowNet: An attribute-bases Semantic Network (Dong Zhengdong)

Future 1. Linguistic Ontology: Wordnets --Bi- or Multi-lingual Wordnets in EuroWordNet style --Collaboration among Chinese speaking communities (Academia Sinica, City University of Hong Kong, Peking University)

Future 2. Language Archives under the Digital Archive National Project --Digital Archive Initiatives Started in The Language Resource Project (PI: Huang) includes 3 corpus projects on 20 th Century Taiwan Mandarinn Near Modern Chinese ( Century) Pilot project on Formosan language corpora --Expected to become a National Project in 2002

Future 3. A universal and sharable scheme for encoding Chinese characters 4. Join the Open Language Archives Community (OLAC) 5. Participation and Conformation to International Standards for Language Engineering (ISLE)