4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007 From Synergy to Knowledge: Integrating multiple language resources Part.

Slides:

Advertisements

Similar presentations

The OLAC Metadata Set and Controlled Vocabularies Steven Bird Gary Simons Penn SIL.

Advertisements

OLAC Metadata Steven Bird University of Melbourne / University of Pennsylvania OLAC Workshop 10 December 2002.

IRCS Workshop on Open Language Archives IMDI & Endangered Languages Archives Heidi Johnson / AILLA.

Accessing Distributed Resources Information: An OLAC perspective Steven Bird Gary Simons Chu-Ren Huang Melbourne SIL Academia Sinica ENABLER/ELSNET Workshop.

OLAC: The Open Language Archives Community Steven Bird Gary Simons Penn SIL.

The Seven Pillars of Open Language Archiving: A Vision Statement Gary Simons and Steven Bird Workshop on Web-based Language Documentation and Description.

White Paper on Establishing an Infrastructure for Open Language Archiving Steven Bird and Gary Simons.

The Open Language Archives Community: Building a worldwide library of digital language resources Gary Simons, SIL International LSA Tutorial on Archiving.

OLAC Process and OLAC Protocol: A Guided Tour Gary F. Simons SIL International ___________________________ OLAC Workshop 10 Dec 2002, Philadelphia.

An Overview of OLAC: The Open Language Archives Community Gary Simons and Steven Bird Workshop on The Digitization of Language Data: The Need for Standards.

Getting Involved in OLAC Steven Bird University of Pennsylvania LREC Symposium: The Open Language Archives Community 29 May 2002.

Getting Involved in OLAC Steven Bird University of Pennsylvania LSA Symposium: The Open Language Archives Community 4 January 2002.

The Seven Pillars of Open Language Archiving: Introducing the OLAC Vision Gary Simons SIL International LREC Symposium: The Open Language Archives Community.

Language Archives and Linguistic Anchoring of Digital Archives Chu-Ren Huang Institute of Linguistics, Academia Sinica LSA Symposium: The Open Language.

The Seven Pillars of Open Language Archiving: Introducing the OLAC Vision Gary Simons SIL International LSA Symposium: The Open Language Archives Community.

Infrastructures in Taiwan and for the Chinese Languages Chu-Ren Huang Institute of Linguistics Academia Sinica ACL 2000 WORKSHOP:

Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.

UCLA Digital Library UC Digital Library Forum August 5, 2002 UCLA Digital Library Presenter: Curtis Fornadley Senior Programmer/Analyst.

The Open Archives Initiative Simeon Warner (Cornell University) Symposium on “Scholarly Publishing and Archiving on the Web”, University.

1 CS 502: Computing Methods for Digital Libraries Lecture 17 Descriptive Metadata: Dublin Core.

Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.

OAI Standards for Sheet Music Meeting March 28-29, 2002 Basic OAI Principals How They Apply to Sheet Music Presenter: Curtis Fornadley, Senior Programmer/Analyst.

ACCESS TO QUALITY RESOURCES ON RUSSIA Tanja Pursiainen, University of Helsinki, Aleksanteri institute. EVA 2004 Moscow, 29 November 2004.

GL12 Conf. Dec. 6-7, 2010NTL, Prague, Czech Republic Extending the “Facets” concept by applying NLP tools to catalog records of scientific literature *E.

Educause October 29, 2001 A GEM of a Resource: The Gateway to Educational Materials Copyright Nancy Virgil Morgan, This work is the intellectual.

Digital Library Architecture and Technology

CLARIN-NL First Call Jan Odijk CLARIN-NL Kick-off Meeting Utrecht, 27 May 2009.

Publishing Digital Content to a LOR Publishing Digital Content to a LOR 1.

Current Status and Future of Language Resources in Taiwan Chu-Ren Huang Institute of Linguistics, Academia Sinica Symposium on Language Resources in Asia.

Sharing linguistic multi-media resources Jacquelijn Ringersma Paul Trilsbeek Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands.

An overview of the Natural Language Toolkit

8/28/97Organization of Information in Collections Introduction to Description: Dublin Core and History University of California, Berkeley School of Information.

OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.

June 20, 2006E-MELD 2006, MSU1 Toward Implementation of Best Practice: Anthony Aristar, Wayne State University Other E-MELD Outcomes.

Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.

‘The Universal Catalogue’ a cultural sector viewpoint David Dawson Senior Policy Adviser (Digital Futures) Museums, Libraries and archives Council.

LIS 506 (Fall 2006) LIS 506 Information Technology Week 11: Digital Libraries & Institutional Repositories.

Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.

Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK

Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 1 Metadata Helen Aristar Dry Eastern Michigan University LINGUIST List.

Customizing the IMDI metadata schema for endangered languages Heidi Johnson (AILLA) Arienne Dwyer (DOBES)

1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.

1 Bridging the gap between the paper past and digital future.

Nov 21, 2005University of Texas at Austin The E-MELD Project Helen Aristar Dry & Anthony Aristar The LINGUIST List Eastern Michigan U & Wayne State U.

1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.

Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 1 Comparability of language data and analysis Using an ontology for linguistics Scott Farrar, U.

Aug 2-5, 2002 EMELD Workshop Overview & Update Helen Aristar Dry The LINGUIST List & Eastern Michigan University EMELD Workshop on The Digitization.

Lifecycle Metadata for Digital Objects November 1, 2004 Descriptive Metadata: “Modeling the World”

Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) Phil Barker, March © Heriot-Watt University. You may reproduce all or any part.

Open Archive Initiative – Protocol for metadata Harvesting (OAI-PMH) Surinder Kumar Technical Director NIC, New Delhi

Slavic Digital Text Workshop 2006 The Open Archives Initiative Protocol for Metadata Harvesting: an Opportunity for Sharing Content in a Distributed Environment.

1 GRID Based Federated Digital Library K. Maly, M. Zubair, V. Chilukamarri, and P. Kothari Department of Computer Science Old Dominion University February,

OAI Overview DLESE OAI Workshop April 29-30, 2002 John Weatherley

Integrating Access to Digital Content Sarah Shreeves University of Illinois at Urbana-Champaign Visual Resources Association 23 rd Annual Conference Miami.

Laura Russell Programmer VertNet Buenos Aires (Argentina) 28 September 2011 Training course on biodiversity data publishing and.

Digital Library The networked collections of digital text, documents, images, sounds, scientific data, and software that are the core of today’s Internet.

Metadata “Data about data” Describes various aspects of a digital file or group of files Identifies the parts of a digital object and documents their content,

The Open Archives Initiative Marshall Breeding Director for Innovative Technologies and Research Vanderbilt University

Sharing Digital Scores: Will the Open Archives Initiative Protocol for Metadata Harvesting Provide the Key? Constance Mayer, Harvard University Peter Munstedt,

2/22/2016J Ammerman1 Open Archives Initiative What is it? What’s it good for?

1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.

Describing resources II: Dublin Core CERN-UNESCO School on Digital Libraries Rabat, Nov 22-26, 2010 Annette Holtkamp CERN.

Metadata & Repositories Jackie Knowles RSP Support Officer.

An overview of the Natural Language Toolkit

Natural Language Processing (NLP)

OAI and Metadata Harvesting

Open Archives Initiative

Open Archive Initiative

Natural Language Processing (NLP)

Natural Language Processing (NLP)

Presentation transcript:

4th National NLP Research Symposium, De La Salle Univ., Manila, June From Synergy to Knowledge: Integrating multiple language resources Part I: Language Resources and Tools Chu-Ren Huang Academia Sinica

p. 2 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Outline: Language Resources and Tools  Introduction: 10 Years in Chinese Language Processing-A mirror for other Asian Languages  The Starting Point: Resources and Resources Sharing OLAC: The Open Language Archives Community Asian Language Resources Committee of AFNLP Standards: ISO TC37 Language Resources Mangagement Language Archives Project of Taiwan  Tools: Getting Started in NLP with NLTK

p. 3 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Why Resources and Tools Language Resources  Foundation and empirical basis of scientific studies of natural languages The only reliable source for language specific features  Infrastructure for knowledge representation and knowledge engineering  Essential to preserve linguistic and cultural diversity Tools  Needed to ‘process’  General enough for multilingual processing and cross-lingual comparison  Robust enough to deal with language specific issues

p. 4 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Chinese Language Processing as a Mirror For the development of Asian Language Processing  Unlike Japanese, which has enjoying being one of the leaders in technological innovation  The development of Chinese language processing coincides with the developing economies of Taiwan and China  Especially the availability of Chinese language PC’s  Similar to the situation of many Asian languages now

p. 5 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June CLP in the past 10 years A review of what happened in the past ten years in Chinese Language Processing ( ) from a somewhat personal perspective 1992 –Corpora Completion of the first Chinese corpus for linguistic research (Huang and Chen, COLING ’ ) -untagged, non-segmented -but searchable

p. 6 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June CLP 1992 – –Segmentation Standard Announcement of the first national standard for word segmentation by PRC government. 《 GB 信息處理用現代漢語分詞規範》 –Lexicon Completion and Release of the first version of CKIP lexicon (with the category set and ICG thematic roles) First version of K. Chen’s parser for Chinese

p. 7 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June CLP Corpus 1994 – th year anniversary for the Automation of Chinese historical textual databases. Completion of the pre-Qin Classic Chinese corpus at Academia Sinica Completion of Sinica Corpus (v million words), the first balanced and tagged Chinese corpus.

p. 8 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June CLP 1996 –Research Institutes 10 th Anniversary of the Institute of Computational Linguistics at Peking University 10 th Anniversary of the Chinese Knowledge Information Processing Group at Academia Sinica –Anthology of Papers Readings in Chinese Natural Language Processing (Journal of Chinese Linguistics Monograph) Editors: Huang, Chen, and T’sou

p. 9 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June CLP 1996 November-1997 Sinica Corpus on Web One of the first fully searchable language corpus on the WWW (old webpage in web archives) (current page) 1997 Publication of the first Chinese dictionary compiled directly from a corpus (Huang et al.’s Mandarin Daily Classifier Dictionary and Noun-Classifier Collocation Dictionary ） The Tenth Annual ROCLING conference

p. 10 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June CLP 1998 –KnowledgeNet Release of HowNet, the first full-fledged Chinese and English-Chinese LKB -Segmentation Standard Official announcement of CNS14366 for Taiwan

p. 11 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June CLP 2000 –Treebanks Simultaneous completion and announcement of two Chinese Treebanks: *Penn Chinese Treebank *Sinica Treebank ACL Workshop on Chinese Language Processing

p. 12 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June CLP –Society Formal approval of the formation of ACL SigHAN, the first international organization on Chinese Language Processing 2002 First SigHAN workshop on Chinese Language Processing Formal launch of Hsieh’s Intelligent Character Encoding System (a sustainable solution to the missing character problem) COLING2002 in Taipei

p. 13 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June CLP  THE FIRST INTERNATIONAL CHINESE WORD SEGMENTATION BAKEOFF  Chinese Proposition Bank ,2005,2007  Chinese Gigaword Corpus v.1., v.2, and tagged version

p. 14 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June What CLP Development Showed?  Resources Lead  When tools and standards completes a comprehensive infrastructure  Research will bloom

p. 15 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Resources Development  Towards a Sharable and Sustainable Model of Resources Development OLAC Open Language Archives Community

p. 16 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June OLAC Aims OLAC, the Open Language Archives Community, is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: developing consensus on best current practice for the digital archiving of language resources; developing a network of interoperating repositories and services for housing and accessing such resources.

p. 17 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June OLAC Organization Coordinators: Steven Bird & Gary Simons Council: Anthony Aristar (Linguist List), Christopher Cieri (LDC), Gary Holton (Alaska Native Lanuage Center), Chu-Ren Huang (Academia Sinica), Heidi Johnson (Archive of the Indigenous Languages of Latin America), Laurent Romary (Atilf, University of Nancy), Joan Spanne (SIL), Martin Wynne (Oxford Text Archive) Participating Archives & Services: 39 archives including LDC, ELRA, DFKI, CBOLD, ANLC, LACITO, Perseus, SIL, APS, Utrecht, Academia Sinica, TalkBank, Rosetta, MPI Individual Members: ~120

p. 18 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Types of Language Resource DATA: any information which documents or describes a language, such as a: monograph, data file, shoebox of index cards, unanalyzed recordings, heavily annotated texts, complete descriptive grammar TOOLS: computational resources that facilitate creating, viewing, querying, or otherwise using language data includes fonts, stylesheets, DTDs, Schemas ADVICE: any information about: reliable data sources, appropriate tools and practices

p. 19 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June The Gap

p. 20 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Coordinated Approach OAIOLAC "A shared architectural vision, having many components, and implemented in stages by the community, will bridge the gap" Analogies: federated databases; semantic web

p. 21 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June CONVERTCREATE EXPORTDELIVER FORMAT OLAC OAI CONTENTMETADAT A OLAC REPOSITORIES OLAC SERVICES USER SERVICES OLAC PROC OLAC MHP OAI MS DC Software Recommendations Initiatives Standards

p. 22 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June The Foundation: 3 initiatives Dublin Core Metadata Initiative (DC) founded in 1995 (Dublin, Ohio) conventions for resource discovery on the web Open Archives Initiative (OAI) founded in 1999 (Santa Fe) interoperability of e-print services Open Language Archives Community (OLAC) founded in 2000 (Philadelphia) a partnership of institutions and individuals creating a worldwide virtual library of language resources

p. 23 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Foundation 1: DC Elements 15 metadata elements: broad interdisciplinary consensus each element is optional and repeatable applies to digital and traditional formats Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Language, Relation, Coverage, Rights. dublincore.org

p. 24 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Foundation 1: DC Qualifiers Encoding Schemes: a controlled vocabulary or notation used to express the value of an element helps a client system to interpret the element content e.g. Language = "en" (not "English", "Anglais",...) Refinements: makes the meaning of an element more specific e.g. Subject.language, Type.linguistic

p. 25 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Foundation 2: OAI Repository

p. 26 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Foundation 2: OAI Standards To implement the OAI infrastructure, an archive must comply with two standards: 1. The OAI Shared Metadata Set Dublin Core interoperability across all repositories 2. The OAI Metadata Harvesting Protocol HTTP requests - 6 verbs: Identify, ListIdentifiers, ListMetadataFormats, ListSets, ListRecords, GetRecord XML responses

p. 27 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Foundation 2: OAI Service Providers and Data Providers

p. 28 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Foundation 3: OLAC & OAI Recall: OAI data providers must support: Dublin Core Metadata OAI Metadata harvesting protocol BUT: OAI data providers can support: a more specialized metadata format a more specialized harvesting protocol What OLAC does: specialized metadata for language resources specialized harvesting (extra validation)

p. 29 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June OLAC Standards Aside: standards = the protocols and interfaces that allow the community to function recommendations = "standards" for representing linguistic content OLAC has three primary standards: OLACMS: the OLAC Metadata Set (Qualified DC) OLAC MHP: refinements to the OAI protocol OLAC Process: a procedure for identifying Best Common Practice Recommendations

p. 30 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June The OLAC Metadata Set The three categories of metadata: Work language: describes information entities and their intellectual attributes e.g. names of works and their creators Document language: describes and provides access to the physical manifestation of information e.g. format, publisher, date, rights Subject language: describes what a document is about e.g. subject, description

p. 31 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June OLACMS and Controlled Vocabularies Language: A language of the intellectual content of the resource (OLAC-Language) Subject.language: A language which the content of the resource describes or discusses (OLAC-Language) OLAC-Language: A vocabulary for identifying the language(s) that the data is in, or that a piece of linguistic description is about, or that a particular tool can process

p. 32 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June CONVERTCREATE EXPORTDELIVER FORMAT Summary: With the software in place, we have a complete platform OAI CONTENTMETADAT A OLAC PROC OLAC MHP OAI MS DC Software Recommendations Initiatives Standards

p. 33 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June CONVERTCREATE EXPORTDELIVER FORMAT Summary: Repositories completely bridge the gap, letting us consistently organize and archive our resources OAI CONTENTMETADAT A OLAC REPOSITORIES OLAC PROC OLAC MHP OAI MS DC Software Recommendations Initiatives Standards

p. 34 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June CONVERTCREATE EXPORTDELIVER FORMAT OLAC OAI CONTENTMETADAT A OLAC REPOSITORIES OLAC SERVICES USER SERVICES OLAC PROC OLAC MHP OAI MS DC Software Recommendations Initiatives Standards Acknowledgements: ISLE and TalkBank projects (NSF), participants of the Philadelphia workshop, Eva Banik (programmer), Hernando de Soto (the analogy)

p. 35 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June OLACMS helps archive versatility Given Shared Metadata Standard  New language archives can be created on the fly by harvesting existing archives  Rich information can be inferred by establishing temporal and geographic anchors for each document.

p. 36 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June OLAC Infrastructure Helps to Solve Language Archive Problems such as  Language Identification and  Metadata Set for Multi-lingual Language Archives

p. 37 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June The Language Identification Problem The DC code (e.g. ‘en’ for English) is not enough to describe all the languages in the world Enthnologue ( is comprehensive but not completehttp:// Potential Problems of using Enthnologue (or any existing language list) over-splitting over-chunking omission

p. 38 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June A Fundamental Solution to Language Identification Problems Registering language groups with an OLAC registration service OLAC language classification server would house a comprehensive list of language family names (defined by users) and their extensional definitions (i.e. sets of Enthnologue codes) AS:Amis = {ALV, AIS} ALV= Amis, AIS= Nataoran

p. 39 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Describing Multi-Lingual Resources in OLACMS  Directionality is crucial in multilingual resources  However, OLAC metadata is flat and unordered Bi-directional MT

p. 40 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Multi-lingual Resources II Text: language Bitext (bilingual aligned corpus) There is always an directionality Original: language Translation: Subject.language Language Description (Field Notes) Elicitation, transcription, translation, notes  Multiple related resources

p. 41 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Language Archives Project of Taiwan  Part of the National Digital Archives Project (NDAP)  Pilot Stage  First Phase:  Both Language Archives  And Linguistic Anchor

p. 42 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Language and Digital Archives

p. 43 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Digital Archives are Linguistically Anchored Archive s are anchored with Lexical KnowledgeBase (LKB) Archive s are anchored with Lexical KnowledgeBase (LKB) -because LKB as collection of lexical types instantiated in archives uniquely defines each archive -And each lexical item is the conceptual atom projecting knowledge from archive to archive

p. 44 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Multi-anchor Knowledge Linking  Geographical anchor based on GIS (geography information system) -Ecology (Fauna, Weather, Geology etc.) -Socio-Anthropological classification  Linguistic anchor based on LKB -etymology, language grouping, loan words,

p. 45 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Institute of Linguistics Language Archives

p. 46 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Two branch projects ：１ Chinese Archives -- 5 sub-projects ： Early- Mandarin Chinese Lexicon Lexical Database of Pre-Qin Bronze and Bamboo Manuscripts Modern Chinese Corpus and Treebank New Age Corpus: Linguistic Representations and Archives of Multimedia Data Southern-Min Archive: A Database of Historical Change in Language Distribution ２ Formosan Language Archives.

p. 47 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June GOAL ： 1.Collect the corpus and the lexicon in the period of Early Mandarin Chinese. 2.Provide a systematical knowledge thesaurus as well as powerful instrument for the study of the grammatical development. Archives Description ： 1.Digitalization of texts (10,000,000 characters). 2.Tagging of grammatical markers (3,500,000 characters). 3.Construction of the lexical database.  Early- Mandarin Chinese Lexicon

p. 48 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June

p. 49 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Archives Description ： to digitize the bronze inscriptions from the Shang to the Eastern Chou dynasties. the construction of a typological lexicon of bronze inscriptions and bamboo scripts accurate encoding and analysis for the bronze inscriptions and Chu scripts. Achievement ： Proof-read bronze inscriptions (12113 piece of bronze inscriptions). Lexical Database of Pre-Qin Bronze and Bamboo Manuscripts

p. 50 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June

p. 51 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Achievement ：  Segmented words tagged with their part-of- speech (10 millions words version in 2006).  Syntactic tree structure ： 30,000.   Modern Chinese Corpus and Treebank

p. 52 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June

p. 53 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June

p. 54 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Treebank

p. 55 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Archives Description ： 1.A multimodal corpus of spoken Mandarin in Taiwan. 2.By means of different designs of tasks and scenarios. 3.Combining data format of written transcripts with digital technology of video and audio processing. New Age Corpus: Linguistic Representations and Archives of Multimedia Data

p. 56 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Achievement ：  Transcribed and transformed the 11 hour-digital data.  Tagged the 5-hour speech data.  New Age Corpus: Linguistic Representations and Archives of Multimedia Data

p. 57 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June

p. 58 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Archives Description ： 1.From the perspectives of historical change and geographical distribution. 2.A tagged corpus of Southern Min written documents from 16th century to 20th century. 3.A linguistic Geographical Informational System displaying distributions of languages in Hsinfeng. Southern-Min Archive: A Database of Historical Change in Language Distribution

p. 59 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June

p. 60 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Archives Description ： 1.Preserve the endangered Formosan Austronesian languages 1.1 corpora, lexicons and grammars 1.2 integration of linguistic information with GIS. 2.fifteen extant Formosan languages 2.1 Rukai, Yami, Saisiyat, Tsou, Atayal, Bunun, Paiwan, Amis and Puyuma  Formosan Language archives

p. 61 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June

p. 62 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June

p. 63 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June

p. 64 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Sinica BOW: Bilingual Ontological Wordnet  To construct a Chinese WordNet as the linguistic ontology for knowledge representation;  To provide linguistic anchoring grounded with temporal information by building a synchronic lexicon for all historical periods; and  To provide linguistic anchoring reference and implementation services.

p. 65 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June

p. 66 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Asian Language Resources Committee Mail List:  Affiliated with AFNLP  Cataloguing Asian Language Resources Will adopt OLACMS and search engine  Hosting ALR Workshops (5 so far)  Asian Language Processing Special Issues in Language Resources and Evaluation  Co-Chairs :Togunaga Huang

4th National NLP Research Symposium, De La Salle Univ., Manila, June An overview of the Natural Language Toolkit Project Leaders: Steven Bird, Edward Loper, Ewan KleinSteven BirdEdward LoperEwan Klein Acknowledgement: I would like to thank Steven Bird for agreeing to let me use these slides on NLTK

p. 68 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Summary  NLTK is a suite of open source Python modules, data sets and tutorials  supporting research and development in natural language processing  Download NLTK from nltk.sourceforge.net  A Truly Multilingual Toolkit accessible to beginning researchers in NLP A good way to attract international scholars to research on your language  Also a good stepping stone for a developing HLT language to test a full range of NLP applications

p. 69 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Components of NLTK 1.Code: corpus readers, tokenizers, stemmers, taggers, chunkers, parsers, wordnet,... (50k lines of code) 2.Corpora: 20+ annotated data sets widely used in natural language processing (300Mb data) 3.Documentation: a 360-page book, articles, reviews, API documentation

p. 70 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Code  corpus readers  tokenizers  stemmers  taggers  parsers  wordnet  semantic interpretation  clusterers  evaluation metrics  …

p. 71 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Corpora  Brown Corpus  Carnegie Mellon Pronouncing Dictionary  CoNLL 2000 Chunking Corpus  Project Gutenberg Selections  NIST 1999 Information Extraction: Entity Recognition Corpus  US Presidential Inaugural Address Corpus  Indian Language POS-Tagged Corpus  Prepositional Phrase Attachment Corpus  SENSEVAL 2 Corpus  Sinica Treebank Corpus Sample  Universal Declaration of Human Rights Corpus  Stopwords Corpus  TIMIT Corpus Sample  Treebank Corpus Sample  …

p. 72 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Documentation  a 360-page book about natural language processing in Python and NLTK teaches Python and NLP provides numerous examples and exercises  installation instructions  presentation slides for some of the book chapters  API Documentation: describes every module, interface, class, and method

p. 73 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Parser demonstrations

p. 74 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Interactive session (WordNet)

p. 75 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Adoption in NLP courses Amsterdam, Ben-Gurion, Brown, Bryn Mawr, CDAC-Mumbai, Coruña, Edinburgh, Erlangen, Georgetown, Helsinki, IIT-Bombay, Iowa State, Konstanz, MIT, Macquarie, Magdeburg, Malta, Marquette, Melbourne, Nancy, Naval Postgraduate School, Northeastern, Ohio State, Pitt, San Diego State, Simon Fraser, Stanford, Syracuse University, Tsuda College, U Colorado, UC Berkeley, UMass Amherst, UNAM, U Penn, UT Austin, Warsaw

p. 76 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June Contribute…  NLTK is an open source project  all code, data, documentation is free  dozens of people have contributed over the past 6 years  please visit the website for project ideas  sign up on the NLTK-Announce mailing list to hear about new releases