A Country Report – COCOSDA Activities in China Data More and more companies on data resources and services suppliers are emerging in China: a new.

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

TEL: FAX: WEBSITE: © 2002 iFLYTEK. All rights reserved. This presentation is for informational.
Information Society Technologies Third Call for Proposals Norbert Brinkhoff-Button DG Information Society European Commission Key action III: Multmedia.
Introduction to Computational Linguistics
Sub-Project I Prosody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan.
GSK: Development and Distribution of Resources Hitoshi ISAHARA GSK: Gengo Shigen Kyokai (Language Resource Association) National Institute of Information.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
DU, C-SIIT1 Collecting and Transcribing Real Chinese Spontaneous Telephone Speech Corpus Limin Du, Chair Professor Director, Center for Speech Interactive.
INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING NLP-AI IIIT-Hyderabad CIIL, Mysore ICON DECEMBER, 2003.
J. Kunzmann, K. Choukri, E. Janke, A. Kießling, K. Knill, L. Lamel, T. Schultz, and S. Yamamoto Automatic Speech Recognition and Understanding ASRU, December.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
Languages & The Media, 5 Nov 2004, Berlin 1 New Markets, New Trends The technology side Stelios Piperidis
Multilingual eLearning in LANGuage Engineering. Project Overview  Project span: Oct 2004 – Oct 2007  Kick-off meeting Oct  Project goals:
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
1 Texmex – November 15 th, 2005 Strategy for the future Global goal “Understand” (= structure…) TV and other MM documents Prepare these documents for applications.
CSE111: Great Ideas in Computer Science Dr. Carl Alphonce 219 Bell Hall Office hours: M-F 11:00-11:
SPOKEN LANGUAGE SYSTEMS MIT Computer Science and Artificial Intelligence Laboratory Mitchell Peabody, Chao Wang, and Stephanie Seneff June 19, 2004 Lexical.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
Assistive Technology By: Roxanne Majeski, Oscar Guerin, Tasha Reaves, Elias Luna.
Track: Speech Technology Kishore Prahallad Assistant Professor, IIIT-Hyderabad 1Winter School, 2010, IIIT-H.
CAREERS IN LINGUISTICS OUTSIDE OF ACADEMIA CAREERS IN INDUSTRY.
A Smart-Pen Product VariSearch A Unique, Cross-language, Spelling-tolerant Search Engine Features and Application Area.
STANDARDIZATION OF SPEECH CORPUS Li Ai-jun, Yin Zhi-gang Phonetics Laboratory, Institute of Linguistics, Chinese Academy of Social Sciences.
Recent Activities of Speech Corpora and Assessment in Korea Yong-Ju Lee Wonkwang University Korea.
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
Research Component on Technology Concluding Thoughts Sarmad Hussain Center for Research in Urdu Language Processing National University of Computer and.
AILLA:The Archive of the Indigenous Languages of Latin America Heidi Johnson / The University of Texas at Austin.
NLP Related Activities in Thailand Virach Sornlertlamvanich Information Research and Development Division National Electronics and Computer Technology.
Introduction to IT Presented by: Ishan Agarwal ABV-IIITM, Gwalior.
World Languages Mandarin English Challenges in Mandarin Speech Recognition  Highly developed language model is required due to highly contextual nature.
Licensing and Distribution of Resources and Software PAN L10n Perspective Sarmad Hussain Center for Research in Urdu Language Processing National University.
Dutch HLT Resources: from BLARK to Priority Lists Helmer Strik, Diana Binnenpoorte, Janienke Sturm, Folkert de Vriend, and Catia Cucchiarini* A 2 RT, Dept.
EVikings II WP3: Language Technologies. HLT Human Language Technologies (HLT) play a crucial role in the Information Society For small languages it is.
TRANSLATION MEMORY TECHNOLOGY
Gerrit Schutte OHIM 9th of December, 2011 Trademark terminology control.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Human – Network Voice Interface in A Wireless Era
Catia Cucchiarini, Walter Daelemans and Helmer Strik Strengthening the Dutch Language and Speech Technology Infrastructure Catia Cucchiarini, Walter Daelemans.
Higher Vision, language and movement. Strong AI Is the belief that AI will eventually lead to the development of an autonomous intelligent machine. Some.
金聲玉振 Taiwan Univ. & Academia Sinica 1 Spoken Dialogue in Information Retrieval Jia-lin Shen Oct. 22, 1998.
Introduction A field survey of Dutch language resources has been carried out within the framework of a project launched by the Dutch Language Union (Nederlandse.
1 An Introduction to Computational Linguistics Mohammad Bahrani.
Video Active Presentation Agenda: –Demonstration of videoactive.eu Frontend and Backend fiatifta.dk Copenhagen September 2008.
Computational Linguistics Courses Experiment Test.
Basics of Natural Language Processing Introduction to Computational Linguistics.
INTRODUCTION TO APPLIED LINGUISTICS
SPEECH TECHNOLOGY An Overview Gopala Krishna. A
Corpus Linguistics Anca Dinu February, 2017.
Native Ads by YeahMobi.
Thai AGROVOC Ontology Base for Agricultural Information Retrieval
AI/DL for the Future of Smart Transportation Intumit, Inc. Herb Jiang
Computational and Statistical Methods for Corpus Analysis: Overview
Deep Exploration and Filtering of Text (DEFT)
3.0 Map of Subject Areas.
Why Study Spoken Language?
Linguistics—the Study of Language
Macrolinguistics Linguistics is not the only field concerned with language. Other disciplines such as psychology, sociology, ethnography, the science of.
--Mengxue Zhang, Qingyang Li
Why Study Spoken Language?
King Saud University, Riyadh, Saudi Arabia
Computational Linguistics: New Vistas
Artificial Intelligence System in software Development
Overseas Business Director
Idiap Research Institute University of Edinburgh
Acoustic-Prosodic and Lexical Entrainment in Deceptive Dialogue
Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen
European Masters Program Language & Communication Technologies
1-P-30 Speech-to-Speech Translation using Dual Learning and Prosody Conversion Zhaojie Luo, Yoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki (Kobe.
Presentation transcript:

A Country Report – COCOSDA Activities in China Data More and more companies on data resources and services suppliers are emerging in China: a new trend O-COCOSDA 2016,Bali Indonesia Aijun LI , *Dong WANG Institute of Linguistics, Chinese Academy of Social Sciences *Research Institute of Information Technology, Tsinghua University

Institute of Linguistics, CASS AESOP-CASS: 10,000h, more annotated data and research carried out. Discourse-CASS: more than 1000 dialogues and 100 discourses with rich annotation on speech segmental and prosodic structure/Information structure/Rhetoric structure /Speech act (topics, adjacent pairs )/Referential structure/Dependency relation/expression Word-Child-CASS: 1.5-6y word database, 4000 children, all with canonical and real pronunciation annotation Articulatory EMA-CASS: English word EMA data for English L2 learners and native speakers. 10 speakers. (new in 2016) Articulatory FMRI-CASS: one Chinese speaker’s syllable data.

Chinese LDC Till now, there are 102 corpora, including speech synthesis/recognition corpora, corpora for machine translation, lexicon and other natural language processing corpora. In 2016, there are 4 new corpora added (1 syntactic and 3 others corpus);20 corpora (7 spoken language, 5 translation, 3 speech recognition/synthesis, 2 lexicon and 3 other corpus) have been distributed to 11 institutes and companies. Types of the corpora Providers of the corpora Types of the users

Tsinghua University, CSLT Free Chinese data THCHS30 and THUYG20 Kaldi recipe available All accompanied resource available On OpenSLR Free speech data for minor languages come soon Supported by NSFC important project on multilingual minor language ASR Collaborated with Xinjing University and Northwest National University Uyghur, Tibetan, Mongolian, Kazak, Kirghiz First stage 20 hours per language, also accompanied lexicon and text data

Volume Increasing Languages Increasing http://www.speechocean.com/ A worldwide data resources & services supplier with 15 years experience in the fields of Human Computer Interaction and Human Language Technology, such as speech synthesis, speech recognition, machine translation, web search, image recognition, and natural language understanding. Volume Increasing Languages Increasing

Data resource Data Market application globalsales@datatang.com Has 500,000 certified collectors distributing globally. Releases 100 new collecting missions weekly. Collects various types of data stably including Speech, Image and Text Data Mall is: The largest data exchange platform 45,000 datasets in all domains 1,600,000+ exchanges in 2015 Trusted partner of Fortune 500 companies Processed over 10 million images/100,000 hours of speech in 2015 Employees: 1000+ Data resource Data Market application Datatang is capable of collecting data from various industries including government, finance, health care, and traffic, etc. Datatang Datatang aims to make your products even smarter. Training algorithm and building machine learning model have never been easier. Data Service Data Exchange API/SDK globalsales@datatang.com

About Huiting Data Award-Winning Corpus Featured Corpuses http://www.huitingtech.com/ About Huiting Data Founded in 2011, Huiting Data is a leading multimedia data and technology service provider. Huiting Data collaborates with international artificial intelligence technology companies providing high quality speech, image, text and other multimedia databases. Award-Winning Corpus Multi-accent Mandarin Speech Recognition Database (Link) Awarded as one of the five creative products by Speech Industry Alliance of China (SIAC) Speakers’ Dialects include Cantonese, Min Dialect, Gan Dialect, Sichuan Dialect, Wu Dialect and Xiang Dialect Sentence-wise Error Rate < 2% | Recorded by Smart Phone 3000 Hours | 3000 Speakers | Mono PCM | 16kHz | 16bit Featured Corpuses Cantonese Speech Recognition Database 1500 Speakers | 1000 Hours Link Cantonese-English Mixed Recognition Database 1400 Speakers | 500 Hours Mandarin-English Mixed Recognition Database 2700 Speakers | 1000 Hours Multi-language Speech Recognition Database 300 Speakers | 90 Hours for Each Language