LR College Paris 10 th ECESS meeting 10th ECESS Meeting College Language Resources Paris January 2008 1. Goal of meeting 2. Status members of College 3.

Slides:



Advertisements
Similar presentations
Promotion and Tenure Faculty Senate May 8, To be voted on.
Advertisements

IEC Substation Configuration Language and Its Impact on the Engineering of Distribution Substation Systems Notes Dr. Alexander Apostolov.
J. Kunzmann, K. Choukri, E. Janke, A. Kießling, K. Knill, L. Lamel, T. Schultz, and S. Yamamoto Automatic Speech Recognition and Understanding ASRU, December.
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
ENERTECH REMODECE REMODECE MEETING September, the 24th 2007.
USP workshop Using the Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA.
Different Streaming Technologies. Three major streaming technologies include:
1 Linguistic Resources needed by Nuance Jan Odijk Cocosda/Write Workshop.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
English Word Origins Grade 3 Middle School (US 9 th Grade) Advanced English Pablo Sherman The etymology of language.
Digital Alternatives to Transcribed Records at FAO IAMLADP Working Group on Technology for Conferences, Languages and Publications Task Force on Digital.
Microsoft Office Word 2013 Expert Microsoft Office Word 2013 Expert Courseware # 3251 Lesson 4: Working with Forms.
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
Evaluation of research proposals. Experience of Moldovan Advisory Expertise Council Science evaluation as a prerequisite for promoting excellence in research.
Arabic TTS (status & problems) O. Al Dakkak & N. Ghneim.
This chapter is extracted from Sommerville’s slides. Text book chapter
Lecturer: Ghadah Aldehim
3 Dec 2003Market Operations Standing Committee1 Market Rule and Change Management Consultation Process John MacKenzie / Darren Finkbeiner / Ella Kokotsis,
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Public 1 © 2005 Nokia V1-Filename.ppt / yyyy-mm-dd / Initials Development Challenges of Multilingual Text-to-Speech Systems Kimmo Pärssinen
STANDARDIZATION OF SPEECH CORPUS Li Ai-jun, Yin Zhi-gang Phonetics Laboratory, Institute of Linguistics, Chinese Academy of Social Sciences.
F. Petitjean, M-L Charron, S. Ferron (EHESP School of Public Health), C. Stock (Inist-CNRS) GL15 – Bratislava (SK), December 2, 2013.
Requirements Analysis
Metadata generation and glossary creation in eLearning Lothar Lemnitzer Review meeting, Zürich, 25 January 2008.
PrepTalk a Preprocessor for Talking book production Ted van der Togt, Dedicon, Amsterdam.
Supervisor: Dr. Eddie Jones Electronic Engineering Department Final Year Project 2008/09 Development of a Speaker Recognition/Verification System for Security.
Herbert Desel & Martin Ganzert1 R O S E T T A ENHANCING DATA QUALITY BY STANDARDISATION OF DATA ELECTRONIC EXCHANGE Herbert Desel 1 & Martin Ganzert 2.
The PrestoSpace Project Valentin Tablan. 2 Sheffield NLP Group, January 24 th 2006 Project Mission The 20th Century was the first with an audiovisual.
Language Resources College 11 th ECESS meeting 11th ECESS Meeting College Language Resources 0. Minutes making for College ‘Language Resources’ 1. Goal.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Exploring XML-based Technologies and Procedures for Quality Evaluation from a Real-life Case Perspective Folkert de Vriend 1 & Giulio Maltese 2 1 Speech.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Enhanced Infrastructure for Creation & Collection of Translation Resources Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda.
Quality Control of Language Resources at ELRA Henk van den Heuvel a, Khalid Choukri b, Harald Höge c, Bente Maegaard d, Jan Odijk e, Valerie Mapelli b.
LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes.
Information Technology – Dialogue Systems Ulm University (Germany) Speech Data Corpus for Verbal Intelligence Estimation.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
University of Maribor Faculty of Electrical Engineering and Computer Science AST ’04, July 7-9, 2004 Slovenian Lexica and Corpora in the Scope of the LC-STAR.
Food and Agriculture Organization of the UN Library and Documentation Systems Division July 2005 Ontologies creation, extraction and maintenance 6 th AOS.
Rundkast at LREC 2008, Marrakech LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen RUNDKAST: An Annotated.
Towards optimal TTS corpora CADIC Didier BOIDIN Cedric D'ALESSANDRO Christophe.
EVA Workshop, 26 March 2003, Florence, Italy1 COINE Cultural Objects In Networked Environments Anthi Baliou University of Macedonia,Library Thessaloniki,
How Can Corpora Help Me To Be Successful in CO150?
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
ONZEminer Margaret Maclagan, ONZE director Robert Fromont, designer.
Adopting XBRL in the Financial Statements Data Pool of Deutsche Bundesbank Dominik Elgg Deutsche Bundesbank Statistics Department.
LREC Marrakech, May 29, 2008 Question Answering on Speech Transcriptions: the QAST evaluation in CLEF L. Lamel 1, S. Rosset 1, C. Ayache 2, D. Mostefa.
C O R P O R A T E T E C H N O L O G Y Information & Communications Interaction Technologies ECESS Consortium Agreement Herbert Tropf (Siemens AG)
1 Chapter 12 Configuration management This chapter is extracted from Sommerville’s slides. Text book chapter 29 1.
How to teach listening.  Why is teaching listening important?  What kind of listening should students do?  What is special about listening?  What.
Introduction A field survey of Dutch language resources has been carried out within the framework of a project launched by the Dutch Language Union (Nederlandse.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
LR College Maribor: 9 th ECESS meeting 1.Goal of meeting 2.Status members of College 3.Interests and acceptance of associated members Activities of Microsoft.
ENGR 1181 College of Engineering Engineering Education Innovation Center Introduction to Technical Communication.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
How to complete and submit a Final Report through Mobility Tool+ Technical guidelines Authentication, Completion and Submission 1 Antonia Gogaki IT Officer.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Course Information Andy Wang CIS 5930 Computer Systems Performance Analysis.
Academic Cooperation: Terminology Research for IATE.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
Language Identification and Part-of-Speech Tagging
SPEEch on the griD (SPEED)
UNIT 15 Webpage Creator.
MM03 - Master Data in Purchasing & Contract
Modern Language Association
Lesson 1 The Web.
Applied Linguistics Chapter Four: Corpus Linguistics
Presentation transcript:

LR College Paris 10 th ECESS meeting 10th ECESS Meeting College Language Resources Paris January Goal of meeting 2. Status members of College 3. Interests and acceptance of associated members and observers 4. Acceptance of College minutes of last meeting 5. College-Action List of 9 th meeting

LR College Paris: 10 th ECESS meeting 6. Status of partners ( as in TA and in Maribor Pool-xls-data) o Pronunciation lexica (Pool Lex1, Pool Lex2) o Acoustic data for TTS voices (Pool Voice1, Pool Voice2) o Text Corpora (Pool Text1, Pool Text2). 7. The actual state of LR specification. o Settling/Finalization the specification for Text Corpora (Pool Text1, Pool Text2). o Settling/Finalization the specification for Acoustic data for TTS voices (minimal requirements - Pool Voice2). 8. Interest and further plans of partners.

LR College Paris 10 th ECESS meeting 9. Discussion. General issues. o ECESS LR specification documents (public page) o LSPs specifications (internal page) o LR distribution (internal page) o LR exchanging agreement o Splitting LR

LR College Paris 10 th ECESS meeting 10. Discussion. Further directions of LR College o Promotion of ECESS LR o Extension of LR collection. New types of Pools (eg. acoustic databases for speaker characterization, emotional databases, special databases with pathological voices/speech) depending on interests and needs of ECESS. 11. New Action List of College

LR College Paris 10 th ECESS meeting 1. Main Goals Status and further plans of partnersStatus and further plans of partners Interests and acceptance of associated membersInterests and acceptance of associated members Settling/finalization the specification for POS taggingSettling/finalization the specification for POS tagging ECESS LR specification documents (public and internal page) Extension of LR collection Distribution of LRDistribution of LR

LR College Paris 10 th ECESS meeting 2. Status members of LR College Status members of LR College AMU University of Poznan (Coordinator Grażyna Demenko ) Siemens (Harald Höge) Middle East Technical University, Ankara (Tolga Çiloğlu) CAS (Jinhua Tao) Uni Bonn (Stefan Breuer) Uni Munich ( ) Associates and Observers: Nokia (Imre Kiss) Microsoft Portugal (Daniela Braga)

3. Interests and acceptance of associated members and observers Uni Bielefeld (Dafydd Gibbon) 1) MBROLA diphone voice creation service for new languages 2) German lexicon (details to be specified). 3) An experimental child's voice with recordings and report on issues involved. 4) Particular interest in multilingual resources and in under-documented languages. Others members/observers ? LR College Paris 10 th ECESS meeting

4.Acceptance of College minutes of last meeting 5. College-Action List of 9 th meeting Settling/Finalization specifications for Text Corpora POS: PT1, PT2 Pool Settling/Finalization specifications for LR – voice database: non-standard PV2 Pool Lexicon: PL1, PL2 Pool final documentation – end of 2007 (internal ECESS pages)

LR College Paris 10 th ECESS meeting 6. Status of partners ( as in TA and in Maribor Pool-xls-data) Types of LR and related Pools Pools for Pronunciation lexica (1) PL1 Pool Lex1, according to LC-STAR specs (2) (PL2) Pool Lex2, according to minimum requirements Pools for Voices (1) PV1 Pool Voice1, according to TC-STAR specs, (2) ( PV2) Pool Voice2, according to minimum requirements. Pools for Text Corpora (1) PT1 Pool Text1, according ECESS Specs (2) (PT2) Pool Text2, according to minimum requirements Pools Lex1 and Voice1 Pool Lex1:According to LC-STAR specs as described earlier (documents available from the ECESS website) Pool Voice1: According to TC-STAR specs as described earlier (documents available from the ECESS websites) Pools Lex2 and Voice 2 Specifications of Minimum Requirements and thresholds will be defined during the first Period of ECESS coordinated by Uni. Munich). - Preferably defined as a subset of TC/LC-STAR criteria.

LR College Paris 10 th ECESS meeting Technical Annex

LR College Paris 10 th ECESS meeting PRESENT RESOURCES Siemens, UK lexicon (10/2007), UK baseline voice validated, Nokia LC-STAR Mandarin lexicon and TC-STAR Mandarin TTS database (1 male voice) for exchange in ECESS. AMU LC-STAR Polish lexicon UPC Catalan, 2 sp 10h baseline voices

LR College Paris 10th ECESS meeting 7.The actual state of LR specification. o Settling/Finalization the specification for Text Corpora (Pool Text1, Pool Text2). o Settling/Finalization the specification for Acoustic data for TTS voices (minimal requirements - Pool Voice2).

LR College Paris 10th ECESS meeting Design Principles of the Acoustic Corpora Size of corpus 10 h speech per baseline speaker per language ‘Baseline Text Corpus’ is composed by the corpora** Transcribed speech words Written text (novels and short stories with short sentences) words Selected phrases (frequent phrases, triphone sentences, mimic sentences) words Minimal requirements acoustic data. Coordinated by University of Munich

LR College Maribor: 9 th ECESS meeting Text corpus specifications (for POS tagging ) Size of corpus: Expected size of text data: 100K tokens minimum, 100% manually checked rest (500K-1M) can be done automatically Domains: Mandatory: 20K should be coming from spoken transliterations Preferred: in line with the TC-STAR text corpora (in line with acoustic data creation) TC-STAR text corpus as basis for POS tagging (90Kwords) LC-STAR tag set, or comparable, but tag set in lexicon and tagged text corpus must match

LR College Maribor: 9 th ECESS meeting Discussion POS tagging Size of text, domains Tokenization problems POS tagging sets Format of POS tagging Validation

8. Plans of Partners. LR College Paris 10 th ECESS meeting

9. Discussion. General issues. o ECESS LR specification documents (public page) o LSPs specifications (internal page) o Splitting LR o LR distribution (internal page) o LR exchanging agreement

ECESS LR specification documents (public page, internal page) The language independent specification is public and should be accessible from the public ECESS web-page. The language specific data (Language Specific Peculiarities); the LSP could be extended to contain all the 'contact information') is part of the LR dedicated for a pool. The LSPs have to be approved by the LR-college. The LSPs are located in the internal webpage of ECESS (College LR). A new public 'ECESS' specification document. (different LC-STAR,TC_StAR documents together, ECESS specification LR papers, publication LR College Paris 10 th ECESS meeting

Splitting LR SIE suggests to split the data in the lexicon pool to 'lexicon for common words' (which we will deliver for UK) and 'lexicon for proper names'. Partners interested only in parts of the lexica could then choose what they want to deliver and exchange. Advantage: some partners may only want to deliver/get certain parts of a particular language; production costs for the different parts are more comparable. LR College Paris 10 th ECESS meeting

o LR distribution (internal page) o LR exchanging agreement LR-agreement: within the college 'Tools‘ Uni. Maribor acts as a distributor of tools needed for evaluation. LR College Maribor: 9 th ECESS meeting

10. Discussion. Further directions of LR College o Promotion of ECESS LR o Extension of LR collection. New types of Pools (eg. acoustic databases for speaker characterization, emotional databases, special databases with pathological voices/speech) depending on interests and needs of Ecess. 11. New Action List of College LR College Paris 10 th ECESS meeting