Download presentation
Presentation is loading. Please wait.
Published byMagnus Roberts Modified over 9 years ago
1
LR College Paris 10 th ECESS meeting 10th ECESS Meeting College Language Resources Paris January 2008 1. Goal of meeting 2. Status members of College 3. Interests and acceptance of associated members and observers 4. Acceptance of College minutes of last meeting 5. College-Action List of 9 th meeting
2
LR College Paris: 10 th ECESS meeting 6. Status of partners ( as in TA and in Maribor Pool-xls-data) o Pronunciation lexica (Pool Lex1, Pool Lex2) o Acoustic data for TTS voices (Pool Voice1, Pool Voice2) o Text Corpora (Pool Text1, Pool Text2). 7. The actual state of LR specification. o Settling/Finalization the specification for Text Corpora (Pool Text1, Pool Text2). o Settling/Finalization the specification for Acoustic data for TTS voices (minimal requirements - Pool Voice2). 8. Interest and further plans of partners.
3
LR College Paris 10 th ECESS meeting 9. Discussion. General issues. o ECESS LR specification documents (public page) o LSPs specifications (internal page) o LR distribution (internal page) o LR exchanging agreement o Splitting LR
4
LR College Paris 10 th ECESS meeting 10. Discussion. Further directions of LR College o Promotion of ECESS LR o Extension of LR collection. New types of Pools (eg. acoustic databases for speaker characterization, emotional databases, special databases with pathological voices/speech) depending on interests and needs of ECESS. 11. New Action List of College
5
LR College Paris 10 th ECESS meeting 1. Main Goals Status and further plans of partnersStatus and further plans of partners Interests and acceptance of associated membersInterests and acceptance of associated members Settling/finalization the specification for POS taggingSettling/finalization the specification for POS tagging ECESS LR specification documents (public and internal page) Extension of LR collection Distribution of LRDistribution of LR
6
LR College Paris 10 th ECESS meeting 2. Status members of LR College Status members of LR College AMU University of Poznan (Coordinator Grażyna Demenko ) Siemens (Harald Höge) Middle East Technical University, Ankara (Tolga Çiloğlu) CAS (Jinhua Tao) Uni Bonn (Stefan Breuer) Uni Munich ( ) Associates and Observers: Nokia (Imre Kiss) Microsoft Portugal (Daniela Braga)
7
3. Interests and acceptance of associated members and observers Uni Bielefeld (Dafydd Gibbon) 1) MBROLA diphone voice creation service for new languages 2) German lexicon (details to be specified). 3) An experimental child's voice with recordings and report on issues involved. 4) Particular interest in multilingual resources and in under-documented languages. Others members/observers ? LR College Paris 10 th ECESS meeting
8
4.Acceptance of College minutes of last meeting 5. College-Action List of 9 th meeting Settling/Finalization specifications for Text Corpora POS: PT1, PT2 Pool Settling/Finalization specifications for LR – voice database: non-standard PV2 Pool Lexicon: PL1, PL2 Pool final documentation – end of 2007 (internal ECESS pages)
9
LR College Paris 10 th ECESS meeting 6. Status of partners ( as in TA and in Maribor Pool-xls-data) Types of LR and related Pools Pools for Pronunciation lexica (1) PL1 Pool Lex1, according to LC-STAR specs (2) (PL2) Pool Lex2, according to minimum requirements Pools for Voices (1) PV1 Pool Voice1, according to TC-STAR specs, (2) ( PV2) Pool Voice2, according to minimum requirements. Pools for Text Corpora (1) PT1 Pool Text1, according ECESS Specs (2) (PT2) Pool Text2, according to minimum requirements Pools Lex1 and Voice1 Pool Lex1:According to LC-STAR specs as described earlier (documents available from the ECESS website) Pool Voice1: According to TC-STAR specs as described earlier (documents available from the ECESS websites) Pools Lex2 and Voice 2 Specifications of Minimum Requirements and thresholds will be defined during the first Period of ECESS coordinated by Uni. Munich). - Preferably defined as a subset of TC/LC-STAR criteria.
10
LR College Paris 10 th ECESS meeting Technical Annex
11
LR College Paris 10 th ECESS meeting PRESENT RESOURCES Siemens, UK lexicon (10/2007), UK baseline voice validated, Nokia LC-STAR Mandarin lexicon and TC-STAR Mandarin TTS database (1 male voice) for exchange in ECESS. AMU LC-STAR Polish lexicon UPC Catalan, 2 sp 10h baseline voices
12
LR College Paris 10th ECESS meeting 7.The actual state of LR specification. o Settling/Finalization the specification for Text Corpora (Pool Text1, Pool Text2). o Settling/Finalization the specification for Acoustic data for TTS voices (minimal requirements - Pool Voice2).
13
LR College Paris 10th ECESS meeting Design Principles of the Acoustic Corpora Size of corpus 10 h speech per baseline speaker per language ‘Baseline Text Corpus’ is composed by the corpora** Transcribed speech 45 000 words Written text (novels and short stories with short sentences) 27 000 words Selected phrases (frequent phrases, triphone sentences, mimic sentences) 18000 words Minimal requirements acoustic data. Coordinated by University of Munich
14
LR College Maribor: 9 th ECESS meeting Text corpus specifications (for POS tagging ) Size of corpus: Expected size of text data: 100K tokens minimum, 100% manually checked rest (500K-1M) can be done automatically Domains: Mandatory: 20K should be coming from spoken transliterations Preferred: in line with the TC-STAR text corpora (in line with acoustic data creation) TC-STAR text corpus as basis for POS tagging (90Kwords) LC-STAR tag set, or comparable, but tag set in lexicon and tagged text corpus must match
15
LR College Maribor: 9 th ECESS meeting Discussion POS tagging Size of text, domains Tokenization problems POS tagging sets Format of POS tagging Validation
16
8. Plans of Partners. LR College Paris 10 th ECESS meeting
17
9. Discussion. General issues. o ECESS LR specification documents (public page) o LSPs specifications (internal page) o Splitting LR o LR distribution (internal page) o LR exchanging agreement
18
ECESS LR specification documents (public page, internal page) The language independent specification is public and should be accessible from the public ECESS web-page. The language specific data (Language Specific Peculiarities); the LSP could be extended to contain all the 'contact information') is part of the LR dedicated for a pool. The LSPs have to be approved by the LR-college. The LSPs are located in the internal webpage of ECESS (College LR). A new public 'ECESS' specification document. (different LC-STAR,TC_StAR documents together, ECESS specification LR papers, publication LR College Paris 10 th ECESS meeting
19
Splitting LR SIE suggests to split the data in the lexicon pool to 'lexicon for common words' (which we will deliver for UK) and 'lexicon for proper names'. Partners interested only in parts of the lexica could then choose what they want to deliver and exchange. Advantage: some partners may only want to deliver/get certain parts of a particular language; production costs for the different parts are more comparable. LR College Paris 10 th ECESS meeting
20
o LR distribution (internal page) o LR exchanging agreement LR-agreement: within the college 'Tools‘ Uni. Maribor acts as a distributor of tools needed for evaluation. LR College Maribor: 9 th ECESS meeting
21
10. Discussion. Further directions of LR College o Promotion of ECESS LR o Extension of LR collection. New types of Pools (eg. acoustic databases for speaker characterization, emotional databases, special databases with pathological voices/speech) depending on interests and needs of Ecess. 11. New Action List of College LR College Paris 10 th ECESS meeting
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.