Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001 Data Recording, Transcription, and Speech Recognition for Egypt.

Slides:



Advertisements
Similar presentations
Common Core Standards (What this means in computer class)
Advertisements

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
Research & Development ICASSP' Analysis of Model Adaptation on Non-Native Speech for Multiple Accent Speech Recognition D. Jouvet & K. Bartkova France.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
DU, C-SIIT1 Collecting and Transcribing Real Chinese Spontaneous Telephone Speech Corpus Limin Du, Chair Professor Director, Center for Speech Interactive.
Advances in WP2 Torino Meeting – 9-10 March
This is an audio presentation. Please turn on your computer speakers. Press to start the presentation.
J. Kunzmann, K. Choukri, E. Janke, A. Kießling, K. Knill, L. Lamel, T. Schultz, and S. Yamamoto Automatic Speech Recognition and Understanding ASRU, December.
Sean Powers Florida Institute of Technology ECE 5525 Final: Dr. Veton Kepuska Date: 07 December 2010 Controlling your household appliances through conversation.
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
Tanja Schultz, Alan Black, Bob Frederking Carnegie Mellon University West Palm Beach, March 28, 2003 Towards Dolphin Recognition.
Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th, 2001.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
ISSUES IN SPEECH RECOGNITION Shraddha Sharma
Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.
Digital Sound and Video Chapter 10, Exploring the Digital Domain.
Speech Recognition Final Project Resources
STANDARDIZATION OF SPEECH CORPUS Li Ai-jun, Yin Zhi-gang Phonetics Laboratory, Institute of Linguistics, Chinese Academy of Social Sciences.
Data collection and experimentation. Why should we talk about data collection? It is a central part of most, if not all, aspects of current speech technology.
How Spread Works. Spread Spread stands for Speech and Phoneme Recognition as Educational Aid for the Deaf and Hearing Impaired Children It is a game used.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
SIG IL 2000 Evaluation of a Practical Interlingua for Task-Oriented Dialogue Lori Levin, Donna Gates, Alon Lavie, Fabio Pianesi, Dorcas Wallace, Taro Watanabe,
May 2006CLINT-CS Verbmobil1 CLINT-CS Dialogue II Verbmobil.
1 BILC SEMINAR 2009 Speech Recognition: Is It for Real? Tony Mirabito Defense Language Institute English Language Center (DLIELC) DLIELC.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.
LREC 2008, May 26 – June 1, Marrakesh Speaker Recognition: Building the Mixer 4 and 5 Corpora Linda Brandschain, Christopher Cieri, David Graff, Abby Neely,
LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Speech-to-Speech MT Design and Engineering Alon Lavie and Lori Levin MT Class April
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
University of Maribor Faculty of Electrical Engineering and Computer Science AST ’04, July 7-9, 2004 Slovenian Lexica and Corpora in the Scope of the LC-STAR.
Rundkast at LREC 2008, Marrakech LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen RUNDKAST: An Annotated.
A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,
Nespole!’s Experiment on Multimodality (Summer 2001) Erica Costantini (University of Trieste) Fabio Pianesi (ITC-irst, Trento) Susanne Burger (CMU)
AQUAINT Herbert Gish and Owen Kimball June 11, 2002 Answer Spotting.
ELanguages creative collaboration for teachers globally.
The Office Procedures and Technology
ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY Viet Bac Le*, Laurent Besacier*, Tanja Schultz** * CLIPS-IMAG Laboratory,
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
FlashNormalize: Programming by Examples for Text Normalization International Joint Conference on Artificial Intelligence, Buenos Aires 7/29/2015FlashNormalize1.
Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Collaborator Revolutionizing the way you communicate and understand
Basic structure of sphinx 4
ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION. Introduction What is Speech Recognition?  also known as automatic speech recognition or computer speech.
Rapid Development in new languages Limited training data (6hrs) provided by NECTEC from 34 speakers, + 8 spks for development and test Romanization of.
The Audio-Lingual Method
金聲玉振 Taiwan Univ. & Academia Sinica 1 Spoken Dialogue in Information Retrieval Jia-lin Shen Oct. 22, 1998.
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
Integrating Multiple Knowledge Sources For Improved Speech Understanding Sherif Abdou, Michael Scordilis Department of Electrical and Computer Engineering,
STD Approach Two general approaches: word-based and phonetics-based Goal is to rapidly detect the presence of a term in a large audio corpus of heterogeneous.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Speech Recognition Created By : Kanjariya Hardik G.
Carnegie Mellon IRST-itc Balancing Expressiveness and Simplicity in an Interlingua for Task Based Dialogue Lori Levin, Donna Gates, Dorcas Wallace, Kay.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
Named Entities in Domain Unlimited Speech Translation Alex Waibel, Stephan Vogel, Tanja Schultz Carnegie Mellon University Interactive Systems Labs.
Audio Books for Phonetics Research CatCod2008 Jiahong Yuan and Mark Liberman University of Pennsylvania Dec. 4, 2008.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
G. Anushiya Rachel Project Officer
Language Translation Services –Wordpar.com
Automatic Speech Recognition
Chapter 6. Data Collection in a Wizard-of-Oz Experiment in Reinforcement Learning for Adaptive Dialogue Systems by: Rieser & Lemon. Course: Autonomous.
3.0 Map of Subject Areas.
The Office Procedures and Technology
King Saud University, Riyadh, Saudi Arabia
Language Transfer of Audio Word2Vec:
Presentation transcript:

Tanja Schultz Carnegie Mellon University Cairo, Egypt, May Data Recording, Transcription, and Speech Recognition for Egypt

Outline Ê Requirements for Speech Recognition è Data Requirements è Audio data è Pronunciation dictionary è Text corpus data è Recording of Audio data è Transcription of Audio data Ë Initialization of an Egypt Speech Recognition Engine è Multilingual Speech Recognition è Rapid Adaptation to new Languages

Part 1 Requirements for Speech Recognition è Data Requirements è Audio data è Pronunciation dictionary è Text corpus data è Recording of Audio data è Transcription of Audio data Thanks to Celine Morel and Susanne Burger

Speech Recognition Output hellohello Hello Hale Bob Hallo : TTS Speech Input - Preprocessing Decoding / Search Postprocessing - Synthesis

Fundamental Equation of SR Output hellohello P(W/x) = [ P(x/W) * P(W) ] / P(x) Am AE M Are A R I AI you J U we VE I am you are we are : Acoustic Model Pronunciation Language Model A-b A-m A-e

SR: Data Requirements Audio Data Phoneme Set Pronunciation Dictionary Text Data  Am AE M Are A R I AI you J U we VE I am you are we are : Acoustic Model Pronunciation Language Model A-b A-m A-e

Audio Data  For training and testing the SR-engine many high quality data in the target language should be collected u u What kind of data are needed l l Scenario and Task l l How to collect these data, Recording setup l l Preparation of Information u u Quality of data l l Sampling rate, resolution u u Amount of data l l Number of dialogs and speakers u u Transcription of Audio Data

What kind of Audio Data C-Star Scenario: Travel arrangement (planning a vacation trip, booking a hotel room,...) u u Scenario is realistic and attractive to the people u u Dialog between two people: l l One Agent: Travel assistant l l One Client: Traveler, pretends to visit a specific site u u Speakers get instructions about what task they have to accomplish but not HOW to do that u u Role playing setup

How to collect Audio Data u u Recording setup l l The dialog partners can NOT see each other, i.e. no face-to-face (in preparation of telephone, web applications) l l No non-verbal communication l l Spontaneous Speech (noise effects, disfluencies,... may occur) l l No Push-to-talk, try to avoid crosstalk l l Balanced dialogs u u Dialog structure, Task l l Greetings and formalities between dialog partners l l Client gives information like number of persons traveling, date of travel (arrival/departure), interest l l Client ask questions about means of transportation (train,flight), hotel or appartment modalities, visits of sights or cultural events l l Agent provides information according to clients questions

Prepare Information for Client and Agent u u A: Hotel list (3-4 hotels per dialog) u u A: Transportation list (3-4 flights, train, bus schedules) u u A: List of 3-4 cultural events per dialog u u C: information about specific task: l l who is traveling (i.e. client travels with partner + two kids) l l when is s/he traveling (i.e. 2 weeks vacation trip in July) l l where (i.e. trip to Pennsylvania, US) l l how ( i.e. direct flight to Pittsburgh, rental car) l l what are the places of interest (CMU - Pittsburgh, Liberty Bell in Philadelphia,...) u u Date and time of recording might be faked u u Dialog takes place at recording place u u Example sheets  Celine Morel

Quality and Quantity of Audio Data u u Quality of data l l High quality clean speech  close-speaking microphone, like Sennheiser H-420 l l 16kHz sampling rate, 16 bit resolution u u Amount of data l l Minimum of 10 hours of spoken speech l l Average length of dialogs minutes l l 10 hours  dialogs u u Number of speakers l l as much speakers as possible (speaker independent AM) l l dialogs = maximum of 120 different spk l l Split up the speakers/dialogs into three disjunctive subsets:  training set, development testset, evaluation testset

Recording Tool: Total Recorder   h ttp:// u u Registration fee: $ u u IBM compatible PC, soundcard (i.e. Soundblaster) u u Close-speaking microphone (i.e. Sennheiser H-420) u u Win95, Win98, Win2000, WinNT Sound- board Sound- board Driver Total Recorder

Transcription of Audio Data For training the SR-engine we need to transcribe the spoken data manually u u Very time consuming (10-20 times real time) u u The more accurate transcribed the more valuable   Since we do have the pronunciations, only word-based transcriptions are needed u u Transcription convention from Susanne Burger l l download from l l Describes notation u u Transcription tool: transEdit (Burger & Meier)

Transliteration conventions Example: tanja_0001: this sentence +uhm+ was spoken +pause+ by ~Tanja and +/cos/+ contains one restart u u Parsability - one turn per line: Tanja_0001 u u Consistency u u Filter programs l l tagging of proper names ~Tanja l l tagging of numbers l l special noise markers +uhm+ l l no capitalization at the beginning of turns

Pronunciation Dictionary  For each word seen in the training set, a pronunciation of this word has to be defined in terms of the phoneme set u u Define an appropriate phoneme set: atomar sounds of language u u Describe each word to be recognized in terms of this phoneme set u u Example in English: IAI youJ U u u Strong Grapheme-to-Phoneme relation in Egypt/Arabic IF the vocalization is transcribed, romanized transcription u u Grapheme-to-Phoneme tool for Standard Arabic (collected in Tunesia and Palestine) already developed at CMU (master student Jamal Abu-Alwan)

Phoneme Set (i.e. Standard Arabic)

Text Data  For training the language model we need a huge corpus of text data of same domain u u The language model helps guiding the search u u Compute probabilities of words, word pairs and word tripels u u Millions of words needed to calculate these probs u u Text corpus should be as close as possible to the given domain u u Writing systems must be the same u u Other text might be useful as background information

Computer Requirements u u Data collection l l IBM compatible PC l l High quality Soundcard like Soundblaster l l Close-speaking microphone like Sennheiser H-420 l l Operating System  Win95 l l Large Harddisc Õ Õ x 2 bytes per sec  30 kBytes/sec  2 Mb/min  120 Mb/hr  Õ Õ 1.2 GigaBytes for 10hr spoken speech u u Speech Recognition l l Fast processor - as fast as possible l l RAM  512 Mb l l Additional 2-4 GigaBytes for temporary files during training and testing u u Translation l l Donna, Lori?

Discussion u u Speech Recognizer in Egypt or Standard Arabic language ? u u Egypt l l Spoken -used- language  more interesting for a human-to-human speech-to-speech translation system? l l Standardized pronunciation? l l Large text resources available in Egypt? l l Parser output follows Standard Arabic vocalization? l l Use Egypt CallHome data and pronunciation dictionaries (LDC)? u u Standard Arabic l l Useful to a larger community? l l Canonical pronunciation? l l Preliminary speech recognizer and data already available at CMU l l Larger text resources available? u u Do we want monolingual dialogs (agent&client) or multilingual recordings?

Part 2 Initialization of an Egypt Speech Recognition Engine è Multilingual Speech Recognition è Rapid Adaptation to new Languages

Initialization of Egypt SR Engine u u Rapid initialization of an Egypt/Arabic speech recognizer? u u Pronunciation dictionary: Grapheme-to-Phoneme tool available if vocalization, romanization is provided by trl u u Language model: text corpora if vocalized u u Apply Egypt parser for vocalization? u u Acoustic models: Initialization or Adaptation according to our fast adaptation approach PDTS

GlobalPhone Multilingual Database l Widespread languages l Native Speakers l Uniformity l Broad domain l Huge text resources è Internet Newspapers Total sum of resources l 15 languages so far l  300 hours speech data l  1400 native speakers Arabic Ch-Mandarin Ch-Shanghai English French German Japanese Korean Croatian Portuguese Russian Spanish Swedish Tamil Turkish

Speech Recognition in Multiple Languages Pronunciation rules Text data Sound system Speech data (  10 hours) Goal: Speech recognition in a many different languages Problem: Only few or no training data available (costs, time) ela /e/l/a/ eu /e/u/ sou /s/u/ eu sou você é ela é AM LexLM

Speech Recognition in Multiple Languages Pronunciation rules Text data Sound system Speech data ela /e/l/a/ eu /e/u/ sou /s/u/ eu sou você é ela é AM LexLM

Multilingual Acoustic Modeling Step 1: Combine acoustic models Share data across languages

Multilingual Acoustic Modeling Sound production is human not language specific: è International Phonetic Alphabet (IPA) è Multilingual Acoustic Modeling 1) Universal sound inventory based on IPA 485 sounds are reduced to 162 IPA-sound classes 2) Each sound class is represented by one “phoneme” which is trained through data sharing across languages m,n,s,l occur in all languages p,b,t,d,k,g,f and i,u,e,a,o occur in almost all languages l no sharing of triphthongs and palatal consonants

Rapid Language Adaptation Step 2: Use ML acoustic models, borrow data Adapt ML acoustic models to target language ela /e/l/a/ eu /e/u/ sou /s/u/ eu sou você é ela é AM LexLM

Rapid Language Adaptation Model mapping to the target language 1) Map the multilingual phonemes to Portuguese ones based on the IPA-scheme 2) Copy the corresponding acoustic models in order to initialize Portuguese models Problem: Contexts are language specific, how to apply context dependent models to a new target language Solution: Adaptation of multilingual contexts to the target language based on limited training data

Language Adaptation Experiments Ø TreeML-TreePo-Tree PDTS +

Summary 4 4 Multilingual database suitable for MLVCSR 4 4 Covers the most widespread languages 4 4 Language dependent recognition in 10 languages 4 4 Language independent acoustic modeling 4 4 Global phoneme set that covers  10 languages 4 4 Data sharing thru multilingual models 4 4 Language adaptive speech recognition 4 4 Limited amount of language specific data  Create speech engines in new target languages using only limited data, save time and money

Selected Publications u u Language Independent and Language Adaptive Acoustic Modeling Tanja Schultz and Alex Waibel in: Speech Communication, To appear 2001 u u Multilinguality in Speech and Spoken Language Systems Alex Waibel, Petra Geutner, Laura Mayfield-Tomokiyo, Tanja Schultz, and Monika Woszczyna in: Proceedings of the IEEE, Special Issue on Spoken Language Processing, Volume 88(8), pp , August 2000 u u Polyphone Decision Tree Specialization for Language Adaptation Tanja Schultz and Alex Waibel in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-2000), Istanbul, Turkey, June u u Download from