Speech Technology for Language Learning

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Building an ASR using HTK CS4706
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
Analyzing Students’ Pronunciation and Improving Tonal Teaching Ropngrong Liao Marilyn Chakwin Defense.
Facial expression as an input annotation modality for affective speech-to-speech translation Éva Székely, Zeeshan Ahmed, Ingmar Steiner, Julie Carson-Berndsen.
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
PHONEXIA Can I have it in writing?. Discuss and share your answers to the following questions: 1.When you have English lessons listening to spoken English,
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,
SPOKEN LANGUAGE SYSTEMS MIT Computer Science and Artificial Intelligence Laboratory Mitchell Peabody, Chao Wang, and Stephanie Seneff June 19, 2004 Lexical.
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
A PRESENTATION BY SHAMALEE DESHPANDE
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Natural Language Processing and Speech Enabled Applications by Pavlovic Nenad.
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
Track: Speech Technology Kishore Prahallad Assistant Professor, IIIT-Hyderabad 1Winter School, 2010, IIIT-H.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
T raining on Read&Write GOLD Dick Powers
Supervisor: Dr. Eddie Jones Electronic Engineering Department Final Year Project 2008/09 Development of a Speaker Recognition/Verification System for Security.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Kishore Prahallad IIIT-Hyderabad 1 Unit Selection Synthesis in Indian Languages (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)
Building a sentential model for automatic prosody evaluation Kyuchul Yoon School of English Language & Literature Yeungnam University Korea.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
Rutgers Multimedia Chinese Teaching System (RMCTS) MERLOT International Conference, August 7-10, 2008.
1 Natural Language Processing Lecture Notes 14 Chapter 19.
Performance Comparison of Speaker and Emotion Recognition
© 2013 by Larson Technical Services
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
Automatic Pronunciation Scoring of Specific Phone Segments for Language Instruction EuroSpeech 1997 Authors: Y. Kim, H. Franco, L. Neumeyer Presenter:
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
PREPARED BY MANOJ TALUKDAR MSC 4 TH SEM ROLL-NO 05 GUKC-2012 IN THE GUIDENCE OF DR. SANJIB KR KALITA.
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
G. Anushiya Rachel Project Officer
التوجيه الفني العام للغة الإنجليزية
Dr Anie Attan 26 April 2017 Language Academy UTMJB
Investigating Pitch Accent Recognition in Non-native Speech
Natural Language Processing and Speech Enabled Applications
Mr. Darko Pekar, Speech Morphing Inc.
An Introduction to Reading at Alwyn Infant School 2017
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
IB Assessments CRITERION!!!.
Text-To-Speech System for English
Artificial Intelligence for Speech Recognition
Automatic Speech Recognition Introduction
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
3.0 Map of Subject Areas.
Why Study Spoken Language?
Automatic Fluency Assessment
Job Google Job Title: Linguistic Project Manager
Comparing American and Palestinian Perceptions of Charisma Using Acoustic-Prosodic and Lexical Analysis Fadi Biadsy, Julia Hirschberg, Andrew Rosenberg,
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
Why Study Spoken Language?
Audio Books for Phonetics Research
LANGUAGE TEACHING MODELS
Automating Early Assessment of Academic Standards for Very Young Native and Non-Native Speakers of American English better known as The TBALL Project.
Fadi Biadsy. , Andrew Rosenberg. , Rolf Carlson†, Julia Hirschberg
Neural Speech Synthesis with Transformer Network
Speaker Identification:
Presentation transcript:

Speech Technology for Language Learning Yoon Kim NeoSpeech, Inc. IEEE SCV Signal Processing Society Meeting February 5, 2004 2018-11-06

About NeoSpeech Launched August of 2002 Based in Fremont, CA Backed by Voiceware, Korea’s leading speech technology provider Products & Services Core technology: TTS, ASR, Speaker Verification and Voice Animation Applications: Computer assisted language learning; Automatic Outbound Notification Over 36 Multimedia/PC, telephony and reseller customers One of fastest-growing speech technology providers 2018-11-06

Outline Introduction Speech Technologies for CALL Demos What is CALL? Why is speech technology useful for CALL? Speech Technologies for CALL Automatic Speech Recognition (ASR) Text-to-Speech (TTS) Synthesis Demos Challenges and the Future 2018-11-06

Introduction 2018-11-06

What is CALL? CALL: Computer Aided Language Learning General term for using all computing resources for the acquisition, training and evaluation of language skills in the following areas: Reading, Writing, Listening, Speaking Why is CALL useful? Convenient, anytime access to language education Self-paced tool that aids human language instruction Can alleviate the fear of learning a new language through human to human interactions Human computer interactions can intrigue the young generation of users that are familiar with computers 2018-11-06

Why is Speech Technology Important in CALL? Speech is perhaps the most effective way of communication between humans Listening and speaking involve processing of speech from acoustic/phonetic and linguistic perspectives Computers are multi-modal in nature Speech technology enables systems to use these different modalities (speech, visual/haptic) for CALL Results in a more complete interaction for students, increasing learning efficacy and user satisfaction 2018-11-06

Speech Technologies used in CALL Systems Speech Input: Speech Recognition ASR for grammar-based verbal interaction Pronunciation Scoring Detection/Feedback of Mispronunciation Speech Output: Text-to-Speech Listening and verification of dynamic content 2018-11-06

Automatic Speech Recognition for CALL 2018-11-06

Automatic Speech Recognition (ASR) and Understanding Process of decoding the raw speech waveform and extracting linguistic information for human-machine communication Speech Understanding Process of comprehending communicative intent in addition to the linguistic decoding of the raw acoustic speech signal 2018-11-06

Speech Recognition Process The input speech signal is converted to a sequence of feature vectors X , based on a cepstral, time-quefrency analysis. Input Speech “Call George Bush at home” Objective: Given a sequence of acoustic feature X extracted, find the most likely word string that could have been uttered Cepstral Analysis Acoustic Front-end Cepstral Analysis Acoustic Models P(X/W) Acoustic models P(X|W) represent sub-word units, such as phonemes, as a finite-state machine in which states model spectral structure and transitions model temporal structure. Language Model P(W) Search Recognized Utterance The language model P(W) predicts the next set of words, and controls which models are hypothesized. 2018-11-06

ASR-CALL Applications 2018-11-06

ASR for Verbal Interaction Use continuous grammar to handle words and phrases Interaction specific, dynamic grammar Applications: Interactive lessons with voice input using ASR as an option Simple multiple choice questions Fill in the blank questions Word unscrambling drills 2018-11-06

Pronunciation Scoring Scoring performed by analyzing the following cues from non-native and native acoustic models Statistical match Duration Prosody Rate of Speech Grammar is singular and well defined Scoring can be done at the following levels Specific phone segments Words/Phrases Sentences Overall student proficiency 2018-11-06

Mispronunciation Feedback Detection Similar to keyword spotting Alternative pronunciation networks can be used Detection hot list Correction Segment specific training Confusable pair training (e.g. /r/ versus /l/ for Korean students) Can provide feedback/tips on potential correction Applications Reading tutor for children Detection and correction of common pronunciation mistakes (depends on the source language of student) Pronunciation of the word “Afternoon” Native : AE2 F T ERO N UW1 N Student : AE1 F T ELb N UW1 N Phones /AE1/, /ELb/ are detected as mispronunciations  Student is given tips on how to pronounce /AE/ and /ER/ correctly. 2018-11-06

Text-to-Speech Synthesis for CALL 2018-11-06

Definition of TTS Synthesis Text-To-Speech (TTS) Synthesis Automatic production of acoustic speech waveform from arbitrary text input Better than humans in some ways Cheaper Can be more intelligible More flexible than recording Worse than humans in other ways Ungraceful degradation for longer sentences Mechanical timbre 2018-11-06

Speech Synthesis Process My office was on St. Mary’s St. one block from the coffee shop. Input Text Text Processing My office was on Saint Mary’s Street, one block from the coffee shop. Prosody Prediction *My office |was on Saint *Mary’s Street || *one block | from the *coffee shop. Phonetic Processing Prosody often includes some syntactic analysis or approximation of it Paper with Mari and Ivan showed that all these levels play a role (more on metrics) ·          F0 contour prediction has played a significant role in improving the naturalness. ·          Symbolic phrase and accent location markers gave significant improvement over the baseline reference, but the additional gain from specific tonal markers was not significant. ·          Predicted F0 and natural phone durations separately produced the same perceptual effect. However, when combined, these two prosodic features appeared to amplify each other's contribution. ·          Subjects can be categorized by their sensitivity to pitch range and/or phone durations. m *ay ao1 f ax s|w ax z ao n s ey n t m *eh r iy z s t r iy t || w *ah n b l aa k | f r ax m dh ax k ao* f iy sh aa p Waveform Generation Synthesized Output 2018-11-06

TTS-CALL Applications 2018-11-06

TTS-Based Learning and Comprehension Large-corpus, concatenative based TTS systems Fortified grapheme-to-phoneme rules Offers instant multimedia content generation for learning new words, phrases or sentences Any text content can be “read out” using a TTS voice Interactive, focused topics Easy accessibility 2018-11-06

Demos and Conclusion 2018-11-06

Demos NeoSpeech/Voiceware (www.neospeech.com) Magic English Plus (TTS) Cong Cong – Talking in English (ASR) BravoBrava! (www.bravobrava.com) SpeaK! (ASR/TTS) 2018-11-06

Challenges and the Future ASR: Accuracy and Robustness of non-native based speech recognition Variety of source and target language configurations Micro-level pronunciation feedback Normalizing speaker characteristics (acoustic, linguistic) and channel/environment Robust rejection schemes TTS: Accuracy and naturalness of TTS systems for advanced listening lessons Combination of CALL with Spoken Language Translation 2018-11-06

Thank You! yoon.kim@neospeech.com www.neospeech.com 2018-11-06