Presentation is loading. Please wait.

Presentation is loading. Please wait.

Speech Technology for Language Learning

Similar presentations


Presentation on theme: "Speech Technology for Language Learning"— Presentation transcript:

1 Speech Technology for Language Learning
Yoon Kim NeoSpeech, Inc. IEEE SCV Signal Processing Society Meeting February 5, 2004

2 About NeoSpeech Launched August of 2002 Based in Fremont, CA
Backed by Voiceware, Korea’s leading speech technology provider Products & Services Core technology: TTS, ASR, Speaker Verification and Voice Animation Applications: Computer assisted language learning; Automatic Outbound Notification Over 36 Multimedia/PC, telephony and reseller customers One of fastest-growing speech technology providers

3 Outline Introduction Speech Technologies for CALL Demos
What is CALL? Why is speech technology useful for CALL? Speech Technologies for CALL Automatic Speech Recognition (ASR) Text-to-Speech (TTS) Synthesis Demos Challenges and the Future

4 Introduction

5 What is CALL? CALL: Computer Aided Language Learning
General term for using all computing resources for the acquisition, training and evaluation of language skills in the following areas: Reading, Writing, Listening, Speaking Why is CALL useful? Convenient, anytime access to language education Self-paced tool that aids human language instruction Can alleviate the fear of learning a new language through human to human interactions Human computer interactions can intrigue the young generation of users that are familiar with computers

6 Why is Speech Technology Important in CALL?
Speech is perhaps the most effective way of communication between humans Listening and speaking involve processing of speech from acoustic/phonetic and linguistic perspectives Computers are multi-modal in nature Speech technology enables systems to use these different modalities (speech, visual/haptic) for CALL Results in a more complete interaction for students, increasing learning efficacy and user satisfaction

7 Speech Technologies used in CALL Systems
Speech Input: Speech Recognition ASR for grammar-based verbal interaction Pronunciation Scoring Detection/Feedback of Mispronunciation Speech Output: Text-to-Speech Listening and verification of dynamic content

8 Automatic Speech Recognition for CALL

9 Automatic Speech Recognition (ASR) and Understanding
Process of decoding the raw speech waveform and extracting linguistic information for human-machine communication Speech Understanding Process of comprehending communicative intent in addition to the linguistic decoding of the raw acoustic speech signal

10 Speech Recognition Process
The input speech signal is converted to a sequence of feature vectors X , based on a cepstral, time-quefrency analysis. Input Speech “Call George Bush at home” Objective: Given a sequence of acoustic feature X extracted, find the most likely word string that could have been uttered Cepstral Analysis Acoustic Front-end Cepstral Analysis Acoustic Models P(X/W) Acoustic models P(X|W) represent sub-word units, such as phonemes, as a finite-state machine in which states model spectral structure and transitions model temporal structure. Language Model P(W) Search Recognized Utterance The language model P(W) predicts the next set of words, and controls which models are hypothesized.

11 ASR-CALL Applications

12 ASR for Verbal Interaction
Use continuous grammar to handle words and phrases Interaction specific, dynamic grammar Applications: Interactive lessons with voice input using ASR as an option Simple multiple choice questions Fill in the blank questions Word unscrambling drills

13 Pronunciation Scoring
Scoring performed by analyzing the following cues from non-native and native acoustic models Statistical match Duration Prosody Rate of Speech Grammar is singular and well defined Scoring can be done at the following levels Specific phone segments Words/Phrases Sentences Overall student proficiency

14 Mispronunciation Feedback
Detection Similar to keyword spotting Alternative pronunciation networks can be used Detection hot list Correction Segment specific training Confusable pair training (e.g. /r/ versus /l/ for Korean students) Can provide feedback/tips on potential correction Applications Reading tutor for children Detection and correction of common pronunciation mistakes (depends on the source language of student) Pronunciation of the word “Afternoon” Native : AE2 F T ERO N UW1 N Student : AE1 F T ELb N UW1 N Phones /AE1/, /ELb/ are detected as mispronunciations  Student is given tips on how to pronounce /AE/ and /ER/ correctly.

15 Text-to-Speech Synthesis for CALL

16 Definition of TTS Synthesis
Text-To-Speech (TTS) Synthesis Automatic production of acoustic speech waveform from arbitrary text input Better than humans in some ways Cheaper Can be more intelligible More flexible than recording Worse than humans in other ways Ungraceful degradation for longer sentences Mechanical timbre

17 Speech Synthesis Process
My office was on St. Mary’s St. one block from the coffee shop. Input Text Text Processing My office was on Saint Mary’s Street, one block from the coffee shop. Prosody Prediction *My office |was on Saint *Mary’s Street || *one block | from the *coffee shop. Phonetic Processing Prosody often includes some syntactic analysis or approximation of it Paper with Mari and Ivan showed that all these levels play a role (more on metrics) ·          F0 contour prediction has played a significant role in improving the naturalness. ·          Symbolic phrase and accent location markers gave significant improvement over the baseline reference, but the additional gain from specific tonal markers was not significant. ·          Predicted F0 and natural phone durations separately produced the same perceptual effect. However, when combined, these two prosodic features appeared to amplify each other's contribution. ·          Subjects can be categorized by their sensitivity to pitch range and/or phone durations. m *ay ao1 f ax s|w ax z ao n s ey n t m *eh r iy z s t r iy t || w *ah n b l aa k | f r ax m dh ax k ao* f iy sh aa p Waveform Generation Synthesized Output

18 TTS-CALL Applications

19 TTS-Based Learning and Comprehension
Large-corpus, concatenative based TTS systems Fortified grapheme-to-phoneme rules Offers instant multimedia content generation for learning new words, phrases or sentences Any text content can be “read out” using a TTS voice Interactive, focused topics Easy accessibility

20 Demos and Conclusion

21 Demos NeoSpeech/Voiceware (www.neospeech.com)
Magic English Plus (TTS) Cong Cong – Talking in English (ASR) BravoBrava! ( SpeaK! (ASR/TTS)

22 Challenges and the Future
ASR: Accuracy and Robustness of non-native based speech recognition Variety of source and target language configurations Micro-level pronunciation feedback Normalizing speaker characteristics (acoustic, linguistic) and channel/environment Robust rejection schemes TTS: Accuracy and naturalness of TTS systems for advanced listening lessons Combination of CALL with Spoken Language Translation

23 Thank You! yoon.kim@neospeech.com www.neospeech.com


Download ppt "Speech Technology for Language Learning"

Similar presentations


Ads by Google