Download presentation
Presentation is loading. Please wait.
Published byCrystal Harper Modified over 9 years ago
2
Clinical Applications of Speech Technology Phil Green Speech and Hearing Research Group Dept of Computer Science University of Sheffield pdg@dcs.shef.ac.uk
3
CAST December 2007 Talk Overview SPandH - Speech and Hearing @ Sheffield The CAST group Building Automatic Speech Recognisers – conventional methodology ASR for clients with speech disorders Kinematic Maps Voice-driven Environmental Control VIVOCA Customising Voices Future Directions
4
CAST December 2007 SPandH Phonetics & Linguistics Hearing & Acoustics Electrical Engineering & Signal Processing Speech & Language Therapy Auditory Scene Analysis Missing Data Theory Glimpsing CAST
5
Prof Mark Hawley School of Health and Related Research Assistive Technology Prof Pam Enderby Institute of General Practice and Primary Care University of Sheffield Speech Therapy Prof Phil Green Prof Roger K Moore Speech and Hearing Research Group Department of Computer Science University of Sheffield Speech Technology Dr Stuart Cunningham Department of Human Communication Sciences University of Sheffield Speech Perception, Speech Technology Contact: pdg@dcs.shef.ac.uk
6
CAST December 2007 Conventional Automatic Speech Recogniser Construction Standard technique uses generative statistical models: Each speech unit is modeled by an HMM with a number of states. Each state is characterised by a mixture Gaussian distribution over the components of the acoustic vector x. Parameters of the distributions estimated in training (EM – Baum-Welch) All this is the acoustic model. There will also be a language model. Decoding finds model & state sequence most likely to generate X. Training based on large pre-recorded speaker-independent speech corpus
7
CAST December 2007 Dysarthria Loss of control of speech articulators Stroke victims, cerebral palsy, MS.. Effects 170 per 100,000 population Severe cases unintelligible to strangers: Often accompanied by physical disability channel lamp radio
8
CAST December 2007 STARDUST: ASR for Dysarthric Speakers NHS NEAT Funding Environmental control Small vocabulary, isolated words Speaker-dependent Sparse training data Variable training data
9
CAST December 2007 STARDUST Methodology Initial recordings Train Recogniser Confusability Analysis Client Practice For Consistency New Recordings
10
CAST December 2007 STARDUST training results ClientSentence Intelligibilit y (%) Word Intelligibilit y (%) Vocabulary Size Pre- training (%) Post- training (%) CC6101195.79100.00 PH34221096.22100.00 GR001082.0086.00 JT10221396.9299.74 KD--1380.0090.77 MR--1177.2795.45 FL--1192.7396.36 ECS trial: halved the average time to execute a command
11
CAST December 2007 STARDUST Consistency Training
12
CAST December 2007 STARDUST Clinical Trial
13
CAST December 2007 OPTACIA: Kinematic Maps Pronunciation Training Aid EC Funding Speech acoustics mapped to x,y position in map window in real time Mapping by trained Neural Net Customise for exercises and clients ANN Mapping Signal Processing sh s i Speech
14
CAST December 2007 Example: Vowel Map
15
CAST December 2007 SPECS: Speech-Driven Environmental Control Systems NHS HTD Funding Industrial exploitation STARDUST on ‘balloon board’
16
CAST December 2007 VIVOCA- Voice Input Voice Output Communication Aid NHS NEAT funding Assists communication with strangers; Client: ‘buy tea’ [unintelligible] VIVOCA: ‘A cup of tea with milk and no sugar please’ [intelligible synthesised speech] Runs on a PDA Text Generation ASR Dysarthric speech Speech Synthesis Intelligible speech
17
CAST December 2007 Voices for VIVOCA It is possible to build voices from training data A local voice is preferable Yorkshire voices: Ian MacMillan Christa Ackroyd
18
CAST December 2007 Concatenative synthesis Input data Text input Synthesised speech Speech recordings Unit segmentation Unit database Unit selection Concatenation + smoothing i a sh Festvox: http://festvox.org/ +… ++ …
19
CAST December 2007 Concatenative synthesis High quality Natural sounding Sounds like original speaker Need a lot of data (~600 sentences) Can be inconsistent Difficult to manipulate prosody
20
CAST December 2007 HMM synthesis yes yes
21
CAST December 2007 HMM synthesis: adaptation Input data Text input Average speaker model Synthesised speech Speech recordings Training Synthesis e t HTS http://hts.sp.nitech.ac.jp/ Adapted speaker model Adaptation e t Speech recordings 100 200
22
CAST December 2007 HMM synthesis Consistent Intelligible Easier to manipulate prosody Needs relatively little input for adaptation data (>5 sentences) Less natural than concatenative
23
CAST December 2007 Personalisation for individuals with progressive speech disorders Voice banking Before deterioration Capturing the essence of a voice During deterioration
24
CAST December 2007 HMM synthesis: adaptation for dysarthric speech Input data Text input Average speaker model Synthesised speech Speech recordings Training Synthesis e t HTS http://hts.sp.nitech.ac.jp/ Adapted speaker model Adaptation e t Speech recordings Duration, phonation and energy information
25
CAST December 2007 Future directions Personal Adaptive Listeners (PALS) ‘Home Service’ Companions
26
CAST December 2007 The PALS Concept A PAL is a portable (PDA, wearable..) device which you own Your PAL is like your valet It knows a lot about you.. The way you speak, the words you like to use Your interests, contacts, networks You talk with it The knowledge makes conversational dialogues viable It does things for you Bookings, appointments, reminders Communication Access to services.. It learns to do a better job By explicit training (this is how I refer to things, these are the names I use..) USER-AS-TEACHER By Automatic Adaptation: acoustic models, language models, dialogue models
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.