Automatic Speech Recognition with Sparse Training Data for Dysarthric Speakers P. Green 1, J. Carmichael 1, A. Hatzis 1, P. Enderby 3, M. Hawley & M. Parker.

Slides:



Advertisements
Similar presentations
WMS-IV Wechsler Memory Scale - Fourth Edition
Advertisements

An Integrated Toolkit Deploying Speech Technology for Computer Based Speech Training with Application to Dysarthric Speakers Athanassios Hatzis, Phil Green,
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Project management Project manager must;
Multi-Modal Text Entry and Selection on a Mobile Device David Dearman 1, Amy Karlson 2, Brian Meyers 2 and Ben Bederson 3 1 University of Toronto 2 Microsoft.
The Computerised FDA Application Formulating A System of Acoustic Objective Measures for the Frenchay Dysarthria Assessment Tests.
Departments of Medicine and Biostatistics
Reading Graphs and Charts are more attractive and easy to understand than tables enable the reader to ‘see’ patterns in the data are easy to use for comparisons.
Visual Recognition Tutorial
Mixed models Various types of models and their relation
The OLP articulation program : a demonstration Rebecca Palmer, Pam Enderby, Mark Hawley Phil Green, Nassos Hatzis & James Carmichael European Commission.
Visual Recognition Tutorial
Human Computer Interface. Human Computer Interface? HCI is not just about software design HCI applies to more than just desktop PCs!!! No such thing as.
Basics of Good Documentation Document Control Systems
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.
Clinical Applications of Speech Technology Phil Green Speech and Hearing Research Group Dept of Computer Science University of Sheffield
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
Categorical Data Prof. Andy Field.
New technologies supporting people with severe speech disorders Mark Hawley Barnsley District General Hospital and University of Sheffield.
3 CHAPTER Cost Behavior 3-1.
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 9. Hypothesis Testing I: The Six Steps of Statistical Inference.
6 am 11 am 5 pm Fig. 5: Population density estimates using the aggregated Markov chains. Colour scale represents people per km. Population Activity Estimation.
Isolated-Word Speech Recognition Using Hidden Markov Models
Abstract The emergence of big data and deep learning is enabling the ability to automatically learn how to interpret EEGs from a big data archive. The.
Andrew Brasher Andrew Brasher, Patrick McAndrew Userlab, IET, Open University Human-Generated Learning.
Funding is provided by the National Institute on Disability and Rehabilitation Research under the US Department of Education, Grant # H133E University.
STARDUST PROJECT – Speech Recognition for People with Severe Dysarthria Mark Parker Specialist Speech and Language Therapist.
Clinical Applications of Speech Technology Phil Green Speech and Hearing Research Group Dept of Computer Science University of Sheffield
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Helynn Boughner EDU 674 Prof. Klein.  Is any technology that can help a person do a task. It can be as high- tech, as a computer system that speaks the.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
STARDUST – Speech Training And Recognition for Dysarthric Users of Assistive Technology Mark Hawley et al Barnsley District General Hospital and University.
Disclosure of Financial Conflicts of Interest in Continuing Medical Education Michael D. Jibson, MD, PhD and Jennifer Seibert, MD University of Michigan.
Sh s Children with CIs produce ‘s’ with a lower spectral peak than their peers with NH, but both groups of children produce ‘sh’ similarly [1]. This effect.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
This material is approved for public release. Distribution is limited by the Software Engineering Institute to attendees. Sponsored by the U.S. Department.
Analysis of RT distributions with R Emil Ratko-Dehnert WS 2010/ 2011 Session 07 –
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
HMM - Part 2 The EM algorithm Continuous density HMM.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Topic 4 - Database Design Unit 1 – Database Analysis and Design Advanced Higher Information Systems St Kentigern’s Academy.
FatMax Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 LicenseCreative Commons Attribution-NonCommercial-ShareAlike 2.5.
Human Computer Interface INT211
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Equations, Symbols and Graphs Unit 1 – Introduction to Physics.
Cognitive Testing, Statistics and Dementia Ralph J. Kiernan Ph.D. 14 th May 2013.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
PREPARED BY MANOJ TALUKDAR MSC 4 TH SEM ROLL-NO 05 GUKC-2012 IN THE GUIDENCE OF DR. SANJIB KR KALITA.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
UNDERSTAND HOW TO SUPPORT POSITIVE OUTCOMES FOR CHILDREN AND YOUNG PEOPLE Unit 030.
Using Speech Recognition to Predict VoIP Quality
Statistical Models for Automatic Speech Recognition
Categorical Data Aims Loglinear models Categorical data
Estimating with PROBE II
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Human Computer Interface
CSSSPEC6 SOFTWARE DEVELOPMENT WITH QUALITY ASSURANCE
Human Computer Interface
Statistical Models for Automatic Speech Recognition
Using statistics to evaluate your test Gerard Seinhorst
Human Computer Interface
Human Computer Interface
Statistical Data Analysis
EE Audio Signals and Systems
Presentation transcript:

Automatic Speech Recognition with Sparse Training Data for Dysarthric Speakers P. Green 1, J. Carmichael 1, A. Hatzis 1, P. Enderby 3, M. Hawley & M. Parker 2, 1 Department of Computer Science, University of Sheffield, 2 Department of Medical Physics & Clinical Engineering, Barnsley District General Hospital NHS Trust, 3 Institute of General Practice, University of Sheffield. Abstract We describe an unusual ASR application: recognition of command words from severely dysarthric speakers, who have poor control of their articulators. The goal is to allow these clients to control assistive technology by voice. While this is a small vocabulary, speaker-dependent, isolated-word application, the speech material is more variable than normal, and only a small amount of data is available for training. After training a CDHMM recogniser, it is necessary to predict its likely performance without using an independent test set, so that confusable words can be replaced by alternatives. We present a battery of measures of consistency and confusability, based on forced-alignment, which can be used to predict recogniser performance. We show how these measures perform, and how they are presented to the clinicians who are the users of the system. Recogniser Design Continuous Density HMMs (CDHMM) with: Whole-word rather than phone level modelling with training data labelled at word level (typically 20 utterances per word). 11 HMM states per word model 3 GMMs per state Straight-through model topology 12 MFCCs 16 kHz sampling rate with 10ms frame window The Implications For Vocabulary Selection. Let’s have a closer look at a section of GR’s matrix…. ‘Alarm’ and ‘Lamp’ show high confusability with each other (but not with other words in the vocabulary), so one should be removed and replaced with an alternative item, perhaps ‘Light’ instead of ‘Lamp’. Sometimes it is not easy to guess which items will confuse easily with others. For normal speaker MP, the word ‘Volume’ shows low confusabililty in contrast to the other words… …but not so for the dysarthric GR: In actual practice, it was necessary to replace ‘Volume’ in GR’s recogniser’s vocabulary with ‘Power’. Visualising Confusability Inter and intra word model confusability can be visualised as a matrix. For greater visual impact, we use colour-coding to depict a range of values. Ideally, the areas of high confusability should only occur along the diagonal of the matrix (the word ‘confusing with itself’). For dysarthric speech, areas of high confusability are often found off the diagonal in unexpected locations. Does it Predict Actual Performance? Future Work The relationship between speech intelligibility and consistency. The use of this tool for speech disorder diagnostics: subjectively assessed intelligibility tests are psychometrically weak and inconsistent. Confusability metrics are objective and repeatable. Incorporating this tool into speech training software (see Hatzis et al., this conference).. Acknowledgements This research was sponsored by the UK Department of Health New and Emerging Application of Technology (NEAT) programme and received a proportion of its funding from the NHS Executive. The views expressed in this publication are those of the authors and not necessarily those of the Department of Health or the NHS Executive. Motivation  Dysarthrias (a family of neurologically based speech disorders characterised by loss of control of the articulators) are often connected to a more generalised motor impairment -- e.g stroke, MS – making normal interaction with the environment difficult.  This physical incapacity makes voice control of Electronic Assistive Technology (EAT) an attractive option BUT…  Severely dysarthric speech is so abnormal that off-the-shelf ASR products fail.  The STARDUST project aims to use custom-built ASR for control of EAT by severe dysarthrics The Stardust Project’s Achievements We have: used computer-based training to improve the speech consistency of most of the clients enrolled in the pilot project, making the speech recognition task easier (see Hatzis et al, this conference). built small vocabulary isolated-word recognisers for severely disordered speech. The accuracy of these speaker dependent recognisers is encouraging (10-word vocab): successfully used these recognisers to control Assistive Technology Confusability: Forecasting recogniser performance from the training set Dealing with sparse training data: severe dysarthrics cannot be asked to produce large quantities of training data. Data scarcity implies that all available speech (except extreme outliers) should be used for training: no separate training and test sets. We need to predict which words the recogniser is likely to confuse with one another, in order to modify the vocabulary if necessary. Phonetically-based confusability measures are not applicable to dysarthric speech. The following measures use only a training set for a vocabulary of N words W 1,…,W N, where w jk is the k th repetition of the j th word a set of CDHMMs M i trained on this data. The measures are based on forced alignment: L ijk is the per- frame log likelihood of each model generating each example of each word on the Viterbi path. Word-level consistency. the consistency for a word is obtained by averaging the L ijk for the correct word model:  i = (Σ k L iik )/n i Where n i is the total number of examples of W i. : The overall consistency of the training corpus is the average of the  :  = (Σ i  i )/N The confusability between any two words W i and W j is C ij = (Σ k L ijk )/n j C ij is the average score obtained by aligning examples of W j against M i. The higher this is, the greater the likelihood that W j will be misrecognised as W i. s SpeakerRecognition Accuracy(%) MP (Normal)100 AH (Normal)100 GR (Severely Dysarthric)87 JT (Severely Dysarthric)100 CC (Severely Dysarthric)96 Low Confusability High Confusability TV Alarm Lamp Chan On Off Up Down Radio Vol TV Alarm Lamp Chan. On Off Up Down Radio Vol Table 2: Confusability Matrix for Normal Speaker MP (10-word vocabulary) Table 3: Confusability Matrix for Dysarthric Speaker GR (10-word vocabulary) TV Alarm Lamp Chan On Off Up Down Radio Vol TV Alarm Lamp Chan. On Off Up Down Radio Vol Table 4: Test Set Confusions Superimposed on GR Confusability Matrix TV Alarm Lamp Chan On Off Up Down Radio Vol TV Alarm Lamp Chan. On Off Up Down Radio Vol TV Alarm Lamp Chan. On Off Up Down Radio Vol Alarm Lamp Volume TV Alarm Lamp Chan. On Off Up Down Radio Volume Volume Justification: forced- alignment likelihoods will be lower for an inconsistently spoken word than for a consistent one since its distributions will be flatter. Confirmed by experiments with mixed training sets: