Philip Jackson and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics in a segmental-HMM recognizer using intermediate.

Slides:



Advertisements
Similar presentations
Improved ASR in noise using harmonic decomposition Introduction Pitch-Scaled Harmonic Filter Recognition Experiments Results Conclusion aperiodic contribution.
Advertisements

aperiodic periodic Production of /z/: Covariation and weighting of harmonically decomposed streams for ASR Introduction Pitch-scaled harmonic filter Recognition.
Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Building an ASR using HTK CS4706
“Connecting the dots” How do articulatory processes “map” onto acoustic processes?
Coarticulation Analysis of Dysarthric Speech Xiaochuan Niu, advised by Jan van Santen.
Using Motherese in Speech Recognition EE516 final project Steven Schimmel March 13, 2003.
Speech Sound Production: Recognition Using Recurrent Neural Networks Abstract: In this paper I present a study of speech sound production and methods for.
Speech Group INRIA Lorraine
On Constrained Optimization Approach To Object Segmentation Chia Han, Xun Wang, Feng Gao, Zhigang Peng, Xiaokun Li, Lei He, William Wee Artificial Intelligence.
Speaker Adaptation for Vowel Classification
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.
Speech Recognition in Noise
A PRESENTATION BY SHAMALEE DESHPANDE
Representing Acoustic Information
Time-Domain Methods for Speech Processing 虞台文. Contents Introduction Time-Dependent Processing of Speech Short-Time Energy and Average Magnitude Short-Time.
Page 0 of 14 Dynamical Invariants of an Attractor and potential applications for speech data Saurabh Prasad Intelligent Electronic Systems Human and Systems.
Automatic detection of microchiroptera echolocation calls from field recordings using machine learning algorithms Mark D. Skowronski and John G. Harris.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
Compensating speaker-to-microphone playback system for robust speech recognition So-Young Jeong and Soo-Young Lee Brain Science Research Center and Department.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Automatic Identification and Classification of Words using Phonetic and Prosodic Features Vidya Mohan Center for Speech and Language Engineering The Johns.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
AGA 4/28/ NIST LID Evaluation On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Recognition of Speech Using Representation in High-Dimensional Spaces University of Washington, Seattle, WA AT&T Labs (Retd), Florham Park, NJ Bishnu Atal.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic Speech Recognition Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture.
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 1 Phone Boundary Detection using Sample-based Acoustic Parameters.
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
CY3A2 System identification1 Maximum Likelihood Estimation: Maximum Likelihood is an ancient concept in estimation theory. Suppose that e is a discrete.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
A Recognition Model for Speech Coding Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture.
Performance Comparison of Speaker and Emotion Recognition
The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech Frank Seide IEEE Transactions on Speech and Audio Processing 2005.
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
Speech recognition Home Work 1. Problem 1 Problem 2 Here in this problem, all the phonemes are detected by using phoncode.doc There are several phonetics.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Acoustic Phonetics 3/14/00.
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
AN EXPECTATION MAXIMIZATION APPROACH FOR FORMANT TRACKING USING A PARAMETER-FREE NON-LINEAR PREDICTOR Issam Bazzi, Alex Acero, and Li Deng Microsoft Research.
A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
SPEECH VARIATION AND THE USE OF DISTANCE METRICS ON THE ARTICULATORY FEATURE SPACE Louis ten Bosch.
Neural Network Approximation of High- dimensional Functions Peter Andras School of Computing and Mathematics Keele University
Madhulika Pannuri Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Correlation Dimension.
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
Face Detection EE368 Final Project Group 14 Ping Hsin Lee
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture
Statistical Methods For Engineers
8-Speech Recognition Speech Recognition Concepts
Segment-Based Speech Recognition
Automatic Speech Recognition: Conditional Random Fields for ASR
Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2
Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang
Speaker Identification:
Qiang Huo(*) and Chorkin Chan(**)
Network Training for Continuous Speech Recognition
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee
Presentation transcript:

Philip Jackson and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics in a segmental-HMM recognizer using intermediate linear representations

Speech dynamics into ASR INTRODUCTION

Conventional model INTRODUCTION 1 acoustic observations HMM acoustic PDF

Linear-trajectory model INTRODUCTION 2341 W acoustic observations articulatory-to- intermediate layer segmental HMM acoustic PDF acoustic mapping

Multi-level Segmental HMM segmental finite-state process intermediate “articulatory” layer –linear trajectories mapping required –linear transformation –radial basis function network INTRODUCTION

Estimation of linear mapping Matched sequences and THEORY

Linear-trajectory equations Defined as: THEORY

Training the model parameters For optimal least-squares estimates (acoustic domain): THEORY midpoint slope

THEORY midpoint slope For optimal least-squares estimates (articulatory domain): Training the model parameters

THEORY midpoint slope For optimal maximum-likelihood estimates (articulatory domain): Training the model parameters

Tests on MOCHA S. British English, at 16kHz (Wrench, 2000) –MFCC13 acoustic features, incl. zero’ th –articulatory x - & y -coords from 7 EMA coils –PCA9+Lx: first nine articulatory modes plus the laryngograph log energy METHOD

MOCHA baseline performance RESULTS Constant-trajectory SHMM (ID_0) Linear-trajectory SHMM (ID_1)

Performance across mappings RESULTS

Phone categorisation No.No.Description A 1all data B 2silence; speech C 6linguistic categories: silence/stop; vowel; liquid; nasal; fricative; affricate D 10as (Deng and Ma, 2000) : silence; vowel; liquid; nasal; UV fric; /s,ch/; V fric; /z,jh/; UV stop; V stop E 10discrete articulatory regions F 49silence; individual phones METHOD

Tests on TIMIT N. American English, at 8kHz –MFCC13 acoustic features, incl. zero’ th a)F1-3: formants F1, F2 and F3, estimated by Holmes formant tracker b)F1-3+BE5: five band energies added c)PFS12: synthesiser control parameters METHOD

TIMIT baseline performance Constant-trajectory SHMM (ID_0) Linear-trajectory SHMM (ID_1) RESULTS

Performance across feature sets RESULTS

Performance across groupings RESULTS

Results across groupings RESULTS

Model visualisation Original acoustic data Constant- trajectory model Linear- trajectory model (c,F) DISCUSSION

Conclusions Developed framework for speech dynamics in an intermediate space Linear traj. + piecewise linear mapping bounded by performance of linear traj. in acoustic space Near optimal performance achieved –For more than 3 formant parameters –For 6 or more linear mappings Formants and articulatory parameters gave qualitatively similar results What next? SUMMARY

Complete experiments with lang. model Include segment duration models Derive pseudo-articulatory representations by unsupervised (embedded) training Implement non-linear mapping (i.e., RBF) Further information: –here and now –web.bham.ac.uk/p.jackson/balthasar SUMMARY Further work