A Recognition Model for Speech Coding Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture.

Slides:



Advertisements
Similar presentations
Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Advertisements

Time-Frequency Analysis Analyzing sounds as a sequence of frames
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
Page 0 of 34 MBE Vocoder. Page 1 of 34 Outline Introduction to vocoders MBE vocoder –MBE Parameters –Parameter estimation –Analysis and synthesis algorithm.
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1 PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL BALANCE OF VOWELS Jan P.H. van Santen and Xiaochuan.
Speech Group INRIA Lorraine
Complex Feature Recognition: A Bayesian Approach for Learning to Recognize Objects by Paul A. Viola Presented By: Emrah Ceyhan Divin Proothi Sherwin Shaidee.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
Natural Language Processing - Speech Processing -
1 Speech Parametrisation Compact encoding of information in speech Accentuates important info –Attempts to eliminate irrelevant information Accentuates.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
Philip Jackson and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics in a segmental-HMM recognizer using intermediate.
COMP 4060 Natural Language Processing Speech Processing.
1 USING CLASS WEIGHTING IN INTER-CLASS MLLR Sam-Joo Doh and Richard M. Stern Department of Electrical and Computer Engineering and School of Computer Science.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2004 Chapter 4 Digital Transmission.
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
Principles of the Global Positioning System Lecture 10 Prof. Thomas Herring Room A;
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING MARCH 2010 Lan-Ying Yeh
Representing Acoustic Information
LE 460 L Acoustics and Experimental Phonetics L-13
Chapter 11 Simple Regression
IIT Bombay ICA 2004, Kyoto, Japan, April 4 - 9, 2004   Introdn HNM Methodology Results Conclusions IntrodnHNM MethodologyResults.
1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.
Time-Domain Methods for Speech Processing 虞台文. Contents Introduction Time-Dependent Processing of Speech Short-Time Energy and Average Magnitude Short-Time.
Digital Image Processing Lecture 20: Representation & Description
Implementing a Speech Recognition System on a GPU using CUDA
Speech Coding Submitted To: Dr. Mohab Mangoud Submitted By: Nidal Ismail.
VI. Evaluate Model Fit Basic questions that modelers must address are: How well does the model fit the data? Do changes to a model, such as reparameterization,
Jacob Zurasky ECE5526 – Spring 2011
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
Ranking and Rating Data in Joint RP/SP Estimation by JD Hunt, University of Calgary M Zhong, University of Calgary PROCESSUS Second International Colloquium.
Informing Multisource Decoding for Robust Speech Recognition Ning Ma and Phil Green Speech and Hearing Research Group The University of Sheffield 22/04/2005.
♥♥♥♥ 1. Intro. 2. VTS Var.. 3. Method 4. Results 5. Concl. ♠♠ ◄◄ ►► 1/181. Intro.2. VTS Var..3. Method4. Results5. Concl ♠♠◄◄►► IIT Bombay NCC 2011 : 17.
Speech Parameter Generation From HMM Using Dynamic Features Keiichi Tokuda, Takao Kobayashi, Satoshi Imai ICASSP 1995 Reporter: Huang-Wei Chen.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
VOCODERS. Vocoders Speech Coding Systems Implemented in the transmitter for analysis of the voice signal Complex than waveform coders High economy in.
Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic Speech Recognition Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Performance Comparison of Speaker and Emotion Recognition
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.
Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
Encoding How is information represented?. Way of looking at techniques Data Medium Digital Analog Digital Analog NRZ Manchester Differential Manchester.
By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
SPEECH VARIATION AND THE USE OF DISTANCE METRICS ON THE ARTICULATORY FEATURE SPACE Louis ten Bosch.
1 Speech Compression (after first coding) By Allam Mousa Department of Telecommunication Engineering An Najah University SP_3_Compression.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
1 LOW-RESOURCE NOISE-ROBUST FEATURE POST-PROCESSING ON AURORA 2.0 Chia-Ping Chen, Jeff Bilmes and Katrin Kirchhoff SSLI Lab Department of Electrical Engineering.
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
Digital Image Processing Lecture 20: Representation & Description
Vocoders.
1 Vocoders. 2 The Channel Vocoder (analyzer) : The channel vocoder employs a bank of bandpass filters,  Each having a bandwidth between 100 HZ and 300.
Digital Systems: Hardware Organization and Design
Analog to Digital Encoding
Presenter: Shih-Hsiang(士翔)
Measuring the Similarity of Rhythmic Patterns
Keyword Spotting Dynamic Time Warping
Presentation transcript:

A Recognition Model for Speech Coding Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture

2 Introduction Speech coding at low data rates (a few hundred bits/s) requires compact, low-dimensional representation. => code variable-length speech “segments”. Automatic speech recognition is potentially a powerful way to identify useful segments for coding. BUT: HMM-based coding has limitations: shortcomings of HMMs as production models typical recognition feature sets (e.g. cepstral coefficients) impose limits on coded speech quality difficult to retain speaker characteristics (at least for speaker-independent recognition ).

3 A “unified” model for speech coding

4 A simple coding scheme Demonstrate principles of coding using same model for both recognition and synthesis. Model represents linear formant trajectories. Recognition: linear trajectory segmental HMMs of formant features. Synthesis: JSRU parallel-formant synthesizer. Coding is applied to analysed formant trajectories => relatively high bit-rate (typically bits/s). Recognition is used mainly to identify segment boundaries, but also to guide the coding of the trajectories.

5 Segment coding scheme overview

6 Formant analyser (EUROSPEECH’97) –Each formant frequency estimate is assigned a value representing confidence in its measurement accuracy. When formants are indistinct, confidence is low. –In cases of ambiguity, the analyser offers two alternative sets of formant trajectories for resolution in the recognition process. “four seven”

7 Linear formant trajectory recognition Feature set: formant frequencies plus mel- cepstrum coefficients and overall energy feature. Confidences: represent as variances: low confidence => large variance. Add confidence variance to model variance, so low- confidence features have little influence. Formant alternatives: choose one giving highest probability for each possible data segment and model state. Numbers of segments depend on phone identity: e.g. 1 segment for fricatives; 3 for voiceless stops. Range of durations : segment-dependent minimum and maximum segment duration.

8 Frame-by-frame synthesizer controls Values for each of 10 synthesizer control parameters are obtained at 10ms intervals: Voicing and fundamental frequency from excitation analysis program. 3 Formant frequency controls from formant analyser. 5 Formant amplitude controls from FFT-based method. With 6 bits assigned to each of the 10 controls, the baseline data rate is 6000 bits/s.

9 Segment coding –Segments identified by recognizer are coded using straight-line fits to observed formant parameters. –Use a least mean square error criterion. For formant frequencies, frame error is weighted by confidence variance. Thus the more reliable frames have more influence. –To code a segment, represent value at start, and difference of end value from start value. –Force continuity across segment boundaries where smooth changes are required for naturalness (e.g. semivowel-vowel boundaries). –When there are formant alternatives, use those selected by recognizer.

10 Coding experiments –Tested on 2 tasks: speaker-independent connected digit recognition and speaker-dependent recognition of airborne reconnaissance reports (500 word vocab.). –Frame-by-frame analysis-synthesis (at 6000 bits/s) generally produced a close copy of original speech. –Segment-coded versions preserved main characteristics. –There were some instances of formant analysis errors. –In some cases, using the recognizer to select between alternative formant trajectories improved segment coding quality. –In general, coding still works well even if there are recognition errors, as main requirement is to identify suitable segments for linear trajectory coding.

11 Coded at about 600bps –Speaker 1: digits –Speaker 2: digits –Speaker 3: digits –Speaker 1: ARM report Natural –Speaker 1: digits –Speaker 2: digits –Speaker 3: digits –Speaker 1: ARM report Speech Coding results Achievements of study: Established principle of using formant trajectory model for both recognition and synthesis, including using information from recognition to assist in coding. Future work: better quality coding should be possible by further integrating formant analysis, recognition and synthesis within a common framework.