Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

Slides:

Advertisements

Similar presentations

Building an ASR using HTK CS4706

Advertisements

Acoustic Characteristics of Consonants

Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),

Basic Spectrogram Lab 8. Spectrograms §Spectrograph: Produces visible patterns of acoustic energy called spectrograms §Spectrographic Analysis: l Acoustic.

Speech Sound Production: Recognition Using Recurrent Neural Networks Abstract: In this paper I present a study of speech sound production and methods for.

December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.

Natural Language Processing - Speech Processing -

Speech Translation on a PDA By: Santan Challa Instructor Dr. Christel Kemke.

F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)

The Beatbox Voice-to-Drum Synthesizer A BSTRACT The Beatbox is a real time voice-to-drum synthesizer intended primarily for the entertainment of small.

COMP 4060 Natural Language Processing Speech Processing.

A PRESENTATION BY SHAMALEE DESHPANDE

Natural Language Understanding

Representing Acoustic Information

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Eng. Shady Yehia El-Mashad

Topics covered in this chapter

Isolated-Word Speech Recognition Using Hidden Markov Models

Age and Gender Classification using Modulation Cepstrum Jitendra Ajmera (presented by Christian Müller) Speaker Odyssey 2008.

Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.

SoundSense by Andrius Andrijauskas. Introduction  Today’s mobile phones come with various embedded sensors such as GPS, WiFi, compass, etc.  Arguably,

Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling V. Karjigi , P. Rao Dept. of Electrical Engineering,

Speech Perception 4/6/00 Acoustic-Perceptual Invariance in Speech Perceptual Constancy or Perceptual Invariance: –Perpetual constancy is necessary, however,

1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Jacob Zurasky ECE5526 – Spring 2011

Csc Lecture 7 Recognizing speech. Geoffrey Hinton.

1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Automatic Identification and Classification of Words using Phonetic and Prosodic Features Vidya Mohan Center for Speech and Language Engineering The Johns.

Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.

Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.

AGA 4/28/ NIST LID Evaluation On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky.

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.

Performance Comparison of Speaker and Emotion Recognition

EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,

Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.

BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.

Speech recognition Home Work 1. Problem 1 Problem 2 Here in this problem, all the phonemes are detected by using phoncode.doc There are several phonetics.

RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.

Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.

Speech Processing Using HTK Trevor Bowden 12/08/2008.

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.

Acoustic Phonetics 3/14/00.

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:

Speech Recognition through Neural Networks By Mohammad Usman Afzal Mohammad Waseem.

Detection Of Anger In Telephone Speech Using Support Vector Machine and Gaussian Mixture Model Prepared By : Siti Marahaini Binti Mahamood.

Speech Processing Dr. Veton Këpuska, FIT Jacob Zurasky, FIT.

Korean Phoneme Discrimination

PATTERN COMPARISON TECHNIQUES

Talking with computers

ARTIFICIAL NEURAL NETWORKS

Speech Recognition UNIT -5.

Artificial Intelligence for Speech Recognition

Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.

Mel-spectrum to Mel-cepstrum Computation A Speech Recognition presentation October Ji Gu

Ala’a Spaih Abeer Abu-Hantash Directed by Dr.Allam Mousa

EE513 Audio Signals and Systems

Digital Systems: Hardware Organization and Design

Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,

Presented by Chen-Wei Liu

Artificial Intelligence 2004 Speech & Natural Language Processing

Presenter: Shih-Hsiang(士翔)

Presentation transcript:

Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III

Recall: ► Speech signal is ‘slowly’ time varying singnal ► There are a number of linguistically distinct speech sounds (phonemes) in a language. ► It is possible to represent the sound spectrogram in a 3D spectrogram of the speech intensity and the different frequency bands over time ► Most SR systems rely heavily on vowel recognition to achieve high performance (they are long in duration and spectrally well defined and therefore easily recognized)

Speech sounds and features Examples: ► Vowels (a, u, …) ► Diphthongs (f.i. a y as in guy, … ) ► Semivowels (w, l, r, y) ► Nasal Consonants (m, n) ► Unvoiced Fricatives (f, s) ► Voiced Fricatives (v, th, z) ► Voiced and Unvoiced Stops (b, d, g) ► They all have their own characteristics (features)

ASR Stages 1) speech analysis system: to provide an appropriate spectral representation of the characteristics of the time-varying speech signal  2) feature detection stage: to convert the spectral measurements to a set of features that describe the broad acoustic properties of the different phonetic units (f.i. nasality, frication, formant locations, voiced-unvoiced classification, ratios of high- and low-frequency energy, etc.) 3) segmentation and labeling phase: to find stable regions and then label the segmented region according to how well the features within that region match those of individual phonetic units 4) final output of the recognizer is the word or word sequence that best matches

Feature detection (and extraction) ► Speech segment contains certain characteristics, features. ► Different segments of speech contain different features, specific for the kind of segment! ► Goal is to try to classify a speech segment into one of several broad speech classes (f.i. via binary tree: compact/diffuse, acute/grave, long/short, high/low frequency, etc) ► Ideally, feature vectors for a given word should hopefully be the same regardless of the way in which the word has been uttered

Last week: Mel-Frequency Ceptrum Coefficient ► ► Fourier Transform extracts the frequency components of a signal in the time domain ► ► Frequency domain is filtered/sliced in 12 smaller parts, where for each it’s own coefficient (MFCC) can be calculated ► MFCC's use the log-spectrum of the speech signal. The logarithmic nature of the technique is significant since the human auditory system perceives sound on a logarithmic scale above certain frequencies

Fourier Transform Fourier Transform Cepstral Analysis Cepstral Analysis Perceptual Weighting Perceptual Weighting Time Derivative Time Derivative Time Derivative Time Derivative Energy + Mel-Spaced Cepstrum Delta Energy + Delta Cepstrum Delta-Delta Energy + Delta-Delta Cepstrum Input Speech MFCC’s are beautiful, because they incorporate knowledge of the nature of speech sounds in measurement of the features. Utilize rudimentary models of human perception. Acoustic Modeling: Feature Extraction Fourier Transform time domain  frequency domain Frequency domain is sliced in 12 smaller parts with each it’s own MFCC Include absolute energy and 12 spectral measurements. Time derivatives to model spectral change

What ‘to do’ with the MFCC’s: ► A speech recognizer can be built using the energy values (time domain) and 12 MFCC's (frequency domain), plus the first and second order derivatives of those coefficients. 13 (Absolute Energy (1) and MFCCs (12)) 13 (Delta First-order derivatives of the 13 absolute coefficients) 13 (Delta-Delta Second-order derivatives of the 13 absolute coefficients) Total Basic MFCC Front End ► The derivatives are useful because they provide information about the ► The derivatives are useful because they provide information about the spectral change ► These total of 39 coefficients will provide information about the different features in that segment! ► The feature measurements of the segments are stored in so called ‘feature vectors’, that can be used in the next stage of the speech recognition (f.i. Hidden Markov Model)

In Sphinx III: computation of feature vectors ► feat_s2mfc2feat ► feat_s2mfc2feat_block 1. MFC file is read 2. Initialization: defining the kind of input->feature conversion desired (there are some differences between Sphinx II and Sphinx III) 3. Feature vectors are computed for the entire segment specified (feat_s2mfc2feat and feat_s2mfc2feat_block) In Sphinx in the feature vectors, the streams of features are stored as follows: ► CEP: C1-C12 ► DCEP: D1-D12 ► Energy values: C0, D0, DD0 ► D2CEP: DD1-DD12

► So, at this point in the speech recognition process, you have stored feature vectors for the entire speech segment you are looking at, providing the necessary information about what kind features are in that segment. ► Now, ► Now, The feature stream can be analyzed using a Hidden-Markov Model (HMM) frication burst voicing round nasal glide a1a2:a5a6a1a2:a5a6 … … “one” “two” “oh” …… … … :::: Feature Extraction Modules Input speech Feature Vector Concat. Train The feature stream is analyzed using a Hidden-Markov Model (HMM)