So far: Historical introduction Mathematical background (e.g., pattern classification, acoustics) Feature extraction for speech recognition (and some neural.

Slides:



Advertisements
Similar presentations
Building an ASR using HTK CS4706
Advertisements

A KTEC Center of Excellence 1 Pattern Analysis using Convex Optimization: Part 2 of Chapter 7 Discussion Presenter: Brian Quanz.
Toward Automatic Music Audio Summary Generation from Signal Analysis Seminar „Communications Engineering“ 11. December 2007 Patricia Signé.
ECE 8443 – Pattern Recognition Objectives: Elements of a Discrete Model Evaluation Decoding Dynamic Programming Resources: D.H.S.: Chapter 3 (Part 3) F.J.:
Look Who’s Talking Now SEM Exchange, Fall 2008 October 9, Montgomery College Speaker Identification Using Pitch Engineering Expo Banquet /08/09.
Outline 1. General Design and Problem Solving Strategies 2. More about Dynamic Programming – Example: Edit Distance 3. Backtracking (if there is time)
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Natural Language Processing - Speech Processing -
The 1980’s Collection of large standard corpora Front ends: auditory models, dynamics Engineering: scaling to large vocabulary continuous speech Second.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Speaker Adaptation for Vowel Classification
COMP 4060 Natural Language Processing Speech Processing.
Dynamic Time Warping Applications and Derivation
A PRESENTATION BY SHAMALEE DESHPANDE
Clustering Unsupervised learning Generating “classes”
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Representing Acoustic Information
Introduction to Automatic Speech Recognition
Variable Penalty Dynamic Time Warping For Aligning Chromatography Data David Clifford Research Scientist June 2009.
: Chapter 10: Image Recognition 1 Montri Karnjanadecha ac.th/~montri Image Processing.
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
This week: overview on pattern recognition (related to machine learning)
Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU.
1 TEMPLATE MATCHING  The Goal: Given a set of reference patterns known as TEMPLATES, find to which one an unknown pattern matches best. That is, each.
7-Speech Recognition Speech Recognition Concepts
Dynamic Programming. Well known algorithm design techniques:. –Divide-and-conquer algorithms Another strategy for designing algorithms is dynamic programming.
Implementing a Speech Recognition System on a GPU using CUDA
MODELING AND ANALYSIS OF MANUFACTURING SYSTEMS Session 12 MACHINE SETUP AND OPERATION SEQUENCING E. Gutierrez-Miravete Spring 2001.
1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:
Incorporating Dynamic Time Warping (DTW) in the SeqRec.m File Presented by: Clay McCreary, MSEE.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Chapter 5: Speech Recognition An example of a speech recognition system Speech recognition techniques Ch5., v.5b1.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
VQ for ASR 張智星 多媒體資訊檢索實驗室 清華大學 資訊工程系.
Chapter 9 DTW and VQ Algorithm  9.1 Basic idea of DTW  9.2 DTW algorithm  9.3 Basic idea of VQ  9.4 LBG algorithm  9.5 Improvement of VQ.
Topic 25 Dynamic Programming "Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Evaluation Decoding Dynamic Programming.
Algorithm Design Methods (II) Fall 2003 CSE, POSTECH.
Ch 5b: Discriminative Training (temporal model) Ilkka Aho.
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
1 Dynamic Time Warping and Minimum Distance Paths for Speech Recognition Isolated word recognition: Task : Want to build an isolated ‘word’ recogniser.
DYNAMIC TIME WARPING IN KEY WORD SPOTTING. OUTLINE KWS and role of DTW in it. Brief outline of DTW What is training and why is it needed? DTW training.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
Course 3 Binary Image Binary Images have only two gray levels: “1” and “0”, i.e., black / white. —— save memory —— fast processing —— many features of.
Introduction and Preliminaries D Nagesh Kumar, IISc Water Resources Planning and Management: M4L1 Dynamic Programming and Applications.
SPEECH VARIATION AND THE USE OF DISTANCE METRICS ON THE ARTICULATORY FEATURE SPACE Louis ten Bosch.
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
1 4.7 TIME ALIGNMENT AND NORMALIZATION Linear time normalization:
PATTERN COMPARISON TECHNIQUES
4.7 TIME ALIGNMENT AND NORMALIZATION
LECTURE 15: HMMS – EVALUATION AND DECODING
Sharat.S.Chikkerur S.Anand Mantravadi Rajeev.K.Srinivasan
Topic 25 Dynamic Programming
Digital Systems: Hardware Organization and Design
The connected word recognition problem Problem definition: Given a fluently spoken sequence of words, how can we determine the optimum match in terms.
Statistical Models for Automatic Speech Recognition
LECTURE 14: HMMS – EVALUATION AND DECODING
4.7 TIME ALIGNMENT AND NORMALIZATION
Dynamic Time Warping and training methods
Presenter: Simon de Leon Date: March 2, 2006 Course: MUMT611
Connected Word Recognition
Measuring the Similarity of Rhythmic Patterns
Keyword Spotting Dynamic Time Warping
Presentation transcript:

So far: Historical introduction Mathematical background (e.g., pattern classification, acoustics) Feature extraction for speech recognition (and some neural processing) What sound units are typically defined Audio signal processing topics (pitch extraction, perceptual audio coding, source separation, music analysis) Now – back to pattern recognition, but include time

Deterministic Sequence Recognition

Sequence recognition for ASR ASR = static pattern classification + sequence recognition Deterministic sequence recognition: template matching Templates are typically word-based; don’t need phonetic sound units per se Still need to put together local distances into something global (per word or utterance)

Front end analysis Basic approach the same for deterministic, statistical: – 25 ms windows (e.g., Hamming), 10 ms steps (a frame) – Some kind of cepstral analysis (e.g., MFCC or PLP) – Cepstral vector at time n called x n

Speech sound categories Words, phones most common For template-based ASR, mostly words For template-based ASR, local distances based on examples (reference frames) versus input frames

From Frames to Sequence Easy if local matches are all correct (never happens!) Local matches are unreliable Need measure of goodness of fit Need to integrate into global measure Need to consider all possible sequences

Templates: Isolated Word Example Matrix for comparison between frames Word template = multiple feature vectors Reference template = Input template = Need to find D(, )

Templates Matching Problem Time Normalization Which references to use Defining distances/costs Endpoints for input templates

Time Normalization Linear Time Normalization Nonlinear Time Normalization – Dynamic Time Warp (DTW)

Linear Time Normalization: Limitations Speech sounds stretch/compress differently Stop consonants versus vowels Need to normalize differently

Generalized Time Warping Permit many more variations Ideally, compare all possible time warpings Vintsyuk (1968): use dynamic programming

Dynamic programming Bellman optimality principle (1962): optimal policy given optimal policies from sub problems Best path through grid: if best path goes through grid point, best path includes best partial path to grid point Classic example: knapsack problem

Knapsack problem Stuffing a sack with items, different value Goal: maximize value in sack Key point 1: If max size is 10, and we know values of solutions for max size of 9, we can compute the final answer knowing the value of adding items. Key point 2: Point 1 sounds recursive, but can be made efficiently nonrecursive by building a table

Basic DTW step w/ simple local constraints. Each (i,j) cell has local distance d and cumulative distortion D. The eqn shows the basic computational step.

Dynamic Time Warp (DTW) Apply DP to ASR: Vintsyuk, Bridle, Sakoe Let D(i,j) = total distortion up to frame i in input and frame j in reference Let d(i,j) = local distance between frame i in input and frame j in reference Let p(i,j) = set of possible predecessors to frame i in input and frame j in reference D(i,j) = d(i, j) + min p(i,j) D(p(i,j))

DTW steps (1) Compute local distance d in 1 st column(1 st frame of input) for each reference template. Let D(0,j) = d(0,j) for each cell in each template (2) For i=1 (2 nd column), j=0, compute d(i,j) add to min of all possible predecessor values of D to get local value of D; repeat for each frame in each template. (3) Repeat (2) for each column to the end of input (4) For each template, find best D in last column of input (5) Choose the word for the template with smallest D

DTW Complexity O(Nframes ref. Nframes in. Ntemplates) Storage, though can just be O(Nframes ref. Ntemplates) (store current column and previous column) Constant reduction: global constraints Constant reduction: local constraints

Typical global slope constraints for dynamic programming

Which reference templates? All examples? Prototypes? DTW-based global distances permit clustering

DTW-based K-means (1) Initialize (how many, where) (2) Assign examples to closest center (DTW distance) (3) For each cluster, find template with minimum value for maximum distance, call it the center (4) Repeat (2) and (3) until some stopping criterion is reached (5) Use center templates as references for ASR

Defining local distance Normalizing for scale Cepstral weighting Perceptual weighting, e.g., JND Learning distances, e.g., with ANN, statistics

Endpoint detection: big problem! Sounds easy Hard in practice (noise, reverb, gain issues) Simple systems use energy, time thresholds More complex ones also use spectrum Can be tuned Not robust

Connected Word ASR by DTW Time normalization Recognition Segmentation Can’t have templates for all utterances DP to the rescue

DP for Connected Word ASR by DTW Vintsyuk, Bridle, Sakoe Sakoe: 2-level algorithm Vintsyuk, Bridle: one stage Ney explanation Ney, H., “The use of a one-stage dynamic programming algorithm for connected word recognition,” IEEE Trans. Acoust. Speech Signal Process. 32: , 1984

Connected Algorithm In principle: one big distortion matrix (for 20,000 words, 50 frames/word, 1000 frame input [10 seconds] would be 10 9 cells!) Also required, backtracking matrix (since word segmentation not known) Get best distortion Backtrack to get words Fundamental principle: find best segmentation and classification as part of the same process, not as sequential steps

DTW path for connected words

DTW for connected words In principle, backtracking matrix points back to best previous cell Mostly just need backtrack to end of previous word Simplifications possible

Storage efficiency Distortion matrix -> 2 columns Backtracking matrix -> 2 rows “From template” points to template with lowest cost ending here “From frame” points to end frame of previous word

More on connected templates “Within word” local constraints “Between word” local constraints Grammars Transition costs

Knowledge-based segmentation DTW combines segmentation, time norm, recognition; all segmentations considered Same feature vectors used everywhere Could segment separately, using acoustic- phonetic features cleverly Example: FEATURE, Ron Cole (1983)

Limitations of DTW approach No structure from subword units Average or exemplar values only Cross-word pronunciation effects not handled Limited flexibility for distance/distortion Limited mathematical basis -> Statistics!

Epilog: “episodic” ASR Having examples can get interesting again when there are many of them Potentially an augmentation of stat methods Recent experiments show decent results Somewhat different properties -> combination

The rest of the course Statistical ASR Speech synthesis Speaker recognition Speaker diarization Oral presentations on your projects Written report on your project

Class project timing Week of April 30: no class Monday, double class Wednesday May 2 (is that what people want?) 8 oral presentations by individuals, 12 minutes each + 3 minutes for questions 2 oral presentations by pairs – 17 minutes each + 3 minutes for questions 3:10 PM to 6 PM with a 10 minute mid-session break Written report due Wednesday May 9, no late submissions ( attachment is fine)