DYNAMIC TIME WARPING IN KEY WORD SPOTTING. OUTLINE KWS and role of DTW in it. Brief outline of DTW What is training and why is it needed? DTW training.

Slides:

Advertisements

Similar presentations

Real-Time Template Tracking

Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.

Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),

Improvement of Audio Capture in Handheld Devices through Digital Filtering Problem Microphones in handheld devices are of low quality to reduce cost. This.

Spatial statistics Lecture 3.

Using Rubrics for Evaluating Student Learning. Purpose To review the development of rubrics for the purpose of assessment To share an example of how a.

Speaker Recognition Sharat.S.Chikkerur Center for Unified Biometrics and Sensors

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

A STUDY ON SPEECH RECOGNITION USING DYNAMIC TIME WARPING CS 525 : Project Presentation PALDEN LAMA and MOUNIKA NAMBURU.

Reduced Support Vector Machine

Using Rubrics for Evaluating Student Learning Office of Assessment and Accreditation Indiana State University.

Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.

Digital Voice Communication Link EE 413 – TEAM 2 April 21 st, 2005.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Dynamic Time Warping Applications and Derivation

Real-Time Speech Recognition Thang Pham Advisor: Shane Cotter.

A PRESENTATION BY SHAMALEE DESHPANDE

Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Introduction to Automatic Speech Recognition

1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.

So far: Historical introduction Mathematical background (e.g., pattern classification, acoustics) Feature extraction for speech recognition (and some neural.

Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.

Speech Processing Laboratory

1 TEMPLATE MATCHING  The Goal: Given a set of reference patterns known as TEMPLATES, find to which one an unknown pattern matches best. That is, each.

7-Speech Recognition Speech Recognition Concepts

BACKGROUND LEARNING AND LETTER DETECTION USING TEXTURE WITH PRINCIPAL COMPONENT ANALYSIS (PCA) CIS 601 PROJECT SUMIT BASU FALL 2004.

Jacob Zurasky ECE5526 – Spring 2011

Multimodal Information Analysis for Emotion Recognition

Dan Rosenbaum Nir Muchtar Yoav Yosipovich Faculty member : Prof. Daniel LehmannIndustry Representative : Music Genome.

Incorporating Dynamic Time Warping (DTW) in the SeqRec.m File Presented by: Clay McCreary, MSEE.

1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.

Chapter 5: Speech Recognition An example of a speech recognition system Speech recognition techniques Ch5., v.5b1.

Simulation is the process of studying the behavior of a real system by using a model that replicates the behavior of the system under different scenarios.

Speaker Recognition by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Digital Signal Processing ( Term Project )

A Use Case Primer 1. The Benefits of Use Cases  Compared to traditional methods, use cases are easy to write and to read.  Use cases force the developers.

VQ for ASR 張智星多媒體資訊檢索實驗室清華大學資訊工程系.

Chapter 9 DTW and VQ Algorithm  9.1 Basic idea of DTW  9.2 DTW algorithm  9.3 Basic idea of VQ  9.4 LBG algorithm  9.5 Improvement of VQ.

Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska

PH24010 Data Handling and Statistics Use of MathCAD to handle statistical data Data storage in vectors and matrices MathCAD’s built-in functions: –Mean,

Introduction to Loops For Loops. Motivation for Using Loops So far, everything we’ve done in MATLAB, you could probably do by hand: Mathematical operations.

Indoor Location Detection By Arezou Pourmir ECE 539 project Instructor: Professor Yu Hen Hu.

Ch 5b: Discriminative Training (temporal model) Ilkka Aho.

Performance Comparison of Speaker and Emotion Recognition

Speed improvements to information retrieval-based dynamic time warping using hierarchical K-MEANS clustering Presenter: Kai-Wun Shih Gautam Mantena 1,2.

Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.

1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.

Procedures Used in Collecting Data 1. Verbal Description used in the study of settings, procedures, and behaviors depends on observation and is a written.

Lecture 8: Measurement Errors 1. Objectives List some sources of measurement errors. Classify measurement errors into systematic and random errors. Study.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.

1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.

Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.

핵심어 검출을 위한 단일 끝점 DTW 알고리즘 Yong-Sun Choi and Soo-Young Lee

Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang

Supervised Time Series Pattern Discovery through Local Importance

Chen Jimena Melisa Parodi Menashe Shalom

Sharat.S.Chikkerur S.Anand Mantravadi Rajeev.K.Srinivasan

Unit# 9: Computer Program Development

Statistical Models for Automatic Speech Recognition

Isolated word, speaker independent speech recognition

Dynamic Time Warping and training methods

EE513 Audio Signals and Systems

Module Recognition Algorithms

Measuring the Similarity of Rhythmic Patterns

Keyword Spotting Dynamic Time Warping

Auditory Morphing Weyni Clacken

Presentation transcript:

DYNAMIC TIME WARPING IN KEY WORD SPOTTING

OUTLINE KWS and role of DTW in it. Brief outline of DTW What is training and why is it needed? DTW training algorithm Raw data preprocessing Results Suggestions

SYSTEM BLOCK-DIAGRAM

FUNCTIONS OF SUBSYSTEMS Audio interface samples sound and provides it to other subsystems. Also indicates detection of KEY word. Front-end detects intervals of data with voice present and converts them into collections of feature vectors Back-end compares input of feature vectors from the front-end with template(or set of templates) provided by training block and sends the score to the analysis unit. Training/Testing/Analysis creates templates, analyze the matching score and makes a decision what KEY-word the input corresponds to

DUAL ROLE OF DTW DTW can be used as a comparison method in the back-end DTW can be used as an averaging tool in training to create template. In the first case, it is on-line use,in the second, it is off-line.

PRINCILES OF DTW While comparing 2 inputs of feature vectors the Global difference must be the sum of local differences between frames the inputs consist of. Because the phonetic content distribution of an input is not known in advance, it is possible to compare frames corresponding to different phones. Compare “apple and orange”, so to speak. DTW minimizes the effect of such a comparison by finding the correspondence between the frames such that the Global distance is minimal.

GLOBAL AND LOCAL DISTANCE D(i, j+1) d(i, j+1) D(i+1,j+1)=min( D(i, j+1), D(i, j), D(i+1, j) )+ +d(i+1,j+1) d(i+1,j+1) D(i, j) d(i, j) D(i+1,j) d(i+1,j)

D[j, j] D[0 0]

DTW continued Iteratively populated array, as described above, will lead us to the Global distance D[N,M] At the same time the path is not known. If application requires knowledge of the path the array can be populated with data structures containing not only the global distance at a cell but also the indexes of the preceding cell in the possible trajectory.

What is training and why is it needed? There are a huge number of realizations or tokens of the same word we wish to recognize. They differ in length and acoustic content so that the distance between some realizations might be bigger than the distance between realization of a given word and a realization of a different word. That could lead to a wrong decision in recognition.

Training continued To reduce the probability of the scenario described above, the averaging procedure is designed. Its effect is a realization of the word,which is overall well matched with all the tokens of a training data. This constructed realization of the word is called template. Some tokens within the data might be perfectly matched some – not so good, but template matches fairly well with any token of the training data.

Possible length distribution in the data

TRAINING USING DTW Find the distribution of length of the training tokens and the token/s representing average length. Use this average token as one input to DTW program. Use the rest of the training data iteratively as the other input to the DTW. Every token of data will be warped with average length input and frames of different tokens corresponding to the same frame of the average input will be added together and averaged. Obtained in such a manner template will be used as an input to DTW program to repeat the cycle. Repeat until convergence.

AVERAGING WITH DTW Am Cm An Cn

After one cycle Xn=An/Cn; In the next cycle An and Cn should be set to 0

RAW DATA All training data were given as a directory of raw data files, each resulted from sampling of analog signal with 8kHz. A description file with the information about date, time, filename, beginning and end of speech and, finally, the word spoken was also given.

Raw data preprocessing Training program required the knowledge of the beginning of the segment of speech as well as its length, both expressed in frames. For this reason, the description file was read and necessary information was extracted and written into the file ‘filelist’. A PEARL program was used. Training program also required tokens represented by feature vectors,that is why raw data files corresponding to key-word were converted to feature files. Again, the program was written in PERL with Front-end called from inside. As a result, a new directory ‘Operator’ with feature files was created.

Obtained in such a manner file ‘filelist’ and file corresponding to the token with the average length were inputs to the training program using DTW averaging algorithm described above. Average token was the input X and the input Y was read sequentially from the directory “Operator”, pointed to by paths of files in the input file “filelist”.

RESULTS DTW program was written tested on the data obtained in Dr. Silaghi Speech Recognition class through HTK Front-end. Although a few deviations were observed there was clear correlation between phonetic similarity of words and the DTW score. At the same time it was a mistake not to test it on the data used in the training before writing a training program.

RESULTS CONTINUED When the training program was finished it turned out that DTW does not discriminate training data good enough so that it was pointless to test the template produced by the training program. Because DTW program is relatively straightforward it is more likely that there is a problem in the Front-end. In any case, it remains to be seen. Overall, it can be concluded that DTW program and based on it Training program was developed but not tested due to possible problems in the Front-end

SUGGESTIONS To gain a better insight into DTW it would be interesting to incorporate MATLAB in studying it. The DTW code is simple enough to be executed as a M- script. It only needs 2 arrays X and Y as inputs. They are imported from outside after the reading of correspondent feature files. The advantage is,that MATLAB provides the extensive tools for the examining the DFT path. The wave-files of the inputs should be imported also and the spectrograms should be studied, to see the correlation between the phonetic similarities pointed by the spectrograms and the behavior of the DTW path.