INSTRUCTOR:Dr.Veton Kepuska STUDENT:Dileep Narayan.Koneru YES/NO RECOGNITION SYSTEM.

Slides:

Advertisements

Similar presentations

FUNCTION FITTING Student’s name: Ruba Eyal Salman Supervisor:

Advertisements

Building an ASR using HTK CS4706

數位語音處理概論 HW#2-1 HMM Training and Testing

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.

An End-User Perspective On Using NatQuery Building a Dynamic Variable T

Hidden Markov Models Theory By Johan Walters (SR 2003)

AN INTRODUCTION TO PRAAT Tina John M.A. Institute of Phonetics and digital Speech Processing - University Kiel Institute of Phonetics and Speech Processing.

Information Retrieval in Practice

HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.

CMP 131 Introduction to Computer Programming Violetta Cavalli-Sforza Week 2, Lecture 2.

Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.

Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.

1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.

Dynamic Time Warping Applications and Derivation

Overview of Search Engines

A Text-to-Speech Synthesis System

May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

Introduction to Automatic Speech Recognition

Gesture Recognition Using Laser-Based Tracking System Stéphane Perrin, Alvaro Cassinelli and Masatoshi Ishikawa Ishikawa Namiki Laboratory UNIVERSITY OF.

Visual Basic Chapter 1 Mr. Wangler.

Online Chinese Character Handwriting Recognition for Linux

Isolated-Word Speech Recognition Using Hidden Markov Models

1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.

Arthur Kunkle ECE 5525 Fall Introduction and Motivation  A Large Vocabulary Speech Recognition (LVSR) system is a system that is able to convert.

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

 Feature extractor  Mel-Frequency Cepstral Coefficients (MFCCs) Feature vectors.

7-Speech Recognition Speech Recognition Concepts

IPC144 Introduction to Programming Using C Week 1 – Lesson 2

Simulink ® Interface Course 13 Active-HDL Interfaces.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

By: Meghal Bhatt.  Sphinx4 is a state of the art speaker independent, continuous speech recognition system written entirely in java programming language.

Arnel Fajardo, student (“Hak Seng”)

Presentation by Daniel Whiteley AME department

Chapter 06 (Part I) Functions and an Introduction to Recursion.

DSP homework 1 HMM Training and Testing

Jacob Zurasky ECE5526 – Spring 2011

Csc Lecture 7 Recognizing speech. Geoffrey Hinton.

Results of Tagalog vowel Speech recognition using Continuous HMM Arnel C. Fajardo Ph. D student (Under the supervision of Professor Yoon-Joong Kim)

22CS 338: Graphical User Interfaces. Dario Salvucci, Drexel University. Lecture 10: Advanced Input.

FIX Eye FIX Eye Getting started: The guide EPAM Systems B2BITS.

Soundscapes James Martin. Overview Problem Statement Proposed Solution Solution Created (Modules, Model, Pics) Testing Looking Back See It in Action Q&A.

1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.

The HTK Book (for HTK Version 3.2.1) Young et al., 2002.

Lesson 3-Touring Utilities and System Features. Overview Employing fundamental utilities. Linux terminal sessions. Managing input and output. Using special.

Performance Comparison of Speaker and Emotion Recognition

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.

Instructor:Dr.Veton Kepuska Student:Dileep Narayan.Koneru PRAAT PROSODIC FEATURE EXTRACTION TOOL.

Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

DYNAMIC TIME WARPING IN KEY WORD SPOTTING. OUTLINE KWS and role of DTW in it. Brief outline of DTW What is training and why is it needed? DTW training.

1 Hidden Markov Model: Overview and Applications in MIR MUMT 611, March 2005 Paul Kolesnik MUMT 611, March 2005 Paul Kolesnik.

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.

Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Essential components of the implementation are:  Formation of the network and weight initialization routine  Pixel analysis of images for symbol detection.

Speaker Verification System Middle Term Presentation Performed by: Barak Benita & Daniel Adler Instructor: Erez Sabag.

Be “GUI ready” developing in RPG by Robert Arce from PrismaTech. Be “GUI ready” developing in RPG-ILE Presented by: Robert Arce.

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:

語音訊號處理之初步實驗 NTU Speech Lab 指導教授: 李琳山助教: 熊信寬

Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.

Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.

Automatic Speech Recognition

CONTEXT DEPENDENT CLASSIFICATION

Presentation by Daniel Whiteley AME department

Handwritten Characters Recognition Based on an HMM Model

Visual Recognition of American Sign Language Using Hidden Markov Models 문현구 문현구.

Presentation transcript:

INSTRUCTOR:Dr.Veton Kepuska STUDENT:Dileep Narayan.Koneru YES/NO RECOGNITION SYSTEM

Construction Steps The main construction steps are the following: Creation of a training database: each element of the vocabulary is recorded several times, and labelled with the corresponding word. Acoustical analysis: the training waveforms are converted into some series of coefficient vectors. Definition of the models: a prototype of Hidden Markov Model (HMM) is defined for each element of the task vocabulary. Training of the models: each HMM is initialized and trained with the training data. Definition of the task: the grammar of the recogniser (what can be recognized) is defined. Recognition of an input signal. Evaluation: the performance of the recogniser can be evaluated on a corpus of test data.

Creation of training corpus Record the signal Label the Signal

To create and label a speech file: HSLab any_name.sig The tool’s graphical interface appears.Which looks like this

Acoustical Analysis The speech recognition tools cannot process directly on speech waveforms. These have tobe represented in a more compact and efficient way. This step is called “acoustical analysis”: The signal is segmented in successive frames (whose length is chosen between 20ms and40ms, typically), overlapping with each other. Each frame is multiplied by a windowing function (e.g. Hamming function). A vector of acoustical coefficients (giving a compact representation of the spectral properties of the frame) is extracted from each windowed frame.

The conversion from the original waveform to a series of acoustical vectors is done with the HCopy HTK tool: HCopy -A -D -C analysis.conf -S targetlist.txt analysis.conf is a configuration file setting the parameters of the acoustical coefficient extraction. targetlist.txt specifies the name and location of each waveform to process, along with the name and location of the target coefficient files.

HMM Definition A prototype has to be generated for each event to model. In our case, we have to write a prototype for 3 HMMs that we will call “yes”, “no”, and “sil”(with headers ~h "yes", ~h "no" and ~h "sil" in the 3 description files). These 3 files could be named hmm_yes, hmm_no, hmm_sil and be stored in a directory called model/proto/.

Initialization for HMM training Before starting the training process, the HMM parameters must be properly initialised with training data in order to allow a fast and precise convergence of the training algorithm. HTK offers 2 different initialisation tools: Hinit and HCompv.

The following command line initialises the HMM by time- alignment of the training data with a Viterbi algorithm: HInit -A -D –T 1 -S trainlist.txt -M model/hmm0 \ -H model/proto/hmmfile -l label -L label_dir nameofhmm nameofhmm is the name of the HMM to initialise (here: yes, no, or sil). hmmfile is a description file containing the prototype of the HMM called nameofhmm (here: proto/hmm_yes, proto/hmm_no, or proto/hmm_sil).

trainlist.txt gives the complete list of the.mfcc files forming the training corpus (stored in directory data/train/mfcc/). label_dir is the directory where the label files (.lab) corresponding to the training corpus (here: data/train/lab/). label indicates which labelled segment must be used within the training corpus (here: yes, no, or sil because have used the same names for the labels and the HMMs, but this is not mandatory…) model/hmm0 is the name of the directory (must be created before) where the resulting initialised HMM description will be output. This procedure has to be repeated for each model (hmm_yes, hmm_no, hmm_sil).

The HCompv tool performs a “flat” initialisation of a model. Every state of the HMM is given the same mean and variance vectors: these are computed globally on the whole training corpus. The initialisation command line is in this case: HCompv -A -D –T 1 -S trainlist.txt -M model/hmm0flat -H model/proto/hmmfile -f 0.01 nameofhmm model/hmm0flat : the output directory must be different from the one used with HInit (to avoid overwrite).

Task definition Grammer and Dictionary: Before using our word models, we have to define the basic architecture of our recogniser (the task grammar). We will first define the most simple one: a start silence, followed by a single word (in our case “Yes” or “No”), followed by an end silence. The system must of course know to which HMM corresponds each of the grammar variables YES, NO, START_SIL and END_SIL. This information is stored in a text file called the taskdictionary.

The grammar and dictionary files will be like these: Task grammer Task dictionary

Network The task grammar (described in file gram.txt) have to be compiled with tool HParse, to obtain the task network (written in net.slf): HParse -A -D -T 1 gram.txt net.slf

Recognition Let’s come to the recognition procedure itself: An input speech signal input.sig is first transformed into a series of “acoustical vectors” (here MFCCs) with tool HCopy, in the same way as what was done with the raining data (Acoustical Analysis step). The result is stored in an input.mfcc file (often called the acoustical observation). The input observation is then process by a Viterbi algorithm, which matches it against the recogniser’s Markov models. This is done by tool HVite: HVite -A -D -T 1 -H hmmsdef.mmf -i reco.mlf -w net.slf dict.txt hmmlist.txt input.mfcc

input.mfcc is the input data to be recognised. hmmlist.txt lists the names of the models to use (yes, no, and sil). Each element is separated by a new line character. Don’t forget to insert a new line after the last element. dict.txt is the task dictionary. net.slf is the task network. reco.mlf is the output recognition transcription file. hmmsdef.mmf contains the definition of the HMMs. It is possible to repeat the -H option and list our different HMM definition files

The block diagram of the recognition of input signal

A more interactive way of testing the recogniser is to use the “direct input” options of HVite: HVite -A -D -T 1 -C directin.conf -g -H hmmsdef.mmf -w net.slf dict.txt hmmlist.txt No input signal argument or output file are required in this case: a prompt READY[1]> appears on screen, indicating that the first input signal is recorded. The signal recording is stopped by a key-press. The recognition result is then displayed, and a prompt READY[2]> waiting for a new input immediately appears.

Error rates The ref.mlf and rec.mlf transcriptions are compared with the HTK performance evaluation tool, HResults: HResults -A -D -T 1 -e ??? sil -I ref.mlf labellist.txt rec.mlf > results.txt results.txt contains the output performance statistics rec.mlf contains the transcriptions of the test data, as output by the recogniser. labellist.txt lists the labels appearing in the transcription files (here: yes, no, and sil). ref.mlf contains the reference transcriptions of the test data

Result of the performance test

The results of direct audio input is shown below: The results for the direct speech input is shown below

Conclusions The yes/no speech recognition system provides us with a basic idea of building a speech recognition system using the htk toolkit. Which can be used to build a more complex speech recognition system with more training and testing data and with a more complex grammer and dictionary.