LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Improved ASR in noise using harmonic decomposition Introduction Pitch-Scaled Harmonic Filter Recognition Experiments Results Conclusion aperiodic contribution.
Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),
Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.
Institute of Information Science Academia Sinica 1 Singer Identification and Clustering of Popular Music Recordings Wei-Ho Tsai
Toward Automatic Music Audio Summary Generation from Signal Analysis Seminar „Communications Engineering“ 11. December 2007 Patricia Signé.
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.
Look Who’s Talking Now SEM Exchange, Fall 2008 October 9, Montgomery College Speaker Identification Using Pitch Engineering Expo Banquet /08/09.
Vineel Pratap Girish Govind Abhilash Veeragouni. Human listeners are capable of extracting information from the acoustic signal beyond just the linguistic.
CMSC Assignment 1 Audio signal processing
AUTOMATIC SPEECH CLASSIFICATION TO FIVE EMOTIONAL STATES BASED ON GENDER INFORMATION ABSTRACT We report on the statistics of global prosodic features of.
Hidden Markov Model based 2D Shape Classification Ninad Thakoor 1 and Jean Gao 2 1 Electrical Engineering, University of Texas at Arlington, TX-76013,
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight.
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 1 Feature selection for audio-visual speech recognition Mihai Gurban.
Toward Semantic Indexing and Retrieval Using Hierarchical Audio Models Wei-Ta Chu, Wen-Huang Cheng, Jane Yung-Jen Hsu and Ja-LingWu Multimedia Systems,
Recent Developments in Human Motion Analysis
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Speaker Adaptation for Vowel Classification
Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.
Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System.
Language and Speaker Identification using Gaussian Mixture Model Prepare by Jacky Chau The Chinese University of Hong Kong 18th September, 2002.
HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
AdvAIR Supervised by Prof. Michael R. Lyu Prepared by Alex Fok, Shirley Ng 2002 Fall An Advanced Audio Information Retrieval System.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
The Chinese University of Hong Kong Department of Computer Science and Engineering Lyu0202 Advanced Audio Information Retrieval System.
1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System Supervisor: Prof Michael Lyu Presented by: Lewis Ng,
FYP0202 Advanced Audio Information Retrieval System By Alex Fok, Shirley Ng.
2001/03/29Chin-Kai Wu, CS, NTHU1 Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY.
A PRESENTATION BY SHAMALEE DESHPANDE
Multimodal Analysis Video Representation Video Highlights Extraction Video Browsing Video Retrieval Video Summarization.
Advisor: Prof. Tony Jebara
김덕주 (Duck Ju Kim). Problems What is the objective of content-based video analysis? Why supervised identification has limitation? Why should use integrated.
Isolated-Word Speech Recognition Using Hidden Markov Models
Age and Gender Classification using Modulation Cepstrum Jitendra Ajmera (presented by Christian Müller) Speaker Odyssey 2008.
Macquarie RT05s Speaker Diarisation System Steve Cassidy Centre for Language Technology Macquarie University Sydney.
Study of Word-Level Accent Classification and Gender Factors
9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.
Prakash Chockalingam Clemson University Non-Rigid Multi-Modal Object Tracking Using Gaussian Mixture Models Committee Members Dr Stan Birchfield (chair)
Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,
Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System.
Multimodal Integration for Meeting Group Action Segmentation and Recognition M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll,
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal VideoConference Archives Indexing System.
Informing Multisource Decoding for Robust Speech Recognition Ning Ma and Phil Green Speech and Hearing Research Group The University of Sheffield 22/04/2005.
Overview of Part I, CMSC5707 Advanced Topics in Artificial Intelligence KH Wong (6 weeks) Audio signal processing – Signals in time & frequency domains.
Structure Discovery of Pop Music Using HHMM E6820 Project Jessie Hsu 03/09/05.
Understanding The Semantics of Media Chapter 8 Camilo A. Celis.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Speaker Recognition by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Digital Signal Processing ( Term Project )
NOISE DETECTION AND CLASSIFICATION IN SPEECH SIGNALS WITH BOOSTING Nobuyuki Miyake, Tetsuya Takiguchi and Yasuo Ariki Department of Computer and System.
Hidden Markov Classifiers for Music Genres. Igor Karpov Rice University Comp 540 Term Project Fall 2002.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
Singer Similarity Doug Van Nort MUMT 611. Goal Determine Singer / Vocalist based on extracted features of audio signal Classify audio files based on singer.
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.
ADAPTIVE BABY MONITORING SYSTEM Team 56 Michael Qiu, Luis Ramirez, Yueyang Lin ECE 445 Senior Design May 3, 2016.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Detection Of Anger In Telephone Speech Using Support Vector Machine and Gaussian Mixture Model Prepared By : Siti Marahaini Binti Mahamood.
Ch. 2 : Preprocessing of audio signals in time and frequency domain
Supervisor: Prof Michael Lyu Presented by: Lewis Ng, Philip Chan
Artificial Intelligence for Speech Recognition
AUDIO SURVEILLANCE SYSTEMS: SUSPICIOUS SOUND RECOGNITION
feature extraction methods for EEG EVENT DETECTION
Presenter: Shih-Hsiang(士翔)
Presentation transcript:

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo

Outline of Presentation Project objectives Project objectives ViaVoice recognition experiments ViaVoice recognition experiments Speech recognition editing tool Speech recognition editing tool Audio scene change detection Audio scene change detection Speech classification Speech classification Summary Summary

Our Project Objectives Audio information retrieval Audio information retrieval Speech recognition Speech recognition

Last Term’s Work Extract audio channel (stereo 44.1 kHz) from a mpeg video files into wave files (mono 22 kHz) Extract audio channel (stereo 44.1 kHz) from a mpeg video files into wave files (mono 22 kHz) Segmented the wave files into sentences by detecting its frame energy Segmented the wave files into sentences by detecting its frame energy Developed a visual training tool Developed a visual training tool

Visual Training Tool Video Window; Dictation Window; Text Editor

IBM ViaVoice Experiments Employed 7 student helpers Employed 7 student helpers Produce transcripts of 77 news video clips Produce transcripts of 77 news video clips Four experiments: Four experiments:  Baseline measurement  Trained model measurement  Slow down measurement  Indoor news measurement

Baseline Measurement To measure the ViaVoice recognition accuracy using TVB news video To measure the ViaVoice recognition accuracy using TVB news video Testing set: 10 video clips Testing set: 10 video clips The segmented wav files are dictated The segmented wav files are dictated Employ the hidden Markov model toolkit (HTK) to examine the accuracy Employ the hidden Markov model toolkit (HTK) to examine the accuracy

Trained Model Measurement To measure the accuracy of ViaVoice, trained by its correctly recognized words To measure the accuracy of ViaVoice, trained by its correctly recognized words 10 videos clips are segmented and dictated 10 videos clips are segmented and dictated The correctly dictated words of training set are used to train the ViaVoice by the SMAPI function SmWordCorrection The correctly dictated words of training set are used to train the ViaVoice by the SMAPI function SmWordCorrection Repeat the procedures of “baseline measurement” after training to get the recognition performance Repeat the procedures of “baseline measurement” after training to get the recognition performance Repeat the procedures of using 20 videos clips Repeat the procedures of using 20 videos clips

Slow Down Measurement Investigate the effect of slowing down the audio channel Investigate the effect of slowing down the audio channel Resample the segment wave files in the testing set by the ratio of 1.05, 1.1, 1.15, 1.2, 1.3, 1.4, and 1.6 Resample the segment wave files in the testing set by the ratio of 1.05, 1.1, 1.15, 1.2, 1.3, 1.4, and 1.6 Repeat the procedures of “baseline measurement” Repeat the procedures of “baseline measurement”

Indoor News Measurement Eliminate the effect of noise Eliminate the effect of noise Select the indoor news reporter sentence Select the indoor news reporter sentence Dictate the test set using untrained model Dictate the test set using untrained model Repeat the procedure using trained model Repeat the procedure using trained model

Experimental Results Experiment Accuracy (Max. performance) Baseline25.27% Trained Model 25.87% (with 20 video trained) Slow Speech 25.67% (max. at ratio = 1.15) Indoor Speech (untrained model) 35.22% Indoor Speech (trained model) 36.31% (with 20 video trained) Overall Recognition Results (ViaVoice, TVB News )

Experimental Result Cont. Trained Video Number Untrained 10 videos 20 videos Accuracy25.27%25.82%25.87% Ratio Accuracy (%) Result of trained model with different number of training videos Result of using different slow down ratio

Analysis of Experimental Result Trained model: about 1% accuracy improvement Trained model: about 1% accuracy improvement Slowing down speeches: about 1% accuracy improvement Slowing down speeches: about 1% accuracy improvement Indoor speeches are recognized much better Indoor speeches are recognized much better Mandarin: estimated baseline accuracy is about 70 % ( >> Cantonese) Mandarin: estimated baseline accuracy is about 70 % ( >> Cantonese)

Speech Processor Training does not increase accuracy significantly Training does not increase accuracy significantly Need manually editing of the recognition result Need manually editing of the recognition result Word timing information is also important Word timing information is also important

Editing Functionality The recognition result is organized in a basic unit called “firm word” The recognition result is organized in a basic unit called “firm word” Retrieve the timing information from the speech engine Retrieve the timing information from the speech engine Record the timing information of every firm word in an index Record the timing information of every firm word in an index Highlight corresponding firm word during video playback Highlight corresponding firm word during video playback

Dynamic Time Index Alignment While editing recognition result, firm word structure may be changed While editing recognition result, firm word structure may be changed Time index need to be updated to maintain new firm word Time index need to be updated to maintain new firm word In speech processor, time index is aligned with firm words whenever user edits the text In speech processor, time index is aligned with firm words whenever user edits the text

Time Index Alignment Example Before EditingEditing After Editing

Motivation for Doing Speech Segmentation and Classification Gender classification can help us to build gender dependent model Gender classification can help us to build gender dependent model Detection of scene changes from video content is not accurate enough, so we need audio scene change detection as an assistant tool Detection of scene changes from video content is not accurate enough, so we need audio scene change detection as an assistant tool

Flow Diagram of Audio Information Retrieval System Audio Signal From News’ Audio Channel Audio Signal MFCC Feature Extraction Segmentation Audio Scene Change Detect cont’ vowel > 30% Speech Non- Speech Male? Female? Music Pattern Matching Speaker Identification/ Classification By MFCC var. By 256 GMM By Clustering

Feature Extraction by MFCC The first thing we should do on the raw audio input data The first thing we should do on the raw audio input data MFCC stands for “mel-frequency cepstral coefficient” MFCC stands for “mel-frequency cepstral coefficient” Human perception of the frequency of sound does not follow a linear scale Human perception of the frequency of sound does not follow a linear scale

Detection of Audio Scene Change by Bayesian Information Criterion (BIC) Bayesian information criterion (BIC) is a likelihood criterion Bayesian information criterion (BIC) is a likelihood criterion We maximize the likelihood functions separately for each model M and obtain L (X,M) We maximize the likelihood functions separately for each model M and obtain L (X,M) The main principle is to penalize the system by the model complexity The main principle is to penalize the system by the model complexity

Detection of a single point change using BIC We define: We define: H 0 : x 1, x 2 … x N ~ N(μ,Σ) to be the whole sequence without changes and H 1 : x 1, x 2 … x L ~ N(μ 1,Σ 1 ), x L+1, x L+2 … x N ~ N(μ 2,Σ 2 ), is the hypothesis that change occurring at time i. The maximum likelihood ratio is defined as: R(I)=Nlog| Σ|-N 1 log| Σ 1 |-N 2 log| Σ 2 |

Detection of a single point change using BIC The difference between the BIC values of two models can be expressed as: The difference between the BIC values of two models can be expressed as: BIC(I) = R(I) – λP P=(1/2)(d+(1/2d(d+1))logN If BIC value>0, detection of scene change If BIC value>0, detection of scene change

Detection of multiple point changes by BIC a. Initialize the interval [a, b] with a=1, b=2 a. Initialize the interval [a, b] with a=1, b=2 b. Detect if there is one changing point in interval [a, b] using BIC b. Detect if there is one changing point in interval [a, b] using BIC c. If (there is no change in [a, b]) c. If (there is no change in [a, b]) let b= b + 1 else let t be the changing point detected assign a = t +1; b = a+1; end d. go to step (b) if necessary

Advantages of BIC approach Robustness Robustness Thresholding-free Thresholding-free Optimality Optimality

Comparison of different algorithms

Audio scene change detection

Gender Classification The mean and covariance of male and female feature vector is quite different The mean and covariance of male and female feature vector is quite different So we can model it by a Gaussian Mixture Model (GMM) So we can model it by a Gaussian Mixture Model (GMM)

Male/Female Classification (freq count vs. values) MaleFemale

Gender Classification

Music/Speech classification by pitch tracking speech has more continue contour than music. speech has more continue contour than music. Speech clip always has 30%-55% continuous contour whereas silence or music has1%-15% Speech clip always has 30%-55% continuous contour whereas silence or music has1%-15% Thus, we choose >20% for speech. Thus, we choose >20% for speech.

Frequency Vs no of frames SpeechMusic

Summary ViaVoice training experiments ViaVoice training experiments Speech recognition editing tool Speech recognition editing tool Dynamic time index alignment Dynamic time index alignment Audio scene change detection Audio scene change detection Speech classification Speech classification Integrated the above functions into a speech processor Integrated the above functions into a speech processor

Future Work Classify the indoor news and outdoor news for further process the video clips Classify the indoor news and outdoor news for further process the video clips Train the gender dependent models for ViaVoice engine. It may increase the recognition accuracy by having a gender dependent model Train the gender dependent models for ViaVoice engine. It may increase the recognition accuracy by having a gender dependent model