LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo

Outline of Presentation Project objectives Project objectives ViaVoice recognition experiments ViaVoice recognition experiments Speech recognition editing tool Speech recognition editing tool Audio scene change detection Audio scene change detection Speech classification Speech classification Summary Summary

Our Project Objectives Audio information retrieval Audio information retrieval Speech recognition Speech recognition

Last Term’s Work Extract audio channel (stereo 44.1 kHz) from a mpeg video files into wave files (mono 22 kHz) Extract audio channel (stereo 44.1 kHz) from a mpeg video files into wave files (mono 22 kHz) Segmented the wave files into sentences by detecting its frame energy Segmented the wave files into sentences by detecting its frame energy Developed a visual training tool Developed a visual training tool

Visual Training Tool Video Window; Dictation Window; Text Editor

IBM ViaVoice Experiments Employed 7 student helpers Employed 7 student helpers Produce transcripts of 77 news video clips Produce transcripts of 77 news video clips Four experiments: Four experiments:  Baseline measurement  Trained model measurement  Slow down measurement  Indoor news measurement

Baseline Measurement To measure the ViaVoice recognition accuracy using TVB news video To measure the ViaVoice recognition accuracy using TVB news video Testing set: 10 video clips Testing set: 10 video clips The segmented wav files are dictated The segmented wav files are dictated Employ the hidden Markov model toolkit (HTK) to examine the accuracy Employ the hidden Markov model toolkit (HTK) to examine the accuracy

Trained Model Measurement To measure the accuracy of ViaVoice, trained by its correctly recognized words To measure the accuracy of ViaVoice, trained by its correctly recognized words 10 videos clips are segmented and dictated 10 videos clips are segmented and dictated The correctly dictated words of training set are used to train the ViaVoice by the SMAPI function SmWordCorrection The correctly dictated words of training set are used to train the ViaVoice by the SMAPI function SmWordCorrection Repeat the procedures of “baseline measurement” after training to get the recognition performance Repeat the procedures of “baseline measurement” after training to get the recognition performance Repeat the procedures of using 20 videos clips Repeat the procedures of using 20 videos clips

Slow Down Measurement Investigate the effect of slowing down the audio channel Investigate the effect of slowing down the audio channel Resample the segment wave files in the testing set by the ratio of 1.05, 1.1, 1.15, 1.2, 1.3, 1.4, and 1.6 Resample the segment wave files in the testing set by the ratio of 1.05, 1.1, 1.15, 1.2, 1.3, 1.4, and 1.6 Repeat the procedures of “baseline measurement” Repeat the procedures of “baseline measurement”

Indoor News Measurement Eliminate the effect of noise Eliminate the effect of noise Select the indoor news reporter sentence Select the indoor news reporter sentence Dictate the test set using untrained model Dictate the test set using untrained model Repeat the procedure using trained model Repeat the procedure using trained model

Experimental Results Experiment Accuracy (Max. performance) Baseline25.27% Trained Model 25.87% (with 20 video trained) Slow Speech 25.67% (max. at ratio = 1.15) Indoor Speech (untrained model) 35.22% Indoor Speech (trained model) 36.31% (with 20 video trained) Overall Recognition Results (ViaVoice, TVB News )

Experimental Result Cont. Trained Video Number Untrained 10 videos 20 videos Accuracy25.27%25.82%25.87% Ratio11.051.11.151.21.31.41.5 Accuracy (%) 25.2725.4625.6325.6725.8217.1812.344.04 Result of trained model with different number of training videos Result of using different slow down ratio

Analysis of Experimental Result Trained model: about 1% accuracy improvement Trained model: about 1% accuracy improvement Slowing down speeches: about 1% accuracy improvement Slowing down speeches: about 1% accuracy improvement Indoor speeches are recognized much better Indoor speeches are recognized much better Mandarin: estimated baseline accuracy is about 70 % ( >> Cantonese) Mandarin: estimated baseline accuracy is about 70 % ( >> Cantonese)

Speech Processor Training does not increase accuracy significantly Training does not increase accuracy significantly Need manually editing of the recognition result Need manually editing of the recognition result Word timing information is also important Word timing information is also important

Editing Functionality The recognition result is organized in a basic unit called “firm word” The recognition result is organized in a basic unit called “firm word” Retrieve the timing information from the speech engine Retrieve the timing information from the speech engine Record the timing information of every firm word in an index Record the timing information of every firm word in an index Highlight corresponding firm word during video playback Highlight corresponding firm word during video playback

Dynamic Time Index Alignment While editing recognition result, firm word structure may be changed While editing recognition result, firm word structure may be changed Time index need to be updated to maintain new firm word Time index need to be updated to maintain new firm word In speech processor, time index is aligned with firm words whenever user edits the text In speech processor, time index is aligned with firm words whenever user edits the text

Time Index Alignment Example Before EditingEditing After Editing

Motivation for Doing Speech Segmentation and Classification Gender classification can help us to build gender dependent model Gender classification can help us to build gender dependent model Detection of scene changes from video content is not accurate enough, so we need audio scene change detection as an assistant tool Detection of scene changes from video content is not accurate enough, so we need audio scene change detection as an assistant tool

Flow Diagram of Audio Information Retrieval System Audio Signal From News’ Audio Channel Audio Signal MFCC Feature Extraction Segmentation Audio Scene Change Detect cont’ vowel > 30% Speech Non- Speech Male? Female? Music Pattern Matching Speaker Identification/ Classification By MFCC var. By 256 GMM By Clustering

Feature Extraction by MFCC The first thing we should do on the raw audio input data The first thing we should do on the raw audio input data MFCC stands for “mel-frequency cepstral coefficient” MFCC stands for “mel-frequency cepstral coefficient” Human perception of the frequency of sound does not follow a linear scale Human perception of the frequency of sound does not follow a linear scale

Detection of Audio Scene Change by Bayesian Information Criterion (BIC) Bayesian information criterion (BIC) is a likelihood criterion Bayesian information criterion (BIC) is a likelihood criterion We maximize the likelihood functions separately for each model M and obtain L (X,M) We maximize the likelihood functions separately for each model M and obtain L (X,M) The main principle is to penalize the system by the model complexity The main principle is to penalize the system by the model complexity

Detection of a single point change using BIC We define: We define: H 0 : x 1, x 2 … x N ~ N(μ,Σ) to be the whole sequence without changes and H 1 : x 1, x 2 … x L ~ N(μ 1,Σ 1 ), x L+1, x L+2 … x N ~ N(μ 2,Σ 2 ), is the hypothesis that change occurring at time i. The maximum likelihood ratio is defined as: R(I)=Nlog| Σ|-N 1 log| Σ 1 |-N 2 log| Σ 2 |

Detection of a single point change using BIC The difference between the BIC values of two models can be expressed as: The difference between the BIC values of two models can be expressed as: BIC(I) = R(I) – λP P=(1/2)(d+(1/2d(d+1))logN If BIC value>0, detection of scene change If BIC value>0, detection of scene change

Detection of multiple point changes by BIC a. Initialize the interval [a, b] with a=1, b=2 a. Initialize the interval [a, b] with a=1, b=2 b. Detect if there is one changing point in interval [a, b] using BIC b. Detect if there is one changing point in interval [a, b] using BIC c. If (there is no change in [a, b]) c. If (there is no change in [a, b]) let b= b + 1 else let t be the changing point detected assign a = t +1; b = a+1; end d. go to step (b) if necessary

Advantages of BIC approach Robustness Robustness Thresholding-free Thresholding-free Optimality Optimality

Comparison of different algorithms

Audio scene change detection

Gender Classification The mean and covariance of male and female feature vector is quite different The mean and covariance of male and female feature vector is quite different So we can model it by a Gaussian Mixture Model (GMM) So we can model it by a Gaussian Mixture Model (GMM)

Male/Female Classification (freq count vs. values) MaleFemale

Gender Classification

Music/Speech classification by pitch tracking speech has more continue contour than music. speech has more continue contour than music. Speech clip always has 30%-55% continuous contour whereas silence or music has1%-15% Speech clip always has 30%-55% continuous contour whereas silence or music has1%-15% Thus, we choose >20% for speech. Thus, we choose >20% for speech.

Frequency Vs no of frames SpeechMusic

Summary ViaVoice training experiments ViaVoice training experiments Speech recognition editing tool Speech recognition editing tool Dynamic time index alignment Dynamic time index alignment Audio scene change detection Audio scene change detection Speech classification Speech classification Integrated the above functions into a speech processor Integrated the above functions into a speech processor

Future Work Classify the indoor news and outdoor news for further process the video clips Classify the indoor news and outdoor news for further process the video clips Train the gender dependent models for ViaVoice engine. It may increase the recognition accuracy by having a gender dependent model Train the gender dependent models for ViaVoice engine. It may increase the recognition accuracy by having a gender dependent model

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Similar presentations

Presentation on theme: "LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Similar presentations

Presentation on theme: "LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo."— Presentation transcript:

Similar presentations

About project

Feedback