LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Slides:

Advertisements

Similar presentations

PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.

Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Segmentation and Classification Optimally selected HMMs using BIC were integrated into a Superior HMM framework A Soccer video topology was generated utilising.

Digital Audio 1.

Sound in multimedia How many of you like the use of audio in The Universal Machine? What about The Universal Computer? Why or why not? Does your preference.

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh June 2005.

AUTOMATIC SPEECH CLASSIFICATION TO FIVE EMOTIONAL STATES BASED ON GENDER INFORMATION ABSTRACT We report on the statistics of global prosodic features of.

Hidden Markov Model based 2D Shape Classification Ninad Thakoor 1 and Jean Gao 2 1 Electrical Engineering, University of Texas at Arlington, TX-76013,

Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight.

Broadcast News Parsing Using Visual Cues: A Robust Face Detection Approach Yannis Avrithis, Nicolas Tsapatsoulis and Stefanos Kollias Image, Video & Multimedia.

Toward Semantic Indexing and Retrieval Using Hierarchical Audio Models Wei-Ta Chu, Wen-Huang Cheng, Jane Yung-Jen Hsu and Ja-LingWu Multimedia Systems,

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Multimedia Search and Retrieval: New Concepts, System Implementation, and Application Qian Huang, Atul Puri, Zhu Liu IEEE TRANSACTION ON CIRCUITS AND SYSTEMS.

Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,

Presentation Outline  Project Aims  Introduction of Digital Video Library  Introduction of Our Work  Considerations and Approach  Design and Implementation.

Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.

MUSCLE movie data base is a multimodal movie corpus collected to develop content- based multimedia processing like: - speaker clustering - speaker turn.

LYU 0102 : XML for Interoperable Digital Video Library Recent years, rapid increase in the usage of multimedia information, Recent years, rapid increase.

Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System.

Language and Speaker Identification using Gaussian Mixture Model Prepare by Jacky Chau The Chinese University of Hong Kong 18th September, 2002.

Architecture & Data Management of XML-Based Digital Video Library System Jacky C.K. Ma Michael R. Lyu.

Chinese Character Recognition for Video Presented by: Vincent Cheung Date: 25 October 1999.

HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.

AdvAIR Supervised by Prof. Michael R. Lyu Prepared by Alex Fok, Shirley Ng 2002 Fall An Advanced Audio Information Retrieval System.

The Chinese University of Hong Kong Department of Computer Science and Engineering Lyu0202 Advanced Audio Information Retrieval System.

Outline of Presentation Introduction of digital video libraries Introduction of the CMU Informedia Project Informedia: user perspective Informedia:

1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System Supervisor: Prof Michael Lyu Presented by: Lewis Ng,

FYP0202 Advanced Audio Information Retrieval System By Alex Fok, Shirley Ng.

2001/03/29Chin-Kai Wu, CS, NTHU1 Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY.

Learning and Recognizing Activities in Streams of Video Dinesh Govindaraju.

Advisor: Prof. Tony Jebara

김덕주 (Duck Ju Kim). Problems What is the objective of content-based video analysis? Why supervised identification has limitation? Why should use integrated.

Digital Sound and Video Chapter 10, Exploring the Digital Domain.

Kinect Player Gender Recognition from Speech Analysis

Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.

: Chapter 1: Introduction 1 Montri Karnjanadecha ac.th/~montri Principles of Pattern Recognition.

Study of Word-Level Accent Classification and Gender Factors

9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.

The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Application of Audio and Video Processing Methods for Language.

Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,

Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,

Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System.

Multimodal Integration for Meeting Group Action Segmentation and Recognition M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll,

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal VideoConference Archives Indexing System.

ICASSP Speech Discrimination Based on Multiscale Spectro–Temporal Modulations Nima Mesgarani, Shihab Shamma, University of Maryland Malcolm Slaney.

1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.

PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.

Audient: An Acoustic Search Engine By Ted Leath Supervisor: Prof. Paul Mc Kevitt School of Computing and Intelligent Systems Faculty of Engineering University.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

1 Applications of video-content analysis and retrieval IEEE Multimedia Magazine 2002 JUL-SEP Reporter: 林浩棟.

Duraid Y. Mohammed Philip J. Duncan Francis F. Li. School of Computing Science and Engineering, University of Salford UK Audio Content Analysis in The.

Unsupervised Mining of Statistical Temporal Structures in Video Liu ze yuan May 15,2011.

Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.

Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.

DYNAMIC TIME WARPING IN KEY WORD SPOTTING. OUTLINE KWS and role of DTW in it. Brief outline of DTW What is training and why is it needed? DTW training.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Preliminary Transformations Presented By: -Mona Saudagar Under Guidance of: - Prof. S. V. Jain Multi Oriented Text Recognition In Digital Images.

S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,

Portable Camera-Based Assistive Text and Product Label Reading From Hand-Held Objects for Blind Persons.

Spectral subtraction algorithm and optimize Wanfeng Zou 7/3/2014.

Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.

MPEG 7 &MPEG 21.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

Yes, I'm able to index audio files within Alfresco

Supervisor: Prof Michael Lyu Presented by: Lewis Ng, Philip Chan

Traffic Sign Recognition Using Discriminative Local Features Andrzej Ruta, Yongmin Li, Xiaohui Liu School of Information Systems, Computing and Mathematics.

AUDIO SURVEILLANCE SYSTEMS: SUSPICIOUS SOUND RECOGNITION

Presentation transcript:

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo

Outline of Presentation Project objectives Project objectives ViaVoice recognition experiments ViaVoice recognition experiments Speech information processor Speech information processor Audio information retrieval Audio information retrieval Summary Summary

Our Project Objectives Speech recognition Speech recognition Audio information retrieval Audio information retrieval

Last Term’s Work Extract audio channel (stereo 44.1 kHz) from mpeg video files into wave files (mono 22 kHz) Extract audio channel (stereo 44.1 kHz) from mpeg video files into wave files (mono 22 kHz) Segment the wave files into sentences by detecting its frame energy Segment the wave files into sentences by detecting its frame energy Realtime dictation with IBM ViaVoice (ViaVoice is a speech recognition engine developed by IBM) Realtime dictation with IBM ViaVoice (ViaVoice is a speech recognition engine developed by IBM) Developed a visual training tool Developed a visual training tool

Visual Training Tool Video Window; Dictation Window; Text Editor

IBM ViaVoice Experiments Employed 7 student helpers Employed 7 student helpers Produce transcripts of 77 news video clips Produce transcripts of 77 news video clips Four experiments: Four experiments:  Baseline measurement  Trained model measurement  Slow down measurement  Indoor news measurement

Baseline Measurement To measure the ViaVoice recognition accuracy using TVB news video To measure the ViaVoice recognition accuracy using TVB news video Testing set: 10 video clips Testing set: 10 video clips The segmented wav files are dictated The segmented wav files are dictated Employ the hidden Markov model toolkit (HTK) to examine the accuracy Employ the hidden Markov model toolkit (HTK) to examine the accuracy

Trained Model Measurement To measure the accuracy of ViaVoice, trained by its correctly recognized words To measure the accuracy of ViaVoice, trained by its correctly recognized words 10 videos clips are segmented and dictated 10 videos clips are segmented and dictated The correctly dictated words of training set are used to train the ViaVoice by the SMAPI function SmWordCorrection The correctly dictated words of training set are used to train the ViaVoice by the SMAPI function SmWordCorrection Repeat the procedures of “baseline measurement” after training to get the recognition performance Repeat the procedures of “baseline measurement” after training to get the recognition performance Repeat the procedures of using 20 videos clips Repeat the procedures of using 20 videos clips

Slow Down Measurement Investigate the effect of slowing down the audio channel Investigate the effect of slowing down the audio channel Resample the segment wave files in the testing set by the ratio of 1.05, 1.1, 1.15, 1.2, 1.3, 1.4, and 1.6 Resample the segment wave files in the testing set by the ratio of 1.05, 1.1, 1.15, 1.2, 1.3, 1.4, and 1.6 Repeat the procedures of “baseline measurement” Repeat the procedures of “baseline measurement”

Indoor News Measurement Eliminate the effect of noise Eliminate the effect of noise Select the indoor news reporter sentence Select the indoor news reporter sentence Dictate the test set using untrained model Dictate the test set using untrained model Repeat the procedure using trained model Repeat the procedure using trained model

Experimental Results Experiment Accuracy (Max. performance) Baseline25.27% Trained Model 25.87% (with 20 video trained) Slow Speech 25.67% (max. at ratio = 1.15) Indoor Speech (untrained model) 35.22% Indoor Speech (trained model) 36.31% (with 20 video trained) Overall Recognition Results (ViaVoice, TVB News )

Experimental Result Cont. Trained Video Number Untrained 10 videos 20 videos Accuracy25.27%25.82%25.87% Ratio Accuracy (%) Result of trained model with different number of training videos Result of using different slow down ratio

Analysis of Experimental Result Trained model: about 1% accuracy improvement Trained model: about 1% accuracy improvement Slowing down speeches: about 1% accuracy improvement Slowing down speeches: about 1% accuracy improvement Indoor speeches are recognized much better Indoor speeches are recognized much better Mandarin: estimated baseline accuracy is about 70 % ( >> Cantonese) Mandarin: estimated baseline accuracy is about 70 % ( >> Cantonese)

Experiment Conclusions Four reasons for low accuracy Four reasons for low accuracy  Language model mismatch  Voice channel mismatch  The broadcast is very fast and some characters are not so clear  The voice of video clips is too loud The first two reasons are the most critical ones The first two reasons are the most critical ones

Speech Recognition Approach We cannot do much acoustic model training with the ViaVoice API We cannot do much acoustic model training with the ViaVoice API Training is speaker dependent Training is speaker dependent Great difference between the news audio and the training speech for ViaVoice Great difference between the news audio and the training speech for ViaVoice The tool to adapt acoustic model is not currently available The tool to adapt acoustic model is not currently available Manually editing is necessary for producing correct subtitles Manually editing is necessary for producing correct subtitles

Speech Information Processor (SIP) Media player, Text editor, Audio information panel

Main Features Media playback Media playback Real-time dictation Real-time dictation Word time information Word time information Dynamic recognition text editing Dynamic recognition text editing Audio scene change detection Audio scene change detection Audio segments classification Audio segments classification Gender classification Gender classification

System Chart

Timing Information Retrieval Use ViaVoice Speech Manager API (SMAPI) Use ViaVoice Speech Manager API (SMAPI) Asynchronous callback Asynchronous callback The recognized text is organized in a basic unit called “firm word” The recognized text is organized in a basic unit called “firm word” SIP builds an index to store the position and time of firm words SIP builds an index to store the position and time of firm words Highlight corresponding firm words during video playback Highlight corresponding firm words during video playback

Highlight words during playback

Dynamic Index Alignment While editing recognized result, firm word structure might be changed While editing recognized result, firm word structure might be changed Word index need to be updated accordingly Word index need to be updated accordingly SIP captures WM_CHAR event of the text editor SIP captures WM_CHAR event of the text editor Then search for the modified words, and update the corresponding entries in the index Then search for the modified words, and update the corresponding entries in the index In practice, binary search provides good responding time In practice, binary search provides good responding time

Time Index Alignment Example Before EditingEditing After Editing

Audio Information Panel The entire clip is divided into segments separated by audio scene changes The entire clip is divided into segments separated by audio scene changes SIP classifies the segments into three categories, male, female, and non-speech SIP classifies the segments into three categories, male, female, and non-speech Click a segment to preview it Click a segment to preview it

Audio Information Retrieval

Detection of Audio Scene Change -- Motivations Segments of different properties can be handled differently Segments of different properties can be handled differently Apply unsupervised learning to different clusters Apply unsupervised learning to different clusters Assistant tool to video scene change detection Assistant tool to video scene change detection

Bayesian Information Criterion (BIC) Gaussian Distribution—model input stream Gaussian Distribution—model input stream Maximum Likelihood—detect turns Maximum Likelihood—detect turns BIC– make a decision BIC– make a decision

Principle of BIC Bayesian information criterion (BIC) is a likelihood criterion Bayesian information criterion (BIC) is a likelihood criterion The main principle is to penalize the system by the model complexity The main principle is to penalize the system by the model complexity

Detection of a single point change using BIC H 0 :x 1,x 2 …x N ~N(μ,Σ) H 1 :x 1,x 2 …x i ~N(μ 1,Σ 1 ), H 2 :x i+1,x i+2 …x N ~N(μ 2,Σ 2 ), The maximum likelihood ratio is defined as: R(I)=Nlog| Σ|-N 1 log| Σ 1 |-N 2 log| Σ 2 |

Detection of a single point change using BIC The difference between the BIC values of two models can be expressed as: The difference between the BIC values of two models can be expressed as: BIC(I) = R(I) – λP P=(1/2)(d+(1/2d(d+1)))logN If BIC value>0, detection of scene change If BIC value>0, detection of scene change

Detection of multiple point changes by BIC a. Initialize the interval [a, b] with a=1, b=2 a. Initialize the interval [a, b] with a=1, b=2 b. Detect if there is one changing point in interval [a, b] using BIC b. Detect if there is one changing point in interval [a, b] using BIC c. If (there is no change in [a, b]) c. If (there is no change in [a, b]) let b= b + 1 else let t be the changing point detected assign a = t +1; b = a+1; end d. go to step (b) if necessary

Advantages of BIC approach Robustness Robustness Thresholding-free Thresholding-free Optimality Optimality

Comparison of different algorithms

Gender Classification: Motivation and Purpose Allowing different speech analysis algorithms for each gender Allowing different speech analysis algorithms for each gender Facilitating speech recognition by cutting the search space in half Facilitating speech recognition by cutting the search space in half Helping us to build gender-dependent recognition model and better training of the system Helping us to build gender-dependent recognition model and better training of the system

Gender Classification MaleFemale

Speech/Non-Speech Classification Motivation Motivation One method we used : pitch tracking One method we used : pitch tracking

Speech/Non-Speech classification SpeechNon-Speech

Summary ViaVoice training experiments ViaVoice training experiments Speech recognition editing Speech recognition editing Dynamic index alignment Dynamic index alignment Audio scene change detection Audio scene change detection Speech classification Speech classification Integrated the above functions into a speech processor Integrated the above functions into a speech processor

Q & A