Review of ICASSP 2004 Arthur Chan. Part I of This presentation (6 pages) Pointers of ICASSP 2004 (2 pages) NIST Meeting Transcription Workshop (2 pages)

Slides:

Advertisements

Similar presentations

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Advertisements

15.0 Utterance Verification and Keyword/Key Phrase Spotting References: 1. “Speech Recognition and Utterance Verification Based on a Generalized Confidence.

Rapid and Accurate Spoken Term Detection David R. H. Miller BBN Technolgies 14 December 2006.

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

PHONEXIA Can I have it in writing?. Discuss and share your answers to the following questions: 1.When you have English lessons listening to spoken English,

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Adaptation Resources: RS: Unsupervised vs. Supervised RS: Unsupervised.

Important concepts in software engineering The tools to make it easy to apply common sense!

Chapter 1: Introduction to Pattern Recognition

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.

Investigation on Mandarin Broadcast News Speech Recognition Mei-Yuh Hwang, Xin Lei, Wen Wang*, Takahiro Shinozaki University of Washington, *SRI 9/19/2006,

1 Language Model Adaptation in Machine Translation from Speech Ivan Bulyko, Spyros Matsoukas, Richard Schwartz, Long Nguyen, and John Makhoul.

June 14th, 2005Speech Group Lunch Talk Kofi A. Boakye International Computer Science Institute Mixed Signals: Speech Activity Detection and Crosstalk in.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,

CALO Decoder Progress Report for June Arthur (Decoder, Trainer, ICSI Training) Yitao (Live-mode Decoder) Ziad (ICSI Training) Carnegie Mellon University.

2001/03/29Chin-Kai Wu, CS, NTHU1 Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY.

DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.

Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.

® Automatic Scoring of Children's Read-Aloud Text Passages and Word Lists Klaus Zechner, John Sabatini and Lei Chen Educational Testing Service.

Li Deng Microsoft Research Redmond, WA Presented at the Banff Workshop, July 2009 From Recognition To Understanding Expanding traditional scope of signal.

1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.

Topic Detection and Tracking Introduction and Overview.

Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU.

Notes on ICASSP 2004 Arthur Chan May 24, This Presentation (5 pages)  Brief note of ICASSP 2004  NIST RT 04 Evaluation results  Other interesting.

Speech and Language Processing

Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.

Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.

Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

1 Using TDT Data to Improve BN Acoustic Models Long Nguyen and Bing Xiang STT Workshop Martigny, Switzerland, Sept. 5-6, 2003.

Rapid and Accurate Spoken Term Detection Michael Kleber BBN Technologies 15 December 2006.

Overview of the TDT-2003 Evaluation and Results Jonathan Fiscus NIST Gaithersburg, Maryland November 17-18, 2002.

Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.

1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.

PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.

1 Update on WordWave Fisher Transcription Owen Kimball, Chia-lin Kao, Jeff Ma, Rukmini Iyer, Rich Schwartz, John Makhoul.

Recurrent neural network based language model Tom´aˇs Mikolov, Martin Karafia´t, Luka´sˇ Burget, Jan “Honza” Cˇernocky, Sanjeev Khudanpur INTERSPEECH 2010.

ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.

1 DUTIE Speech: Determining Utility Thresholds for Information Extraction from Speech John Makhoul, Rich Schwartz, Alex Baron, Ivan Bulyko, Long Nguyen,

National Taiwan University, Taiwan

1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

Problems of Modeling Phone Deletion in Conversational Speech for Speech Recognition Brian Mak and Tom Ko Hong Kong University of Science and Technology.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington.

A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.

Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.

Using Conversational Word Bursts in Spoken Term Detection Justin Chiu Language Technologies Institute Presented at University of Cambridge September 6.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

STD Approach Two general approaches: word-based and phonetics-based Goal is to rapidly detect the presence of a term in a large audio corpus of heterogeneous.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Paul van Mulbregt Sheera Knecht Jon Yamron Dragon Systems Detection at Dragon Systems.

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.

A NONPARAMETRIC BAYESIAN APPROACH FOR

Evolvable dialogue systems

College of Engineering

Progress Report of Sphinx in Summer 2004 (July 1st to Aug 31st )

Document Expansion for Speech Retrieval (Singhal, Pereira)

Sphinx Recognizer Progress Q2 2004

Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,

Presentation transcript:

Review of ICASSP 2004 Arthur Chan

Part I of This presentation (6 pages) Pointers of ICASSP 2004 (2 pages) NIST Meeting Transcription Workshop (2 pages)

Session Summary Speech Processing Sessions (SpL1-L11, SpP1-16) Many people because of SARS in Hong Kong last year. Speech/Speaker recognition, TTS/Voice morphing, speech coding, Signal Processing Sessions (Sam*, Sptm*, Ae-P6) Image Processing Sessions (Imdsp*) Machine Learning Sessions (Mlsp*) Multimedia Processing Sessions (Msp*) Applications (Itt*)

Quick Speech Paper Pointer Acoustic Modeling and Adaptation (SP-P2, SP-P3, SP-P 14) Noisy Speech Processing/Recognition (SP-P6, SP-P13) Language Modeling (SP-L11) Speech Processing in the meeting domain. R04 Rich Transcription in meeting domain. Handbook can be obtained from Arthur. Speech Application/Systems (ITT-P2, MSP-P1, MSP-P2) Speech Understanding (SP-P4) Feature-analysis (SP-P6, SP-L6) Voice Morphing (SP-L1) TTS

Meeting Transcription Workshop Message : Meeting transcription is hard Problems in core technology Cross talk causes a lot of trouble on SR and speaker segmentation. Problems in evaluation Cross talk causes a lot of trouble in string evaluation. Problems in resource creation Transcription becomes very hard Tool is not yet available.

Speech Recognition Big challenge in speech recognition ~65% average ERR using state-of-the art technology of Acoustic modeling and language modeling Speaker adaptation Discriminative training Signal Processing using multi-distance microphones Observations Speech recognition become poorer when there are more speakers. Multi-distance is a big win. May be microphone array will also be.

End of Part I Jim asked about why FA is counted at Jun 18, 2004 Q: “Is it reasonable to give the same weighting to FA as to Missing Speaker and Wrong Speaker?”

Part II : More on Diarization Error Measurement (7 pages) Is the current DER reasonable? Lightly Supervised Training (6 pages)

More on Diarization Error Measurement (7 pages) Its Goal: Discover how many persons are involved in the conversation Assign speech segments to a particular segments Usually assume no prior knowledge of the speakers Application: Unsupervised speaker adaptation, Automatic archiving and indexing acoustic data.

Usual procedures of Speaker Diarization 1, Speaker Segmentation Segment a N-speaker audio document into segments which is believed to be spoken by one speaker. 2, Speaker Clustering Assign segments to hypothesized speakers

Diarization Process Ref_Spk1 Ref Sys Ref_Spk2 Hyp_Spk1 False Alarm Hyp_Spk1 Hyp_Spk2 MissSpeaker Err

Definition of Diarization Error Rough segmentation are first provided as reference. Another stage of acoustic segmentation will also be applied on the segmentation Definition: :Duration of the segment :Number of speakers in the Reference :Number of speakers provided by the system :Number of speaker in the reference which is hypothesize correctly by the system

Breakdown to three types of errors Speaker that is attributed to the wrong speaker (or speaker error time), sum of Missed Speaker time: sum of segments where more reference speaker than system speakers. False Alarm: sum of segments where more system speakers than the reference.

Re: Jim, possible extension of the measure Current measures is weighted by the number of mistakes made Possible way to extend the definition

Other Practical Concerns of Measuring DER In NIST evaluation guideline: Only rough segmentation is provided at the beginning. 250 ms time collar is provided in the evalution Breaks of a speaker less than 0.3s doesn’t count.

My Conclusion Weakness of current measure: Because of FA, DER can be larger than 100. But most systems perform much better than that Constraints are also provided to make the measure reasonable. Also, as in WER It is pretty hard to decide how to weigh deletion and insertion errors. So, current measure is imperfect however, it might be to extend it to be more reasonable

Further References Spring 2004 (RT-04S) Rich Transcription Meeting Recognition Plan, spring/documents/rt04s-meeting-eval-plan- v1.pdf spring/documents/rt04s-meeting-eval-plan- v1.pdf Speaker Segmentation and Clustering in Meetings by Qin Jin et al. Can be found in RT 2004 Spring Meeting Recognition Workshop

Lightly supervised Training (6 pages) Lightly supervision in acoustic model training > 1000 hours training (by BBN) using TDT (Topic detection tracking) corpus The corpus (totally 1400 hrs) Contains News from ABC/CNN (TDT2), MSNBC and NBC (TDT3 and 4) Lightly supervised training, using only closed-caption transcription, not transcribed by human. “Decoding as a second opinion: Adapted results: BL (hub4) WERR 12.7% -> tdt4 12.0% -> + tdt2 11.6% + tdt3 10.9% -> w MMIE 10.5%

How does it work? Require very strict automatic selection criterion What kills the recognizer is insertion and deletion of phrases. CC : “The republican leadership council is going to air ads promoting Ralph Nadar” Actual : “The republican leadership council, a moderate group, is going to air ads the Green Party candidate, Ralph Nadar. “ -> Corrupt phoneme alignments.

Point out the Error : Biased LM for lightly supervise decoding Instead of using standard LM Use LM with biased on the CC LM Arguments: Good recognizer can figure out whether there is error. However, it is not easy to automatically know that there is an error. High Biased of LM will result in low WERR in certain CC. Can point out error better. However, High Biased of LM cause recognizer making same errors as CC. Make recognizer biased to the CC Authors : “ … the art is such as way the recognizer can confirm correct words …. and point out the errors”

Selection of Sentences: Lightly supervised decoding Lightly supervised decoding Use a 10xRT decoder to run through 1400 hrs of speech. (1.5 year in 1 single processor machine) Authors: “It takes some time to run.” Selection Only choose the files with 3 or more contiguous words correct (Or files with no error) Only 50% data is selected. (around 700 hrs)

Model Scalability and Conclusion No. of hours from 141h -> 843h Speakers from 7k -> 31k Codebooks from 6k -> 34k Gaussians from 164k -> 983k

Conclusion and Discussion A new challenge for speech recognition Are we using the right method in this task? Is increasing the number of parameters correct? Will more complex models (n-phones, n- grams) work better in cases > 1000 hrs?

Related work in ICASSP 2004 Lightly supervised acoustic model using consensus network (LIMSI on TDT4 Mandarin) Improving broadcast news transcription by lightly supervised discriminative training (Very similar work by Cambridge.) Use a faster decoder (5xRT) Discriminative training is the main theme.