1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf.

Slides:

Advertisements

Similar presentations

Sub-Project I Prosody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan.

Advertisements

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.

The SRI 2006 Spoken Term Detection System Dimitra Vergyri, Andreas Stolcke, Ramana Rao Gadde, Wen Wang Speech Technology & Research Laboratory SRI International,

SRI 2001 SPINE Evaluation System Venkata Ramana Rao Gadde Andreas Stolcke Dimitra Vergyri Jing Zheng Kemal Sonmez Anand Venkataraman.

Mandarin Chinese Speech Recognition. Mandarin Chinese Tonal language (inflection matters!) Tonal language (inflection matters!) 1 st tone – High, constant.

Acoustic / Lexical Model Derk Geene. Speech recognition  P(words|signal)= P(signal|words) P(words) / P(signal)  P(signal|words): Acoustic model  P(words):

Error Analysis: Indicators of the Success of Adaptation Arindam Mandal, Mari Ostendorf, & Ivan Bulyko University of Washington.

SPOKEN LANGUAGE SYSTEMS MIT Computer Science and Artificial Intelligence Laboratory Mitchell Peabody, Chao Wang, and Stephanie Seneff June 19, 2004 Lexical.

1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf.

Investigation on Mandarin Broadcast News Speech Recognition Mei-Yuh Hwang, Xin Lei, Wen Wang*, Takahiro Shinozaki University of Washington, *SRI 9/19/2006,

Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR Overview Motivation Tone has a crucial role in Mandarin speech.

1 Language Model Adaptation in Machine Translation from Speech Ivan Bulyko, Spyros Matsoukas, Richard Schwartz, Long Nguyen, and John Makhoul.

WEB-DATA AUGMENTED LANGUAGE MODEL FOR MANDARIN SPEECH RECOGNITION Tim Ng 1,2, Mari Ostendrof 2, Mei-Yuh Hwang 2, Manhung Siu 1, Ivan Bulyko 2, Xin Lei.

Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,

1 Less is More? Yi Wu Advisor: Alex Rudnicky. 2 People: There is no data like more data!

Why is ASR Hard? Natural speech is continuous

May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

Introduction to Automatic Speech Recognition

Adaptation Techniques in Automatic Speech Recognition Tor André Myrvoll Telektronikk 99(2), Issue on Spoken Language Technology in Telecommunications,

ASRU 2007 Survey Presented by Shih-Hsiang. 2 Outline LVCSR –Building a Highly Accurate Mandarin Speech RecognizerBuilding a Highly Accurate Mandarin Speech.

1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.

1M4 speech recognition University of Sheffield M4 speech recognition Martin Karafiát*, Steve Renals, Vincent Wan.

Arthur Kunkle ECE 5525 Fall Introduction and Motivation  A Large Vocabulary Speech Recognition (LVSR) system is a system that is able to convert.

Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU.

Speech and Language Processing

March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

1M4 speech recognition University of Sheffield M4 speech recognition Vincent Wan, Martin Karafiát.

1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.

Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.

1 Using TDT Data to Improve BN Acoustic Models Long Nguyen and Bing Xiang STT Workshop Martigny, Switzerland, Sept. 5-6, 2003.

Rapid and Accurate Spoken Term Detection Michael Kleber BBN Technologies 15 December 2006.

Improving out of vocabulary name resolution The Hanks David Palmer and Mari Ostendorf Computer Speech and Language 19 (2005) Presented by Aasish Pappu,

The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.

Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.

Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.

National Taiwan University, Taiwan

1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

Problems of Modeling Phone Deletion in Conversational Speech for Speech Recognition Brian Mak and Tom Ko Hong Kong University of Science and Technology.

Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.

Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Tight Coupling between ASR and MT in Speech-to-Speech Translation Arthur Chan Prepared for Advanced Machine Translation Seminar.

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.

Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者：郝柏翰 2013/05/23.

NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY Discriminative pronunciation modeling using the MPE criterion Meixu SONG, Jielin PAN, Qingwei ZHAO, Yonghong.

H ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION Overview Goal Build a highly accurate Mandarin speech recognizer for broadcast news (BN) and broadcast.

Jeff Ma and Spyros Matsoukas EARS STT Meeting March , Philadelphia Post-RT04 work on Mandarin.

Building A Highly Accurate Mandarin Speech Recognizer

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky

Feature Mapping FOR SPEAKER Diarization IN NOisy conditions

Tight Coupling between ASR and MT in Speech-to-Speech Translation

Research on the Modeling of Chinese Continuous Speech Recognition

Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang

Speaker Identification:

Learning Long-Term Temporal Features

Presenter : Jen-Wei Kuo

Artificial Intelligence 2004 Speech & Natural Language Processing

Presentation transcript:

1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf 12/12/2007

2 Outline  Goal: A highly accurate Mandarin ASR  Background  Acoustic segmentation  Acoustic models and adaptation  Language models and adaptation  Cross adaptation  System combination  Error analysis  Future

3 Background  870 hours of acoustic training data.  N-gram based (N=1) ML Chinese word segmentation.  60K-word lexicon.  1.2G words of training text. Trigrams and 4-grams. n2n3n4Dev07-IV Perplexity LM 3 58M108M qLM 3 6M 3M LM 4 58M316M201M297.8 qLM 4 19M24M 6M383.2

4 Acoustic segmentation  Former segmenter caused high deletion errors. It mis-classified some speech segments as noises.  Speech segment min duration 18*30=540ms=0.5s Start/nullEnd/null speech silence noiseStart/nullEnd/null speech silence noiseStart/nullEnd/null speech silence noise VocabularyPronunciation speech18 + fg Noiserej silencebg Start/nullEnd/null speech silence noise

5 New Acoustic Segmenter  Allow shorter speech duration  Model Mandarin vs. Foreign (English) separately. VocabularyPronunciation Mandarin1I1 F Mandarin2I2 F Foreignforgn Noiserej Silencebg Start/nullEnd/nullForeign silence Mandarin1 2 noise

6 Two Sets of Acoustic Models  For cross adaptation and system combo Different error behaviors Similar error rate performance System-MLPSystem-PLP Features74 (MFCC+3+32) 42 (PLP+3) fMPEnoyes Phones7281

7 MLP Phoneme Posterior Features  Compute Tandem features with pitch+PLP input.  Compute HATs features with 19 critical bands  Combine two Tandem and HATs posterior vectors into one.  Log(PCA(71)  32)  MFCC + pitch + MLP = 74-dim  3500x128 Gaussians, MPE trained.  Both cross-word (CW) and nonCW triphones trained.

8 Tandem Features [T 1,T 2,…,T 71 ]  Input: 9 frames of PLP+pitch (42x9)x15000x71 PLP (39x9) Pitch (3x9)

9 HATS Features [H 1,H 2,…,H 71 ] 51x60x71 … E1 E2 E19 (60*19)x8000x71

10 Phone-81: Diphthongs for BC  Add diphthongs (4x4=16) for fast speech and modeling longer triphone context.  Maintain unique syllabification.  Syllable ending W and Y not needed anymore. ExamplePhone-72Phone-81 要 /yao4/ a4 Waw4 北 /bei3/ E3 Yey3 有 /you3/ o3 Wow3 爱 /ai4/ a4 Yay4

11 Phone-81: Frequent Neutral Tones for BC  Neural tones more common in conversation.  Neutral tones were not modeled. The 3 rd tone was used as replacement.  Add 3 neutral tones for frequent chars. ExamplePhone-72Phone-81 了 /e5/ e3e5 吗 /ma5/ a3a5 子 /zi5/ i3i5

12 Phone-81: Special CI Phones for BC  Filled pauses (hmm,ah) common in BC. Add two CI phones for them.  Add CI /V/ for English. ExamplePhone-72Phone-81 victorywV 呃 /ah/ o3fp_o 嗯 /hmm/ e3 Nfp_en

13 Phone-81: Simplification of Other Phones  Now =92 phones, too many triphones to model.  Merge similar phones to reduce #triphones. I2 was modeled by I1, now i2.  92 – (4x3–1) = 81 phones. ExamplePhone-72Phone-81 安 /an1/ A1 Na1 N 词 /ci2/ I1i2 池 /chi2/ IH2i2

14 PLP Models with fMPE Transform  PLP model with fMPE transform to compete with MLP model.  Smaller ML-trained Gaussian posterior model: 3500x32 CW+SAT  5 Neighboring frames of Gaussian posteriors.  M is 42 x (3500*32*5), h is (3500*32*5)x1.  Ref: Zheng ICASSP 07 paper

15 Topic-based LM Adaptation Latent Dirichlet Allocation Topic Model {w | w same story (4secs) } 00 One sentence   4s window is used to make adaptation more robust against ASR errors.  {w} are weighted based on distance.

16 Topic-based LM Adaptation Latent Dirichlet Allocation Topic Model One sentence Topic-based LM Adaptation Training One topic per sentence. Train 64 topic-dep. 4-gram LM 1, LM 2, … LM 64. Decoding Top n topics per sentence, where  i ’ > threshold. Latent Dirichlet Allocation Topic Model One sentence Latent Dirichlet Allocation Topic Model One sentence Latent Dirichlet Allocation Topic Model

17 Improved Acoustic Segmentation Pruned trigram, SI nonCW-MLP MPE, on eval06 SegmenterSubDelInsTotal OLD NEW Oracle

18 Different Phone Sets Pruned trigram, SI nonCW-PLP ML, on dev07 BNBCAvg Phone Phone Indeed different error behaviors --- good for system combo.

19 Decoding Architecture MLP nonCW qLM 3 PLP CW+SAT+fMPE MLLR, LM 3 MLP CW+SAT MLLR, LM 3 qLM 4 Adapt/Rescore Confusion Network Combination Aachen

20 Topic-based LM Adaptation (NTU)  Training, per sentence: 64 topics:  = (  1,  2, …,  m ) Topic(sentence) = k = argmax {  1,  2, …,  m } Train 64 topic-dep (TD) 4-grams  Testing, per utterance: {w}: N-best confidence based weighting + distance weighting Pick all TD 4-grams whose  i is above a threshold. Interpolate with the topic-indep. 4-gram. Rescore N-best list.

21 CERs with diff LMs (internal use) AM (adapt. hyps) PLP (MLP) MLP (PLP) MLP (Aachen) PLP (Aachen) Rover LM qLM LM Adapted qLM

22 Topic-based LM Adaptation (NTU) AM (adapt. hyps) PLP (MLP) MLP (PLP) MLP (Aachen) PLP (Aachen) CNC Rover LM Adapted qLM “q” represents “quick” or tightly pruned. Oracle CNC: 4.7%. Could it be a broken word sequence? Need to verify that with word perplexity and HTER.

ASR System vs SUBDELINSTOTAL 2006 system system CER on Eval07 37% relative improvement!!

24 Eval07 BN ASR Error Distribution 66 BN snippets (Avg CER 3.4%) %50.0%100.0%150.0% % snippets CER (%) SRI

25 Eval07 BC ASR Error Distribution 53 BC snippets (avg CER 15.9%) %20.0%40.0%60.0%80.0%100.0%120.0% % snippets CER (%) SRI

26 What Worked for Mandarin ASR?  MLP features  MPE  CW+SAT  fMPE  Improved acoustic segmentation, particularly for deletion errors.  CNC Rover.

27 Small Help for ASR  Topic-dep. LM adaptation.  Outside regions for additional AM adaptation data.  A new phone set with diphthongs to offer different error behaviors.  Pitch input in tandem features.  Cross adaptation with Aachen  Successful collaboration among 5 team members from 3 continents.

28 Error Analysis on Extreme Cases SnippetDurCERHTER a) Worst BN87s10.9%47.73% b) Worst BC72s24.9%48.37% c) Best BN62s012.67% d) Best BC77s15.2%14.20%  CER not directly related to HTER; genre matters.  Better CER does ease MT.

29 Error Analysis  (a) worst BN: OOV names  (b) worst BC: overlapped speech  (c) best BN: composite sentences  (d) best BC: simple sentences with disfluency and re-starts.

30 Error Analysis  OOV (especially names) Problematic for both ASR/MT  Overlapped speech  What to do?  Content word mis-reco (not all errors are equal!)  升值 (increase in value)  甚至 (even) Parsing scores? 徐昌霖徐成民徐长明 Xu, Chang-Lin 黄竹琴黄朱琴黄朱勤皇猪禽黄朱其 Huang, Zhu-Qin

31 Error Analysis  MT BN high errors Composite syntax structure. Syntactic parsing would be useful.  MT BC high errors Overlapped speech ASR high errors due to disfluency Conjecture: MT on perfect BC ASR is easy, for its simple/short sentence structure

32 Next ASR: Chinese OOV Org Names  Semi-auto abbreviation generation for long words. Segment a long word into a sequence of shorter words Extract the 1 st char of each shorter words: World Health Organization  WHO (Make sure they are in MT translation table, too)

33 Next ASR: Chinese OOV Per. Names  Mandarin high rate of homophones: 408 syllables  6000 common characters. 14 homophone chars / syllable!!  Given a spoken Chinese OOV name, no way to be sure which characters to use. But for MT, don’t care anyway as long as the syllables are correct.!!  Recognizing repetition of the same name in the same snippet: CNC at syllable level Xu  {Chang, Cheng}  {Lin, Min, Ming} Huang  Zhu  {Qin, Qi}  After syllable CNC, apply the same name to all occurrences in Pinyin.

34 Next ASR: English OOV Names  English spelling in Lexicon, with (multiple) Mandarin pronunciations: Bush /bu4 shi2/ or /bu4 xi1/ Bin Laden /ben1 la1 deng1/ or /ben1 la1 dan1/ John /yue1 han4/ Sadr /sa4 de2 er3/ Name mapping from MT?  Need to do name tagging on training text (Yang Liu), convert Chinese names to English spelling, re- train n-gram.

35 Next ASR: LM  LM adaptation with fine topics, each topic with small vocabulary size.  Spontaneous speech: n-gram backtraces to content words in search or N-best? Text paring modeling? 我想那 ( 也 )( 也 ) 也是  我想那也是 I think it, (too), (too), is, too.  I think it is, too.  If optimizing CER, stm needs to be designed such that disfluency is optionally deletable.

36 Next ASR: AM  Add explicit tone modeling (Lei07). Prosody info: duration and pitch contour at word level Various backoff schemes for infrequent words  More understanding why outside regions not helping with AM adaptation. Add SD MLLR regression tree (Mandal06). Improve auto speaker clustering  Smaller clusters, better performance

37 ASR & MT Integration  Do we need to merge lexicon? ASR <= MT.  Do we need to use the same word segmenter?  Is word/char -level CNC output better for MT?  Open questions and feedback!!!