Building A Highly Accurate Mandarin Speech Recognizer

Slides:

Advertisements

Similar presentations

PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.

Advertisements

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.

The SRI 2006 Spoken Term Detection System Dimitra Vergyri, Andreas Stolcke, Ramana Rao Gadde, Wen Wang Speech Technology & Research Laboratory SRI International,

SRI 2001 SPINE Evaluation System Venkata Ramana Rao Gadde Andreas Stolcke Dimitra Vergyri Jing Zheng Kemal Sonmez Anand Venkataraman.

Acoustic / Lexical Model Derk Geene. Speech recognition  P(words|signal)= P(signal|words) P(words) / P(signal)  P(signal|words): Acoustic model  P(words):

1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf.

1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf.

Investigation on Mandarin Broadcast News Speech Recognition Mei-Yuh Hwang, Xin Lei, Wen Wang*, Takahiro Shinozaki University of Washington, *SRI 9/19/2006,

Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR Overview Motivation Tone has a crucial role in Mandarin speech.

Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,

Why is ASR Hard? Natural speech is continuous

Automatic Continuous Speech Recognition Database speech text Scoring.

May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

Introduction to Automatic Speech Recognition

Adaptation Techniques in Automatic Speech Recognition Tor André Myrvoll Telektronikk 99(2), Issue on Spoken Language Technology in Telecommunications,

Word-subword based keyword spotting with implications in OOV detection Jan “Honza” Černocký, Igor Szöke, Mirko Hannemann, Stefan Kombrink Brno University.

ASRU 2007 Survey Presented by Shih-Hsiang. 2 Outline LVCSR –Building a Highly Accurate Mandarin Speech RecognizerBuilding a Highly Accurate Mandarin Speech.

Rapid and Accurate Spoken Term Detection Owen Kimball BBN Technologies 15 December 2006.

1M4 speech recognition University of Sheffield M4 speech recognition Martin Karafiát*, Steve Renals, Vincent Wan.

A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore.

Arthur Kunkle ECE 5525 Fall Introduction and Motivation  A Large Vocabulary Speech Recognition (LVSR) system is a system that is able to convert.

Improving Utterance Verification Using a Smoothed Na ï ve Bayes Model Reporter : CHEN, TZAN HWEI Author :Alberto Sanchis, Alfons Juan and Enrique Vidal.

Speech Recognition Application

Speech and Language Processing

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

1M4 speech recognition University of Sheffield M4 speech recognition Vincent Wan, Martin Karafiát.

1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.

Multiple parallel hidden layers and other improvements to recurrent neural network language modeling ICASSP 2013 Diamantino Caseiro, Andrej Ljolje AT&T.

22CS 338: Graphical User Interfaces. Dario Salvucci, Drexel University. Lecture 10: Advanced Input.

Rapid and Accurate Spoken Term Detection Michael Kleber BBN Technologies 15 December 2006.

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.

11 Effects of Explicitly Modeling Noise Words Chia-lin Kao, Owen Kimball, Spyros Matsoukas.

Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

1 Update on WordWave Fisher Transcription Owen Kimball, Chia-lin Kao, Jeff Ma, Rukmini Iyer, Rich Schwartz, John Makhoul.

Robust speaking rate estimation using broad phonetic class recognition Jiahong Yuan and Mark Liberman University of Pennsylvania Mar. 16, 2010.

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.

National Taiwan University, Taiwan

1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

Problems of Modeling Phone Deletion in Conversational Speech for Speech Recognition Brian Mak and Tom Ko Hong Kong University of Science and Technology.

1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.

Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies.

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,

Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,

NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY Discriminative pronunciation modeling using the MPE criterion Meixu SONG, Jielin PAN, Qingwei ZHAO, Yonghong.

H ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION Overview Goal Build a highly accurate Mandarin speech recognizer for broadcast news (BN) and broadcast.

Jeff Ma and Spyros Matsoukas EARS STT Meeting March , Philadelphia Post-RT04 work on Mandarin.

Automatic Speech Recognition

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

Mr. Darko Pekar, Speech Morphing Inc.

Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky

Feature Mapping FOR SPEAKER Diarization IN NOisy conditions

An overview of decoding techniques for LVCSR

The Development of the AMI System for the Transcription of Speech in Meetings Thomas Hain, Lukas Burget, John Dines, Iain McCowan, Giulia Garau, Martin.

Statistical Models for Automatic Speech Recognition

5.0 Acoustic Modeling References: , 3.4.1, 4.5, 9.1~ 9.4 of Huang

Mohamed Kamel Omar and Lidia Mangu ICASSP 2007

CRANDEM: Conditional Random Fields for ASR

Statistical Models for Automatic Speech Recognition

Automatic Speech Recognition: Conditional Random Fields for ASR

Sphinx Recognizer Progress Q2 2004

Research on the Modeling of Chinese Continuous Speech Recognition

Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang

Speaker Identification:

Learning Long-Term Temporal Features

Presentation transcript:

Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Mari Ostendorf, etc. Dustin: for confidence measure using SVM

Outline Mandarin-specific modules: Word segmentation. Tonal phonetic pronunciations. Pronunciation look-up tools. Linguistic questions for CART state clustering. Pitch features. Mandarin-optimized acoustic segmenter.

Outline Language independent techniques: Jan-08 system Future MPE training. fMPE feature transform. MLP feature front end. System combination. Jan-08 system Future

Word segmentation and lexicon Started from BBN 64K lexicon (originally from LDC 44K lexicon) /g/ssli/data/mandarin-bn/external-sites/ Added 20K new entries (especially names) from various sources. First-pass: Longest-first match (LFM) word segmentation Selected most frequent 60K words as our decoding lexicon. UW Ç BBN = 46.8K UW \ BBN = 13.6K (阿扁,马英九) BBN \ UW = 17.3K (狼狈为奸，心慌意乱，北京烤鸭)

Word segmentation and lexicon Train 3-gram. Treat OOV = @reject@ = garbage. Second-pass: Re-segment training text with ML word segmentation. /homes/mhwang/src/ngramseg/wseg/ngram –order 1 –lm <DARPA n-gram> Output depends on (1) algorithm, (2) lexicon. 记者-从中-国-国家计划委员会-有关部门-获悉记者-从-中国-国家计划委员会-有关部门-获悉 Re-train 3-gram, 4-gram, 5-gram. Very minor perplexity improvement. Character accuracy from 74.42% (LFM) to 75.01% (ML) by NTU.

Lexicon and Perplexity 1.2B words of training text. qLMn: quick (highly pruned) n-gram #bigrams #trigrams #4-grams Dev07-IV Perplexity LM3 58M 108M --- 325.7 qLM3 6M 3M 379.8 LM4 316M 201M 297.8 qLM4 19M 24M 331.2

Two Tonal Phone Sets 70 tonal phones from BBN originally, using IBM main-vowel idea: Split Mandarin Final into vowel+coda to increase parameter sharing. bang /b a NG/ ban /b a N/ {n,N},{y,Y},{w,Y} for unique syllabification Silence for pauses and rej for noises/garbage/foreign. Introducing diphthongs and neutral tones for BC  79 tonal phones

Phone-81: Diphthongs for BC Add diphthongs (4x4=16) for fast speech and modeling longer triphone context. Maintain unique syllabification. Syllable ending W and Y not needed anymore. Example Phone-72 Phone-81 要 /yao4/ a4 W aw4 北 /bei3/ E3 Y ey3 有 /you3/ o3 W ow3 爱 /ai4/ a4 Y ay4

Phone-81: Frequent Neutral Tones Neutral tones more common in conversation. Neutral tones were not modeled. The 3rd tone was used as replacement. Add 3 neutral tones for frequent chars. Example Phone-72 Phone-81 了 /e5/ e3 e5 吗 /ma5/ a3 a5 子 /zi5/ i3 i5

Phone-81: Special CI Phones Filled pauses (hmm, ah) common in BC. Add two CI phones for them. Add CI /V/ for English. Example Phone-72 Phone-81 victory w V 呃 /ah/ o3 fp_o 嗯 /hmm/ e3 N fp_en

Phone-81: Simplification of Other Phones Now 72+14+3+3=92 phones, too many triphones to model. Merge similar phones to reduce #triphones. I2 was modeled by I1, now i2. 92 – (4x3–1) = 81 phones. Example Phone-72 Phone-81 安 /an1/ A1 N a1 N 词 /ci2/ I1 i2 池 /chi2/ IH2

Different Phone Sets Pruned trigram, SI nonCW-PLP ML, on dev07 BN BC Avg Phone-81 7.6 27.3 18.9 Phone-72 7.4 27.6 19.0 Indeed different error behaviors --- good for system combo.

Pronunciation Look-up Tools SRC=/g/ssli/data/mandarin-bn/scripts/pron $SRC/wlookup.pl: Look up pronunciations from a word dictionary, for Chinese and/or English words. $SRC/eng2bbn.pl: Look up English word pronunciations in Mandarin phone set. $SRC/standarnd-all.sc: P72 Single-char lexicon. First pronunciation = most common $SRC/sc2bbn.pl: Look up Chinese word pronunciation from individual characters. $SRC/pconvert.pl: convert a dict from one phone set to another $SRC/RWTH/: RWTH-70 phone set (3rd phone set)

Pitch Features Get_f0 to compute pitch for voiced segments. Pass to graphtrack to reduce pitch halving/doubling problem SPLINE interpolation for unvoiced regions. Log, D, DD Feature CER MFCC 24.1% MFCC+F0 21.4%

Acoustic segmentation Former segmenter, inherited from the English system, caused high deletion errors. It mis-classified some speech segments as noises. Speech segment min duration 18*30=540ms=0.5s Start / null End speech silence noise Vocabulary Pronunciation speech 18+ fg Noise rej rej silence bg bg Start / null End speech silence noise Start / null End speech silence noise Start / null End speech silence noise

New Acoustic Segmenter Allow shorter speech duration Model Mandarin vs. Foreign (English) separately. Vocabulary Pronunciation Mandarin1 I1 F Mandarin2 I2 F Foreign forgn forgn Noise rej rej Silence bg bg Start / null End Foreign silence Mandarin 1 2 noise

Improved Acoustic Segmentation Pruned trigram, SI nonCW-MLP MPE, on Eval06 Segmenter Sub Del Ins Total OLD 9.7 7.0 1.9 18.6 NEW 9.9 6.4 2.0 18.3

Language-Independent Technologies MLP MPE fMPE WER - 17.1% + 15.3% 14.6% 15.6% 13.4% 14.7% 13.9% 13.1%

Two Sets of Acoustic Models MLP-model: MFCC+pitch+MLP (32-dim) = 74-dim CW Triphones with SD SAT feature transform MPE trained P72 PLP-model: PLP+pitch = 32-dim Followed by fMPE SI feature transform P81

MLP Phoneme Posterior Features One MLP to compute Tandem features with pitch+PLP input. 71 output units. 20 MLPs to compute HATs features with 19 critical bands. 71 output units. Combine Tandem and HATs posterior vectors into one 71-dim vector, valued [0..1]. PCA(Log(71))  32 MFCC + pitch + MLP = 74-dim

Tandem Features [T1,T2,…,T71] Input: 9 frames of PLP+pitch (42x9)x15000x71 PLP (39x9) Pitch (3x9)

MLP and Pitch Features nonCW ML, Hub4 Training, MLLR, LM2 on Eval04 HMM Feature MLP Input CER MFCC (39-dim) None 24.1 MFCC+F0 (42-dim) 21.4 MFCC+F0+Tandem (74-dim) PLP(39*9) 20.3 PLP+F0(42*9) 19.7

HATS Features [H1,H2,…,H71] 51x60x71 (60*19)x8000x71 E1 E2 … E19

PLP Models with fMPE Transform PLP model with fMPE transform to compete with MLP model. Smaller ML-trained Gaussian posterior model: 3500x32 CW+SAT 5 Neighboring frames of Gaussian posteriors. M is 42 x (3500*32*5), ht is (3500*32*5)x1. Ref: Zheng ICASSP 07 paper

Eval07: June 2007 Team CER UW 9.1% RWTH 12.1% UW+RWTH 8.9% CU+BBN 9.4% IBM+CMU 9.8%

Jan 08: RWTH Improvements Using RWTH-70 phone set, converted from UW dictionary. Using UW-ICSI MLP features. On Dev07 UW June 2007 auto AS 11.2% RWTH (MLP) Jan08 9.9% UW-1 (MLP) Jan08 9.8%

Jan-2008: Decoding Architecture Manual acoustic segmentation. Removing sub-segments. Removing the ending of the first utterance when partially overlapped. Gender-ID per utternace. Auto speaker clustering per gender. VTLN per speaker. CMN/CVN per utterance.

Jan-2008 Decoding Architecture SI MLP nonCW qLM3 Aachen PLP-SA MLP-SA PLP CW SAT+fMPE MLLR, LM3 MLP CW SAT MLLR, LM3 Confusion Network Combination

Re-Test: Jan 2008 Dev07 Eval07-retest: 8.1%  7.3% PLP-SA-1: 10.2% PLP-SA-2: 9.9% (very competitive to MLP-model after adaptation) MLP-SA-2: 9.8% {PLP-SA-2, MLP-SA-2}: 9.5% RWTH: 9.9% (more sub errors, fewer del errors) {RWTH, PLP-SA-2, MLP-SA-2}: 9.2% Eval07-retest: 8.1%  7.3%

Future Work Putting all words together, re-do word segmentation and re-select decoding lexicon. Auto create new words using point-wise mutual information: PMI(w1,w2) = log P(w1,w2)/{P(w1)P(w2)} LM adaptation Finer topics Names Has to coordinate with MT/NE