Managing Ambiguity, Gaps, and Errors in Spoken Language Processing Gina-Anne Levow May 14, 2009.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.

Tone perception and production by Cantonese-speaking and English- speaking L2 learners of Mandarin Chinese Yen-Chen Hao Indiana University.

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Machine learning continued Image source:

IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.

Unsupervised and Semi-Supervised Learning of Tone and Pitch Accent Gina-Anne Levow University of Chicago June 6, 2006.

Context and Learning in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago May 18, 2007.

Prosody in Spoken Language Understanding Gina Anne Levow University of Chicago January 4, 2008 NLP Winter School 2008.

Modeling Prosodic Sequences with K-Means and Dirichlet Process GMMs Andrew Rosenberg Queens College / CUNY Interspeech 2013 August 26, 2013.

Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.

Classification and Decision Boundaries

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.

Identifying Local Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago October 5, 2004.

Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.

SPOKEN LANGUAGE SYSTEMS MIT Computer Science and Artificial Intelligence Laboratory Mitchell Peabody, Chao Wang, and Stephanie Seneff June 19, 2004 Lexical.

Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.

Issues in Pre- and Post-translation Document Expansion: Untranslatable Cognates and Missegmented Words Gina-Anne Levow University of Chicago July 7, 2003.

Context in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago September 7, 2005.

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.

Turn-taking in Mandarin Dialogue: Interactions of Tone and Intonation Gina-Anne Levow University of Chicago October 14, 2005.

CS Instance Based Learning1 Instance Based Learning.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Overview of Search Engines

DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Unsupervised Learning of Categories from Sets of Partially Matching Image Features Kristen Grauman and Trevor Darrel CVPR 2006 Presented By Sovan Biswas.

A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore.

Mandarin-English Information (MEI) Johns Hopkins University Summer Workshop 2000 presented at the TDT-3 Workshop February 28, 2000 Helen Meng The Chinese.

Whither Linguistic Interpretation of Acoustic Pronunciation Variation Annika Hämäläinen, Yan Han, Lou Boves & Louis ten Bosch.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Mandarin-English Information (MEI): Investigating Translingual Speech Retrieval Johns Hopkins University Center of Language and Speech Processing Summer.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.

Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.

Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.

Mandarin-English Information (MEI): Investigating Translingual Speech Retrieval Johns Hopkins University Center of Language and Speech Processing Summer.

Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.

Semi-supervised Training of Statistical Parsers CMSC Natural Language Processing January 26, 2006.

Transductive Regression Piloted by Inter-Manifold Relations.

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.

Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.

Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.

Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.

Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.

National Taiwan University, Taiwan

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

Mandarin-English Information (MEI): Investigating Translingual Speech Retrieval Johns Hopkins University Center of Language and Speech Processing Summer.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Predicting Voice Elicited Emotions

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.

Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

Semi-Supervised Clustering

Investigating Pitch Accent Recognition in Non-native Speech

Information Retrieval: Models and Methods

Presentation transcript:

Managing Ambiguity, Gaps, and Errors in Spoken Language Processing Gina-Anne Levow May 14, 2009

Research Background & Interests Dialogue Spoken and Multimodal Dialogue Systems Recognizing spoken corrections (Levow 98,99,01,04) Predicting turn-taking behavior (Levow 05) Recognizing prominence and tone (Levow 05,06,08,WL07,DL08) Topic and story segmentation (Levow 04+, Matveeva&Levow 07) Prosodic and lexical evidence Search and retrieval (Meng et al 01, Levow 03,07,Levow et al 05) Focus on speech, cross-language News, conversation/interview data

Roadmap Motivation Tone and Pitch Accent Challenges: Contextual variation & Training demands Data collections and processing Modeling Context for Tone and Pitch Accent Aside: Novel features Reducing Training Demands: Semi- and Un-supervised Learning Cross-language Spoken Document Retrieval Challenges: Ambiguity, Gaps, and Errors Retrieval setting Multi-scale processing: Phrases to subwords: translation, indexing, and expansion Conclusion

Challenges Growing opportunities for speech applications Explosive growth in audio data Ubiquitous mobile devices Systems rely on supporting resources and technologies May be limited Labeled training data for task, linguistic resources Or noisy Speech recognition, machine translation Successful systems must remain effective Employ techniques to mitigate the impact of ambiguity, gaps, and errors

Challenges: Context Tone and Pitch Accent Recognition Key component of language understanding Lexical tone carries word meaning Pitch accent carries semantic, pragmatic, discourse meaning Non-canonical form (Shen 90, Shih 00, Xu 01) Tonal coarticulation modifies surface realization In extreme cases, fall becomes rise Tone is relative To speaker range High for male may be low for female To phrase range, other tones E.g. downstep

Challenges: Training Demands Tone and pitch accent recognition Exploit data intensive machine learning SVMs (Thubthong 01,Levow 05, SLX05) Boosted and Bagged Decision trees (X. Sun, 02) HMMs: (Wang & Seneff 00, Zhou et al 04, Hasegawa-Johnson et al, 04,… Can achieve good results with huge sample sets SLX05: ~10K lab syllabic samples -> > 90% accuracy Training data expensive to acquire Time – pitch accent 10s of times real-time Money – requires skilled labelers Limits investigation across domains, styles, etc Human language acquisition doesn’t use labels

Strategy: Overall Common model across languages Common machine learning classifiers Acoustic-prosodic model No word label, POS, lexical stress info No explicit tone label sequence model English, Mandarin Chinese (also isiZulu, Cantonese)

Strategy: Context Exploit contextual information Features from adjacent syllables Height, shape: direct, relative Compensate for phrase contour Analyze impact of Context position, context encoding Up to 24% reduction in error over no context

English Data Collection Boston University Radio News Corpus (Ostendorf et al, 95) Manually ToBI annotated, aligned, syllabified Pitch accent aligned to syllables Unaccented, High, Downstepped High, Low (Sun 02, Ross & Ostendorf 95) Unaccented vs Accented

Mandarin Data Collections Lab speech data: (Xu, 1999) 5 syllable utterances: vary tone, focus position In-focus, pre-focus, post-focus TDT2 Voice of America Mandarin Broadcast News Automatically force aligned to anchor scripts Automatically segmented, pinyin pronunciation lexicon Manually constructed pinyin-ARPABET mapping CU Sonic – language porting High, Mid-rising, Low, High falling (,Neutral)

Local Feature Extraction Uniform representation for tone, pitch accent Motivated by Pitch Target Approximation Model Tone/pitch accent target exponentially approached Linear target: height, slope (Xu et al, 99) Base features: Pitch, Intensity max, mean, min, range (Praat, speaker normalized) Pitch at 5 points across voiced region Duration Initial, final in phrase Slope: Linear fit to last half of pitch contour

Context Features Local context: Extended features Adjacent points of preceding, following syllables Difference features Difference between Pitch max, mean, mid, slope Intensity max, mean Of preceding, following and current syllable Phrasal context: Compute collection average phrase slope Compute scalar pitch values, adjusted for slope

Supervised Classification Classifier: Support Vector Machine Linear kernel Multiclass formulation SVMlight (Joachims), LibSVM (Cheng & Lin 01) 4:1 training / test splits Experiments: Effects of Context position: preceding, following, none, both Context encoding: Extended/Difference

Results: Local Context ContextMandarin ToneEnglish Pitch Accent Full74.5%81.3% Extend PrePost74%80.7% Extend Pre74%79.9% Extend Post70.5%76.7% Diffs PrePost75.5%80.7% Diffs Pre76.5%79.5% Diffs Post69%77.3% Both Pre76.5%79.7% Both Post71.5%77.6% No context68.5%75.9%

Results: Local Context ContextMandarin ToneEnglish Pitch Accent Full74.5%81.3% Extend PrePost74%80.7% Extend Pre74%79.9% Extend Post70.5%76.7% Diffs PrePost75.5%80.7% Diffs Pre76.5%79.5% Diffs Post69%77.3% Both Pre76.5%79.7% Both Post71.5%77.6% No context68.5%75.9%

Results: Local Context ContextMandarin ToneEnglish Pitch Accent Full74.5%81.3% Extend PrePost74%80.7% Extend Pre74%79.9% Extend Post70.5%76.7% Diffs PrePost75.5%80.7% Diffs Pre76.5%79.5% Diffs Post69%77.3% Both Pre76.5%79.7% Both Post71.5%77.6% No context68.5%75.9%

Context: Summary Employ common acoustic representation Tone (Mandarin), pitch accent (English) SVM classifiers - linear kernel: 76%, 81% Local context effects: Up to > 20% relative reduction in error Preceding context greatest contribution Carryover vs anticipatory Consistent with phonetic analysis (Xu) that carryover coarticulation is greater than anticipatory

Aside: Voice Quality & Energy w/ Dinoj Surendran Assess local voice quality and energy features for tone Not typically associated with tones: Mandarin Considered: VQ: NAQ, AQ, etc; Spectral balance; Spectral Tilt; Band energy Useful: Band energy significantly improves Mandarin: neutral tone Supports identification of unstressed syllables Spectral balance predicts stress in Dutch

Roadmap Motivation Tone and Pitch Accent Challenges: Contextual variation & Training demands Data collections and processing Modeling Context for Tone and Pitch Accent Aside: Novel features Reducing Training Demands: Semi- and Un-supervised Learning Cross-language Spoken Document Retrieval Challenges: Ambiguity, Gaps, and Errors Retrieval setting Multi-scale processing: Phrases to subwords: translation, indexing, and expansion Conclusion

Strategy: Training Challenge: Can we use the underlying acoustic structure of the language – through unlabeled examples – to reduce the need for expensive labeled training data? Exploit semi-supervised and unsupervised learning Semi-supervised Laplacian SVM K-means and spectral clustering Substantially outperform baselines Can approach supervised levels

Semi-supervised Learning Approach: Employ small amount of labeled data Exploit information from additional – presumably more available –unlabeled data Few prior exampes: self-training: Ostendorf TR Classifier: Laplacian SVM (Sindhwani,Belkin&Niyogi ’05) Semi-supervised variant of SVM Exploits unlabeled examples RBF kernel, typically 6 nearest neighbors, transductive

Experiments Pitch accent recognition: Binary classification: Unaccented/Accented 1000 instances, proportionally sampled Labeled training: 200 unacc, 100 acc 80% accuracy (cf. 84% w/15x labeled SVM) Mandarin tone recognition: 4-way classification: n(n-1)/2 binary classifiers 400 instances: balanced; 160 labeled Clean lab speech- in-focus-94% cf. 99% w/SVM, 1000s train; 85% w/SVM 160 training samples Broadcast news: 70% Cf. < 50% w/SVM 160 training samples

Pitch Accent Learning Curves

Unsupervised Learning Question: Can we identify the tone structure of a language from the acoustic space without training? Analogous to language acquisition Significant recent research in unsupervised clustering Established approaches: k-means Spectral clustering (Shi & Malik ‘97, Fischer & Poland 2004): asymmetric k-lines Little research for tone Self-organizing maps (Gauthier et al,2005) Tones identified in lab speech using f0 velocities Cluster-based bootstrapping (Narayanan et al, 2006)

Pitch Accent Clustering Clustering alternatives: 3 Spectral approaches: Perform spectral decomposition of affinity matrix Asymmetric k-lines (Fischer & Poland 2004) Symmetric k-lines (Fischer & Poland 2004) Laplacian Eigenmaps (Belkin, Niyogi, & Sindhwani 2004) Binary weights, k-lines clustering K-means: Standard Euclidean distance # of clusters: 2-16 Assign most frequent class label to each cluster 4 way distinction: 1000 samples, proportional Best results: > 78% 2 clusters: asymmetric k-lines; > 2 clusters: kmeans Larger # clusters: all similar

Contrasting Learners

Tone Clustering: I Mandarin four tones: 400 samples: balanced 2-phase clustering: 2-5 clusters each Asymmetric k-lines, k-means clustering Clean read speech: In-focus syllables: 87% (cf. 99% supervised) In-focus and pre-focus: 77% (cf. 93% supervised) Broadcast news: 57% (cf. 74% supervised) K-means requires more clusters to reach k-lines level

Tone Structure First phase of clustering splits high/rising from low/falling by slope Second phase by pitch height

Discussion Common prosodic framework for tone and pitch accent recognition Contextual modeling enhances recognition Local context and broad phrase contour Carryover coarticulation has larger effect for Mandarin Exploiting unlabeled examples for recognition Semi- and Un-supervised approaches Best cases approach supervised levels with less training Exploits acoustic structure of tone and accent space

Roadmap Motivation Tone and Pitch Accent Challenges: Contextual variation & Training demands Data collections and processing Modeling Context for Tone and Pitch Accent Aside: Novel features Reducing Training Demands: Semi- and Un-supervised Learning Cross-language Spoken Document Retrieval Challenges: Ambiguity, Gaps, and Errors Retrieval setting Multi-scale processing: Phrases to subwords: translation, indexing, and expansion Conclusion

Cross-language Spoken Document Retrieval Explosive growth in online audio Increasing proportion non-English, esp. Chinese Challenges: Ambiguity: many translations, acoustic confusion Gaps: OOV in recognition, translation (esp. NE) Errors: Misrecognition, mistranslation, misseg. Solution: subword units may help transliteration, e.g. Northern Ireland /bei3 ai4 er3 lan2/ (in query) recognition of subword units, e.g. Iraq --> a rock (in document)

Cross-language Spoken Document Retrieval Explosive growth in online audio Increasing proportion non-English, esp. Chinese Challenges: Ambiguity: many translations, acoustic confusion Gaps: OOV in recognition, translation (esp. NE) Errors: Misrecognition, mistranslation, misseg. Solution I: subword units may help Transliteration, subword recognition, matching Solution II: expansion may help Clean side collections provide terms

The Answer: A Preview Perform word-based speech recognition Lexicon constraints greatly improve accuracy Perform phrase-based query translation Minimizes translation ambiguity Convert both to character bigrams and match Elegantly handles ambiguous term granularity Add evidence from proper name matching Using syllable bigrams Perform document expansion to enhance match

CL-SDR Architecture Document Expansion Mandarin Retrieval System Mandarin Transcription Comparable Mandarin Documents Balanced Gloss Translation With Stemming Backoff, χ2 Feature Selection Inquery 3.1p1 Ranked list VOA Mandarin Broadcast Audio: 2265 stories NYT English Newswire Exemplars Speech Recognition Exhaustive Relevance Judgments Evaluation 17 Topics mAP (Beginning from and building on the MEI project, Meng et al 01)

CL-SDR Architecture Document Expansion Mandarin Retrieval System Mandarin Transcription Comparable Mandarin Documents Balanced Gloss Translation With Stemming Backoff, χ2 Feature Selection Inquery 3.1p1 Ranked list VOA Mandarin Broadcast Audio: 2265 stories NYT English Newswire Exemplars Speech Recognition Exhaustive Relevance Judgments Evaluation 17 Topics mAP Multi-scale points

Document Expansion Comparable Documents Retrieval System Ranked Document List Term Selection Expanded Documents Original Documents Documents Retrieval System (Singhal ’99)

Translation Granularity Ambiguity: [[Human]7 [Rights] 30] 1 Phrases beat words Three sources Translation lexicon Named entities Numeric expressions Mean Average Precision

Multi-scale Indexing and Expansion Goal: Allow partial match, minimize ambiguity Word segmentation: Noisy, errorful in Mandarin text, speech Single characters: ambiguous Overlapping char. bigrams: Partial match, semantic units Retrieval best w/character bigrams Also outperform syllables Words improve expansion term choice Mean Average Precision

Transliteration: Subword Translation Untranslatable names Create cross-language phonetic map Train transformation-based error-driven learning on name translations Produce 1-best syllables Combine with standard term translation Small, consistent improvement Mean Average Precision

Pre- and Post-translation Document Expansion Document Expansion English Retrieval System Mandarin Transcription Comparable Mandarin Documents Balanced Gloss Translation Inquery 3.1p1 Ranked list VOA Mandarin Broadcast Audio: 2265 stories NYT English Newswire Exemplars Speech Recognition Exhaustive Relevance Judgments Evaluation 17 Topics mAP Document Expansion Comparable English Documents

Bridging Gaps with Document Expansion Prior work on query expansion (European) Pre-translation most important – need terms Pre-translation: Compensate for ASR errors, more translatable Post-translation: Recover trans./ASR gaps, errors, enrichment Post improves significantly Outperforms mono, manual Recovers (trans.) OOV Example terms: Tariq Aziz, Boris Yeltsin, etc Less problematic in other lang. Mean Average Precision

Discussion Multi-scale processing enables: Matching at highest level of precision to reduce ambiguity Partial matching to enable robustness to gaps and errors In conjunction with document expansion, Can outperform retrieval on monolingual, manually transcriptions

Future Challenges SDR: Good effectiveness shown for clean BN speech Conversational speech: more challenging Higher WER will emphasize partial matching Systematic, statistically valid integration of: ASR word/subword hypotheses at multiple scales Alternate retrieval models CL-SDR: Integrate approaches that better model uncertainty and ambiguity in translation

Future Challenges & Opportunities Prominence and Tone Identify prominence and emphasis to improve spoken language understanding Integrate with speech recognition Rerank candidates Enhance unit selection for contextually appropriate prosody Dialogue and intonation Predicting and managing turn-taking Utterance classification Handling miscommunication

Thanks Dinoj Surendran, Siwei Wang, Yi Xu V. Sindhwani, M. Belkin, & P. Niyogi; I. Fischer & J. Poland; T. Joachims; C-C. Cheng & C. Lin This work supported by NSF Grant # MEI Team: Helen Meng, Sanjeev Khudanpur, Hsin-min Wang, Wai-Kit Lo, Doug Oard, et al.

Results & Discussion: Phrasal Context Phrase ContextMandarin ToneEnglish Pitch Accent Phrase75.5%81.3% No Phrase72%79.9% Phrase contour compensation enhances recognition Simple strategy Use of non-linear slope compensate may improve

Aside: More Tones Cantonese: CUSENT corpus of read broadcast news text Same feature extraction & representation 6 tones: High level, high rise, mid level, low fall, low rise, low level SVM classification: Linear kernel: 64%, Gaussian kernel: 68% 3,6: 50% - mutually indistinguishable (50% pairwise) Human levels: no context: 50%; context: 68% Augment with syllable phone sequence 86% accuracy: 90% of syllable w/tone 3 or 6: one dominates

Results: Local Context ContextMandarin Tone English Pitch Accent IsiZulu Tone Full74.5%81.3%76.2% Extend PrePost 74%80.7%74.7% Extend Pre74%79.9%75.3% Extend Post70.5%76.7%74.6% Diffs PrePost75.5%80.7%76.2% Diffs Pre76.5%79.5%76.5% Diffs Post69%77.3%74.6% Both Pre76.5%79.7%76.5% Both Post71.5%77.6%74.8% No context68.5%75.9%74.1%

Results: Local Context ContextMandarin Tone English Pitch Accent IsiZulu Tone Full74.5%81.3%76.2% Extend PrePost 74%80.7%74.7% Extend Pre74%79.9%75.3% Extend Post70.5%76.7%74.6% Diffs PrePost75.5%80.7%76.2% Diffs Pre76.5%79.5%76.5% Diffs Post69%77.3%74.6% Both Pre76.5%79.7%76.5% Both Post71.5%77.6%74.8% No context68.5%75.9%74.1%

Results: Local Context ContextMandarin Tone English Pitch Accent IsiZulu Tone Full74.5%81.3%76.2% Extend PrePost 74%80.7%74.7% Extend Pre74%79.9%75.3% Extend Post70.5%76.7%74.6% Diffs PrePost75.5%80.7%76.2% Diffs Pre76.5%79.5%76.5% Diffs Post69%77.3%74.6% Both Pre76.5%79.7%76.5% Both Post71.5%77.6%74.8% No context68.5%75.9%74.1%

Confusion Matrix (English) Recognized Tone Manually Labeled Tone Unaccented High LowD.S. High Unaccented 95% ( 888/934) 25% (110/440) 100% (12/12) 53.5% (61/114) High 4.6% (43/934) 73% (322/440) 0% 38.5% (44/114) Low 0% D.S. High0.3% (3/934) 2%( 8/440) 0%8% (9/114)

Confusion Matrix (English) Recognized Tone Manually Labeled Tone Unaccented High LowD.S. High Unaccented 95% 25% 100% 53.5% High 4.6% 73% 0% 38.5% Low 0% D.S. High 0.3% 2% 0% 8%

Confusion Matrix (Mandarin) Recognized Tone Manually Labeled Tone HighMid-Rising LowHigh-Falling | Neutral High 84% (38/45) 9% (5/56) 5% (1/20) 13% | 0% (9/68) | Mid-Rising 6.7% (3/45) 78.6% (44/56) 10% (2/20) 7% | 27.3% (5/68) | (3/11) Low 0%3.6% (2/56) 70% (14/20) 7% (5/68) | 27.3% High-Falling 7.4% (4/45) 3.6% (2/56) 10% (2/20) 70% (48/68) | 0% | Neutral 0%5.3% (3/56) 5% (1/20) 1.5% (1/68) | 45%

Confusion Matrix (Mandarin) Recognized Tone Manually Labeled Tone HighMid-Rising LowHigh-Falling | Neutral High 84% 9% 5% 13% | 0% | Mid-Rising 6.7% 78.6% 10% 7% | 27.3% | Low 0% 3.6% 70% 7% | 27.3% High-Falling 7.4% 3.6% 10% 70% | 0% | Neutral 0% 5.3% 5% 1.5% | 45%

Related Work Tonal coarticulation: Xu & Sun,02; Xu 97;Shih & Kochanski 00 English pitch accent X. Sun, 02; Hasegawa-Johnson et al, 04; Ross & Ostendorf 95 Lexical tone recognition SVM recognition of Thai tone: Thubthong 01 Context-dependent tone models Wang & Seneff 00, Zhou et al 04

Pitch Target Approximation Model Pitch target: Linear model: Exponentially approximated: In practice, assume target well-approximated by mid-point (Sun, 02)

Context: Summary Employ common acoustic representation Tone (Mandarin, isiZulu), pitch accent (English) Cantonese: ~64%; 68% with RBF kernel SVM classifiers - linear kernel: 76%, 76%, 81% Local context effects: Up to > 20% relative reduction in error Preceding context greatest contribution Carryover vs anticipatory Phrasal context effects: Compensation for phrasal contour improves recognition

“Bounds” on Subword-Based Systems Character bigrams for indexing marginally outperforms word-based systems Syllable bigrams are quite competitive, though somewhat behind Mean average precision ~0.6 is a good CL-SDR target

Cross-Language SDR Challenges Query processing (translation) Tokenization Translation lexicon coverage Selection among alternate translations Document processing (recognition) Tokenization Recognition lexicon coverage Selection among alternate recognition hypotheses

Laplacian SVM For SVMs, we solve Assume if samples close in intrinsic geometry, then conditionals P(y|x1) P(y|x2) are similar. Add new regularizer, to control complexity in intrinsic geometry, as well as ambient. For Laplacian SVM, solve Support of Px on compact submanifold

Manifold Regularization

Spectral clustering Employs spectrum of similarity matrix S to perform dimensionality reduction for clustering Meila-Shi clusters based on eigenvectors correspondings to k largest eigenvalues Laplacian eigenmaps: Create edges for k nearest neighbors Choose weights: binary or heat kernel Compute eigenv’s for Ly = lambda Dy Represent by e-vectors of m smallest e-values Preserves distances for near neighbors

K-lines Given vectors y (spectral) Randomly generate vectors m (or set to eigenvectors of y) Create matrices of all vectors y closest to each mi mj = first eigenv of MjMjT Iterate

Asymmetric clustering Clusterable data not just Gaussian New affinity measure: conductivity Question: Kernel radius? Often just pick best Here, select automatically, context dependent Also, asymmetric, e.g. if dense region, other further Symmetric – take minimum

K-means clustering Select k= number of clusters Randomly choose k points as cluster centers A) Assign points to nearest cluster B) Recompute cluster centers Repeat A,B until convergence + Fast, large data sets - Random initial assignments

Sequence and Factorial Models: I

Sequence and Factorial Models Compare 0-order, 1-order Compare linear-chain vs factorial

Query by Example English Newswire Exemplars Mandarin Audio Stories President Bill Clinton and Chinese President Jiang Zemin engaged in a spirited, televised debate Saturday over human rights and the Tiananmen Square crackdown, and announced a string of agreements on arms control, energy and environmental matters. There were no announced breakthroughs on American human rights concerns, including Tibet, but both leaders accentuated the positive … 美国总统克林顿的助手赞扬中国官员允许电视现场直播克林顿和江泽民在首脑会晤后举行的联合记者招待会。。特别是一九八九镇压民主运动的决定。他表示镇压天安门民主运动是错误的,他还批评了中国对西藏精神领袖达国家安全事务助理伯格表示,这次直播让中国人第一次在种公开的论坛上听到围绕敏感的人权问题的讨论。在记者招待会上 …