1 Prosody-Based Automatic Segmentation of Speech into Sentences and Topics Elizabeth Shriberg Andreas Stolcke Speech Technology and Research Laboratory.

Slides:

Advertisements

Similar presentations

Machine Learning Approaches to the Analysis of Large Corpora : A Survey Xunlei Rose Hu and Eric Atwell University of Leeds.

Advertisements

Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.

Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.

Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

Hidden Markov Models Theory By Johan Walters (SR 2003)

Emotion in Meetings: Hot Spots and Laughter. Corpus used ICSI Meeting Corpus – 75 unscripted, naturally occurring meetings on scientific topics – 71 hours.

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.

Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

2001/03/29Chin-Kai Wu, CS, NTHU1 Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY.

Varying Input Segmentation for Story Boundary Detection Julia Hirschberg GALE PI Meeting March 23, 2007.

Albert Gatt Corpora and Statistical Methods Lecture 9.

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.

Improving Utterance Verification Using a Smoothed Na ï ve Bayes Model Reporter : CHEN, TZAN HWEI Author :Alberto Sanchis, Alfons Juan and Enrique Vidal.

Automatic detection of microchiroptera echolocation calls from field recordings using machine learning algorithms Mark D. Skowronski and John G. Harris.

Exploiting lexical information for Meeting Structuring Alfred Dielmann, Steve Renals (University of Edinburgh) {

Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,

REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.

Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.

Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.

1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.

National Taiwan University, Taiwan

The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)

1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.

Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.

Performance Comparison of Speaker and Emotion Recognition

A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-

Hello, Who is Calling? Can Words Reveal the Social Nature of Conversations?

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.

ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Conditional Random Fields for ASR

Statistical Models for Automatic Speech Recognition

Recognizing Structure: Sentence, Speaker, andTopic Segmentation

Statistical Models for Automatic Speech Recognition

Recognizing Structure: Dialogue Acts and Segmentation

Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,

Automatic Prosodic Event Detection

Presentation transcript:

1 Prosody-Based Automatic Segmentation of Speech into Sentences and Topics Elizabeth Shriberg Andreas Stolcke Speech Technology and Research Laboratory Dilek Hakkani-Tur Gokhan Tur Depatment of Computer Engineering, Bilkent University To appear in Speech Communication 32(1-2) Special Issue on Accessing Information in Spoken Audio Presenter: Yi-Ting Chen

2 Outline Introduction Method –Prosodic modeling –Language modeling –Model combination –Data Results and discussion Summary and conclusion

3 Introduction (1/2) Why process audio data? Why automatic segmentation? –A crucial step toward robust information extraction from speech is the automatic determination of topic, sentence, and phrase boundaries Why used prosody? –In all languages, prosody is used to convey structural, semantic, and functional information –Prosodic cues by their nature are relatively unaffected by word identity –Unlike spectral features, some prosodic features are largely invariant to changes in channel characteristics –Prosodic feature extraction can be achieved with minimal additional computational load and no additional training data

4 Introduction (2/2) In this paper we describe the prosodic modeling in detail Using decision tree and hidden Markov modeling techniques to combine prosodic cues with word-based approaches, and evaluate performance on two speech corpora To look at results for both true word, and word as hypothesized by a speech recognizer.

5 Method (1/6) –Prosodic modeling Feature extraction regions –For each inter-word boundary, we looked at prosodic features of the word immediately preceding and following the boundary, or alternatively within a window of 20 frames (200ms) befor and after the boundary –They extracted prosodic features reflecting pause durations, phone durations, pitch information, and voice quality information –They chose not to use amplitude- or energy-based features, since previous word showed these features to be both less reliable than and largely redundant with duration and pitch features

6 Method (2/6) –Prosodic modeling Features: –The features were designed to be independent of word identities –They began with a set of over 100 features, was pared down to a smaller set by eliminating features –Pause features: Important cues to boundaries between semantic units The pause model was trained as an individual phone In the case of no pause at the boundary, this pause duration feature was output as 0 The duration of the pause preceding the word before the boundary Raw durations and durations normalized were investigated for pause duration distributions from the particular speaker

7 Method (3/6) –Prosodic modeling Features: –Phone and rhyme duration features: a slowing down toward the ends of units, or preboundary lengthening Preboundary lengthening typically affects the nucleus and coda of syllables Duration characteristics of the last rhyme of the syllable preceding the boundary Each phone in the rhyme was normalized for inherent duration as

8 Method (4/6) –Prosodic modeling Features: –F0 features: Pitch information is typically less robust and more difficult to model than other prosodic features To smooth out microintonation and tracking errors, simplify F0 feature computation, and identify speaking-range parameters for each speaker

9 Method (5/6) –Prosodic modeling Features: –F0 features: Reset features Range features F0 slope features F0 continuity features –Estimated voice quality features –Other features: speaker gender 、 turn boundaries 、 time elapsed from the start of turn and the turn count in the conversation

10 Method (6/6) –Prosodic modeling Decision trees –Decision trees are probabilistic classifiers –Given a set of features and a labeled training set, the decision tree construction algorithm repeatedly selects a single feature that has the highest predictive value –The leaves of the tree store probabilities about the class distribution of all samples falling into the corresponding region of the feature space –Decision trees make no assumptions about the shape of feature distributions –It is not necessary to convert feature values to some standard scale Feature selection algorithm

11 Method (1/3) –Language modeling The goal: to capture information about segment boundaries contained in the word sequences To model the joint distribution of boundary types and words in a hidden Markov model (HMM) To denote boundary classification by and use for the word sequences the structure of the HMM : Using the slightly more complex forward-backward algorithm to maximize the posterior probability of each individual boundary classification

12 Method (2/3) –Language modeling Sentence segmentation –A hidden-event N-gram language model –The states of the HMM consist of the end-of sentence status of each word, plus any preceding words and possibly boundary tags to fill up the N-gram context –Transition probabilities are given by N-gram probabilities estimated from annotated –Boundary-tagged training data using Katz backoff –Ex:

13 Method (3/3) –Language modeling Topic segmentation –First, to constructed 100 individual unigram topic cluster language models using the Multipass k-means algorithm (Using TDT) –Then to built an HMM in which the states are topic clusters, and the observation are sentences –In addition to the basic HMM segmenter, incorporating two states for modeling the initial and final sentences of a topic segment

14 Method (1/3) –Model combination Expecting prosodic and lexical segmentation cues to be partly complementary –Posterior Probability interpolation –Integrated hidden Markov modeling With suitable independence assumption to apply the familiar techniques to compute: or To incorporate the prosodic information into the HMM, prosodic features are modeled as emissions from relecant HMM states, with likelihoods So, a complete path through the HMM is associated with the total probability

15 Method (2/3) –Model combination Expecting prosodic and lexical segmentation cues to be partly complementary –Integrated hidden Markov modeling How to estimate the likelihoods –Note that the decision tree estimates posteriors –These can be converted to likelihoods using Bayes’ rule as in –A beneficial side effect of this approach is that the decision tress models the lower-frequency events in greater detail than if presented with the raw, highly skewed class distribution –A tunable model combination weight (MCW) was introduced

16 Method (3/3) –Model combination Expecting prosodic and lexical segmentation cues to be partly complementary –HMM posteriors as decision tress features For practical reasons we chose not to use it in this work Drawback: overestimate the informativeness of the word- based posteriors based on automatic transcriptions –Alternative models HMM: A drawback is that the independence assumptions may be inappropriate and inherently lime the performance of the model The decision trees: advantages: enhances discrimination between the target classifications and input features can be combined easily drawbacks: the sensitivity to skewed class distribution expensive to model multiple target variables

17 Method (1/2) –Data Speech data and annotations –Switchboard data: a sub set of the corpus that had been hand- labeled for sentence boundaries by LDC –Broadcast News data for topic and sentence segmentation was extracted from the LDC’ 1997 Broadcast News (BN) release –Training of Broadcast News language models used an additional 130 million word of text-only transcripts from the 1996 Hub-4 language model corpus (for sentence segmentation ) Training, tuning, and test sets

18 Method (2/2) –Data Word recognition –1-best output from SRI’s DECIPHER large-vocabulary speech recognizer –Skipping several of the computationally expensive or cumbersome steps (such as acoustic adaptation) –Switchboard test set:46.7% WER –Broadcast News: 30.5% WER Evaluation metrics –Sentence segmentation performance for true words was measured by boundary classification error –For recognized words, a string alignment of the automatically labeled recognition hypothesis are performed –Then to calculate error rate –Topic segmentation was evaluated using the metric defined by NIST for TDT-2 evaluation

19 Results and discussion (1/10) Task 1: Sentence segmentation of Broadcast New data –Prosodic features usage The best-performing tree identified six features for this task, which fall into four groups Pause > turn > F0 > Rhyme duration Based on descriptive literature, the behavior of the features is precisely

20 Results and discussion (2/10) Task 1: Sentence segmentation of Broadcast New data –Error reduction from prosody –The prosodic model alone performs better than a word-based language model –The prosodic model is somewhat more robust to recognizer output than the language model

21 Results and discussion (3/10) Task 1: Sentence segmentation of Broadcast New data –Performance without F0 features The F0 features used are not typically extracted or computed in most ASR systems Removing all F0 features: It could also indicate a higher degree of correlation between true words and the prosodic features?

22 Results and discussion (4/10) Task 2: Sentence segmentation of Switchboard data –Prosodic feature usage A different distribution of features than observed for Broadcast News The primary feature type used here is pre-boundary duration Pause duration at the boundary was also useful Most interesting about this tree was the consistent behavior of duration features, which gave higher probability to a sentence boundary

23 Results and discussion (5/10) Task 2: Sentence segmentation of Switchboard data –Error reduction from prosody Prosodic alone is not a particularly good mood model Combining prosody with the language model resulted in a statistically significant improvement All differences were statistically significant

24 Results and discussion (6/10) Task 3: Topic segmentation of Broadcast News data –Prosodic feature usage Five feature types most helpful for this task: The results are similar to those seen earlier for sentence segmentation in Broadcast News The importance of pause duration is underestimated

25 Results and discussion (7/10) Task 3: Topic segmentation of Broadcast News data –Prosodic feature usage The speaker-gender feature –The women in a sense behave more “neatly” than the men –One possible explanation is that men are more likely than women to produce regions of nonmodal voicing of topic boundaries

26 Results and discussion (8/10) Task 3: Topic segmentation of Broadcast News data –Error reduction from prosody All results reflect the word-averaged, weighted error metric used in the TDT-2 evaluation Chance here correspond to outputting the “no boundary” class at all locations, meaning that the false alarm rate will be zero and miss rate will be 1 A weight of 0.7 to false alarms and 0.3 to miss

27 Results and discussion (9/10) Task 3: Topic segmentation of Broadcast News data –Performance without F0 features The experiments were conducted only for true word, since as shown in table 5, results are similar to those for recognized words

28 Results and discussion (10/10) Comparisons of error reduction across conditions –Performance without F0 features While researcher typically have found Switchboard a difficult corpus to process, in the case of sentence segmentation on true word it just the opposite-atypically Previous word on automatic segmentation on Switchboard transcripts is likely to overestimate success for other corpora

29 Summary and conclusion The use of prosodic information for sentence and topic segmentation have studied Results showed that on Broadcast News the prosodic model alone performed as well as purely word0based statistical language models Interestingly, the integrated HMM worded best on transcribed words, while the posterior interpolation approach was much more robust in the case of recognized