Recognizing Structure: Sentence, Speaker, andTopic Segmentation

Slides:



Advertisements
Similar presentations
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
Advertisements

5/5/20151 Recognizing Metadata: Segmentation and Disfluencies Julia Hirschberg CS 4706.
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
1 Spoken Dialogue Systems Dialogue and Conversational Agents (Part IV) Chapter 19: Draft of May 18, 2005 Speech and Language Processing: An Introduction.
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago MAICS April 1, 2006.
Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.
Spoken Language Processing Lab Who we are: Julia Hirschberg, Stefan Benus, Fadi Biadsy, Frank Enos, Agus Gravano, Jackson Liscombe, Sameer Maskey, Andrew.
Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.
Context in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago September 7, 2005.
Turn-taking in Mandarin Dialogue: Interactions of Tone and Intonation Gina-Anne Levow University of Chicago October 14, 2005.
Classification of Discourse Functions of Affirmative Words in Spoken Dialogue Julia Agustín Gravano, Stefan Benus, Julia Hirschberg Shira Mitchell, Ilia.
Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,
Varying Input Segmentation for Story Boundary Detection Julia Hirschberg GALE PI Meeting March 23, 2007.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Exploiting lexical information for Meeting Structuring Alfred Dielmann, Steve Renals (University of Edinburgh) {
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Better Punctuation Prediction with Dynamic Conditional Random Fields Wei Lu and Hwee Tou Ng National University of Singapore.
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
Turn-taking Discourse and Dialogue CS 359 November 6, 2001.
1 Computation Approaches to Emotional Speech Julia Hirschberg
Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
1 Prosody-Based Automatic Segmentation of Speech into Sentences and Topics Elizabeth Shriberg Andreas Stolcke Speech Technology and Research Laboratory.
National Taiwan University, Taiwan
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Lexical, Prosodic, and Syntactics Cues for Dialog Acts.
Adapting Dialogue Models Discourse & Dialogue CMSC November 19, 2006.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Acoustic Cues to Emotional Speech Julia Hirschberg (joint work with Jennifer Venditti and Jackson Liscombe) Columbia University 26 June 2003.
Linguistic knowledge for Speech recognition
Investigating Pitch Accent Recognition in Non-native Speech
Towards Emotion Prediction in Spoken Tutoring Dialogues
Conditional Random Fields for ASR
Improving a Pipeline Architecture for Shallow Discourse Parsing
Recognizing Disfluencies
Recognizing Structure: Dialogue Acts and Segmentation
Recognizing Structure: Sentence and Topic Segmentation
Studying Intonation Julia Hirschberg CS /21/2018.
Intonational and Its Meanings
Intonational and Its Meanings
Prosody in Recognition/Understanding
Dialogue Acts Julia Hirschberg CS /18/2018.
Comparing American and Palestinian Perceptions of Charisma Using Acoustic-Prosodic and Lexical Analysis Fadi Biadsy, Julia Hirschberg, Andrew Rosenberg,
Statistical NLP: Lecture 9
Turn-taking and Disfluencies
Advanced NLP: Speech Research and Technologies
Searching and Summarizing Speech
Turn-taking and Disfluencies
High Frequency Word Entrainment in Spoken Dialogue
Searching and Summarizing Speech
Recognizing Disfluencies
Advanced NLP: Speech Research and Technologies
Jun Wu and Sanjeev Khudanpur Center for Language and Speech Processing
Discourse Structure in Generation
CS4705 Natural Language Processing
Emotional Speech Julia Hirschberg CS /16/2019.
Recognizing Structure: Dialogue Acts and Segmentation
Speech recognition, machine learning
Low Level Cues to Emotion
Statistical NLP : Lecture 9 Word Sense Disambiguation
Speech recognition, machine learning
Automatic Prosodic Event Detection
CS249: Neural Language Model
Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen
Presentation transcript:

Recognizing Structure: Sentence, Speaker, andTopic Segmentation Julia Hirschberg CS 4706 11/27/2018

Today Recognizing structural information in speech Learning from generation Learning from text segmentation Types of structural information Segmentation in spoken corpora 11/27/2018

Today Recognizing structural information in speech Learning from generation Learning from text segmentation Types of structural information Segmentation in spoken corpora 11/27/2018

Recall: Discourse Structure for Speech Generation Theoretical accounts (e.g. Grosz & Sidner ’86) Empirical studies Text vs. speech How can they help in recognition? Features to test Acoustic/prosodic features Lexical features 11/27/2018

Today Recognizing structural information in speech Learning from generation Learning from text segmentation Types of structural information Segmentation in spoken corpora 11/27/2018

Indicators of Structure in Text Cue phrases: now, well, first Pronominal reference Orthography and formatting -- in text Lexical information (Hearst ‘94, Reynar ’98, Beeferman et al ‘99): Domain dependent Domain independent 11/27/2018

Methods of Text Segmentation Lexical cohesion methods vs. multiple source Vocabulary similarity indicates topic cohesion Intuition from Halliday & Hasan ’76 Features: Stem repetition Entity repetition Word frequency Context vectors Semantic similarity Word distance Methods: Sliding window 11/27/2018

Combine lexical cohesion with other cues Lexical chains Clustering Combine lexical cohesion with other cues Features Cue phrases Reference (e.g. pronouns) Syntactic features Methods Machine Learning from labeled corpora 11/27/2018

Choi 2000: Text Segmentation Implements leading methods and compares new algorithm to them on corpus of 700 concatenated documents Comparison algorithms: Baselines: No boundaries All boundaries Regular partition Random # of random partitions Actual # of random partitions 11/27/2018

Textiling Algorithm (Hearst ’94) DotPlot algorithms (Reynar ’98) Segmenter (Kan et al ’98) Choi ’00 proposal Cosine similarity measure Same: 1; no overlap 0 11/27/2018

Choi’s algorithm has best performance (9-12% error) Similarity matrix  rank matrix Minimize effect of outliers How likely is this sentence to be a boundary, compared to other sentences? Divisive clustering based on D(n) = sum of rank values (sI,j) of segment n/ inside area of segment n (j-i+1) – for i,j the sentences at the beginning and end of segment n Keep dividing the corpus until D(n) = D(n) - D(n-1) shows little change Choi’s algorithm has best performance (9-12% error) 11/27/2018

Utiyama & Isahara ’02: What if we have no labeled data for our domain? 11/27/2018

Today Recognizing structural information in speech Learning from generation Learning from text segmentation Types of structural information Segmentation in spoken corpora 11/27/2018

Types of Discourse Structure in Spoken Corpora Domain independent Sentence/utterance boundaries Speaker turn segmentation Topic segmentation Domain dependent Broadcast news Meetings Telephone conversations 11/27/2018

Today Recognizing structural information in speech Learning from generation Learning from text segmentation Types of structural information Segmentation in spoken corpora 11/27/2018

Spoken Cues to Discourse Structure Pitch range Lehiste ’75, Brown et al ’83, Silverman ’86, Avesani & Vayra ’88, Ayers ’92, Swerts et al ’92, Grosz & Hirschberg’92, Swerts & Ostendorf ’95, Hirschberg & Nakatani ‘96 Preceding pause Lehiste ’79, Chafe ’80, Brown et al ’83, Silverman ’86, Woodbury ’87, Avesani & Vayra ’88, Grosz & Hirschberg’92, Passoneau & Litman ’93, Hirschberg & Nakatani ‘96 11/27/2018

Brown et al ’83, Grosz & Hirschberg’92, Hirschberg & Nakatani ‘96 Rate Butterworth ’75, Lehiste ’80, Grosz & Hirschberg’92, Hirschberg & Nakatani ‘96 Amplitude Brown et al ’83, Grosz & Hirschberg’92, Hirschberg & Nakatani ‘96 Contour Brown et al ’83, Woodbury ’87, Swerts et al ‘92 Add Audix tree?? 11/27/2018

Finding Sentence and Topic Boundaries Statistical, Machine Learning approaches with large segmented corpora Features: Lexical cues Domain dependent Sensitive to ASR performance Acoustic/prosodic cues Domain independent Sensitive to speaker identify 11/27/2018

Shriberg et al ’00: Prosodic Cues Prosody cues perform as well or better than text-based cues at sentence and topic segmentation -- and generalize better? Goal: identify sentence and topic boundaries at ASR-defined word boundaries CART decision trees provided boundary predictions HMM combined these with lexical boundary predictions from LM 11/27/2018

Features For each potential boundary location: Pause at boundary (raw and normalized by speaker) Pause at word before boundary (is this a new ‘turn’ or part of continuous speech segment?) Phone and rhyme duration (normalized by inherent duration) (phrase-final lengthening?) F0 (smoothed and stylized): reset, range (topline, baseline), slope and continuity Raw pause worked better than normalized Speaker id/change hand marked apparently F0 reset measured before and after potential boundary, range is defined over preceding word and compared to speaker-specific parameters 11/27/2018

Trained/tested on Switchboard and Broadcast News Voice quality (halving/doubling estimates as correlates of creak or glottalization) Speaker change, time from start of turn, # turns in conversation and gender Trained/tested on Switchboard and Broadcast News 11/27/2018

Sentence segmentation results Prosodic features Better than LM for BN Worse (on transcription) and same for ASR transcript on SB All better than chance Useful features for BN Pause at boundary ,turn/no turn, f0 diff across boundary, rhyme duration Useful features for SB Phone/rhyme duration before boundary, pause at boundary, turn/no turn, pause at preceding word boundary, time in turn BN sentence boundaries vs SB ASR Trans Model 13.3 6.2 Chance(nonb) 11.7 3.3 Comb HMM 11.8 4.1 LM only 10.9 3.6 Pros only ASR Trans Model 25.8 11.0 Chance(nonb) 22.5 4.0 Comb HMM 22.8 4.3 LM only 22.9 6.7 Pros only 11/27/2018 BN topic vs SB ASR Trans Model 13.3 6.2 Chance(nonb) 11.7 3.3 Comb HMM 11.8 4.1 LM only 10.9 3.6 Pros only ASR Trans Model .3 Chance(nonb) .1438 .1377 Comb HMM .1897 .1895 LM only .1731 .1657 Pros only

Topic segmentation results (BN only): Useful features Pause at boundary, f0 range, turn/no turn, gender, time in turn Prosody alone better than LM Combined model improves significantly 11/27/2018

Next Class Identifying Speech Acts Reading: This chapter of J&M is a beta version Please keep a diary for: Any typos Any passages you think are hard to follow Any suggestions HW 3a due by class (2:40pm) 11/27/2018