Recognizing Structure: Sentence and Topic Segmentation

Slides:

Advertisements

Similar presentations

Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.

Advertisements

Human Speech Recognition Julia Hirschberg CS4706 (thanks to John-Paul Hosum for some slides)

5/5/20151 Recognizing Metadata: Segmentation and Disfluencies Julia Hirschberg CS 4706.

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.

1 Spoken Dialogue Systems Dialogue and Conversational Agents (Part IV) Chapter 19: Draft of May 18, 2005 Speech and Language Processing: An Introduction.

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.

Comparing American and Palestinian Perceptions of Charisma Using Acoustic-Prosodic and Lexical Analysis Fadi Biadsy, Julia Hirschberg, Andrew Rosenberg,

Presented by Ravi Kiran. Julia Hirschberg Stefan Benus Jason M. Brenier Frank Enos Sarah Friedman Sarah Gilman Cynthia Girand Martin Graciarena Andreas.

Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.

Spoken Language Processing Lab Who we are: Julia Hirschberg, Stefan Benus, Fadi Biadsy, Frank Enos, Agus Gravano, Jackson Liscombe, Sameer Maskey, Andrew.

Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.

Context in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago September 7, 2005.

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.

Turn-taking in Mandarin Dialogue: Interactions of Tone and Intonation Gina-Anne Levow University of Chicago October 14, 2005.

Classification of Discourse Functions of Affirmative Words in Spoken Dialogue Julia Agustín Gravano, Stefan Benus, Julia Hirschberg Shira Mitchell, Ilia.

EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.

Varying Input Segmentation for Story Boundary Detection Julia Hirschberg GALE PI Meeting March 23, 2007.

DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.

Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,

On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.

Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.

Turn-taking Discourse and Dialogue CS 359 November 6, 2001.

Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.

Predicting Student Emotions in Computer-Human Tutoring Dialogues Diane J. Litman&Kate Forbes-Riley University of Pittsburgh Department of Computer Science.

Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.

1 Prosody-Based Automatic Segmentation of Speech into Sentences and Topics Elizabeth Shriberg Andreas Stolcke Speech Technology and Research Laboratory.

National Taiwan University, Taiwan

1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.

Lexical, Prosodic, and Syntactics Cues for Dialog Acts.

Adapting Dialogue Models Discourse & Dialogue CMSC November 19, 2006.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Acoustic Cues to Emotional Speech Julia Hirschberg (joint work with Jennifer Venditti and Jackson Liscombe) Columbia University 26 June 2003.

A Text-free Approach to Assessing Nonnative Intonation Joseph Tepperman, Abe Kazemzadeh, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory,

Discourse Structure and Text Coherence

Investigating Pitch Accent Recognition in Non-native Speech

Towards Emotion Prediction in Spoken Tutoring Dialogues

Conditional Random Fields for ASR

Recognizing Disfluencies

Recognizing Structure: Dialogue Acts and Segmentation

Studying Intonation Julia Hirschberg CS /21/2018.

Intonational and Its Meanings

Intonational and Its Meanings

Prosody in Recognition/Understanding

Detecting Prosody Improvement in Oral Rereading

Dialogue Acts Julia Hirschberg CS /18/2018.

Comparing American and Palestinian Perceptions of Charisma Using Acoustic-Prosodic and Lexical Analysis Fadi Biadsy, Julia Hirschberg, Andrew Rosenberg,

Data Mining, Information Extraction and Search in Spoken Documents

Statistical NLP: Lecture 9

Turn-taking and Disfluencies

Studying Spoken Language Text 17, 18 and 19

Advanced NLP: Speech Research and Technologies

Recognizing Structure: Sentence, Speaker, andTopic Segmentation

Searching and Summarizing Speech

High Frequency Word Entrainment in Spoken Dialogue

Searching and Summarizing Speech

Data Mining, Information Extraction and Search in Spoken Documents

Agustín Gravano & Julia Hirschberg {agus,

Recognizing Disfluencies

Advanced NLP: Speech Research and Technologies

Discourse Structure in Generation

Tetsuya Nasukawa, IBM Tokyo Research Lab

Recognizing Structure: Dialogue Acts and Segmentation

Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,

Low Level Cues to Emotion

Acoustic-Prosodic and Lexical Entrainment in Deceptive Dialogue

Automatic Prosodic Event Detection

Presentation transcript:

Recognizing Structure: Sentence and Topic Segmentation Julia Hirschberg CS 4706 9/19/2018

Types of Discourse Structure in Spoken Corpora Domain independent Sentence/utterance boundaries Speaker turn segmentation Topic/story segmentation Domain dependent Broadcast news Meetings Telephone conversations 9/19/2018

Today Theoretical studies of discourse structure Text-based approaches and cues Speech-based approaches and cues Practical goals for speech segmentation and state-of-the-art 9/19/2018

Hierarchical Structure in Discourse? Welcome to word processing. That’s using a computer to type letters and reports. Make a typo? No problem. Just back up, type over the mistake, and it’s gone. And, it eliminates retyping. And, it eliminates retyping. 9/19/2018

Structures of Discourse Structure (Grosz & Sidner ‘86) Leading theory of discourse structure Three components: Linguistic structure What is said Assumption: divided into appropriate units of analysis (Discourse Segments) Intentional structure How are segments structured into topics, subtopics Relations: satisfaction-precedence, dominance Attentional structure 9/19/2018

How are these relations recognized in a discourse? Linguistic markers: tense and aspect cue phrases: now, well Inference of S intentions Inference from task structure Intonational Information 9/19/2018

Structural Information in Text Example: reviews of Ipod nano Cue phrases: now, well, first Pronominal reference Orthography and formatting -- in text Lexical information (Hearst ‘94, Reynar ’98, Beeferman et al ‘99): Domain dependent Domain independent 9/19/2018

Methods of Text Segmentation Lexical cohesion methods vs. multiple source Vocabulary similarity indicates topic cohesion Intuition from Halliday & Hasan ’76 Features to capture: Stem repetition Entity repetition Word frequency Context vectors Semantic similarity Word distance Methods: Sliding window 9/19/2018

Combine lexical cohesion with other cues Lexical chains Clustering Combine lexical cohesion with other cues Features Cue phrases Reference (e.g. pronouns) Syntactic features Methods Machine Learning from labeled corpora 9/19/2018

Choi 2000: An Example Implements leading methods and compares new algorithm to them on corpus of 700 concatenated documents Comparison algorithms: Baselines: No boundaries All boundaries Regular partition Random # of random partitions Actual # of random partitions 9/19/2018

Textiling Algorithm (Hearst ’94) DotPlot algorithms (Reynar ’98) Segmenter (Kan et al ’98) Choi ’00 proposal Cosine similarity measure on stems Same sentence: 1; no overlap 0 Similarity matrix  rank matrix Replace each cell with # of lower-valued neighbor cells, normalized by # of neighboring cells How likely is this sentence to be a boundary, compared to other sentences? Minimize effect of outliers 9/19/2018

Choi’s algorithm performs best (9-12% error) Divisive clustering based on D(n) = sum of rank values (sI,j) of segment n/ inside area of segment n (j-i+1) – for i,j the sentences at the beginning and end of segment n Homogeneous segments have large rank values within a small area of the matrix Keep dividing the corpus until D(n) = D(n) - D(n-1) shows little change Choi’s algorithm performs best (9-12% error) 9/19/2018

Acoustic and Prosodic Cues to Segmentation Intuition: Speakers vary acoustic and prosodic cues to convey variation in discourse structure Systematic? In read or spontaneous speech? Evidence: Observations from recorded corpora Laboratory experiments Machine learning of discourse structure from acoustic/prosodic features 9/19/2018

Spoken Cues to Discourse/Topic Structure Pitch range Lehiste ’75, Brown et al ’83, Silverman ’86, Avesani & Vayra ’88, Ayers ’92, Swerts et al ’92, Grosz & Hirschberg’92, Swerts & Ostendorf ’95, Hirschberg & Nakatani ‘96 Preceding pause Lehiste ’79, Chafe ’80, Brown et al ’83, Silverman ’86, Woodbury ’87, Avesani & Vayra ’88, Grosz & Hirschberg’92, Passoneau & Litman ’93, Hirschberg & Nakatani ‘96 9/19/2018

Rate Amplitude Contour Butterworth ’75, Lehiste ’80, Grosz & Hirschberg’92, Hirschberg & Nakatani ‘96 Amplitude Brown et al ’83, Grosz & Hirschberg’92, Hirschberg & Nakatani ‘96 Contour Brown et al ’83, Woodbury ’87, Swerts et al ‘92 Add Audix tree?? 9/19/2018

A Practical Problem: Finding Sentence and Topic/Story Boundaries in ASR Transcripts Motivation: Finding sentences critical to further syntactic/semantic analysis Topic/story id important to identify common regions for q/a, extraction Features: Lexical cues Domain dependent Sensitive to ASR performance Acoustic/prosodic cues Domain independent Sensitive to speaker identity Statistical, Machine Learning approaches with large segmented corpora (e.g. Broadcast News) 9/19/2018

ASR Transcription aides tonight in boston in depth the truth squad for special series until election day tonight the truth about the budget surplus of the candidates are promising the two international flash points getting worse while the middle east and a new power play by milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u s was was told local own boss good evening uh from the university of massachusetts in boston the site of the widely anticipated first of eight between vice president al gore and governor george w bush with the election now just five weeks away this is the beginning of a sprint to the finish and a strong start here tonight is important this is the stage for the two candidates will appear before a national television audience taking questions from jim lehrer of p b s n b c’s david gregory is here with governor bush claire shipman is covering the vice president claire you begin tonight please 9/19/2018

Speaker segmentation (Diarization) Speaker: 0 - aides tonight in boston in depth the truth squad for special series until election day tonight the truth about the budget surplus of the candidates are promising the two international flash points getting worse while the middle east and a new power play by milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u s was was told local own boss good evening uh from the university of massachusetts in boston Speaker: 1 - the site of the widely anticipated first of eight between vice president al gore and governor george w bush with the election now just five weeks away this is the beginning of a sprint to the finish and a strong start here tonight is important this is the stage for the two candidates will appear before a national television audience taking questions from jim lehrer of p b s n b c’s david gregory is here with governor bush claire shipman is covering the vice president claire you begin tonight please 9/19/2018

Sentence detection, punctuation Speaker: Anchor - Aides tonight in boston. In depth the truth squad for special series until election day. Tonight the truth about the budget surplus of the candidates are promising. The two international flash points getting worse. While the middle east. And a new power play by milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u. s. was was told local own boss. Good evening uh from the university of massachusetts in boston. Speaker: Reporter - The site of the widely anticipated first of eight between vice president al gore and governor george w. bush. With the election now just five weeks away. This is the beginning of a sprint to the finish. And a strong start here tonight is important. This is the stage for the two candidates will appear before a national television audience taking questions from jim lehrer of p. b. s. n. b. c.'s david gregory is here with governor bush. Claire shipman is covering the vice president claire you begin tonight please. 9/19/2018

Story boundary detection Speaker: Anchor - Aides tonight in boston. In depth the truth squad for special series until election day. Tonight the truth about the budget surplus of the candidates are promising. The two international flash points getting worse. While the middle east. And a new power play by milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u. s. was was told local own boss. Good evening uh from the university of massachusetts in boston. Speaker: Reporter - The site of the widely anticipated first of eight between vice president al gore and governor george w. bush. With the election now just five weeks away. This is the beginning of a sprint to the finish. And a strong start here tonight is important. This is the stage for the two candidates will appear before a national television audience taking questions from jim lehrer of p. b. s. n. b. c.'s david gregory is here with governor bush. Claire shipman is covering the vice president claire you begin tonight please. 9/19/2018

Prosodic Cues (Shriberg et al ’00) Text-based segmentation is fine…if you have reliable text Could prosodic cues perform as well or better at sentence and topic segmentation in ASR transcripts? – more robust? – more general? Goal: identify sentence and topic boundaries at ASR-defined word boundaries CART decision trees and LM HMM combined prosodic and LM results 9/19/2018

Features --for each potential boundary location: Pause at boundary (raw and normalized by speaker) Pause at word before boundary (is this a new ‘turn’ or part of continuous speech segment?) Phone and rhyme duration (normalized by inherent duration) (phrase-final lengthening?) F0 (smoothed and stylized): reset, range (topline, baseline), slope and continuity Voice quality (halving/doubling estimates as correlates of creak or glottalization) Speaker change, time from start of turn, # turns in conversation and gender Trained/tested on Switchboard and Broadcast News Raw pause worked better than normalized Speaker id/change hand marked apparently F0 reset measured before and after potential boundary, range is defined over preceding word and compared to speaker-specific parameters 9/19/2018

Sentence segmentation results Prosodic only model Better than LM for BN Worse (on hand transcription) and same (for ASR transcript) on SB Slightly improves LM on SB Useful features for BN Pause at boundary, turn change/no turn change, f0 diff across boundary, rhyme duration Useful features for SB Phone/rhyme duration before boundary, pause at boundary, turn/no turn, pause at preceding word boundary, time in turn BN sentence boundaries vs SB ASR Trans Model 13.3 6.2 Chance(nonb) 11.7 3.3 Comb HMM 11.8 4.1 LM only 10.9 3.6 Pros only ASR Trans Model 25.8 11.0 Chance(nonb) 22.5 4.0 Comb HMM 22.8 4.3 LM only 22.9 6.7 Pros only 9/19/2018 BN topic vs SB ASR Trans Model 13.3 6.2 Chance(nonb) 11.7 3.3 Comb HMM 11.8 4.1 LM only 10.9 3.6 Pros only ASR Trans Model .3 Chance(nonb) .1438 .1377 Comb HMM .1897 .1895 LM only .1731 .1657 Pros only

Topic segmentation results (BN only): Useful features Pause at boundary, f0 range, turn/no turn, gender, time in turn Prosody alone better than LM Combined model improves significantly 9/19/2018

Recent Work on Story Segmentation (Rosenberg et al ’07) Story Segmentation goal: Divide each show into homogenous regions, each about a single topic Task: focussed Q/A Issue: what unit of anlysis should we use in assessing potential boundaries? 9/19/2018

TDT-4 Corpus English: 312.5 hours, 250 broadcasts, 6 shows Arabic: 88.5 hours, 109 broadcasts, 2 shows Mandarin: 109 hours, 134 broadcasts, 3 shows Manually annotated story boundaries ASR Hypotheses Speaker Diarization Hypotheses 9/19/2018

Approach Identify set of segments which define: Unit of analysis Candidate boundaries Classify each candidate boundary based on features extracted from segments C4.5 Decision Tree Model each show-type separately E.g. CNN “Headline News” and ABC “World News Tonight have distinct models Evaluate using WindowDiff with k=100 Diff between # boundaries in true seg and proposed, within sliding window over each – normalizing sum of diffs by number of sentences – window size Lower score is better 9/19/2018

Segment Boundary Modeling Features Acoustic Pitch & Intensity speaker normalized min, mean, max, stdev, slope Speaking Rate vowels/sec, voiced frames/sec Final Vowel, Rhyme Length Pause Length Lexical TextTiling scores LCSeg scores Story beginning and ending keywords Structural Position in show Speaker participation First or last speaker turn? 9/19/2018

Input Segmentations ASR Word boundaries Hypothesized Sentence Units No segmentation baseline Hypothesized Sentence Units Boundaries with 0.5, 0.3 and 0.1 confidence thresholds Pause-based Segmentation Boundaries at pauses over 500ms and 250ms Hypothesized Intonational Phrases 9/19/2018

Hypothesizing Intonational Phrases ~30 minutes manually annotated ASR BN from reserved TDT-4 CNN show. Phrase was marked if a phrase boundary occurred since the previous word boundary. C4.5 Decision Tree Pitch, Energy and Duration Features Normalized by hypothesized speaker id and surrounding context 66.5% F-Measure (p=.683, r=.647) 9/19/2018

Story Segmentation Results 9/19/2018

Input Segmentation Statistics Exact Story Boundary Coverage (pct.) Mean Distance to Nearest True Boundary (words) Segment to True Boundary Ratio 9/19/2018

Conclusions and Future Work Best Performance: Low threshold (0.1) sentences Short pause (250ms) segmentation Hyp. IPs perform better than sentences. Would increased SU, IP accuracy improve story segmentation? External evaluation: impact on IR and MT performance. 9/19/2018

Next Class Spoken Dialogue Systems 9/19/2018