Download presentation
Presentation is loading. Please wait.
Published byMartina Garrison Modified over 9 years ago
1
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC 35900-1 October 11, 2006
2
Roadmap Recognizing discourse structure in speech Analyzing spoken monologue Automatic topic segmentation –Acoustic cues, text cues, and integration Conclusions & Plans
3
Recognizing Discourse Structure Hypothesis: –Discourse can be decomposed into subunits Formal written text –Clues to structure: paragraphs, chapters, sections Spoken discourse –Lacks orthographic cues –Are compensating features available?
4
Prosody & Discourse Structure Discourse structure model –Grosz&Sidner 1986 –Global structure: discourse segments, embedding –Local structure: prominence, salience Linguistic structure includes intonation –Signal global or local structure Use of phrases to signal global structure Signal parenthetical
5
Intonational Features Theoretical framework –Tone and Break Index (ToBI, Pierrehumbert) Tone: pitch contours; Breaks: phrase units “Intermediate” phrases are basic units Features: Pitch range within and between phrases Amplitude (loudness) Pitch contour type Speaking rate (syll/sec) Inter-phrase pause duration
6
Speech Corpora Vary on: –Speaker type: professional/not –Speaking style: read/spontaneous –Speech content: news/directions/etc Variability in prosody too….
7
Pilot Study I: Newswire Professionally read 3 AP newswire stories Manual segmentation: Text only, Speech –Consensus labels: SB, SF Correlation of pitch range, amplitude, rate –Can identify structure via hand-labelings Issues: –Difficulty labeling, Idiosyncratic BN speech
8
Pilot Study II: Prominence and Discourse Prominence: Accent/stress on a word –Typically associated with NEW information –Contrast: Locally NEW (in segment) vs Globally NEW Analyze all NPs in 20 min spontaneous Difference in position and form influence –Full forms accented, pronouns etc not –Mismatches: Imply role of global/local Issues: –Difficulty labeling; use of full names or pronouns
9
Direction-giving Corpus Spontaneous/read speech; non-professional –Task-oriented: give directions, vary complexity Return later to read original transcriptions Discourse segment labeling: Text vs Speech –More consensus labels for speech than text Speech allows more reliable segmentation Spontaneous more reliable than read (medial)
10
Acoustic Analysis Features: Max/mean f0 (pitch), amplitude, rate, pause (pre/post) Findings: Segment beginnings: Higher max/mean f0, amplitude –Shorter following pause (Longer preceding pause in read) Segment endings: Lower max/mean f0, amplitude Similar for T & S annotations Issues: Single speaker
11
Prominence and Discourse NPs annotated for: –Lexical form (full NP/pron), grammatical role, surface position (sent/phrase), accent –23% reduced stress Effect of form, role Repetition, not necessarily reduced –Also find reduced forms in contrasts
12
Summary Clear prosodic cues to discourse structure –Across speakers, speaking style, content –Initiation: High max/average pitch, amplitude; preceding pause –Finality is converse Information status –Few clear correlates with accentuation Mediated by form, grammatical role
13
Prosodic and Lexical Cues to Topic Segmentation Broadcast news story-level segmentation –Television and radio Contrast w/GHN –Fully automatic: transcription, prosodic labeling –Large data set- multiple speakers –All teleprompted news
14
Possible Signals Lexical topic similarity in vector space –Hearst (1994) Lexical discourse cues (Beeferman et al) E.g. “CNN “ – Reporter sign-off –HMM topic model Prosodic cues –Pitch, loudness, duration, speaker change, …
15
Basic Approach Chop audio stream into “sentences” Group “sentences” into topics Classify each sentence boundary as topic boundary or not Probabilistic framework –argmax B Pr(B|W,F) B is sequence of boundaries, W words, F features
16
Prosodic Classification Features: –Pitch (f0) – before and after possible boundary, –Duration – final phoneme, final rhyme, pause No amplitude – viewed as redundant with pitch Classifier: Decision trees –Features selected by wrapper loop on training
17
Lexical Classification HMM topic language models –Train one model per topic –Begin/End state Train on previous topics Later augment with Topic Boundary states
18
Integrating Models With decision trees: –Incorporate HMM topic boundary probability as additional feature –Boundary labeled if exceeds some threshold With HMMs: –Use prosodic trees to estimate likelihoods –Use standard Viterbi decoding to find best
19
Testing & Evaluation Based on 6 shows –104 shows used for training Used ASR output for words/positions –Contrast with correct forced alignment Used manual speaker segmentation Bizarre cost metric Basic units: Chop at 0.572 sec pause
20
Decision Tree Classification Prosody-only features: –Pause duration, F0 difference, speaker change, gender Consistent with GHN Gender? Different styles for males/females Combined: –HMM LM likelihoods, pause, F0 difference
21
Best Results Integrate prosody and lexical cues HMM-based model combination better –Decision tree thresholding inconsistent Improves over HMM classifier only
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.