Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC 35900-1 October 11, 2006.

Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC 35900-1 October 11, 2006

Roadmap Recognizing discourse structure in speech Analyzing spoken monologue Automatic topic segmentation –Acoustic cues, text cues, and integration Conclusions & Plans

Recognizing Discourse Structure Hypothesis: –Discourse can be decomposed into subunits Formal written text –Clues to structure: paragraphs, chapters, sections Spoken discourse –Lacks orthographic cues –Are compensating features available?

Prosody & Discourse Structure Discourse structure model –Grosz&Sidner 1986 –Global structure: discourse segments, embedding –Local structure: prominence, salience Linguistic structure includes intonation –Signal global or local structure Use of phrases to signal global structure Signal parenthetical

Intonational Features Theoretical framework –Tone and Break Index (ToBI, Pierrehumbert) Tone: pitch contours; Breaks: phrase units “Intermediate” phrases are basic units Features: Pitch range within and between phrases Amplitude (loudness) Pitch contour type Speaking rate (syll/sec) Inter-phrase pause duration

Speech Corpora Vary on: –Speaker type: professional/not –Speaking style: read/spontaneous –Speech content: news/directions/etc Variability in prosody too….

Pilot Study I: Newswire Professionally read 3 AP newswire stories Manual segmentation: Text only, Speech –Consensus labels: SB, SF Correlation of pitch range, amplitude, rate –Can identify structure via hand-labelings Issues: –Difficulty labeling, Idiosyncratic BN speech

Pilot Study II: Prominence and Discourse Prominence: Accent/stress on a word –Typically associated with NEW information –Contrast: Locally NEW (in segment) vs Globally NEW Analyze all NPs in 20 min spontaneous Difference in position and form influence –Full forms accented, pronouns etc not –Mismatches: Imply role of global/local Issues: –Difficulty labeling; use of full names or pronouns

Direction-giving Corpus Spontaneous/read speech; non-professional –Task-oriented: give directions, vary complexity Return later to read original transcriptions Discourse segment labeling: Text vs Speech –More consensus labels for speech than text Speech allows more reliable segmentation Spontaneous more reliable than read (medial)

Acoustic Analysis Features: Max/mean f0 (pitch), amplitude, rate, pause (pre/post) Findings: Segment beginnings: Higher max/mean f0, amplitude –Shorter following pause (Longer preceding pause in read) Segment endings: Lower max/mean f0, amplitude Similar for T & S annotations Issues: Single speaker

Prominence and Discourse NPs annotated for: –Lexical form (full NP/pron), grammatical role, surface position (sent/phrase), accent –23% reduced stress Effect of form, role Repetition, not necessarily reduced –Also find reduced forms in contrasts

Summary Clear prosodic cues to discourse structure –Across speakers, speaking style, content –Initiation: High max/average pitch, amplitude; preceding pause –Finality is converse Information status –Few clear correlates with accentuation Mediated by form, grammatical role

Prosodic and Lexical Cues to Topic Segmentation Broadcast news story-level segmentation –Television and radio Contrast w/GHN –Fully automatic: transcription, prosodic labeling –Large data set- multiple speakers –All teleprompted news

Possible Signals Lexical topic similarity in vector space –Hearst (1994) Lexical discourse cues (Beeferman et al) E.g. “CNN “ – Reporter sign-off –HMM topic model Prosodic cues –Pitch, loudness, duration, speaker change, …

Basic Approach Chop audio stream into “sentences” Group “sentences” into topics Classify each sentence boundary as topic boundary or not Probabilistic framework –argmax B Pr(B|W,F) B is sequence of boundaries, W words, F features

Prosodic Classification Features: –Pitch (f0) – before and after possible boundary, –Duration – final phoneme, final rhyme, pause No amplitude – viewed as redundant with pitch Classifier: Decision trees –Features selected by wrapper loop on training

Lexical Classification HMM topic language models –Train one model per topic –Begin/End state Train on previous topics Later augment with Topic Boundary states

Integrating Models With decision trees: –Incorporate HMM topic boundary probability as additional feature –Boundary labeled if exceeds some threshold With HMMs: –Use prosodic trees to estimate likelihoods –Use standard Viterbi decoding to find best

Testing & Evaluation Based on 6 shows –104 shows used for training Used ASR output for words/positions –Contrast with correct forced alignment Used manual speaker segmentation Bizarre cost metric Basic units: Chop at 0.572 sec pause

Decision Tree Classification Prosody-only features: –Pause duration, F0 difference, speaker change, gender Consistent with GHN Gender? Different styles for males/females Combined: –HMM LM likelihoods, pause, F0 difference

Best Results Integrate prosody and lexical cues HMM-based model combination better –Decision tree thresholding inconsistent Improves over HMM classifier only

Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC 35900-1 October 11, 2006.

Similar presentations

Presentation on theme: "Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC 35900-1 October 11, 2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC 35900-1 October 11, 2006.

Similar presentations

Presentation on theme: "Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC 35900-1 October 11, 2006."— Presentation transcript:

Similar presentations

About project

Feedback