Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003.

Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Outline Where I’ve been Predictability affects on word duration Predictability effects on pitch accent Where I’m at Computational model of pitch accent Prosodic information to aid parsing Psychological models of production Where I’d like to go Prosody Disfluencies Speech synthesis

Where I’ve been Bad idea despair Cute! knowledge

Where I’m at… CU-BOULDER LINGUISTICS PROFESSOR WINS 2002 MACARTHUR FELLOWSHIP

Where I’m headed …

Predictability affects on word duration Methodology Corpus and Design (swbd, regression) Measures of predictability (frequency, bigram, joint, mutual information, repetition) Function words (top ten most frequent) Vowel reduction Coda deletion Duration Content words t/d deletion Duration

Predictability affects on word duration The probabilistic reduction hypothesis: The higher the probability of a word, the more it is reduced/shortened/lenited in lexical production. (Gregory et a. 1999, Jurafsky, Bell, Gregory, and Raymond 2000) Implications Any factor that increases the probability of a word also increases phonological reduction. Is that the only role of probabilistic information?

Predictability effects on pitch accent Same database, used regression models, but this time coded for pitch accent. What is pitch accent? Perceptual phenomenon (Hirschberg, 1993) Associated with duration, amplitude, and F0 of units. Words that appear more intonationally prominent than others are said to bear pitch accent.

Predictability and pitch accent Pitch accent is associated with meaning:

Predictability effects on pitch accent Results. More predictable words are less likely to bear pitch accent, as measured by (this is true for all parts of speech): Frequency Conditional bigram probability Joint probability Semantic relatedness Repetition (not all the same measures that affect reduction, e.g., preceding context is more important with pitch accent) Implications The role of predictability is not limited to reduction processes Predictability is not just a fact about lexical access, this information is available during phonological encoding Prosody in speech synthesis is rudimentary, a probabilistic model is (relatively) easy to implement.

Current Research Computational model of pitch accent Prosodic information to aid parsing Psychological models of production

Computational model of pitch accent (joint work with Yasemin Altun) Problem Predicting accent is not an exact science. Hirschberg (1993) and Pan & Hirschberg (2000) demonstrate that frequency and conditional probability increase accuracy in pitch accent prediction. Function vs content only 68% F requency, conditional probability 71% BOTH 1 and 2 73% Will the addition of more/different probabilistic variables increase accuracy as well?

Computational model of pitch accent Testing more variables Joint probability, reverse conditional probability The effects of surrounding accents More fine grained part of speech Things like rate of speech, etc.

Prosodic information to aid parsing (Joint work with Mark Johnson and Eugene Charniak) Problem: Parsing conversational speech is difficult Accuracy of parsing the wall street journal90% (Charniak 2000) switchboard84.5% wsj, no punctuation86% swbd, no punctuation81% Add prosodic features instead of punctuation

Prosodic information to aid parsing Methodology Get timing information from the transcripts Add pause duration information as a term in the parser (use pauses as a cue instead of punctuation) For sentence-internal punctuation only http://cog.brown.edu:16080/~mj/papers/acl02-emptynodes.pdf Results Accuracy goes down (80%) Because the language model is not as strong?

Psychological models of production: Disfluencies (joint work with Julie Sedivy and Dan Grodner) Looking at what’s going on during speech and when Initially, we were interested in how prosody maps to discourse constraints in the production of prenominal adjectives Move the red cup Facts: Speakers only use scalar or material adjectives in the environment of a contrast. Speakers use color adjectives ALL the time. Marking a contrast is prosodically marked (there is an increase in pitch range in the presence of a contrast) Despite an increase in pitch range, there is not a duration increase with adjectives produced in a contrastive environment. BUT Scalar adjectives are longer

Psychological models of production: Disfluencies Really neat fact: Speakers produce more disfluencies with scalar adjectives compared to material or color. disfluencies account for about 6% of spontaneous speech. Shriberg (2002) silent pauses move the red … elongated pronunciationsmove theee filled pauses move the um repetitions move the the restarts move the uh the red …

Psychological models of production: Disfluencies Used an eye-tracking device to find out what’s happening during the disfluency Move the, uh, big car next to the turtle

Psychological models of production: Disfluencies Results: We found that speakers are looking more at the contrasting object in the case of the scalars during the disfluency AND during the adjective! Implications: Marking a contrast set does not increase processing load Encoding a relative property does increase processing load Duration is affected by lexical encoding (suggests a continuum of planning difficulty effects)

Near-future research

Prosody In general, continue looking a the factors that influence prosodic variation and see if these can be modeled probabilistically. The challenge: Lots of people have found discourse-pragmatic factors contribute to prosodic marking Others, including myself, have found that prosody is affected by probabilistic variables How can we model aspects of the speech context probabilistically?

Disfluencies Disfluencies have proven to be a very useful window into processes of speech production. Are there more disfluencies around evaluative terms in general? Do different types of disfluencies correspond to difficulties associated with difference aspects of production (initial planning versus lexical encoding and access) Investigate more fully the connection between disfluencies and the length of surrounding words. Why is it that words following a disfluency are longer? How much of duration variation can be accounted for by planning difficulties versus other factors?

Speech synthesis Three types of TTS systems: Concatenated or diphone models. Advantages: the ability to process of novel strings of text, does not require a huge database of stored speech. Disadvantage: mechanical sounding speech, a lot of post-processing Corpus based- -prosodic patterns (durations, stress, F0 contours) are not defined by the signal processor, but rather the phoneme sequences are chosen based on exact prosodic pattern matches in a corpus. Advantage: natural sounding speech, specifically with regard to prosody. Disadvantage: a much larger database is required with a lot more hand coding involved. It also does not allow for totally novel sequences of sounds or words that are not in the database. Phrase splicing (unit selection)--selects the largest unit possible from a corpus of one speaker. Advantage: Very natural, requires very little post-speech processing from a signal processor. Disadvantage: Requires an extremely large (~10) hours of hand-annotated corpus of speech. It also does not allow for novel sequences of speech, thus must be used in conjunction with a diphone model.

Speech synthesis (joint work with Mike Buckley and Kris Schindler) Using a Probabilistic Model to Improve Speech Synthesis in the UB Talker The UB Talker: The UB Talker artificial speaking device menu-driven means of selecting words and phrases, Menus, words, and phrases can be pre-programmed or entered in on-screen Uses context-awareness and phrase completion to predict responses Statistics are derived using frequency of use, most- recently used, time of day, day of week, and time of year to present most-likely phrases to users.

Speech synthesis Once a string is selected, a synthesizer component produced speech. Two goals: 1. Add a probabilistic model of prosody to the current free TTS system 2. Build a corpus of speech toward a unit selection model (the Client has about 2,000 phrases in the system that can be pre-recorded)

Speech synthesis some academically available and commercially available synthesizers: http://www.cstr.ed.ac.uk/projects/festival/userin.html http://www.rhetorical.com/cgi-bin/demo.cgi http://www.research.att.com/projects/tts/demo.html

Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003.

Similar presentations

Presentation on theme: "Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003.

Similar presentations

Presentation on theme: "Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003."— Presentation transcript:

Similar presentations

About project

Feedback