Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003.

Slides:



Advertisements
Similar presentations
Language Use and Understanding BCS 261 LIN 241 PSY 261 CLASS 6: EFFECTS OF DISFLUENCY ON REFERENCE COMPREHENSION.
Advertisements

Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Using disfluency to understand, um, sentences... with PP-attachment ambiguities Jennifer E. Arnold and Kellen Carpenter, UNC Chapel Hill Background 1)
High Level Prosody features: through the construction of a model for emotional speech Loic Kessous Tel Aviv University Speech, Language and Hearing
Prosodics, Part 1 LIN Prosodics, or Suprasegmentals Remember, from our first discussions in class, that speech is really a continuous flow of initiation,
Nuclear Accent Shape and the Perception of Prominence Rachael-Anne Knight Prosody and Pragmatics 15 th November 2003.
AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY R.J.J.H. van Son, Barbertje M. Streefkerk, and Louis C.W. Pols Institute of Phonetic Sciences / ACLC University.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
FLST: Prosodic Models FLST: Prosodic Models for Speech Technology Bernd Möbius
Chapter 1: Information and Computation. Cognitive Science  José Luis Bermúdez / Cambridge University Press 2010 Overview Review key ideas from last few.
Dr. O. Dakkak & Dr. N. Ghneim: HIAST M. Abu-Zleikha & S. Al-Moubyed: IT fac., Damascus U. Prosodic Feature Introduction and Emotion Incorporation in an.
Recognizing Implicit Discourse Relations in the Penn Discourse Treebank Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng Department of Computer Science National.
Bootstrapping a Language- Independent Synthesizer Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002.
Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
Research on teaching and learning pronunciation
Chapter three Phonology
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
-- A corpus study using logistic regression Yao 1 Vowel alternation in the pronunciation of THE in American English.
A Text-to-Speech Synthesis System
Interactions between Language and Stuttering NU/SFA Workshop for Fluency Specialists July, 1996 J. Scott Yaruss, Ph.D., CCC-SLP University of Pittsburgh.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Prosody and NLP Seminar by Nikhil: Adith: Prachur: 06D05011 We have a presentation this Friday ?
Natural Language Processing and Speech Enabled Applications by Pavlovic Nenad.
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
14: THE TEACHING OF GRAMMAR  Should grammar be taught?  When? How? Why?  Grammar teaching: Any strategies conducted in order to help learners understand,
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Results: Prominence prediction without lexical information Each type of feature reduces the error rate over the baseline. SRF and INF features appear to.
Measuring Hint Level in Open Cloze Questions Juan Pino, Maxine Eskenazi Language Technologies Institute Carnegie Mellon University International Florida.
Funded by NIH grant RO1 HD-4152 to J. Arnold NSF BCS and NSF BCS to Z. Griffin Why do speakers modulate acoustic prominence? Listener-oriented.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Speech Perception1 Fricatives and Affricates We will be looking at acoustic cues in terms of … –Manner –Place –voicing.
1 Computational Linguistics Ling 200 Spring 2006.
1 The Ferret Copy Detector Finding short passages of similar texts in large document collections Relevance to natural computing: System is based on processing.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Ch 3 Slide 1 Is there a connection between phonemes and speakers’ perception of phonetic differences? (audibility of fine distinctions) Due to phonology,
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.
LATERALIZATION OF PHONOLOGY 2 DAY 23 – OCT 21, 2013 Brain & Language LING NSCI Harry Howard Tulane University.
Turn-taking Discourse and Dialogue CS 359 November 6, 2001.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
Recent Models of Stuttering Western Illinois University February 7, 1997 J. Scott Yaruss, Ph.D., CCC-SLP University of Pittsburgh.
 explain expected stages and patterns of language development as related to first and second language acquisition (critical period hypothesis– Proficiency.
Timbre and Memory An experiment for the musical mind Emily Yang Yu Music 151, 2008.
Lexical, Prosodic, and Syntactics Cues for Dialog Acts.
Suprasegmental Properties of Speech Robert A. Prosek, Ph.D. CSD 301 Robert A. Prosek, Ph.D. CSD 301.
Outline  I. Introduction  II. Reading fluency components  III. Experimental study  1) Method and participants  2) Testing materials  IV. Interpretation.
Variability in Interlanguage Session 6. Variability Variability refers to cases where a second language learner uses two or more linguistic variants to.
Suprasegmental features and Prosody Lect 6A&B LING1005/6105.
Natural Language Processing and Speech Enabled Applications
Text-To-Speech System for English
Recognizing Disfluencies
Meanings of Intonational Contours
Studying Intonation Julia Hirschberg CS /21/2018.
The American School and ToBI
Speech and Language Processing
Speech Generation: From Concept and from Text
Comparing American and Palestinian Perceptions of Charisma Using Acoustic-Prosodic and Lexical Analysis Fadi Biadsy, Julia Hirschberg, Andrew Rosenberg,
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
Meanings of Intonational Contours
Turn-taking and Disfluencies
Representing Intonational Variation
Recognizing Structure: Sentence, Speaker, andTopic Segmentation
Turn-taking and Disfluencies
Recognizing Disfluencies
Topic: Language perception
Presentation transcript:

Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Outline Where I’ve been Predictability affects on word duration Predictability effects on pitch accent Where I’m at Computational model of pitch accent Prosodic information to aid parsing Psychological models of production Where I’d like to go Prosody Disfluencies Speech synthesis

Where I’ve been Bad idea despair Cute! knowledge

Where I’m at… CU-BOULDER LINGUISTICS PROFESSOR WINS 2002 MACARTHUR FELLOWSHIP

Where I’m headed …

Predictability affects on word duration Methodology Corpus and Design (swbd, regression) Measures of predictability (frequency, bigram, joint, mutual information, repetition) Function words (top ten most frequent) Vowel reduction Coda deletion Duration Content words t/d deletion Duration

Predictability affects on word duration The probabilistic reduction hypothesis: The higher the probability of a word, the more it is reduced/shortened/lenited in lexical production. (Gregory et a. 1999, Jurafsky, Bell, Gregory, and Raymond 2000) Implications Any factor that increases the probability of a word also increases phonological reduction. Is that the only role of probabilistic information?

Predictability effects on pitch accent Same database, used regression models, but this time coded for pitch accent. What is pitch accent? Perceptual phenomenon (Hirschberg, 1993) Associated with duration, amplitude, and F0 of units. Words that appear more intonationally prominent than others are said to bear pitch accent.

Predictability and pitch accent Pitch accent is associated with meaning:

Predictability effects on pitch accent Results. More predictable words are less likely to bear pitch accent, as measured by (this is true for all parts of speech): Frequency Conditional bigram probability Joint probability Semantic relatedness Repetition (not all the same measures that affect reduction, e.g., preceding context is more important with pitch accent) Implications The role of predictability is not limited to reduction processes Predictability is not just a fact about lexical access, this information is available during phonological encoding Prosody in speech synthesis is rudimentary, a probabilistic model is (relatively) easy to implement.

Current Research Computational model of pitch accent Prosodic information to aid parsing Psychological models of production

Computational model of pitch accent (joint work with Yasemin Altun) Problem Predicting accent is not an exact science. Hirschberg (1993) and Pan & Hirschberg (2000) demonstrate that frequency and conditional probability increase accuracy in pitch accent prediction. Function vs content only 68% F requency, conditional probability 71% BOTH 1 and 2 73% Will the addition of more/different probabilistic variables increase accuracy as well?

Computational model of pitch accent Testing more variables Joint probability, reverse conditional probability The effects of surrounding accents More fine grained part of speech Things like rate of speech, etc.

Prosodic information to aid parsing (Joint work with Mark Johnson and Eugene Charniak) Problem: Parsing conversational speech is difficult Accuracy of parsing the wall street journal90% (Charniak 2000) switchboard84.5% wsj, no punctuation86% swbd, no punctuation81% Add prosodic features instead of punctuation

Prosodic information to aid parsing Methodology Get timing information from the transcripts Add pause duration information as a term in the parser (use pauses as a cue instead of punctuation) For sentence-internal punctuation only Results Accuracy goes down (80%) Because the language model is not as strong?

Psychological models of production: Disfluencies (joint work with Julie Sedivy and Dan Grodner) Looking at what’s going on during speech and when Initially, we were interested in how prosody maps to discourse constraints in the production of prenominal adjectives Move the red cup Facts: Speakers only use scalar or material adjectives in the environment of a contrast. Speakers use color adjectives ALL the time. Marking a contrast is prosodically marked (there is an increase in pitch range in the presence of a contrast) Despite an increase in pitch range, there is not a duration increase with adjectives produced in a contrastive environment. BUT Scalar adjectives are longer

Psychological models of production: Disfluencies Really neat fact: Speakers produce more disfluencies with scalar adjectives compared to material or color. disfluencies account for about 6% of spontaneous speech. Shriberg (2002) silent pauses move the red … elongated pronunciationsmove theee filled pauses move the um repetitions move the the restarts move the uh the red …

Psychological models of production: Disfluencies Used an eye-tracking device to find out what’s happening during the disfluency Move the, uh, big car next to the turtle

Psychological models of production: Disfluencies Results: We found that speakers are looking more at the contrasting object in the case of the scalars during the disfluency AND during the adjective! Implications: Marking a contrast set does not increase processing load Encoding a relative property does increase processing load Duration is affected by lexical encoding (suggests a continuum of planning difficulty effects)

Near-future research

Prosody In general, continue looking a the factors that influence prosodic variation and see if these can be modeled probabilistically. The challenge: Lots of people have found discourse-pragmatic factors contribute to prosodic marking Others, including myself, have found that prosody is affected by probabilistic variables How can we model aspects of the speech context probabilistically?

Disfluencies Disfluencies have proven to be a very useful window into processes of speech production. Are there more disfluencies around evaluative terms in general? Do different types of disfluencies correspond to difficulties associated with difference aspects of production (initial planning versus lexical encoding and access) Investigate more fully the connection between disfluencies and the length of surrounding words. Why is it that words following a disfluency are longer? How much of duration variation can be accounted for by planning difficulties versus other factors?

Speech synthesis Three types of TTS systems: Concatenated or diphone models. Advantages: the ability to process of novel strings of text, does not require a huge database of stored speech. Disadvantage: mechanical sounding speech, a lot of post-processing Corpus based- -prosodic patterns (durations, stress, F0 contours) are not defined by the signal processor, but rather the phoneme sequences are chosen based on exact prosodic pattern matches in a corpus. Advantage: natural sounding speech, specifically with regard to prosody. Disadvantage: a much larger database is required with a lot more hand coding involved. It also does not allow for totally novel sequences of sounds or words that are not in the database. Phrase splicing (unit selection)--selects the largest unit possible from a corpus of one speaker. Advantage: Very natural, requires very little post-speech processing from a signal processor. Disadvantage: Requires an extremely large (~10) hours of hand-annotated corpus of speech. It also does not allow for novel sequences of speech, thus must be used in conjunction with a diphone model.

Speech synthesis (joint work with Mike Buckley and Kris Schindler) Using a Probabilistic Model to Improve Speech Synthesis in the UB Talker The UB Talker: The UB Talker artificial speaking device menu-driven means of selecting words and phrases, Menus, words, and phrases can be pre-programmed or entered in on-screen Uses context-awareness and phrase completion to predict responses Statistics are derived using frequency of use, most- recently used, time of day, day of week, and time of year to present most-likely phrases to users.

Speech synthesis Once a string is selected, a synthesizer component produced speech. Two goals: 1. Add a probabilistic model of prosody to the current free TTS system 2. Build a corpus of speech toward a unit selection model (the Client has about 2,000 phrases in the system that can be pre-recorded)

Speech synthesis some academically available and commercially available synthesizers: