CS 224S / LINGUIST 285 Spoken Language Processing

Slides:



Advertisements
Similar presentations
Speech: Fundamentals CS 3710 / ISSP 3565
Advertisements

Acoustic/Prosodic Features
Tom Lentz (slides Ivana Brasileiro)
Basic Spectrogram & Clinical Application: Consonants
Acoustic Characteristics of Consonants
From Resonance to Vowels March 8, 2013 Friday Frivolity Some project reports to hand back… Mystery spectrogram reading exercise: solved! We need to plan.
1 CS 551/651: Structure of Spoken Language Spectrogram Reading: Stops John-Paul Hosom Fall 2010.
Vowels and Tubes (again) March 22, 2011 Today’s Plan Perception experiment! Discuss vowel theory #2: tubes! Then: some thoughts on music. First: let’s.
ACOUSTICS OF SPEECH AND SINGING MUSICAL ACOUSTICS Science of Sound, Chapters 15, 17 P. Denes & E. Pinson, The Speech Chain (1963, 1993) J. Sundberg, The.
Prosodics, Part 1 LIN Prosodics, or Suprasegmentals Remember, from our first discussions in class, that speech is really a continuous flow of initiation,
Basic Spectrogram Lab 8. Spectrograms §Spectrograph: Produces visible patterns of acoustic energy called spectrograms §Spectrographic Analysis: l Acoustic.
The Human Voice. I. Speech production 1. The vocal organs
ACOUSTICAL THEORY OF SPEECH PRODUCTION
Speech Perception Overview of Questions Can computers perceive speech as well as humans? Does each word that we hear have a unique pattern associated.
The Human Voice Chapters 15 and 17. Main Vocal Organs Lungs Reservoir and energy source Larynx Vocal folds Cavities: pharynx, nasal, oral Air exits through.
PH 105 Dr. Cecilia Vogel Lecture 14. OUTLINE  consonants  vowels  vocal folds as sound source  formants  speech spectrograms  singing.
Vowel Acoustics, part 2 November 14, 2012 The Master Plan Acoustics Homeworks are due! Today: Source/Filter Theory On Friday: Transcription of Quantity/More.
Unit 4 Articulation I.The Stops II.The Fricatives III.The Affricates IV.The Nasals.
Computational Extraction of Social and Interactional Meaning SSLST, Summer 2011 Dan Jurafsky Prosody IP notice: many slides for today from Jennifer Venditti,
Introduction to Intonation Jennifer J. Venditti Cognitive Science March 2001.
What is Phonetics? Short answer: The study of speech sounds in all their aspects. Phonetics is about describing speech. (Note: phonetics ¹ phonics) Phonetic.
Chapter three Phonology
Intonation September 18, 2014 The Plan for Today Also: I have posted a couple of readings on TOBI (an intonation transcription system) to the course.
Harmonics, Timbre & The Frequency Domain
Source/Filter Theory and Vowels February 4, 2010.
Acoustic Phonetics 3/9/00. Acoustic Theory of Speech Production Modeling the vocal tract –Modeling= the construction of some replica of the actual physical.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Vowel Acoustics November 2, 2012 Some Announcements Mid-terms will be back on Monday… Today: more resonance + the acoustics of vowels Also on Monday:
ECE 598: The Speech Chain Lecture 7: Fourier Transform; Speech Sources and Filters.
Harmonics November 1, 2010 What’s next? We’re halfway through grading the mid-terms. For the next two weeks: more acoustics It’s going to get worse before.
LING 001 Introduction to Linguistics Fall 2010 Sound Structure I: Phonetics Acoustic phonetics Jan. 27.
Speech Science Fall 2009 Oct 28, Outline Acoustical characteristics of Nasal Speech Sounds Stop Consonants Fricatives Affricates.
Speech Science Oct 7, 2009.
Voice Quality + Stop Acoustics
Say “blink” For each segment (phoneme) write a script using terms of the basic articulators that will say “blink.” Consider breathing, voicing, and controlling.
Frequency, Pitch, Tone and Length October 16, 2013 Thanks to Chilin Shih for making some of these lecture materials available.
Structure of Spoken Language
Speech Science VI Resonances WS Resonances Reading: Borden, Harris & Raphael, p Kentp Pompino-Marschallp Reetzp
Resonance October 23, 2014 Leading Off… Don’t forget: Korean stops homework is due on Tuesday! Also new: mystery spectrograms! Today: Resonance Before.
Stops Stops include / p, b, t, d, k, g/ (and glottal stop)
Vowel Acoustics March 10, 2014 Some Announcements Today and Wednesday: more resonance + the acoustics of vowels On Friday: identifying vowels from spectrograms.
Statistical NLP Spring 2011
Frequency, Pitch, Tone and Length February 12, 2014 Thanks to Chilin Shih for making some of these lecture materials available.
Resonance January 28, 2010 Last Time We discussed the difference between sine waves and complex waves. Complex waves can always be understood as combinations.
Intonational Meaning in Discourse Jennifer J. Venditti Tutorial for the IRCS 5 th Annual Undergraduate Summer Workshop in Cognitive Science 18 June 2002.
Phonetics, part III: Suprasegmentals October 19, 2012.
Stop Acoustics and Glides December 2, 2013 Where Do We Go From Here? The Final Exam has been scheduled! Wednesday, December 18 th 8-10 am (!) Kinesiology.
Voicing + Basic Acoustics October 14, 2015 Agenda Production Exercise #2 is due on Friday! No transcription exercise this Friday! Today, we’ll begin.
Stop + Approximant Acoustics
CS 188: Artificial Intelligence Spring 2007 Speech Recognition 03/20/2007 Srini Narayanan – ICSI and UC Berkeley.
1 Acoustic Phonetics 3/28/00. 2 Nasal Consonants Produced with nasal radiation of acoustic energy Sound energy is transmitted through the nasal cavity.
P105 Lecture #27 visuals 20 March 2013.
Acoustic Phonetics 3/14/00.
Phonetics, part III: Suprasegmentals October 18, 2010.
Pitch Tracking + Prosody January 19, 2012 Homework! For Tuesday: introductory course project report Background information on your consultant and the.
Suprasegmental features and Prosody Lect 6A&B LING1005/6105.
HOW WE TRANSMIT SOUNDS? Media and communication 김경은 김다솜 고우.
Basic Acoustics + Digital Signal Processing January 11, 2013.
Resonance October 29, 2015 Looking Ahead I’m still behind on grading the mid-term and Production Exercise #1… They should be back to you by Monday. Today:
Statistical NLP Spring 2010
The Human Voice. 1. The vocal organs
Structure of Spoken Language
The Human Voice. 1. The vocal organs
Lecture 9: Speech Recognition (I) October 26, 2004 Dan Jurafsky
Vowel Formants 1.
What is Phonetics? Short answer: The study of speech sounds in all their aspects. Phonetics is about describing speech. (Note: phonetics ¹ phonics) Phonetic.
Remember me? The number of times this happens in 1 second determines the frequency of the sound wave.
The Production of Speech
Speech Perception (acoustic cues)
CS 188: Artificial Intelligence Spring 2006
Presentation transcript:

CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 2: Acoustic Phonetics and Intonation

Today: Acoustic Phonetics No math today PRAAT and sound waves Segmental Phenomena: Spectra Spectrograms Formants Reading spectrograms The Source Filter Theory: why formants Suprasegmental Phenomena: Intonation F0, pitch, intensity Accents + Prominence Boundaries Tunes

Homework 1 http://www.stanford.edu/class/cs224s/hw1.html You’ll need to download PRAAT; details are in the homework.

Sound waves are longitudinal waves Dan Rusell Figure

particle dispacment pressure Dan Rusell Figure

Remember High School Physics Simple Period Waves (sine waves) Characterized by: period: T amplitude A phase  Fundamental frequency in cycles per second, or Hz F0=1/T 1 cycle

Simple periodic waves The frequency of a wave: Amplitude: 1 Computing the frequency of a wave: 5 cycles in .5 seconds = 10 cycles/second = 10 Hz Amplitude: 1 Equation: Y = A sin(2ft) The frequency of a wave: 5 cycles in .5 seconds = 10 cycles/second = 10 Hz Amplitude: 1

Speech sound waves X axis: time. Y axis: A little piece from the waveform of the vowel [iy] X axis: time. Y axis: Amplitude = air pressure at that time +: compression 0: normal air pressure, -: rarefaction

Back to waves: Fundamental frequency Waveform of the vowel [iy] Frequency: 10 repetitions / .03875 seconds = 258 Hz This is speed that vocal folds move, hence voicing Each peak corresponds to an opening of the vocal folds The low frequency of the complex wave is called the fundamental frequency of the wave or F0

She just had a baby Note that vowels all have regular amplitude peaks Stop consonant Closure followed by release Notice the silence followed by slight bursts of emphasis: very clear for [b] of “baby” Fricative: noisy. [sh] of “she” at beginning

Fricative

Back to freshman physics: Waves have different frequencies 100 Hz 1000 Hz

Complex waves: Adding a 100 Hz and 1000 Hz wave together

Spectrum Amplitude 1000 100 Frequency in Hz Frequency components (100 and 1000 Hz) on x-axis

Spectra continued Fourier analysis: any wave can be represented as the (infinite) sum of sine waves of different frequencies (amplitude, phase)

Spectrum of one instant in an actual soundwave: many components across frequency range

Part of [ae] waveform from “had” Note complex wave repeating nine times in figure Plus smaller waves which repeats 4 times for every large pattern Large wave has frequency of 250 Hz (9 times in .036 seconds) Small wave roughly 4 times this, or roughly 1000 Hz Two little tiny waves on top of peak of 1000 Hz waves

Back to spectrum Spectrum represents these freq components Computed by Fourier transform x-axis shows frequency, y-axis shows magnitude (in decibels) Peaks at 930 Hz, 1860 Hz, and 3020 Hz.

Seeing formants: the spectrogram 1/5/07

Formants Vowels largely distinguished by 2 characteristic pitches. One of them (the higher of the two) goes downward throughout the series iy ih eh ae aa ao ou u The other goes up for the first four vowels and then down for the next four. These are called "formants" of the vowels, lower is 1st formant, higher is 2nd formant.

Spectrogram: spectrum + time dimension

How to read spectrograms bab: closure of lips lowers all formants: so rapid increase in all formants at beginning of "bab” dad: first formant increases, but F2 and F3 slight fall gag: F2 and F3 come together: this is a characteristic of velars. Formant transitions take longer in velars than in alveolars or labials From Ladefoged “A Course in Phonetics”

She came back and started again 1. lots of high-freq energy 3. closure for k 4. burst of aspiration for k 5. ey vowel;faint 1100 Hz formant is nasalization 6. bilabial nasal 7. short b closure, voicing barely visible. 8. ae; note upward transitions after bilabial stop at beginning 9. note F2 and F3 coming together for "k” From Ladefoged “A Course in Phonetics”

Formants and the Source Filter Model

Different vowels have different formants Every time the vocal cords open and close, pulse of air from the lungs is sharp tap on air in vocal tract. Setting air in vocal cavity vibrating, producing different harmonics

Vocal Fold Cycles

The vocal source at 150 Hz a

The harmonics a

Source filter model of vowels Any body of air will vibrate in a way that depends on its size and shape. Vocal tract as "amplifier"; amplifies certain harmonics Formants are result of different shapes of vocal tract.

The oral cavity amplifies some harmonics

Source-filter model of speech production Input Filter Output Glottal spectrum Vocal tract frequency response function Source and filter are independent, so: Different vowels can have same pitch The same vowel can have different pitch Figures and text from Ratree Wayland slide from his website

From Mark Liberman’s Web site

Deriving schwa: how shape of mouth (filter function) creates formants of [ax] f = c/ c = speed of sound (approx 35,000 cm/sec) A sound with =10 meters has low frequency f = 35 Hz (35,000/1000) A sound with =2 centimeters has high frequency f = 17,500 Hz (35,000/2)

Resonances of the vocal tract The human vocal tract as an open tube Air in a tube of a given length will tend to vibrate at resonance frequency of tube. Closed end Open end Length 17.5 cm. Figure from Ladefoged(1996) p 117

Resonances of the vocal tract The human vocal tract as an open tube Air in a tube of a given length will tend to vibrate at resonance frequency of tube. Closed end Open end Length 17.5 cm. Figure from W. Barry Speech Science slides

Resonances of the vocal tract If vocal tract is cylindrical tube open at one end Standing waves form in tubes Waves will resonate if their wavelength corresponds to dimensions of tube Constraint: Pressure differential should be maximal at (closed) glottal end and minimal at (open) lip end. Next slide shows what kind of length of waves can fit into a tube with this contraint

1/5/07 From Sundberg

Computing the 3 formants of schwa Let the length of the tube be L F1 = c/1 = c/(4L) = 35,000/4*17.5 = 500Hz F2 = c/2 = c/(4/3L) = 3c/4L = 3*35,000/4*17.5 = 1500Hz F3 = c/3 = c/(4/5L) = 5c/4L = 5*35,000/4*17.5 = 2500Hz So we expect a neutral vowel to have 3 resonances at 500, 1500, and 2500 Hz These vowel resonances are called formants

This can help solve a mystery of musical life: Why are opera singers (especially sopranos) so hard to understand?

Vowel [i] sung at successively higher pitch. 2 1 3 5 6 4 7 Figures from Ratree Wayland slides from his website

Defining Intonation Ladd (1996) “Intonational phonology” “The use of suprasegmental phonetic features Suprasegmental = above & beyond the segment/phone F0 Intensity (energy) Duration to convey sentence-level pragmatic meanings” I.e. meanings that apply to phrases or utterances as a whole, not lexical stress, not lexical tone.

Pitch track

Pitch is not Frequency Pitch is the mental sensation or perceptual correlated of F0 Relationship between pitch and F0 is not linear; human pitch perception is most accurate between 100Hz and 1000Hz. Linear in this range Logarithmic above 1000Hz Mel scale is one model of this F0-pitch mapping A mel is a unit of pitch defined so that pairs of sounds which are perceptually equidistant in pitch are separated by an equal number of mels Frequency in mels = 1127 ln (1 + f/700)

Plot of Intensity

Three aspects of prosody Prominence: some syllables/words are more prominent than others Structure/boundaries: sentences have prosodic structure Some words group naturally together Others have a noticeable break or disjuncture between them Tune: the intonational melody of an utterance. From Ladd (1996)

Prosodic Prominence: Pitch Accents A: What types of foods are a good source of vitamins? B1: Legumes are a good source of VITAMINS. B2: LEGUMES are a good source of vitamins. Prominent syllables are: Louder Longer Have higher F0 and/or sharper changes in F0 (higher F0 velocity) Slide from Jennifer Venditti

Prosodic Boundaries I met Mary and Elena’s mother at the mall yesterday. French [bread and cheese] [French bread] and [cheese] Slide from Jennifer Venditti

Prosodic Tunes Legumes are a good source of vitamins. Are legumes a good source of vitamins? Slide from Jennifer Venditti

Thinking about F0

Graphic representation of F0 F0 (in Hertz) legumes are a good source of VITAMINS time Slide from Jennifer Venditti

The ‘ripples’ legumes are a good source of VITAMINS F0 is not defined for consonants without vocal fold vibration. Slide from Jennifer Venditti

The ‘ripples’ legumes are a good source of VITAMINS [ z ] [ g ] [ v ] legumes are a good source of VITAMINS ... and F0 can be perturbed by consonants with an extreme constriction in the vocal tract. Slide from Jennifer Venditti

Abstraction of the F0 contour legumes are a good source of VITAMINS Our perception of the intonation contour abstracts away from these perturbations. Slide from Jennifer Venditti

The ‘waves’ and the ‘swells’ ‘wave’ = accent ‘swell’ = phrase legumes are a good source of VITAMINS Slide from Jennifer Venditti

Placement of Pitch Accents Prominence: Placement of Pitch Accents

Stress vs. accent Stress is a structural property of a word it marks a potential (arbitrary) location for an accent to occur, if there is one. Accent is a property of a word in context it is a way to mark intonational prominence in order to ‘highlight’ important words in the discourse. (x) (accented syll) x stressed syll full vowels syllables vi ta mins Ca li for nia Slide from Jennifer Venditti

Stress vs. accent (2) The speaker decides to make the word vitamin more prominent by accenting it. Lexical stress tell us that this prominence will appear on the first syllable, hence VItamin. So prosodic prominence is a function of lexicon context I’m a little surPRISED to hear it CHARacterized as upBEAT

Which word receives an accent? It depends on the context. The ‘new’ information in the answer to a question is often accented while the ‘old’ information is usually not. Q1: What types of foods are a good source of vitamins? A1: LEGUMES are a good source of vitamins. Q2: Are legumes a source of vitamins? A2: Legumes are a GOOD source of vitamins. Q3: I’ve heard that legumes are healthy, but what are they a good source of ? A3: Legumes are a good source of VITAMINS. Slide from Jennifer Venditti

Same ‘tune’, different alignment LEGUMES are a good source of vitamins The main rise-fall accent (= “I assert this”) shifts locations. Slide from Jennifer Venditti

Same ‘tune’, different alignment Legumes are a GOOD source of vitamins The main rise-fall accent (= “I assert this”) shifts locations. Slide from Jennifer Venditti

Same ‘tune’, different alignment legumes are a good source of VITAMINS The main rise-fall accent (= “I assert this”) shifts locations. Slide from Jennifer Venditti

Levels of prominence Most phrases have more than one accent The last accent in a phrase is perceived as more prominent Called the Nuclear Accent Emphatic accents like nuclear accent often used for semantic purposes, such as indicating that a word is contrastive, or the semantic focus. The kind of thing you uses ***s in IM, or capitalized letters ‘I know SOMETHING interesting is sure to happen,’ she said to herself. Can also have words that are less prominent than usual Reduced words, especially function words. Often use 4 classes of prominence: Emphatic accent, pitch accent, unaccented, reduced

Intonational phrasing/boundaries

A single intonation phrase legumes are a good source of vitamins Broad focus statement consisting of one intonation phrase (that is, one intonation tune spans the whole unit). Slide from Jennifer Venditti

Multiple phrases legumes are a good source of vitamins Utterances can be ‘chunked’ up into smaller phrases in order to signal the importance of information in each unit. Slide from Jennifer Venditti

Phrasing sometimes helps disambiguate Global ambiguity: The old men and women stayed home. Sally saw the man with the binoculars. John doesn’t drink because he’s unhappy. Slide from Jennifer Venditti

Phrasing can disambiguate Global ambiguity: The old men and women stayed home. The old men % and women % stayed home. Sally saw % the man with the binoculars. Sally saw the man % with the binoculars. John doesn’t drink because he’s unhappy. John doesn’t drink % because he’s unhappy. Slide from Jennifer Venditti

Phrasing sometimes helps disambiguate Temporary ambiguity: When Madonna sings the song ... Slide from Jennifer Venditti

Phrasing sometimes helps disambiguate Temporary ambiguity: When Madonna sings the song is a hit. Slide from Jennifer Venditti

Phrasing sometimes helps disambiguate Temporary ambiguity: When Madonna sings % the song is a hit. When Madonna sings the song % it’s a hit. [from Speer & Kjelgaard (1992)] Slide from Jennifer Venditti

Phrasing sometimes helps disambiguate Mary & Elena’s mother mall I met Mary and Elena’s mother at the mall yesterday One intonation phrase with relatively flat overall pitch range. Slide from Jennifer Venditti

Phrasing sometimes helps disambiguate Elena’s mother mall Mary I met Mary and Elena’s mother at the mall yesterday Separate phrases, with expanded pitch movements. Slide from Jennifer Venditti

Intonational tunes

Yes-No question tune are LEGUMES a good source of vitamins Rise from the main accent to the end of the sentence. Slide from Jennifer Venditti

Yes-No question tune are legumes a GOOD source of vitamins Rise from the main accent to the end of the sentence. Slide from Jennifer Venditti

Yes-No question tune are legumes a good source of VITAMINS Rise from the main accent to the end of the sentence. Slide from Jennifer Venditti

WH-questions WHAT are a good source of vitamins [I know that many natural foods are healthy, but ...] WHAT are a good source of vitamins WH-questions typically have falling contours, like statements. Slide from Jennifer Venditti

Broad focus “Tell me something about the world.” legumes are a good source of vitamins In the absence of narrow focus, English tends to mark the first and last ‘content’ words with perceptually prominent accents. Slide from Jennifer Venditti

Rising statements “Tell me something I didn’t already know.” legumes are a good source of vitamins [... does this statement qualify?] High-rising statements can signal that the speaker is seeking approval. Slide from Jennifer Venditti

Yes-No question are legumes a good source of VITAMINS Rise from the main accent to the end of the sentence. Slide from Jennifer Venditti

‘Surprise-redundancy’ tune [How many times do I have to tell you ...] legumes are a good source of vitamins Low beginning followed by a gradual rise to a high at the end. Slide from Jennifer Venditti

‘Contradiction’ tune linguini isn’t a good source of vitamins “I’ve heard that linguini is a good source of vitamins.” linguini isn’t a good source of vitamins [... how could you think that?] Sharp fall at the beginning, flat and low, then rising at the end. Slide from Jennifer Venditti

Using Intonation in Spoken Language Processing Prominence/Accent: Tells us about focus of utterance Tune: whether utterance is question/statement, important for affect extraction Boundaries: can help parsing

Today: Acoustic Phonetics No math today PRAAT and sound waves Segmental Phenomena: Spectra Spectrograms Formants Reading spectrograms The Source Filter Theory: why formants Suprasegmental Phenomena: Intonation F0, pitch, intensity Accents + Prominence Boundaries Tunes