Pitch Tracking + Prosody January 17, 2012 The Plan for Today One announcement: On Thursday, we’ll meet in the Craigie Hall D 428 We’ll be working on.

Slides:



Advertisements
Similar presentations
Acoustic/Prosodic Features
Advertisements

A Phonetician ’ s Guide to Audio Formats Chilin Shih University of Illinois at Urbana Champaign LSA 2006January 5-8, 2006.
Digital Signal Processing
Sound in multimedia How many of you like the use of audio in The Universal Machine? What about The Universal Computer? Why or why not? Does your preference.
Spectral Analysis Feburary 24, 2009 Sorting Things Out 1.TOBI transcription homework rehash. And some structural reminders. 2.On Thursday: back in the.
SWE 423: Multimedia Systems Chapter 3: Audio Technology (2)
Frequency, Pitch, Tone and Length October 15, 2012 Thanks to Chilin Shih for making some of these lecture materials available.
Tone, Accent and Stress February 14, 2014 Practicalities Production Exercise #2 is due at 5 pm today! For Monday after the break: Yoruba tone transcription.
IT-101 Section 001 Lecture #8 Introduction to Information Technology.
Basic Spectrogram Lab 8. Spectrograms §Spectrograph: Produces visible patterns of acoustic energy called spectrograms §Spectrographic Analysis: l Acoustic.
Image and Sound Editing Raed S. Rasheed Sound What is sound? How is sound recorded? How is sound recorded digitally ? How does audio get digitized.
Pitch Tracking + Prosody January 20, 2009 The Plan for Today One announcement: On Thursday, we’ll meet in the Tri-Faculty Computer Lab (SS 018) Section.
Overview What is in a speech signal?
Chapter 2: Fundamentals of Data and Signals. 2 Objectives After reading this chapter, you should be able to: Distinguish between data and signals, and.
Chapter 1: Introduction Business Data Communications, 4e.
Syllables and Stress October 25, 2010 Practicalities Some homeworks to return… Review session on Wednesday. Mid-term on Friday. Note: transcriptions.
Digital Audio Multimedia Systems (Module 1 Lesson 1)
Representing Sound in a computer Analogue  Analogue sound is produced by being picked up by a transducer (microphone) and converted in an electrical current.
Basic Acoustics + Digital Signal Processing September 11, 2014.
Introduction to Sound Sounds are vibrations that travel though the air or some other medium A sound wave is an audible vibration that travels through.
Source/Filter Theory and Vowels February 4, 2010.
Digital audio. In digital audio, the purpose of binary numbers is to express the values of samples that represent analog sound. (contrasted to MIDI binary.
LE 460 L Acoustics and Experimental Phonetics L-13
Computer Science 121 Scientific Computing Winter 2014 Chapter 13 Sounds and Signals.
Lab #8 Follow-Up: Sounds and Signals* * Figures from Kaplan, D. (2003) Introduction to Scientific Computation and Programming CLI Engineering.
Introduction to Interactive Media 10: Audio in Interactive Digital Media.
Data Communications & Computer Networks, Second Edition1 Chapter 2 Fundamentals of Data and Signals.
Lecture # 22 Audition, Audacity & Sound Editing Sound Representation.
COMP Representing Sound in a ComputerSound Course book - pages
Voice Quality Feburary 11, 2013 Practicalities Course project reports to hand in! And the next set of guidelines to hand out… Also: the mid-term is on.
Automatic Pitch Tracking September 18, 2014 The Digitization of Pitch The blue line represents the fundamental frequency (F0) of the speaker’s voice.
Art 321 Sound, Audio, Acoustics Dr. J. Parker. Sound What we hear as sound is caused by rapid changes in air pressure! It is thought of as a wave, but.
Resonance, Revisited March 4, 2013 Leading Off… Project report #3 is due! Course Project #4 guidelines to hand out. Today: Resonance Before we get into.
Automatic Pitch Tracking January 16, 2013 The Plan for Today One announcement: Starting on Monday of next week, we’ll meet in Craigie Hall D 428 We’ll.
Harmonics November 1, 2010 What’s next? We’re halfway through grading the mid-terms. For the next two weeks: more acoustics It’s going to get worse before.
COSC 1P02 Introduction to Computer Science 4.1 Cosc 1P02 Week 4 Lecture slides “Programs are meant to be read by humans and only incidentally for computers.
Intonation January 21, 2014 The Plan for Today There’s a DSP exercise for you to work on! Due next Thursday. Also: I have posted a couple of readings.
Syllables and Stress October 19, 2012 Practicalities Mid-sagittal diagrams to turn in! Plus: homeworks to hand back. Production Exercise #2 is still.
Compression No. 1  Seattle Pacific University Data Compression Kevin Bolding Electrical Engineering Seattle Pacific University.
More Meaningful Jargon Or, All You Need to Know to Speak Like a Geek Sound.
Georgia Institute of Technology Introduction to Processing Digital Sounds part 1 Barb Ericson Georgia Institute of Technology Sept 2005.
Sound Conversion Chilin Shih University of Illinois — Urbana Champaign E-MELD Conference 2003 July 11 th -13th LSA Institute Michigan State University.
Frequency, Pitch, Tone and Length October 16, 2013 Thanks to Chilin Shih for making some of these lecture materials available.
Resonance October 23, 2014 Leading Off… Don’t forget: Korean stops homework is due on Tuesday! Also new: mystery spectrograms! Today: Resonance Before.
CS Spring 2009 CS 414 – Multimedia Systems Design Lecture 3 – Digital Audio Representation Klara Nahrstedt Spring 2009.
Digital Signal Processing January 16, 2014 Analog and Digital In “reality”, sound is analog. variations in air pressure are continuous = it has an amplitude.
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
Frequency, Pitch, Tone and Length February 12, 2014 Thanks to Chilin Shih for making some of these lecture materials available.
Resonance January 28, 2010 Last Time We discussed the difference between sine waves and complex waves. Complex waves can always be understood as combinations.
Intro-Sound-part1 Introduction to Processing Digital Sounds part 1 Barb Ericson Georgia Institute of Technology Oct 2009.
Tone, Accent and Quantity October 19, 2015 Thanks to Chilin Shih for making some of these lecture materials available.
Voicing + Basic Acoustics October 14, 2015 Agenda Production Exercise #2 is due on Friday! No transcription exercise this Friday! Today, we’ll begin.
CS Spring 2014 CS 414 – Multimedia Systems Design Lecture 3 – Digital Audio Representation Klara Nahrstedt Spring 2014.
Session 18 The physics of sound and the manipulation of digital sounds.
Topic: Pitch Extraction
Pitch Tracking + Prosody January 19, 2012 Homework! For Tuesday: introductory course project report Background information on your consultant and the.
Lifecycle from Sound to Digital to Sound. Characteristics of Sound Amplitude Wavelength (w) Frequency ( ) Timbre Hearing: [20Hz – 20KHz] Speech: [200Hz.
Basic Acoustics + Digital Signal Processing January 11, 2013.
3.3 Fundamentals of data representation
Multimedia: making it Work
Multimedia Systems and Applications
Introduction to Digital Audio
Acoustics of Speech Julia Hirschberg CS /7/2018.
Analyzing the Speech Signal
Linear Predictive Coding Methods
Analyzing the Speech Signal
Acoustics of Speech Julia Hirschberg CS /2/2019.
COMS 161 Introduction to Computing
Julia Hirschberg and Sarah Ita Levitan CS 6998
Recap In previous lessons we have looked at how numbers can be stored as binary. We have also seen how images are stored as binary. This lesson we are.
Presentation transcript:

Pitch Tracking + Prosody January 17, 2012

The Plan for Today One announcement: On Thursday, we’ll meet in the Craigie Hall D 428 We’ll be working on intonation transcription… The plan for today: 1.Wrap up A-to-D conversion 2.Automatic Pitch Tracking 3.(Brief) suprasegmentals review 4.The basics of English intonation

Sample Size Demo 11k 16 bits 11k 8 bits 8k 16 bits 8k 8bits (telephone) Note: CDs sample at 44,100 Hz and have 16-bit quantization. Also check out bad and actedout examples in Praat. Also: look at Praat’s representation of a.sound file.

Quantization Range With 16-bit quantization, we can encode 65,536 different possible amplitude values. Remember that I(dB) = 10 * log 10 (A 2 /r 2 ) Substitute the max and min amplitude values for A and r, respectively, and we get: I(dB) = 10 * log 10 ( /1 2 ) = 96.3 dB Some newer machines have 24-bit quantization-- = 16,777,216 possible amplitude values. I(dB) = 10 * log 10 ( /1 2 ) = dB This is bigger than the range of sounds we can listen to without damaging our hearing.

Problem: Clipping Clipping occurs when the pressure in the analog signal exceeds the sample size range in digitization Check out sylvester and normal in Praat.

A Note on Formats Digitized sound files come in different formats….wav,.aiff,.au, etc. Lossless formats digitize sound in the way I’ve just described. They only differ in terms of “header” information and specified limits on file size, etc. Lossy formats use algorithms to condense the size of sound files …and the sound file loses information in the process. For instance: the.mp3 format primarily saves space by eliminating some very high frequency information. (which is hard for people to hear)

AIFF vs. MP3.aiff format.mp3 format (digitized at 128 kB/s) This trick can work pretty well…

MP3 vs. MP3.mp3 format (digitized at 128 kB/s).mp3 format (digitized at 64 kB/s).mp3 conversion can induce reverb artifacts, and also cut down on temporal resolution (among other things).

Sound Digitization Summary Samples are taken of an analog sound’s pressure value at a recurring sampling rate. This digitizes the time dimension in a waveform. The sampling frequency needs to be twice as high as any frequency components you want to capture in the signal. E.g., Hz for speech Quantization converts the amplitude value of each sample into a binary number in the computer. This digitizes the amplitude dimension in a waveform. Rounding off errors can lead to quantization noise. Excessive amplitude can lead to clipping errors.

The Digitization of Pitch The blue line represents the fundamental frequency (F0) of the speaker’s voice. Also known as a pitch track How can we automatically “track” F0 in a sample of speech? Praat can give us a representation of speech that looks like:

Pitch Tracking Voicing: Air flow through vocal folds Rapid opening and closing due to Bernoulli Effect Each cycle sends an acoustic shockwave through the vocal tract …which takes the form of a complex wave. The rate at which the vocal folds open and close becomes the fundamental frequency (F0) of a voiced sound.

Voicing Bars

Individual glottal pulses

Voicing = Complex Wave Note: voicing is not perfectly periodic. …always some random variation from one cycle to the next. How can we measure the fundamental frequency of a complex wave?

The basic idea: figure out the period between successive cycles of the complex wave. Fundamental frequency = 1 / period duration = ???

Measuring F0 To figure out where one cycle ends and the next begins… The basic idea is to find how well successive “chunks” of a waveform match up with each other. One period = the length of the chunk that matches up best with the next chunk. Automatic Pitch Tracking parameters to think about: 1.Window size (i.e., chunk size) 2.Step size 3.Frequency range (= period range)

Window (Chunk) Size Here’s an example of a small window

Window (Chunk) Size Here’s an example of a large(r) window

Initial window of the waveform is compared to another window (of the same duration) at a later point in the waveform

Matching The waveforms in the two windows are compared to see how well they match up. Correlation = measure of how well the two windows match ???

Autocorrelation The measure of correlation = Sum of the point-by-point products of the two chunks. The technical name for this is autocorrelation… because two parts of the same wave are being matched up against each other. (“auto” = self)

Autocorrelation Example Ex: consider window x, with n samples… What’s its correlation with window y? (Note: window y must also have n samples) x 1 = first sample of window x x 2 = second sample of window x … x n = nth (final) sample of window x y 1 = first sample of window y, etc. Correlation (R) = x 1 *y 1 + x 2 * y 2 + … + x n * y n The larger R is, the better the correlation.

By the Numbers Sample x y product Sum of products = -.48 These two chunks are poorly correlated with each other.

By the Numbers, part 2 Sample x z product Sum of products = 1.26 These two chunks are well correlated with each other. (or at least better than the previous pair) Note: matching peaks count for more than matches close to 0.

Back to (Digital) Reality The waveforms in the two windows are compared to see how well they match up. Correlation = measure of how well the two windows match ??? These two windows are poorly correlated

Next: the pitch tracking algorithm moves further down the waveform and grabs a new window

The distance the algorithm moves forward in the waveform is called the step size “step”

Matching, again The next window gets compared to the original. ???

Matching, again The next window gets compared to the original. ??? These two windows are also poorly correlated

The algorithm keeps chugging and, eventually… another “step”

Matching, again The best match is found. ??? These two windows are highly correlated

The fundamental period can be determined by the calculating the length of time between the start of window 1 and the start of (well correlated) window 2. period

Frequency is 1 / period Q: How many possible periods does the algorithm need to check? Frequency range (default in Praat: 75 to 600 Hz) Mopping up

Moving on Another comparison window is selected and the whole process starts over again.

would Uhm I like A flight to Seattle from Albuquerque The algorithm ultimately spits out a pitch track. This one shows you the F0 value at each step. Thanks to Chilin Shih for making these materials available

Pitch Tracking in Praat Play with F0 range. Create Pitch Object. Also go To Manipulation…Pitch. Also check out:

Summing Up Pitch tracking uses three parameters 1.Window size Ensures reliability In Praat, the window size is always three times the longest possible period. E.g.: 3 X 1/75 =.04 sec. 2.Step size For temporal precision 3.Frequency range Reduces computational load

Deep Thought Questions What might happen if: The shortest period checked is longer than the fundamental period? AND two fundamental periods fit inside a window? Potential Problem #1: Pitch Halving The pitch tracker thinks the fundamental period is twice as long as it is in reality.  It estimates F0 to be half of its actual value

Pitch Halving pitch is halved Check out normal file in Praat.

More Deep Thoughts What might happen if: The shortest period checked is less than half of the fundamental period? AND the second half of the fundamental cycle is very similar to the first? Potential Problem #2: Pitch doubling The pitch tracker thinks the fundamental period is half as long as it actually is.  It estimates the F0 to be twice as high as it is in reality.

Pitch Doubling pitch is doubled

Microperturbations Another problem: Speech waveforms are partly shaped by the type of segment being produced. Pitch tracking can become erratic at the juncture of two segments. In particular: voiced to voiceless segments sonorants to obstruents These discontinuities in F0 are known as microperturbations. Also: transitions between modal and creaky voicing tend to be problematic.

Back to Language F0 is important because it can be used by languages to signal differences in meaning. Note: Acoustic=Fundamental Frequency Perceptual=Pitch Linguistic=Tone

A Typology F0 is generally used in three different ways in language: 1. Tone languages (Chinese, Navajo, Igbo) Lexically determined tone on every syllable “Syllable-based” tone languages 2. Accentual languages (Japanese, Swedish) The location of an accent in a particular word is lexically marked. “Word-based” tone languages 3. Stress languages (English, Russian) It’s complicated.