CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Lecture 1: Introduction, ARPAbet, Articulatory Phonetics.

Slides:



Advertisements
Similar presentations
CS : Speech, NLP and the Web/Topics in AI
Advertisements

Phonetics.
From Sounds to Language
From Sounds to Language
Basic Phonology of English
ACOUSTICS OF SPEECH AND SINGING MUSICAL ACOUSTICS Science of Sound, Chapters 15, 17 P. Denes & E. Pinson, The Speech Chain (1963, 1993) J. Sundberg, The.
From Sounds to Language Lecture 2 Spoken Language Processing Prof. Andrew Rosenberg.
PHONETICS & PHONOLOGY COURSE WINTER TERM 2014/2015.
The Human Voice. I. Speech production 1. The vocal organs
Introduction to linguistics – The sounds of German R21118 Dr Nicola McLelland.
Phonetics (Part 1) Dr. Ansa Hameed.
Speech Anatomy and Articulation
From Sounds to Language CS 4706 Julia Hirschberg.
English Phonetics and Phonology Lesson 3B
Computational Extraction of Social and Interactional Meaning SSLST, Summer 2011 Dan Jurafsky Prosody IP notice: many slides for today from Jennifer Venditti,
CS 4705 Lecture 4 CS4705 Sound Systems and Text-to- Speech.
Chapter 6 Features PHONOLOGY (Lane 335).
Chapter 2 Introduction to articulatory phonetics
Phonetics III: Dimensions of Articulation October 15, 2012.
The sounds of language Phonetics Chapter 4.
Phonetics HSSP Week 5.
Phonetics Phonetics: It is the science of speech sounds. It is the study of the production and reception of speech sounds. It is concerned with the sounds.
LING 001 Introduction to Linguistics Fall 2010 Sound Structure I: Phonetics Articulatory phonetics Phonetic transcription Jan. 25.
Descriptive grammar term 1 Dorota Klimek-Jankowska.
LE 222 Sound and English Sound system
CS 124/LINGUIST 180: From Languages to Information
Speech and Language Processing
1 L103: Introduction to Linguistics Phonetics (consonants)
Introduction to Linguistics Ms. Suha Jawabreh Lecture # 7.
Spoken language phonetics: Consonant articulation and transcription LING 200 Spring 2003.
Phonetics Class # 2 Chapter 6. Homework (Ex. 1 – page 268)  Judge [d ] or [ ǰ ]  Thomas [t]  Though [ ð ]  Easy [i]  Pneumonia [n]  Thought [ θ.
SPEECH ORGANS & ARTICULATION
Phonetics: Dimensions of Articulation October 13, 2010.
WEBSITE Please use this website to practice what you learn during lessons 1.
Linguistics The fourth week. Chapter 2 The Sounds of Language 2.1 Introduction 2.1 Introduction 2.2 Phonetics 2.2 Phonetics.
Statistical NLP Spring 2011
Introduction to Language Phonetics 1. Explore the relationship between sound and spelling Become familiar with International Phonetic Alphabet (IPA )
ACE TESOL Diploma Program – London Language Institute OBJECTIVES You will understand: 1. How each of the phonemes in English is articulated 2. The differences.
LIN 3201 Sounds of Human Language Sayers -- Week 1 – August 29 & 31.
Anatomy and Physiology of the Speech Mechanism. Major Biological Systems Respiratory System Laryngeal System Supralaryngeal System.
Ch4 – Features Features are partly acoustic partly articulatory aspects of sounds but they are used for phonology so sometimes they are created to distinguish.
Unit Two The Organs of speech
Welcome to all.
ARTICULATORY PHONETICS
ARTICULATORY PHONETICS
Phonetics Dimensions of Articulation
Linguistics: Phonetics
CS 224S / LINGUIST 285 Spoken Language Processing
The Human Voice. 1. The vocal organs
Introduction to Linguistics
Structure of Spoken Language
Sounds of Language: fənɛ́tɪks
Consonant articulation
Essentials of English Phonetics
Describing English Consonants
The articulation of consonants
Overview/review Transcription Describing Consonants
Structure of Spoken Language
The Human Voice. 1. The vocal organs
PHONETICS – THE BIOLOGY OF SPEECH
Presentation on Organs of Speech
How speech sounds are made
Speech is made up of sounds.
Jennifer J. Venditti Postdoctoral Research Associate
An Introduction to the Sound Systems in English and Hindi
S. M. Joshi College, Hadapsar, Pune-28.
Manner of Articulation
Phonetics and Phonemics
CONSONANTS ARTICULATORY PHONETICS. Consonants When we pronounce consonants, the airflow out of the mouth is completely blocked, greatly restricted, or.
Presentation transcript:

CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Lecture 1: Introduction, ARPAbet, Articulatory Phonetics

Today, April 1, Week 1 Overview of 3++ components of course 1. ASR 2. Affect Extraction 3. Dialogue ++ TTS Speaker Recognition Very brief history Articulatory Phonetics Administration ARPAbet transcription

Speech Processing tasks speech recognition dialogue/conversational agents extracting spoken emotion and social meaning speaker id, speaker verification speech synthesis

Applications of Speech Recognition and Synthesis Personal assistants Hands-free (in car) Gaming Education Language teaching

LVCSR Large Vocabulary Continuous Speech Recognition ~64,000 words Speaker independent (vs. speaker- dependent) Continuous speech (vs isolated-word)

Current error rates TaskVocabularyWord Error Rate % Digits110.5 WSJ read speech5K3 WSJ read speech20K3 Broadcast news64,000+5 Conversational Telephone64, Ballpark numbers; exact numbers depend very much on the specific corpus

HSR versus ASR Conclusions: Machines about 5 times worse than humans Gap increases with noisy speech These numbers are rough, take with grain of salt TaskVocabASRHum SR Continuous digits WSJ 1995 clean5K30.9 WSJ 1995 w/noise 5K91.1 SWBD K10?3-4?

Why is conversational speech harder? A piece of an utterance without context The same utterance with more context

Why foreign accents are hard A word by itself The word in context

Speech Recognition Design Intuition Build a statistical model of the speech-to-words process Collect lots and lots of speech, and transcribe all the words. Train the model on the labeled speech Paradigm: Supervised Machine Learning + Search

Dialogue (= Conversational Agents) Personal Assistants Apple SIRI Microsoft Cortana Google Now

Two Paradigms for Dialogue POMDP Partially-Observed Markov Decision Processes Reinforcement Learning to learn what action to take Asking a question or answering one are just actions “Speech acts” Simple regular expressions and slot filling Pre-built frames Calendar Who When Where Filled by hand-built rules (“ on (Mon|Tue|Wed…) ”)

Two Paradigms for Dialogue POMDP Exciting Research Studied by all universities Implemented in no commercial systems Simple regular expressions and slot filling State of the art used in all systems

Extraction of Social Meaning from Speech Detection of student uncertainty in tutoring Forbes-Riley et al. (2008) Emotion detection (annoyance) Ang et al. (2002) Detection of deception Newman et al. (2003) Detection of charisma Rosenberg and Hirschberg (2005) Speaker stress, trauma Rude et al. (2004), Pennebaker and Lay (2002)

Conversational style Given speech and text from a conversation Can we tell if a speaker is Awkward? Flirtatious? Friendly? Dataset: minute “speed-dates” Each subject rated their partner for these styles The following segment has been lightly signal-processed:

Speaker Recognition tasks Speaker Recognition Speaker Verification (Speaker Detection) Is this speech sample from a particular speaker Is that Jane? Speaker Identification Which of these speakers does this sample come from? Who is that? Related tasks: Gender ID, Language ID Is this a woman or a man? Speaker Diarization Segmenting a dialogue or multiparty conversation Who spoke when?

Applications of Speaker Recognition Speaker Recognition: Speaker verification (binary decision) Voice password Telephone assistant Speaker identification (one of N) Criminal investigation Diarization Transcribing meetings

TTS (= Text-to-Speech) (= Speech Synthesis) Produce speech from a text input Applications: Personal Assistants Apple SIRI Microsoft Cortana Google Now Games Airport Announcements

Unit Selection TTS Overview Main Commercial Algorithm Google TTS Collect lots of speech (5-50 hours) from one speaker, transcribe very carefully, all the syllables and phones and whatnot To synthesize a sentence, patch together syllables and phones from the training data. Paradigm: search

History: foundational insights 1900s-1950s Automaton: Markov 1911 Turing 1936 McCulloch-Pitts neuron (1943) html html Shannon (1948) link between automata and Markov models Human speech processing Fletcher at Bell Labs (1920’s) Probabilistic/Information-theoretic models Shannon (1948)

Speech synthesis is old! Pictures and some text from Hartmut Traunmüller’s web site: Von Kempeln 1780 b. Bratislava 1734 d. Vienna 1804 Leather resonator manipulated by the operator to try and copy vocal tract configuration during sonorants (vowels, glides, nasals) Bellows provided air stream, counterweight provided inhalation Vibrating reed produced periodic pressure wave

Von Kempelen: Small whistles controlled consonants Rubber mouth and nose; nose had to be covered with two fingers for non- nasals Unvoiced sounds: mouth covered, auxiliary bellows driven by string provides puff of air From Traunmüller’s web site

History: Early Recognition 1920’s Radio Rex Celluloid dog with iron base held within house by electromagnet against force of spring Current to magnet flowed through bridge which was sensitive to energy at 500 Hz 500 Hz energy caused bridge to vibrate, interrupting current, making dog spring forward The sound “e” (ARPAbet [eh]) in Rex has 500 Hz component

History: early ASR systems 1950’s: Early Speech recognizers 1952: Bell Labs single-speaker digit recognizer Measured energy from two bands (formants) Built with analog electrical components 2% error rate for single speaker, isolated digits 1958: Dudley built classifier that used continuous spectrum rather than just formants 1959: Denes ASR combining grammar and acoustic probability

History: early ASR systems 1960’s FFT - Fast Fourier transform (Cooley and Tukey 1965) LPC - linear prediction (1968) 1969 John Pierce letter “Whither Speech Recognition?” Random tuning of parameters, Lack of scientific rigor, no evaluation metrics Need to rely on higher level knowledge

ASR: 1970’s and 1980’s Hidden Markov Model 1972 Independent application of Baker (CMU) and Jelinek/Bahl/Mercer lab (IBM) following work of Baum and colleagues at IDA ARPA project year speech understanding project: 1000 word vocab, continous speech, multi-speaker SDC, CMU, BBN Only 1 CMU system achieved goal 1980’s+ Annual ARPA “Bakeoffs” Large corpus collection TIMIT Resource Management Wall Street Journal

Admin: Requirements and Grading Readings: Selected chapters from Jurafsky & Martin, Speech and Language Processing. A few conference and journal papers Grading Homework: 45% 1 transcription assignment, 4 programming assignments Final Project: 45% Group projects of 3 people; 2 if necessary. Participation: 10%

Overview of the course The TAs: Andrew Maas (Head TA) Peng Qi Sushobhan Nayak plus one more to come

Phonetics ARPAbet An alphabet for transcribing American English phonetic sounds. Articulatory Phonetics How speech sounds are made by articulators (moving organs) in mouth. Acoustic Phonetics Acoustic properties of speech sounds

ARPAbet ml ml The CMU Pronouncing Dictionary What about other languages? International Phonetic Alphabet: c_Alphabet c_Alphabet

ARPAbet Vowels b_dARPAb_dARPA 1beadiy9bodeow 2bidih10booeduw 3bayedey11budah 4bedeh12birder 5badae13bideay 6bod(y)aa14bowedaw 7bawdao15Boydoy 8Budd(hist)uh Sounds from Ladefoged Note: Many speakers pronounce Buddhist with the vowel uw as in booed, So for them [uh] is instead the vowel in “put” or “book”

The Speech Chain (Denes and Pinson) SPEAKER HEARER

Speech Production Process Respiration: We (normally) speak while breathing out. Respiration provides airflow. “Pulmonic egressive airstream” Phonation Airstream sets vocal folds in motion. Vibration of vocal folds produces sounds. Sound is then modulated by: Articulation and Resonance Shape of vocal tract, characterized by: Oral tract Teeth, soft palate (velum), hard palate Tongue, lips, uvula Nasal tract Text adopted from Sharon Rose

Nasal Cavity Pharynx Vocal Folds (within the Larynx) Trachea Lungs Text copyright J. J. Ohala, Sept 2001, from Sharon Rose slide Sagittal section of the vocal tract (Techmer 1880)

From Mark Liberman’s website, from Ultimate Visual Dictionary

From Mark Liberman’s Web Site, from Language Files (7th ed)

Vocal tract Figure thnx to John Coleman!!

Figure of Ken Stevens, labels from Peter Ladefoged’s web site

Vocal tract movie (high speed x- ray) Figure of Ken Stevens, from Peter Ladefoged’s web site

USC’s SAIL Lab Shri Narayanan

Tamil

Larynx and Vocal Folds The Larynx (voice box) A structure made of cartilage and muscle Located above the trachea (windpipe) and below the pharynx (throat) Contains the vocal folds (adjective for larynx: laryngeal) Vocal Folds (older term: vocal cords) Two bands of muscle and tissue in the larynx Can be set in motion to produce sound (voicing) Text from slides by Sharon Rose UCSD LING 111 handout

The larynx, external structure, from front Figure thnx to John Coleman!!

Vertical slice through larynx, as seen from back Figure thnx to John Coleman!!

Voicing: Air comes up from lungs Forces its way through vocal cords, pushing open (2,3,4) This causes air pressure in glottis to fall, since: when gas runs through constricted passage, its velocity increases (Venturi tube effect) this increase in velocity results in a drop in pressure (Bernoulli principle) Because of drop in pressure, vocal cords snap together again (6-10) Single cycle: ~1/100 of a second. Figure & text from John Coleman’s web site

Voicelessness When vocal cords are open, air passes through unobstructed Voiceless sounds: p/t/k/s/f/sh/th/ch If the air moves very quickly, the turbulence causes a different kind of phonation: whisper

Vocal folds open during breathing From Mark Liberman’s web site, from Ultimate Visual Dictionary

Vocal Fold Vibration UCLA Phonetics Lab Demo

Consonants and Vowels Consonants: phonetically, sounds with audible noise produced by a constriction Vowels: phonetically, sounds with no audible noise produced by a constriction (it’s more complicated than this, since we have to consider syllabic function, but this will do for now) Text adapted from John Coleman

Place of Articulation Consonants are classified according to the location where the airflow is most constricted. This is called place of articulation Three major kinds of place articulation: Labial (with lips) Coronal (using tip or blade of tongue) Dorsal (using back of tongue)

Places of articulation labial dental alveolar post-alveolar/palatal velar uvular pharyngeal laryngeal/glottal Figure thanks to Jennifer Venditti

Coronal place dental alveolar post-alveolar/palatal Figure thanks to Jennifer Venditti Dental: th/dh Alveolar: t/d/s/z/l Post: sh/zh/y

Dorsal Place velar uvular pharyngeal Figure thanks to Jennifer Venditti Velar: k/g/ng

Manner of Articulation Stop: complete closure of articulators, so no air escapes through mouth Oral stop: palate is raised, no air escapes through nose. Air pressure builds up behind closure, explodes when released p, t, k, b, d, g Nasal stop: oral closure, but palate is lowered, air escapes through nose. m, n, ng

Oral vs. Nasal Sounds Thanks to Jong-bok Kim for this figure!

More on Manner of articulation of consonants Fricatives Close approximation of two articulators, resulting in turbulent airflow between them, producing a hissing sound. f, v, s, z, th, dh Approximant Not quite-so-close approximation of two articulators, so no turbulence y, r Lateral approximant Obstruction of airstream along center of oral tract, with opening around sides of tongue. l Text from Ladefoged “A Course in Phonetics”

More on manner of articulation of consonants Tap or flap Tongue makes a single tap against the alveolar ridge dx in “butter” Affricate Stop immediately followed by a fricative ch, jh

Articulatory parameters for English consonants (in ARPAbet) PLACE OF ARTICULATION bilabiallabio- dental inter- dental alveolarpalatalvelarglottal stop p b t d k g q fric. f vthdh s zshzh h affric.chjh nasal m nng approx wl/r y flapdx MANNER OF ARTICULATION VOICING: voicelessvoiced Table from Jennifer Venditt!i

Tongue position for vowels

Vowels 1/5/07 IYAAUW Fig. from Eric Keller

American English Vowel Space FRONTBACK HIGH LOW ey ow aw oy ay iy ih eh ae aa ao uw uh ah ax ixux Figure from Jennifer Venditti

[iy] vs. [uw] Figure from Jennifer Venditti, from a lecture given by Rochelle Newman

[ae] vs. [aa] Figure from Jennifer Venditti, from a lecture given by Rochelle Newman

More phonetic structure Syllables Composed of vowels and consonants. Not well defined. Something like a “vowel nucleus with some of its surrounding consonants”.

More phonetic structure Stress Some syllables have more energy than others Stressed syllables versus unstressed syllables (an) ‘INsult vs. (to) in’SULT (an) ‘OBject vs. (to) ob’JECT Simple model: every multi-syllabic word has one syllable with: “primary stress” We can represent by using the number “1” on the vowel (and an implicit unmarking on the other vowels) “table”: t ey1 b ax l “machine: m ax sh iy1 n Also possible: “secondary stress”, marked with a “2” ih-2 n f axr m ey-1 sh ax n Third category: reduced: schwa: ax

Where to go for more info Ladefoged, Peter A Course in Phonetics Mark Liberman’s site 1/ling001/phonetics.html 1/ling001/phonetics.html John Coleman’s site phil_phonetics_course_index.html phil_phonetics_course_index.html

Summary Overview of 3++ parts of course ASR Dialogue Affect Extraction + TTS and Speaker Recognition Very brief history Articulatory Phonetics ARPAbet transcription NEXT TIME: Acoustic phonetics