Structure of Spoken Language

Slides:



Advertisements
Similar presentations
CS : Speech, NLP and the Web/Topics in AI
Advertisements

CS 551/651: Structure of Spoken Language Spectrogram Reading: Approximants John-Paul Hosom Fall 2010.
Sounds that “move” Diphthongs, glides and liquids.
SPPA 403 Speech Science1 Unit 3 outline The Vocal Tract (VT) Source-Filter Theory of Speech Production Capturing Speech Dynamics The Vowels The Diphthongs.
Basic Spectrogram & Clinical Application: Consonants
Acoustic Characteristics of Consonants
Speech Perception Dynamics of Speech
1 CS 551/651: Structure of Spoken Language Lecture 4: Characteristics of Manner of Articulation John-Paul Hosom Fall 2008.
1 CS 551/651: Structure of Spoken Language Spectrogram Reading: Stops John-Paul Hosom Fall 2010.
Phonetics.
Hello, Everyone! Review questions  Give examples to show the following features that make human language different from animal communication system:
ACOUSTICS OF SPEECH AND SINGING MUSICAL ACOUSTICS Science of Sound, Chapters 15, 17 P. Denes & E. Pinson, The Speech Chain (1963, 1993) J. Sundberg, The.
Speech Science XII Speech Perception (acoustic cues) Version
Practical Phonetics Week 2
Digital Systems: Hardware Organization and Design
The Human Voice. I. Speech production 1. The vocal organs
Structure of Spoken Language
Introduction to linguistics – The sounds of German R21118 Dr Nicola McLelland.
Phonetics (Part 1) Dr. Ansa Hameed.
Speech Anatomy and Articulation
Introduction Consonants pt 1: Obstruents
English Phonetics and Phonology Lesson 3B
Spectrogram & its reading
Chapter 6 Features PHONOLOGY (Lane 335).
On-line Learning with Passive-Aggressive Algorithms Joseph Keshet The Hebrew University Learning Seminar,2004.
Recap: Vowels & Consonants V – central “sound” of the syllable C – outer “shell” of the syllable (C) V (C) (C)(C)(C)V(C)(C)(C)
Phonetics III: Dimensions of Articulation October 15, 2012.
Linguistics I Chapter 4 The Sounds of Language.
Fricatives + Voice Onset Time March 31, 2014 In the Year 2000 Today: we’ll wrap up fricatives… and then move on to stops. This Friday, there will be.
Speech Sounds of American English and Some Iranian Languages
The sounds of language Phonetics Chapter 4.
1 CS 551/651: Structure of Spoken Language Lecture 4: Characteristics of Manner of Articulation John-Paul Hosom Fall 2010.
English Pronunciation Practice A Practical Course for Students of English By Wang Guizhen Faculty of English Language & Culture Guangdong University of.
Structure of Spoken Language
1 CS 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
The Sounds of Language. Phonology, Phonetics & Phonemics… Phonology, Phonetics & Phonemics… Producing and writing speech sounds... Producing and writing.
Speech Science Fall 2009 Oct 26, Consonants Resonant Consonants They are produced in a similar way as vowels i.e., filtering the complex wave produced.
Introduction to Linguistics Ms. Suha Jawabreh Lecture # 7.
Phonological Theory.
1 Phonetics and Phonemics. 2 Phonetics and Phonemics : Phonetics The principle goal of Phonetics is to provide an exact description of every known speech.
Speech Or can you hear me now?. Linguistic Parts of Speech Phone Phone Basic unit of speech sound Basic unit of speech sound Phoneme Phoneme Phone to.
Phonetics Class # 2 Chapter 6. Homework (Ex. 1 – page 268)  Judge [d ] or [ ǰ ]  Thomas [t]  Though [ ð ]  Easy [i]  Pneumonia [n]  Thought [ θ.
English Phonetics and Phonology
Speech Science Fall 2009 Oct 28, Outline Acoustical characteristics of Nasal Speech Sounds Stop Consonants Fricatives Affricates.
Phonetics: Dimensions of Articulation October 13, 2010.
Phonetics 2. Phonology 2.1 The phonic medium of language Sounds which are meaningful in human communication constitute the phonic medium of language.
Automatic Identification and Classification of Words using Phonetic and Prosodic Features Vidya Mohan Center for Speech and Language Engineering The Johns.
Phonetics Definition Speech Organs Consonants vs. Vowels
LIN 3201 Sounds of Human Language Sayers -- Week 1 – August 29 & 31.
1 CS 551/651: Structure of Spoken Language Spectrogram Reading: Nasals John-Paul Hosom Fall 2010.
Stop Acoustics and Glides December 2, 2013 Where Do We Go From Here? The Final Exam has been scheduled! Wednesday, December 18 th 8-10 am (!) Kinesiology.
Stop + Approximant Acoustics
Ch4 – Features Features are partly acoustic partly articulatory aspects of sounds but they are used for phonology so sometimes they are created to distinguish.
1 Acoustic Phonetics 3/28/00. 2 Nasal Consonants Produced with nasal radiation of acoustic energy Sound energy is transmitted through the nasal cavity.
Practical Phonetics Consonants: place and manner of articulation Where and how sounds are made.
Phonetics Dimensions of Articulation
The Human Voice. 1. The vocal organs
Structure of Spoken Language
Essentials of English Phonetics
Chapter 8 Practice Quiz.
The articulation of consonants
Structure of Spoken Language
The Human Voice. 1. The vocal organs
Speech is made up of sounds.
Structure of Spoken Language
Phonetics and Phonemics
Speech Perception (acoustic cues)
Manner of Articulation
Phonetics and Phonemics
Presentation transcript:

Structure of Spoken Language CS 551/651: Structure of Spoken Language Lecture 5: Characteristics of Place of Articulation; Phonetic Transcription John-Paul Hosom Fall 2010

Acoustic-Phonetic Features: Manner of Articulation Approximately 8 manners of articulation: Name Sub-Types Examples . Vowel vowel, diphthong aa, iy, uw, eh, ow, … Approximants liquid, glide l, r, w, y Nasal m, n, ng Stop unvoiced, voiced p, t, k, b, d, g Fricative unvoiced, voiced f, th, s, sh, v, dh, z, zh Affricate unvoiced, voiced ch, jh Aspiration h Flap dx, nx Change in manner of articulation usually abrupt and visible; manner provides much information about location of phonemes.

Acoustic-Phonetic Features: Place of Articulation Approximately 8 places of articulation for consonants: Name Examples . Labial p, b, m, (w) Labio-Dental f, v Dental th, dh Alveolar t, d, s, z, n, l* Palato-Alveolar sh, zh, ch**, jh**, r*** Palatal y Velar k, g, ng, (w) Glottal h * /l/ doesn’t have same coarticulatory properties as other alveolars ** starts as alveolar (/t/, /d/), then becomes palatal-alveolar *** /r/ can have a complex place of articulation Place of articulation more subject to coarticulation than manner; F2 trajectory important for identifying place of articulation.

Acoustic-Phonetic Features: Place of Articulation Labial (/p/, /b/, /m/, /w/): constriction (or complete closure) at lips the only unvoiced labial is /p/ the only nasal labial is /m/ characterized by F1, F2, (even) F3 of adjacent vowel(s) rapidly and briefly decreasing at border with labial

Acoustic-Phonetic Features: Place of Articulation Labio-Dental (/f/, /v/): produced by constriction between lower lip and upper teeth in English, all labio-dental phonemes are fricatives can be characterized by formants of adjacent vowel(s) decreasing at border with labial (similar to characteristics of labials) Dental (/th/, /dh/): produced by constriction between tongue tip and upper teeth (sometimes tongue tip is closer to alveolar ridge) in English, all dental phonemes are fricatives may be characterized by stronger energy above 6 KHz, but weaker than /sh/, /zh/ fricatives

Acoustic-Phonetic Features: Place of Articulation Alveolar (/t/, /d/, /s/, /z/, /n/, /l/): tongue tip is at or near alveolar ridge a large number of English consonants are alveolar primary cue to alveolars: F2 of neighboring vowel(s) is around 1800 Hz, except for /l/ /l/ has low F1 ( 400 Hz) and F2 ( 1000 Hz), high F3 /l/ before vowel is “light” /l/, after vowel is “dark” /l/.

Acoustic-Phonetic Features: Place of Articulation Palato-Alveolar (/sh/, /zh/, /ch/, /jh/, /r/): tongue is between alveolar ridge and hard palate 2 fricatives, 2 affricates, 1 rhotic consonant (r sound) retroflex has “depression” midway along tongue the palato-alveolar fricatives tend to have strong energy due to weak constriction allowing large airflow /r/ (and /er/) most easily identified by F3 below 2000 Hz /r/ sometimes considered alveolar approximant Palatal (/y/): produced with tongue close to hard palate “extreme” production of /iy/ F1-F2 tend to be more spread than /iy/, F1 is lower than /iy/

Acoustic-Phonetic Features: Place of Articulation Velar (/k/, /g/, /ng/): produced with constriction against velum (soft palate) only plosives /k/ and /g/, and nasal /ng/ characteristic of velars is the “velar pinch”, in which F2 and F3 of neighboring vowel become very close at boundary with velar. Most visible in front vowel /ih/

Acoustic-Phonetic Features: Place of Articulation Glottal (/h/): /h/ is the nominal glottal phoneme in English; in reality, the tongue can be in any vowel-like position the primary cue for /h/ is formant structure without voicing, an energy dip, and/or an increase in aspiration noise in higher frequencies.

Distinctive Phonetic Features: Summary Distinctive features may be used to categorize phonetic sub-classes and show relationships between phonemes There is often not a one-to-one correspondence between a feature value and a particular trait in the speech signal A variety of context-dependent and context-independent cues (sometimes conflicting, sometimes complimentary) serve to identify features Speech is highly variable, highly context-dependent, and cues to phonemic identity are spread in both the spectral and time domains. The diffusion of features makes automatic speech recognition difficult, but human speech recognition is able to use this diffusion for robustness.

+high  low +low  high back  round Redundancy Distinctive features are not always independent; some redundancy may be implied (especially binary features) Example: Spanish i e a o u High +  Low Back Round +high  low +low  high back  round +round  +back +low  +back +low  round back  low +round  low These relationships are language and feature-set specific. (from Schane, p. 35-38)

Redundancy Redundant information can be indicated by circling redundant features: i e a o u High +  Low Back Round Some redundancies are universal (can’t be +high and +low) Phonetic sequences also have constraints (redundant info.): English has no more than 3 word-initial consonants; in this case, first consonant is always /s/; next is always /p/, /t/, or /k/; third is always /r/ or /l/ (from Schane, p. 36-40)

Phonetic Transcription Given a corpus of speech data, it’s often necessary to create a transcription: • word level • phoneme level • time-aligned phoneme level • time-aligned detailed phoneme level (with diacritics) • other information: phonetic stress, emotion, syntax, repairs Most common are word-level and time-aligned phoneme level. Time-aligned phonetic transcription examples: 0 110 .pau 110 180 h 180 240 eh 240 280 l 280 390 ow 390 540 .pau t uw .br

Phonetic Transcription Are phonemes precise quantities with exact boundaries? No… humans disagree on phonetic labels and boundary positions; disagreement may be a matter of interpretation of the utterance. Phonetic label agreement between humans: Full Labels Base Labels Broad Categories English 70% 71% 89% German 61% 65% 81% Mandarin 66% 78% 87% Spanish 74% 82% 90% Full, Base Label Set: 55 (English), 62 (German), 50 (Mandarin), 42 (Spanish) Broad Categories: 7 corresponding to manner of articulation *From Cole, Oshika, et al., ICSLP’94

Phonetic Transcription 70% agreement on 55 phonemes, 89% agreement on 7 categories Best phoneme-level automatic speech recognition results on TIMIT, with a 39-phoneme symbol set: 75.8% (Antoniou, 2001; Reynolds and Antoniou, 2003?) Differences: Human agreement evaluated on spontaneous speech (stories), TIMIT is read speech Humans used 55 phonemes; 39 phonemes for evaluating TIMIT Phoneme agreement doesn’t translate into word accuracy… human word accuracy is typically an order of magnitude better than the best automatic speech recognition system.

Phonetic Transcription Phonetic label boundary agreement between humans: Agreement measured by comparing two manual labelings, A and B, and computing the percentage of cases in which B labels are within some threshold (20 msec) of A labels. agreement (%) threshold (msec) Average agreement of 93.8% within 20 msec threshold; Maximum agreement of 96% within 20 msec

Phonetic Transcription Is there a “correct” answer? No; inherently subjective although semi-arbitrary guidelines can be imposed. Is measuring accuracy meaningless? No; phonemes do have identity and order, although details may be subjective. Sometimes very precise (if semi-arbitrary) labels and boundaries are extremely important (e.g. concatenative text-to-speech databases). What about getting a computer to generate transcriptions, or at least phonetic boundaries? Advantages: consistent, fast Disadvantages: not accurate, compared to human transcription not robust to different speakers, environments

Phonetic Transcription Automatic Phonetic Alignment (assume phonetic identity is known): Two common methods: “Forced Alignment”: Use existing speech recognizer, constrained to recognize only the “correct” phoneme sequence. The search process used by HMM recognizers returns both phoneme identity and location. Location information is boundary information. (2) Dynamic Time Warping: (a) Use text-to-speech or utterance “templates” to generate same speech content with known boundaries. (b) Warp time scale of reference (TTS or template) with input speech to minimize spectral error. (c) Convert known boundary locations to original time scale.

Phonetic Transcription Accuracy of automatic alignment Speaker-independent alignment using Forced Alignment: agreement (%) threshold (msec)

Phonetic Transcription Comparing manual and automatic alignment of TIMIT corpus: Automatic method still makes “stupid” mistakes. Manual labeling criteria not rigorously defined. Performance degrades significantly in presence of noise. Assumes correct phonetic sequence is known…