Structure of Spoken Language

Name: Structure of Spoken Language
Uploaded: 2017-10-05T07:14:48+00:00
Duration: PTM14S53
Channel: Derek Beasley
Description: Structure of Spoken Language

Structure of Spoken Language
CS 551/651: Structure of Spoken Language Lecture 5: Characteristics of Place of Articulation; Phonetic Transcription John-Paul Hosom Fall 2010

Acoustic-Phonetic Features: Manner of Articulation
Approximately 8 manners of articulation: Name Sub-Types Examples Vowel vowel, diphthong aa, iy, uw, eh, ow, … Approximants liquid, glide l, r, w, y Nasal m, n, ng Stop unvoiced, voiced p, t, k, b, d, g Fricative unvoiced, voiced f, th, s, sh, v, dh, z, zh Affricate unvoiced, voiced ch, jh Aspiration h Flap dx, nx Change in manner of articulation usually abrupt and visible; manner provides much information about location of phonemes.

Acoustic-Phonetic Features: Place of Articulation
Approximately 8 places of articulation for consonants: Name Examples Labial p, b, m, (w) Labio-Dental f, v Dental th, dh Alveolar t, d, s, z, n, l* Palato-Alveolar sh, zh, ch**, jh**, r*** Palatal y Velar k, g, ng, (w) Glottal h * /l/ doesn’t have same coarticulatory properties as other alveolars ** starts as alveolar (/t/, /d/), then becomes palatal-alveolar *** /r/ can have a complex place of articulation Place of articulation more subject to coarticulation than manner; F2 trajectory important for identifying place of articulation.

Labial (/p/, /b/, /m/, /w/): constriction (or complete closure) at lips the only unvoiced labial is /p/ the only nasal labial is /m/ characterized by F1, F2, (even) F3 of adjacent vowel(s) rapidly and briefly decreasing at border with labial

Labio-Dental (/f/, /v/): produced by constriction between lower lip and upper teeth in English, all labio-dental phonemes are fricatives can be characterized by formants of adjacent vowel(s) decreasing at border with labial (similar to characteristics of labials) Dental (/th/, /dh/): produced by constriction between tongue tip and upper teeth (sometimes tongue tip is closer to alveolar ridge) in English, all dental phonemes are fricatives may be characterized by stronger energy above 6 KHz, but weaker than /sh/, /zh/ fricatives

Alveolar (/t/, /d/, /s/, /z/, /n/, /l/): tongue tip is at or near alveolar ridge a large number of English consonants are alveolar primary cue to alveolars: F2 of neighboring vowel(s) is around 1800 Hz, except for /l/ /l/ has low F1 ( 400 Hz) and F2 ( 1000 Hz), high F3 /l/ before vowel is “light” /l/, after vowel is “dark” /l/.

Palato-Alveolar (/sh/, /zh/, /ch/, /jh/, /r/): tongue is between alveolar ridge and hard palate 2 fricatives, 2 affricates, 1 rhotic consonant (r sound) retroflex has “depression” midway along tongue the palato-alveolar fricatives tend to have strong energy due to weak constriction allowing large airflow /r/ (and /er/) most easily identified by F3 below 2000 Hz /r/ sometimes considered alveolar approximant Palatal (/y/): produced with tongue close to hard palate “extreme” production of /iy/ F1-F2 tend to be more spread than /iy/, F1 is lower than /iy/

Velar (/k/, /g/, /ng/): produced with constriction against velum (soft palate) only plosives /k/ and /g/, and nasal /ng/ characteristic of velars is the “velar pinch”, in which F2 and F3 of neighboring vowel become very close at boundary with velar. Most visible in front vowel /ih/

Glottal (/h/): /h/ is the nominal glottal phoneme in English; in reality, the tongue can be in any vowel-like position the primary cue for /h/ is formant structure without voicing, an energy dip, and/or an increase in aspiration noise in higher frequencies.

Distinctive Phonetic Features: Summary
Distinctive features may be used to categorize phonetic sub-classes and show relationships between phonemes There is often not a one-to-one correspondence between a feature value and a particular trait in the speech signal A variety of context-dependent and context-independent cues (sometimes conflicting, sometimes complimentary) serve to identify features Speech is highly variable, highly context-dependent, and cues to phonemic identity are spread in both the spectral and time domains. The diffusion of features makes automatic speech recognition difficult, but human speech recognition is able to use this diffusion for robustness.

+high  low +low  high back  round
Redundancy Distinctive features are not always independent; some redundancy may be implied (especially binary features) Example: Spanish i e a o u High +  Low Back Round +high  low +low  high back  round +round  +back +low  +back +low  round back  low +round  low These relationships are language and feature-set specific. (from Schane, p )

Redundancy Redundant information can be indicated by circling redundant features: i e a o u High +  Low Back Round Some redundancies are universal (can’t be +high and +low) Phonetic sequences also have constraints (redundant info.): English has no more than 3 word-initial consonants; in this case, first consonant is always /s/; next is always /p/, /t/, or /k/; third is always /r/ or /l/ (from Schane, p )

Phonetic Transcription
Given a corpus of speech data, it’s often necessary to create a transcription: • word level • phoneme level • time-aligned phoneme level • time-aligned detailed phoneme level (with diacritics) • other information: phonetic stress, emotion, syntax, repairs Most common are word-level and time-aligned phoneme level. Time-aligned phonetic transcription examples: pau h eh l ow pau t uw .br

Are phonemes precise quantities with exact boundaries? No… humans disagree on phonetic labels and boundary positions; disagreement may be a matter of interpretation of the utterance. Phonetic label agreement between humans: Full Labels Base Labels Broad Categories English 70% 71% 89% German 61% 65% 81% Mandarin 66% 78% 87% Spanish 74% 82% 90% Full, Base Label Set: 55 (English), 62 (German), 50 (Mandarin), (Spanish) Broad Categories: 7 corresponding to manner of articulation *From Cole, Oshika, et al., ICSLP’94

70% agreement on 55 phonemes, 89% agreement on 7 categories Best phoneme-level automatic speech recognition results on TIMIT, with a 39-phoneme symbol set: 75.8% (Antoniou, 2001; Reynolds and Antoniou, 2003?) Differences: Human agreement evaluated on spontaneous speech (stories), TIMIT is read speech Humans used 55 phonemes; 39 phonemes for evaluating TIMIT Phoneme agreement doesn’t translate into word accuracy… human word accuracy is typically an order of magnitude better than the best automatic speech recognition system.

Phonetic label boundary agreement between humans: Agreement measured by comparing two manual labelings, A and B, and computing the percentage of cases in which B labels are within some threshold (20 msec) of A labels. agreement (%) threshold (msec) Average agreement of 93.8% within 20 msec threshold; Maximum agreement of 96% within 20 msec

Is there a “correct” answer? No; inherently subjective although semi-arbitrary guidelines can be imposed. Is measuring accuracy meaningless? No; phonemes do have identity and order, although details may be subjective. Sometimes very precise (if semi-arbitrary) labels and boundaries are extremely important (e.g. concatenative text-to-speech databases). What about getting a computer to generate transcriptions, or at least phonetic boundaries? Advantages: consistent, fast Disadvantages: not accurate, compared to human transcription not robust to different speakers, environments

Automatic Phonetic Alignment (assume phonetic identity is known): Two common methods: “Forced Alignment”: Use existing speech recognizer, constrained to recognize only the “correct” phoneme sequence. The search process used by HMM recognizers returns both phoneme identity and location. Location information is boundary information. (2) Dynamic Time Warping: (a) Use text-to-speech or utterance “templates” to generate same speech content with known boundaries. (b) Warp time scale of reference (TTS or template) with input speech to minimize spectral error. (c) Convert known boundary locations to original time scale.

Phonetic Transcription Accuracy of automatic alignment
Speaker-independent alignment using Forced Alignment: agreement (%) threshold (msec)

Comparing manual and automatic alignment of TIMIT corpus: Automatic method still makes “stupid” mistakes. Manual labeling criteria not rigorously defined. Performance degrades significantly in presence of noise. Assumes correct phonetic sequence is known…

Structure of Spoken Language

Similar presentations

Presentation on theme: "Structure of Spoken Language"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Structure of Spoken Language

Similar presentations

Presentation on theme: "Structure of Spoken Language"— Presentation transcript:

Similar presentations

About project

Feedback