Investigating /l/ variation in English through forced alignment Jiahong Yuan & Mark Liberman University of Pennsylvania Sept. 9, 2009.

Slides:



Advertisements
Similar presentations
Tom Lentz (slides Ivana Brasileiro)
Advertisements

Human Speech Recognition Julia Hirschberg CS4706 (thanks to John-Paul Hosum for some slides)
1 CS 551/651: Structure of Spoken Language Spectrogram Reading: Stops John-Paul Hosom Fall 2010.
CS 551/651: Structure of Spoken Language Lecture 12: Tests of Human Speech Perception John-Paul Hosom Fall 2008.
Catia Cucchiarini Quantitative assessment of second language learners’ fluency in read and spontaneous speech Radboud University Nijmegen.
Phonetic variability of the Greek rhotic sound Mary Baltazani University of Ioannina, Greece  Rhotics exhibit considerable phonetic variety cross-linguistically.
Syllable. Syllable When talking about stress, we refer to the degree of force and loudness with which a syllable is uttered. When talking about stress,
Speech Science XII Speech Perception (acoustic cues) Version
Prosodics, Part 1 LIN Prosodics, or Suprasegmentals Remember, from our first discussions in class, that speech is really a continuous flow of initiation,
AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY R.J.J.H. van Son, Barbertje M. Streefkerk, and Louis C.W. Pols Institute of Phonetic Sciences / ACLC University.
Digital Systems: Hardware Organization and Design
A new Golden Age of phonetics? Mark Liberman University of Pennsylvania
PHONETICS AND PHONOLOGY
Itay Ben-Lulu & Uri Goldfeld Instructor : Dr. Yizhar Lavner Spring /9/2004.
Introduction to Linguistics
Syllabification Principles
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
Development of coarticulatory patterns in spontaneous speech Melinda Fricke Keith Johnson University of California, Berkeley.
Recognition of Voice Onset Time for Use in Detecting Pronunciation Variation ● Project Description ● What is Voice Onset Time (VOT)? – Physical Realization.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
Introduction to Speech Production Lecture 1. Phonetics and Phonology Phonetics: The physical manifestation of language in sound waves. –How sounds are.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
Chapter three Phonology
The distribution of the duration of the 4185 schwas (fig.2) is divided into 2 sub- groups : (1) 29% of the words produced with a complete absence of voicing.
Chapter 4 Vowels PHONOLOGY (Lane 335).
Speech Sounds of American English and Some Iranian Languages
Phonology, phonotactics, and suprasegmentals
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Introduction to Automatic Speech Recognition
Introduction Mel- Frequency Cepstral Coefficients (MFCCs) are quantitative representations of speech and are commonly used to label sound files. They are.
Present Experiment Introduction Coarticulatory Timing and Lexical Effects on Vowel Nasalization in English: an Aerodynamic Study Jason Bishop University.
Phonetics and Phonology
Segmental factors in language proficiency: Velarization degree as a signature of pronunciation talent Henrike Baumotte and Grzegorz Dogil {henrike.baumotte,
Whither Linguistic Interpretation of Acoustic Pronunciation Variation Annika Hämäläinen, Yan Han, Lou Boves & Louis ten Bosch.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Speech Science Fall 2009 Oct 26, Consonants Resonant Consonants They are produced in a similar way as vowels i.e., filtering the complex wave produced.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Korea Maritime and Ocean University NLP Jung Tae LEE
Speech Science Fall 2009 Oct 28, Outline Acoustical characteristics of Nasal Speech Sounds Stop Consonants Fricatives Affricates.
The Golden Age of Speech and Language Science Mark Liberman University of Pennsylvania
1 Further Maths Chapter 4 Displaying and describing relationships between two variables.
LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes.
The relationship between objective properties of speech and perceived pronunciation quality in read and spontaneous speech was examined. Read and spontaneous.
5aSC5. The Correlation between Perceiving and Producing English Obstruents across Korean Learners Kenneth de Jong & Yen-chen Hao Department of Linguistics.
The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.
ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent Recognition of foreign names spoken by native speakers Frederik Stouten & Jean-Pierre Martens Ghent University.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
Robust speaking rate estimation using broad phonetic class recognition Jiahong Yuan and Mark Liberman University of Pennsylvania Mar. 16, 2010.
New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,
Training Tied-State Models Rita Singh and Bhiksha Raj.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Tongue movement kinematics in speech: Task specific control of movement speed Anders Löfqvist Haskins Laboratories New Haven, CT.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Hello, Who is Calling? Can Words Reveal the Social Nature of Conversations?
1 CS 551/651: Structure of Spoken Language Spectrogram Reading: Nasals John-Paul Hosom Fall 2010.
THE SOUND PATTERNS OF LANGUAGE
Language Perception.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Against formal phonology (Port and Leary).  Generative phonology assumes:  Units (phones) are discrete (not continuous, not variable)  Phonetic space.
Gestural Timing and Magnitude of English /r/: An Ultrasound-OptoTrak Study Fiona Campbell, Bryan Gick, Ian Wilson, and Eric Vatikiotis-Bateson Ultrafest.
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
Paul Fryers Deputy Director, EMPHO Technical Advisor, APHO Introduction to Correlation and Regression Contributors Shelley Bradley, EMPHO Mark Dancox,
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
Audio Books for Phonetics Research CatCod2008 Jiahong Yuan and Mark Liberman University of Pennsylvania Dec. 4, 2008.
Speech Analysis TA:Chuan-Hsun Wu
Audio Books for Phonetics Research
Speech Perception (acoustic cues)
A Japanese trilogy: Segment duration, articulatory kinematics, and interarticulator programming Anders Löfqvist Haskins Laboratories New Haven, CT.
2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.
Presentation transcript:

Investigating /l/ variation in English through forced alignment Jiahong Yuan & Mark Liberman University of Pennsylvania Sept. 9, 2009

Yuan & Liberman: Interspeech Introduction English /l/ is traditionally classified into at least two allophones: “dark /l/”, which appears in syllable rimes “clear /l/”,which appears in syllable onsets. Sproat and Fujimura (1993) : clear and dark allophones are not categorically distinct; single phonological entity /l/ involves two gestures – a vocalic dorsal gesture and a consonantal apical gesture. The two gestures are inherently asynchronous: the vocalic gesture is attracted to the nucleus of the syllable the consonantal gesture is attracted to the margin (“gestural affinity”). In a syllable-final /l/, the tongue dorsum gesture shifts left to the syllable nucleus, making the vocalic gesture precede the consonantal, tongue apex gesture. In a syllable-initial /l/, the reverse situation holds.

Yuan & Liberman: Interspeech Introduction Clear /l/ has a relatively high F 2 and a low F 1 ; Dark /l/ has a lower F 2 and a higher F 1 ; Intervocalic /l/s are intermediate between the clear and dark variants (Lehiste 1964). An important piece of evidence for the “gestural affinity” proposal: Sproat and Fujimura (1993) found that the backness of pre-boundary intervocalic /l/ (in /i - ɪ /) is correlated with the duration of the pre- boundary rime. The /l/ in longer rimes is darker. S&F (1993) devised a set of boundaries with a variety of strengths, to ‘elicit’ different rime durations in laboratory speech: Major intonation boundary: Beel, equate the actors. “|” VP phrase boundary: Beel equates the actors. “V” Compound-internal boundary: The beel-equator’s amazing. “C” ‘#’ boundary: The beel-ing men are actors. “#” No boundary: Mr Beelik wants actors. “%”

Yuan & Liberman: Interspeech Introduction Figure 1 in Sproat and Fujimura (1993): Relation between F 2 -F 1 (in Hz) and pre-boundary rime duration (in s) for (a) speaker CS and (b) speaker RS.

Yuan & Liberman: Interspeech Introduction Figure 4 in Sproat and Fujimura (1993): A schematic illustration of the effects of rime duration on pre-boundary post-nuclear /l/.

Yuan & Liberman: Interspeech Introduction Following Sproat and Fujimura (1993), Huffman (1997) showed that onset [l]s also vary in backness. The study suggested that the dorsum gesture for the intervocalic onset [l]s (e.g., in below) may be shifted leftward in time relative to the apical gesture, which makes a dark(er) /l/. The data utilized in these studies contained only a few hundred tokens of /l/ in laboratory speech. “The relation of duration and backness can be complicated by differences in coarticulatory effects of neighboring vowels, or by speaker-specific constraints on absolute degree of backness.” (Huffman 1997). => Our study utilizes a very large speech corpus. Automatic formant tracking is error-prone, and it is time-consuming to measure formants by hand. => We develop a new method to quantify /l/ backness without formant tracking.

Yuan & Liberman: Interspeech Data The SCOTUS corpus includes more than 50 years of oral arguments from the Supreme Court of the United States – nearly 9,000 hours in total. For this study, we used only the Justices’ speech (25.5 hours) from the 2001-term arguments, along with the orthographic transcripts. The phone boundaries were automatically aligned using the PPL forced aligner trained on the same data, with the HTK toolkit and the CMU pronouncing dictionary. The dataset contains 21,706 tokens of /l/, including 3,410 word-initial [l]s, 7,565 word-final [l]s, and 10,731 word-medial [l]s.

Yuan & Liberman: Interspeech The Penn Phonetics Lab Forced Aligner The aligner’s acoustic models are GMM-based monophone HMMs on 39 PLP coefficients. The monophones include: speech segments: /t/, /l/, /aa1/, /ih0/, … (ARPAbet) ‏ non-speech segments: {sil} silence; {LG} laugh; {NS} noise; {BR} breath; {CG} cough; {LS} lip smack {sp } is a “tee” model that has a direct transition from the entry to the exit node in the HMM; therefore, “sp” can have have “zero” length. The tee-model is used for handling possible inter-word silence. The mean absolute difference between manual and automatically-aligned phone boundaries in TIMIT is about 12 milliseconds.

Yuan & Liberman: Interspeech Forced Alignment Architecture Word and phone boundaries located

Yuan & Liberman: Interspeech Method To measure the “darkness” of /l/ through forced alignment, we first split /l/ into two phones, L1 for the clear /l/ and L2 for the dark /l/, and retrained the acoustic models for the new phone set. In training, word-initial [l]’s (e.g., like, please) were categorized as L1 (clear); the word-final [l]s (e.g., full, felt) were L2 (dark). All other [l]’s were ambiguous, which could be either L1 or L2. During each iteration of training, the ‘real’ pronunciations of the ambiguous [l]’s were automatically determined, and then the acoustic models of L1 and L2 were updated. The new acoustic models were tested on both the training data and on a data subset that had been set aside for testing. During the tests, all [l]’s were treated as ambiguous – the aligner determined whether a given [l] was L1 or L2.

Yuan & Liberman: Interspeech Method An example of L1/L2 classification through forced alignment:

Yuan & Liberman: Interspeech Method If we use word-initial vs. word-final as the gold standard, the accuracy of /l/ classification by forced alignment is 93.8% on the training data and 92.8% on the test data. L1 L2 L (training data) L  gold-standard by word position L L (test data)  classified by the aligner These results suggest that acoustic fit to clear/dark allophones in forced alignment is an accurate way to estimate the darkness of /l/.

Yuan & Liberman: Interspeech Method To compute a metric to measure the degree of /l/-darkness, we therefore ran forced alignment twice. All [l]’s were first aligned with L1 model, and then with the L2 model. The difference in log likelihood scores between L2 and L1 alignments – the D score – measures the darkness of [l]. The larger the D score, the darker the [l]. The histograms of the D scores – (Word-medial /l/s were classified as L1 or L2 by forced alignment.)

Yuan & Liberman: Interspeech Results To study the relation between rime duration and /l/-darkness, we use the [l]s that follow a primary-stress vowel (denoted as ‘1’). Such [l]s can precede a word boundary (‘#’), or a consonant (‘C’) or a non-stress vowel (‘0’) within the word.

Yuan & Liberman: Interspeech Results From the figure we can see that:  The [l]s in longer rimes have larger D scores, and hence are darker. This result is consistent with Sproat and Fujimura (1993).  The rime duration being equal, the [l]s preceding a non-stress vowel (1_L_0) are less dark than the [l]s preceding a word boundary (1_L_#) or a consonant (1_L_C).  The relation between the rime duration and darkness for the /l/ in 1_L_C is non-linear. The /l/ reaches its peak of darkness when the rime (more precisely, the stressed vowel and /l/) is about ms.  The syllable final [l]s are always dark (D > 0), even in very short rimes, i.e., less than 100 ms. This result is contradictory to Sproat and Fujimura (1993)’s finding that the syllable-final /l/ in very short rimes can be as clear as the canonical clear /l/.

Yuan & Liberman: Interspeech Results To further examine the difference between clear and dark /l/, we compare the intervocalic syllable-final /l/ (1_L_0) with the intervocalic syllable-initial /l/ (0_L_1). The “rime” duration of the intervocalic syllable-initial /l/ (0_L_1) was measured as the duration of the non-stress vowel plus /l/, to be comparable to the intervocalic syllable-final /l/.

Yuan & Liberman: Interspeech Results From the figures we can see that:  The intervocalic syllable-final [l]s have positive D scores whereas the intervocalic syllable-initial [l]s have negative D scores.  There is a positive correlation between darkness and rime duration (i.e., the duration of /l/ and its preceding vowel) for the intervocalic syllable-final [l]s, but no correlation for the intervocalic syllable-initial [l]s.  For the intervocalic syllable-final /l/, there is a positive correlation between /l/ duration and darkness. No correlation between /l/ duration and darkness was found, however, for the intervocalic syllable-initial /l/.  These results suggest that there is a clear difference between the intervocalic syllable-final and syllable-initial /l/s.

Yuan & Liberman: Interspeech Conclusions We found a strong correlation between the rime duration and /l/- darkness for syllable-final /l/. This result is consistent with Sproat and Fujimura (1993). We found no correlation between /l/ duration and darkness for syllable- initial /l/. This result is different from Huffman (1997). We found a clear difference in /l/ darkness between the 0_1 and 1_0 stress contexts, across all values of V+/l/ duration and of /l/ duration. We found that the syllable-final /l/ preceding a non-stress vowel was less dark than preceding a consonant or a word boundary. Also, there was a non-linear relationship between timing and quality for the /l/ preceding a consonant and following a primary-stress vowel. Such an /l/ reaches its peak of darkness when the duration of the stressed vowel plus /l/ is about ms. Further research is needed to confirm and explain these results.