Automatic Identification and Classification of Words using Phonetic and Prosodic Features Vidya Mohan Center for Speech and Language Engineering The Johns.

Slides:



Advertisements
Similar presentations
Normal Aspects of Articulation. Definitions Phonetics Phonology Articulatory phonetics Acoustic phonetics Speech perception Phonemic transcription Phonetic.
Advertisements

SPPA 403 Speech Science1 Unit 3 outline The Vocal Tract (VT) Source-Filter Theory of Speech Production Capturing Speech Dynamics The Vowels The Diphthongs.
Speech Perception Dynamics of Speech
Glides (/w/, /j/) & Liquids (/l/, /r/) Degree of Constriction Greater than vowels – P oral slightly greater than P atmos Less than fricatives – P oral.
Chapter 2 phonology. The phonic medium of language Speech is more basic than writing. Reasons? Linguists studies the speech sounds.
Hello, Everyone! Review questions  Give examples to show the following features that make human language different from animal communication system:
The sound patterns of language
Identification of prosodic near- minimal Pairs in Spontaneous Speech Keesha Joseph Howard University Center for Spoken Language Understanding (CSLU) Oregon.
AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY R.J.J.H. van Son, Barbertje M. Streefkerk, and Louis C.W. Pols Institute of Phonetic Sciences / ACLC University.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
Digital Systems: Hardware Organization and Design
Clinical Phonetics.
Prosodic Signalling of (Un)Expected Information in South Swedish Gilbert Ambrazaitis Linguistics and Phonetics Centre for Languages and Literature.
PHONETICS AND PHONOLOGY
Chapter two speech sounds
Profile of Phoneme Auditory Perception Ability in Children with Hearing Impairment and Phonological Disorders By Manal Mohamed El-Banna (MD) Unit of Phoniatrics,
Recognition of Voice Onset Time for Use in Detecting Pronunciation Variation ● Project Description ● What is Voice Onset Time (VOT)? – Physical Realization.
Chapter 6 Features PHONOLOGY (Lane 335).
Advanced Technology Center Stuttgart EMOTIONAL SPACE IMPROVES EMOTION RECOGNITION Raquel Tato, Rocio Santos, Ralf Kompe Man Machine Interface Lab Advance.
The Relation Between Stress Accent and Pronunciation Variation in Spontaneous American English Discourse Steven Greenberg, Hannah Carvey, Leah Hitchcock.
Chapter three Phonology
1 Sounds: the building blocks of language CA461 Speech Processing 1 Lecture 2.
Phonetics and Phonology 1.4; 3.1, 3.2, 3.3, 3.4, 3.5 (ex.) 4.1, 4.2, 4.3; Ref. 3.8 Homework: 3.6, #1-7, #8 (choose any three) [Mar 5]
Classification of English vowels
Last minute Phonetics questions?
Speech Sounds of American English and Some Iranian Languages
Phonology, phonotactics, and suprasegmentals
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Speech & Language Development 1 Normal Development of Speech & Language Language...“Standardized set of symbols and the knowledge about how to combine.
Phonetics and Phonology
Chapter 2 Speech Sounds Phonetics and Phonology
An Introduction to Linguistics
Speech Perception 4/6/00 Acoustic-Perceptual Invariance in Speech Perceptual Constancy or Perceptual Invariance: –Perpetual constancy is necessary, however,
Landmark-Based Speech Recognition: Status Report, 7/21/2004.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Speech Perception1 Fricatives and Affricates We will be looking at acoustic cues in terms of … –Manner –Place –voicing.
Ch 3 Slide 1 Is there a connection between phonemes and speakers’ perception of phonetic differences? (audibility of fine distinctions) Due to phonology,
Landmark-Based Speech Recognition Mark Hasegawa-Johnson Jim Baker Steven Greenberg Katrin Kirchhoff Jen Muller Kemal Sonmez Ken Chen Amit Juneja Karen.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
Boosting Training Scheme for Acoustic Modeling Rong Zhang and Alexander I. Rudnicky Language Technologies Institute, School of Computer Science Carnegie.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.
FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.
Automatic Speech Attribute Transcription (ASAT) Project Period: 10/01/04 – 9/30/08 The ASAT Team –Mark Clements –Sorin Dusan.
New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Chapter II phonology II. Classification of English speech sounds Vowels and Consonants The basic difference between these two classes is that in the production.
Institute of Information Science, Academia Sinica 12 July, IIS, Academia Sinica Automatic Detection-based Phone Recognition on TIMIT Hung-Shin Lee.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
Phonological Theories Autosegmental / Metrical Phonology Segmental description SS-2006: Session 4.
Speech recognition Home Work 1. Problem 1 Problem 2 Here in this problem, all the phonemes are detected by using phoncode.doc There are several phonetics.
Ch4 – Features Features are partly acoustic partly articulatory aspects of sounds but they are used for phonology so sometimes they are created to distinguish.
Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
SPEECH VARIATION AND THE USE OF DISTANCE METRICS ON THE ARTICULATORY FEATURE SPACE Louis ten Bosch.
II. Speech sounds. Speech production and perception ---Articulatory phonetics: the study of the production of speech sounds.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Chapter 3 Phonetics.
Combining Phonetic Attributes Using Conditional Random Fields Jeremy Morris and Eric Fosler-Lussier – Department of Computer Science and Engineering A.
PHONETICS AND PHONOLOGY
Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Hasegawa-Johnson
Vowels and Consonant Serikova Aigerim.
Conditional Random Fields for ASR
2.2.2 Complementary distribution
Structure of Spoken Language
Speech Perception (acoustic cues)
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
PHONETICS AND PHONOLOGY INTRODUCTION TO LINGUISTICS Lourna J. Baldera BSED- ENGLISH 1.
Presentation transcript:

Automatic Identification and Classification of Words using Phonetic and Prosodic Features Vidya Mohan Center for Speech and Language Engineering The Johns Hopkins University

Introduction Motivation Substitution errors in Automatic Speech Recognition (ASR) tasks could be reduced if finer grained units than phonemes were used to capture changes in the waveform. A syllable structure, along with its constituent components - onset, nucleus, coda - could better represent acoustic/articulatory/prosodic features. By building better acoustic-phonetic models, where features are weighed according to their discriminative ability, word error rate (WER) might be decreased. Current speech recognizers generally accord the same level of importance to the onset and the coda, as well as to accented and unaccented parts of speech. It would therefore be useful to more fully understand how words are related to their acoustic/articulatory/prosodic features.

Introduction Statement of Proposal Build word models by identifying which phonetic and prosodic features are critical in recognition, and thus being able to create a word templates defining them. Evaluation - Structured word identification and classification. Note: –Word identification would be used for proof of concept –Classification in confusion networks would allow for integration with the current ASR systems

Candidate Features We are interested in features that preserve as much information of the speech waveform as possible We already have feature detectors for most of these features Features of interest are: –Articulatory –Acoustic –Prosodic Articulatory Features –Manner - Fricative, spirant, stop, nasal, flap, lateral, rhotic, glide, vowel, diphthongs –Place - anterior, central, posterior, back, front, tense, labial, dental, alveolar, velar –Voicing –Lip rounding Prosodic Features –Prosody and stress accent - some syllables are more stressed than others

Candidate Features Acoustic Features –Knowledge Based (formant) Acoustic Parameters –Neural Firing Rate Features (rate scale) –Energy level and modulation –Duration - of words, syllables and constituents Other Features –Number of syllables –Sensitivity to context (feature weighting) Summary Which of the above features are most likely to be preserved in all representations of the word?

Word Template Representation Summary Template - features selected and weighted according to their importance in the identification of the word. Thus the system might learn the following representation of a word, say, center,

Word Template Representation Summary Template - features selected and weighted according to their importance in the identification of the word. Thus the system might learn the following representation of a word, say, center,

Methodology Choosing the specific words to study –Common confusable words from Switchboard Broadly Three Step Process –Feature Generation –Word Template Creation –Evaluation Feature Generation –Use the phonetic classifiers to generate features for the whole corpus –Stress Accent Detector for Syllables - Accuracy of 79% on manually transcribed corpus of Switchboard has been obtained at WS ‘04

Methodology Word Template Creation –Features of a particular word are selected according to their information content. –Mutual Information between a word and a feature captures the notion of information content quantitatively –I(W; F) = H(W) - H(W/F) = ∑ ∑ p(w, f) log[p(w|f)/p(w)] –The importance of a feature will be weighted according to the mutual information value, accuracy of the classifier and the stress accent pattern of the syllable. –∑ I(W; F) >= I(W; F 1,F 2,…F n ) There will definitely be dependencies between the features so methods for their careful selection would be used. –Features can be decided upon by discriminative analysis as well, especially in the cases with data sparsity

Methodology Evaluation Task The feature set will be evaluated on two tasks, –Word Identification Use this template to find the temporal bounds of a particular word in an utterance Evaluation Metric: Equal Error Rate Utterances will be chosen from the TIMIT and Switchboard corpus Segment the utterances using a syllable classifier – Word Classification Develop classifiers that could be used to classify word confusion pairs that exist in a lattice.

Summary Collaborators Mark Hasegawa Johnson, University of Illinois Kemal Sonmez, SRI Steve Greenberg, University of California Johns Hopkins Supervisors: Sanjeev Khudanpur Izhak Shafran Summary The objective of this proposal is to obtain a minimal set representation of a word with respect to its acoustic/articulatory/prosodic features using mutual information to choose the features. Phonetic classifiers developed/tuned this summer will be used for this purpose. Initial word identification experiments will be used to test the proof of concept. Given time and efficacy of the method, it will be integrated with the current LVCSR system to distinguish between confusable pairs of words. The project will hopefully provide interesting insights as to what the key features of a word are.