Phonetic features in ASR Kurzvortrag Institut für Kommunikationsforschung und Phonetik Bonn 17. Juni 1999 Jacques Koreman Institute of Phonetics University.

Slides:



Advertisements
Similar presentations
Building an ASR using HTK CS4706
Advertisements

Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Author :Panikos Heracleous, Tohru Shimizu AN EFFICIENT KEYWORD SPOTTING TECHNIQUE USING A COMPLEMENTARY LANGUAGE FOR FILLER MODELS TRAINING Reporter :
Vineel Pratap Girish Govind Abhilash Veeragouni. Human listeners are capable of extracting information from the acoustic signal beyond just the linguistic.
Niebuhr, D‘Imperio, Gili Fivela, Cangemi 1 Are there “Shapers” and “Aligners” ? Individual differences in signalling pitch accent category.
Speech Sound Production: Recognition Using Recurrent Neural Networks Abstract: In this paper I present a study of speech sound production and methods for.
Acoustic / Lexical Model Derk Geene. Speech recognition  P(words|signal)= P(signal|words) P(words) / P(signal)  P(signal|words): Acoustic model  P(words):
F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)
Sonority as a Basis for Rhythmic Class Discrimination Antonio Galves, USP. Jesus Garcia, USP. Denise Duarte, USP and UFGo. Charlotte Galves, UNICAMP.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Speaker Adaptation for Vowel Classification
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.
A PRESENTATION BY SHAMALEE DESHPANDE
Advisor: Prof. Tony Jebara
Database Construction for Speech to Lip-readable Animation Conversion Gyorgy Takacs, Attila Tihanyi, Tamas Bardi, Gergo Feldhoffer, Balint Srancsik Peter.
Phonetic features in ASR Intensive course Dipartimento di Elettrotecnica ed Elettronica Politecnica di Bari 22 – 26 March 1999 Jacques Koreman Institute.
Representing Acoustic Information
Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.
SoundSense: Scalable Sound Sensing for People-Centric Application on Mobile Phones Hon Lu, Wei Pan, Nocholas D. lane, Tanzeem Choudhury and Andrew T. Campbell.
Kinect Player Gender Recognition from Speech Analysis
Age and Gender Classification using Modulation Cepstrum Jitendra Ajmera (presented by Christian Müller) Speaker Odyssey 2008.
Speech rate affects the word error rate of automatic speech recognition systems. Higher error rates for fast speech, but also for slow, hyperarticulated.
Independent + Relational Analyses Systemic Phonological Analysis of Child Speech (SPACS)
9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.
Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling V. Karjigi , P. Rao Dept. of Electrical Engineering,
Speaker Recognition By Afshan Hina.
Speech and Language Processing
Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,
VBS Documentation and Implementation The full standard initiative is located at Quick description Standard manual.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Diamantino Caseiro and Isabel Trancoso INESC/IST, 2000 Large Vocabulary Recognition Applied to Directory Assistance Services.
1 Phoneme and Sub-phoneme T- Normalization for Text-Dependent Speaker Recognition Doroteo T. Toledano 1, Cristina Esteve-Elizalde 1, Joaquin Gonzalez-Rodriguez.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.
17.0 Distributed Speech Recognition and Wireless Environment References: 1. “Quantization of Cepstral Parameters for Speech Recognition over the World.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
Phonetic features in ASR: a linguistic solution to acoustic variation? Jacques Bistra Attilio.
Calibration of Consonant Perception in Room Reverberation K. Ueno (Institute of Industrial Science, Univ. of Tokyo) N. Kopčo and B. G. Shinn-Cunningham.
SEPARATION OF CO-OCCURRING SYLLABLES: SEQUENTIAL AND SIMULTANEOUS GROUPING or CAN SCHEMATA OVERRULE PRIMITIVE GROUPING CUES IN SPEECH PERCEPTION? William.
Korean Phoneme Discrimination Ben Lickly Motivation Certain Korean phonemes are very difficult for English speakers to distinguish, such as ㅅ and ㅆ.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
National Taiwan University, Taiwan
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
Performance Comparison of Speaker and Emotion Recognition
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.
Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info.
Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.
Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
Noise Reduction in Speech Recognition Professor:Jian-Jiun Ding Student: Yung Chang 2011/05/06.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
SPEECH VARIATION AND THE USE OF DISTANCE METRICS ON THE ARTICULATORY FEATURE SPACE Louis ten Bosch.
Combining Phonetic Attributes Using Conditional Random Fields Jeremy Morris and Eric Fosler-Lussier – Department of Computer Science and Engineering A.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
Guided By, DINAKAR DAS.C.N ( Assistant professor ECE ) Presented by, ARUN.V.S S7 EC ROLL NO: 2 1.
Automatic Speech Recognition
Acoustic to Articoulatory Speech Inversion by Dynamic Time Warping
ARTIFICIAL NEURAL NETWORKS
Gender Classification Using Scaled Conjugate Gradient Back Propagation
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Presentation transcript:

Phonetic features in ASR Kurzvortrag Institut für Kommunikationsforschung und Phonetik Bonn 17. Juni 1999 Jacques Koreman Institute of Phonetics University of the Saarland P.O. Box D Saarbrücken

ICSLP’98 Do phonetic features help to improve consonant identification in ASR? Jacques Koreman Bistra Andreeva William J. Barry Institute of Phonetics, University of the Saarland Saarbrücken, Germany

INTRODUCTION Variation in the acoustic signal is not a problem for human perception, but causes inhomogeneity in the phone models for ASR, leading to poor consonant identification. We should Bitar & Espy-Wilson do this by using a knowledge-based event- seeking approach for extracting phonetic features from the microphone signal on the basis of acoustic cues. We propose an acoustic-phonetic mapping procedure on the basis of a Kohonen network. “directly target the linguistic information in the signal and... minimize other extra-linguistic information that may yield large speech variability” (Bitar & Espy-Wilson 1995a, p. 1411)

DATA English, German, Italian and Dutch texts from the EUROM 0 database, read by 2 male + 2 female speakers per language Texts

DATA 12 mel-frequency cepstral coefficients ( MFCC ’s) energy corresponding delta parameters 16 kHz microphone signals Hamming window: 15 ms step size: 5 ms pre-emphasis: 0.97 Signals

SYSTEM ARCHITECTURE consonant language model lexicon phonetic features hidden Markov modelling C Kohonen network BASELINE Kohonen network MFCC’s + energy delta parameters BASELINE Kohonen network

CONFUSIONS BASELINE phonetic categories: manner, place, voicing 1 category wrong 2 categories wrong 3 categories wrong (by Attilio Erriquez)

CONFUSIONS MAPPING phonetic categories: manner, place, voicing 1 category wrong 2 categories wrong 3 categories wrong (by Attilio Erriquez)

ACIS = Baseline system:31.22 % Mapping system:68.47 % total of all correct identification percentages number of consonants to be identified The Average Correct Identification Score compensates for the number of occurrences in the database, giving each consonant equal weight. It is the total of all percentage numbers along the diagonal of the confusion matrix divided by the number of consonants.

BASELINE SYSTEM good identification of language-specific phones reason: acoustic homogeneity poor identification of other phones % correct cons. baseline mapping language  German  Italian  Italian  English  Engl., It.  English x G, NL

MAPPING SYSTEM good identification, also of acoustically variable phones reason: variable acoustic parameters are mapped onto homogenous, distinctive phonetic features % correct cons. baseline mappinglanguage h E,G, NL k all b all d all t all p all etc.

APMS = Baseline system:1.79 Mapping system:1.57 The Average Phonetic Misidentification Score gives a measure of the severity of the consonant confusions in terms of phonetic features. The multiple is the sum of all products of the misidentification percentages (in the non-diagonal cells) times the number of misidentified phonetic categories (manner, place and voicing). It is divided by the total of all the percentage numbers in the non- diagonal cells. phonetic misidentification coefficient sum of the misidentification percentages

APMS = after mapping, incorrectly identified consonant is on average closer to the phonetic identity of the consonant which was produced reason: the Kohonen network is able to extract linguistically distinctive phonetic features which allow for a better separation of the consonants in hidden Markov modelling. phonetic misidentification coefficient sum of the misidentification percentages

CONSONANT CONFUSIONS cons. identified as rr (84%),  (5%), l (4%) jj (94%), z (6%) mm (63%), n (11%),  (10%), r (6%) nn (26%), m (21%),  (20%), r (6%)  (46%), n (23%), m (15%),  (8%) cons. identified as rg (61%),  (16%),  (13%) j  (53%), j (18%),  (12%),  (6%), r (6%),  (6%) m  (23%),  (18%), m (16%),  (13%),  (10%) n  (28%),  (18%),  (16%),  (12%), m (8%),  (8%)  (42%),  (15%),  (15%), m (8%),  (8%),  (8%) BASELINE MAPPING

CONCLUSIONS Acoustic-phonetic mapping helps to address linguistically relevant information in the speech signal, ignoring extra- linguistic sources of variation. The advantages of mapping are reflected in the two measures which we have presented: ACIS shows that mapping leads to better consonant identification rates for all except a few of the language- specific consonants. The improvement can be put down to the system’s ability to map acoustically variable consonant realisations to more homogeneous phonetic feature vectors.

CONCLUSIONS Acoustic-phonetic mapping helps to address linguistically relevant information in the speech signal, ignoring extra- linguistic sources of variation. The advantages of mapping are reflected in the two measures which we have presented: APMS shows that the confusions which occur in the mapping experiment are less severe than in the baseline experiment from a phonetic point of view. There are fewer confusions on the phonetic dimensions manner, place and voicing when mapping is applied, because the system focuses on distinctive information in the acoustic signals.

SUMMARY Acoustic-phonetic mapping leads to fewer and phonetically less severe consonant confusions.

THE END THANK YOU FOR YOUR ATTENTION!