AGA 4/28/2003 2003 NIST LID Evaluation On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1 PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL BALANCE OF VOWELS Jan P.H. van Santen and Xiaochuan.
Dual-domain Hierarchical Classification of Phonetic Time Series Hossein Hamooni, Abdullah Mueen University of New Mexico Department of Computer Science.
Speech Sound Production: Recognition Using Recurrent Neural Networks Abstract: In this paper I present a study of speech sound production and methods for.
Speaker Recognition Sharat.S.Chikkerur Center for Unified Biometrics and Sensors
Acoustic / Lexical Model Derk Geene. Speech recognition  P(words|signal)= P(signal|words) P(words) / P(signal)  P(signal|words): Acoustic model  P(words):
Speaker Adaptation for Vowel Classification
Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Philip Jackson and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics in a segmental-HMM recognizer using intermediate.
Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR Overview Motivation Tone has a crucial role in Mandarin speech.
A PRESENTATION BY SHAMALEE DESHPANDE
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Word-subword based keyword spotting with implications in OOV detection Jan “Honza” Černocký, Igor Szöke, Mirko Hannemann, Stefan Kombrink Brno University.
PROSODY MODELING AND EIGEN- PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION Zi-He Chen, Yuan-Fu Liao, and Yau-Tarng Juang ICASSP 2005 Presenter: Fang-Hui.
Audio classification Discriminating speech, music and environmental audio Rajas A. Sambhare ECE 539.
A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore.
Hierarchical Approach for Spotting Keywords from an Acoustic Stream Supervisor:Professor Raimo Kantola Instructor:Professor Hynek Hermansky, IDIAP Research.
9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Luis Fernando D’Haro, Ondřej Glembek, Oldřich Plchot, Pavel Matejka, Mehdi Soufifar, Ricardo Cordoba, Jan Černocký.
Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.
Automatic Identification and Classification of Words using Phonetic and Prosodic Features Vidya Mohan Center for Speech and Language Engineering The Johns.
Dijana Petrovska-Delacrétaz 1 Asmaa el Hannani 1 Gérard Chollet 2 1: DIVA Group, University of Fribourg 2: GET-ENST, CNRS-LTCI,
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Page 1 NOLISP, Paris, May 23rd 2007 Audio-Visual Audio-Visual Subspaces Audio Visual Reduced Audiovisual Subspace Principal Component & Linear Discriminant.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 1 Phone Boundary Detection using Sample-based Acoustic Parameters.
New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 1/27 Intro.Intro.
National Taiwan University, Taiwan
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Ch 5b: Discriminative Training (temporal model) Ilkka Aho.
Performance Comparison of Speaker and Emotion Recognition
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies.
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.
Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.
Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info.
Speech recognition Home Work 1. Problem 1 Problem 2 Here in this problem, all the phonemes are detected by using phoncode.doc There are several phonetics.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
1 Hidden Markov Model: Overview and Applications in MIR MUMT 611, March 2005 Paul Kolesnik MUMT 611, March 2005 Paul Kolesnik.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
SPEECH VARIATION AND THE USE OF DISTANCE METRICS ON THE ARTICULATORY FEATURE SPACE Louis ten Bosch.
FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.
Combining Phonetic Attributes Using Conditional Random Fields Jeremy Morris and Eric Fosler-Lussier – Department of Computer Science and Engineering A.
Speech Recognition through Neural Networks By Mohammad Usman Afzal Mohammad Waseem.
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang
Speaker Identification:
Learning Long-Term Temporal Features
Presented by Chen-Wei Liu
Presenter: Shih-Hsiang(士翔)
2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.
Listen Attend and Spell – a brief introduction
Presentation transcript:

AGA 4/28/ NIST LID Evaluation On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky Anthropic Signal Processing Group

2 AGA 4/28/ NIST LID Evaluation OGI-4 – ASP System Goal –Convert the speech signal into a sequence of discrete sub-word units that can characterize the language Approach –Use temporal trajectories of speech parameters to obtain the sequence of units –Model the sequence of discrete sub-word units using a N-gram language model Sub-word units –TRAP-derived American English phonemes –Symbols derived from prosodic cues dynamics –Phonemes from OGI-LID Segmentation Target language model Background model + units Speech signal score (+) (-)

3 AGA 4/28/ NIST LID Evaluation American English Phoneme Recognition Phoneme set –39 American English phonemes (CMU-like) Phoneme Recognizer –trained on NTIMIT –TRAP (Temporal Patterns) based –Speech segments for training obtained from energy-based speech/nonspeech segmentation Modeling –3-gram language model Frequency Time classifier Short-term analysis Temporal patterns paradigm phone

4 AGA 4/28/ NIST LID Evaluation English Phoneme System Temporal trajectories –23 mel-scale frequency band –1 s segments of log energy trajectory Band classifiers –MLP (101x300x39) –Hidden unit nonlinearities: sigmoids –Output nonlinearities: softmax Band Classifier 1 Band Classifier 2 Band Classifier N Merger frequency time Viterbi search Merger –MLP (897x300x39) Viterbi search –Penalty factor tuning : deletions = insertions Training –NTIMIT

5 AGA 4/28/ NIST LID Evaluation Prosodic Cues Dynamics Technique –Using prosodic cues (intensity and pitch trajectories) to derive the sub-word units Approach –Segment the speech signal at the inflection points of trajectories (zero-crossings of the derivative) and at the onsets and offsets of voicing –Label the segment by the direction of change of the parameter within the segment ClassTemporal Trajectory Description 1rising f0 and rising energy 2rising f0 and falling energy 3falling f0 and rising energy 4falling f0 and falling energy 5unvoiced segment

6 AGA 4/28/ NIST LID Evaluation Prosodic Cues Dynamics Duration –The duration of the segment is characterized as “short” (less than 8 frames) or “long” 10 symbols Broad-phonetic-category (BFC) –Finer labeling achieved by estimating the broad-phonetic category (vowel+diphthong+glide, schwa, stop, fricative, flap, nasal, and silence) coinciding with each prosodic segment –BFC TRAPs trained on NTIMIT is used for deriving the broad phonetic categories 61 symbols 3-gram language model BFC TRAPS Setup –Input temporal vectors 15 bark-scale frequency band energy 1s segments of log energy trajectory Mean and variance normalized Dimension reduction:DCT –Band classifiers MLP (15x100x7) Hidden units: sigmoid Output units: softmax –Merger MLP (105x100x7)

7 AGA 4/28/ NIST LID Evaluation OGI-4 – ASP System EER 30s =41.4% EER 30s =19.3%EER 30s =32.1% EER 30s =17.8%

8 AGA 4/28/ NIST LID Evaluation OGI-4 – ASP System EER 30s =17.8%

9 AGA 4/28/ NIST LID Evaluation Post-Evaluation – Phoneme System Speech-nonspeech segmentation using silence classes from TRAP-based classification TRAPs classifier –Temporal trajectory duration - 400ms –3 bands as the input trajectory for each band classifier to explore the correlation between adjacent bands The trajectories of 3 bands are projected into a DCT basis (20 coefficients) –Viterbi search tuned for language identification Training data –CallFriend training and development sets

10 AGA 4/28/ NIST LID Evaluation Post-Evaluation – Phoneme System 34% relative improvement EER 30s =12.7%

11 AGA 4/28/ NIST LID Evaluation Post-Evaluation – Prosodic Cues System No energy-based segmentation –Unvoiced segments longer than 2 seconds are considered non-speech No broad-phonetic category labeling applied –Rate of change plus the quantized duration (10 tokens) Training data –CallFriend training and development sets

12 AGA 4/28/ NIST LID Evaluation Post-Evaluation – Prosodic Cues System 30% relative improvement EER 30s =22.2%

13 AGA 4/28/ NIST LID Evaluation Fusion - 30 sec condition Fusing the scores from the prosodic cues system –with TRAP-derived phonemes: EER 30s = 10.5% (17% relative improvement) –with OGI-LID derived phonemes: EER 30s = 6.6% 14% relative improvement TRAP-derived phoneme system fused with OGI-LID: –EER 30s = 6.2% 19% relative improvement 26% relative improvement EER 30s =5.7%

14 AGA 4/28/ NIST LID Evaluation Conclusions Sequences of discrete symbols derived from speech dynamics provide useful information for characterizing the language Two techniques for deriving the sequences of symbols investigated –segmentation and labeling based on prosodic cues –segmentation and labeling based on TRAP-derived phonetic labels The introduced techniques combine well with each other as well as with the more conventional language ID techniques