ZRE 2009 / 10 introductory talk Honza Černocký Brno University of Technology, Czech Republic ZRE 8.2.2010.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Speech Processing for NSR Vs DSR Veeru Ramaswamy PhD CTO, Vianix LLC
1 Patrol LID System for DARPA RATS P1 Evaluation Pavel Matejka Patrol Team Language Identification System for DARPA RATS P1 Evaluation Pavel Matejka 1,
© Fraunhofer FKIE Corinna Harwardt Automatic Speaker Recognition in Military Environment.
Brno University Of Technology Lukáš Burget, Michal Fapšo, Valiantsina Hubeika, Ondřej Glembek, Martin Karafiát, Marcel Kockmann, Pavel Matějka,
PHONEXIA Can I have it in writing?. Discuss and share your answers to the following questions: 1.When you have English lessons listening to spoken English,
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Language and Speaker Identification using Gaussian Mixture Model Prepare by Jacky Chau The Chinese University of Hong Kong 18th September, 2002.
AdvAIR Supervised by Prof. Michael R. Lyu Prepared by Alex Fok, Shirley Ng 2002 Fall An Advanced Audio Information Retrieval System.
The Chinese University of Hong Kong Department of Computer Science and Engineering Lyu0202 Advanced Audio Information Retrieval System.
1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.
Why is ASR Hard? Natural speech is continuous
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Natural Language Understanding
Introduction to Automatic Speech Recognition
Word-subword based keyword spotting with implications in OOV detection Jan “Honza” Černocký, Igor Szöke, Mirko Hannemann, Stefan Kombrink Brno University.
A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore.
June 28th, 2004 BioSecure, SecurePhone 1 Automatic Speaker Verification : Technologies, Evaluations and Possible Future Gérard CHOLLET CNRS-LTCI, GET-ENST.
A Talking Elevator, WS2006 UdS, Speaker Recognition 1.
Engineering Management From The Top Power Behind the Storage.
VBS Documentation and Implementation The full standard initiative is located at Quick description Standard manual.
Integrated Stochastic Pronunciation Modeling Dong Wang Supervisors: Simon King, Joe Frankel, James Scobbie.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Luis Fernando D’Haro, Ondřej Glembek, Oldřich Plchot, Pavel Matejka, Mehdi Soufifar, Ricardo Cordoba, Jan Černocký.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Dijana Petrovska-Delacrétaz 1 Asmaa el Hannani 1 Gérard Chollet 2 1: DIVA Group, University of Fribourg 2: GET-ENST, CNRS-LTCI,
AGA 4/28/ NIST LID Evaluation On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky.
Voice Biometry standard proposal Honza Černocký Brno University of Technology, BUT Czech Republic Sep 8 th 2015, Interspeech VBS meeting.
Computer Sciences at NYU Open House January 2004 l Graduate Study at New York University l The MS in Computer Sciences l The MS in Information Systems.
Speech, Perception, & AI Artificial Intelligence CMSC February 13, 2003.
Survey of Approaches to Information Retrieval of Speech Message Kenney Ng Spoken Language Systems Group Laboratory for Computer Science Massachusetts Institute.
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.
Speech and Music Retrieval INST 734 Doug Oard Module 12.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Performance Comparison of Speaker and Emotion Recognition
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
Unlocking Audio/Video Content with Speech Recognition Behrooz Chitsaz Director, IP Strategy Microsoft Research Frank Seide Lead.
Behrooz ChitsazLorrie Apple Johnson Microsoft ResearchU.S. Department of Energy.
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
DYNAMIC TIME WARPING IN KEY WORD SPOTTING. OUTLINE KWS and role of DTW in it. Brief outline of DTW What is training and why is it needed? DTW training.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
STD Approach Two general approaches: word-based and phonetics-based Goal is to rapidly detect the presence of a term in a large audio corpus of heterogeneous.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
1/41 SPEECH DATA MINING, SPEECH ANALYTICS, VOICE BIOMETRY.
Using Speech Recognition to Predict VoIP Quality
Faculty of Information Technology, Brno University of Technology, CZ
From research to products
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
College of Engineering
Statistical Models for Automatic Speech Recognition
Presentation for EEL6586 Automatic Speech Processing
Statistical Models for Automatic Speech Recognition
Sfax University, Tunisia
BUT 18 years of research in speech data mining
Faculty of Science IT Department Lecturer: Raz Dara MA.
AUDIO SURVEILLANCE SYSTEMS: SUSPICIOUS SOUND RECOGNITION
John H.L. Hansen & Taufiq Al Babba Hasan
CS246: Information Retrieval
A maximum likelihood estimation and training on the fly approach
Speaker Identification:
SNR-Invariant PLDA Modeling for Robust Speaker Verification
Presentation transcript:

ZRE 2009 / 10 introductory talk Honza Černocký Brno University of Technology, Czech Republic ZRE

ZRE Honza Cernocky /46 Agenda Where we are and who we are Needle in a haystack Simple example - Gender ID Speaker recognition Language identification Keyword spotting CZ projects

ZRE Honza Cernocky /46 Where is Brno ?

ZRE Honza Cernocky /46 The place Brno University of Technology – 2nd largest technical university in the Czech Republic (~2500 staff, ~18000 students). Faculty of Information Technology (FIT) – its youngest faculty (created in January 2002). Reconstruction of the campus finished in Nov 2007 – now a beautiful place marrying old cartusian monastery and modern buildings.

ZRE Honza Cernocky /46 Department of Computer Graphics and Multimedia Video/image processing Speech processing Knowledge engineering and natural language processing Medical visualization and 3D modeling

6/46 University research group established in people in 2009 (faculty, researchers, students, support staff). Provides also education within Dpt. of Computer Graphics and Multimedia. Cooperating with EU and US universities and companies. Supported by EC, US and national projects goal: high profile research in speech theory and algorithms

ZRE Honza Cernocky /46 Key people Directors: Dr. Jan “Honza” Černocký - Executive direction Prof. Hynek Heřmanský - (Johns Hopkins University, USA) advisor and guru Dr. Lukáš Burget – Scientific director Sub-group leaders: Petr Schwarz – phonemes, implementation Pavel “Pája” Matějka – SpeakerID, LanguageID Pavel Smrž – NLP and semantic Web

ZRE Honza Cernocky /46 The steel and soft … Steel 3 IBM Blade centers with 42 IBM Blade servers à 2 dual- core CPUs Another ~120 computers in class-rooms >16 TB of disk space Professional and friendly administration Soft Common: HTK, Matlab, QuickNet, SGE Own SW: STK, BS-CORE, BS-API

ZRE Honza Cernocky /46 Faculty (faculty members and faculty-wide research funds) EU projects (FP[4567]) Past: SpeechDat-E, SpeeCon, M4, AMI, CareTaker. Running: AMIDA, MOBIO, weKnowIt. US funding – Air Force’s EOARD Local funding agencies - Grant Agency of Czech Republic, Ministry of Education, Ministry of Trade and Commerce Czech “force” ministries – Defense, Interior Industrial contracts Spin-off – Phonexia, Ltd. funding

10/46 Phonexia Ltd. Company created in 2006 by 6 members Closely cooperating with the research group Key people Dr. Pavel Matějka, CEO Dr. Petr Schwarz, CTO Igor Szöke, CFO Dr. Lukáš Burget, research coordinator Dr. Jan Černocký, university relations Tomáš Kašpárek, hardware architect Phonexia’s goal: bringing mature technologies to the market, especially in the security/defense sector

ZRE Honza Cernocky /46 Agenda Where we are and who we are Needle in a haystack Simple example - Gender ID Speaker recognition Language identification Keyword spotting CZ projects

12/46 Needle in a haystack Speech is the most important modality of human-human communication (~80% of information) … criminals and terrorists are also communicating by speech Speech is easy to acquire in both civilian and intelligence/defense scenarios. More difficult is to find what we are looking for Typically done by human experts, but always count on: Limited personnel Limited budget Not enough languages spoken Insufficient security clearances Technologies of speech processing are not almighty but can help to narrow the search space.

ZRE Honza Cernocky /46 “Speech recognition” GOAL: Automatically extract information transmitted in speech signal Speaker Recognition Gender Recognition Language Recognition Speech Recognition Speaker Name Gender Language What was said. John Doe Male or Female English/German/?? “Hallo Crete!” Keyword spotting “Crete” spotted Speech

14/46 Focus on evaluations „I'm better than the other guys“ – not relevant unless the same data and evaluation metrics for everyone. NIST – US Government Agency, Regular benchmark campaigns – evaluations – of speech technologies. All participants have the same data and have the same limited time to process them and send results to NIST => objective comparison. The results and details of systems are discussed at NIST workshops. extensively participating in NIST evaluations: Transcription 2005, 2006, 2007, 2009 Language ID 2003, 2005, 2007, 2009 Speaker Verification 1998, 1999, 2006, 2008, Spoken term detection 2006 Why are we doing this ? We believe that evaluations are really advancing the state of the art We do not want to waste our time on useless work …

ZRE Honza Cernocky /46 What we are really doing ? Following the recipe from any pattern-recognition book:

ZRE Honza Cernocky /46 And what is the result ? Something you’ve probably already seen: Feature extraction Evaluation of probabilities or likelihoods Models “Decoding” inputdecision

ZRE Honza Cernocky /46 Agenda Where we are and who we are Needle in a haystack Simple example - Gender ID Speaker recognition Language identification Keyword spotting CZ projects

18/46 The simplest example … GID Gender Identification The easiest speech application to deploy … … and the most accurate (>96% on challenging channels) Limits search space by 50%

ZRE Honza Cernocky /46 So how is Gender-ID done ? Evaluation of GMM likelihoods MFCC input Gaussian Mixture models – boys, girls Decision Male/female

ZRE Honza Cernocky /46 Features – Mel Frequency Cepstral Coefficients The signal is not stationary And the hearing is not linear

ZRE Honza Cernocky /46 The evaluation of likelihoods: GMM

ZRE Honza Cernocky /46 The decision – Bayes rule. GID DEMO

ZRE Honza Cernocky /46 Agenda Where we are and who we are Needle in a haystack Simple example - Gender ID Speaker recognition Language identification Keyword spotting CZ projects

ZRE Honza Cernocky /46 Speaker recognition Speaker recognition aims at recognizing "who said it". In speaker identification, the task is to assign speech signal to one out of N speakers. In speaker verification, the claimed identity is known and the question to be answered is "was the speaker really Mr. XYZ or an impostor? Front-end processing Front-end processing Target model Background model Background model score normalization score normalization   Adapt

ZRE Honza Cernocky /46 25 High inter-session variability High speaker variability UBM Target speaker model Bad session variability Example: single Gaussian model with 2D features

ZRE Honza Cernocky /46 And what to do about it High inter-session variability UBM Target speaker model Test data For recognition, move both models along the high inter-session variability direction(s) to fit well the test data High inter-speaker variability

27/46 Research achievements Key thing: Joint Factor Analysis (JFA) decomposes models into channel and speaker sub-spaces. Coping with unwanted variability In the same time, compact representation of speakers allowing for extremely fast scoring of speech files. Speaker search DEMO <- NIST SRE 2006: BUT STBU consortium NIST SRE > confirming leading position

ZRE Honza Cernocky /46 Agenda Where we are and who we are Needle in a haystack Simple example - Gender ID Speaker recognition Language identification Keyword spotting CZ projects

ZRE Honza Cernocky /46 The goal of language ID Determine the language of a speech segment LID

ZRE Honza Cernocky /46 Two main approaches to LID Acoustic – Gaussian Mixture Model Phonotactic – Phone Recognition followed by Language Model

ZRE Honza Cernocky /46 Acoustics Good for short speech segments and dialect recognition Relies on the sounds Done by discriminatively trained GMMs with channel compensation

ZRE Honza Cernocky /46 Phonotactic approach good for longer speech segments robust against dialects in one language eliminates speech characteristics of speaker's native language Based on high-quality NN-based phone recognizer … producing stringsor lattices

ZRE Honza Cernocky /46 Phonotactic modeling - example und25 and3 the0.... und1 and32 the und5 and0 the1.... GermanEnglish Test N-gram language models – discounting, backoff Binary decision trees – adaptation from UBM Support Vector Machines – vectors with counts

34/46 Research achievements NIST evaluation results: LRE 2005 – the best in 2 out of 3 categories LRE 2007 – confirmation of the leading position. LRE 2009 – a bit of bad luck but very good post- submission system ara F 0.0 eng F 0.0 far F 0.0 fre T 99.9 ger F 0.0 hin F 0.0 jap F 0.0 kor F 0.0 man F 0.0 spa F 0.0 tam F 0.0 vie F 0.0 ara F 0.0 eng T 93.3 far F 0.0 fre F 0.3 ger F 4.9 hin F 0.0 jap F 0.0 kor F 0.0 man F 1.3 spa F 0.0 tam F 0.0 vie F 0.1 ara F 0.0 eng F 15.1 far F 0.0 fre F 0.0 ger T 84.7 hin F 0.0 jap F 0.0 kor F 0.0 man F 0.0 spa F 0.0 tam F 0.0 vie F 0.0 ara T 42.9 eng F 1.7 far F 12.9 fre F 0.0 ger F 0.0 hin F 11.2 jap F 0.9 kor F 22.2 man F 0.0 spa F 0.1 tam F 7.4 vie F 0.1 Key things: Discriminative modeling Channel compensation Gathering training data from public sources Web demo:

ZRE Honza Cernocky /46 Agenda Where we are and who we are Needle in a haystack Simple example - Gender ID Speaker recognition Language identification Keyword spotting CZ projects

36/46 Keyword spotting What ? Which recording and when ? Confidence ? Comparing keyword model output with an anti-model. Technical approaches Acoustic keyword spotting Searching in an output of Large Vocabulary Continuous speech recognizer (LVCSR) Searching in an output of LVCSR completed with sub-word units. The choices: What is the needed tradeoff between speed and accuracy? How to cope with the “devil” of keyword spotting: Out of Vocabulary (OOV) words

ZRE Honza Cernocky /46 Acoustic KWS no problem with OOVs  Indexing not possible – need to go through everything  down to 0.01xRT  Does not have the strength of LM – problem with short words and sub-words. Model of a word against a background model. No language model

ZRE Honza Cernocky /46 Searching in the output of LVCSR speed of search more precise on frequent words.  limited by LVCSR vocabulary - OOV  LVCSR is more complex and slower. LVCSR, then search in 1-best or lattice. Indexing possible

ZRE Honza Cernocky /46 Searching in the output of LVCSR + sub-words Speed of search preserved Precision on frequent words preserved. Allows to search OOVs without additional processing of all data.  LVCSR and indexing are more complex. LVCSR with words and sub-word units. Indexing of both words and sub-word units

40/46 Research achievements Key things: Expertise with acoustic, word and sub-word recognition Excellent front-ends – LVCSR and phone recognizer. Speech indexing and search Normalization of scores. DEMO – Russian acoustic KWS NIST STD 2006 – EnglishMV Task 2008 – Czech

ZRE Honza Cernocky /46 Agenda Where we are and who we are Needle in a haystack Simple example - Gender ID Speaker recognition Language identification Keyword spotting CZ projects