Download presentation
Presentation is loading. Please wait.
Published byMuriel Webb Modified over 9 years ago
1
ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010
2
ZRE Honza Cernocky 8.2.20102/46 Agenda Where we are and who we are Needle in a haystack Simple example - Gender ID Speaker recognition Language identification Keyword spotting CZ projects
3
ZRE Honza Cernocky 8.2.20103/46 Where is Brno ?
4
ZRE Honza Cernocky 8.2.20104/46 The place Brno University of Technology – 2nd largest technical university in the Czech Republic (~2500 staff, ~18000 students). Faculty of Information Technology (FIT) – its youngest faculty (created in January 2002). Reconstruction of the campus finished in Nov 2007 – now a beautiful place marrying old cartusian monastery and modern buildings.
5
ZRE Honza Cernocky 8.2.20105/46 Department of Computer Graphics and Multimedia Video/image processing Speech processing Knowledge engineering and natural language processing Medical visualization and 3D modeling http://www.fit.vutbr.cz/units/UPGM/
6
6/46 Speech@FIT University research group established in 1997 20 people in 2009 (faculty, researchers, students, support staff). Provides also education within Dpt. of Computer Graphics and Multimedia. Cooperating with EU and US universities and companies. Supported by EC, US and national projects Speech@FIT’s goal: high profile research in speech theory and algorithms
7
ZRE Honza Cernocky 8.2.20107/46 Key people Directors: Dr. Jan “Honza” Černocký - Executive direction Prof. Hynek Heřmanský - (Johns Hopkins University, USA) advisor and guru Dr. Lukáš Burget – Scientific director Sub-group leaders: Petr Schwarz – phonemes, implementation Pavel “Pája” Matějka – SpeakerID, LanguageID Pavel Smrž – NLP and semantic Web
8
ZRE Honza Cernocky 8.2.20108/46 The steel and soft … Steel 3 IBM Blade centers with 42 IBM Blade servers à 2 dual- core CPUs Another ~120 computers in class-rooms >16 TB of disk space Professional and friendly administration Soft Common: HTK, Matlab, QuickNet, SGE Own SW: STK, BS-CORE, BS-API
9
ZRE Honza Cernocky 8.2.20109/46 Faculty (faculty members and faculty-wide research funds) EU projects (FP[4567]) Past: SpeechDat-E, SpeeCon, M4, AMI, CareTaker. Running: AMIDA, MOBIO, weKnowIt. US funding – Air Force’s EOARD Local funding agencies - Grant Agency of Czech Republic, Ministry of Education, Ministry of Trade and Commerce Czech “force” ministries – Defense, Interior Industrial contracts Spin-off – Phonexia, Ltd. Speech@FIT funding
10
10/46 Phonexia Ltd. Company created in 2006 by 6 Speech@FIT members Closely cooperating with the research group Key people Dr. Pavel Matějka, CEO Dr. Petr Schwarz, CTO Igor Szöke, CFO Dr. Lukáš Burget, research coordinator Dr. Jan Černocký, university relations Tomáš Kašpárek, hardware architect Phonexia’s goal: bringing mature technologies to the market, especially in the security/defense sector
11
ZRE Honza Cernocky 8.2.201011/46 Agenda Where we are and who we are Needle in a haystack Simple example - Gender ID Speaker recognition Language identification Keyword spotting CZ projects
12
12/46 Needle in a haystack Speech is the most important modality of human-human communication (~80% of information) … criminals and terrorists are also communicating by speech Speech is easy to acquire in both civilian and intelligence/defense scenarios. More difficult is to find what we are looking for Typically done by human experts, but always count on: Limited personnel Limited budget Not enough languages spoken Insufficient security clearances Technologies of speech processing are not almighty but can help to narrow the search space.
13
ZRE Honza Cernocky 8.2.201013/46 “Speech recognition” GOAL: Automatically extract information transmitted in speech signal Speaker Recognition Gender Recognition Language Recognition Speech Recognition Speaker Name Gender Language What was said. John Doe Male or Female English/German/?? “Hallo Crete!” Keyword spotting “Crete” spotted Speech
14
14/46 Focus on evaluations „I'm better than the other guys“ – not relevant unless the same data and evaluation metrics for everyone. NIST – US Government Agency, http://www.nist.gov/speech Regular benchmark campaigns – evaluations – of speech technologies. All participants have the same data and have the same limited time to process them and send results to NIST => objective comparison. The results and details of systems are discussed at NIST workshops. Speech@FIT extensively participating in NIST evaluations: Transcription 2005, 2006, 2007, 2009 Language ID 2003, 2005, 2007, 2009 Speaker Verification 1998, 1999, 2006, 2008, Spoken term detection 2006 Why are we doing this ? We believe that evaluations are really advancing the state of the art We do not want to waste our time on useless work …
15
ZRE Honza Cernocky 8.2.201015/46 What we are really doing ? Following the recipe from any pattern-recognition book:
16
ZRE Honza Cernocky 8.2.201016/46 And what is the result ? Something you’ve probably already seen: Feature extraction Evaluation of probabilities or likelihoods Models “Decoding” inputdecision
17
ZRE Honza Cernocky 8.2.201017/46 Agenda Where we are and who we are Needle in a haystack Simple example - Gender ID Speaker recognition Language identification Keyword spotting CZ projects
18
18/46 The simplest example … GID Gender Identification The easiest speech application to deploy … … and the most accurate (>96% on challenging channels) Limits search space by 50%
19
ZRE Honza Cernocky 8.2.201019/46 So how is Gender-ID done ? Evaluation of GMM likelihoods MFCC input Gaussian Mixture models – boys, girls Decision Male/female
20
ZRE Honza Cernocky 8.2.201020/46 Features – Mel Frequency Cepstral Coefficients The signal is not stationary And the hearing is not linear
21
ZRE Honza Cernocky 8.2.201021/46 The evaluation of likelihoods: GMM
22
ZRE Honza Cernocky 8.2.201022/46 The decision – Bayes rule. GID DEMO
23
ZRE Honza Cernocky 8.2.201023/46 Agenda Where we are and who we are Needle in a haystack Simple example - Gender ID Speaker recognition Language identification Keyword spotting CZ projects
24
ZRE Honza Cernocky 8.2.201024/46 Speaker recognition Speaker recognition aims at recognizing "who said it". In speaker identification, the task is to assign speech signal to one out of N speakers. In speaker verification, the claimed identity is known and the question to be answered is "was the speaker really Mr. XYZ or an impostor? Front-end processing Front-end processing Target model Background model Background model score normalization score normalization Adapt
25
ZRE Honza Cernocky 8.2.201025/46 25 High inter-session variability High speaker variability UBM Target speaker model Bad session variability Example: single Gaussian model with 2D features
26
ZRE Honza Cernocky 8.2.201026/46 And what to do about it High inter-session variability UBM Target speaker model Test data For recognition, move both models along the high inter-session variability direction(s) to fit well the test data High inter-speaker variability
27
27/46 Research achievements Key thing: Joint Factor Analysis (JFA) decomposes models into channel and speaker sub-spaces. Coping with unwanted variability In the same time, compact representation of speakers allowing for extremely fast scoring of speech files. Speaker search DEMO <- NIST SRE 2006: BUT STBU consortium NIST SRE 2008 -> confirming leading position
28
ZRE Honza Cernocky 8.2.201028/46 Agenda Where we are and who we are Needle in a haystack Simple example - Gender ID Speaker recognition Language identification Keyword spotting CZ projects
29
ZRE Honza Cernocky 8.2.201029/46 The goal of language ID Determine the language of a speech segment LID
30
ZRE Honza Cernocky 8.2.201030/46 Two main approaches to LID Acoustic – Gaussian Mixture Model Phonotactic – Phone Recognition followed by Language Model
31
ZRE Honza Cernocky 8.2.201031/46 Acoustics Good for short speech segments and dialect recognition Relies on the sounds Done by discriminatively trained GMMs with channel compensation
32
ZRE Honza Cernocky 8.2.201032/46 Phonotactic approach good for longer speech segments robust against dialects in one language eliminates speech characteristics of speaker's native language Based on high-quality NN-based phone recognizer … producing stringsor lattices
33
ZRE Honza Cernocky 8.2.201033/46 Phonotactic modeling - example und25 and3 the0.... und1 and32 the13.... und5 and0 the1.... GermanEnglish Test N-gram language models – discounting, backoff Binary decision trees – adaptation from UBM Support Vector Machines – vectors with counts
34
34/46 Research achievements NIST evaluation results: LRE 2005 – Speech@FIT the best in 2 out of 3 categories LRE 2007 – confirmation of the leading position. LRE 2009 – a bit of bad luck but very good post- submission system ara F 0.0 eng F 0.0 far F 0.0 fre T 99.9 ger F 0.0 hin F 0.0 jap F 0.0 kor F 0.0 man F 0.0 spa F 0.0 tam F 0.0 vie F 0.0 ara F 0.0 eng T 93.3 far F 0.0 fre F 0.3 ger F 4.9 hin F 0.0 jap F 0.0 kor F 0.0 man F 1.3 spa F 0.0 tam F 0.0 vie F 0.1 ara F 0.0 eng F 15.1 far F 0.0 fre F 0.0 ger T 84.7 hin F 0.0 jap F 0.0 kor F 0.0 man F 0.0 spa F 0.0 tam F 0.0 vie F 0.0 ara T 42.9 eng F 1.7 far F 12.9 fre F 0.0 ger F 0.0 hin F 11.2 jap F 0.9 kor F 22.2 man F 0.0 spa F 0.1 tam F 7.4 vie F 0.1 Key things: Discriminative modeling Channel compensation Gathering training data from public sources Web demo: http://speech.fit.vutbr.cz/lid-demo/
35
ZRE Honza Cernocky 8.2.201035/46 Agenda Where we are and who we are Needle in a haystack Simple example - Gender ID Speaker recognition Language identification Keyword spotting CZ projects
36
36/46 Keyword spotting What ? Which recording and when ? Confidence ? Comparing keyword model output with an anti-model. Technical approaches Acoustic keyword spotting Searching in an output of Large Vocabulary Continuous speech recognizer (LVCSR) Searching in an output of LVCSR completed with sub-word units. The choices: What is the needed tradeoff between speed and accuracy? How to cope with the “devil” of keyword spotting: Out of Vocabulary (OOV) words
37
ZRE Honza Cernocky 8.2.201037/46 Acoustic KWS no problem with OOVs Indexing not possible – need to go through everything down to 0.01xRT Does not have the strength of LM – problem with short words and sub-words. Model of a word against a background model. No language model
38
ZRE Honza Cernocky 8.2.201038/46 Searching in the output of LVCSR speed of search more precise on frequent words. limited by LVCSR vocabulary - OOV LVCSR is more complex and slower. LVCSR, then search in 1-best or lattice. Indexing possible
39
ZRE Honza Cernocky 8.2.201039/46 Searching in the output of LVCSR + sub-words Speed of search preserved Precision on frequent words preserved. Allows to search OOVs without additional processing of all data. LVCSR and indexing are more complex. LVCSR with words and sub-word units. Indexing of both words and sub-word units
40
40/46 Research achievements Key things: Expertise with acoustic, word and sub-word recognition Excellent front-ends – LVCSR and phone recognizer. Speech indexing and search Normalization of scores. DEMO – Russian acoustic KWS NIST STD 2006 – EnglishMV Task 2008 – Czech
41
ZRE Honza Cernocky 8.2.201041/46 Agenda Where we are and who we are Needle in a haystack Simple example - Gender ID Speaker recognition Language identification Keyword spotting CZ projects
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.