Download presentation
Presentation is loading. Please wait.
1
From research to products
3–5–2017
2
Helps clients to extract automatically maximum of valuable information from spoken speech. Turns speech to knowledge. Based in 2006 as spin-off of Brno University of Technology Seat and main office in Brno Customers worldwide - governmental agencies, call centers, banks, telco operators, broadcast service companies The main focus on platform and developer tools Profitable, no external funding Here to write name of presentation or chapter
3
From research to products
Technologies Products Scientific papers, reports, experimental code (Matlab, Python, C++, lots of glue (shell scripts), data files The goal is accuracy Stability, speed, reproducibility and documentation less important Openness text The goal is stability (error handling, code verification, testing cycles at various levels) and speed Regular development cycles and planning Well defined application interfaces (API) Documentation, licensing Development of new applications Integration with client’s technologies and systems The goal is functionality o integrated solution User interfaces
4
What is in speech? Content Speaker Equipment Environment
Language, dialect Keywords, phrases Speech transcription Topic Data mining Equipment Device (phone/mike/...) Transmit channels (landline/cell phone/Skype) Codecs (gsm/mp3/…) Speech quality Speaker Gender, age Speaker identity Emotion, speaker origin Education, relation When speaker speaks Environment Where speakers speaks To whom speakers speaks (dialog, reading, public talk) Other sounds (music, vehicles, animals…) What is in speech?
5
Voice interfaces – potential or thread?
6
Voice activity detection Keyword spotting Language identification
Technologies Voice activity detection Keyword spotting Language identification Speech transcription Gender recognition Dialog analysis Speaker identification Emotion recognition Diarization Sentiment analysis
7
Use cases News agencies / business inteligence
Call centers - quality control / business inteligence News agencies - search for some information Use cases Intelligence agencies - search for some information in a haystack, forensic expertise Banks - fraud detection / voice as a password
8
Speech platform We are delivering tools for developers One interface for all technologies – REST + HTTP/RTP streams Simple installation, integration and scalability Partnering with developers and integrators all around world
9
Hot tasks
10
Neural networks Current systems are deep neural network based Recurrent neural network are more powerful but gradient vanish too quickly -> move to Long Short Term Memories (LSTM) or BLSTM
11
Vector representation of words – from words to meaning
Word2vec algorithm proposed by Tomaš Mikolov Word in some context is mapped to short N-dimensional vector Similar words are mapped to the same place in the N- dimensional space
12
Punctuation and capitalization
ben is asked to wait for amy but he does not wait he continues to run so amy's request is changed now ben is asked to help amy ben stops and amy is helped Ben is asked to wait for Amy, but he does not wait. He continues to run. So Amy's request is changed. Now Ben is asked to help Amy. Ben stops and Amy is helped. A neural network is trained to predict upper/lower case and punctuation. The input is a vector representation of words. Similar technique can be trained to detect POS tags, name entities, other key information etc.
13
Sentiment analysis very poor service totally unprofessional cost too much money to do i wish it didn't cost so much and we're really really disappointed my agent was very very helpful and informative excellent service good job thank you everyone was extremely happy and very nice thank you had great courtesy fast service with a smile Again, neural network can be trained for this task. The input is a vector representation of words.
14
Training data preparation and cleaning
Each speech recognizer needs word transcribed speech recordings for training, 100h and more, 1h ~ 15h of human time Transcription can be done by crowd-sourcing, but good quality control is necessary The goal is to develop speech recognizers in just days A confidence score based on comparison of a sentence model and an arbitrary word sequence model can be used.
15
Dynamic decoder for lower memory consumption and embedding
A decoder is a key part of each speech recognizer Static decoder uses a precompiled recognition network but it is very memory consuming M = H * C * L * G 3 Dynamic decoder does the composition on-fly, so it needs only a little of memory. The whole recognizer can fit to some embedded devices.
16
One state Hidden Markov Models
Classical concept uses more states per phoneme in HMM Neural networks and long temporal context modelling makes is possible to use just 1 state per phoneme 1 state phoneme models increases the recognition speed 3x
17
Knowing more about speech source
The recognition result is given by audio quality Speech can be degraded noise, reverberation, or any loss compression The audio can pass many codecs Detection of audio codecs and bit rates is key to predict the quality of output data It is important for any forensic expertise It is important to prevent spoofing attacks to voice biometry
18
Social network / link analysis
There is a lot of metadata in speech recordings Making relations among metadata inside recording or across recordings makes particular pieces of information easier to find The current searching capabilities can be improved by few orders The same happened with text when Google came There are well known algorithms from text
19
Some technologies
20
technical signal removal
Voice activity detector energy based VAD technical signal removal VAD based on f0 tracking neural network VAD Higher accuracy, lower speed Energy based VAD – fast removal of low energy parts Technical signal removal and noise filtering - removal of tones, removal of flat spectra signal, removal of stationary signals, filtering of pulse noise VAD based on f0 tracking – removal of other non-speech signals neural network VAD – very accurate VAD based on phoneme recognition
21
Language identification
UBM Projection parameters Language parameters Calibration parameters feature extraction collection of UBM statistics projection to iVectors language classifier - MLR score calib. / transform language scores Prepared by Phonexia Fully trainable by client Language prints (iVectors) can be easily transferred over low capacity links
22
iVector describes total variability inside speech record
Speaker identification – voice print extraction UBM Projection parameters Projection parameters Norm. parameters feature extraction collection of UBM statistics projection to iVectors extraction of spk info. - LDA user normalization voiceprint prepared by Phonexia iVector describes total variability inside speech record LDA removes non-speaker variability User normalization helps user to normalize to unseen channels (mean subtraction)
23
Speaker identification – voice print comparison
Model parameters Calibration parameters voiceprint 1 voiceprint comparator WCCN+PLDA length dep. calibration piecewise LR score transform Logistic func. score trainable by user voiceprint 2 Voiceprint comparer returns log likelihood Calibration ensures probabilistic interpretation of the score under different speech lengths Score transform enables to selects log likelihood ratio or percentage score
24
Thanks for your attention
Petr Schwarz CTO T E phonexia.com
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.