From research to products

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

KARAOKE FORMATION Pratik Bhanawat (10bec113) Gunjan Gupta Gunjan Gupta (10bec112)

Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.

Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007.

PHONEXIA Can I have it in writing?. Discuss and share your answers to the following questions: 1.When you have English lessons listening to spoken English,

1 Texmex – November 15 th, 2005 Strategy for the future Global goal “Understand” (= structure…) TV and other MM documents Prepare these documents for applications.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

A PRESENTATION BY SHAMALEE DESHPANDE

Multimedia Data Mining Arvind Balasubramanian Multimedia Lab (ECSS 4.416) The University of Texas at Dallas.

Track: Speech Technology Kishore Prahallad Assistant Professor, IIIT-Hyderabad 1Winter School, 2010, IIIT-H.

Wang, Z., et al. Presented by: Kayla Henneman October 27, 2014 WHO IS HERE: LOCATION AWARE FACE RECOGNITION.

Digital Sound and Video Chapter 10, Exploring the Digital Domain.

Knowledge Base approach for spoken digit recognition Vijetha Periyavaram.

Speaker Recognition By Afshan Hina.

Real-Time Speech Recognition Subtitling in Education Respeaking 2009 Dr Mike Wald University of Southampton.

Associative Pattern Memory (APM) Larry Werth July 14, 2007

VBS Documentation and Implementation The full standard initiative is located at Quick description Standard manual.

 The most intelligent device - “Human Brain”.  The machine that revolutionized the whole world – “computer”.  Inefficiencies of the computer has lead.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:

17.0 Distributed Speech Recognition and Wireless Environment References: 1. “Quantization of Cepstral Parameters for Speech Recognition over the World.

Voice Recognition (Presentation 2) By: Priya Devi A. S/W Developer, Xsys technologies Bangalore.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Voice Biometry standard proposal Honza Černocký Brno University of Technology, BUT Czech Republic Sep 8 th 2015, Interspeech VBS meeting.

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Performance Comparison of Speaker and Emotion Recognition

ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION. Introduction What is Speech Recognition?  also known as automatic speech recognition or computer speech.

BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.

Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.

Statistical techniques for video analysis and searching chapter Anton Korotygin.

Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:

1/41 SPEECH DATA MINING, SPEECH ANALYTICS, VOICE BIOMETRY.

RESEARCH MOTHODOLOGY SZRZ6014 Dr. Farzana Kabir Ahmad Taqiyah Khadijah Ghazali (814537) SENTIMENT ANALYSIS FOR VOICE OF THE CUSTOMER.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.

BIOMETRICS VOICE RECOGNITION. Meaning Bios : LifeMetron : Measure Bios : LifeMetron : Measure Biometrics are used to identify the input sample when compared.

Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.

Using Speech Recognition to Predict VoIP Quality

Distributed Representations for Natural Language Processing

4/19/ :02 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.

Deep Learning Amin Sobhani.

ARTIFICIAL NEURAL NETWORKS

Recurrent Neural Networks for Natural Language Processing

Artificial Intelligence for Speech Recognition

Conditional Random Fields for ASR

Analogue & Digital.

MID-SEM REVIEW.

Introduction CSE 1310 – Introduction to Computers and Programming

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

Machine Learning Ali Ghodsi Department of Statistics

Convolutional Networks

ASAP and Deep ASAP: End-to-End Audio Sentiment Analysis Pipelines

Facial Recognition [Biometric]

Retrieval of audio testimonials via voice search

Speech Capture, Transcription and Analysis App

Topics Introduction Hardware and Software How Computers Store Data

Pilar Orero, Spain Yoshikazu SEKI, Japan 2018

Neural Speech Synthesis with Transformer Network

Voice Activation for Wealth Management

HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs

A maximum likelihood estimation and training on the fly approach

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

Introduction to Sentiment Analysis

Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.

What is a System? A system is a collection of interrelated components that work together to perform a specific task.

CS249: Neural Language Model

Huawei CBG AI Challenges

Presentation transcript:

From research to products 3–5–2017

Helps clients to extract automatically maximum of valuable information from spoken speech. Turns speech to knowledge. Based in 2006 as spin-off of Brno University of Technology Seat and main office in Brno Customers worldwide - governmental agencies, call centers, banks, telco operators, broadcast service companies The main focus on platform and developer tools Profitable, no external funding Here to write name of presentation or chapter

From research to products Technologies Products Scientific papers, reports, experimental code (Matlab, Python, C++, lots of glue (shell scripts), data files The goal is accuracy Stability, speed, reproducibility and documentation less important Openness text The goal is stability (error handling, code verification, testing cycles at various levels) and speed Regular development cycles and planning Well defined application interfaces (API) Documentation, licensing Development of new applications Integration with client’s technologies and systems The goal is functionality o integrated solution User interfaces

What is in speech? Content Speaker Equipment Environment Language, dialect Keywords, phrases Speech transcription Topic Data mining Equipment Device (phone/mike/...) Transmit channels (landline/cell phone/Skype) Codecs (gsm/mp3/…) Speech quality Speaker Gender, age Speaker identity Emotion, speaker origin Education, relation When speaker speaks Environment Where speakers speaks To whom speakers speaks (dialog, reading, public talk) Other sounds (music, vehicles, animals…) What is in speech?

Voice interfaces – potential or thread?

Voice activity detection Keyword spotting Language identification Technologies Voice activity detection Keyword spotting Language identification Speech transcription Gender recognition Dialog analysis Speaker identification Emotion recognition Diarization Sentiment analysis

Use cases News agencies / business inteligence Call centers - quality control / business inteligence News agencies - search for some information Use cases Intelligence agencies - search for some information in a haystack, forensic expertise Banks - fraud detection / voice as a password

Speech platform We are delivering tools for developers One interface for all technologies – REST + HTTP/RTP streams Simple installation, integration and scalability Partnering with developers and integrators all around world

Hot tasks

Neural networks Current systems are deep neural network based Recurrent neural network are more powerful but gradient vanish too quickly -> move to Long Short Term Memories (LSTM) or BLSTM

Vector representation of words – from words to meaning Word2vec algorithm proposed by Tomaš Mikolov Word in some context is mapped to short N-dimensional vector Similar words are mapped to the same place in the N- dimensional space

Punctuation and capitalization ben is asked to wait for amy but he does not wait he continues to run so amy's request is changed now ben is asked to help amy ben stops and amy is helped Ben is asked to wait for Amy, but he does not wait. He continues to run. So Amy's request is changed. Now Ben is asked to help Amy. Ben stops and Amy is helped. A neural network is trained to predict upper/lower case and punctuation. The input is a vector representation of words. Similar technique can be trained to detect POS tags, name entities, other key information etc.

Sentiment analysis -0.981 very poor service totally unprofessional -0.971 cost too much money to do -0.962 i wish it didn't cost so much -0.940 and we're really really disappointed 0.946 my agent was very very helpful and informative 0.995 excellent service good job thank you 0.996 everyone was extremely happy and very nice thank you 0.996 had great courtesy fast service with a smile Again, neural network can be trained for this task. The input is a vector representation of words.

Training data preparation and cleaning Each speech recognizer needs word transcribed speech recordings for training, 100h and more, 1h ~ 15h of human time Transcription can be done by crowd-sourcing, but good quality control is necessary The goal is to develop speech recognizers in just days A confidence score based on comparison of a sentence model and an arbitrary word sequence model can be used.

Dynamic decoder for lower memory consumption and embedding A decoder is a key part of each speech recognizer Static decoder uses a precompiled recognition network but it is very memory consuming M = H * C * L * G 3 Dynamic decoder does the composition on-fly, so it needs only a little of memory. The whole recognizer can fit to some embedded devices.

One state Hidden Markov Models Classical concept uses more states per phoneme in HMM Neural networks and long temporal context modelling makes is possible to use just 1 state per phoneme 1 state phoneme models increases the recognition speed 3x

Knowing more about speech source The recognition result is given by audio quality Speech can be degraded noise, reverberation, or any loss compression The audio can pass many codecs Detection of audio codecs and bit rates is key to predict the quality of output data It is important for any forensic expertise It is important to prevent spoofing attacks to voice biometry

Social network / link analysis There is a lot of metadata in speech recordings Making relations among metadata inside recording or across recordings makes particular pieces of information easier to find The current searching capabilities can be improved by few orders The same happened with text when Google came There are well known algorithms from text

Some technologies

technical signal removal Voice activity detector energy based VAD technical signal removal VAD based on f0 tracking neural network VAD Higher accuracy, lower speed Energy based VAD – fast removal of low energy parts Technical signal removal and noise filtering - removal of tones, removal of flat spectra signal, removal of stationary signals, filtering of pulse noise VAD based on f0 tracking – removal of other non-speech signals neural network VAD – very accurate VAD based on phoneme recognition

Language identification UBM Projection parameters Language parameters Calibration parameters feature extraction collection of UBM statistics projection to iVectors language classifier - MLR score calib. / transform language scores Prepared by Phonexia Fully trainable by client Language prints (iVectors) can be easily transferred over low capacity links

iVector describes total variability inside speech record Speaker identification – voice print extraction UBM Projection parameters Projection parameters Norm. parameters feature extraction collection of UBM statistics projection to iVectors extraction of spk info. - LDA user normalization voiceprint prepared by Phonexia iVector describes total variability inside speech record LDA removes non-speaker variability User normalization helps user to normalize to unseen channels (mean subtraction)

Speaker identification – voice print comparison Model parameters Calibration parameters voiceprint 1 voiceprint comparator WCCN+PLDA length dep. calibration piecewise LR score transform Logistic func. score trainable by user voiceprint 2 Voiceprint comparer returns log likelihood Calibration ensures probabilistic interpretation of the score under different speech lengths Score transform enables to selects log likelihood ratio or percentage score

Thanks for your attention Petr Schwarz CTO T +420 733 532 891 E petr.schwarz@phonexia.com phonexia.com