Www.phonexia.com, 1/41 SPEECH DATA MINING, SPEECH ANALYTICS, VOICE BIOMETRY.

Slides:



Advertisements
Similar presentations
© 1998, Progress Software Corporation 1 Migration of a 4GL and Relational Database to Unicode Tex Texin International Product Manager.
Advertisements

Voice and Data Encryption over mobile networks July 2012 IN-NOVA TECNOLOGIC IN-ARG SA MESH VOIP.
Configuration management
Software change management
Configuration management
Distributed Data Processing
Tuning Jenny Burr August Discussion Topics What is tuning? What is the process of tuning?
COM vs. CORBA.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Adding scalability to legacy PHP web applications Overview Mario A. Valdez-Ramirez.
PHONEXIA Can I have it in writing?. Discuss and share your answers to the following questions: 1.When you have English lessons listening to spoken English,
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
© 2009 Research In Motion Limited Methods of application development for mobile devices.
Presentation Outline  Project Aims  Introduction of Digital Video Library  Introduction of Our Work  Considerations and Approach  Design and Implementation.
SESSION 9 THE INTERNET AND THE NEW INFORMATION NEW INFORMATIONTECHNOLOGYINFRASTRUCTURE.
Chapter 14 The Second Component: The Database.
Matlab as a Design Environment for Wireless ASIC Design June 16, 2005 Erik Lindskog Beceem Communications, Inc.
Why is ASR Hard? Natural speech is continuous
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
ITIL: Why Your IT Organization Should Care Service Support
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Lecture-8/ T. Nouf Almujally
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
TEL 355: Communication and Information Systems in Organizations Speech-Enabled Interactive Voice Response Systems Professor John F. Clark.
.NET, and Service Gateways Group members: Andre Tran, Priyanka Gangishetty, Irena Mao, Wileen Chiu.
This chapter is extracted from Sommerville’s slides. Text book chapter
Speech Technology Center Solutions for Mobile Phones.
Speech Recognition ECE5526 Wilson Burgos. Outline Introduction Objective Existing Solutions Implementation Test and Result Conclusion.
 To explain the importance of software configuration management (CM)  To describe key CM activities namely CM planning, change management, version management.
VBS Documentation and Implementation The full standard initiative is located at Quick description Standard manual.
Views The architecture was specifically changed to accommodate multiple views. The used of the QStackedWidget makes it easy to switch between the different.
Introduction CSE 1310 – Introduction to Computers and Programming Vassilis Athitsos University of Texas at Arlington 1.
Configuration Management (CM)
» Jun 9, 2003 Speaker Verification Secure AND Efficient, Deployments in Finance and Banking Jonathan Moav Director of Marketing
Understanding Wireless Networking. WiFi Technology WiFi began as a way to extend home and small office network access without installing more cable. As.
17.0 Distributed Speech Recognition and Wireless Environment References: 1. “Quantization of Cepstral Parameters for Speech Recognition over the World.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
TECHONOLOGY experts INDUSTRY Some of our clients Link Translation’s extensive experience includes translation for some of the world's largest and leading.
Voice Biometry standard proposal Honza Černocký Brno University of Technology, BUT Czech Republic Sep 8 th 2015, Interspeech VBS meeting.
NETWORK HARDWARE AND SOFTWARE MR ROSS UNIT 3 IT APPLICATIONS.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Real-Time Cyber Physical Systems Application on MobilityFirst Winlab Summer Internship 2015 Karthikeyan Ganesan, Wuyang Zhang, Zihong Zheng Shantanu Ghosh,
HNDIT23082 Lecture 06:Software Maintenance. Reasons for changes Errors in the existing system Changes in requirements Technological advances Legislation.
ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION. Introduction What is Speech Recognition?  also known as automatic speech recognition or computer speech.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
1 Chapter 12 Configuration management This chapter is extracted from Sommerville’s slides. Text book chapter 29 1.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
Oman College of Management and Technology Course – MM Topic 7 Production and Distribution of Multimedia Titles CS/MIS Department.
DYNAMIC TIME WARPING IN KEY WORD SPOTTING. OUTLINE KWS and role of DTW in it. Brief outline of DTW What is training and why is it needed? DTW training.
Activity 1 5 minutes to discuss and feedback on the following:
Speech Processing 1 Introduction Waldemar Skoberla phone: fax: WWW:
Axis AI Solves Challenges of Complex Data Extraction and Document Classification through Advanced Natural Language Processing and Machine Learning MICROSOFT.
RESEARCH MOTHODOLOGY SZRZ6014 Dr. Farzana Kabir Ahmad Taqiyah Khadijah Ghazali (814537) SENTIMENT ANALYSIS FOR VOICE OF THE CUSTOMER.
Online School Management System Supervisor Name: Ashraful Islam Juwel Lecturer of Asian University of Bangladesh Submitted By: Bikash Chandra SutrodhorID.
INTRODUCTION TO INFORMATION SYSTEMS LECTURE 9: DATABASE FEATURES, FUNCTIONS AND ARCHITECTURES PART (2) أ/ غدير عاشور 1.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
From research to products
IBM Tivoli Web Site Analyzer Training Document
Speech Technology Center Solutions
Conditional Random Fields for ASR
Continuous Automated Chatbot Testing
ITIL: Why Your IT Organization Should Care Service Support
ITIL: Why Your IT Organization Should Care Service Support
Matlab as a Design Environment for Wireless ASIC Design
ITIL: Why Your IT Organization Should Care Service Support
Sending data to EUROSTAT using STATEL and STADIUM web client
VoiceXML An investigation Author: Mya Anderson
Presentation transcript:

1/41 SPEECH DATA MINING, SPEECH ANALYTICS, VOICE BIOMETRY

2/41 OVERVIEW How to move speech technology from research labs to the market? What are the current challenges is speech recognition research? Phonexia introduction Technology deployment use cases Technologies and what is behind Speech core and application interfaces Grand challenges text

3/41 WHAT IS IN SPEECH? Speaker Content Gender, age Language, dialect Speaker identity Keywords, phrases Emotion, speaker origin Speech transcription Education, relation Topic When speaker speaks Data mining Environment Equipment Where speakers speaks Device (phone/mike/...) To whom speakers speaks Transmit channels (dialog, reading, public talk) (landline/cell phone/Skype) Other sounds Codecs (gsm/mp3/…) (music, vehicles, animals…) Speech quality

4/41 PHONEXIA Goal: help clients to extract automatically maximum of valuable information from spoken speech. Based in 2006 as spin-off of Brno University of Technology Seat and main office in Brno, Czech Republic, active worldwide Customers in more than 20 countries governmental agencies, call centers, banks, telco operators, broadcast service companies …) Profitable, no external funding

5/41 FROM RESEARCH TO MARKET text Research Technologies Products Scientific papers, reports, experimental code (Matlab, Python, C++, lots of glue (shell scripts), data files The goal is accuracy Stability, speed, reproducibility and documentation less important Openness The goal is stability (error handling, code verification, testing cycles at various levels) and speed Regular development cycles and planning Well defined application interfaces (API) Documentation, licensing Development of new applications Integration with client’s technologies and systems The goal is functionality o integrated solution User interfaces

6/41 USE CASES Call centers Banks Intelligence agencies

7/41 CALL CENTERS Two main application areas: 1)Quality control 2)Data mining from voice traffic

8/41 CALL CENTERS – QUALITY CONTROL Supervisor is responsible for: Team leading Rating of calls Evaluation of operators Analysis of results Reporting  Only 3% of calls are analyzed by listening  100% of calls are analyzed using speech technologies, new statistics  lower staff costs, lower operating costs  Higher satisfaction of customers

9/41 CALL CENTERS – QUALITY CONTROL II Technologies: VAD + discourse analysis To get important statistics about call progress (start time, speaker turns, speech speed, reaction times …) Diarization Separation of summed conversation to two channels Keyword/phrase detector Detection of obligatory phrases, rough words, call script compliance … Speech transcription + search To search for important places in calls

10/41 DATA MINING FROM INCOMING CALLS Use cases: Prevention of call center from overloading (for example large power outage) Added value information for business (big data) Technology: Speech transcription Data mining tool Search engine

11/41 BANKS text Use cases: Banks have call centers ‒ Quality control ‒ Data mining from incoming traffic Authentication of people using voice biometry ‒ Using key phrase (text dependent speaker identification) ‒ Authentication on background (text independent speaker identification) Identification of frauds ‒ People with fake identities calls repeatedly to request loans

12/41 INTELIGENCE AGENCIES text Huge amount of information, can not be processed manually -public news, telecommunication networks, air communication, internet... Search for a needle in haystack Combination of all technologies -language identification, gender identification, speaker identification, diarization, keyword spotting, speech transcription -data mining tools -correlation with other metadata Operational and forensic speaker identification

13/41 TECHNOLOGIES Voice activity detection Language identification Gender recognition Speaker identification Diarization Keyword spotting Speech transcription Dialog analysis Emotion recognition

14/41 Energy based VAD – fast removal of low energy parts Technical signal removal and noise filtering - removal of tones, removal of flat spectra signal, removal of stationary signals, filtering of pulse noise VAD based on f0 tracking – removal of other non-speech signals neural network VAD – very accurate VAD based on phoneme recognition VOICE ACTIVITY DETECTION energy based VAD technical signal removal VAD based on f0 tracking neural network VAD Higher accuracy, lower speed

15/41 Important area of research, not fully solved, VAD is a key part of other technologies and directly affects accuracy of these technologies Music/singing detector Detectors of non-speech speaker sounds (cough, laugh) Detectors of other environment sounds (transport vehicles, animals, electric tools, door slam) Technical signal detectors VAD for high noisy speech (SNR lower than 0 dB) VADs or distorted channels Non-parametric VADs Distant mike VADs VAD CHALLENGES

16/41 LANGUAGE IDENTIFICATION Automatic recognition of the language spoken. 60 languages + user can add new ones themselves Can be used also as dialect recognition iVector based technology, discriminative training, < 1kB language prints Acoustic channel independent Usage: Crime is caused by small groups speaking specific languages very often Call record forwarding (to operator / other technologies / archive...) Analysis of the audio archive Insertion of advertisement to media Language verification in broadcast signal distribution >> x x x x x x x x x

17/41 LID SYSTEM ARCHITECTURE (IVECTOR BASED SYSTEM) feature extraction collection of UBM statistics projection to iVectors language classifier - MLR score calib. / transform language scores UBM Projection parameters Language parameters Calibration parameters Prepared by Phonexia Fully trainable by client Language prints (iVectors) can be easily transferred over low capacity links

18/41 SPEAKER RECOGNITION x x >> x x Several scenarios: speaker verification, speaker search, speaker spotting, link/pattern analysis Text independent or text dependent mode iVectors based technology, < 1kB voiceprints Voiceprint extraction and scoring Millions of comparisons in fraction of seconds Diarization (speaker segmentation) User-based system training, user-based calibration

19/41 SYSTEM ARCHITECTURE - VOICE PRINT EXTRACTION feature extraction collection of UBM statistics projection to iVectors extraction of spk info. - LDA user normalization voiceprint UBM Projection parameters Norm. parameters iVector describes total variability inside speech record LDA removes non-speaker variability User normalization helps user to normalize to unseen channels (mean subtraction) prepared by Phonexiatrainable by user

20/41 SYSTEM ARCHITECTURE - VOICEPRINT COMPARISON voiceprint comparator WCCN+PLDA length dep. calibration piecewise LR score Model parameters Voiceprint comparer returns log likelihood Calibration ensures probabilistic interpretation of the score under different speech lengths Score transform enables to selects log likelihood ratio or percentage score trainable by user voiceprint 1 voiceprint 2 score transform Logistic func. Calibration parameters

21/41 LID AND SID CHALLENGES LID/SID on very short records (< 3s) while keeping training at user side How to ensure accuracy over large number of acoustic channels and languages (SID) Graphical tools for system training/calibration and evaluation at user side LID/SID on Voice over IP networks LID/SID form distant mikes

22/41 Fully Bayesian approach with eigenvoice priors (Valente, Kenny) Initial number of speakers is higher than expected number of speakers A duration model is used to prevent fast jumps among speakers Target number of speaker can be chosen based on minimal speaker posterior probability, or pre-set by user Very accurate but slower and more memory consuming SPEAKER DIARIZATION random alignment of frames to speakers estimation of spk. factors for each speaker conversion of spk. factors to spk. GMM, new align. generating of speaker labels spk1, spk2, spk1 collection of GMM stats for each speaker

23/41 Diarization is a technology that still needs a lot of research Very sensitive to initialization Very sensitive to non-speech sounds (laugh, cough, environment sounds), integration of a good VAD is necessary Very sensitive to changes in transmit channels and language Even with DER close to 1% there are recordings where current algorithms fail completely (often two women speaking with high pitch) Uses iterative approach, one iteration is equivalent to one run of SID, users expect much faster run than SID Diarization from distant mikes Beam-forming and diarization from microphone arrays How to accurately estimate number of speakers DIARIZATION CHALLENGES

24/41 Two approaches: KWS based on LVCSR ‒ Very accurate ‒ Slower ‒ Expensive for development Speech transcription based on state-of-the-art acoustic and language models Posteriors from confusion network used as confidences About 100h of training data KEYWORD SPOTTING Acoustic KWS ‒ Fast ‒ Less accurate ‒ Cheap development Simple neural network based acoustic model Simple language model (phone loop as background) About 20h of training data Phoneme-based calibration

25/41 SPEECH TRANSCRIPTION PLP + bottle-neck features, HLDA fast VTLN estimated using a set of GMMs GMM or NN based system Discriminative training Speaker adaptation 3-gram language model strings / lattices / confusion networks

26/41 SPEECH TRANSCRIPTION CHALLENGES Rather engineering then research challenges: Accuracy Speed Lower memory consumption How to train new system fully automatically How to run hundreds of recognizers in parallel How to do channel normalization and speaker adaptation for any length of speech utterance

27/41 CONNECTION TO TEXT BASED DATA MINING TOOLS It is much easier to sell speech transcription with a higher-level data mining tool ‒ There is too much text to read ‒ The text has to many errors (users will never be happy unless the text is 100% correct) This can be overcame by integration with existing text-based data- mining tools: ‒ Categorization of recordings ‒ Indexing and search using complex queries ‒ Exploration of new topics ‒ Content analysis ‒ Reaction on trends Integration is done on confusion networks (alternative hypothesis, increased probability to find specific information)

28/41 TOVEK TOOLS – CONTENT ANALYSIS

29/41 HOW TO MOVE OUR SOFTWARE TO USERS? Decision to write new speech core in 2007 – Brno Speech Core Focus on stability, speed and proper error handling Object oriented design, proper interfaces, no dependency among modules except through regular interfaces One C++ compiler, binary compatibility of libraries with others One code base for all technologies

30/41 BRNO SPEECH CORE More than 250 objects covering large range of speech algorithms (feature extraction, acoustic models, decoders, transforms, grammar compilers, …) More than million of source code lines Code versioning, automatic builds, test suits, licensing Still easy maintainable ‒ extension of functionality inside objects ‒ splitting of functionality to more objects ‒ replacement of objects (fixed interfaces)

31/41 FAST PROTOTYPING Research to product transfer time is essential for commercial success Research done using standard toolkits – STK, TNET, KALDI, Python scripts, … For production systems: New system can be implemented in few days Often no single line of C/C++ code is written Only objects for new algorithms are implemented Objects are connected through one configuration file to form data streams text

32/41 BSCORE CONFIG [source:SFileWaveformSourceI] … [fconvertor:SWaveformFormatConvertorI] input_format_str=lin16 output_format_str=float nchannels=1 … [melbanks:SMelBanksI] sample_freq=8000 vector_size=200 preem_coef=0.97 nbanks=15 … [posteriors:SNNetPosteriorEstimatorI] … [decoder:SPhnDecoderI] … [output:STranscriptionNodeI]... [links] source->fconvertor fconvertor->melbanks melbanks->posteriors posteriors->decoder decoder->output

33/41 APPLICATION INTERFACES Customers are used to work with specific programming tools and do not want to change their habits GUI/command line/SDKs C/C++ API – binary compatibility with many compilers Java API – a middle layer created using Java Native Interface (JNI) C# API – automatically generated using SWIG MRCPv2 network interface – for integration to telephone infrastructure (IVRs) - through UniMRCP open source project REST – server platform, simple network interface for each technology Supported OS – Windows/Linux, 32/64 bits, Android

34/41 USUAL DESIGN OF SYSTEMS WITH WEB BASED GUI Speech Server (REST), application server, web serer, database Speech technology inside application server (TomCat, JBoss) through Java API Application server Speech server Web server Data storage net

35/41 CLIENT FOR REST SERVER text

36/41 GRAND CHALLENGES Training data collection Guarantee of accuracy Reduction of hardware cost

37/41 TRAINING DATA COLLECTION We would like to offer cheap speech technology to anyone on this planet. Hundreds of languages and thousands of dialects The data is costly - about EUR for existing language, more than EUR for collection and annotation of new corpora The existing data often do not match the target dialect and acoustic channels  We can add only few languages to our offer per year Can we find a smarter and cheaper way to collect data?

38/41 DATA COLLECTION PROJECT 1.Collection of data from public sources (broadcast) 2.Automatic detection of phone calls within broadcast (high variability in speakers, dialects and speaking style) 3.Language identification to verify language, speaker identification to ensure speaker variability 4.Annotation through crowd sourcing platform 5.Fully automatic process including training of ASR, unsupervised adaptation Experience from data collection for language identification (LDC and NIST adopted this process, now mainstream in Language ID) SpokenData.com ready for online annotation (ReplayWell company) Cost of collection of 100h of data, annotation and system training could be reduced bellow EUR per language Interested? Please write to

39/41 GUARANTEE OF ACCURACY More and more customers buy speech solutions. But each new installation brings new risks. Speaker identification is not fully language independent Language identification is not dialect independent Speech transcription is not domain independent All technologies are not channel independent How to estimate in advance if the installation is going to be successful? What will be the target accuracy of each technology?  Data collection project can help  Can be extended to World Map of Spoken Languages

40/41 HARDWARE COST Speech solution is not only software, but also computational hardware, data storage, physical planning of the HW etc. Computers and cooling consume electricity The additional cost can be about 50% of total project cost Most research is directed to reach maximal accuracies Any improvement in speed can have large effect on the success of your technology  Perfect optimization of software  Use of HW acceleration (GPU cards etc.)

41/41 Phonexia s.r.o. Q & A THANKS!