... NOT JUST ANOTHER PUBLIC DOMAIN SOFTWARE PROJECT ...

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
SVM—Support Vector Machines
ECE 8443 – Pattern Recognition Objectives: Acoustic Modeling Language Modeling Feature Extraction Search Pronunciation Modeling Resources: J.P.: Speech.
Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.
Supervised Learning Recap
ECE 8443 – Pattern Recognition Objectives: Course Introduction Typical Applications Resources: Syllabus Internet Books and Notes D.H.S: Chapter 1 Glossary.
Groundwater 3D Geological Modeling: Solving as Classification Problem with Support Vector Machine A. Smirnoff, E. Boisvert, S. J.Paradis Earth Sciences.
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
Industrial Engineering College of Engineering Bayesian Kernel Methods for Binary Classification and Online Learning Problems Theodore Trafalis Workshop.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Using Analytic QP and Sparseness to Speed Training of Support Vector Machines John C. Platt Presented by: Travis Desell.
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Sample-Separation-Margin Based Minimum Classification Error Training of Pattern Classifiers with Quadratic Discriminant Functions Yongqiang Wang 1,2, Qiang.
Scalable Text Mining with Sparse Generative Models
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.
Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.
This week: overview on pattern recognition (related to machine learning)
"Dude, Where's My... Signals and Systems Textbook?" Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
Hybrid Systems for Continuous Speech Recognition Issac Alphonso Institute for Signal and Information Processing Mississippi State.
Joseph Picone, PhD Human and Systems Engineering Professor, Electrical and Computer Engineering Bridging the Gap in Human and Machine Performance HUMAN.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
A Sparse Modeling Approach to Speech Recognition Based on Relevance Vector Machines Jon Hamaker and Joseph Picone Institute for.
Seungchan Lee Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Software Release and Support.
Joseph Picone, PhD Human and Systems Engineering Professor, Electrical and Computer Engineering URL:
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
IMPROVING RECOGNITION PERFORMANCE IN NOISY ENVIRONMENTS Joseph Picone 1 Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
IRCS/CCN Summer Workshop June 2003 Speech Recognition.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author: Aravind.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
DIALOG SYSTEMS FOR AUTOMOTIVE ENVIRONMENTS Presenter: Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
Joseph Picone, PhD Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering An Overview of Statistical.
Speaker Verification Speaker verification uses voice as a biometric to determine the authenticity of a user. Speaker verification systems consist of two.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
DIALOG SYSTEMS FOR AUTOMOTIVE ENVIRONMENTS Presenter: Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi State.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.
Biointelligence Laboratory, Seoul National University
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
A Sparse Modeling Approach to Speech Recognition Using Kernel Machines Jon Hamaker Institute for Signal and Information Processing.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
NTU & MSRA Ming-Feng Tsai
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Applications of Risk Minimization to Speech Recognition Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
A NONPARAMETRIC BAYESIAN APPROACH FOR
CS 9633 Machine Learning Support Vector Machines
College of Engineering
Statistical Models for Automatic Speech Recognition
LECTURE 16: SUPPORT VECTOR MACHINES
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
EEG Recognition Using The Kaldi Speech Recognition Toolkit
Statistical Models for Automatic Speech Recognition
LECTURE 17: SUPPORT VECTOR MACHINES
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
Presentation transcript:

... NOT JUST ANOTHER PUBLIC DOMAIN SOFTWARE PROJECT ... UAB – CIS ... NOT JUST ANOTHER PUBLIC DOMAIN SOFTWARE PROJECT ... • Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi State University • Contact Information: Box 9571 Mississippi State, Mississippi 39762 Tel: 662-325-3149 Fax: 662-325-2298 Email: picone@isip.msstate.edu • Acknowledgement: Supported by several NSF grants (e.g. EIA-9809300). • URL: www.isip.msstate.edu/publications/seminars/external/2003/uab

PUBLIC DOMAIN SOFTWARE TECHNOLOGY PUBLIC DOMAIN SOFTWARE Focus: speech recognition State of the art Statistical (e.g., HMM) Continuous speech Large vocabulary Speaker independent Goal: Accelerate research Flexibility, Extensibility, Modular Efficient (C++, Parallel Proc.) Easy to Use (documentation) Toolkits, GUIs Benefit: Technology Standard benchmarks Conversational speech

MISSION “GNU FOR DSP” Origins date to work at Texas Instruments in 1985. The Institute for Signal and Information Processing (ISIP) was created in 1994 at Mississippi State University with a simple vision to develop public domain software. Key differentiating characteristics of this project are: Public Domain: unrestricted software (including commercial use); no copyrights, licenses, or research-only restrictions. Increase Participation: competitive technology plus application-specific toolkits reduce start-up costs. Lasting Infrastructure: Support, training, education, dissemination of information are priorities.

FLEXIBLE YET EFFICIENT APPROACH FLEXIBLE YET EFFICIENT Research: Matlab Octave Python ASR: HTK SPHINX CSLU ISIP: IFCs Java Apps Toolkits Research: Rapid Prototyping “Fair” Evaluations Ease of Use Lightweight Programming Efficiency: Memory Hyper-real time training Parallel processing Data intensive

PLATFORMS AND COMPILERS APPROACH PLATFORMS AND COMPILERS Supported platforms: Linux (Redhat 6.1 or greater) Sun x86 Solaris 7 or greater Windows (cygwin tools) (Recently phased out Sun Sparc) Languages and Compilers: Remember Lisp? Java? Tk/Tcl? Avoid a reliance on Perl! C++ was the obvious choice as a tradeoff between stability, standardization, and efficiency.

DOCUMENTATION AND WORKSHOPS APPROACH DOCUMENTATION AND WORKSHOPS Extensive online software documentation, tutorials, and training materials Self-documenting software Over 100 students and professionals representing 25 countries and 75 institutions have attended our workshops Over a dozen companies have trained in our lab

REAL-TIME INFORMATION EXTRACTION APPLICATIONS REAL-TIME INFORMATION EXTRACTION Metadata extraction from conversational speech Automatic gisting and intelligence gathering Speech to text is the core technology challenge Machines vs. humans Real-time audio indexing Time-varying channel Dynamic language model Multilingual and cross-lingual

DIALOG SYSTEMS FOR THE CAR APPLICATIONS DIALOG SYSTEMS FOR THE CAR In-vehicle dialog systems improve information access. Advanced user interfaces enhance workforce training and increase manufacturing efficiency. Noise robustness in both environments to improve recognition performance Advanced statistical models and machine learning technology

APPLICATIONS SPEAKER RECOGNITION Voice verification for calling card security First wide-spread deployment of recognition technology in the telephone network Extension of same statistical modeling technology used in speech recognition

SPEAKER STRESS AND FATIGUE APPLICATIONS SPEAKER STRESS AND FATIGUE Recognition of emotion, stress, fatigue, and other voice qualities are possible from enhanced descriptions of the speech signal Fundamentally the same statistical modeling problem as other speech applications Fatigue analysis from voice under development under an SBIR

UNIQUE FEATURES OF OUR RESEARCH APPLICATIONS UNIQUE FEATURES OF OUR RESEARCH Acoustic Modeling Risk minimization Relevance vectors Syllable modeling Network training Language Modeling Hierarchical decoder Dynamic models NLP Integration Basic principles: fundamental algorithm research captured in a consistent software framework

SPEECH RECOGNITION RESEARCH? INTRODUCTION SPEECH RECOGNITION RESEARCH? Why do we work on speech recognition? “Language is the preeminent trait of the human species.” “I never met someone who wasn’t interested in language.” “I decided to work on language because it seemed to be the hardest problem to solve.” Why should we work on speech recognition? Antiterrorism, homeland security, military applications Telecommunications, mobile communications Education, learning tools, educational toys, enrichment Computing, intelligent systems, machine learning Commodity or liability? Fragile technology that is error prone The recently completed Aurora evaluation, part of the ETSI standards activity to develop a standard for feature extraction for client/server applications in cellular telephony, would provide a nice framework in which to conduct this research. The evaluation task is based on the WSJ 5,000 word closed vocabulary task, and includes clean speech, telephone-bandwidth speech, and speech degraded by digitally-added noise. Performance on clean conditions (matched training and testing) is on the order of 7% WER. Performance on the noise conditions, even after training on noisy data, is on the order of 30% WER. See: http://www.isip.msstate.edu/projects/aurora/performance/index.html for more details on the baseline performance. Two noise-adaptive front ends were recently developed by a consortium of sites interested in this problem which reduced error rates by 50%, but performance was still far from that achieved in clean conditions.

FUNDAMENTAL CHALLENGES INTRODUCTION FUNDAMENTAL CHALLENGES

PROBABILISTIC FRAMEWORK INTRODUCTION PROBABILISTIC FRAMEWORK

BLOCK DIAGRAM OVERVIEW SPEECH RECOGNITION BLOCK DIAGRAM OVERVIEW Core components: transduction feature extraction acoustic modeling (hidden Markov models) language modeling (statistical N-grams) search (Viterbi beam) knowledge sources

SPEECH RECOGNITION FEATURE EXTRACTION

SPEECH RECOGNITION ACOUSTIC MODELING

SPEECH RECOGNITION LANGUAGE MODELING

SPEECH RECOGNITION VITERBI BEAM SEARCH breadth-first time synchronous beam pruning supervision word prediction natural language

APPLICATION OF INFORMATION RETRIEVAL SPEECH RECOGNITION APPLICATION OF INFORMATION RETRIEVAL Traditional Output: best word sequence time alignment of information Other Outputs: word graphs N-best sentences confidence measures metadata such as speaker identity, accent, and prosody

ML CONVERGENCE NOT OPTIMAL RISK MINIMIZATION ML CONVERGENCE NOT OPTIMAL Finding the optimal decision boundary requires only one parameter. Maximum likelihood convergence does not translate to optimal classification if a priori assumptions about the data are not correct.

GENERALIZATION AND RISK RISK MINIMIZATION GENERALIZATION AND RISK How much can we trust isolated data points? Optimal decision surface changes abruptly Optimal decision surface is a line Optimal decision surface still a line Can we integrate prior knowledge about data, confidence, or willingness to take risk?

STRUCTURAL OPTIMIZATION RISK MINIMIZATION STRUCTURAL OPTIMIZATION Open-Loop Error Error Optimum Training Set Error Model Complexity Structural optimization often guided by an Occam’s Razor approach Trading goodness of fit and model complexity Examples: MDL, BIC, AIC, Structural Risk Minimization, Automatic Relevance Determination

STRUCTURAL RISK MINIMIZATION Expected risk Expected Risk: Not possible to estimate P(x,y) Empirical Risk: Related by the VC dimension, h: Approach: choose the machine that gives the least upper bound on the actual risk bound on the expected risk optimum VC confidence empirical risk VC dimension The VC dimension is a measure of the complexity of the learning machine Higher VC dimension gives a looser bound on the actual risk – thus penalizing a more complex model (Vapnik)

SUPPORT VECTOR MACHINES Optimization: Separable Data RISK MINIMIZATION SUPPORT VECTOR MACHINES C2 Optimization: Separable Data Hyperplane: Constraints: Quadratic optimization of a Lagrange functional minimizes risk criterion (maximizes margin). Only a small portion become support vectors. Final classifier: CO H2 C1 class 1 H1 w origin optimal class 2 classifier Hyperplanes C0-C2 achieve zero empirical risk. C0 generalizes optimally The data points that define the boundary are called support vectors

EXPERIMENTAL RESULTS DETERDING VOWEL DATA Deterding Vowel Data: 11 vowels spoken in “h*d” context; 10 log area parameters; 528 train, 462 SI test Approach % Error # Parameters SVM: Polynomial Kernels 49% K-Nearest Neighbor 44% Gaussian Node Network SVM: RBF Kernels 35% 83 SVs Separable Mixture Models 30% RVM: RBF Kernels 13 RVs

SVM ALPHADIGIT RECOGNITION EXPERIMENTAL RESULTS SVM ALPHADIGIT RECOGNITION Transcription Segmentation SVM HMM N-best Hypothesis 11.0% 11.9% N-best+Ref Reference 3.3% 6.3% HMM system is cross-word state-tied triphones with 16 mixtures of Gaussian models SVM system has monophone models with segmental features System combination experiment yields another 1% reduction in error

RELEVANCE VECTOR MACHINES AUTOMATIC RELEVANCE DETERMINATION A kernel-based learning machine Incorporates an automatic relevance determination (ARD) prior over each weight (MacKay) A flat (non-informative) prior over a completes the Bayesian specification

SVM/RVM ALPHADIGIT COMPARISON EXPERIMENTAL RESULTS SVM/RVM ALPHADIGIT COMPARISON Approach Error Rate Avg. # Parameters Training Time Testing Time SVM 16.4% 257 0.5 hours 30 mins RVM 16.2% 12 30 days 1 min RVMs yield a large reduction in the parameter count while attaining superior performance Computational costs mainly in training for RVMs but is still prohibitive for larger sets

PRACTICAL RISK MINIMIZATION? EXPERIMENTAL RESULTS PRACTICAL RISK MINIMIZATION? Reduction of complexity at the same level of performance is interesting: Results hold across tasks RVMs have been trained on 100,000 vectors Results suggest integrated training is critical Risk minimization provides a family of solutions: Is there a better solution than minimum risk? What is the impact on complexity and robustness? Applications to other problems? Speech/Non-speech classification? Speaker adaptation? Language modeling?

EXPERIMENTAL RESULTS PRELIMINARY RESULTS Approach Error Rate Avg. # Parameters Training Time Testing Time SVM 15.5% 994 3 hours 1.5 hours RVM Constructive 14.8% 72 5 days 5 mins Reduction 74 6 days Data increased to 10000 training vectors Reduction method has been trained up to 100k vectors (on toy task). Not possible for Constructive method

RELEVANT SOFTWARE RESOURCES SUMMARY RELEVANT SOFTWARE RESOURCES Pattern Recognition Applet: compare popular algorithms on standard or custom data sets Speech Processing Toolkits: speech recognition, speaker recognition and verification, statistical modeling, machine learning, state of the art toolkits Foundation Classes: generic C++ implementations of many popular statistical modeling approaches Fun Stuff: have you seen our commercial on the Home Shopping Channel?

SUMMARY BRIEF BIBLIOGRAPHY Applications to Speech Recognition: J. Hamaker and J. Picone, “Advances in Speech Recognition Using Sparse Bayesian Methods,” submitted to the IEEE Transactions on Speech and Audio Processing, January 2003. A. Ganapathiraju, J. Hamaker and J. Picone, “Applications of Risk Minimization to Speech Recognition,” submitted to the IEEE Transactions on Signal Processing, July 2003. J. Hamaker, J. Picone, and A. Ganapathiraju, “A Sparse Modeling Approach to Speech Recognition Based on Relevance Vector Machines,” Proceedings of the International Conference of Spoken Language Processing, vol. 2, pp. 1001-1004, Denver, Colorado, USA, September 2002. J. Hamaker, Sparse Bayesian Methods for Continuous Speech Recognition, Ph.D. Dissertation, Department of Electrical and Computer Engineering, Mississippi State University, December 2003. A. Ganapathiraju, Support Vector Machines for Speech Recognition, Ph.D. Dissertation, Department of Electrical and Computer Engineering, Mississippi State University, January 2002. Influential work: M. Tipping, “Sparse Bayesian Learning and the Relevance Vector Machine,” Journal of Machine Learning, vol. 1, pp. 211-244, June 2001. D. J. C. MacKay, “Probable networks and plausible predictions --- a review of practical Bayesian methods for supervised neural networks,” Network: Computation in Neural Systems, 6, pp. 469-505, 1995. D. J. C. MacKay, Bayesian Methods for Adaptive Models, Ph. D. thesis, California Institute of Technology, Pasadena, California, USA, 1991. E. T. Jaynes, “Bayesian Methods: General Background,” Maximum Entropy and Bayesian Methods in Applied Statistics, J. H. Justice, ed., pp. 1-25, Cambridge Univ. Press, Cambridge, UK, 1986. V.N. Vapnik, Statistical Learning Theory, John Wiley, New York, NY, USA, 1998. V.N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New York, NY, USA, 1995. C.J.C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” AT&T Bell Laboratories, November 1999.