Engineering Terminology

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
ECE 8443 – Pattern Recognition Objectives: Acoustic Modeling Language Modeling Feature Extraction Search Pronunciation Modeling Resources: J.P.: Speech.
ECE 8443 – Pattern Recognition Objectives: Course Introduction Typical Applications Resources: Syllabus Internet Books and Notes D.H.S: Chapter 1 Glossary.
Machine Learning Neural Networks
Speech Translation on a PDA By: Santan Challa Instructor Dr. Christel Kemke.
Lecture 13 Revision IMS Systems Analysis and Design.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
Natural Language Understanding
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
1 Building National Cyberinfrastructure Alan Blatecky Office of Cyberinfrastructure EPSCoR Meeting May 21,
ECE 8443 – Pattern Recognition ECE 3163 – Signals and Systems Objectives: Pattern Recognition Feature Generation Linear Prediction Gaussian Mixture Models.
Introduction to Automatic Speech Recognition
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
SoundSense by Andrius Andrijauskas. Introduction  Today’s mobile phones come with various embedded sensors such as GPS, WiFi, compass, etc.  Arguably,
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Joseph Picone, PhD Human and Systems Engineering Professor, Electrical and Computer Engineering Bridging the Gap in Human and Machine Performance HUMAN.
Miltec Research and Technology RECOGNITION OF NONSTATIONARY SIGNALS Joseph Picone, PhD Professor, Department of Electrical and Computer Engineering Mississippi.
 Knowledge Acquisition  Machine Learning. The transfer and transformation of potential problem solving expertise from some knowledge source to a program.
 The most intelligent device - “Human Brain”.  The machine that revolutionized the whole world – “computer”.  Inefficiencies of the computer has lead.
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Seungchan Lee Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Software Release and Support.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
IRCS/CCN Summer Workshop June 2003 Speech Recognition.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
DIALOG SYSTEMS FOR AUTOMOTIVE ENVIRONMENTS Presenter: Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Definitions Random Signal Analysis (Review) Discrete Random Signals Random.
1 Machine Learning 1.Where does machine learning fit in computer science? 2.What is machine learning? 3.Where can machine learning be applied? 4.Should.
HIERARCHICAL SEARCH FOR LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION Author :Neeraj Deshmukh, Aravind Ganapathiraju and Joseph Picone.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
DARPA ITO/MARS Project Update Vanderbilt University A Software Architecture and Tools for Autonomous Robots that Learn on Mission K. Kawamura, M. Wilkes,
Joseph Picone, PhD Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering An Overview of Statistical.
Speaker Verification Speaker verification uses voice as a biometric to determine the authenticity of a user. Speaker verification systems consist of two.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
DIALOG SYSTEMS FOR AUTOMOTIVE ENVIRONMENTS Presenter: Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Performance Comparison of Speaker and Emotion Recognition
Automatic Discovery and Processing of EEG Cohorts from Clinical Records Mission: Enable comparative research by automatically uncovering clinical knowledge.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
Automated Interpretation of EEGs: Integrating Temporal and Spectral Modeling Christian Ward, Dr. Iyad Obeid and Dr. Joseph Picone Neural Engineering Data.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Objectives: Terminology Components The Design Cycle Resources: DHS Slides – Chapter 1 Glossary Java Applet URL:.../publications/courses/ece_8443/lectures/current/lecture_02.ppt.../publications/courses/ece_8443/lectures/current/lecture_02.ppt.
Data Mining and Decision Support
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Pattern Recognition NTUEE 高奕豪 2005/4/14. Outline Introduction Definition, Examples, Related Fields, System, and Design Approaches Bayesian, Hidden Markov.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
A NONPARAMETRIC BAYESIAN APPROACH FOR
ARTIFICIAL NEURAL NETWORKS
Artificial Intelligence for Speech Recognition
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
EEG Recognition Using The Kaldi Speech Recognition Toolkit
LECTURE 15: REESTIMATION, EM AND MIXTURES
Natural Language Processing (NLP) Systems Joseph E. Gonzalez
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

POTENTIAL SYNERGIES BETWEEN SPEECH RECOGNITION AND PROTEOMICS Joseph Picone, PhD Professor, Department of Electrical and Computer Engineering Mississippi State University URL: Audio:

Engineering Terminology Speech recognition is essentially an application of pattern recognition or machine learning to audio signals: Pattern Recognition: “The act of taking raw data and taking an action based on the category of the pattern.” Machine Learning: The ability of a machine to improve its performance based on previous results. A popular application of pattern recognition is the development of a functional mapping between inputs (observations) and desired outcomes or actions (classes). For the past 30 years, statistical methods have dominated the fields of pattern recognition and machine learning. Unfortunately, these methods typically require large amounts of truth-marked data to be effective. Generalization and Risk: There are many algorithms that produce very low error rates on small data sets, but many of these algorithms have trouble generalizing these results when constrained to limited amounts of training data., or encountering evaluation conditions different from the training data.

Proteomics (From Wikipedia) Proteomics is the large-scale study of proteins, particularly their structures and functions. Proteomics is often considered the next step beyond genomics in the study of biological systems. It is much more complicated than genomics mostly because while an organism's genome is more or less constant, the proteome differs from cell to cell and from time to time. This is because distinct genes are expressed in distinct cell types. This means that even the basic set of proteins which are produced in a cell needs to be determined. Several new methods have emerged to probe protein-protein interactions, including protein microarrays and immunoaffinity chromatography followed by mass spectrometry. Unlike the speech recognition problem, identifying proteins using mass spectrometry is a mature process. But can you generate enough data? The fundamental challenge is understanding protein interactions and how these relate to diagnostic techniques and disease treatments. Collaboration Challenge: what role can an ability to learn functional mappings from data play in proteomics?

Speech Recognition Overview Conversion of a 1D time series (sound pressure wave vs. time) to a symbolic description. Exploits “domain” knowledge at each level of the hierarchy to constrain the search space and improve accuracy. The exact location of symbols in the signal are unknown. Segmentation, or location of the symbols, is done in a statistically optimal manner as part of the search process. Complexity of the search space is exponential.

From a Signal to a Spectrogram Convert a one-dimensional signal (sound pressure wave vs. time) to a time- frequency representation that better depicts the “signature” of a sound. Use simple linear transforms such as a Fourier Transform to generate a “spectrogram” of the signal (spectral magnitude vs. time and frequency). Key challenge: where do sounds begin and end in the signal?

From a Spectrum to Phonemes The spectral signature of sounds varies with its context (e.g., there are 39 variants of “t” in English). We use context-dependent models that take into account the left and right context (e.g., “k-ah+t”). This unfortunately causes an exponential growth in the search space. There are approx. 40 phones in English, and approx. 10,000 possible combinations of three phones, which we refer to as triphones. Decision-tree clustering is used to reduce the number of parameters required to describe these models. Since any phone can occur at any time, and any phone can follow any other phone, every frame of processing requires starting 10,000 new hypotheses. Hence, to control complexity, the search is controlled using a top-down supervision (time-synchronous breadth-first search). Less probable hypothesis are discarded each frame (beam search).

From Phonemes to Words Phones are converted to words using a lexicon that typically contains between 100K and 1M words. About 10% of the expected phonemes are deleted in conversational speech, so pronunciation models must be robust to missing data. Many words have alternate pronunciations based on context, dialect, accent, speaking rate, etc. Phoneme recognition accuracies are low (approx. 60%), but by using word- level supervision, recognition accuracy can be high (greater than 90%). If any of 1M words can occur at almost any time, the size of the search space is enormous. Hence, efficient search strategies are critical, and only suboptimal solutions are feasible.

From Words to Concepts Words can be converted to concepts or actions using various mapping functions (e.g., finite state machines, neural networks, formal languages). Statistical models can be used, but these require large amounts of labeled data (word sequence and corresponding action). Domain knowledge is used to limit the search space.

Next Steps Speech recognition expertise that is of potential value: The ability to train sophisticated statistical models on large amounts of data. The ability to efficiently search enormously large search spaces. The ability to convert domain knowledge into statistical models (e.g., prior probabilities in a Bayesian framework). Next steps: Determine a small pilot project that is demonstrative of the type of data or problems you need solved. Reality is in the data: transfer some data sets that we can use to create an experimental environment for our algorithms. Establish baseline performance (e.g., accuracy, complexity, memory, speed) of current state of the art. Understand through error analysis what are the dominant failure modes, and what types of improvements are desired.

Relevant Publications and Online Resources Recent relevant peer-reviewed publications: S. Srinivasan, T. Ma, D. May, G. Lazarou and J. Picone, “Nonlinear Mixture Autoregressive Hidden Markov Models For Speech Recognition,” Proc. Of ICSLP, pp. 960-963, Brisbane, Australia, September 2008. S. Prasad, S. Srinivasan, M. Pannuri, G. Lazarou and J. Picone, “Nonlinear Dynamical Invariants for Speech Recognition,” Proc. ICSLP, pp. 2518-2521, Pittsburgh, Pennsylvania, USA, September 2006. J. Baca and J. Picone, “Effects of Navigational Displayless Interfaces on User Prosodics,” Speech Communication, vol. 45, no. 2, pp. 187-202, Feb. 2005. A. Ganapathiraju, J. Hamaker and J. Picone, “Applications of Support Vector Machines to Speech Recognition,” IEEE Trans. on Signal Proc., vol. 52, no. 8, pp. 2348-2355, August 2004. R. Sundaram and J. Picone, “Effects of Transcription Errors on Supervised Learning in Speech Recognition,” Proc. ICASSP, pp. 169-172, Montreal, Quebec, Canada, May 2004. I. Alphonso and J. Picone, “Network Training For Continuous Speech Recognition,” Proc. EURASIP, pp. 565-568, Vienna, Austria, September 2004. J. Hamaker, J. Picone, and A. Ganapathiraju, “A Sparse Modeling Approach to Speech Recognition Based on Relevance Vector Machines,” Proc. ICSLP, pp. 1001- 1004, Denver, Colorado, USA, September 2002. Relevant online resources: “Institute for Signal and Information Processing,” http://www.isip.piconepress.com. “Internet-Accessible Speech Recognition Technology,” http://www.isip.piconepress.com/projects/speech/. “An Open-Source Speech Recognition System,” http://www.isip.piconepress.com/projects/speech/softw are/. “Nonlinear Statistical Modeling of Speech,” http://www.piconepress.com/projects/nsf_nonlinear/. “An On-line Tutorial on Speech Recognition,” http://www.isip.piconepress.com/projects/speech/softw are/tutorials/production/fundamentals/current/. “Speech and Signal Processing Demonstrations,” http://www.isip.piconepress.com/projects/speech/softw are/demonstrations/. “Fundamentals of Speech Recognition,” http://www.isip.piconepress.com/publications/courses/e ce_8463/. “Pattern Recognition,” http://www.isip.piconepress.com/publications/courses/e ce_8463/. “Adaptive Signal Processing,” http://www.isip.piconepress.com/publications/courses/e ce_8423/.

Appendix: Relevant Resources Interactive Software: Java applets, GUIs, dialog systems, code generators, and more Speech Recognition Toolkits: compare SVMs and RVMs to standard approaches using a state of the art ASR toolkit Foundation Classes: generic C++ implementations of many popular statistical modeling approaches Fun Stuff: have you seen our campus bus tracking system? Or our Home Shopping Channel commercial?

Appendix: ISIP Is More Than Just Software Extensive online software documentation, tutorials, and training materials. Extensive archive of graduate and undergraduate coursework. Web-based instructional materials including demos and applets. Self-documenting software. Summer workshops at which students receive intensive hands-on training. Jointly develop advanced prototypes in partnerships with commercial entities. Provide consulting services to industry across a broad range of human language technology. Commitment to open source.

Appendix: Speech Recognition Architectures Core components: transduction feature extraction acoustic modeling (hidden Markov models) language modeling (statistical N-grams) search (Viterbi beam) knowledge sources Our focus has traditionally been on the acoustic modeling components of the system.

Appendix: Noisy Communication Channel Model

Appendix: Feature Extraction A popular approach for capturing these dynamics is the Mel-Frequency Cepstral Coefficients (MFCC) “front-end:” ]i=k,

Appendix: Acoustic Modeling

Appendix: Context-Dependent Phones

Appendix: Language Modeling

Appendix: Statistical N-gram Models

Appendix: Search Strategies breadth-first time synchronous beam pruning supervision word prediction natural language

Appendix: Evolution of Knowledge in HLT Systems Source of Knowledge Performance A priori expert knowledge created a generation of highly constrained systems (e.g. isolated word recognition, parsing of written text, fixed-font OCR). Statistical methods created a generation of data-driven approaches that supplanted expert systems (e.g., conversational speech to text, speech synthesis, machine translation from parallel text). … but that isn’t the end of the story … A number of fundamental problem still remain (e.g., channel and noise robustness, less dense or less common languages). The solution will require approaches that use expert knowledge from related, more dense domains (e.g., similar languages) and the ability to learn from small amounts of target data (e.g., autonomic).

Appendix: Predicting User Preferences These models can be used to generate alternatives for you that are consistent with your previous choices (or the choices of people like you). Such models are referred to as generative models because they can generate new data spontaneously that is statistically consistent with previously collected data. Alternately, you can build graphs in which movies are nodes and links represent connections between movies judged to be similar. Some sites, such as Pandora, allow you to continuously rate choices, and adapt the mathematical models of your preferences in real time. This area of science is known as adaptive systems, dealing with algorithms for rapidly adjusting to new data.

Appendix: Functional Mappings A simple model of your behavior is: The inputs, x, can represent names, places, or even features of the sites you visit frequently (e.g., purchases). The weights, wj, can be set heuristically (e.g., visiting www.aljazeera.com is much more important than visiting www.msms.k12.ms.us). The parameters of the model can be optimized to minimize the error in predicting your choices, or to maximize the probability of predicting a correct choice. We can weight these probabilities by the a priori likelihood that the average user would make certain choices (Bayesian models). Retail Linear Classifier Newspapers

Appendix: Major ISIP Milestones 1994: Founded the Institute for Signal and Information Processing (ISIP) 1995: Human listening benchmarks established for the DARPA speech program 1997: DoD funds the initial development of our public domain speech recognition system 1997: Syllable-based speech recognition 1998: NSF CARE award for Internet-Accessible Speech Recognition Technology 1998: First large-vocabulary speech recognition application of Support Vector Machines 1999: First release of high-quality SWB transcriptions and segmentations 2000: First participation in the annual DARPA evaluations (only university site to participate) 2000: NSF funds a multi-university collaboration on integrating speech and natural language 2001: Demonstrated the small impact of transcription errors on HMM training 2002: First viable application of Relevance Vector Machines to speech recognition 2002: Distribution of Aurora toolkit 2002: Evolution of ISIP into the Institute for Intelligent Electronic Systems 2002: the “Crazy Joe” commercial becomes the most widely viewed ISIP document 2003: IIES joins the Center for Advanced Vehicular Systems 2004: NSF funds nonlinear statistical modeling research and supports the development of speaker verification technology 2004: ISIP’s first speaker verification system 2005: ISIP’s first dialog system based on our port to the DARPA Communicator system 2006: Automatic detection of fatigue 2007: Integration of nonlinear features into a speech recognition front end 2008: ISIP’s first keyword search system 2008: Nonlinear mixture autoregressive models for speech recognition 2008: Linear dynamic models for speech recognition 2009: Launch of our first commercial web site and associated business venture…

Biography Joseph Picone received his Ph.D. in Electrical Engineering in 1983 from the Illinois Institute of Technology. He is currently a Professor in the Department of Electrical and Computer Engineering at Mississippi State University. He recently completed a three-year sabbatical at the Department of Defense where he directed human language technology research and development. His primary research interests are currently machine learning approaches to acoustic modeling in speech recognition. For over 25 years he has conducted research on many aspects of digital speech and signal processing. He has also been a long-term advocate of open source technology, delivering one of the first state-of-the-art open source speech recognition systems, and maintaining one of the more comprehensive web sites related to signal processing. His research group is known for producing many innovative educational materials that have increased access to the field. Dr. Picone has previously been employed by Texas Instruments and AT&T Bell Laboratories, including a two-year assignment in Japan establishing Texas Instruments’ first international research center. He is a Senior Member of the IEEE and has been active in several professional societies related to human language technology. He has authored numerous papers on the subject and holds 8 patents.