HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs

Slides:

Advertisements

Similar presentations

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.

Advertisements

ECE 8443 – Pattern Recognition Objectives: Acoustic Modeling Language Modeling Feature Extraction Search Pronunciation Modeling Resources: J.P.: Speech.

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

ECE 8443 – Pattern Recognition Objectives: Course Introduction Typical Applications Resources: Syllabus Internet Books and Notes D.H.S: Chapter 1 Glossary.

Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections

Data Visualization STAT 890, STAT 442, CM 462

Pattern Recognition and Machine Learning

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

The Chinese University of Hong Kong Department of Computer Science and Engineering Lyu0202 Advanced Audio Information Retrieval System.

Scalable Text Mining with Sparse Generative Models

Overview of Web Data Mining and Applications Part I

Overview of Search Engines

Introduction to machine learning

CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.

Natural Language Understanding

Information Retrieval in Practice

ECE 8443 – Pattern Recognition ECE 3163 – Signals and Systems Objectives: Pattern Recognition Feature Generation Linear Prediction Gaussian Mixture Models.

Neuroscience Program's Seminar Series HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs Joseph Picone, PhD Professor and Chair Department of Electrical and.

Proseminar HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs Joseph Picone, PhD Professor and Chair Department of Electrical and Computer Engineering Temple.

CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.

7-Speech Recognition Speech Recognition Concepts

Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.

Project title : Automated Detection of Sign Language Patterns Faculty: Sudeep Sarkar, Barbara Loeding, Students: Sunita Nayak, Alan Yang Department of.

Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:

Temple University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone, PhD Department of Electrical and Computer.

IRCS/CCN Summer Workshop June 2003 Speech Recognition.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Machine Learning Extract from various presentations: University of Nebraska, Scott, Freund, Domingo, Hong,

PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Data and Applications Security Developments and Directions Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #15 Secure Multimedia Data.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

Performance Comparison of Speaker and Emotion Recognition

BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.

Data Mining and Decision Support

Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

By: Nicole Cappella. Why I chose Speech Recognition  Always interested me  Dr. Phil Show Manti Teo Girlfriend Hoax  Three separate voice analysts proved.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.

Information Retrieval in Practice

A NONPARAMETRIC BAYESIAN APPROACH FOR

ECE 417 Lecture 1: Multimedia Signal Processing

Sentiment analysis algorithms and applications: A survey

College of Engineering

Conditional Random Fields for ASR

LECTURE 01: COURSE OVERVIEW

Statistical Models for Automatic Speech Recognition

HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs

Machine Learning Ali Ghodsi Department of Statistics

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs

PCA vs ICA vs LDA.

Statistical Models for Automatic Speech Recognition

HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs

LECTURE 01: COURSE OVERVIEW

Course Introduction CSC 576: Data Mining.

LECTURE 15: REESTIMATION, EM AND MIXTURES

A maximum likelihood estimation and training on the fly approach

Speech recognition, machine learning

Natural Language Processing (NLP) Systems Joseph E. Gonzalez

Speech recognition, machine learning

Presentation transcript:

HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs Joseph Picone, PhD Professor and Chair Department of Electrical and Computer Engineering Temple University URL:

Statistical Approach: Noisy Communication Channel Model

Acoustic Models P(A/W) Speech Recognition Overview Based on a noisy communication channel model in which the intended message is corrupted by a sequence of noisy models Bayesian approach is most common: Objective: minimize word error rate by maximizing P(W|A) P(A|W): Acoustic Model P(W): Language Model P(A): Evidence (ignored) Acoustic models use hidden Markov models with Gaussian mixtures. P(W) is estimated using probabilistic N-gram models. Parameters can be trained using generative (ML) or discriminative (e.g., MMIE, MCE, or MPE) approaches. Acoustic Front-end Acoustic Models P(A/W) Language Model P(W) Search Input Speech Recognized Utterance Feature Extraction

Signal Processing in Speech Recognition

Features: Convert a Signal to a Sequence of Vectors

iVectors: Towards Invariant Features The i-vector representation is a data-driven approach for feature extraction that provides a general framework for separating systematic variations in the data, such as channel, speaker and language. The feature vector is modeled as a sum of three (or more) components: M = m + Tw + ε where M is the supervector, m is a universal background model, ε is a noise term, and w is the target low-dimensional feature vector. M is formed as a concatenation of consecutive feature vectors. This high-dimensional feature vector is then mapped into a low-dimensional space using factor analysis techniques such as Linear Discriminant Analysis (LDA). The dimension of T can be extremely large (~20,000 x 100), but the dimension of the resulting feature vector, w, is on the order of a traditional feature vector (~50). The i-Vector representation has been shown to give significant reductions (20% relative) in EER on speaker/language identification tasks.

Acoustic Models P(A/W) Speech Recognition Overview Based on a noisy communication channel model in which the intended message is corrupted by a sequence of noisy models Bayesian approach is most common: Objective: minimize word error rate by maximizing P(W|A) P(A|W): Acoustic Model P(W): Language Model P(A): Evidence (ignored) Acoustic models use hidden Markov models with Gaussian mixtures. P(W) is estimated using probabilistic N-gram models. Parameters can be trained using generative (ML) or discriminative (e.g., MMIE, MCE, or MPE) approaches. Acoustic Front-end Acoustic Models P(A/W) Language Model P(W) Search Input Speech Recognized Utterance Research Focus

Acoustic Models: Capture the Time-Frequency Evolution

Language Modeling: Word Prediction

Search: Finding the Best Path breadth-first time synchronous beam pruning supervision word prediction natural language

Information Retrieval From Voice Enables Analytics Speech Activity Detection “What is the number one complaint of my customers?” Language Identification Gender Identification Speaker Identification Speech to Text Keyword Search Entity Extraction Relationship Analysis Relational Database

Content-Based Searching Once the underlying data is analyzed and “marked up” with metadata that reveals content such as language and topic, search engines can match based on meaning. Such sites make use several human language technologies and allow you to search multiple types of media (e.g., audio tracks of broadcast news). This is an emerging area for the next generation Internet.

Applications Continually Find New Uses for the Technology Real-time translation of news broadcasts in multiple languages (DARPA GALE) Google search using voice queries Keyword search of audio and video Real-time speech translation in 54 languages Monitoring of communications networks for military and homeland security applications

Analytics Definition: A tool or process that allows an entity (i.e., business) arrive at an optimal or realistic decision based on existing data. (Wiki). Google is building a highly profitable business around analytics derived from people using its search engine. Any time you access a web page, you are leaving a footprint of yourself, particularly with respect to what you like to look at. This has allows advertisers to tailor their ads to your personal interests by adapting web pages to your habits. Web sites such as amazon.com, netflix.com and pandora.com have taken this concept of personalization to the next level. As people do more browsing from their telephones, which are now GPS enabled, an entirely new class of applications is emerging that can track your location, your interests and your network of “friends.”

Speech Recognition is Information Extraction Traditional Output: best word sequence time alignment of information Other Outputs: word graphs N-best sentences confidence measures metadata such as speaker identity, accent, and prosody Applications: Information localization data mining emotional state stress, fatigue, deception

Dialog Systems DARPA Communicator architecture Extendable distributed processing architecture Frame-based dialog manager Open-source speech recognition Goal: combine the best of all research systems to assess state of the art Dialog systems for involve speech recognition, speech synthesis, avatars, and even gesture and emotion recognition Avatars increasingly lifelike But… systems tend to be application-specific

Future Directions How do we get better? Supervised transcription is slow, expensive and limited. Unsupervised learning on large amounts of data is viable. More data, more data, more data… YouTube is opening new possibilities Courtroom and governmental proceedings are providing significant amounts of parallel text Google??? But this type of data is imperfect… … and learning algorithms are still very primitive And neuroscience has yet to inform our learning algorithms!

Brief Bibliography of Related Research S. Pinker, The Language Instinct: How the Mind Creates Language, William Morrow and Company, New York, New York, USA, 1994. F. Juang and L.R. Rabiner, “Automatic Speech Recognition - A Brief History of the Technology,” Elsevier Encyclopedia of Language and Linguistics, 2nd Edition, 2005. M. Benzeghiba, et al., “Automatic Speech Recognition and Speech Variability, A Review,” Speech Communication, vol. 49, no. 10-11, pp. 763–786, October 2007. B.J. Kroger, et al., “Towards a Neurocomputational Model of Speech Production and Perception,” Speech Communication, vol. 51, no. 9, pp. 793- 809, September 2009. B. Lee, “The Biological Foundations of Language”, available at http://www.duke.edu/~pk10/language/neuro.htm (a review paper). M. Gladwell, Blink: The Power of Thinking Without Thinking, Little, Brown and Company, New York, New York, USA, 2005.