Neuroscience Program's Seminar Series HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs Joseph Picone, PhD Professor and Chair Department of Electrical and.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
ECE 8443 – Pattern Recognition Objectives: Acoustic Modeling Language Modeling Feature Extraction Search Pronunciation Modeling Resources: J.P.: Speech.
SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.
Natural Language and Speech Processing Creation of computational models of the understanding and the generation of natural language. Different fields coming.
Speech Translation on a PDA By: Santan Challa Instructor Dr. Christel Kemke.
Search Engines and Information Retrieval
Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.
Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Chapter Chapter 7: Expert Systems and Artificial Intelligence Decision Support.
Presented by Zeehasham Rasheed
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Overview of Search Engines
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
Natural Language Understanding
Information Retrieval in Practice
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
ISSUES IN SPEECH RECOGNITION Shraddha Sharma
ECE 8443 – Pattern Recognition ECE 3163 – Signals and Systems Objectives: Pattern Recognition Feature Generation Linear Prediction Gaussian Mixture Models.
Digital Sound and Video Chapter 10, Exploring the Digital Domain.
Proseminar HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs Joseph Picone, PhD Professor and Chair Department of Electrical and Computer Engineering Temple.
Motivation Parametric models can capture a bounded amount of information from the data. Real data is complex and therefore parametric assumptions is wrong.
Overview of the Database Development Process
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Emerging Directions in Statistical Modeling in Speech Recognition Joseph Picone and Amir Harati Institute for Signal and Information Processing Temple.
Search Engines and Information Retrieval Chapter 1.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
7-Speech Recognition Speech Recognition Concepts
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
1 Computational Linguistics Ling 200 Spring 2006.
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
1 The Ferret Copy Detector Finding short passages of similar texts in large document collections Relevance to natural computing: System is based on processing.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
2 2  Background  Vision in Human Brain  Efficient Coding Theory  Motivation  Natural Pictures  Methodology  Statistical Characteristics  Models.
Temple University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone, PhD Department of Electrical and Computer.
IRCS/CCN Summer Workshop June 2003 Speech Recognition.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
HIERARCHICAL SEARCH FOR LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION Author :Neeraj Deshmukh, Aravind Ganapathiraju and Joseph Picone.
Speech, Perception, & AI Artificial Intelligence CMSC February 13, 2003.
Copyright © 2012, SAS Institute Inc. All rights reserved. ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY,
NLP ? Natural Language is one of fundamental aspects of human behaviors. One of the final aim of human-computer communication. Provide easy interaction.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Performance Comparison of Speaker and Emotion Recognition
Basic structure of sphinx 4
ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION. Introduction What is Speech Recognition?  also known as automatic speech recognition or computer speech.
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
Automated Interpretation of EEGs: Integrating Temporal and Spectral Modeling Christian Ward, Dr. Iyad Obeid and Dr. Joseph Picone Neural Engineering Data.
Behrooz ChitsazLorrie Apple Johnson Microsoft ResearchU.S. Department of Energy.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Speech Recognition Created By : Kanjariya Hardik G.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
By: Nicole Cappella. Why I chose Speech Recognition  Always interested me  Dr. Phil Show Manti Teo Girlfriend Hoax  Three separate voice analysts proved.
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
A NONPARAMETRIC BAYESIAN APPROACH FOR
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
LECTURE 15: REESTIMATION, EM AND MIXTURES
Presentation transcript:

Neuroscience Program's Seminar Series HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs Joseph Picone, PhD Professor and Chair Department of Electrical and Computer Engineering Temple University URL:

NeuroSci: Slide 1 Abstract What makes machine understanding of human language so difficult?  “In any natural history of the human species, language would stand out as the preeminent trait.”  “For you and I belong to a species with a remarkable trait: we can shape events in each other’s brains with exquisite precision.” S. Pinker, The Language Instinct: How the Mind Creates Language, 1994 In this presentation, we will:  Discuss the complexity of the language problem in terms of three key engineering approaches: statistics, signal processing and machine learning.  Introduce the basic ways in which we process language by computer.  Discuss some important applications that continue to drive the field (commercial and defense/homeland security).

NeuroSci: Slide 2 According to the Oxford English Dictionary, the 500 words used most in the English language each have an average of 23 different meanings. The word “round,” for instance, has 70 distinctly different meanings. (J. Gray, ) Language Defies Conventional Mathematical Descriptions Are you smarter than a 5 th grader? “The tourist saw the astronomer on the hill with a telescope.” Hundreds of linguistic phenomena we must take into account to understand written language. Each can not always be perfectly identified (e.g., Microsoft Word) 95% x 95% x … = a small number Is SMS messaging even a language? “y do tngrs luv 2 txt msg?” D. Radev, Ambiguity of LanguageAmbiguity of Language

NeuroSci: Slide 3 Communication Depends on Statistical Outliers A small percentage of words constitute a large percentage of word tokens used in conversational speech: Consequence: the prior probability of just about any meaningful sentence is close to zero. Why? Conventional statistical approaches are based on average behavior (means) and deviations from this average behavior (variance). Consider the sentence: “Show me all the web pages about Franklin Telephone in Oktoc County.” Key words such as “Franklin” and “Oktoc” play a significant role in the meaning of the sentence. What are the prior probabilities of these words?

NeuroSci: Slide 4 Maybe We Don’t Need to Understand Language? See ISIP Phonetic Units to run a demo of the influence of phonetic units on different speaking styles.ISIP Phonetic Units

NeuroSci: Slide 5 Fundamental Challenges in Spontaneous Speech Common phrases experience significant reduction (e.g., “Did you get” becomes “jyuge”). Approximately 12% of phonemes and 1% of syllables are deleted. Robustness to missing data is a critical element of any system. Linguistic phenomena such as coarticulation produce significant overlap in the feature space. Decreasing classification error rate requires increasing the amount of linguistic context. Modern systems condition acoustic probabilities using units ranging from phones to multiword phrases.

NeuroSci: Slide 6 Human Performance is Impressive Human performance exceeds machine performance by a factor ranging from 4x to 10x depending on the task. On some tasks, such as credit card number recognition, machine performance exceeds humans due to human memory retrieval capacity. The nature of the noise is as important as the SNR (e.g., cellular phones). A primary failure mode for humans is inattention. A second major failure mode is the lack of familiarity with the domain (i.e., business terms and corporation names). 0% 5% 15% 20% 10% 10 dB16 dB22 dB Quiet Wall Street Journal (Additive Noise) Machines Human Listeners (Committee) Word Error Rate Speech-To-Noise Ratio

NeuroSci: Slide 7 Human Performance is Robust Cocktail Party Effect: the ability to focus one’s listening attention on a single talker among a mixture of conversations and noises. Sound localization is enabled by our binaural hearing, but also involves cognition. Suggests that audiovisual integration mechanisms in speech take place rather early in the perceptual process. McGurk Effect: visual cues of a cause a shift in perception of a sound, demonstrating multimodal speech perception.

NeuroSci: Slide 8 Human Language Technology (HLT) Audio Processing:  Speech Coding/Compression (mpeg)  Text to Speech Synthesis (voice response systems) Pattern Recognition / Machine Learning:  Language Identification (defense)  Speaker Identification (biometrics for security)  Speech Recognition (automated operator services) Natural Language Processing (NLP):  Entity/Content Extraction (ask.com, cuil.com)  Summarization and Gisting (CNN, defense)  Machine Translation (Google search) Integrated Technologies:  Real-time Speech to Speech Translation (videoconferencing)  Multimodal Speech Recognition (automotive)  Human Computer Interfaces (tablet computing) All technologies share a common technology base: machine learning.

NeuroSci: Slide 9 The World’s Languages There are over 6,000 known languages in the world.6,000 known languages The dominance of English is being challenged by growth in Asian and Arabic languages. Common languages are used to facilitate communication; native languages are often used for covert communications. U.S Census Non-English Languages

NeuroSci: Slide 10 Basic Hardware: Acoustic to Electrical Transducer

NeuroSci: Slide 11 Core components of modern speech recognition systems: Transduction: conversion of an electrical or acoustic signal to a digital signal; Feature Extraction: conversion of samples to vectors containing the salient information; Acoustic Model: statistical representation of basic sound patterns (e.g., hidden Markov models); Language Model: statistical model of common words or phrases (e.g., N-grams); Search: finding the best hypothesis for the data using an optimization procedure. Speech Recognition Architectures Acoustic Front-end Acoustic Models P(A/W) Language Model P(W) Search Input Speech Recognized Utterance

NeuroSci: Slide 12 Signal Processing in Speech Recognition

NeuroSci: Slide 13 Feature Extraction in Speech Recognition

NeuroSci: Slide 14 Adding More Knowledge to the Front End

NeuroSci: Slide 15 Noise Compensation Techniques

NeuroSci: Slide 16 Core components of modern speech recognition systems: Transduction: conversion of an electrical or acoustic signal to a digital signal; Feature Extraction: conversion of samples to vectors containing the salient information; Acoustic Model: statistical representation of basic sound patterns (e.g., hidden Markov models); Language Model: statistical model of common words or phrases (e.g., N-grams); Search: finding the best hypothesis for the data using an optimization procedure. Speech Recognition Architectures Acoustic Front-end Acoustic Models P(A/W) Language Model P(W) Search Input Speech Recognized Utterance

NeuroSci: Slide 17 Statistical Approach: Noisy Communication Channel Model

NeuroSci: Slide 18 Modeling acoustics in speech involves using models with hidden parameters that self- organize information. The 1-coin model to the left is observable because the output sequence can be mapped to a specific sequence of state transitions. The remaining models are hidden because the underlying state sequence cannot be directly inferred from the output sequence. With hidden Markov models, we can learn the parameters of these models from data. One approach is to maximize the likelihood of the data given the model. Doubly Stochastic Systems

NeuroSci: Slide 19 Acoustic Modeling: Hidden Markov Models Acoustic models encode the temporal evolution of the features (spectrum). Gaussian mixture distributions are used to account for variations in speaker, accent and pronunciation. Phonetic model topologies are simple left-to-right structures. Skip states (time-warping) and multiple paths (alternate pronunciations) are also common features of models. Sharing model parameters is a common strategy to reduce complexity. Model parameters are optimized using data-driven training techniques.

NeuroSci: Slide 20 Context-Dependent Acoustic Units

NeuroSci: Slide 21 Data-Driven Parameter Sharing Is Crucial

NeuroSci: Slide 22 Core components of modern speech recognition systems: Transduction: conversion of an electrical or acoustic signal to a digital signal; Feature Extraction: conversion of samples to vectors containing the salient information; Acoustic Model: statistical representation of basic sound patterns (e.g., hidden Markov models); Language Model: statistical model of common words or phrases (e.g., N-grams); Search: finding the best hypothesis for the data using an optimization procedure. Speech Recognition Architectures Acoustic Front-end Acoustic Models P(A/W) Language Model P(W) Search Input Speech Recognized Utterance

NeuroSci: Slide 23 Language is Redundant Written languages such as English are redundant – words and phrases can be guessed even when many letters are missing. Logographic languages do not share this property. Some languages are inflected (words change according to grammatical function). Some languages do not have word boundaries (e.g., spaces) in text. English as a spoken language is considered to be of average difficulty for automated speech recognition. Combinations of words, known as N-grams, are a simple yet powerful, yet imperfect, way to model spoken English.

NeuroSci: Slide 24 Finite state machines are one of many types of grammar formalisms that can be used to process language. We categorize these formalisms by their generative capacity (the Chomsky hierarchy). Type of GrammarConstraintsAutomata Phrase StructureA -> BTuring Machine (unrestricted) Context-SensitiveaAc -> aBcLinear Bounded Automata (N-grams, Unification) Context-FreeA -> w A -> BC Push down automata (CFG, BNF, JSGF, RTN) RegularA -> w A -> wB Finite State Automata (transducers) CFGs offer a good compromise between parsing efficiency and representational power, and provide a natural bridge between speech recognition and natural language processing. Language Defies a Mathematical Description

NeuroSci: Slide 25 The Best and Worst of N-grams Bigram Language Model: the probability of a word sequence is factored into a product of its bigrams.

NeuroSci: Slide 26 Core components of modern speech recognition systems: Transduction: conversion of an electrical or acoustic signal to a digital signal; Feature Extraction: conversion of samples to vectors containing the salient information; Acoustic Model: statistical representation of basic sound patterns (e.g., hidden Markov models); Language Model: statistical model of common words or phrases (e.g., N-grams); Search: finding the best hypothesis for the data using an optimization procedure. Speech Recognition Architectures Acoustic Front-end Acoustic Models P(A/W) Language Model P(W) Search Input Speech Recognized Utterance

NeuroSci: Slide 27 Search Algorithms are Based on Dynamic Programming Search is time synchronous and “left-to-right.” Arbitrary amounts of silence must be permitted between each word. Words are hypothesized many times with different start/stop times, which significantly increases search complexity. Finding optimal solutions is expensive. Suboptimal solutions work well. Search complexity must be linear w.r.t. length/duration to be practical. Most systems use multiple passes and invoke several search algorithms. Lookahead and pruning are essential parts of search

NeuroSci: Slide 28 Hierarchical Search vs. Finite State Transducers Breadth-first time ‑ synchronous hierarchical search is very convenient for integrating linguistic constraints. Efficient Viterbi search of a hierarchical network is a much more complicated problem because of ambiguity in the network (e.g., the same word sequence can appear multiple places in the network. Special care must be taken to synchronize all hypotheses so each acoustic model is evaluated as few times as possible. Since many hypothesis might need the same phone at the same time, coordinating this search becomes a nontrivial problem. Finite state transducers which compile the hierarchical network into one large, flat network are now commonly used, trading memory for speed.

NeuroSci: Slide 29 Cross-Word Decoding Using Lexical Trees Cross-word decoding: since word boundaries don’t occur in spontaneous speech, we must allow for sequences of sounds that span word boundaries. Cross-word decoding significantly increases memory requirements. The lexicon can be converted to a tree structure (lexical trees) to improve efficiency.

NeuroSci: Slide 30 Core components of modern speech recognition systems: Transduction: conversion of an electrical or acoustic signal to a digital signal; Feature Extraction: conversion of samples to vectors containing the salient information; Acoustic Model: statistical representation of basic sound patterns (e.g., hidden Markov models); Language Model: statistical model of common words or phrases (e.g., N-grams); Search: finding the best hypothesis for the data using an optimization procedure. Speech Recognition Architectures What applications can be built from this type of technology? Speech recognition applications continue to evolve from simple speech to text to complex information retrieval tasks. Acoustic Front-end Acoustic Models P(A/W) Language Model P(W) Search Input Speech Recognized Utterance

NeuroSci: Slide 31 Analytics Definition: A tool or process that allows an entity (i.e., business) arrive at an optimal or realistic decision based on existing data. (Wiki).Wiki Google is building a highly profitable business around analytics derived from people using its search engine. Any time you access a web page, you are leaving a footprint of yourself, particularly with respect to what you like to look at. This has allows advertisers to tailor their ads to your personal interests by adapting web pages to your habits. Web sites such as amazon.com, netflix.com and pandora.com have taken this concept of personalization to the next level.amazon.comnetflix.compandora.com As people do more browsing from their telephones, which are now GPS enabled, an entirely new class of applications is emerging that can track your location, your interests and your network of “friends.”

NeuroSci: Slide 32 Information Retrieval From Voice Enables Analytics Speech Activity Detection Language Identification Gender Identification Speaker Identification Speech to TextKeyword Search “What is the number one complaint of my customers?” Entity Extraction Relationship Analysis Relational Database

NeuroSci: Slide 33 Speech Recognition is Information Extraction Traditional Output:  best word sequence  time alignment of information Other Outputs:  word graphs  N-best sentences  confidence measures  metadata such as speaker identity, accent, and prosody Applications:  Information localization  data mining  emotional state  stress, fatigue, deception

NeuroSci: Slide 34 Predicting User Preferences These models can be used to generate alternatives for you that are consistent with your previous choices (or the choices of people like you). Such models are referred to as generative models because they can generate new data spontaneously that is statistically consistent with previously collected data. Alternately, you can build graphs in which movies are nodes and links represent connections between movies judged to be similar. Some sites, such as Pandora, allow you to continuously rate choices, and adapt the mathematical models of your preferences in real time. This area of science is known as adaptive systems, dealing with algorithms for rapidly adjusting to new data.

NeuroSci: Slide 35 Once the underlying data is analyzed and “marked up” with metadata that reveals content such as language and topic, search engines can match based on meaning. Such sites make use several human language technologies and allow you to search multiple types of media (e.g., audio tracks of broadcast news). This is an emerging area for the next generation Internet. Content-Based Searching

NeuroSci: Slide 36 Applications Continually Find New Uses for the Technology Real-time translation of news broadcasts in multiple languages (DARPA GALE) Google search using voice queries Keyword search of audio and video Real-time speech translation in 54 languages Monitoring of communications networks for military and homeland security applications

NeuroSci: Slide 37 Future Directions How do we get better?  Supervised transcription is slow, expensive and limited. Unsupervised learning on large amounts of data is viable. More data, more data, more data…  YouTube is opening new possibilities  Courtroom and governmental proceedings are providing significant amounts of parallel text  Google??? But this type of data is imperfect… … and learning algorithms are still very primitive And neuroscience has yet to inform our learning algorithms!

NeuroSci: Slide 38 Brief Bibliography of Related Research S. Pinker, The Language Instinct: How the Mind Creates Language, William Morrow and Company, New York, New York, USA, F. Juang and L.R. Rabiner, “Automatic Speech Recognition - A Brief History of the Technology,” Elsevier Encyclopedia of Language and Linguistics, 2 nd Edition, M. Benzeghiba, et al., “Automatic Speech Recognition and Speech Variability, A Review,” Speech Communication, vol. 49, no , pp. 763–786, October B.J. Kroger, et al., “Towards a Neurocomputational Model of Speech Production and Perception,” Speech Communication, vol. 51, no. 9, pp , September B. Lee, “The Biological Foundations of Language”, available at (a review paper). M. Gladwell, Blink: The Power of Thinking Without Thinking, Little, Brown and Company, New York, New York, USA, 2005.

NeuroSci: Slide 39 Biography Joseph Picone received his Ph.D. in Electrical Engineering in 1983 from the Illinois Institute of Technology. He is currently Professor and Chair of the Department of Electrical and Computer Engineering at Temple University. He recently completed a three-year sabbatical at the Department of Defense where he directed human language technology research and development. His primary research interests are currently machine learning approaches to acoustic modeling in speech recognition. For over 25 years he has conducted research on many aspects of digital speech and signal processing. He has also been a long-term advocate of open source technology, delivering one of the first state-of-the-art open source speech recognition systems, and maintaining one of the more comprehensive web sites related to signal processing. His research group is known for producing many innovative educational materials that have increased access to the field.web sites Dr. Picone has previously been employed by Texas Instruments and AT&T Bell Laboratories, including a two-year assignment in Japan establishing Texas Instruments’ first international research center. He is a Senior Member of the IEEE, holds several patents in this area, and has been active in several professional societies related to human language technology.