A few thoughts about ASAT Some slides from NSF workshop presentation on knowledge integration Thoughts about “islands of certainty” Neural networks: the.

Slides:

Advertisements

Similar presentations

Human Speech Recognition Julia Hirschberg CS4706 (thanks to John-Paul Hosum for some slides)

Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Recovering Human Body Configurations: Combining Segmentation and Recognition Greg Mori, Xiaofeng Ren, and Jitentendra Malik (UC Berkeley) Alexei A. Efros.

Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.

1 Hypothesis Testing Chapter 8 of Howell How do we know when we can generalize our research findings? External validity must be good must have statistical.

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.

ECE 8443 – Pattern Recognition Objectives: Course Introduction Typical Applications Resources: Syllabus Internet Books and Notes D.H.S: Chapter 1 Glossary.

Confidence Measures for Speech Recognition Reza Sadraei.

BA 555 Practical Business Analysis

Modelling. Outline  Modelling methods  Editing models – adding detail  Polygonal models  Representing curves  Patched surfaces.

1-1 Copyright © 2015, 2010, 2007 Pearson Education, Inc. Chapter 15, Slide 1 Chapter 15 Random Variables.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Why is ASR Hard? Natural speech is continuous

SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.

Crash Course on Machine Learning

Conditional Random Fields   A form of discriminative modelling   Has been used successfully in various domains such as part of speech tagging and other.

Inference for Regression

Soft Computing Lecture 17 Introduction to probabilistic reasoning. Bayesian nets. Markov models.

Cascade Correlation Architecture and Learning Algorithm for Neural Networks.

Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.

第十讲概率图模型导论 Chapter 10 Introduction to Probabilistic Graphical Models

1 Robot Environment Interaction Environment perception provides information about the environment’s state, and it tends to increase the robot’s knowledge.

Chapter 2 Doing Social Psychology Research. Why Should You Learn About Research Methods?  It can improve your reasoning about real-life events  This.

Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

IRCS/CCN Summer Workshop June 2003 Speech Recognition.

Where did plants and animals come from? How did I come to be?

Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.

Automatic Identification and Classification of Words using Phonetic and Prosodic Features Vidya Mohan Center for Speech and Language Engineering The Johns.

SPEECH PERCEPTION DAY 18 – OCT 9, 2013 Brain & Language LING NSCI Harry Howard Tulane University.

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

Uncertainty Management in Rule-based Expert Systems

Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.

FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.

Automatic Speech Attribute Transcription (ASAT) Project Period: 10/01/04 – 9/30/08 The ASAT Team –Mark Clements –Sorin Dusan.

New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,

Presenter: Jinhua Du ( 杜金华 ) Xi’an University of Technology 西安理工大学 NLP&CC, Chongqing, Nov , 2013 Discriminative Latent Variable Based Classifier.

CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.

CHAPTER 6 Naive Bayes Models for Classification. QUESTION????

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

NTU & MSRA Ming-Feng Tsai

Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.

Regression Analysis1. 2 INTRODUCTION TO EMPIRICAL MODELS LEAST SQUARES ESTIMATION OF THE PARAMETERS PROPERTIES OF THE LEAST SQUARES ESTIMATORS AND ESTIMATION.

Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 6, 2013.

Internet Signal Processing: Next Steps Dr. Craig Partridge BBN Technologies.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Against formal phonology (Port and Leary).  Generative phonology assumes:  Units (phones) are discrete (not continuous, not variable)  Phonetic space.

ASAT Project Two main research thrusts –Feature extraction –Evidence combiner Feature extraction –The classical distinctive features are well explored,

1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

LECTURE 16: BEYOND LINEARITY PT. 1 March 28, 2016 SDS 293 Machine Learning.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Intro to Probability and Statistics 1-1: How Can You Investigate Using Data? 1-2: We Learn about Populations Using Samples 1-3: What Role Do Computers.

Combining Phonetic Attributes Using Conditional Random Fields Jeremy Morris and Eric Fosler-Lussier – Department of Computer Science and Engineering A.

Chapter 10 Confidence Intervals for Proportions © 2010 Pearson Education 1.

Conditional Random Fields for ASR

Data, Univariate Statistics & Statistical Inference

CRANDEM: Conditional Random Fields for ASR

Jeremy Morris & Eric Fosler-Lussier 04/19/2007

Automatic Speech Recognition: Conditional Random Fields for ASR

Research on the Modeling of Chinese Continuous Speech Recognition

Presentation transcript:

A few thoughts about ASAT Some slides from NSF workshop presentation on knowledge integration Thoughts about “islands of certainty” Neural networks: the good, the bad, and the ugly Short intro to the OSU team du jour

Outline (or, rather, my list of questions) What is Knowledge Integration (KI)? How has KI influenced ASR to date? Where should KI be headed? –What types of cues should we be looking for? –How should cues be combined?

What is Knowledge Integration? It means different things to different people –Combining multiple hypotheses –Bringing linguistic information to bear in ASR Working definition: –Combining multiple sources of evidence to produce a final (or intermediate) hypothesis –Traditional ASR process uses KI Combines acoustic, lexical, and syntactic information But this is only the tip of the iceberg

KI examples in ASR Feature Calculation Language Modeling Acoustic Modeling Pronunciation Modeling cat: dog: dog mail: mAl the: D&, DE … cat dog: cat the: the cat: the dog: the mail: … Acoustic model gives state hypotheses from features Search integrates knowledge from acoustic, pronunciation, and language models Statistical models have “simple” dependencies The cat chased the dog S E A R C H P(X|Q)P(Q|W)P(W)

KI: Statistical Dependencies Feature Calculation Language Modeling Acoustic Modeling Pronunciation Modeling cat: dog: dog mail: mAl the: D&, DE … cat dog: cat the: the cat: the dog: the mail: … “Side information” from the speech waveform Speaking rate Prosodic information Syllable boundaries The cat chased the dog S E A R C H

KI: Statistical Dependencies Feature Calculation Language Modeling Acoustic Modeling Pronunciation Modeling cat: dog: dog mail: mAl the: D&, DE … cat dog: cat the: the cat: the dog: the mail: … Information from sources outside “traditional” system Class n-grams, CFG/Collins-style parsers Sentence-level stress Vocal-tract length normalization The cat chased the dog S E A R C H

KI: Statistical Dependencies Feature Calculation Language Modeling Acoustic Modeling Pronunciation Modeling cat: dog: dog mail: mAl the: D&, DE … cat dog: cat the: the cat: the dog: the mail: … Information from “internal” knowledge sources Pronunciations w/ multi-words, LM probabilities State-level pronunciation modeling Buried Markov Models The cat chased the dog S E A R C H

KI: Statistical Dependencies Feature Calculation Language Modeling Acoustic Modeling Pronunciation Modeling cat: dog: dog mail: mAl the: D&, DE … cat dog: cat the: the cat: the dog: the mail: … Information from errors made by system Discriminative acoustic, pronunciation, and language modeling The cat chased the dog S E A R C H

KI: Model Combination Feature Calculation Language Modeling Acoustic Modeling Pronunciation Modeling Integrate multiple “final” hypotheses ROVER Word sausages (Mangu et al.) The cat chased the dog Feature Calculation Language Modeling Acoustic Modeling Pronunciation Modeling X

KI: Model Combination Feature Calculation Acoustic Modeling Combine multiple “non-final” hypotheses Multi-stream modeling Synchronous phonological feature modeling Boosting Interpolated language models The cat chased the dog Feature Calculation Language Modeling Acoustic Modeling Pronunciation Modeling X

Summary: Current uses of KI Probability conditioning P(A|B) -> P(A|B,X,Y,Z) –More refined (accurate?) models –Can complicate overall equation Model merging P(A|B) -> f(P 1 (A|B),w 1 ) + f(P 2 (A|B),w 2 ) –Different views of information are (usually) good –But sometimes combination methods are not as principled as one would like

Where should we go from here? As a field have investigated many sources of knowledge –We learn more about language this way Cf. “More data is better data” school To make an impact we need – A common framework –Easy ways to combine knowledge –“Interesting” sources of knowledge

KI in Event-Driven ASR Phonological features as events (from Chin’s proposal) backalveolar consonant vowel nasal closureburstmid-low closureburst can’t

KI in Event-Driven ASR Integrating multiple detectors –Easy if detectors are of the same type –Use both conditioning and model combination backalveolar consonant vowel nasal closureburstmid-low closureburst can’t P(back|detector1) P(back|detector2)

KI in Event-Driven ASR Integrating multiple cross-type detectors –Simplest to use Naïve Bayes assumption P(X|e1,e2,e3)=(P(e1|X)P(e2|X)P(e3|X)P(X))/Z backalveolar consonant vowel nasal closureburstmid-low closureburst can’t P(k|features)

KI in Event-Driven ASR Breakdown in Naïve Bayes –Detectors aren’t always independent backalveolar consonant vowel nasal closurebursthigh closureburst can’t k Feature spreading correlated with vowel raising New non-independent detector

KI in Event-Driven ASR Wanted: Gestalt detector –View overall shape of detector streams backalveolar consonant vowel nasal closurebursthigh closureburst P(can’t| ) k

The Challenge of Plug-n-Play Shouldn’t have to re-learn entire system every time a new detector is added –Can’t have one global P(can’t|all variables) –Changes should be localized Implies need for hierarchical structure Composition structure should enable combination of radically different forms of information –E.g., audio-visual speech recognition

The Challenge of Plug-n-Play Perhaps need three types of structures –Event integrators Is this a CVC syllable? Problems like feature spreading become local –Hypothesis generators I think the word “can’t” is here. Combines evidence from top-level integrators –Hypothesis validators Is this hypothesis consistent? Language model, word boundary detection, … Still probably have Naïve Bayes problems

What type of detectors should we be thinking about? Phonological features Phones Syllables? Words? Function Words? Syllable/word boundaries Prosodic stress … and a whole bunch of other things –We’ve already looked at a number of them –And Jim’s already made some of these points

Putting it all together Huge multi-dimensional graph search Should not be strictly “left-to-right” –“Islands of certainty” –People tend to emphasize the important words …and we can usually detect them better –Work backwards to firm up uncertain segments

Summary As a field, we have looked at many influences on our probabilistic models Have gained expertise in –Probability conditioning –Model combination Event-driven ASR may provide challenging, but interesting framework for incorporating different ideas

Thoughts about “islands of certainty”

We can’t parse everything At least not on the first pass Need to find ways to cleverly reduce computation: center around things that we’re sure about –Can we use confidence values from “light” detectors and refine? (likely) –Can we use external sources of knowledge to help guide search? (likely)

Word/syllable onset detection Several factors point to existence of factors that can help with word segmentation –Psychology experiments have suggested that phonotactics plays a big role (e.g., Saffran et al.) –Shire (at ICSI) was able to train a pretty reliable syllable boundary detector from acoustics –Syllable onsets pronounced more canonically than nuclei or codas -- 84% vs 65% Switchboard, 90% vs 62%/80% TIMIT (Fosler-Lussier et al 99) Can we build “island of certainty” models by looking at a combination of acoustic/phonetic factors?

Pronunciation numbers

Integrating multiple units Naïve method: just try to combine everything in sight Refined method: process left to right, but process a buffer (e.g..5-2 sec) –Look for islands –Back-fit other material in a way that makes sense given the islands –Can use external measures like speaking rate to validate likelihood of inferred structure

Neural nets ANNs are good as non-linear discriminators But they have a problem: when they’re wrong, they are often REALLY wrong –Ex: training on TI digits (30 phones, easy) –CV frame-level margin: P(correct)-P(next competitor) 9% margin < -0.4, 8% margin % margin 0-0.4, 75% margin >0.4 Could chalk this up to “pronunciation variation” Current thinking: if training more responsive to margin, might move some of that 9% upward.

Current personnel Me Keith as consultant Anton Rytting (Linguistics): part time senior grad student, works on word segmentation in Greek; currenly twisting his arm Linguistics student TBA 1/05. Incoming students (we’ll see who works) –1 ECE student (signal processing) –2 CSE students (MS in reinforcement learning, BA in genetic algorithms)