1 Less is More? Yi Wu Advisor: Alex Rudnicky. 2 People: There is no data like more data!

Slides:

Advertisements

Similar presentations

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.

Advertisements

Bayesian network classification using spline-approximated KDE Y. Gurwicz, B. Lerner Journal of Pattern Recognition.

CS4432: Database Systems II

Unsupervised Learning

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

CS479/679 Pattern Recognition Dr. George Bebis

2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.

Visual Recognition Tutorial

Assuming normally distributed data! Naïve Bayes Classifier.

Descriptive statistics Experiment  Data  Sample Statistics Sample mean Sample variance Normalize sample variance by N-1 Standard deviation goes as square-root.

0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Lecture 5: Learning models using EM

Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.

Evaluating Hypotheses

Machine Learning CMPT 726 Simon Fraser University

1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.

Bayesian Learning Rong Jin.

Radial Basis Function Networks

Kullback-Leibler Boosting Ce Liu, Hueng-Yeung Shum Microsoft Research Asia CVPR 2003 Presented by Derek Hoiem.

Crash Course on Machine Learning

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.

Probability theory: (lecture 2 on AMLbook.com)

ECSE 6610 Pattern Recognition Professor Qiang Ji Spring, 2011.

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.

Lecture note for Stat 231: Pattern Recognition and Machine Learning 4. Maximum Likelihood Prof. A.L. Yuille Stat 231. Fall 2004.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

An Efficient Approach to Learning Inhomogenous Gibbs Models Ziqiang Liu, Hong Chen, Heung-Yeung Shum Microsoft Research Asia CVPR 2003 Presented by Derek.

LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

Learning With Bayesian Networks Markus Kalisch ETH Zürich.

Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

Ensemble Learning  Which of the two options increases your chances of having a good grade on the exam? –Solving the test individually –Solving the test.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.

A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.

Machine Learning 5. Parametric Methods.

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05

Statistical Models for Automatic Speech Recognition Lukáš Burget.

Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

KNN & Naïve Bayes Hongning Wang

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Chapter 7. Classification and Prediction

Chapter 3: Maximum-Likelihood Parameter Estimation

Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani

Statistical Models for Automatic Speech Recognition

Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)

LECTURE 05: THRESHOLD DECODING

Statistical Models for Automatic Speech Recognition

LECTURE 05: THRESHOLD DECODING

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

LECTURE 09: BAYESIAN LEARNING

Topic Models in Text Processing

Parametric Methods Berlin Chen, 2005 References:

LECTURE 05: THRESHOLD DECODING

Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)

Speech recognition, machine learning

Sampling Plans.

Presentation transcript:

1 Less is More? Yi Wu Advisor: Alex Rudnicky

2 People: There is no data like more data!

3 Goal: Use less to Perform more Identifying an informative subset from a large corpus for Acoustic Model (AM) training. Expectation of the Selected Set  Good in Performance  Fast in Selection

4 Motivation The improvement of system will become increasingly smaller when we keep adding data. Training acoustic model is time consuming. We need some guidance on what is the most needed data.

5 Approach Overview Applied to well-transcribed data Selection based on transcription Choose subset that have “uniform” distribution on speech unit (word, phoneme, character)

6 How to sample data wisely? --A simple example k Gaussian distribution with known priorω i and unknown density function f i (μ i,σ i )

7 How to sample wisely? --A simplified example We are given access to at most N examples. We have right to choose how much we want from each class. We train the model use MLE estimator. When a new sample generated, we use our model to determine its class. Question: How to sample to achieve minimum error?

8 The optimal Bayes Classifier If we have the exact form of f i (x), above classification is optimal.

9 To approximate the optimal We use our MLE The true error would be bounded by optimal Bayes error plus error bound for our worst estimated

10 Sample Uniformly We want to sample each class equally.  The data selected will have good coverage on each class.  This will give robust estimation on each class.

11 The Real ASR system

12 Data Selection for ASR System The prior has been estimated independently by language model. To make acoustic model accurate, we want to sample the W uniformly. We can take the unit to be phoneme, character, word. We want their distribution to be uniform.

13 Entropy: Measure for “uniformness” Use the entropy of the word (phoneme) as ways of evaluation  Suppose the word (phoneme) has a sample distribution p 1, p 2 …. p n  Choose subset have maximum -p 1 *log(p 1 )- p 2 *log(p 2 )-... p n *log(p n )) Entropy actually is the KL distance from uniform distribution

14 Computational Issue It is computational intractable to find the transcription set that maximizes the entropy Forward Greedy Search

15 Combination There are multiple entropies we want to maximize. Combination Method  Weighted Sum  Add sequentially

16 Experiment Setup System: Sphinx III Feature: 39 dimension MFCC Training Corpus: Chinese BN 97(30hr)+ GaleY1(810hr data) Test Set: RT04(60 min)

17 Experiment 1 ( use word distribution) Time (hour) Random (all) Max-entropy Table 1

18 More Result 30 h50 h100 h150 h840 h random(all) cctv(bn) ntdv(bn) rfa(bc) bc/bn(ratio)15.4/ / / / /40 9 max- entropy(all) cctv(bn) ntdtv(bn) rfa(bc) bc/bn(ratio)11.0/ / /49.8

19 Experiment 2 (add sequentially with phoneme and character 150hr) CCTVNTDTVRFAALL Random(150h) Max-entropy (word+char) Max-entropy (word+phone) All data (840 hrs) Table 2

20 Experiment 1,2

21 Experiment 3 (with VTLN) CCTVNTDTVRFAALL 150 hr (word+phone) With VTLN Table 3

22 Summary Choose data uniformly according to speech unit Maximize entropy using greedy algorithm Add data sequentially Future Work Combine Multiple Sources Select Un-transcribed Data