Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.

Slides:



Advertisements
Similar presentations
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Classification Classification Examples
ECG Signal processing (2)
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Linear Classifiers (perceptrons)

Support Vector Machines
SVM—Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
CS Perceptrons1. 2 Basic Neuron CS Perceptrons3 Expanded Neuron.
Classification and Decision Boundaries
Review of : Yoav Freund, and Robert E
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
x – independent variable (input)
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Linear Discriminators Chapter 20 From Data to Knowledge.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.
SVM by Sequential Minimal Optimization (SMO)
ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.
Discriminative Syntactic Language Modeling for Speech Recognition Michael Collins, Brian Roark Murat, Saraclar MIT CSAIL, OGI/OHSU, Bogazici University.
Machine Learning CSE 681 CH2 - Supervised Learning.
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Universit at Dortmund, LS VIII
Classification: Feature Vectors
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
Round-Robin Discrimination Model for Reranking ASR Hypotheses Takanobu Oba, Takaaki Hori, Atsushi Nakamura INTERSPEECH 2010 Min-Hsuan Lai Department of.
Speaker Verification Speaker verification uses voice as a biometric to determine the authenticity of a user. Speaker verification systems consist of two.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
CS 478 – Tools for Machine Learning and Data Mining SVM.
RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
Linear Models for Classification
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Ensemble Methods in Machine Learning
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.
Extending linear models by transformation (section 3.4 in text) (lectures 3&4 on amlbook.com)
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
CS 9633 Machine Learning Support Vector Machines
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Fun with Hyperplanes: Perceptrons, SVMs, and Friends
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Basic machine learning background with Python scikit-learn
CS 4/527: Artificial Intelligence
Pawan Lingras and Cory Butz
Support Vector Machines Introduction to Data Mining, 2nd Edition by
COSC 4335: Other Classification Techniques
Supervised machine learning: creating a model
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28 IEEE TRANSACTIONS 2013

Outline Introduction Methods Experiments Conclusion 2

Introduction Discriminative language modeling (DLM) is a feature- based approach that is used as an error-correcting step after hypothesis generation in automatic speech recognition (ASR). We formulate this both as a classification and a ranking problem and employ the perceptron, the margin infused relaxed algorithm (MIRA) and the support vector machine (SVM). To decrease training complexity, we try count-based thresholding for feature selection and data sampling from the list of hypotheses. 3

Discriminative Language Modeling Using Linear Models Each example of the training set is represented by a feature vector of the acoustic input, x, and the candidate hypothesis, y. In testing, this vector is used to select the highest scoring hypothesis: Learning here corresponds to optimizing the weight vector w and this can be defined as a classification or ranking problem. 4

Classification Algorithms 1)Perceptron: –The idea of the perceptron algorithm is to reward features associated with the gold-standard and to penalize features associated with the current best: 5 the weights are averaged over the number of utterances (I) and epochs (T) to increase model robustness:

Classification Algorithms 2)MIRA: –The margin is defined as the minimum difference between the dot products and the aim is to train a classifier with a large margin. –the binary MIRA iteratively updates a single prototype w, just like the perceptron. 6 Here the learning rates τ are hypothesis-specific, and are found by solving the following optimization problem:

Classification Algorithms Note that binary MIRA cannot be directly applied to our problem since we do not have the true labels of the instances; We propose single and multiple update versions of MIRA for structured prediction 7 The single update version updates only when the current best z is not the oracle y The multiple update version scans over pairs of the oracle y i and all other hypotheses y k.

Classification Algorithms 3)SVM: –The SVM is also a linear classifier and its aim is to find a separating hyperplane that maximizes the margin between the nearest samples of two classes. The constrained optimization problem is defined as: 8 – C is a user-defined trade-off parameter between violations and smoothness. It is possible to assign different C values to the positive and negative classes, especially when the classes are not balanced. – In our implementation, we choose all hypotheses having the lowest error rate of their N-best list as the positive examples, and the rest as the negative examples.

Ranking Algorithms Using classification, we define DLM as the separation of better examples from worse. But we know that the entries in the list have different degrees of goodness and hence it seems more natural to regard DLM as the reranking of the list where the number of word errors provides the target ranking for each possible transcription. It must be noted that ranking ordering is the opposite of numeric ordering. If r a = 1 and r b = 2 then r a > r b 9

Ranking Algorithms 1)Ranking Perceptron: 10 2)Ranking MIRA:

Ranking Algorithms 3)Ranking SVM: –The ranking SVM algorithm is a modification of the classical SVMsetup to handle the reranking problem, defined as 11 – The constraint in (23) implies that the ranking optimization can also be viewed as an SVM classification problem on pairwise difference vectors. In that sense, the algorithm tries to find a large margin linear function which minimizes the number of pairs of training examples that need to be swapped to achieve the desired ranking

Data Sampling 1)Uniform Sampling (US) –To decrease directly the number of instances 2)Rank Grouping (RG) –We would like to keep at least one instance from all available ranks 3)Rank Clustering (RC) –we guarantee decreasing both the number of instances and ranks. 12

Experimental Setup We use a Turkish broadcast news data set constituted of approximately 194 hours. 13 Using this setup, the generative baseline word error rates on the held-out and test set are 22.93% and 22.37%, respectively. UtteranceswordsTime(hours) Training105, held-out1,94723k words3.1 Testing1,78432k morphs3.3

Increasing the Number of Features As we can see from the figure, for all algorithms, adding bigram (2g) and trigram (3g) features does not decrease but increase the error rates, most possibly due to the lack of sufficient training data which leads to overfitting. 14

Dimensionality Reduction We know that reducing the number of dimensions eases the curse of dimensionality and drastically decreases space and time complexity. It also provides a better generalization by eliminating the effects of noise in redundant features. 15

Sampling From the N-Best List Per using only two examples (the best and worst in terms of WE) is as good as using all the examples in the list. SVM, being the weakest method in 50-best, prefers a sampled training set but cannot compete with Per. 16

Comparison of CPU Times We see that although SVMrank training takes time in the ranges of hours, by an intelligent sampling technique from the N-best list, it is possible to decrease the number of training instances and thus the CPU times considerably, while keeping the WERs still in a tolerable region. 17

Conclusion We see first that ranking leads to lower WER than classification; this is true for all algorithms. The disadvantage is that ranking takes more time and this is where feature selection and sampling becomes more interesting. Ranking algorithms, though more accurate, use more features than classification algorithms. The reason is that they do more updates while the classification algorithms make only one update for each hypothesis set. Using higher order N-gram data does not improve the performance but it increases the complexity and the training time. 18

Conclusion It is possible to do feature selection and use a subset of the unigram features thereby decreasing complexity without losing from accuracy. The unigram representation is very sparse and high- dimensional and an interesting future research direction is to represent such a long sparse vector using fewer features. 19