11/24/2006 CLSP, The Johns Hopkins University Random Forests for Language Modeling Peng Xu and Frederick Jelinek IPAM: January 24, 2006.

Slides:



Advertisements
Similar presentations
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
Random Forest Predrag Radenković 3237/10
Real-Time Human Pose Recognition in Parts from Single Depth Images Presented by: Mohammad A. Gowayyed.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Speech Recognition Training Continuous Density HMMs Lecture Based on:
Decision Tree Rong Jin. Determine Milage Per Gallon.
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning: An Introduction
The Sequence Memoizer Frank Wood Cedric Archambeau Jan Gasthaus Lancelot James Yee Whye Teh Gatsby UCL Gatsby HKUST Gatsby TexPoint fonts used in EMF.
Classification.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Ensemble Learning (2), Tree and Forest
1 Advanced Smoothing, Evaluation of Language Models.
12/13/2007Chia-Ho Ling1 SRILM Language Model Student: Chia-Ho Ling Instructor: Dr. Veton Z. K ë puska.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
1 Persian Part Of Speech Tagging Mostafa Keikha Database Research Group (DBRG) ECE Department, University of Tehran.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.
March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco.
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
The Sequence Memoizer Frank Wood Cedric Archambeau Jan Gasthaus Lancelot James Yee Whye Teh UCL Gatsby HKUST Gatsby TexPoint fonts used in EMF. Read the.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.
COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.
1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.
Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
Training of Boosted DecisionTrees Helge Voss (MPI–K, Heidelberg) MVA Workshop, CERN, July 10, 2009.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Statistical NLP Winter 2009
Recurrent neural network based language model Tom´aˇs Mikolov, Martin Karafia´t, Luka´sˇ Burget, Jan “Honza” Cˇernocky, Sanjeev Khudanpur INTERSPEECH 2010.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 6.2: Classification Rules Rodney Nielsen Many.
Supertagging CMSC Natural Language Processing January 31, 2006.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Natural Language Processing Statistical Inference: n-grams
Using Neural Network Language Models for LVCSR Holger Schwenk and Jean-Luc Gauvain Presented by Erin Fitzgerald CLSP Reading Group December 10, 2004.
Experiments in Adaptive Language Modeling Lidia Mangu & Geoffrey Zweig.
Classification and Regression Trees
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者:郝柏翰 2013/05/23.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Smoothing Issues in the Strucutred Language Model
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Roberto Battiti, Mauro Brunato
Jun Wu and Sanjeev Khudanpur Center for Language and Speech Processing
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
N-Gram Model Formulas Word sequences Chain rule of probability
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.
Presenter : Jen-Wei Kuo
Presentation transcript:

11/24/2006 CLSP, The Johns Hopkins University Random Forests for Language Modeling Peng Xu and Frederick Jelinek IPAM: January 24, 2006

1/24/2006 CLSP, The Johns Hopkins University What Is a Language Model? A probability distribution over word sequences Based on conditional probability distributions: probability of a word given its history (past words)

1/24/2006 CLSP, The Johns Hopkins University What Is a Language Model for? Speech recognition AW* AW Source-channel model

1/24/2006 CLSP, The Johns Hopkins University n-gram Language Models A simple yet powerful solution to LM (n-1) items in history: n-gram model Maximum Likelihood (ML) estimate: Sparseness Problem: training and test mismatch, most n-grams are never seen; need for smoothing

1/24/2006 CLSP, The Johns Hopkins University Sparseness Problem Example: Upenn Treebank portion of WSJ, 1 million words training data, 82 thousand words test data, 10-thousand-word open vocabulary n-gram3456 % unseen Sparseness makes language modeling a difficult regression problem: an n-gram model needs at least |V| n words to cover all n-grams

1/24/2006 CLSP, The Johns Hopkins University More Data More data  solution to data sparseness The web has “everything”: web data is noisy. The web does NOT have everything: language models using web data still have data sparseness problem.  [Zhu & Rosenfeld, 2001] In 24 random web news sentences, 46 out of 453 trigrams were not covered by Altavista. In domain training data is not always easy to get.

1/24/2006 CLSP, The Johns Hopkins University Dealing With Sparseness in n-gram Smoothing: take out some probability mass from seen n-grams and distribute among unseen n-grams Interpolated Kneser-Ney: consistently the best performance [Chen & Goodman, 1998]

1/24/2006 CLSP, The Johns Hopkins University Our Approach Extend the appealing idea of history to clustering via decision trees. Overcome problems in decision tree construction … by using Random Forests!

1/24/2006 CLSP, The Johns Hopkins University Decision Tree Language Models Decision trees: equivalence classification of histories Each leaf is specified by the answers to a series of questions (posed to “history”) which lead to the leaf from the root. Each leaf corresponds to a subset of the histories. Thus histories are partitioned (i.e.,classified).

1/24/2006 CLSP, The Johns Hopkins University Decision Tree Language Models: An Example Training data: aba, aca, bcb, bbb, ada {ab,ac,bc,bb,ad} a:3 b:2 {ab,ac,ad} a:3 b:0 {bc,bb} a:0 b:2 Is the first word in {a}?Is the first word in {b}? New event ‘bdb’ in testNew event ‘adb’ in test New event ‘cba’ in test: Stuck!

1/24/2006 CLSP, The Johns Hopkins University Decision Tree Language Models: An Example Example: trigrams ( w - 2, w - 1, w 0 ) Questions about positions: “Is w - i 2S ?” and “Is w - i 2S c ?” There are two history positions for trigram. Each pair, S and S c, defines a possible split of a node, and therefore, training data.  S and S c are complements with respect to training data A node gets less data than its ancestors. ( S, S c ) are obtained by an exchange algorithm.

1/24/2006 CLSP, The Johns Hopkins University Construction of Decision Trees Data Driven: decision trees are constructed on the basis of training data The construction requires: 1. The set of possible questions 2. A criterion evaluating the desirability of questions 3. A construction stopping rule or post-pruning rule

1/24/2006 CLSP, The Johns Hopkins University Construction of Decision Trees: Our Approach Grow a decision tree until maximum depth using training data Use training data likelihood to evaluate questions Perform no smoothing during growing Prune fully grown decision tree to maximize heldout data likelihood Incorporate KN smoothing during pruning

1/24/2006 CLSP, The Johns Hopkins University Smoothing Decision Trees Using similar ideas as interpolated Kneser- Ney smoothing: Note: All histories in one node are not smoothed in the same way. Only leaves are used as equivalence classes.

1/24/2006 CLSP, The Johns Hopkins University Problems with Decision Trees Training data fragmentation: As tree is developed, the questions are selected on the basis of less and less data. Lack of optimality:  The exchange algorithm is a greedy algorithm.  So is the tree growing algorithm Overtraining and undertraining: Deep trees: fit the training data well, will not generalize well to new test data. Shallow trees: not sufficiently refined.

1/24/2006 CLSP, The Johns Hopkins University Amelioration: Random Forests Breiman applied the idea of random forests to relatively small problems. [Breiman 2001]  Using different random samples of data and randomly chosen subsets of questions, construct K decision trees.  Apply test datum x to all the different decision trees. Produce classes y 1, y 2,…, y K.  Accept plurality decision:

1/24/2006 CLSP, The Johns Hopkins University Example of a Random Forest           T1T1 T2T2 T3T3 An example x will be classified as  according to this random forest.

1/24/2006 CLSP, The Johns Hopkins University Random Forests for Language Modeling Two kinds of randomness: Selection of positions to ask about  Alternatives: position 1 or 2 or the better of the two. Random initialization of the exchange algorithm 100 decision trees: i th tree estimates P DT ( i ) ( w 0 | w - 2, w - 1 ) The final estimate is the average of all trees

1/24/2006 CLSP, The Johns Hopkins University Experiments Perplexity (PPL): UPenn Treebank part of WSJ: about 1 million words for training and heldout (90%/10%), 82 thousand words for test

1/24/2006 CLSP, The Johns Hopkins University Experiments: trigram Baseline: KN-trigram No randomization: DT-trigram 100 random DTs: RF-trigram ModelheldoutTest PPLGainPPLGain % KN-trigram DT-trigram % % RF-trigram % %

1/24/2006 CLSP, The Johns Hopkins University Experiments: Aggregating Considerable improvement already with 10 trees!

1/24/2006 CLSP, The Johns Hopkins University Experiments: Analysis seen event :  KN-trigram: in training data  DT-trigram: in training data Analyze test data events by number of times seen in 100 DTs

1/24/2006 CLSP, The Johns Hopkins University Experiments: Stability PPL results of different realizations varies, but differences are small.

1/24/2006 CLSP, The Johns Hopkins University Experiments: Aggregation v.s. Interpolation  Aggregation:  Weighted average: Estimate weights so as to maximize heldout data log-likelihood

1/24/2006 CLSP, The Johns Hopkins University Experiments: Aggregation v.s. Interpolation Optimal interpolation gains almost nothing!

1/24/2006 CLSP, The Johns Hopkins University Experiments: High Order n-grams Models Baseline: KN n-gram 100 random DTs: RF n-gram n-gram3456 KN RF

1/24/2006 CLSP, The Johns Hopkins University Using Random Forests to Other Models: SLM Structured Language Model (SLM): [Chelba & Jelinek, 2000] Approximation: use tree triples SLM KN137.9 RF122.8

1/24/2006 CLSP, The Johns Hopkins University Speech Recognition Experiments (I) Word Error Rate (WER) by N-best Rescoring: WSJ text: 20 or 40 million words training WSJ DARPA’93 HUB1 test data: 213 utterances, 3446 words N-best rescoring: baseline WER is 13.7%  N-best lists were generated by a trigram baseline using Katz backoff smoothing.  The baseline trigram used 40 million words for training.  Oracle error rate is around 6%.

1/24/2006 CLSP, The Johns Hopkins University Speech Recognition Experiments (I) Baseline: KN smoothing 100 random DTs for RF 3-gram 100 random DTs for the PREDICTOR in SLM Approximation in SLM 3-gram (20M) 3-gram (40M) SLM (20M) KN14.0%13.0%12.8% RF12.9%12.4%11.9% p-value<0.001<0.05<0.001

1/24/2006 CLSP, The Johns Hopkins University Speech Recognition Experiments (II) Word Error Rate by Lattice Rescoring IBM 2004 Conversational Telephony System for Rich Transcription: 1 st place in RT-04 evaluation  Fisher data: 22 million words  WEB data: 525 million words, using frequent Fisher n- grams as queries  Other data: Switchboard, Broadcast News, etc. Lattice language model: 4-gram with interpolated Kneser-Ney smoothing, pruned to have 3.2 million unique n-grams, WER is 14.4% Test set: DEV04, 37,834 words

1/24/2006 CLSP, The Johns Hopkins University Speech Recognition Experiments (II) Baseline: KN 4-gram 110 random DTs for EB-RF 4-gram Sampling data without replacement Fisher and WEB models are interpolated Fisher 4-gram WEB 4-gram Fisher+WEB 4-gram KN14.1%15.2%13.7% RF13.5%15.0%13.1% p-value<0.001-

1/24/2006 CLSP, The Johns Hopkins University Practical Limitations of the RF Approach Memory: Decision tree construction uses much more memory. Little performance gain when training data is really large. Because we have 100 trees, the final model becomes too large to fit into memory. Effective language model compression or pruning remains an open question.

1/24/2006 CLSP, The Johns Hopkins University Conclusions: Random Forests New RF language modeling approach More general LM: RF  DT  n-gram Randomized history clustering Good generalization: better n-gram coverage, less biased to training data Extension of Brieman’s random forests for data sparseness problem

1/24/2006 CLSP, The Johns Hopkins University Conclusions: Random Forests Improvements in perplexity and/or word error rate over interpolated Kneser-Ney smoothing for different models: n-gram (up to n=6) Class-based trigram Structured Language Model Significant improvements in the best performing large vocabulary conversational telephony speech recognition system