Smoothing 10/27/2017.

Slides:



Advertisements
Similar presentations
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Lecture 22: Evaluation April 24, 2010.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
Today Evaluation Measures Accuracy Significance Testing
Sampling : Error and bias. Sampling definitions  Sampling universe  Sampling frame  Sampling unit  Basic sampling unit or elementary unit  Sampling.
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Population and Sample The entire group of individuals that we want information about is called population. A sample is a part of the population that we.
Statistical NLP Winter 2009
Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel
Learning Simio Chapter 10 Analyzing Input Data
Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,
Natural Language Processing Statistical Inference: n-grams
RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji Oct13, 2015 Acknowledgement: distributional semantics slides from.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Language Modeling Again So are we smooth now? Courtesy of Chris Jordan.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Chapter 10 Confidence Intervals for Proportions © 2010 Pearson Education 1.
Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:
Asking the right questions to stimulate students’ minds.
Virtual University of Pakistan
Stats Methods at IC Lecture 3: Regression.
Sampling and Sampling Distribution
N-Grams Chapter 4 Part 2.
Chapter 6 Inferences Based on a Single Sample: Estimation with Confidence Intervals Slides for Optional Sections Section 7.5 Finite Population Correction.
Statistical NLP: Lecture 7
Part III – Gathering Data
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Joint Training for Pivot-based Neural Machine Translation
Understanding Randomness
Chapter 25 Comparing Counts.
Different Units Ramakrishna Vedantam.
Combining Random Variables
Lesson Overview 1.1 What Is Science?.
Generalization ..
Writing the executive summary section of your report
Lesson Overview 1.1 What Is Science?.
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Section 7.7 Introduction to Inference
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 11 Practical Methodology
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Lesson Overview 1.1 What Is Science?.
Lesson Overview 1.1 What Is Science?.
CSCE 771 Natural Language Processing
Lesson Overview 1.1 What Is Science?.
Chapter 26 Comparing Counts.
Chapter 7: Sampling Distributions
Lesson Overview 1.1 What Is Science?.
Topic Models in Text Processing
Machine Learning in Practice Lecture 27
Lesson Overview 1.1 What Is Science?.
Lesson Overview 1.1 What Is Science?.
8.3 Estimating a Population Mean
INF 141: Information Retrieval
DESIGN OF EXPERIMENTS by R. C. Baker
Mean vs Median Sampling Techniques
Lesson Overview 1.1 What Is Science?.
Reuben Feinman Research advised by Brenden Lake
Chapter 1 The Science of Biology
Support Vector Machines 2
CS249: Neural Language Model
Professor Junghoo “John” Cho UCLA
Presentation transcript:

Smoothing 10/27/2017

References Not Mentioned (much) Capture-Recapture Kneser-Ney Additive Smoothing https://en.wikipedia.org/wiki/Additive_smoothing Laplace Smoothing Jeffreys Dirichlet Prior What’s wrong with adding one? 10/27/2017

10/27/2017

Good-Turing Assumptions: Modest, but… 10/27/2017

10/27/2017

Jurafsky (Lecture 4-7) https://www. youtube. com/watch 10/27/2017

Tweets Don’t forget about smoothing just because your language model has gone neural. Sparse data is real. Always has been, always will be. 10/27/2017

Good-Turing Smoothing & Large Counts Smoothing may potentially illustrate some of the limitations of statistical approach in natural language process. Good Turing method … suffers from noise as r gets larger… While the smoothing allowed to create a desired model and is substantiated by rigorous math, the process seems to also give some impression that mathematics was “bent” to create a desired model. While many statistical approach such as clustering and smoothing have given a far better performance in many of of natural language processing tasks, it raises a question of whether other approaches can enhance the currently dominant statistical approach in natural language processing. 10/27/2017

References 10/27/2017

10/27/2017

10/27/2017

10/27/2017

Is Smoothing Obsolete? (1 of 2) Is it true that “there is no data like more data” even when that data is sparse? Were Allison et. al. correct in 2006 when they predicted that improved data processing capabilities will make smoothing techniques less necessary? In the case that the entire data set could be used to calculate population frequencies, MLE would return as the popular choice of frequency estimation since it is a straightforward and precise measure. But in a time of working with corpora too large and sparse not to sample, smoothing has been very important. 10/27/2017

Is Smoothing Obsolete? (2 of 2) Good-Turing smoothing was in some sense a general solution to the data sparseness problem. Improvements on the Good-Turing solution include Simple Good-Turing (Gale and Sampson, 1995) and Katz (Katz, 1987). Many survey papers, such as Chen and Goodman’s, compare smoothing methods or try to combine various smoothing methods to produce a more accurate result. Clearly smoothing techniques have been important to NLP, information retrieval, and biology. Smoothing will only become obsolete in the case that our ability to process every word in a corpus (or every web page in a search result, as Gao et. al. were working with) scales with the growth of corpora. Thus the assertion that smoothing might become less and less necessary does not hold. 10/27/2017

“It never pays to think until you’ve run out of data” – Eric Brill Moore’s Law Constant: Data Collection Rates  Improvement Rates Banko & Brill: Mitigating the Paucity-of-Data Problem (HLT 2001) No consistently best learner There is no data like more data Quoted out of context Fire everybody and spend the money on data 6/27/2014 Fillmore Workshop

There is no data like more data Counts are growing 1000x per decade (same as disks) Counts are growing 1000x per decade (same as disks) Rising Tide of Data Lifts All Boats 8/24/2017

Neural Nets have more parameter More parameters  more powerful But also require More data More smoothing Is there implicit smoothing with standard training mechanisms? Dropout Cross-validation As well as explicit smoothing f(x) weighting function in glove c^0.75 10/27/2017

10/27/2017

10/27/2017

Smoothing & Neural Nets Despite the recent success of recurrent neural nets (RNNs) [which] do not rely on word counts, smoothing remains an important problem in the field of language modeling. … RNNs at first glance appear to sidestep the issue of smoothing altogether. However… low frequency words <UNK> tags weighting functions used in word embeddings In this paper we … [argue] that smoothing remains a problem even as the community shifts [to embeddings & neural networks] 10/27/2017

Smoothing & Embeddings The advance in smoothing and its application on estimation of low frequency ngrams in traditional language model does not seem to pass down to word embedding models, which is commonly believed to outperform ngram models on most of nlp tasks. It argues that empirically all context counts needs to be raised to a power of 0.75: a 'magical' fixed hyperparameter to achieve the best accuracy no matter which task we are performing and what training data we are drawing from. 10/27/2017

Although the embedding models come to penalizes the overestimation for small counts, it improperly down weights the large counts as well. An alternative method, for example, would be applying discounting based on the Good-Turing estimate of the probability unseen or low frequency words. There are many other smoothing methods and a revisit to such improvements would prove to dramatically impact our relative performance. 10/27/2017