Smoothing 10/27/2017
References Not Mentioned (much) Capture-Recapture Kneser-Ney Additive Smoothing https://en.wikipedia.org/wiki/Additive_smoothing Laplace Smoothing Jeffreys Dirichlet Prior What’s wrong with adding one? 10/27/2017
10/27/2017
Good-Turing Assumptions: Modest, but… 10/27/2017
10/27/2017
Jurafsky (Lecture 4-7) https://www. youtube. com/watch 10/27/2017
Tweets Don’t forget about smoothing just because your language model has gone neural. Sparse data is real. Always has been, always will be. 10/27/2017
Good-Turing Smoothing & Large Counts Smoothing may potentially illustrate some of the limitations of statistical approach in natural language process. Good Turing method … suffers from noise as r gets larger… While the smoothing allowed to create a desired model and is substantiated by rigorous math, the process seems to also give some impression that mathematics was “bent” to create a desired model. While many statistical approach such as clustering and smoothing have given a far better performance in many of of natural language processing tasks, it raises a question of whether other approaches can enhance the currently dominant statistical approach in natural language processing. 10/27/2017
References 10/27/2017
10/27/2017
10/27/2017
10/27/2017
Is Smoothing Obsolete? (1 of 2) Is it true that “there is no data like more data” even when that data is sparse? Were Allison et. al. correct in 2006 when they predicted that improved data processing capabilities will make smoothing techniques less necessary? In the case that the entire data set could be used to calculate population frequencies, MLE would return as the popular choice of frequency estimation since it is a straightforward and precise measure. But in a time of working with corpora too large and sparse not to sample, smoothing has been very important. 10/27/2017
Is Smoothing Obsolete? (2 of 2) Good-Turing smoothing was in some sense a general solution to the data sparseness problem. Improvements on the Good-Turing solution include Simple Good-Turing (Gale and Sampson, 1995) and Katz (Katz, 1987). Many survey papers, such as Chen and Goodman’s, compare smoothing methods or try to combine various smoothing methods to produce a more accurate result. Clearly smoothing techniques have been important to NLP, information retrieval, and biology. Smoothing will only become obsolete in the case that our ability to process every word in a corpus (or every web page in a search result, as Gao et. al. were working with) scales with the growth of corpora. Thus the assertion that smoothing might become less and less necessary does not hold. 10/27/2017
“It never pays to think until you’ve run out of data” – Eric Brill Moore’s Law Constant: Data Collection Rates Improvement Rates Banko & Brill: Mitigating the Paucity-of-Data Problem (HLT 2001) No consistently best learner There is no data like more data Quoted out of context Fire everybody and spend the money on data 6/27/2014 Fillmore Workshop
There is no data like more data Counts are growing 1000x per decade (same as disks) Counts are growing 1000x per decade (same as disks) Rising Tide of Data Lifts All Boats 8/24/2017
Neural Nets have more parameter More parameters more powerful But also require More data More smoothing Is there implicit smoothing with standard training mechanisms? Dropout Cross-validation As well as explicit smoothing f(x) weighting function in glove c^0.75 10/27/2017
10/27/2017
10/27/2017
Smoothing & Neural Nets Despite the recent success of recurrent neural nets (RNNs) [which] do not rely on word counts, smoothing remains an important problem in the field of language modeling. … RNNs at first glance appear to sidestep the issue of smoothing altogether. However… low frequency words <UNK> tags weighting functions used in word embeddings In this paper we … [argue] that smoothing remains a problem even as the community shifts [to embeddings & neural networks] 10/27/2017
Smoothing & Embeddings The advance in smoothing and its application on estimation of low frequency ngrams in traditional language model does not seem to pass down to word embedding models, which is commonly believed to outperform ngram models on most of nlp tasks. It argues that empirically all context counts needs to be raised to a power of 0.75: a 'magical' fixed hyperparameter to achieve the best accuracy no matter which task we are performing and what training data we are drawing from. 10/27/2017
Although the embedding models come to penalizes the overestimation for small counts, it improperly down weights the large counts as well. An alternative method, for example, would be applying discounting based on the Good-Turing estimate of the probability unseen or low frequency words. There are many other smoothing methods and a revisit to such improvements would prove to dramatically impact our relative performance. 10/27/2017