Download presentation
Presentation is loading. Please wait.
1
Smoothing 10/27/2017
2
References Not Mentioned (much)
Capture-Recapture Kneser-Ney Additive Smoothing Laplace Smoothing Jeffreys Dirichlet Prior What’s wrong with adding one? 10/27/2017
3
10/27/2017
4
Good-Turing Assumptions: Modest, but…
10/27/2017
5
10/27/2017
6
Jurafsky (Lecture 4-7) https://www. youtube. com/watch
10/27/2017
7
Tweets Don’t forget about smoothing
just because your language model has gone neural. Sparse data is real. Always has been, always will be. 10/27/2017
8
Good-Turing Smoothing & Large Counts
Smoothing may potentially illustrate some of the limitations of statistical approach in natural language process. Good Turing method … suffers from noise as r gets larger… While the smoothing allowed to create a desired model and is substantiated by rigorous math, the process seems to also give some impression that mathematics was “bent” to create a desired model. While many statistical approach such as clustering and smoothing have given a far better performance in many of of natural language processing tasks, it raises a question of whether other approaches can enhance the currently dominant statistical approach in natural language processing. 10/27/2017
9
References 10/27/2017
10
10/27/2017
11
10/27/2017
12
10/27/2017
13
Is Smoothing Obsolete? (1 of 2)
Is it true that “there is no data like more data” even when that data is sparse? Were Allison et. al. correct in when they predicted that improved data processing capabilities will make smoothing techniques less necessary? In the case that the entire data set could be used to calculate population frequencies, MLE would return as the popular choice of frequency estimation since it is a straightforward and precise measure. But in a time of working with corpora too large and sparse not to sample, smoothing has been very important. 10/27/2017
14
Is Smoothing Obsolete? (2 of 2)
Good-Turing smoothing was in some sense a general solution to the data sparseness problem. Improvements on the Good-Turing solution include Simple Good-Turing (Gale and Sampson, 1995) and Katz (Katz, 1987). Many survey papers, such as Chen and Goodman’s, compare smoothing methods or try to combine various smoothing methods to produce a more accurate result. Clearly smoothing techniques have been important to NLP, information retrieval, and biology. Smoothing will only become obsolete in the case that our ability to process every word in a corpus (or every web page in a search result, as Gao et. al. were working with) scales with the growth of corpora. Thus the assertion that smoothing might become less and less necessary does not hold. 10/27/2017
15
“It never pays to think until you’ve run out of data” – Eric Brill
Moore’s Law Constant: Data Collection Rates Improvement Rates Banko & Brill: Mitigating the Paucity-of-Data Problem (HLT 2001) No consistently best learner There is no data like more data Quoted out of context Fire everybody and spend the money on data 6/27/2014 Fillmore Workshop
16
There is no data like more data
Counts are growing 1000x per decade (same as disks) Counts are growing 1000x per decade (same as disks) Rising Tide of Data Lifts All Boats 8/24/2017
17
Neural Nets have more parameter
More parameters more powerful But also require More data More smoothing Is there implicit smoothing with standard training mechanisms? Dropout Cross-validation As well as explicit smoothing f(x) weighting function in glove c^0.75 10/27/2017
18
10/27/2017
19
10/27/2017
20
Smoothing & Neural Nets
Despite the recent success of recurrent neural nets (RNNs) [which] do not rely on word counts, smoothing remains an important problem in the field of language modeling. … RNNs at first glance appear to sidestep the issue of smoothing altogether. However… low frequency words <UNK> tags weighting functions used in word embeddings In this paper we … [argue] that smoothing remains a problem even as the community shifts [to embeddings & neural networks] 10/27/2017
21
Smoothing & Embeddings
The advance in smoothing and its application on estimation of low frequency ngrams in traditional language model does not seem to pass down to word embedding models, which is commonly believed to outperform ngram models on most of nlp tasks. It argues that empirically all context counts needs to be raised to a power of 0.75: a 'magical' fixed hyperparameter to achieve the best accuracy no matter which task we are performing and what training data we are drawing from. 10/27/2017
22
Although the embedding models come to penalizes the overestimation for small counts,
it improperly down weights the large counts as well. An alternative method, for example, would be applying discounting based on the Good-Turing estimate of the probability unseen or low frequency words. There are many other smoothing methods and a revisit to such improvements would prove to dramatically impact our relative performance. 10/27/2017
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.