Language-model-based similarity on large texts Tolga Çekiç /10
Language-model-based similarity For term t and document/text x: With a smoothing estimate using Dirichlet disribution: 2/10
Smoothing General form of a smoothed model: Many smoothing methods that use same basic principle; Jelinek-Mercer Method, Katz Smoothing, Bayesian Smoothing, Absolute Discounting, etc. 3/10
Bayesian Smoothing Using Dirichlet Priors Dirichlet distribution with parameters: The model becomes: 4/10
Dirichlet Priors When using Dirichlet priors, α d is document dependent The optimal value of µ is not much different for larger and smaller queries The optimal prior µ depends on the collection but mostly around /10
Estimator for Term Sequence Assuming term indepence(unigram approach) for a term sequence the language model estimator is calculated as: Standard estimator for a probability of a short query An estimation approach that utilizes Cross Entropy measure 6/10
Cross Entropy The cross entropy for two distributions over the same probability space: is Kullback-Leibler divergence from q to p 7/10
Large Text Similarity When two large texts are compared, i.e when n is a large number the result is very close to zero and underflow problems occur One way to circumvent the problem is to take n-th root of the result 8/10
Large Text Similarity Using cross entropy also shows that result: Another estimator can be calculated using KL- divergence 9/10
References Zhai, C. and Laerty, J. D. (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of SIGIR, pages S. F. Chen and J. Goodman (1998). “An empirical study of smoothing techniques for language modeling,” Tech. Rep. TR-10-98, Harvard University. Lavrenko, V., Allan, J., DeGuzman, E., LaFlamme, D., Pollard, V., and Thomas, S. (2002a). Relevance models for topic detection and tracking. In Proceedings of the Human Language Technology Conference (HLT), pages Kurland, O. (2006). Inter-Document Similarities, Language Models, and Ad-Hoc Information Retrieval. Ph.D. Thesis. CornellUniversity. 10/10