Language-model-based similarity on large texts Tolga Çekiç 7.7.2014 1/10.

Slides:



Advertisements
Similar presentations
Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007.
Advertisements

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Information retrieval – LSI, pLSI and LDA
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
Language Models Hongning Wang
Cumulative Progress in Language Models for Information Retrieval Antti Puurula 6/12/2013 Australasian Language Technology Workshop University of Waikato.
Information Retrieval in Practice
Chapter 7 Retrieval Models.
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Latent Dirichlet Allocation a generative model for text
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
Formal Multinomial and Multiple- Bernoulli Language Models Don Metzler.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Language Modeling Approaches for Information Retrieval Rong Jin.
Chapter 7 Retrieval Models.
Generating Impact-Based Summaries for Scientific Literature Qiaozhu Mei, ChengXiang Zhai University of Illinois at Urbana-Champaign 1.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
IRDM WS Chapter 4: Advanced IR Models 4.1 Probabilistic IR 4.2 Statistical Language Models (LMs) Principles and Basic LMs Smoothing.
Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.
Language Models Hongning Wang Two-stage smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Estimating Topical Context by Diverging from External Resources SIGIR’13, July 28–August 1, 2013, Dublin, Ireland. Presenter: SHIH, KAI WUN Romain Deveaud.
A General Optimization Framework for Smoothing Language Models on Graph Structures Qiaozhu Mei, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.
Multilingual Retrieval Experiments with MIMOR at the University of Hildesheim René Hackl, Ralph Kölle, Thomas Mandl, Alexandra Ploedt, Jan-Hendrik Scheufen,
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Less is More Probabilistic Models for Retrieving Fewer Relevant Documents Harr Chen, David R. Karger MIT CSAIL ACM SIGIR 2006 August 9, 2006.
Lecture 1: Overview of IR Maya Ramanath. Who hasn’t used Google? Why did Google return these results first ? Can we improve on it? Is this a good result.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Positional Relevance Model for Pseudo–Relevance Feedback Yuanhua Lv & ChengXiang Zhai Department of Computer Science, UIUC Presented by Bo Man 2014/11/18.
Semantic v.s. Positions: Utilizing Balanced Proximity in Language Model Smoothing for Information Retrieval Rui Yan†, ♮, Han Jiang†, ♮, Mirella Lapata‡,
Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,
A Hierarchical Nonparametric Bayesian Approach to Statistical Language Model Domain Adaptation Frank Wood and Yee Whye Teh AISTATS 2009 Presented by: Mingyuan.
Latent Dirichlet Allocation
Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.
Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.
Using Social Annotations to Improve Language Model for Information Retrieval Shengliang Xu, Shenghua Bao, Yong Yu Shanghai Jiao Tong University Yunbo Cao.
Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Relevance Feedback Hongning Wang
A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval Min Zhang, Xinyao Ye Tsinghua University SIGIR
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
MTH 253 Calculus (Other Topics) Chapter 11 – Infinite Sequences and Series Section 11.5 – The Ratio and Root Tests Copyright © 2009 by Ron Wallace, all.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
A Study of Poisson Query Generation Model for Information Retrieval
Chapter 7 Retrieval Models 1.  Provide a mathematical framework for defining the search process  Includes explanation of assumptions  Basis of many.
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Language Modeling Again So are we smooth now? Courtesy of Chris Jordan.
Using Statistical Decision Theory and Relevance Models for Query-Performance Prediction Anna Shtok and Oren Kurland and David Carmel SIGIR 2010 Hao-Chin.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
A Formal Study of Information Retrieval Heuristics
True/False questions (3pts*2)
Statistical Language Models
An Empirical Study of Learning to Rank for Entity Search
Smoothing 10/27/2017.
Relevance Feedback Hongning Wang
Introduction to Statistical Modeling
John Lafferty, Chengxiang Zhai School of Computer Science
CS 4501: Information Retrieval
Junghoo “John” Cho UCLA
Language Models Hongning Wang
CS590I: Information Retrieval
INF 141: Information Retrieval
Conceptual grounding Nisheeth 26th March 2019.
Language Models for TR Rong Jin
Presentation transcript:

Language-model-based similarity on large texts Tolga Çekiç /10

Language-model-based similarity For term t and document/text x: With a smoothing estimate using Dirichlet disribution: 2/10

Smoothing General form of a smoothed model: Many smoothing methods that use same basic principle; Jelinek-Mercer Method, Katz Smoothing, Bayesian Smoothing, Absolute Discounting, etc. 3/10

Bayesian Smoothing Using Dirichlet Priors Dirichlet distribution with parameters: The model becomes: 4/10

Dirichlet Priors When using Dirichlet priors, α d is document dependent The optimal value of µ is not much different for larger and smaller queries The optimal prior µ depends on the collection but mostly around /10

Estimator for Term Sequence Assuming term indepence(unigram approach) for a term sequence the language model estimator is calculated as: Standard estimator for a probability of a short query An estimation approach that utilizes Cross Entropy measure 6/10

Cross Entropy The cross entropy for two distributions over the same probability space: is Kullback-Leibler divergence from q to p 7/10

Large Text Similarity When two large texts are compared, i.e when n is a large number the result is very close to zero and underflow problems occur One way to circumvent the problem is to take n-th root of the result 8/10

Large Text Similarity Using cross entropy also shows that result: Another estimator can be calculated using KL- divergence 9/10

References Zhai, C. and Laerty, J. D. (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of SIGIR, pages S. F. Chen and J. Goodman (1998). “An empirical study of smoothing techniques for language modeling,” Tech. Rep. TR-10-98, Harvard University. Lavrenko, V., Allan, J., DeGuzman, E., LaFlamme, D., Pollard, V., and Thomas, S. (2002a). Relevance models for topic detection and tracking. In Proceedings of the Human Language Technology Conference (HLT), pages Kurland, O. (2006). Inter-Document Similarities, Language Models, and Ad-Hoc Information Retrieval. Ph.D. Thesis. CornellUniversity. 10/10