A Language Modeling Approach to Information Retrieval 한 경 수 2002-04-02  Introduction  Previous Work  Model Description  Empirical Results  Conclusions.

Slides:

Advertisements

Similar presentations

Relevance Feedback User tells system whether returned/disseminated documents are relevant to query/information need or not Feedback: usually positive sometimes.

Advertisements

Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.

Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.

CS276A Text Retrieval and Mining Lecture 12 [Borrows slides from Viktor Lavrenko and Chengxiang Zhai]

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.

Chapter 5: Introduction to Information Retrieval

A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao, and ChengXiang Zhai University of Illinois at Urbana Champaign SIGIR 2004 (Best paper.

1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,

SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.

Introduction to Information Retrieval (Part 2) By Evren Ermis.

Probabilistic Ranking Principle

Information Retrieval Models: Probabilistic Models

Objective: To estimate population means with various confidence levels.

Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR

Chapter 7 Retrieval Models.

Incorporating Language Modeling into the Inference Network Retrieval Framework Don Metzler.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.

Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.

Modeling Modern Information Retrieval

Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.

Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.

Scalable Text Mining with Sparse Generative Models

Today Concepts underlying inferential statistics

Probabilistic Models in IR Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Using majority of the slides from.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.

Fundamentals of Data Analysis Lecture 4 Testing of statistical hypotheses.

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.

Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.

Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.

Managerial Economics Demand Estimation & Forecasting.

Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005.

Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.

Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.

Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.

Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.

NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.

CpSc 881: Information Retrieval. 2 Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates.

Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.

Information Retrieval Models: Vector Space Models

Survey Jaehui Park Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.

DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,

1 Probabilistic Models for Ranking Some of these slides are based on Stanford IR Course slides at

CS276A Text Information Retrieval, Mining, and Exploitation

Lecture 13: Language Models for IR

Statistical Data Analysis

CSCI 5417 Information Retrieval Systems Jim Martin

Information Retrieval Models: Probabilistic Models

Compact Query Term Selection Using Topically Related Text

Language Models for Information Retrieval

Murat Açar - Zeynep Çipiloğlu Yıldız

Lecture 12 The Language Model Approach to IR

John Lafferty, Chengxiang Zhai School of Computer Science

Statistical Data Analysis

Language Model Approach to IR

INF 141: Information Retrieval

Conceptual grounding Nisheeth 26th March 2019.

Language Models for TR Rong Jin

Presentation transcript:

A Language Modeling Approach to Information Retrieval 한 경 수  Introduction  Previous Work  Model Description  Empirical Results  Conclusions and Future Work  Relevance Feedback in LM  Introduction  Previous Work  Model Description  Empirical Results  Conclusions and Future Work  Relevance Feedback in LM

한경수 LM Approach to IR2 Indexing model of probabilistic retrieval model  A model of the assignment of indexing terms to documents  Indexing model of 2-Poisson model Indicate the useful indexing terms by means of the differences in their rate of occurrence in documents elite for a given term vs. those without the property of eliteness.  The current indexing models have not led to improved retrieval results. Due to 2 unwarranted assumptions Documents are members of pre-defined classes. –Combinatorial explosion of elite sets The parametric assumption –Unnecessary to construct a parametric model of the data when we have the actual data. Introduction

한경수 LM Approach to IR3 Retrieval based on probabilistic LM  Treat the generation of queries as a random process.  Approach Infer a language model for each document. Estimate the probability of generating the query according to each of these models. Rank the documents according to these probabilities.  Intuition Users … –Have a reasonable idea of terms that are likely to occur in documents of interest. –Will choose query terms that distinguish these documents from others in the collection.  Collection statistics … Are integral parts of the language model. Are not used heuristically as in many other approaches. Introduction

한경수 LM Approach to IR4 Probabilistic IR query d1 d2 dn … Information need document collection matching Introduction

한경수 LM Approach to IR5 IR based on LM query d1 d2 dn … Information need document collection generation … Introduction

한경수 LM Approach to IR6 Previous Work  Difference from the 2-Poisson model Don ’ t make distributional assumptions. Don ’ t distinguish a subset of specialty words. –Don ’ t assume a preexisting classification of documents into elite and non-elite sets.  Difference from Robertson & Sparck Jones model and Croft & Harper model Don ’ t focus on relevance except to the extent that the process of query production is correlated with it.  Fuhr model  INQUERY  Kwok, Wong & Yao, Kalt Previous Work

한경수 LM Approach to IR7 Query generation probability  Ranking formula  The probability of producing the query given the language model of document d Model Description Assumption: Given a particular language model, the query term occur independently : language model of document d : raw tf of term t in document d : total number of tokens in document d

한경수 LM Approach to IR8 Insufficient data  Zero probability Don ’ t wish to assign a probability of zero to a document that is missing one or more of the query terms. Somewhat radical assumption to infer that  Assumption A non-occurring term is possible, but no more likely than what would be expected by chance in the collection. If, Model Description : raw count of term t in the collection : raw collection size(total number of tokens in the collection)

한경수 LM Approach to IR9 Averaging for robustness  If we could get an arbitrary sized sample of data from we could be reasonably confident in the maximum likelihood estimator. We only have a document sized sample from that distribution.  To circumvent this problem, Need an estimate from a larger amount of data Model Description : document frequency of t

한경수 LM Approach to IR10 The Risk  Cannot and are not assuming that every document containing t is drawn from the same language model.  There is some risk in using the mean to estimate If we used the mean by itself, there would be no distinction between documents with different term frequencies.  The risk for a term t in a document d (geometric distribution)geometric distribution As the tf gets further away from the normalized mean, the mean probability becomes riskier to use as an estimate. Model Description : mean term frequency of term t in documents where t occurs normalized by document length (= )

한경수 LM Approach to IR11 Combining the two estimates Model Description

한경수 LM Approach to IR12 Analysis of the formulation  Generalization: formulation of the LM for IR  Conception The user has a document in mind, and generate the query from this document. The equation represents the probability that the document that the user had in mind was in fact this one. Model Description general language model individual-document model

한경수 LM Approach to IR13 Experiment Environment  Data TREC topics on TREC disks 2 and 3 –Natural language queries consisting of one sentence each TREC topics on TREC disk 3 using the concept fields –Lists of good terms Empirical Results Number: 054 Domain: International Economics Topic: Satellite Launch Contracts Description: … Concept(s): 1.Contract, agreement 2.Launch vehicle, rocket, payload, satellite 3.Launch services, … … Number: 054 Domain: International Economics Topic: Satellite Launch Contracts Description: … Concept(s): 1.Contract, agreement 2.Launch vehicle, rocket, payload, satellite 3.Launch services, … …

한경수 LM Approach to IR14 Recall/Precision Experiments(1) Empirical Results

한경수 LM Approach to IR15 Recall/Precision Experiments(2) Empirical Results

한경수 LM Approach to IR16 Improving the Basic Model(1)  Smoothing the estimate of the average probability for terms with low document frequency The estimate is based on a small amount of data So could be sensitive to outliers  Binned estimate Bin the low frequency data by document frequency –Cutoff: df=100 Use the binned estimate for the average Empirical Results

한경수 LM Approach to IR17 Improving the Basic Model(2) Empirical Results

한경수 LM Approach to IR18 Improving the Basic Model(3) Empirical Results

한경수 LM Approach to IR19 Conclusions & Future Work  Conclusions Novel way of looking at the problem of text retrieval based on probabilistic language modeling –Conceptually simple and explanatory LM will provide effective retrieval and can be improved to the extent that the following conditions can be met –Our language models are accurate representations of the data. –Users understand our approach to retrieval. –Users have a some sense of term distribution. The ability to think about retrieval in a new way  Future Work Estimate of default probability –Current estimator could in some strange cases assign a higher probability to a non-occurring query term. Query expansion Conclusions & Future Work

한경수 LM Approach to IR20 LM approach to multiple relevant documents  Current LM approach Allow for N+1 language models –N(collection size) + general language model The relationship between general language model and the individual document models is never raised. –How can a document be generated from one language model when the entire collection is generated from a different one?  We need … General model for some accumulation of text, which is modified (not replaced) by a local model for some smaller part of the same text. Relevance Feedback in LM

한경수 LM Approach to IR21 3-level model(1)  3-level model  Whole collection model ( )  Specific-topic model; relevant-documents model ( )  Individual-document model ( )  Relevance hypothesis A request(query; topic) is generated from a specific-topic model {, }. Iff a document is relevant to the topic, the same model will apply to the document. –It will replace part of the individual-document model in explaining the document. The probability of relevance of a document –The probability that this model explains part of the document –The probability that the {,, } combination is better than the {, } combination Relevance Feedback in LM

한경수 LM Approach to IR22 3-level model(2) query d1 d2 dn … Information need document collection generation … …

한경수 LM Approach to IR23 Geometric distribution(1)  기하분포 첫번째 성공을 거둘 때까지 성공률이 p 인 베르누이 시행을 반복 할 때, 총 시행횟수를 X 라 두면 이 확률변수 X 가 갖는 분포가 기하분 포이다.

한경수 LM Approach to IR24 Geometric distribution(2)  예 어떤 실험을 한번 하는데 드는 비용은 10 만원이다. 이 실험이 성공 할 확률은 0.2 이고 성공할 때까지 이 실험을 반복한다고 할 때 실험 에 드는 총비용을 얼마로 예상하면 될까 ?