China-US-France Summer School, Lotus Hill Inst. 2008 Lectures 2 & 3: Statistical Language Models for Information Retrieval ChengXiang Zhai ( 翟成祥 ) Department.

Slides:

Advertisements

Similar presentations

Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.

Advertisements

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.

1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,

Language Models Hongning Wang

Probabilistic Ranking Principle

Information Retrieval Models: Probabilistic Models

Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.

Chapter 7 Retrieval Models.

IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.

A Markov Random Field Model for Term Dependencies Donald Metzler and W. Bruce Croft University of Massachusetts, Amherst Center for Intelligent Information.

Latent Dirichlet Allocation a generative model for text

Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.

Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.

Language Modeling Approaches for Information Retrieval Rong Jin.

Generating Impact-Based Summaries for Scientific Literature Qiaozhu Mei, ChengXiang Zhai University of Illinois at Urbana-Champaign 1.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

Statistical Language Models for Information Retrieval Tutorial at ACM SIGIR 2005 Aug. 15, 2005 ChengXiang Zhai Department of Computer Science University.

1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

IRDM WS Chapter 4: Advanced IR Models 4.1 Probabilistic IR 4.2 Statistical Language Models (LMs) Principles and Basic LMs Smoothing.

1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

Language Models Hongning Wang Two-stage smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.

A General Optimization Framework for Smoothing Language Models on Graph Structures Qiaozhu Mei, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.

Context-Sensitive Information Retrieval Using Implicit Feedback Xuehua Shen : department of Computer Science University of Illinois at Urbana-Champaign.

Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Probabilistic Ranking Principle Hongning Wang

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.

Language Models Hongning Wang Recap: document generation model 4501: Information Retrieval Model of relevant docs for Q Model of non-relevant.

Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.

Lower-Bounding Term Frequency Normalization Yuanhua Lv and ChengXiang Zhai University of Illinois at Urbana-Champaign CIKM 2011 Best Student Award Paper.

Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,

Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.

Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.

Statistical Language Models for Information Retrieval Tutorial at NAACL HLT 2007 April 22, 2007 ChengXiang Zhai Department of Computer Science University.

NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Statistical Language Models for Information Retrieval Tutorial at NAACL HLT 2007 April 22, 2007 ChengXiang Zhai Department of Computer Science University.

Automatic Labeling of Multinomial Topic Models

Natural Language Processing Statistical Inference: n-grams

Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.

Information Retrieval Models: Vector Space Models

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming.

Relevance Feedback Hongning Wang

A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval Min Zhang, Xinyao Ye Tsinghua University SIGIR

Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.

Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.

1 Risk Minimization and Language Modeling in Text Retrieval ChengXiang Zhai Thesis Committee: John Lafferty (Chair), Jamie Callan Jaime Carbonell David.

A Study of Poisson Query Generation Model for Information Retrieval

Statistical Language Models Hongning Wang CS 6501: Text Mining1.

2010 © University of Michigan Probabilistic Models in Information Retrieval SI650: Information Retrieval Winter 2010 School of Information University of.

A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 龙星计划课程 : 信息检索 Statistical Language Models for IR ChengXiang Zhai (

Overview of Statistical Language Models

Information Retrieval Models: Language Models

Statistical Language Models

Language Models for Text Retrieval

Information Retrieval Models: Probabilistic Models

Relevance Feedback Hongning Wang

Language Models for Information Retrieval

Lecture 12 The Language Model Approach to IR

John Lafferty, Chengxiang Zhai School of Computer Science

Language Model Approach to IR

CS 4501: Information Retrieval

Topic Models in Text Processing

Language Models Hongning Wang

Language Models for TR Rong Jin

Presentation transcript:

China-US-France Summer School, Lotus Hill Inst Lectures 2 & 3: Statistical Language Models for Information Retrieval ChengXiang Zhai ( 翟成祥 ) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Query Generation Assuming uniform prior, we have Query likelihood p(q|  d )Document prior Now, the question is how to compute ? Generally involves two steps: (1) estimate a language model based on D (2) compute the query likelihood according to the estimated model Leading to the so-called “Language Modeling Approach” …

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Outline 1.Overview 2.The Basic Language Modeling Approach 3.More Advanced Language Models 4.Language Models for Special Retrieval Tasks 5.Summary

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, What is a Statistical LM? A probability distribution over word sequences –p(“ Today is Wednesday ”)  –p(“ Today Wednesday is ”)  –p(“ The eigenvalue is positive” )  Context/topic dependent! Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Why is a LM Useful? Provides a principled way to quantify the uncertainties associated with natural language Allows us to answer questions like: –Given that we see “ John ” and “ feels ”, how likely will we see “ happy ” as opposed to “ habit ” as the next word? (speech recognition) –Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval) –Given that a user is interested in sports news, how likely would the user use “baseball” in a query? (information retrieval)

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Source-Channel Framework (Model of Communication System [Shannon 48] ) Source Transmitter (encoder) Destination Receiver (decoder) Noisy Channel P(X) P(Y|X) X YX’ P(X|Y)=? When X is text, p(X) is a language model (Bayes Rule) Many Examples: Speech recognition: X=Word sequence Y=Speech signal Machine translation: X=English sentence Y=Chinese sentence OCR Error Correction: X=Correct word Y= Erroneous word Information Retrieval: X=Document Y=Query Summarization: X=Summary Y=Document

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, The Simplest Language Model (Unigram Model) Generate a piece of text by generating each word independently Thus, p(w 1 w 2... w n )=p(w 1 )p(w 2 )…p(w n ) Parameters: {p(w i )} p(w 1 )+…+p(w N )=1 (N is voc. size) Essentially a multinomial distribution over words A piece of text can be regarded as a sample drawn according to this word distribution

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Text Generation with Unigram LM (Unigram) Language Model  p(w|  ) … text 0.2 mining 0.1 assocation 0.01 clustering 0.02 … food … Topic 1: Text mining … food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … Topic 2: Health Document d Text mining paper Food nutrition paper Sampling Given , p(d|  ) varies according to d

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Estimation of Unigram LM (Unigram) Language Model  p(w|  )=? Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 … text ? mining ? assocation ? database ? … query ? … Estimation Total #words =100 10/100 5/100 3/100 1/100 How good is the estimated model ? It gives our document sample the highest prob, but it doesn’t generalize well… More about this later…

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, More Sophisticated LMs N-gram language models –In general, p(w 1 w 2... w n )=p(w 1 )p(w 2 |w 1 )…p(w n |w 1 …w n-1 ) –n-gram: conditioned only on the past n-1 words –E.g., bigram: p(w 1... w n )=p(w 1 )p(w 2 |w 1 ) p(w 3 |w 2 ) …p(w n |w n-1 ) Remote-dependence language models (e.g., Maximum Entropy model) Structured language models (e.g., probabilistic context-free grammar) Will not be covered in detail in this tutorial. If interested, read [Jelinek 98, Manning & Schutze 99, Rosenfeld 00]

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Why Just Unigram Models? Difficulty in moving toward more complex models –They involve more parameters, so need more data to estimate (A doc is an extremely small sample) –They increase the computational complexity significantly, both in time and space Capturing word order or structure may not add so much value for “topical inference” But, using more sophisticated models can still be expected to improve performance...

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Evaluation of SLMs Direct evaluation criterion: How well does the model fit the data to be modeled? –Example measures: Data likelihood, perplexity, cross entropy, Kullback-Leibler divergence (mostly equivalent) Indirect evaluation criterion: Does the model help improve the performance of the task? –Specific measure is task dependent –For retrieval, we look at whether a model helps improve retrieval accuracy –We hope more “reasonable” LMs would achieve better retrieval performance

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Representative LMs for IR Beyond unigram Song & Croft 99 Smoothing examined Zhai & Lafferty 01a Bayesian Query likelihood Zaragoza et al. 03. Theoretical justification Lafferty & Zhai 01a,01b Two-stage LMs Zhai & Lafferty 02 Lavrenko 04 Kraaij 04 Zhai 02 Dissertations Hiemstra 01 Berger 01 Ponte 98 Translation model Berger & Lafferty 99 Basic LM (Query Likelihood) URL prior Kraaij et al. 02 Lavrenko et al. 02 Ogilvie & Callan 03 Zhai et al. 03 Xu et al. 01 Zhang et al. 02 Cronen-Townsend et al. 02 Si et al. 02 Special IR tasks Xu & Croft Parsimonious LM Hiemstra et al. 04 Cluster smoothing Liu & Croft 04; Tao et al. 06 Relevance LM Lavrenko & Croft 01 Dependency LM Gao et al. 04 Model-based FB Zhai & Lafferty 01b Rel. Query FB Nallanati et al 03 Query likelihood scoring Ponte & Croft 98 Hiemstra & Kraaij 99; Miller et al. 99 Parameter sensitivity Ng 00 Title LM Jin et al. 02 Term-specific smoothing Hiemstra 02 Concept Likelihood Srikanth & Srihari 03 Time prior Li & Croft 03 Shen et al. 05 Srikanth 04 Kurland & Lee 05 Pesudo Query Kurland et al. 05 Rebust Est. Tao & Zhai 06 Thesauri Cao et al. 05 Query expansion Bai et al Markov-chain query model Lafferty & Zhai 01b Query/Rel Model & Feedback Cluster LM Kurland & Lee 04 Improved Basic LM Tan et al. 06 Tao 06 Kurland 06

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Ponte & Croft’s Pioneering Work [Ponte & Croft 98] Contribution 1: –A new “query likelihood” scoring method: p(Q|D) –[Maron and Kuhns 60] had the idea of query likelihood, but didn’t work out how to estimate p(Q|D) Contribution 2: –Connecting LMs with text representation and weighting in IR –[Wong & Yao 89] had the idea of representing text with a multinomial distribution (relative frequency), but didn’t study the estimation problem Good performance is reported using the simple query likelihood method

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Early Work ( ) At about the same time as SIGIR 98, in TREC 7, two groups explored similar ideas independently: BBN [Miller et al., 99] & Univ. of Twente [Hiemstra & Kraaij 99] In TREC-8, Ng from MIT motivated the same query likelihood method in a different way [Ng 99] All following the simple query likelihood method; methods differ in the way the model is estimated and the event model for the query All show promising empirical results Main problems: –Feedback is explored heuristically –Lack of understanding why the method works….

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Later Work (1999-) Attempt to understand why LMs work [Zhai & Lafferty 01a, Lafferty & Zhai 01a, Ponte 01, Greiff & Morgan 03, Sparck Jones et al. 03, Lavrenko 04] Further extend/improve the basic LMs [Song & Croft 99, Berger & Lafferty 99, Jin et al. 02, Nallapati & Allan 02, Hiemstra 02, Zaragoza et al. 03, Srikanth & Srihari 03, Nallapati et al 03, Li &Croft 03, Gao et al. 04, Liu & Croft 04, Kurland & Lee 04,Hiemstra et al. 04,Cao et al. 05, Tao et al. 06] Explore alternative ways of using LMs for retrieval (mostly query/relevance model estimation) [Xu & Croft 99, Lavrenko & Croft 01, Lafferty & Zhai 01a, Zhai & Lafferty 01b, Lavrenko 04, Kurland et al. 05, Bai et al. 05,Tao & Zhai 06] Explore the use of SLMs for special retrieval tasks [Xu & Croft 99, Xu et al. 01, Lavrenko et al. 02, Cronen-Townsend et al. 02, Zhang et al. 02, Ogilvie & Callan 03, Zhai et al. 03, Kurland & Lee 05, Shen et al. 05, Balog et al. 06, Fang & Zhai 07]

China-US-France Summer School, Lotus Hill Inst The Basic Language Modeling Approach

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, The Basic LM Approach [Ponte & Croft 98] Document Text mining paper Food nutrition paper Language Model … text ? mining ? assocation ? clustering ? … food ? … food ? nutrition ? healthy ? diet ? … Query = “data mining algorithms” ? Which model would most likely have generated this query?

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Ranking Docs by Query Likelihood d1d1 d2d2 dNdN q d1d1 d2d2 dNdN Doc LM p(q|  d 1 ) p(q|  d 2 ) p(q|  d N ) Query likelihood

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Modeling Queries: Different Assumptions Multi-Bernoulli: Modeling word presence/absence –q= (x 1, …, x |V| ), x i =1 for presence of word w i ; x i =0 for absence –Parameters: {p(w i =1|d), p(w i =0|d)} p(w i =1|d)+ p(w i =0|d)=1 Multinomial (Unigram LM): Modeling word frequency –q=q 1,…q m, where q j is a query word –c(w i,q) is the count of word w i in query q –Parameters: {p(w i |d)} p(w 1 |d)+… p(w |v| |d) = 1 [Ponte & Croft 98] uses Multi-Bernoulli; most other work uses multinomial Multinomial seems to work better [Song & Croft 99, McCallum & Nigam 98,Lavrenko 04]

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Retrieval as LM Estimation Document ranking based on query likelihood Retrieval problem  Estimation of p(w i |d) Smoothing is an important issue, and distinguishes different approaches Document language model

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, How to Estimate p(w|d)? Simplest solution: Maximum Likelihood Estimator –P(w|d) = relative frequency of word w in d –What if a word doesn’t appear in the text? P(w|d)=0 In general, what probability should we give a word that has not been observed? If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed words This is what “smoothing” is about …

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Language Model Smoothing (Illustration) P(w) Word w Max. Likelihood Estimate Smoothed LM

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, How to Smooth? All smoothing methods try to –discount the probability of words seen in a document –re-allocate the extra counts so that unseen words will have a non-zero count Method 1 Additive smoothing [Chen & Goodman 98]: Add a constant  to the counts of each word, e.g., “add 1” “Add one”, Laplace Vocabulary size Counts of w in d Length of d (total counts)

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Improve Additive Smoothing Should all unseen words get equal probabilities? We can use a reference model to discriminate unseen words Discounted ML estimate Reference language model Normalizer Prob. Mass for unseen words

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Other Smoothing Methods Method 2 Absolute discounting [Ney et al. 94]: Subtract a constant  from the counts of each word Method 3 Linear interpolation [Jelinek-Mercer 80]: “Shrink” uniformly toward p(w|REF) # unique words parameter ML estimate

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Other Smoothing Methods (cont.) Method 4 Dirichlet Prior/Bayesian [MacKay & Peto 95, Zhai & Lafferty 01a, Zhai & Lafferty 02]: Assume pseudo counts  p(w|REF) Method 5 Good Turing [Good 53]: Assume total # unseen events to be n 1 (# of singletons), and adjust the seen events in the same way parameter Heuristics needed

China-US-France Summer School, Lotus Hill Inst So, which method is the best? It depends on the data and the task! Cross validation is generally used to choose the best method and/or set the smoothing parameters… For retrieval, Dirichlet prior performs well… Backoff smoothing [Katz 87] doesn’t work well due to a lack of 2 nd -stage smoothing… Note that many other smoothing methods exist See [Chen & Goodman 98] and other publications in speech recognition…

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Comparison of Three Methods [Zhai & Lafferty 01a] Comparison is performed on a variety of test collections

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Understanding Smoothing Discounted ML estimate Reference language model Retrieval formula using the general smoothing scheme The key rewriting step Similar rewritings are very common when using LMs for IR… The general smoothing scheme

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Smoothing & TF-IDF Weighting [Zhai & Lafferty 01a] Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain Ignore for ranking IDF-like weighting TF weighting Doc length normalization (long doc is expected to have a smaller  d ) Smoothing with p(w|C)  TF-IDF + length norm. Smoothing implements traditional retrieval heuristics LMs with simple smoothing can be computed as efficiently as traditional retrieval models Words in both query and doc

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, The D ual-Role of Smoothing [Zhai & Lafferty 02] Verbose queries Keyword queries Why does query type affect smoothing sensitivity? long short long

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Query = “the algorithms for data mining” Another Reason for Smoothing p( “algorithms”|d1) = p(“algorithm”|d2) p( “data”|d1) < p(“data”|d2) p( “mining”|d1) < p(“mining”|d2) So we should make p(“the”) and p(“for”) less different for all docs, and smoothing helps achieve this goal… Content words Intuitively, d2 should have a higher score, but p(q|d1)>p(q|d2)… p DML (w|d1): p DML (w|d2): Query = “the algorithms for data mining” P(w|REF) Smoothed p(w|d1): Smoothed p(w|d2):

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Two-stage Smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet prior(Bayesian)  Collection LM (1- )+ p(w|U) Stage-2 -Explain noise in query -2-component mixture User background model Can be approximated by p(w|C)

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Estimating  using leave-one-out [Zhai & Lafferty 02] P(w 1 |d - w 1 ) P(w 2 |d - w 2 ) log-likelihood Maximum Likelihood Estimator Newton’s Method Leave-one-out w1w1 w2w2 P(w n |d - w n ) wnwn...

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Why would “leave-one-out” work? abc abc ab c d d abc cd d d abd ab ab ab ab cd d e cd e 20 word by author1 20 word by author2 abc abc ab c d d abe cb e f acf fb ef aff abef cdc db ge f s Suppose we keep sampling and get 10 more words. Which author is likely to “write” more new words? Now, suppose we leave “e” out…  must be big! more smoothing  doesn’t have to be big The amount of smoothing is closely related to the underlying vocabulary size

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Estimating using Mixture Model [Zhai & Lafferty 02] Query Q=q 1 …q m 11 NN... Maximum Likelihood Estimator Expectation-Maximization (EM) algorithm P(w|d 1 )d1d1  P(w|d N )dNdN  …... Stage-1 (1- )p(w|d 1 )+ p(w|U) (1- )p(w|d N )+ p(w|U) Stage-2 Estimated in stage-1

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Automatic 2-stage results  Optimal 1-stage results [Zhai & Lafferty 02] Average precision (3 DB’s + 4 query types, 150 topics) * Indicates significant difference Completely automatic tuning of parameters IS POSSIBLE!

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, The Notion of Relevance Relevance  (Rep(q), Rep(d)) Similarity P(r=1|q,d) r  {0,1} Probability of Relevance P(d  q) or P(q  d) Probabilistic inference Different rep & similarity Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) … Generative Model Regression Model (Fox 83) Classical prob. Model (Robertson & Sparck Jones, 76) Doc generation Query generation Basic LM approach (Ponte & Croft, 98) Prob. concept space model (Wong & Yao, 95) Different inference system Inference network model (Turtle & Croft, 91) Initially, LMs are applied to IR in this way Later, LMs are used along these lines too

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Interpretation of Query Likelhiood [Lafferty & Zhai 01a] Assuming uniform prior, we have Query likelihood p(q|  d ) Document prior Computing P(Q|D, R=1) generally involves two steps: (1) estimate a language model based on D (2) compute the query likelihood according to the estimated model P(Q|D)=P(Q|D, R=1)! Prob. that a user who likes D would pose query Q Relevance-based interpretation of the so-called “document language model”

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Variants of the Basic LM Approach Different smoothing strategies –Hidden Markov Models (essentially linear interpolation) [Miller et al. 99] –Smoothing with an IDF-like reference model [Hiemstra & Kraaij 99] –Performance tends to be similar to the basic LM approach –Many other possibilities for smoothing [Chen & Goodman 98] Different priors –Link information as prior leads to significant improvement of Web entry page retrieval performance [Kraaij et al. 02] –Time as prior [Li & Croft 03] –PageRank as prior [Kurland & Lee 05] Passage retrieval [Liu & Croft 02]

China-US-France Summer School, Lotus Hill Inst More Advanced Language Models

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Improving the Basic LM Approach Capturing limited dependencies –Bigrams/Trigrams [Song & Croft 99]; Grammatical dependency [Nallapati & Allan 02, Srikanth & Srihari 03, Gao et al. 04] –Generally insignificant improvement as compared with other extensions such as feedback Full Bayesian query likelihood [Zaragoza et al. 03] –Performance similar to the basic LM approach Translation model for p(Q|D,R) [Berger & Lafferty 99, Jin et al. 02,Cao et al. 05] –Address polesemy and synonyms; improves over the basic LM methods, but computationally expensive Cluster-based smoothing/scoring [Liu & Croft 04, Kurland & Lee 04,Tao et al. 06] –Improves over the basic LM, but computationally expensive Parsimonious LMs [Hiemstra et al. 04]: –Using a mixture model to “factor out” non-discriminative words

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Translation Models Directly modeling the “translation” relationship between words in the query and words in a doc When relevance judgments are available, (q,d) serves as data to train the translation model Without relevance judgments, we can use synthetic data [Berger & Lafferty 99], [Jin et al. 02], or thesauri [Cao et al. 05] Basic translation model Translation modelRegular doc LM

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Cluster-based Smoothing/Scoring Cluster-based smoothing: Smooth a document LM with a cluster of similar documents [Liu & Croft 04] : improves over the basic LM, but insignificantly Document expansion smoothing: Smooth a document LM with the neighboring documents (essentially one cluster per document) [Tao et al. 06] : improves over the basic LM more significantly Cluster-based query likelihood: Similar to the translation model, but “translate” the whole document to the query through a set of clusters [Kurland & Lee 04] How likely doc D belongs to cluster C Only effective when interpolated with the basic LM scores Likelihood of Q given C

China-US-France Summer School, Lotus Hill Inst Feedback in Language Models

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Overview of Feedback Techniques Feedback as machine learning: many possibilities –Standard ML: Given examples of relevant (and non-relevant) documents, learn how to classify a new document as either “relevant” or “non-relevant”. –“Modified” ML: Given a query and examples of relevant (and non-relevant) documents, learn how to rank new documents based on relevance –Challenges: Sparse data Censored sample How to deal with query? –Modeling noise in pseudo feedback (as semi-supervised learning) Feedback as query expansion: traditional IR –Step 1: Term selection –Step 2: Query expansion –Step 3: Query term re-weighting Traditional IR is still robust (Rocchio), but ML approaches can potentially be more accurate

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Feedback and Doc/Query Generation Classic Prob. Model Query likelihood (“Language Model”) Rel. doc model NonRel. doc model “Rel. query” model P(D|Q,R=1) P(D|Q,R=0) P(Q|D,R=1) (q 1,d 1,1) (q 1,d 2,1) (q 1,d 3,1) (q 1,d 4,0) (q 1,d 5,0) (q 3,d 1,1) (q 4,d 1,1) (q 5,d 1,1) (q 6,d 2,1) (q 6,d 3,0) Parameter Estimation Initial retrieval: - query as rel doc vs. doc as rel query - P(Q|D,R=1) is more accurate Feedback: - P(D|Q,R=1) can be improved for the current query and future doc - P(Q|D,R=1) can also be improved, but for current doc and future query Doc-based feedback Query-based feedback

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Difficulty in Feedback with Query Likelihood Traditional query expansion [Ponte 98, Miller et al. 99, Ng 99] –Improvement is reported, but there is a conceptual inconsistency –What’s an expanded query, a piece of text or a set of terms? Avoid expansion –Query term reweighting [Hiemstra 01, Hiemstra 02] –Translation models [Berger & Lafferty 99, Jin et al. 02] –Only achieving limited feedback Doing relevant query expansion instead [Nallapati et al 03] The difficulty is due to the lack of a query/relevance model The difficulty can be overcome with alternative ways of using LMs for retrieval (e.g., relevance model [Lavrenko & Croft 01], Query model estimation [Lafferty & Zhai 01b; Zhai & Lafferty 01b] )

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Two Alternative Ways of Using LMs Classic Probabilistic Model :Doc-Generation as opposed to Query-generation –Natural for relevance feedback –Challenge: Estimate p(D|Q,R=1) without relevance feedback; relevance model [Lavrenko & Croft 01] provides a good solution Probabilistic Distance Model :Similar to the vector-space model, but with LMs as opposed to TF-IDF weight vectors –A popular distance function: Kullback-Leibler (KL) divergence, covering query likelihood as a special case –Retrieval is now to estimate query & doc models and feedback is treated as query LM updating [Lafferty & Zhai 01b; Zhai & Lafferty 01b] Both methods outperform the basic LM significantly

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Relevance Model Estimation [Lavrenko & Croft 01] Question: How to estimate P(D|Q,R) (or p(w|Q,R)) without relevant documents? Key idea: –Treat query as observations about p(w|Q,R) –Approximate the model space with document models Two methods for decomposing p(w,Q) –Independent sampling (Bayesian model averaging) – Conditional sampling: p(w,Q)=p(w)p(Q|w) Original formula in [Lavranko &Croft 01]

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Kernel-based Allocation [Lavrenko 04] A general generative model for text Choices of the kernel function –Delta kernel: –Dirichlet kernel: allow a training point to “spread” its influence An infinite mixture model Kernel-based density function Kernel function Average probability of w 1 …w n over all training points T= training data

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Query Model Estimation [Lafferty & Zhai 01b, Zhai & Lafferty 01b] Question: How to estimate a better query model than the ML estimate based on the original query? “Massive feedback”: Improve a query model through co- occurrence pattern learned from –A document-term Markov chain that outputs the query [Lafferty & Zhai 01b] –Thesauri, corpus [Bai et al. 05,Collins-Thompson & Callan 05] Model-based feedback: Improve the estimate of query model by exploiting pseudo-relevance feedback –Update the query model by interpolating the original query model with a learned feedback model [ Zhai & Lafferty 01b] –Estimate a more integrated mixture model using pseudo- feedback documents [ Tao & Zhai 06]

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Feedback as Model Interpolation [Zhai & Lafferty 01b] Query Q Document D Results Feedback Docs F={d 1, d 2, …, d n } Generative model Divergence minimization  =0 No feedback  =1 Full feedback

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai,  F Estimation Method I: Generative Mixture Model w w F={D 1, …, D n } Maximum Likelihood P(w|  ) P(w| C) 1- P(source) Background words Topic words The learned topic model is called a “parsimonious language model” in [Hiemstra et al. 04]

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai,  F Estimation Method II: Empirical Divergence Minimization D1D1 F={D 1, …, D n } DnDn  close Empirical divergence Divergence minimization far ( ) C Background model

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Example of Feedback Query Model Trec topic 412: “airport security” =0.9 =0.7 Mixture model approach Web database Top 10 docs

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Model-based feedback Improves over Simple LM [Zhai & Lafferty 01b] Translation models, Relevance models, and Feedback-based query models have all been shown to improve performance significantly over the simple LMs (Parameter tuning is necessary in many cases, but see [Tao & Zhai 06] for “parameter-free” pseudo feedback)

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Some Further Improvement [Tao & Zhai 06] Document-specific mixing coefficient (model non- relevant content) Use query as a prior Regularized EM –Start with a strong prior –Gradually reduce the strength on the prior to achieve feedback effect Increase the robustness of the model

China-US-France Summer School, Lotus Hill Inst LMs for Special Retrieval Tasks

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Cross-Lingual IR Use query in language A (e.g., English) to retrieve documents in language B (e.g., Chinese) Cross-lingual p(Q|D,R) [Xu et al 01] Cross-lingual p(D|Q,R) [Lavrenko et al 02] English Chinese EnglishChinese word Method 1: Method 2: Translation model Estimate with a bilingual lexicon Or Parallel corpora Estimate with parallel corpora

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Distributed IR Retrieve documents from multiple collections The task is generally decomposed into two subtasks: Collection selection and result fusion Using LMs for collection selection [Xu & Croft 99, Si et al. 02] –Treat collection selection as “retrieving collections” as opposed to “documents” –Estimate each collection model by maximum likelihood estimate [Si et al. 02] or clustering [Xu & Croft 99] Using LMs for result fusion [ Si et al. 02] –Assume query likelihood scoring for all collections, but on each collection, a distinct reference LM is used for smoothing –Adjust the bias score p(Q|D,Collection) to recover the fair score p(Q|D)

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Structured Document Retrieval [Ogilvie & Callan 03] Title Abstract Body-Part1 Body-Part2 … D D1D1 D2D2 D3D3 DkDk -Want to combine different parts of a document with appropriate weights -Anchor text can be treated as a “part” of a document - Applicable to XML retrieval “part selection” prob. Serves as weight for D j Can be trained using EM Select D j and generate a query word using D j

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Personalized/Context-Sensitive Search [Shen et al. 05, Tan et al. 06] User information and search context can be used to estimate a better query model Refinement of this model leads to specific retrieval formulas Simple models often end up interpolating many unigram language models based on different sources of evidence, e.g., short-term search history [Shen et al. 05] or long-term search history [Tan et al. 06] Context-independent Query LM: Context-sensitive Query LM:

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Modeling Redundancy Given two documents D 1 and D 2, decide how redundant D 1 (or D 2 ) is w.r.t. D 2 (or D 1 ) Redundancy of D 1  “to what extent can D 1 be explained by a model estimated based on D 2 ” Use a unigram mixture model [Zhai 02] See [Zhang et al. 02] for a 3-component redundancy model Along a similar line, we could measure document similarity in an asymmetric way [Kurland & Lee 05] Maximum Likelihood estimator EM algorithm Reference LM LM for D 2 Measure of redundancy

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Predicting Query Difficulty [Cronen-Townsend et al. 02] Observations: –Discriminative queries tend to be easier –Comparison of the query model and the collection model can indicate how discriminative a query is Method: –Define “query clarity” as the KL-divergence between an estimated query model or relevance model and the collection LM –An enriched query LM can be estimated by exploiting pseudo feedback (e.g., relevance model) Correlation between the clarity scores and retrieval performance is found

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Expert Finding [Balog et al. 06, Fang & Zhai 07] Task: Given a topic T, a list of candidates {C i }, and a collection of support documents S={D i }, rank the candidates according to the likelihood that a candidate C is an expert on T. Retrieval analogy: –Query = topic T –Document = Candidate C –Rank according to P(R=1|T,C) –Similar derivations to those on slides 55-56, 64 can be made Candidate generation model: Topic generation model:

China-US-France Summer School, Lotus Hill Inst Summary

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, SLMs vs. Traditional IR Pros: –Statistical foundations (better parameter setting) –More principled way of handling term weighting –More powerful for modeling subtopics, passages,.. –Leverage LMs developed in related areas –Empirically as effective as well-tuned traditional models with potential for automatic parameter tuning Cons: –Lack of discrimination (a common problem with generative models) –Less robust in some cases (e.g., when queries are semi-structured) –Computationally complex –Empirically, performance appears to be inferior to well-tuned full- fledged traditional methods (at least, no evidence for beating them)

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, What We Have Achieved So Far Framework and justification for using LMs for IR Several effective models are developed –Basic LM with Dirichlet prior smoothing is a reasonable baseline –Basic LM with informative priors often improves performance –Translation model handles polysemy & synonyms –Relevance model incorporates LMs into the classic probabilistic IR model –KL-divergence model ties feedback with query model estimation –Mixture models can model redundancy and subtopics Completely automatic tuning of parameters is possible LMs can be applied to virtually any retrieval task with great potential for modeling complex IR problems

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Challenges and Future Directions Challenge 1: Establish a robust and effective LM that –Optimizes retrieval parameters automatically –Performs as well as or better than well-tuned traditional retrieval methods with pseudo feedback –Is as efficient as traditional retrieval methods Challenge 2: Demonstrate consistent and substantial improvement by going beyond unigram LMs –Model limited dependency between terms –Derive more principled weighting methods for phrases Can LMs consistently (convincingly) outperform traditional methods without sacrificing efficiency? Can we do much better by going beyond unigram LMs?

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Challenges and Future Directions (cont.) Challenge 3: Develop LMs that can support “life-time learning” –Develop LMs that can improve accuracy for a current query through learning from past relevance judgments –Support collaborative information retrieval Challenge 4: Develop LMs that can model document structures and subtopics –Recognize query-specific boundaries of relevant passages –Passage-based/subtopic-based feedback –Combine different structural components of a document How can we learn effectively from past relevance judgments? How can we break the document unit in a principled way?

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Challenges and Future Directions (cont.) Challenge 5: Develop LMs to support personalized search –Infer and track a user’s interests with LMs –Incorporate user’s preferences and search context in retrieval –Customize/organize search results according to user’s interests Challenge 6: Generalize LMs to handle relational data –Develop LMs for semi-structured data (e.g., XML) –Develop LMs to handle structured queries –Develop LMs for keyword search in relational databases How can we exploit user information and search context to improve search? What role can LMs play when combining text with relational data?

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Challenges and Future Directions (cont.) Challenge 7: Develop LMs for hypertext retrieval –Combine LMs with link information –Modeling and exploiting anchor text –Develop a unified LM for hypertext search Challenge 8: Develop LMs for retrieval with complex information needs, e.g., –Subtopic retrieval –Readability constrained retrieval –Entity retrieval (e.g. expert search) How can we exploit LMs to develop models for complex retrieval tasks? How can we develop an effective unified retrieval model for Web search?

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Lectures 2 & 3: Key Points Statistical language models represent a new generation of probabilistic model for retrieval –Better connect IR with statistics (estimation) –Better connect search with machine learning (unsupervised, semi-supervised learning) –Achieve good empirical performance –Can model a variety of special retrieval problems Performance-wise, they haven’t yet convincingly outperformed traditional TF-IDF models

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, References [Agichtein & Cucerzan 05] E. Agichtein and S. Cucerzan, Predicting accuracy of extracting information from unstructured text collections, Proceedings of ACM CIKM pages [Baeza-Yates & Ribeiro-Neto 99] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley, [Bai et al. 05] Jing Bai, Dawei Song, Peter Bruza, Jian-Yun Nie, Guihong Cao, Query expansion using term relationships in language models for information retrieval, Proceedings of ACM CIKM 2005, pages [Balog et al. 06] K. Balog, L. Azzopardi, M. de Rijke, Formal models for expert finding in enterprise corpora, Proceedings of ACM SIGIR 2006, pages [Berger & Lafferty 99] A. Berger and J. Lafferty. Information retrieval as statistical translation. Proceedings of the ACM SIGIR 1999, pages [Berger 01] A. Berger. Statistical machine learning for information retrieval. Ph.D. dissertation, Carnegie Mellon University, [Blei et al. 02] D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. In T G Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, Cambridge, MA, MIT Press. [Cao et al. 05] Guihong Cao, Jian-Yun Nie, Jing Bai, Integrating word relationships into language models, Proceedings of ACM SIGIR 2005, Pages: [Carbonell and Goldstein 98]J. Carbonell and J. Goldstein, The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of SIGIR'98, pages [Chen & Goodman 98] S. F. Chen and J. T. Goodman. An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Harvard University. [Collins-Thompson & Callan 05] K. Collins-Thompson and J. Callan, Query expansing using random walk models, Proceedings of ACM CIKM 2005, pages [Cronen-Townsend et al. 02] Steve Cronen-Townsend, Yun Zhou, and W. Bruce Croft. Predicting query performance. In Proceedings of the ACM Conference on Research in Information Retrieval (SIGIR), [Croft & Lafferty 03] W. B. Croft and J. Lafferty (ed), Language Modeling and Information Retrieval. Kluwer Academic Publishers [Fang et al. 04] H. Fang, T. Tao and C. Zhai, A formal study of information retrieval heuristics, Proceedings of ACM SIGIR pages

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, References (cont.) [Fang & Zhai 07] H. Fang and C. Zhai, Probabilistic models for expert finding, Proceedings of ECIR [Fox 83] E. Fox. Expending the Boolean and Vector Space Models of Information Retrieval with P-Norm Queries and Multiple Concept Types. PhD thesis, Cornell University [Fuhr 01] N. Fuhr. Language models and uncertain inference in information retrieval. In Proceedings of the Language Modeling and IR workshop, pages [Gao et al. 04] J. Gao, J. Nie, G. Wu, and G. Cao, Dependence language model for information retrieval, In Proceedings of ACM SIGIR [Good 53] I. J. Good. The population frequencies of species and the estimation of population parameters. Biometrika, 40(3 and 4): , [Greiff & Morgan 03] W. Greiff and W. Morgan, Contributions of Language Modeling to the Theory and Practice of IR, In W. B. Croft and J. Lafferty (eds), Language Modeling for Information Retrieval, Kluwer Academic Pub [Grossman & Frieder 04] D. Grossman and O. Frieder, Information Retrieval: Algorithms and Heuristics, 2 nd Ed, Springer, [He & Ounis 05] Ben He and Iadh Ounis, A study of the Dirichlet priors for term frequency normalisation, Proceedings of ACM SIGIR 2005, Pages [Hiemstra & Kraaij 99] D. Hiemstra and W. Kraaij, Twenty-One at TREC-7: Ad-hoc and Cross-language track, In Proceedings of the Seventh Text REtrieval Conference (TREC-7), [Hiemstra 01] D. Hiemstra. Using Language Models for Information Retrieval. PhD dissertation, University of Twente, Enschede, The Netherlands, January [Hiemstra 02] D. Hiemstra. Term-specific smoothing for the language modeling approach to information retrieval: the importance of a query term. In Proceedings of ACM SIGIR 2002, [Hiemstra et al. 04] D. Hiemstra, S. Robertson, and H. Zaragoza. Parsimonious language models for information retrieval, In Proceedings of ACM SIGIR [Hofmann 99] T. Hofmann. Probabilistic latent semantic indexing. In Proceedings on the 22nd annual international ACM- SIGIR 1999, pages [Jarvelin & Kekalainen 02] Cumulated gain-based evaluation of IR techniques, ACM TOIS, Vol. 20, No. 4, , [Jelinek 98] F. Jelinek, Statistical Methods for Speech Recognition, Cambirdge: MIT Press, [Jelinek & Mercer 80] F. Jelinek and R. L. Mercer. Interpolated estimation of markov source parameters from sparse data. In E. S. Gelsema and L. N. Kanal, editors, Pattern Recognition in Practice Amsterdam, North-Holland,.

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, References (cont.) [Jeon et al. 03] J. Jeon, V. Lavrenko and R. Manmatha, Automatic Image Annotation and Retrieval using Cross-media Relevance Models, In Proceedings of ACM SIGIR 2003 [Jin et al. 02] R. Jin, A. Hauptmann, and C. Zhai, Title language models for information retrieval, In Proceedings of ACM SIGIR [Kalt 96] T. Kalt. A new probabilistic model of text classication and retrieval. University of Massachusetts Technical report TR98-18,1996. [Katz 87] S. M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, volume ASSP-35: [Kraaij et al. 02] W. Kraaij,T. Westerveld, D. Hiemstra: The Importance of Prior Probabilities for Entry Page Search. Proceedings of SIGIR 2002, pp [Kraaij 04] W. Kraaij. Variations on Language Modeling for Information Retrieval, Ph.D. thesis, University of Twente, 2004, [Kurland & Lee 04] O. Kurland and L. Lee. Corpus structure, language models, and ad hoc information retrieval. In Proceedings of ACM SIGIR [Kurland et al. 05] Oren Kurland, Lillian Lee, Carmel Domshlak, Better than the real thing?: iterative pseudo-query processing using cluster-based language models, Proceedings of ACM SIGIR pages [Kurland & Lee 05] Oren Kurland and Lillian Lee, PageRank without hyperlinks: structural re-ranking using links induced by language models, Proceedings of ACM SIGIR pages [Lafferty and Zhai 01a] J. Lafferty and C. Zhai, Probabilistic IR models based on query and document generation. In Proceedings of the Language Modeling and IR workshop, pages [Lafferty & Zhai 01b] J. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In Proceedings of the ACM SIGIR 2001, pages [Lavrenko & Croft 01] V. Lavrenko and W. B. Croft. Relevance-based language models. In Proceedings of the ACM SIGIR 2001, pages [Lavrenko et al. 02] V. Lavrenko, M. Choquette, and W. Croft. Cross-lingual relevance models. In Proceedings of SIGIR 2002, pages [Lavrenko 04] V. Lavrenko, A generative theory of relevance. Ph.D. thesis, University of Massachusetts [Li & Croft 03] X. Li, and W.B. Croft, Time-Based Language Models, In Proceedings of CIKM'03, 2003 [Liu & Croft 02] X. Liu and W. B. Croft. Passage retrieval based on language models. In Proceedings of CIKM 2002, pages

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, References (cont.) [Liu & Croft 04] X. Liu and W. B. Croft. Cluster-based retrieval using language models. In Proceedings of ACM SIGIR [MacKay & Peto 95] D. MacKay and L. Peto. (1995). A hierarchical Dirichlet language model. Natural Language Engineering, 1(3): [Maron & Kuhns 60] M. E. Maron and J. L. Kuhns, On relevance, probabilistic indexing and information retrieval. Journal of the ACM, 7: [McCallum & Nigam 98] A. McCallum and K. Nigam (1998). A comparison of event models for Naïve Bayes text classification. In AAAI-1998 Learning for Text Categorization Workshop, pages [Miller et al. 99] D. R. H. Miller, T. Leek, and R. M. Schwartz. A hidden Markov model information retrieval system. In Proceedings of ACM-SIGIR 1999, pages [Minka & Lafferty 03] T. Minka and J. Lafferty, Expectation-propagation for the generative aspect model, In Proceedings of the UAI 2002, pages [Nallanati & Allan 02] Ramesh Nallapati and James Allan, Capturing term dependencies using a language model based on sentence trees. In Proceedings of CIKM [Nallanati et al 03] R. Nallanati, W. B. Croft, and J. Allan, Relevant query feedback in statistical language modeling, In Proceedings of CIKM [Ney et al. 94] H. Ney, U. Essen, and R. Kneser. On Structuring Probabilistic Dependencies in Stochastic Language Modeling. Comput. Speech and Lang., 8(1), [Ng 00]K. Ng. A maximum likelihood ratio information retrieval model. In Voorhees, E. and Harman, D., editors, Proceedings of the Eighth Text REtrieval Conference (TREC-8), pages [Ogilvie & Callan 03] P. Ogilvie and J. Callan Combining Document Representations for Known Item Search. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2003), pp [Ponte & Croft 98]] J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of ACM-SIGIR 1998, pages [Ponte 98] J. M. Ponte. A language modeling approach to information retrieval. Phd dissertation, University of Massachusets, Amherst, MA, September 1998.

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, References (cont.) [Ponte 01] J. Ponte. Is information retrieval anything more than smoothing? In Proceedings of the Workshop on Language Modeling and Information Retrieval, pages 37-41, [Robertson & Sparch-Jones 76] S. Robertson and K. Sparck Jones. (1976). Relevance Weighting of Search Terms. JASIS, 27, [Robertson 77] S. E. Robertson. The probability ranking principle in IR. Journal of Documentation, 33: , [Robertson & Walker 94] S. E. Robertson and S. Walker, Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. Proceedings of ACM SIGIR pages [Rosenfeld 00] R. Rosenfeld, Two decades of statistical language modeling: where do we go from here? In Proceedings of IEEE, volume~88. [Salton et al. 75] G. Salton, A. Wong and C. S. Yang, A vector space model for automatic indexing. Communications of the ACM, 18(11): [Salton & Buckley 88] G. Salton and C. Buckley, Term weighting approaches in automatic text retrieval, Information Processing and Management, 24(5), [Shannon 48] Shannon, C. E. (1948).. A mathematical theory of communication. Bell System Tech. J. 27, , [Shen et al. 05] X. Shen, B. Tan, and C. Zhai. Context-sensitive information retrieval with implicit feedback. In Proceedings of ACM SIGIR [Si et al. 02] L. Si, R. Jin, J. Callan and P.l Ogilvie. A Language Model Framework for Resource Selection and Results Merging. In Proceedings of the 11th International Conference on Information and Knowledge Management (CIKM) [Singhal et al. 96] A. Singhal, C. Buckley, and M. Mitra, Pivoted document length normalization, Proceedings of ACM SIGIR [Singhal 01] A. Singhal, Modern Information Retrieval: A Brief Overview. Amit Singhal. In IEEE Data Engineering Bulletin 24(4), pages 35-43, [Song & Croft 99] F. Song and W. B. Croft. A general language model for information retrieval. In Proceedings of Eighth International Conference on Information and Knowledge Management (CIKM 1999) [Sparck Jones 72] K. Sparck Jones, A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 11-21, 1972 and 60, , 2004.

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, References (cont.) [Sparck Jones et al. 00] K. Sparck Jones, S. Walker, and S. E. Robertson, A probabilistic model of information retrieval: development and comparative experiments - part 1 and part 2. Information Processing and Management, 36(6): and [Sparck Jones et al. 03] K. Sparck Jones, S. Robertson, D. Hiemstra, H. Zaragoza, Language Modeling and Relevance, In W. B. Croft and J. Lafferty (eds), Language Modeling for Information Retrieval, Kluwer Academic Pub [Srikanth & Srihari 03] M. Srikanth, R. K. Srihari. Exploiting Syntactic Structure of Queries in a Language Modeling Approach to IR. in Proceedings of Conference on Information and Knowledge Management(CIKM'03). [Srikanth 04] M. Srikanth. Exploiting query features in language modeling approach for information retrieval. Ph.D. dissertation, State University of New York at Buffalo, [Tan et al. 06] Bin Tan, Xuehua Shen, and ChengXiang Zhai,, Mining long-term search history to improve search accuracy, Proceedings of ACM KDD [Tao et al. 06] Tao Tao, Xuanhui Wang, Qiaozhu Mei, and ChengXiang Zhai, Language model information retrieval with document expansion, Proceedings of HLT/NAACL [Tao & Zhai 06] Tao Tao and ChengXiang Zhai, Regularized estimation of mixture models for robust pseudo-relevance feedback. Proceedings of ACM SIGIR [Turtle & Croft 91]H. Turtle and W. B. Croft, Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems, 9(3): [van Rijsbergen 86] C. J. van Rijsbergen. A non-classical logic for information retrieval. The Computer Journal, 29(6). [Witten et al. 99] I.H. Witten, A. Mo#at, and T.C. Bell. Managing Gigabytes - Compressing and Indexing Documents and Images. Academic Press, San Diego, 2nd edition, [Wong & Yao 89] S. K. M. Wong and Y. Y. Yao, A probability distribution model for information retrieval. Information Processing and Management, 25(1): [Wong & Yao 95] S. K. M. Wong and Y. Y. Yao. On modeling information retrieval with probabilistic inference. ACM Transactions on Information Systems, 13(1): [Xu & Croft 99] J. Xu and W. B. Croft. Cluster-based language models for distributed retrieval. In Proceedings of the ACM SIGIR 1999, pages 15-19, [Xu et al. 01]J. Xu, R. Weischedel, and C. Nguyen. Evaluating a probabilistic model for cross-lingual information retrieval. In Proceedings of the ACM-SIGIR 2001, pages

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, References (cont.) [Zaragoza et al. 03] Hugo Zaragoza, D. Hiemstra and M. Tipping, Bayesian extension to the language model for ad hoc information retrieval. In Proceedings of SIGIR 2003: 4-9. [Zhai & Lafferty 01a] C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the ACM-SIGIR 2001, pages [Zhai & Lafferty 01b] C. Zhai and J. Lafferty. Model-based feedback in the language modeling approach to information retrieval, In Proceedings of the Tenth International Conference on Information and Knowledge Management (CIKM 2001). [Zhai & Lafferty 02] C. Zhai and J. Lafferty. Two-stage language models for information retrieval. In Proceedings of the ACM-SIGIR 2002, pages [Zhai et al. 03] C. Zhai, W. Cohen, and J. Lafferty, Beyond Independent Relevance: Methods and Evaluation Metrics for Subtopic Retrieval, In Proceedings of ACM SIGIR [Zhai & Lafferty 06] C. Zhai and J. Lafferty, A risk minimization framework for information retrieval, Information Processing and Management, 42(1), Jan. 2006, pages [Zhai 02] C. Zhai, Language Modeling and Risk Minimization in Text Retrieval, Ph.D. thesis, Carnegie Mellon University, [Zhai & Lafferty 06] C. Zhai and J. Lafferty, A risk minimization framework for information retrieval, Information Processing and Management, 42(1), Jan. 2006, pages [Zhang et al. 02] Y. Zhang, J. Callan, and Thomas P. Minka, Novelty and redundancy detection in adaptive filtering. In Proceedings of SIGIR 2002, 81-88

China-US-France Summer School, Lotus Hill Inst © ChengXiang Zhai, Discussion Generative models for text vs. generative models for images/video Query model in multimedia retrieval: –Independent models for different media vs. joint models –How to learn such a query model using “multimedia feedback”? Learn a text model from image feedback? Learn an image model from text feedback? Special retrieval tasks for multimedia –Entity retrieval –Video summarization –Cross-language image search –…