Language Models Hongning Wang Recap: document generation model 4501: Information Retrieval Model of relevant docs for Q Model of non-relevant.

Slides:

Advertisements

Similar presentations

Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.

Advertisements

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.

Statistical Translation Language Model Maryam Karimzadehgan University of Illinois at Urbana-Champaign 1.

1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,

Language Models Hongning Wang

1 Essential Probability & Statistics (Lecture for CS598CXZ Advanced Topics in Information Retrieval ) ChengXiang Zhai Department of Computer Science University.

Probabilistic Ranking Principle

Information Retrieval Models: Probabilistic Models

Chapter 7 Retrieval Models.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.

Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.

Formal Multinomial and Multiple- Bernoulli Language Models Don Metzler.

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

Scalable Text Mining with Sparse Generative Models

Language Modeling Approaches for Information Retrieval Rong Jin.

Generating Impact-Based Summaries for Scientific Literature Qiaozhu Mei, ChengXiang Zhai University of Illinois at Urbana-Champaign 1.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

IRDM WS Chapter 4: Advanced IR Models 4.1 Probabilistic IR 4.2 Statistical Language Models (LMs) Principles and Basic LMs Smoothing.

Language Models Hongning Wang Two-stage smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet.

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.

Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.

Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.

Probabilistic Ranking Principle Hongning Wang

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.

Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.

Chapter 23: Probabilistic Language Models April 13, 2004.

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,

Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.

NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.

Midterm Review Hongning Wang Core concepts Search Engine Architecture Key components in a modern search engine Crawling & Text processing Different.

Natural Language Processing Statistical Inference: n-grams

Information Retrieval Models: Vector Space Models

Relevance Feedback Hongning Wang

A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval Min Zhang, Xinyao Ye Tsinghua University SIGIR

Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.

1 Risk Minimization and Language Modeling in Text Retrieval ChengXiang Zhai Thesis Committee: John Lafferty (Chair), Jamie Callan Jaime Carbonell David.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

A Study of Poisson Query Generation Model for Information Retrieval

Statistical Language Models Hongning Wang CS 6501: Text Mining1.

Introduction to Text Mining Thanks for Hongning slides on Text Ming Courses, Slides are slightly modified by Lei Chen.

2010 © University of Michigan Probabilistic Models in Information Retrieval SI650: Information Retrieval Winter 2010 School of Information University of.

KNN & Naïve Bayes Hongning Wang

A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.

Language Modeling Again So are we smooth now? Courtesy of Chris Jordan.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,

Essential Probability & Statistics

Overview of Statistical Language Models

Information Retrieval Models: Language Models

Statistical Language Models

Language Models for Text Retrieval

Information Retrieval Models: Probabilistic Models

Relevance Feedback Hongning Wang

Language Models for Information Retrieval

Lecture 12 The Language Model Approach to IR

John Lafferty, Chengxiang Zhai School of Computer Science

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Language Model Approach to IR

CS 4501: Information Retrieval

Probabilistic Ranking Principle

Language Models Hongning Wang

CS590I: Information Retrieval

INF 141: Information Retrieval

Conceptual grounding Nisheeth 26th March 2019.

Language Models for TR Rong Jin

Presentation transcript:

Language Models Hongning Wang

Recap: document generation model 4501: Information Retrieval Model of relevant docs for Q Model of non-relevant docs for Q Assume independent attributes of A 1 …A k ….(why?) Let D=d 1 …d k, where d k  {0,1} is the value of attribute A k (Similarly Q=q 1 …q k ) Terms occur in doc Terms do not occur in doc documentrelevant(R=1)nonrelevant(R=0) term present A i =1pipi uiui term absent A i =01-p i 1-u i Ignored for ranking 2

Recap: document generation model 4501: Information Retrieval Terms occur in doc Terms do not occur in doc documentrelevant(R=1)nonrelevant(R=0) term present A i =1pipi uiui term absent A i =01-p i 1-u i Assumption: terms not occurring in the query are equally likely to occur in relevant and nonrelevant documents, i.e., p t =u t Important tricks 3

Recap: Maximum likelihood vs. Bayesian Maximum likelihood estimation – “Best” means “data likelihood reaches maximum” – Issue: small sample size Bayesian estimation – “Best” means being consistent with our “prior” knowledge and explaining data well – A.k.a, Maximum a Posterior estimation – Issue: how to define prior? 4501: Information Retrieval4 ML: Frequentist’s point of view MAP: Bayesian’s point of view

Recap: Robertson-Sparck Jones Model (Robertson & Sparck Jones 76) 4501: Information Retrieval Two parameters for each term A i : p i = P(A i =1|Q,R=1): prob. that term A i occurs in a relevant doc u i = P(A i =1|Q,R=0): prob. that term A i occurs in a non-relevant doc (RSJ model) How to estimate these parameters? Suppose we have relevance judgments, “+0.5” and “+1” can be justified by Bayesian estimation as priors Per-query estimation! 5

Recap: the BM25 formula 4501: Information Retrieval6 TF-IDF component for document TF component for query Vector space model with TF-IDF schema!

Notion of Relevance Relevance  (Rep(q), Rep(d)) Similarity P(r=1|q,d) r  {0,1} Probability of Relevance P(d  q) or P(q  d) Probabilistic inference Different rep & similarity Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) … Generative Model Regression Model (Fox 83) Classical prob. Model (Robertson & Sparck Jones, 76) Doc generation Query generation LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) Prob. concept space model (Wong & Yao, 95) Different inference system Inference network model (Turtle & Croft, 91) 4501: Information Retrieval7

What is a statistical LM? A model specifying probability distribution over word sequences – p(“ Today is Wednesday ”)  – p(“ Today Wednesday is ”)  – p(“ The eigenvalue is positive” )  It can be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model 4501: Information Retrieval8

Why is a LM useful? Provides a principled way to quantify the uncertainties associated with natural language Allows us to answer questions like: – Given that we see “ John ” and “ feels ”, how likely will we see “ happy ” as opposed to “ habit ” as the next word? (speech recognition) – Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval) – Given that a user is interested in sports news, how likely would the user use “baseball” in a query? (information retrieval) 4501: Information Retrieval9

Source-Channel framework [Shannon 48] Source Transmitter (encoder) Destination Receiver (decoder) Noisy Channel P(X) P(Y|X) X Y X’ P(X|Y)=? When X is text, p(X) is a language model (Bayes Rule) Many Examples: Speech recognition: X=Word sequence Y=Speech signal Machine translation: X=English sentence Y=Chinese sentence OCR Error Correction: X=Correct word Y= Erroneous word Information Retrieval: X=Document Y=Query Summarization: X=Summary Y=Document 4501: Information Retrieval10

Language model for text 4501: Information Retrieval11 Chain rule: from conditional probability to joint probability sentence Average English sentence length is 14.3 words 475,000 main headwords in Webster's Third New International Dictionary How large is this? We need independence assumptions!

Unigram language model 4501: Information Retrieval12 The simplest and most popular choice!

More sophisticated LMs 4501: Information Retrieval13

Why just unigram models? Difficulty in moving toward more complex models – They involve more parameters, so need more data to estimate – They increase the computational complexity significantly, both in time and space Capturing word order or structure may not add so much value for “topical inference” But, using more sophisticated models can still be expected to improve performance : Information Retrieval14

Generative view of text documents (Unigram) Language Model  p(w|  ) … text 0.2 mining 0.1 assocation 0.01 clustering 0.02 … food … Topic 1: Text mining … text 0.01 food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … Topic 2: Health Document Text mining document Food nutrition document Sampling 4501: Information Retrieval15

Sampling with replacement Pick a random shape, then put it back in the bag 4501: Information Retrieval16

How to generate text document from an N-gram language model? 4501: Information Retrieval17

Generating text from language models 4501: Information Retrieval18 Under a unigram language model:

Generating text from language models 4501: Information Retrieval19 The same likelihood! Under a unigram language model:

N-gram language models will help Unigram – Months the my and issue of year foreign new exchange’s september were recession exchange new endorsed a q acquire to six executives. Bigram – Last December through the way to preserve the Hudson corporation N.B.E.C. Taylor would seem to complete the major central planners one point five percent of U.S.E. has already told M.X. corporation of living on information such as more frequently fishing to keep her. Trigram – They also point to ninety nine point six billon dollars from two hundred four oh six three percent of the rates of interest stores as Mexico and Brazil on market conditions. 4501: Information Retrieval20 Generated from language models of New York Times

Turing test: generating Shakespeare 4501: Information Retrieval21 A B C D SCIgen - An Automatic CS Paper Generator

Recap: what is a statistical LM? A model specifying probability distribution over word sequences – p(“ Today is Wednesday ”)  – p(“ Today Wednesday is ”)  – p(“ The eigenvalue is positive” )  It can be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model 4501: Information Retrieval22

Recap: Source-Channel framework [Shannon 48] Source Transmitter (encoder) Destination Receiver (decoder) Noisy Channel P(X) P(Y|X) X Y X’ P(X|Y)=? When X is text, p(X) is a language model (Bayes Rule) Many Examples: Speech recognition: X=Word sequence Y=Speech signal Machine translation: X=English sentence Y=Chinese sentence OCR Error Correction: X=Correct word Y= Erroneous word Information Retrieval: X=Document Y=Query Summarization: X=Summary Y=Document 4501: Information Retrieval23

Recap: language model for text 4501: Information Retrieval24 Chain rule: from conditional probability to joint probability sentence Average English sentence length is 14.3 words 475,000 main headwords in Webster's Third New International Dictionary How large is this? We need independence assumptions!

Recap: how to generate text document from an N-gram language model? 4501: Information Retrieval25

Estimation of language models 4501: Information Retrieval26 Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 Unigram Language Model  p(w|  )=? … text ? mining ? assocation ? database ? … query ? … Estimation A “text mining” paper (total #words=100)

Sampling with replacement Pick a random shape, then put it back in the bag 4501: Information Retrieval27

Estimation of language models Maximum likelihood estimation Unigram Language Model  p(w|  )=? Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 … text ? mining ? assocation ? database ? … query ? … Estimation A “text mining” paper (total #words=100) 10/100 5/100 3/100 1/ : Information Retrieval28

Maximum likelihood estimation 4501: Information Retrieval29 Length of document or total number of words in a corpus

Language models for IR [Ponte & Croft SIGIR’98] Document Text mining paper Food nutrition paper Language Model … text ? mining ? assocation ? clustering ? … food ? … food ? nutrition ? healthy ? diet ? … ? Which model would most likely have generated this query? “data mining algorithms” Query 4501: Information Retrieval30

Ranking docs by query likelihood d1d1 d2d2 dNdN q p(q|  d 1 ) p(q|  d 2 ) p(q|  d N ) Query likelihood …… d1d1 d2d2 dNdN Doc LM …… Justification: PRP 4501: Information Retrieval31

Justification from PRP Query likelihood p(q|  d ) Document prior Assuming uniform document prior, we have 4501: Information Retrieval32 Query generation

Retrieval as language model estimation Document language model 4501: Information Retrieval33

Problem with MLE Unseen events – There are 440K tokens on a larger collection of Yelp reviews, but: Only 30,000 unique words occurred Only 0.04% of all possible bigrams occurred  This means any word/N-gram that does not occur in the collection has zero probability with MLE!  No future documents can contain those unseen words/N-grams 4501: Information Retrieval34 A plot of word frequency in Wikipedia (Nov 27, 2006) Word frequency Word rank by frequency

Problem with MLE What probability should we give a word that has not been observed in the document? – log0? If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed words This is so-called “smoothing” 4501: Information Retrieval35

General idea of smoothing All smoothing methods try to 1.Discount the probability of words seen in a document 2.Re-allocate the extra counts such that unseen words will have a non-zero count 4501: Information Retrieval36

Illustration of language model smoothing P(w|d) w Max. Likelihood Estimate Smoothed LM Assigning nonzero probabilities to the unseen words 4501: Information Retrieval37 Discount from the seen words

Smoothing methods Method 1: Additive smoothing – Add a constant  to the counts of each word – Problems? Hint: all words are equally important? “Add one”, Laplace smoothing Vocabulary size Counts of w in d Length of d (total counts) 4501: Information Retrieval38

Add one smoothing for bigrams 4501: Information Retrieval39

After smoothing Giving too much to the unseen events 4501: Information Retrieval40

Refine the idea of smoothing Should all unseen words get equal probabilities? We can use a reference model to discriminate unseen words Discounted ML estimate Reference language model 4501: Information Retrieval41

Smoothing methods Method 2: Absolute discounting – Subtract a constant  from the counts of each word – Problems? Hint: varied document length? # uniq words 4501: Information Retrieval42

Smoothing methods Method 3: Linear interpolation, Jelinek- Mercer – “Shrink” uniformly toward p(w|REF) – Problems? Hint: what is missing? parameter MLE 4501: Information Retrieval43

Smoothing methods Method 4: Dirichlet Prior/Bayesian – Assume pseudo counts  p(w|REF) – Problems? parameter 4501: Information Retrieval44

Dirichlet prior smoothing “extra”/“pseudo” word counts, we set  i =  p(w i |REF) prior over models likelihood of doc given the model 4501: Information Retrieval45

Some background knowledge Conjugate prior – Posterior dist in the same family as prior Dirichlet distribution – Continuous – Samples from it will be the parameters in a multinomial distribution Gaussian -> Gaussian Beta -> Binomial Dirichlet -> Multinomial 4501: Information Retrieval46

Dirichlet prior smoothing (cont) Posterior distribution of parameters: The predictive distribution is the same as the mean: Dirichlet prior smoothing 4501: Information Retrieval47

Estimating  using leave-one-out [Zhai & Lafferty 02] P(w 1 |d - w 1 ) P(w 2 |d - w 2 ) log-likelihood Maximum Likelihood Estimator Leave-one-out w1w1 w2w2 P(w n |d - w n ) wnwn : Information Retrieval48

Why would “leave-one-out” work? abc abc ab c d d abc cd d d abd ab ab ab ab cd d e cd e 20 word by author1 20 word by author2 abc abc ab c d d abe cb e f acf fb ef aff abef cdc db ge f s Now, suppose we leave “e” out…  must be big! more smoothing  doesn’t have to be big The amount of smoothing is closely related to the underlying vocabulary size 4501: Information Retrieval49

Recap: ranking docs by query likelihood d1d1 d2d2 dNdN q p(q|  d 1 ) p(q|  d 2 ) p(q|  d N ) Query likelihood …… d1d1 d2d2 dNdN Doc LM …… Justification: PRP 4501: Information Retrieval50

Recap: retrieval as language model estimation Document language model 4501: Information Retrieval51

Recap: illustration of language model smoothing P(w|d) w Max. Likelihood Estimate Smoothed LM Assigning nonzero probabilities to the unseen words 4501: Information Retrieval52 Discount from the seen words

Recap: smoothing methods Method 1: Additive smoothing – Add a constant  to the counts of each word – Problems? Hint: all words are equally important? “Add one”, Laplace smoothing Vocabulary size Counts of w in d Length of d (total counts) 4501: Information Retrieval53

Recap: refine the idea of smoothing Should all unseen words get equal probabilities? We can use a reference model to discriminate unseen words Discounted ML estimate Reference language model 4501: Information Retrieval54

Recap: smoothing methods Method 2: Absolute discounting – Subtract a constant  from the counts of each word – Problems? Hint: varied document length? # uniq words 4501: Information Retrieval55

Recap: smoothing methods Method 3: Linear interpolation, Jelinek- Mercer – “Shrink” uniformly toward p(w|REF) – Problems? Hint: what is missing? parameter MLE 4501: Information Retrieval56

Recap: smoothing methods Method 4: Dirichlet Prior/Bayesian – Assume pseudo counts  p(w|REF) – Problems? parameter 4501: Information Retrieval57

Recap: understanding smoothing Query = “the algorithms for data mining” p( “algorithms”|d1) = p(“algorithms”|d2) p( “data”|d1) < p(“data”|d2) p( “mining”|d1) < p(“mining”|d2) So we should make p(“the”) and p(“for”) less different for all docs, and smoothing helps to achieve this goal… Topical words Intuitively, d2 should have a higher score, but p(q|d1)>p(q|d2)… p ML (w|d1): p ML (w|d2): Query = “the algorithms for data mining” P(w|REF) Smoothed p(w|d1): Smoothed p(w|d2): : Information Retrieval58

Two-stage smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet prior (Bayesian)  Collection LM (1- )+ p(w|U) Stage-2 -Explain noise in query -2-component mixture User background model 4501: Information Retrieval59

Understanding smoothing Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain Ignore for ranking IDF weighting TF weighting Doc length normalization (longer doc is expected to have a smaller  d ) Smoothing with p(w|C)  TF-IDF + doc- length normalization 4501: Information Retrieval60

Smoothing & TF-IDF weighting Smoothed ML estimate Reference language model Retrieval formula using the general smoothing scheme Key rewriting step (where did we see it before?) 4501: Information Retrieval61 Similar rewritings are very common when using probabilistic models for IR…

What you should know How to estimate a language model General idea and different ways of smoothing Effect of smoothing 4501: Information Retrieval62

Today’s reading Introduction to information retrieval – Chapter 12: Language models for information retrieval 4501: Information Retrieval63