Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI), MIT

Goal: Better Generative Model Generative v. discriminative model Applies to many applications Information retrieval (IR)  Relevance feedback  Using unlabeled data Classification Assumptions explicit

Using a Model for IR 1. Define model 2. Learn parameters from query 3. Rank documents Hyper-learn Better model improves applications  Trickle down to improve retrieval  Classification, relevance feedback, … Corpus specific models

Overview Related work Probabilistic models Example: Poisson Model Compare model to text Hyper-learning the model Exponential framework Investigate retrieval performance Conclusion and future work

Related Work Using text for retrieval algorithm [Jones, 1972], [Greiff, 1998] Using text to model text [Church & Gale, 1995], [Katz, 1996] Learning model parameters [Zhai & Lafferty, 2002] Hyper-learn the model from text!

Probabilistic Models Rank documents by RV = Pr(rel|d) Naïve Bayesian models RV = Pr(rel|d)

Probabilistic Models Rank documents by RV = Pr(rel|d) Naïve Bayesian models =  Pr(d t |rel) features t RV = Pr(rel|d) 8 Open assumptions Feature definition Feature distribution family words # occs in doc Defines the model! Pr(d|rel)

Using a Naïve Bayesian Model 1. Define model 2. Learn parameters from query 3. Rank documents

Using a Naïve Bayesian Model 1. Define model 2. Learn parameters from query 3. Rank documents Pr(d t |rel) =

Using a Naïve Bayesian Model 1. Define model 2. Learn parameters from query 3. Rank documents Pr(d t |rel) = θ e -θ-θ dt!dt! dtdt Poisson Model θ : specifies term distribution

Term occurs exactly d t times Pr(d t |rel) Example Poisson Distribution θ =0.0006 Pr(d t |rel)≈1E-15 +

Using a Naïve Bayesian Model 1. Define model 2. Learn parameters from query 3. Rank documents Learn a θ for each term Maximum likelihood θ Term’s average number of occurrence Incorporate prior expectations

Using a Naïve Bayesian Model 1. Define model 2. Learn parameters from query 3. Rank documents For each document, find RV Sort documents by RV =  Pr(d t |rel). words t RV

Using a Naïve Bayesian Model 1. Define model 2. Learn parameters from query 3. Rank documents For each document, find RV Sort documents by RV =  Pr(d t |rel). words t RV Which step goes wrong?

Using a Naïve Bayesian Model 1. Define model 2. Learn parameters from query 3. Rank documents Pr(d t |rel) = θ e -θ-θ dt!dt! dtdt

Term occurs exactly d t times Pr(d t |rel) How Good is the Model? θ =0.0006 15 times +

How Good is the Model? Term occurs exactly d t times Pr(d t |rel) θ =0.0006 15 times Misfit! +

Hyper-learning a Better Fit Through Textual Analysis Using an Exponential Framework

Need framework for hyper-learning Bernoulli Poisson Normal Mixtures Hyper-Learning Framework

Need framework for hyper-learning Goal: Same benefits as Poisson Model One parameter Easy to work with (e.g., prior) Bernoulli Poisson Normal One parameter exponential families Mixtures Hyper-Learning Framework

Well understood, learning easy [Bernardo & Smith, 1994], [Gous, 1998] Pr( d t | rel ) = f(d t ) g( θ ) e Functions f(d t ) and h(d t ) specify family E.g., Poisson: f( d t ) = ( d t !) -1, h( d t ) = d t Parameter θ term’s specific distribution Exponential Framework θ h(dt)θ h(dt)

Using a Hyper-learned Model 1. Define model 2. Learn parameters from query 3. Rank documents

Using a Hyper-learned Model 1. Hyper-learn model 2. Learn parameters from query 3. Rank documents

Using a Hyper-learned Model 1. Hyper-learn model 2. Learn parameters from query 3. Rank documents Want “best” f(d t ) and h(d t ) Iterative hill climbing Local maximum Poisson starting point

Using a Hyper-learned Model 1. Hyper-learn model 2. Learn parameters from query 3. Rank documents Data: TREC query result sets Past queries to learn about future queries Hyper-learn and test with different sets

Recall the Poisson Distribution Term occurs exactly d t times Pr(d t |rel) 15 times +

Poisson Starting Point - h(d t ) h(dt)h(dt) dtdt Pr(d t |rel) = f(d t ) g( θ ) e θ h(dt)θ h(dt) +

h(dt)h(dt) dtdt Hyper-learned Model - h(d t ) + Pr(d t |rel) = f(d t ) g( θ ) e θ h(dt)θ h(dt)

Poisson Distribution Term occurs exactly d t times Pr(d t |rel) 15 times +

Term occurs exactly d t times Hyper-learned Distribution 15 times Hyper-learned Distribution + Pr(d t |rel)

Term occurs exactly d t times 5 times Hyper-learned Distribution + Pr(d t |rel)

Performing Retrieval 1. Hyper-learn model 2. Learn parameters from query 3. Rank documents

Performing Retrieval 1. Hyper-learn model 2. Learn parameters from query 3. Rank documents Pr( d t | rel ) = f(d t ) g( θ ) e Learn θ for each term θ h(dt)θ h(dt) Labeled docs

Learning θ Sufficient statistics Summarize all observed data τ 1 : # of observations τ 2 : Σ observations d h(d t ) Incorporating prior easy Map τ 1 and τ 2 θ 20 labeled documents

Performing Retrieval 1. Hyper-learn model 2. Learn parameters from query 3. Rank documents

Recall Precision Results: Labeled Documents

Performing Retrieval 1. Hyper-learn model 2. Learn parameters from query 3. Rank documents Short query

Query = single labeled document Vector space-like equation RV = Σ a(t, d) + Σ b(q, d) Problem: Document dominates Solution: Use only query portion Another solution: Normalize Retrieval: Query t in doc q in query Retrieval: Query

Recall Precision Retrieval: Query

Conclusion Probabilistic models Example: Poisson Model Hyper-learning the model Exponential framework Learned a better model Investigate retrieval performance - Easy to work with - Better … - Bad text model - Heavy tailed!

Use model better Use for other applications Other IR applications Classification Correct for document length Hyper-learn on different corpora Test if learned model generalizes Different for genre? Language? People? Hyper-learn model better Future Work

Questions? Contact us with questions: Jaime Teevan teevan@ai.mit.edu David Karger karger@theory.lcs.mit.edu

Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Similar presentations

Presentation on theme: "Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),

Similar presentations

Presentation on theme: "Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),"— Presentation transcript:

Similar presentations

About project

Feedback