Download presentation
Presentation is loading. Please wait.
Published byLucas Wiggins Modified over 9 years ago
1
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid
2
Language Model based Information Retrieval: University of Saarland 2 Overview Motivation Hidden Markov Model (Introduction) HMM for Information Retrieval System Probability Model Baseline System Experiments HMM Refinements Blind Feedback Bigrams Document Priors Conclusion
3
Language Model based Information Retrieval: University of Saarland 3 Motivation Hidden Markov models have been applied successfully Speech Recognition Named Entity Finding Optical Character Recognition Topic Identification Ad hoc Information Retrieval (now)
4
Language Model based Information Retrieval: University of Saarland 4 Hidden Markov Model (Introduction) You have seen sequence of observation (words) You don’t know sequence of generator (states). HMM is a solution for this problem Two probabilities are involved in HMM Jump from one state to others (Transition probability), whose sum is 1. Probability of observations from one state, whose sum is 1.
5
Language Model based Information Retrieval: University of Saarland 5 A discrete HMM Set of output symbols Set of states Set of transitions between states Probability distribution on output symbols for each state Observed sampling process Starting from some initial state Transition from it to another state Sampling from the output distribution at that state Repeat the steps
6
Language Model based Information Retrieval: University of Saarland 6 HMM for Information Retrieval System Observed data: query Q Unknown key: relevant document D Noisy channel: mind of user Transform imagined notion into text of Q P(D is R|Q) ? D is relevant in the user’s mind Given that Q was the query produced
7
Language Model based Information Retrieval: University of Saarland 7 Probability Model P(D is R|Q) = Output symbols Union of all words in the corpus States Mechanism of query word generations Document General English Identical for all documents P(Q|D is R).P(D is R) P(Q) Prior probability
8
Language Model based Information Retrieval: University of Saarland 8 A simple two-state HMM The choice of which kind of word to generate next is independent of the previous such choice. General English Document query start query end a0a0 a1a1 P(q| GE) P(q| D)
9
Language Model based Information Retrieval: University of Saarland 9 Why simplification of params? HMM for each document EM for computing these parameters Need training samples Document with training queries (not available) P(q|D k ) = P(q|GE) = P(Q|D k is R) = ∏ (a 0 P(q|GE) + a 1 P(q|D k )) # q appears in D k length of D k ∑ k # q appears in D k ∑ k length of D k q Q
10
Language Model based Information Retrieval: University of Saarland 10 Baseline System Performance # of queries: 50 Inverted index is created Tf value (term frequency) Ignoring case Porter stemmer Replaced 397 stop words with special token *STOP* Similarly 4-digit strings by *YEAR*, digit strings by *NUMBER* TREC-6, TREC-7 test collections TREC-6 556,077 documents : average of 26.5 unique terms News and government agencies TREC-7 528,155 documents: average of 17.6 unique terms
11
Language Model based Information Retrieval: University of Saarland 11 TF.IDF model
12
Language Model based Information Retrieval: University of Saarland 12 Non-interpolated average precision
13
Language Model based Information Retrieval: University of Saarland 13 HMM Refinements Blind Feedback well-known technique for enhancing performance Bigrams distinctive meaning when used in the context of other word. e.g. white house, Pop John Paul II Query Section Weighting Some portion of query is more important than others. Document Priors longer documents are more informative than short ones
14
Language Model based Information Retrieval: University of Saarland 14 Blind Feedback Constructing a new query based on top-ranked document Rocchio algorithm In 90% of top N retrieved document word “very” is less informative word “Nixon” is highly informative a 0 and a 1 can be estimated by EM algorithm by training queries.
15
Language Model based Information Retrieval: University of Saarland 15 Estimate a 1 In equation (5) of paper Q’ = general query q’ = general query word ???(am I right) Q i = one trained query Q = available training queries Negative values are avoided by taking floor of estimate I m,Q i = top m documents for Q i df(w) = document frequency of w I 0q I 1q …. I mq Q i = “Germany” Berlin
16
Language Model based Information Retrieval: University of Saarland 16 Performance gained
17
Language Model based Information Retrieval: University of Saarland 17 Bigrams
18
Language Model based Information Retrieval: University of Saarland 18 Query Section Weighting TREC evaluation Title section is more important than others. v s(q) =weight for the section of the query v desc =1.2, v narr =1.9, v title =5.7
19
Language Model based Information Retrieval: University of Saarland 19 Document Priors refereed Journal may be more informative than a supermarket tabloid. Most predicative features Source Length Average word length
20
Language Model based Information Retrieval: University of Saarland 20 Conclusion Novel method in IR using HMMs Offer rich setting Incorporate new and familiar techniques Experiments with a system that implements Blind feedback Bigram modeling Query Section weighting Document priors Future work HMM can be extended to accommodate Passage retrieval Explicit synonym modeling Concept modeling
21
Language Model based Information Retrieval: University of Saarland 21 Resources D. Miller, T. Leek, R. Schwartz A Hidden Markov Information Retrieval System SIGIR 99 Berkeley, CA USA L. Rabiner A tutorial on Hidden Markov Models and selected applications in speech recognition Proc. IEEE 77, pp 130-137
22
Language Model based Information Retrieval: University of Saarland 22 Thankyou very much! Questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.