Presentation is loading. Please wait.

Presentation is loading. Please wait.

IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.

Similar presentations


Presentation on theme: "IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical."— Presentation transcript:

1 IR Challenges and Language Modeling

2 IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical approach to language Evaluation methodology  Effectiveness and efficiency The importance of users

3 Current Status Everyone is an Web or language technology person today...  SIGMOD, VLDB  WWW  ACL, EMNLP  ICML  KDD  IJCAI, AAAI,.... Funding agencies have declared some problems “solved”

4 Defining the Research Challenges What are the driving forces? What should we work on? What are the grand challenges? What should be funded? cf. Asilomar Report produced by the database community

5 Language Modeling One challenge: Defining the formal basis for IR  retrieval models  indexing models Lots of papers, any consensus? Relationship to real systems? Language models are an attempt to provide a different perspective for retrieval models  shown promise in describing a range of IR “tasks”  potential for better integration with other language technologies

6 Why retrieval models? “Why do we need new retrieval models now that we have Google?”  Web search  IR  Typical web queries  information needs Google shows that, for some types of queries, effective ranking can be obtained by combining an AND query with a number of other features  effect of scale - ranking within the top group  features such as links, anchor text, tagging used Retrieval models provide frameworks for improving effectiveness in more general contexts

7 LM for IR What is a language model? Query-likelihood and document models Document-likelihood and query models KL divergence comparison of models Other models Applications

8 © Victor Lavrenko, Aug. 2002 What is a Language Model? A statistical model for generating text –Probability distribution over strings in a given language M P ( | M )= P ( | M ) P ( | M, )

9 © Victor Lavrenko, Aug. 2002 Unigram and higher-order models Unigram Language Models N-gram Language Models Other Language Models –Grammar-based models, etc. = P ( )P ( | ) P ( ) P ( ) P ( ) P ( ) P ( | ) P ( | ) P ( | )

10 © Victor Lavrenko, Aug. 2002 The fundamental problem of LMs Usually we don’t know the model M –But have a sample of text representative of that model Estimate a language model from a sample Then compute the observation probability P ( | M ( ) ) M

11 Models of Text Generation Query ModelQuery Doc ModelDoc Searcher Writer Is this the same model?

12 Retrieval Using Language Models Query ModelQuery Doc ModelDoc Retrieval: Query likelihood (1), Document likelihood (2), Model comparison (3) 1 2 3

13 Query Likelihood P(Q|D m ) Major issue is estimating document model  i.e. smoothing techniques instead of tf.idf weights cf. Van Rijsbergen’s P(D  Q) and InQuery’s P(I|D) Good retrieval results  e.g. UMass, BBN, Twente, CMU Problems dealing with relevance feedback, query expansion, structured queries

14 Document Likelihood Rank by likelihood ratio P(D|R)/P(D|N)  treat as a generation problem  P(w|R) is estimated by P(w|Q m )  Q m is the query or relevance model  P(w|N) is estimated by collection probabilities P(w) Issue is estimation of query model  Treat query as generated by mixture of topic and background  Estimate relevance model from related documents (query expansion)  Relevance feedback is easily incorporated Good retrieval results  e.g. UMass at SIGIR 01  inconsistent with heterogeneous document collections

15 Model Comparison Estimate query and document models and compare Obvious measure is KL divergence D(Q m ||D m )  equivalent to query-likelihood approach if simple empirical distribution used for query model More general risk minimization framework has been proposed  Zhai and Lafferty Consistently better results than query-likelihood or document-likelihood approaches

16 Other Approaches HMMs (BBN) Probabilistic Latent Semantic Indexing (Hofmann)  assume documents are generated by a mixture of “aspect” models  estimation more difficult Translation model (Berger and Lafferty)

17 Applications CLIR TDT Novelty and redundancy Links Distributed retrieval QA Filtering Summarization

18 The Future of IR and LM


Download ppt "IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical."

Similar presentations


Ads by Google