Towards Information Retrieval with More Inferential Power Jian-Yun Nie Department of Computer Science University of Montreal

Towards Information Retrieval with More Inferential Power Jian-Yun Nie Department of Computer Science University of Montreal nie@iro.umontreal.ca

2 Background IR Goal: Retrieve relevant information from a large collection of documents to satisfy user’s information need Traditional relevance: Query Q and document D in a given corpus: Score (Q,D) User-independent Knowledge independent Independent of all contextual factors Expected relevance: Also depends on users (U) and contexts (C): Score (Q,D,U,C) Reasoning with contextual information Several approaches in IR can be viewed as simple inference We have to consider more complex inference

3 Overview Introduction: Current Approaches to IR Inference using terms relations A General Model to Integrate Contextual Factors Constructing and Using Domain Models Conclusion and Future Work

4 Traditional Methods in IR Each query term t matches a list of documents t: {…, D, …} Final answer list = combining all the lists of query terms e.g. Vector space model: Language model: 2 implicit assumptions: Information need is only specified by the query terms Query terms are independent

5 Reality A term is only one of the possible expression of a meaning Synonyms, related terms Query is only a partial specification of user’s information need Many words can be omitted in the query: e.g. “Java hotel”: hotel booking in Java island, … How to make the query more complete?

6 Dealing with relations between terms Previous methods try to enhance the query: Query expansion (add some related terms) Thesauri: Wordnet, Hownet Statistical co-occurrence: 2 terms that often co-occur together in the same context Pseudo relevance feedback: top-ranked documents retrieved with original query User profile, background, preference … (a set of background terms) Used to re-rank the documents Equivalent to a query expansion

7 Question Are these related to inference? How to perform inference in general in IR? LM as a tool for implementing logical IR

9 What is logical IR? Key: inference – infer query from document D: Tsunami Q: natural disaster D  Q?

10 Using knowledge to make inference in IR K | D  Q K: general knowledge No knowledge Thesauri Co-occurrence … K: user knowledge Characterizes the knowledge of a particular user

11 Simple inference – the core of logical IR Logical deduction (A  B)  (B  C)   A  C In IR: (D  Q’)  (Q’  Q)   D  Q (D  D’)  (D’  Q)   D  Q Doc. matchingInference on query Inference on doc.Doc. matching

13 Effect of smoothing? Doc: Tsunami, ocean, Asia, … Smoothing  inference Redistribution uniformly/according to collection (also to unrelated terms) Tsunami ocean Asia computer nat.disaster …

14 Expected effect Using Tsunami  natural disaster Knowledge-based smoothing Tsunami ocean Asia computer nat.disaster …

15 Inference: Translation model (Berger & Lafferty 99) Inference Traditional LM

16 Using more types of knowledge for document expansion (Cao et al. 05) Different ways to satisfy a query (term) Directly though unigram model Indirectly (by inference) through Wordnet relations Indirectly trough Co-occurrence relations … D  t i if D  UG t i or D  WN t i or D  CO t i

17 Inference using different types of knowledge (Cao et al. 05) qiqi w 1 w 2 … w n WN model CO modelUG model document λ1λ1 λ2λ2 λ3λ3 P WN (q i |w 1 ) P CO (q i |w 1 )

18 Experiments (Cao et al. 05) Different combinations of unigram model, link model and co-occurrence model Model WSJAPSJM AvgPRec.AvgPRec.AvgPRec. UM0.24661659/21720.19253289/61010.20451417/2322 CM0.22051700/21720.20333530/61010.18631515/2322 LM0.22021502/21720.17953275/61010.16611309/2322 UM+CM0.25271700/21720.20853533/61010.21111521/2322 UM+LM0.25421690/21720.19393342/61010.21031558/2332 UM+CM+LM0.25971706/21720.21283523/61010.21421572/2322 Integrating more types of relation is useful

19 Query expansion in LM KL-div: With no query expansion, equivalent to generative model Smoothed doc. modelQuery model

20 Expanding query model Classical LMRelation model

21 Using co-occurrence information Using an external knowledge base (e.g. Wordnet) Pseudo-rel. feedback Other term relationships …

22 Using co-occurrence relation Use term co-occurrence relationship Terms that often co-occur in the same windows are related Window size: 10 words Unigram relationship (w j  w i ) Query expansion

23 Problem co-occurrence relations Ambiguity Term relationship between two single words e.g. “Java  programming” No information to determine the appropriate context e.g. “Java travel” by “programming” Solution: add some context information into term relationship

24 Overview Introduction: Current Approaches to IR Inference using terms relations Extracting context-dependent term relations A General Model to Integrate Contextual Factors Constructing and Using Domain Models Conclusion and Future Work

25 General Idea (Bai et al. 06) Use (t 1, t 2, t 3, … )  t instead of t 1  t e.g. “ (Java, computer, language)  programming ” Problem with arbitrary number of terms in condition: Complexity with many words in condition part Difficult to obtain reliable relations Our solution: Limit condition part to 2 words e.g. “ (Java, computer)  programming ” “(Java, travel)  island” One word specifies the context to the other

26 Hypotheses Hypothesis 1: most words can be disambiguated with one useful context word e.g. “Java + computer, Java + travel, Java + taste” Hypothesis 2: users often choose useful related words to form their queries A word in query provides useful information to disambiguate another word Possible queries: e.g. “windows version” “doors and windows” Seldom case: users do not express their need clearly e.g. “windows installation” ?

27 Context-dependent co- occurrences (Bai et al. 06) w i w j  w k New relation model

28 Experimental Results (Average Precision) Coll.UMt1  t2t1,t2  t3 AP0.2767 0.2891* (+4%) 0.3280** (+19%) SJM0.2017 0.2175** (+8%) 0.2456** (+22%) WSJ0.2373 0.2390 (+1%) 0.2564 (+8%) FR0.1966 0.2057 (+5%) 0.2331 (+19%) * and ** indicate the difference is statistically significant by t-test (*: p-value < 0.05, **: p-value < 0.01)

29 Experimental Analysis (example) Query #55: “Insider trading” Unigram relationships: P(*|insider) or P(*|trading) stock:0.014177market:0.0113156US:0.0112784year:0.010224 exchang:0.0101797 trade:0.00922486report:0.00825644 price:0.00764028 dollar:0.00714267 1:0.00691906 govern:0.00669295 state:0.00659957 futur:0.00619518 million:0.00614666 dai:0.00605674 offici:0.00597034 peopl:0.0059315 york:0.00579298 issu:0.00571347 nation:0.00563911 Bi-term relationships: P(*|insider, trading) secur:0.0161779 charg:0.0158751 stock:0.0137123 scandal:0.0128471 boeski:0.0125011 inform:0.011982 street:0.0113332 wall:0.0112034 case:0.0106411 year:0.00908383 million:0.00869452investig:0.00826196 exchang:0.00804568 govern:0.00778614 sec:0.00778614 drexel:0.00756986 fraud:0.00718055 law:0.00631543 ivan:0.00609914 profit:0.00566658 => Expansion terms determined by BQE are more relevant than UQE

30 Logical point of view of the extensions D tjtj … titi

32 LM for context-dependent IR? Context (X) = background knowledge of the user, the domain of interest, … Document model smoothed by context model X | D  Q = | X+D  Q Similar to doc. Expansion approaches Query smoothed by context model X | D  Q = | D  Q+X Similar to (Lau et al. 04) and query expansion approaches Utilizations of context: domain knowledge (e.g. java  programming only in computer science) Specification of the area of interest (e.g. science): background terms Characteristics of the collection

33 Contexts and Utilization (1) General term relations (Knowledge) Traditional term relations are context-independent : e.g. “Java  programming ”, Prob(programming|Java) Context-dependent term relations: add some context words in term relations e.g. “{Java, computer}  programming” (“programming” only derived to expand a query containing both “Java” and “computer”) “{Java, computer} ” identifies a better context than “Java ” to determine expansion terms

34 Contexts and Utilization (2) Topic domains of the query (Domain background) Consider topic domain as specifying a set of background terms frequently used in the domain However, these terms are often omitted in the queries e.g. in Computer Science domain, term “computer” is often implied by queries in this domain, but usually omitted e.g. Computer Science domain: any query  “ computer ”, …

35 Example: “bus services in Java” 99 concern “Java language”, only one related to “transportation” (but irrelevant to the query) Reason: do not consider the retrieval context - the user is preparing a travel

36 Example: “bus services in Java + transportation, hotel, flight” 12 among 20 related to “transportation” Reason: the additional terms specify the appropriate context and make query less ambiguous

37 Contexts and Utilization (3) Query-specific collection characteristics (Feedback model) What terms are useful to retrieve relevant documents in a particular corpus? ~ What other topics are often described together with the query topic in the corpus? e.g. in a corpus, “terrorism” can be described often with “9-11, air hijacking, world trade center, …” Expand query with related terms Feedback model: capture query-related collection context

38 Enhanced Query Model Basic idea for query expansion: Combine original query model with the expansion model Generalized model: 3 expansion models from 3 contextual factors: : original query model : knowledge model : domain model : feedback model where X = {0, K, Dom, FB} is the set of all component models is the mixture weight Expansion model Original query model

39 Illustration: Expanded Query Model Term t can be derived from query model by several inference paths Once a path is selected, the corresponding LM is to generate term t

41 Creating Domain Models Assumption: each topic domain contains a set of example (in- domain) documents Extract domain-specific terms from them Use EM algorithm: extract only the specific terms Assume each in-domain document is generated from: ( Dom =0.5) Domain model is extracted by EM so as to maximize P(Dom|θ’ Dom ):

42 Effect of EM Process Term probabilities in domain “Environment” before/after EM (12 iterations) => Extract domain-specific terms while filtering out common words TermInitialFinalchangeTermInitialFinalchange air0.003580.00558+ 56%year0.003570.00052- 86% environment0.002130.00340+ 60%system0.002127.13*e -6 - 99% rain0.001970.00336+ 71%program0.001890.00040- 79% pollution0.001770.00301+ 70%million0.001315.80*e -6 - 99% storm0.001760.00302+ 72%make0.001085.79*e -5 - 95% flood0.001640.00281+ 71%company0.000998.52*e -8 - 99% tornado0.000720.00125+ 74%president0.000772.71*e -6 - 99% greenhouse0.000340.00058+ 72%month0.000733.88*e -5 - 95%

43 How to Gather in-domain Documents Existing directories: ODP, Yahoo! directory We assume that user defines his own domains, and assigns a domain to each of his queries (during the training phase) Gather relevant documents of the queries (by user’s relevance judgments) (C1) Simply collect the top-ranked documents (without user’s relevance judgments) (C2) (This strategy is used in order to test on TREC data)

44 How to Determine the Domain of a New Query 2 strategies to assign domain to the query: Manually (U1) Automatically (U2) Automatic query classification by LM: Similar to text classification, but query is much shorter than text document Select domain with the lowest KL-divergence score of the query: This is an extension to Naïve Bayes classification [Peng et al. 2003]

45 Overview Introduction: Current Approaches to IR Inference using terms relations A General Model to Integrate Contextual Factors Constructing and Using Domain Models Experiments Conclusion and Future Work

46 Experimental Setting Text collection statistics: TREC Coll.Description Size (GB) Vocab.# of Doc.Query TrainingDisk 20.86350,085231,2191-50 Disks 1-3 3.10785,9321,078,16651-150 TREC7Disks 4-51.85630,383528,155351-400 TREC8Disks 4-51.85630,383528,155401-450  Training collection: to determine the parameter values (mixture weights)

47 An Example: Query with Manually Assigned Domain Tipster Topic Description Number: 055 Domain: International Economics Topic: Insider Trading (only use title as Query) Description: Document discusses an insider-trading case. … Figure: Distribution of the queries among 13 domains in TREC

48 Baseline Methods Document model: Jelinek-Mercer smoothing CollectionMeasure Unigram Model Without FBWith FB Disks 1-3 AvgP0.15700.2344 (+49.30%)** Recall / 48 35515 71119 513 P@100.40500.5010 TREC7 AvgP0.16560.2176 (+31.40%)** Recall / 4 6742 2372 777 P@100.34200.3860 TREC8 AvgP0.23870.2909 (+21.87%)** Recall / 4 7282 7643 237 P@100.43400.4860

49 Constructing and Using Domain Models 2 Strategies to create domain models: (current test query is excluded from domain model construction) with the relevant documents for in-domain queries (C1) User judges which documents relevant to the domain Similar to manual construction of directories with the top-100 documents retrieved by in-domain queries (C2) User specifies a domain for queries without judging relevant documents System gathers in-domain documents from user’s search history Once constructed domain models, 2 Strategies to use them: Domain can be assigned to a new query by user manually (U1) Domain is determined by the system automatically using query classification (U2)

50 Creating Domain Models C1 (constructed with relevant documents) vs. C2 (with top-100): CollectionMeasure Rel. Doc. for training (C1)Top-100 doc. (C2) Without FBWith FBWithout FBWith FB Disks 1-3 (U1) AvgP 0.1700 (+8.28%)++ 0.2454 (+4.69%)** 0.1718 (+9.43%)++ 0.2456 (+4.78%)** Recall / 48 35516 51720 14116 55820 131 P@100.43700.51300.43000.5140 TREC7 (U2) AvgP 0.1715 (+3.56%)++ 0.2389 (+9.79%)* 0.1765 (+6.58%)++ 0.2395 (+10.06%)** Recall / 4 6742 2702 9652 3192 969 P@100.37200.37400.37800.3820 TREC8 (U2) AvgP 0.2442 (+2.30%) 0.2957 (+1.65%) 0.2434 (+1.97%) 0.2949 (+1.38%) Recall / 4 7282 7963 3082 7723 318 P@100.44200.50000.43800.4960

51 Determine Query Domain Automatically U2 (automatic domain assignment) vs. U1 (manual domain assignment): Coll.Measure Domain with relevant doc. (C1)Domain with top-100 doc. (C2) Without FBWith FBWithout FBWith FB Disks 1-3 (U1) AvgP 0.1700 (+8.28%)++ 0.2454 (+4.69%)** 0.1718 (+9.43%)++ 0.2456 (+4.78%)** Recall / 48 355 16 51720 14116 55820 131 P@100.43700.51300.43000.5140 Disks 1-3 (U2) AvgP 0.1650 (+5.10%)++ 0.2444 (+4.27%)** 0.1670 (+6.37%)++ 0.2449 (+4.48%)** Recall / 48 355 16 34320 06116 41420 090 P@100.42700.51000.40900.5140

52 Complete Models Integrate original query domain, knowledge model, domain model and FB model: CollectionMeasure All Documents Domain Manual domain Id. (U1)Automatic domain Id. (U2) Disks 1-3 AvgP 0.2501 (+59.30%)++ (+6.70%)** 0.2489 (+58.54%)++ (+6.19%)** Recall /48 35520 51420 367 P@100.52000.5230 TREC7 AvgP N/A 0.2462 (+48.67%)++ (+13.14%)** Recall /4 6743 014 P@100.3960 TREC8 AvgP N/A 0.3029 (+26.90)++ (+4.13%)** Recall /4 7283 321 P@100.5020

54 Conclusions Document/Query expansion is useful for IR They can be considered as an inference process: deduce if a document is related to a query, D  Q Useful to consider several types of context during the inference Relations between terms (expansion terms) Query context: domain Collection characteristics: feedback model Context | D  Q Good experimental results: Different contextual factors are complementary Complete query model that integrates all contextual factors performs the best Language modeling can be extended for this purpose Future users will not be content with word-matching. They will require some inference: determining the intent, suggested related words, select appropriate vertical search engines, …

55 Future Work Integrate other contextual factors (user preference) How to extract them ? Query logs? Other ways to extract term relations than co-occ. (association rules) Term independence in final model: relax term independence assumption Create a model which integrates compound terms [Bai, Nie and Cao, Web Intelligence’05] Dependence language model [Gao et al, SIGIR’04] Do different contextual factors act independently? Can we use simple smoothing to combine them? Is there a better formalism than language modeling for inference? Inference network (Turtle&Croft 1990)? Non-classical logic?

56 Thanks!

Towards Information Retrieval with More Inferential Power Jian-Yun Nie Department of Computer Science University of Montreal

Similar presentations

Presentation on theme: "Towards Information Retrieval with More Inferential Power Jian-Yun Nie Department of Computer Science University of Montreal"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Towards Information Retrieval with More Inferential Power Jian-Yun Nie Department of Computer Science University of Montreal

Similar presentations

Presentation on theme: "Towards Information Retrieval with More Inferential Power Jian-Yun Nie Department of Computer Science University of Montreal"— Presentation transcript:

Similar presentations

About project

Feedback