Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Information Retrieval (Part 2) By Evren Ermis.

Similar presentations


Presentation on theme: "Introduction to Information Retrieval (Part 2) By Evren Ermis."— Presentation transcript:

1 Introduction to Information Retrieval (Part 2) By Evren Ermis

2 2 Introduction to Information Retrieval Retrieval models  Vector-space-model  Probabilistic model  Relevance feedback Evaluation  Performance evaluation  Retrieval Performance evaluation Reference Collections Evaluation measures

3 3 Vector-space-model Binary weights are too limiting Non-binary weights to index terms  In querie  In documents Compute the degree of similarity Sorting in order of similarity allows considering documents which match partially

4 4 Vector-space-model Considering every document as vector Similarity by correlation between vectors

5 5 Vector-space-model Not predicting wether relevant or not But ranking according to similarity Document can be retrieved although matches the querie only partially Use threshold d to filter documents with similarity < d

6 6 Vector-space-model Index term weights features that better describe the seeked documents: intra-cluster similarity distinguish the seeked documents from the rest: inter-cluster dissimilarity

7 7 Vector-space-model Index term weights Intra-cluster similarity Inter-cluster dissimilarity

8 8 Vector-space-model Index term weights The weight of a term in a document is then calculated as product of the tf factor and the idf factor Or for the query

9 9 Vector-space-model Advantages  Improves retrieval performance  Partial matching allowed  Sort according to similarity Disadvantages  Assumes that index terms are independent

10 10 Probabilistic model Assuming that there is a set of documents, containing exactly the relevant documents and no other (ideal answer set) Problem is that we don‘t know that set‘s properties Index terms to characterize the properties Use a initial guess at query time to receive a probabilistic discription of the ideal answer set Use this to retrieve a first set of documents Interaction with user to improve probabilistic discription of ideal answer set

11 11 Probabilistic model Interaction with user to improve probabilistic discription of ideal answer set The probabilistic approach is to model the description in probabilistic terms without the user Problem: Don‘t know how to compute the probabilties of relevance

12 12 Probabilistic model how to compute the probabilties of relevance As measure of similarity  P(d j relevant-to q)/P(d j non-relevant-to q)  Odds of document d j being relevant to query q So using similarity function:

13 13 Probabilistic model Problem: we don‘t have the set R at the beginning Necessary to find initial probabilities Make two assumptions:  P(k j |R) is constant for all index terms  Distribution of index terms among the non-relevant documents can be approximated by the distribution of index terms among all documents

14 14 Probabilistic model So we get: Now we can retrieve documents containing query terms and provide initial probabilistic ranking for them

15 15 Probabilistic model Now we can use these retrieved documents to improve our assumed probabilities Let V be a subset of the retrieved documents and V i a subset of V containing the i-th index term, then:

16 16 Probabilistic model Advantages:  Documents are ranked in decreasing order of their probability being relevant Disadvantages:  Need guess for initial separation of relevant and non-relevant documents  Does not consider frequence of occurences of index term in a document

17 17 Relevance feedback Query reformulation strategy  User depicts relevant documents out of the retrieval  Method selects important terms attached to the user-identified documents  Enhances new gained information in a new query formulation and reweighting of the terms

18 18 Relevance feedback for vector model vectors of relevant documents have similarity among themselves non-relevant documents have vectors that are dissimilar to the relevant ones Reformulate the query such that it gets closer to term-weight vector space of the relevant documents

19 19 Relevance feedback for vector model

20 20 Relevance feedback for probabilistic model Replacing V by D r and V i by D r,i, whereas D r set of user chosen documents, and D r,i is the subset of D r containing the index term k i.

21 21 Relevance feedback for probabilistic model Using this replacement and rewriting the similarity function for probabilistic model we get: Reweighting of the index terms already in the query Not expanding the query by new index terms

22 22 Relevance feedback for probabilistic model Advantages:  Feedback directly related toderivation of new weights  Reweighting is optimal under assumptions of term independence Binary document indexing Disadvantages:  Document term weights not regarded in feedback loop  Previous term weights in query disregarded  No query expansion Not as effectively as vector modification method

23 23 Evaluation Types of evaluation: Performance of the system(time and space) Functional analysis in which the specified system functionalities are tested  How precise is the answer set  Reference collection  Evaluation measure

24 24 Performance Evaluation Performance of the indexing structures Interaction with the operating system Delays in communication channels Overheads introduced by the many software layers

25 25 Retrieval performance evaluation Reference collection consists of  collection of documents  Set of example information requests  Set of relevant documents for each request Evaluation measure  Uses reference collection  Quantifies the similarity between the documents retrieved by a retrieval strategy and the provided set of relevant documents

26 26 Reference collection Exist several different reference collection  TIPSTER/TREC  CACM  CISI  Cystic Fibrosis  etc. Choose TIPSTER/TREC for further discussion

27 27 TIPSTER/TREC conference „Text Retrieval Conference“ Built under the TIPSTER program Large test collection (over 1 million documents) For each conference a set of reference experiments is designed Research groups use these to compare their retrieval systems

28 28 Evaluation measure Exist several different evaluation measures  Recall and precision  Average precision  Interpolated precision  Harmonic mean( F-measure )  E-measure  Satisfaction, Frustation, etc. Choose Recall and precision as the most used ones for further discussion

29 29 Recall and precision

30 30 Recall and precision Definitions for recall:  Recall is the fraction of relevant documents which has been retrieved. And precision:  Precision is the fraction of retrieved documents which is relevant.

31 31 Precision vs. Recall Assume that all documents in A have been examined But user is not confronted with all docs Instead sorted according to relevance Recall and precision vary as the user proceeds examination of docs Proper evaluation requires precision vs. recall curve

32 32 Precision vs. Recall

33 33 Average precision Example figure for one query To evaluate the retrieval algorithm have to run several distinct queries Get distinct precision vs. recall curves Average the precision figures at each recall level

34 34 Interpolated precision Recall levels for each query distinct from 11 standard recall levels Interpolation procedure is necessary Let r j be the j-th standard recall level with j=1,2,…,10. Then,

35 35 Interpolated precision

36 36 Example figures

37 37 Harmonic Mean( F-measure ) Harmonic mean defined as: F high if recall and precision high Therefore maximum F interpreted as best compromise between recall and precision

38 38 E-measure User specifies if more interest in recall or precision E-measure defined as: b is user specified and reflects relative importance of recall and precision

39 39 Conclusion Introduced two most popular models for information retreival:  Vector space model  Probabilistic model Introduced evaluation methods to quantify performance of Information Retrieval Systems ( Recall and Precision, … )

40 40 References Baeza-Yates: „Modern Information Retrieval“ (1999) G.Salton: „The Smart Retrieval System – Experiments in Automatic Document Processing“ (1971) S.E.Roberston, K.Spark Jones: Relevance weighting of search terms – Journal of American Society for Information Sciences (1976) N.Fuhr: „Probabilistic model in information retrieval“ (1992) TREC NIST website: http://trec.nist.govhttp://trec.nist.gov J.J.Rocchio: Relevance feedback in information retrieval (1971)


Download ppt "Introduction to Information Retrieval (Part 2) By Evren Ermis."

Similar presentations


Ads by Google