CS 430: Information Discovery

CS 430: Information Discovery
Lecture 26 Extending the Boolean Model

Course Administration
Final examination: Date: Tuesday, 15 May, 3:00 to 5:00 p.m. Room: Kimball Hall B11 Early examination: Date: Thursday, May 10th, 1:00 to 3:00 p.m. Room: Upson Hall 5130 If you wish to take the early examination send to Laptops: Return before examination and bring receipt to the examination.

Laptop returns Dates: Tuesday, May 8th 9:00 - 11:00 a.m.
Monday, May 14th 1:00 - 3:00 p.m. Tuesday, May 15th 9: :00 a.m. Place: Upson Hall 5130 Receipts: Bring a copy of your receipt to the examination

Discussion 11, Question 7: Overall
Informedia had the following objectives: (a) retrieval performance in the presence of inaccuracy and ambiguity (b) approximate match in meaning and visualization (c) presentation and reuse of video content as a new data type with space and time constraints (d) interoperability in the presence of restricted use intellectual property and the absence of data and protocol standards. How well has each been achieved?

Problems with the Boolean model
Counter-intuitive results: Query q = A and B and C and D and E Document d has terms A, B, C and D, but not E Intuitively, d is quite a good match for q, but it is rejected by the Boolean model. Query q = A or B or C or D or E Document d1 has terms A, B, C, D and E Document d2 has term A, but not B, C, D or E Intuitively, d1 is a much better match than d2, but the Boolean model ranks them as equal.

Problems with the Boolean model (continued)
• Boolean model has no way to rank documents. • Boolean model allows for no uncertainty in assigning index terms to documents. • The Boolean model has no provision for assigning weights to the importance of query terms. Boolean is all or nothing.

Boolean model as sets d and q are either in the set A or not in A. There is no halfway! q d A

Extending the Boolean model
Term weighting • Give weights to terms in documents and/or queries. • Combine standard Boolean retrieval with vector ranking of results Fuzzy sets • Relax the boundaries of the sets used in Boolean retrieval

Ranking methods in Boolean systems
SIRE (Syracuse Information Retrieval Experiment) Term weights • Add term weights to documents Weights calculated by the standard method of term frequency * inverse document frequency. Ranking • Calculate results set by standard Boolean methods • Rank results by vector distances

Relevance feedback in SIRE
SIRE (Syracuse Information Retrieval Experiment) Relevance feedback is particularly important with Boolean retrieval because it allow the results set to be expanded • Results set is created by standard Boolean retrieval • User selects one document from results set • Other documents in collection are ranked by vector distance from this document

Boolean model as fuzzy sets
q is more or less in A. There is a halfway! q d A

Basic concept • A document has a term weights associated with each index term. The term weight measures the degree to which that term characterizes the document. • Term weights are in the range [0, 1]. (In the standard Boolean model all weights are either 0 or 1.) • For a given query, calculate the similarity between the query and each document in the collection. • This calculation is needed for every document that has a non-zero weight for any of the terms in the query.

MMM: Mixed Min and Max model
Fuzzy set theory dA is the degree of membership of an element to set A intersection (and) dAB = min(dA, dB) union (or) dAB = max(dA, dB)

Fuzzy set theory example standard fuzzy set theory set theory dA dB and dAB or dAB

Terms: A1, A2, , An Document D, with index-term weights: dA1, dA2, , dAn Qor = (A1 or A2 or or An) Query-document similarity: S(Qor, D) = Cor1 * max(dA1, dA2,.. , dAn) + Cor2 * min(dA1, dA2,.. , dAn) where Cor1 + Cor2 = 1

Terms: A1, A2, , An Document D, with index-term weights: dA1, dA2, , dAn Qand = (A1 and A2 and and An) Query-document similarity: S(Qand, D) = Cand1 * min(dA1,.. , dAn) + Cand2 * max(dA1,.. , dAn) where Cand1 + Cand2 = 1

Experimental values: Cand1 in range [0.5, 0.8] Cor1 > 0.2 Computational cost is low. Retrieval performance much improved.

Paice Model Paice model is a relative of the MMM model.
The MMM model considers only the maximum and minimum document weights. The Paice model takes into account all of the document weights. Computational cost is higher than form MMM. Retrieval performance is improved. See Frake, pages for more details

P-norm model Terms: A1, A2, . . . , An
Document D, with term weights: dA1, dA2, , dAn Query terms are given weights, a1, a2, ,an, which indicate their relative importance. Operators have coefficients that indicate their degree of strictness Query-document similarity is calculated by considering each document and query as a point in n space. See Frake, pages for details

Test data CISI CACM INSPEC P-norm 79 106 210 Paice 77 104 206
MMM Percentage improvement over standard Boolean model (average best precision) Lee and Fox, 1988

Readings S. Wartik, Boolean Operators, Frake, Chapter 12
Algorithms for high speed evaluation of Boolean operators E. Fox, S. Betrabet, M. Koushik, W. Lee, Extended Boolean Models, Frake, Chapter 15 Methods based on fuzzy set concepts D. Harman, Relevance feedback and other query modification techniques, Frake Chapter 11, Section Relevance feedback in Boolean methods

CS 430: Information Discovery

Similar presentations

Presentation on theme: "CS 430: Information Discovery"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 430: Information Discovery

Similar presentations

Presentation on theme: "CS 430: Information Discovery"— Presentation transcript:

Similar presentations

About project

Feedback