2011.02.07 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

Slides:



Advertisements
Similar presentations
Lecture 6: Boolean to Vector
Advertisements

Traditional IR models Jian-Yun Nie.
Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.
Retrieval Models and Ranking Systems CSC 575 Intelligent Information Retrieval.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
IR Models: Overview, Boolean, and Vector
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,
Multimedia and Text Indexing. Multimedia Data Management The need to query and analyze vast amounts of multimedia data (i.e., images, sound tracks, video.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
ISP433/633 Week 3 Query Structure and Query Operations.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Ch 4: Information Retrieval and Text Mining
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Modeling Modern Information Retrieval
Hinrich Schütze and Christina Lioma
DOK 324: Principles of Information Retrieval Hacettepe University Department of Information Management.
9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Indexing and Representation: The Vector Space Model Document represented by a vector of terms Document represented by a vector of terms Words (or word.
9/14/2000Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Marti Hearst University of California,
Vector Space Model CS 652 Information Extraction and Integration.
9/19/2000Information Organization and Retrieval Vector and Probabilistic Ranking Ray Larson & Marti Hearst University of California, Berkeley School of.
9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.
SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
IR Models: Review Vector Model and Probabilistic.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Advanced Multimedia Text Classification Tamara Berg.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Vector Space Models.
1 Information Retrieval LECTURE 1 : Introduction.
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Information Retrieval Lecture 6 Vector Methods 2.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.
Why indexing? For efficient searching of a document
Plan for Today’s Lecture(s)
CS 430: Information Discovery
Information Retrieval and Web Search
Why the interest in Queries?
Basic Information Retrieval
Representation of documents and queries
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
4. Boolean and Vector Space Retrieval Models
Boolean and Vector Space Retrieval Models
Presentation transcript:

SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval Lecture 6: Boolean to Vector

SLIDE 2IS 240 – Spring 2011 Today Review –IR Models –The Boolean Model –Boolean implementation issues –Extended Boolean Approaches Vector Representation Term Weights Vector Matching

SLIDE 3IS 240 – Spring 2011 Review IR Models Extended Boolean

SLIDE 4IS 240 – Spring 2011 IR Models Set Theoretic Models –Boolean –Fuzzy –Extended Boolean Vector Models (Algebraic) Probabilistic Models (probabilistic)

SLIDE 5IS 240 – Spring 2011 Boolean Logic A B

SLIDE 6IS 240 – Spring 2011 Parse Result (Query Tree) Z39.50 queries… Oper: AND Title XXX and Subject YYY Operand: Index = Title Value = XXX Operand: Index = Subject Value = YYY left right

SLIDE 7IS 240 – Spring 2011 Parse Results Subject XXX and (title yyy and author zzz) Op: AND Oper: Index: Subject Value: XXX Oper: Index: Title Value: YYY Oper: Index: Author Value: ZZZ

SLIDE 8IS 240 – Spring 2011 Boolean AND Algorithm AND =

SLIDE 9IS 240 – Spring 2011 Boolean OR Algorithm OR =

SLIDE 10IS 240 – Spring 2011 Boolean AND NOTAlgorithm AND NOT =

SLIDE 11IS 240 – Spring 2011 Boolean Summary Advantages –simple queries are easy to understand –relatively easy to implement Disadvantages –difficult to specify what is wanted, particularly in complex situations –too much returned, or too little –ordering not well determined Dominant IR model in commercial systems until the WWW

SLIDE 12IS 240 – Spring 2011 Basic Concepts for Extended Boolean Instead of binary values, terms in documents and queries have a weight (importance or some other statistical property) Instead of binary set membership, sets are “fuzzy” and the weights are used to determine degree of membership. Degree of set membership can be used to rank the results of a query

SLIDE 13IS 240 – Spring 2011 Fuzzy Sets Introduced by Zadeh in If set {A} has value v(A) and {B} has value v(B), where 0  v  1 v(A  B) = min(v(A), v(B)) v(A  B) = max(v(A), v(B)) v(~A) = 1-v(A)

SLIDE 14IS 240 – Spring 2011 Fuzzy Sets If we have three documents and three terms… –D 1 =(.4,.2,1), D 2 =(0,0,.8), D 3 =(.7,.4,0) For search: t 1  t 2  t 3 v(D 1 ) = max(.4,.2, 1) = 1 v(D 2 ) = max(0, 0,.8) =.8 v(D 3 ) = max(.7,.4, 0) =.7 For search: t 1  t 2  t 3 v(D 1 ) = min(.4,.2, 1) =.2 v(D 2 ) = min(0, 0,.8) = 0 v(D 3 ) = min(.7,.4, 0) = 0

SLIDE 15IS 240 – Spring 2011 Fuzzy Sets Fuzzy set membership of term to document is f(A)  [0,1] D 1 = {(mesons,.8), (scattering,.4)} D 2 = {(mesons,.5), (scattering,.6)} Query = MESONS AND SCATTERING RSV(D 1 ) = MIN(.8,.4) =.4 RSV(D 2 ) = MIN(.5,.6) =.5 D 2 is ranked before D 1 in the result set.

SLIDE 16IS 240 – Spring 2011 Fuzzy Sets The set membership function can be, for example, the relative term frequency within a document, the IDF or any other function providing weights to terms This means that the fuzzy methods use sets of criteria for term weighting that are the same or similar to those used in other ranked retrieval methods (e.g., vector and probabilistic methods)

SLIDE 17IS 240 – Spring 2011 Robertson’s Critique of Fuzzy Sets D 1 = {(mesons,.4), (scattering,.4)} D 2 = {(mesons,.39), (scattering,.99)} Query = MESONS AND SCATTERING RSV(D 1 ) = MIN(.4,.4) =.4 RSV(D 2 ) = MIN(.39,.99) =.39 However, consistent with the Boolean model: –Query = t 1  t 2  t 3  …  t 100 –If D not indexed by t 1 then it fails, even if D is indexed by t 2,…,t 100

SLIDE 18IS 240 – Spring 2011 Robertson’s critique of Fuzzy Fuzzy sets suffer from the same kind of lack of discrimination among the retrieval results almost to the same extent as standard Boolean The rank of a document depends entirely on the lowest or highest weighted term in an AND or OR operation

SLIDE 19IS 240 – Spring 2011 Other Fuzzy Approaches As described in the Modern Information Retrieval (optional) text, a keyword correlation matrix can be used to determine set membership values, and algebraic sums and products can be used in place of MAX and MIN Not clear how this approach works in real applications (or in tests like TREC) because the testing has been on a small scale

SLIDE 20IS 240 – Spring 2011 Extended Boolean (P-Norm) Ed Fox’s Dissertation work with Salton Basic notion is that terms in a Boolean query, and the Boolean Operators themselves can have weights assigned to them Binary weights means that queries behave like standard Boolean 0 < Weights < 1 mean that queries behave like a ranking system The system requires similarity measures

SLIDE 21IS 240 – Spring 2011 Probabilistic Inclusion of Boolean Most probabilistic models attempt to predict the probability that given a particular query Q and document D, that the searcher would find D relevant If we assume that Boolean criteria are to be ANDed with a probabilistic query…

SLIDE 22IS 240 – Spring 2011 Rubric – Extended Boolean Scans full text of documents and stores them User develops a hierarchy of concepts which becomes the query Leaf nodes of the hierarchy are combinations of text patterns A “fuzzy calculus” is used to propagate values obtained at leaves up through the hierarchy to obtain a single retrieval status value (or “relevance” value) RUBRIC returns a ranked list of documents in descending order of “relevance” values.

SLIDE 23IS 240 – Spring 2011 RUBRIC Rules for Concepts & Weights Team | event => World_Series St._Louis_Cardinals | Milwaukee_Brewers => Team “Cardinals” => St._Louis_Cardinals (0.7) Cardinals_full_name => St._Louis_Cardinals (0.9) Saint & “Louis” & “Cardinals” => Cardinals_full_name “St.” => Saint (0.9) “Saint” => Saint “Brewers” => Milwaukee_Brewers (0.5)

SLIDE 24IS 240 – Spring 2011 RUBRIC Rules for Concepts & Weights “Milwaukee Brewers” => Milwaukee_Brewers (0.9) “World Series” => event Baseball_championship => event (0.9) Baseball & Championship => Baseball_championship “ball” => Baseball (0.5) “baseball” => Baseball “championship” => Championship (0.7)

SLIDE 25IS 240 – Spring 2011 RUBRIC combination methods V(V 1 or V 2 ) = MAX(V 1, V 2 ) V(V 1 and V 2 ) = MIN(V 1, V 2 ) i.e., classic fuzzy matching, but with the addition… V(level n) = C n *V(level n-1)

SLIDE 26IS 240 – Spring 2011 Rule Evaluation Tree World_Series (0) Event (0) “World Series”Baseball_championship (0) Baseball (0) Championship (0) St._Louis_Cardinals (0) Team (0) “Cardinals” (0) Milwaukee_brewers (0) Cardinals_full_name (0) “Milwaukee Brewers” (0)“Brewers” (0) Saint (0)“Louis” (0) “Saint” (0)“St.” (0) “Cardinals” (0) “baseball” (0)“championship” (0)“ball” (0)

SLIDE 27IS 240 – Spring 2011 Rule Evaluation Tree World_Series (0) Event (0) “World Series”Baseball_championship (0) Baseball (0) Championship (0) St._Louis_Cardinals (0) Team (0) “Cardinals” (0) Milwaukee_brewers (0) Cardinals_full_name (0) “Milwaukee Brewers” (0)“Brewers” (0) Saint (0)“Louis” (0) “Saint” (0)“St.” (0) “Cardinals” (0) “baseball” (1.0)“championship” (1.0)“ball” (1.0) Document containing “ball”, “baseball” & “championship”

SLIDE 28IS 240 – Spring 2011 Rule Evaluation Tree World_Series (0) Event (0) “World Series”Baseball_championship (0) Baseball (1.0) Championship (0.7) St._Louis_Cardinals (0) Team (0) “Cardinals” (0) Milwaukee_brewers (0) Cardinals_full_name (0) “Milwaukee Brewers” (0)“Brewers” (0) Saint (0)“Louis” (0) “Saint” (0)“St.” (0) “Cardinals” (0) “baseball” (1.0)“championship” (1.0)“ball” (1.0)

SLIDE 29IS 240 – Spring 2011 Rule Evaluation Tree World_Series (0) Event (0) “World Series”Baseball_championship (0.7) Baseball (1.0) Championship (0.7) St._Louis_Cardinals (0) Team (0) “Cardinals” (0) Milwaukee_brewers (0) Cardinals_full_name (0) “Milwaukee Brewers” (0)“Brewers” (0) Saint (0)“Louis” (0) “Saint” (0)“St.” (0) “Cardinals” (0) “baseball” (1.0)“championship” (1.0)“ball” (1.0)

SLIDE 30IS 240 – Spring 2011 Rule Evaluation Tree World_Series (0) Event (0.63) “World Series”Baseball_championship (0.7) Baseball (1.0) Championship (0.7) St._Louis_Cardinals (0) Team (0) “Cardinals” (0) Milwaukee_brewers (0) Cardinals_full_name (0) “Milwaukee Brewers” (0)“Brewers” (0) Saint (0)“Louis” (0) “Saint” (0)“St.” (0) “Cardinals” (0) “baseball” (1.0)“championship” (1.0)“ball” (1.0)

SLIDE 31IS 240 – Spring 2011 Rule Evaluation Tree World_Series (0.63) Event (0.63) “World Series”Baseball_championship (0.7) Baseball (1.0) Championship (0.7) St._Louis_Cardinals (0) Team (0) “Cardinals” (0) Milwaukee_brewers (0) Cardinals_full_name (0) “Milwaukee Brewers” (0)“Brewers” (0) Saint (0)“Louis” (0) “Saint” (0)“St.” (0) “Cardinals” (0) “baseball” (1.0)“championship” (1.0)“ball” (1.0)

SLIDE 32IS 240 – Spring 2011 Today Review –IR Models –The Boolean Model –Boolean implementation issues –Extended Boolean Approaches Vector Representation Term Weights Vector Matching

SLIDE 33IS 240 – Spring 2011 Non-Boolean IR Need to measure some similarity between the query and the document The basic notion is that documents that are somehow similar to a query, are likely to be relevant responses for that query We will revisit this notion again and see how the Language Modelling approach to IR has taken it to a new level

SLIDE 34IS 240 – Spring 2011 Similarity Measures (Set-based) Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient Assuming that Q and D are the sets of terms associated with a Query and Document:

SLIDE 35IS 240 – Spring 2011 Document Vectors Documents are represented as “bags of words” Represented as vectors when used computationally –A vector is like an array of floating point –Has direction and magnitude –Each vector holds a place for every term in the collection –Therefore, most vectors are sparse

SLIDE 36IS 240 – Spring 2011 Vector Space Model Documents are represented as vectors in “term space” –Terms are usually stems –Documents represented by binary or weighted vectors of terms Queries represented the same as documents Query and Document weights are based on length and direction of their vector A vector distance measure between the query and documents is used to rank retrieved documents

SLIDE 37IS 240 – Spring 2011 Vector Representation Documents and Queries are represented as vectors Position 1 corresponds to term 1, position 2 to term 2, position t to term t The weight of the term is stored in each position

SLIDE 38IS 240 – Spring 2011 Document Vectors + Frequency “Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.)

SLIDE 39IS 240 – Spring 2011 Document Vectors + Frequency “Hollywood” occurs 7 times in text I “Film” occurs 5 times in text I “Diet” occurs 1 time in text I “Fur” occurs 3 times in text I

SLIDE 40IS 240 – Spring 2011 Document Vectors + Frequency

SLIDE 41IS 240 – Spring 2011 We Can Plot the Vectors Star Diet Doc about astronomy Doc about movie stars Doc about mammal behavior

SLIDE 42IS 240 – Spring 2011 Documents in 3D Space Primary assumption of the Vector Space Model: Documents that are “close together” in space are similar in meaning

SLIDE 43IS 240 – Spring 2011 Vector Space Documents and Queries D1D1 D2D2 D3D3 D4D4 D5D5 D6D6 D7D7 D8D8 D9D9 D 10 D 11 t2t2 t3t3 t1t1 Boolean term combinations Q is a query – also represented as a vector

SLIDE 44IS 240 – Spring 2011 Documents in Vector Space t1t1 t2t2 t3t3 D1D1 D2D2 D 10 D3D3 D9D9 D4D4 D7D7 D8D8 D5D5 D 11 D6D6

SLIDE 45IS 240 – Spring 2011 Document Space has High Dimensionality What happens beyond 2 or 3 dimensions? Similarity still has to do with how many tokens are shared in common. More terms -> harder to understand which subsets of words are shared among similar documents. We will look in detail at ranking methods Approaches to handling high dimensionality: Clustering and LSI (later)

SLIDE 46IS 240 – Spring 2011 Today Review –IR Models –The Boolean Model –Boolean implementation issues –Extended Boolean Approaches Vector Representation Term Weights Vector Matching

SLIDE 47IS 240 – Spring 2011 Assigning Weights to Terms Binary Weights Raw term frequency tf*idf –Recall the Zipf distribution –Want to weight terms highly if they are Frequent in relevant documents … BUT Infrequent in the collection as a whole Automatically derived thesaurus terms

SLIDE 48IS 240 – Spring 2011 Binary Weights Only the presence (1) or absence (0) of a term is included in the vector

SLIDE 49IS 240 – Spring 2011 Raw Term Weights The frequency of occurrence for the term in each document is included in the vector

SLIDE 50IS 240 – Spring 2011 Assigning Weights tf*idf measure: –Term frequency (tf) –Inverse document frequency (idf) A way to deal with some of the problems of the Zipf distribution Goal: Assign a tf*idf weight to each term in each document

SLIDE 51IS 240 – Spring 2011 Simple tf*idf

SLIDE 52IS 240 – Spring 2011 Inverse Document Frequency IDF provides high values for rare words and low values for common words For a collection of documents (N = 10000)

SLIDE 53IS 240 – Spring 2011 Word Frequency vs. Resolving Power The most frequent words are not the most descriptive. (from van Rijsbergen 79)

SLIDE 54IS 240 – Spring 2011 Weighting schemes We have seen something of –Binary –Raw term weights –TF*IDF There are many other possibilities –IDF alone –Normalized term frequency –etc.

SLIDE 55IS 240 – Spring 2011 tf x idf normalization Normalize the term weights (so longer documents are not unfairly given more weight) –normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive.

SLIDE 56IS 240 – Spring 2011 Vector space similarity Use the weights to compare the documents

SLIDE 57IS 240 – Spring 2011 Vector Space Similarity Measure combine tf x idf into a measure

SLIDE 58IS 240 – Spring 2011 Weighting schemes We have seen something of –Binary –Raw term weights –TF*IDF There are many other possibilities –IDF alone –Normalized term frequency

SLIDE 59IS 240 – Spring 2011 Term Weights in SMART SMART is an experimental IR system developed by Gerard Salton (and continued by Chris Buckley) at Cornell. Designed for laboratory experiments in IR –Easy to mix and match different weighting methods –Really terrible user interface –Intended for use by code hackers

SLIDE 60IS 240 – Spring 2011 Term Weights in SMART In SMART weights are decomposed into three factors:

SLIDE 61IS 240 – Spring 2011 SMART Freq Components Binary maxnorm augmented log

SLIDE 62IS 240 – Spring 2011 Collection Weighting in SMART Inverse squared probabilistic frequency

SLIDE 63IS 240 – Spring 2011 Term Normalization in SMART sum cosine fourth max

SLIDE 64IS 240 – Spring 2011 Lucene Algorithm The open-source Lucene system is a vector based system that differs from SMART-like systems in the ways the TF*IDF measures are normalized

SLIDE 65IS 240 – Spring 2011 Lucene The basic Lucene algorithm is: Where is the length normalized query –and norm d,t is the term normalization (square root of the number of tokens in the same document field as t) –overlap(q,d) is the proportion of query terms matched in the document –boost t is a user specified term weight enhancement

SLIDE 66IS 240 – Spring 2011 How To Process a Vector Query Assume that the database contains an inverted file like the one we discussed earlier… –Why an inverted file? –Why not a REAL vector file? What information should be stored about each document/term pair? –As we have seen SMART gives you choices about this…

SLIDE 67IS 240 – Spring 2011 Simple Example System Collection frequency is stored in the dictionary Raw term frequency is stored in the inverted file postings list Formula for term ranking

SLIDE 68IS 240 – Spring 2011 Processing a Query For each term in the query –Count number of times the term occurs – this is the tf for the query term –Find the term in the inverted dictionary file and get: n k : the number of documents in the collection with this term Loc : the location of the postings list in the inverted file Calculate Query Weight: w qk Retrieve n k entries starting at Loc in the postings file

SLIDE 69IS 240 – Spring 2011 Processing a Query Alternative strategies… –First retrieve all of the dictionary entries before getting any postings information Why? –Just process each term in sequence How can we tell how many results there will be? –It is possible to put a limitation on the number of items returned How might this be done?

SLIDE 70IS 240 – Spring 2011 Processing a Query Like Hashed Boolean OR: –Put each document ID from each postings list into hash table If match increment counter (optional) –If first doc, set a WeightSUM variable to 0 Calculate Document weight w ik for the current term Multiply Query weight and Document weight and add it to WeightSUM Scan hash table contents and add to new list – including document ID and WeightSUM Sort by WeightSUM and present in sorted order

SLIDE 71IS 240 – Spring 2011 Next time More on Vector Space Document Clustering