SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000
Midterm Review (Most slides taken from earlier lectures)
Structure of an IR System Search Line Interest profiles & Queries Documents & data Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Storage Line Potentially Relevant Documents Comparison/ Matching Store1: Profiles/ Search requests Store2: Document representations Indexing (Descriptive and Subject) Formulating query in terms of descriptors Storage of profiles Storage of Documents Information Storage and Retrieval System Adapted from Soergel, p. 19
Search is an Iterative Process Repositories Workspace Goals
Cognitive (Human) Aspects of Information Access and Retrieval l “Finding Out About” (FOA) –types of information needs –specifying information needs (queries) –the process of information access –search strategies –“sensemaking” l Relevance l Modeling the User
Retrieval Models l Boolean Retrieval l Ranked Retrieval l Vector Space Model l Probabilistic Models
Boolean Queries l (Cat OR Dog) AND (Collar OR Leash) –Each of the following combinations satisfies this statement: l Catxxxxx l Dogxxxx l Collarxxxx l Leashxxxxx
Boolean Queries l (Cat OR Dog) AND (Collar OR Leash) –None of the following combinations work: l Catx l Dogxx l Collarxx l Leashxx
Boolean Queries –Usually expressed as INFIX operators in IR »((a AND b) OR (c AND b)) –NOT is UNARY PREFIX operator »((a AND b) OR (c AND (NOT b))) –AND and OR can be n-ary operators »(a AND b AND c AND d) –Some rules - (De Morgan revisited) »NOT(a) AND NOT(b) = NOT(a OR b) »NOT(a) OR NOT(b)= NOT(a AND b) »NOT(NOT(a)) = a
Boolean Searching “Measurement of the width of cracks in prestressed concrete beams” Formal Query: cracks AND beams AND Width_measurement AND Prestressed_concrete Cracks Beams Width measurement Prestressed concrete Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P)
Boolean Logic 3t33t3 1t11t1 2t22t2 1D11D1 2D22D2 3D33D3 4D44D4 5D55D5 6D66D6 8D88D8 7D77D7 9D99D9 10 D D 11 m1m1 m2m2 m3m3 m5m5 m4m4 m7m7 m8m8 m6m6 m 2 = t 1 t 2 t 3 m 1 = t 1 t 2 t 3 m 4 = t 1 t 2 t 3 m 3 = t 1 t 2 t 3 m 6 = t 1 t 2 t 3 m 5 = t 1 t 2 t 3 m 8 = t 1 t 2 t 3 m 7 = t 1 t 2 t 3
Precedence Ordering l In what order do we evaluate the components of the Boolean expression? –Parenthesis get done first »(a or b) and (c or d) »(a or (b and c) or d) –Usually start from the left and work right (in case of ties) –Usually (if there are no parentheses) »NOT before AND »AND before OR
Boolean Problems l Disjunctive (OR) queries lead to information overload l Conjunctive (AND) queries lead to reduced, and commonly zero result l Conjunctive queries imply reduction in Recall
Vector Space Model l Documents are represented as vectors in term space –Terms are usually stems –Documents represented by binary vectors of terms l Queries represented the same as documents l Query and Document weights are based on length and direction of their vector l A vector distance measure between the query and documents is used to rank retrieved documents
Vector Space with Term Weights and Cosine Matching D2D2 D1D1 Q Term B Term A D i =(d i1,w di1 ;d i2, w di2 ;…;d it, w dit ) Q =(q i1,w qi1 ;q i2, w qi2 ;…;q it, w qit ) Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7)
Assigning Weights to Terms l Binary Weights l Raw term frequency l tf x idf –Recall the Zipf distribution –Want to weight terms highly if they are »frequent in relevant documents … BUT »infrequent in the collection as a whole l Automatically derived thesaurus terms
Binary Weights l Only the presence (1) or absence (0) of a term is included in the vector
Raw Term Weights l The frequency of occurrence for the term in each document is included in the vector
Assigning Weights l tf x idf measure: –term frequency (tf) –inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution l Goal: assign a tf * idf weight to each term in each document
tf x idf
Inverse Document Frequency l IDF provides high values for rare words and low values for common words
tf x idf normalization l Normalize the term weights (so longer documents are not unfairly given more weight) –normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive.
Vector space similarity (use the weights to compare the documents)
Vector Space Similarity Measure combine tf x idf into a similarity measure
Relevance Feedback –aka query modification –aka “more like this”
Query Modification l Problem: how to reformulate the query? –Thesaurus expansion: »Suggest terms similar to query terms –Relevance feedback: »Suggest terms (and documents) similar to retrieved documents that have been judged to be relevant
Relevance Feedback l Main Idea: –Modify existing query based on relevance judgements »Extract terms from relevant documents and add them to the query »and/or re-weight the terms already in the query –Two main approaches: »Automatic (psuedo-relevance feedback) »Users select relevant documents –Users/system select terms from an automatically-generated list
Relevance Feedback l Usually do both: –expand query with new terms –re-weight terms in query l There are many variations –usually positive weights for terms from relevant docs –sometimes negative weights for terms from non-relevant docs
Rocchio Method
l Rocchio automatically –re-weights terms –adds in new terms (from relevant docs) »have to be careful when using negative terms »Rocchio is not a machine learning algorithm l Most methods perform similarly –results heavily dependent on test collection l Machine learning methods are proving to work better than standard IR approaches like Rocchio
Using Relevance Feedback l Known to improve results –in TREC-like conditions (no user involved) l What about with a user in the loop? –How might you measure this? –Let’s examine a user study of relevance feedback by Koenneman & Belkin 1996.
Content Analysis l Automated Transformation of raw text into a form that represent some aspect(s) of its meaning l Including, but not limited to: –Automated Thesaurus Generation –Phrase Detection –Categorization –Clustering –Summarization
Techniques for Content Analysis l Statistical –Single Document –Full Collection l Linguistic –Syntactic –Semantic –Pragmatic l Knowledge-Based (Artificial Intelligence) l Hybrid (Combinations)
Text Processing l Standard Steps: –Recognize document structure »titles, sections, paragraphs, etc. –Break into tokens »usually space and punctuation delineated »special issues with Asian languages –Stemming/morphological analysis –Store in inverted index (to be discussed later)
Figure from Baeza-Yates & Ribeiro-Neto Document Processing Steps
Statistical Properties of Text l Token occurrences in text are not uniformly distributed l They are also not normally distributed l They do exhibit a Zipf distribution
Plotting Word Frequency by Rank l Main idea: count –How many tokens occur 1 time –How many tokens occur 2 times –How many tokens occur 3 times … l Now rank these according to how of they occur. This is called the rank.
Plotting Word Frequency by Rank l Say for a text with 100 tokens l Count –How many tokens occur 1 time (50) –How many tokens occur 2 times (20) … –How many tokens occur 7 times (10) … –How many tokens occur 12 times (1) –How many tokens occur 14 times (1) l So things that occur the most time shave the highest rank (rank 1). l Things that occur the fewest times have the lowest rank (rank n).
Observation: MANY phenomena can be characterized this way. l Words in a text collection l Library book checkout patterns l Incoming Web Page Requests (Nielsen) l Outgoing Web Page Requests (Cunha & Crovella) l Document Size on Web (Cunha & Crovella)
Illustration by Jacob Nielsen Zipf Distribution (linear and log scale)
Zipf Distribution l The product of the frequency of words (f) and their rank (r) is approximately constant –Rank = order of words’ frequency of occurrence l Another way to state this is with an approximately correct rule of thumb: –Say the most common term occurs C times –The second most common occurs C/2 times –The third most common occurs C/3 times –…
Zipf Distribution l The Important Points: –a few elements occur very frequently –a medium number of elements have medium frequency –many elements occur very infrequently
Word Frequency vs. Resolving Power (from van Rijsbergen 79) The most frequent words are not the most descriptive.
Consequences of Zipf l There are always a few very frequent tokens that are not good discriminators. –Called “stop words” in IR –Usually correspond to linguistic notion of “closed-class” words »English examples: to, from, on, and, the,... »Grammatical classes that don’t take on new members. l There are always a large number of tokens that occur almost once and can mess up algorithms. l Medium frequency words most descriptive
Inverted Indexes We have seen “Vector files” conceptually. An Inverted File is a vector file “inverted” so that rows become columns and columns become rows
How Are Inverted Files Created l Documents are parsed to extract tokens. These are saved with the Document ID. Now is the time for all good men to come to the aid of their country Doc 1 It was a dark and stormy night in the country manor. The time was past midnight Doc 2
How Inverted Files are Created l After all documents have been parsed the inverted file is sorted alphabetically.
How Inverted Files are Created l Multiple term entries for a single document are merged. l Within-document term frequency information is compiled.
How Inverted Files are Created l Then the file can be split into –A Dictionary file and –A Postings file
How Inverted Files are Created Dictionary Postings
Inverted indexes l Permit fast search for individual terms l For each term, you get a list consisting of: –document ID –frequency of term in doc (optional) –position of term in doc (optional) l These lists can be used to solve Boolean queries: »country -> d1, d2 »manor -> d2 »country AND manor -> d2 l Also used for statistical ranking algorithms
How Inverted Files are Used Dictionary Postings Query on “time” AND “dark” 2 docs with “time” in dictionary -> IDs 1 and 2 from posting file 1 doc with “dark” in dictionary -> ID 2 from posting file Therefore, only doc 2 satisfied the query.
Document Vectors l Documents are represented as “bags of words” l Represented as vectors when used computationally –A vector is like an array of floating point –Has direction and magnitude –Each vector holds a place for every term in the collection –Therefore, most vectors are sparse
Document Vectors novagalaxy heath’wood filmroledietfur ABCDEFGHIABCDEFGHI Document ids
Text Clustering Term 1 Term 2 Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu
Document/Document Matrix
Agglomerative Clustering ABCDEFGHIABCDEFGHI
ABCDEFGHIABCDEFGHI
ABCDEFGHIABCDEFGHI
Evaluation l Why Evaluate? l What to Evaluate? l How to Evaluate?
Why Evaluate? l Determine if the system is desirable l Make comparative assessments l Others?
What to Evaluate? l How much of the information need is satisfied. l How much was learned about a topic. l Incidental learning: –How much was learned about the collection. –How much was learned about other topics. l How inviting the system is.
Relevance l In what ways can a document be relevant to a query? –Answer precise question precisely. –Partially answer question. –Suggest a source for more information. –Give background information. –Remind the user of other knowledge. –Others...
Relevance l How relevant is the document –for this user for this information need. l Subjective, but l Measurable to some extent –How often do people agree a document is relevant to a query l How well does it answer the question? –Complete answer? Partial? –Background Information? –Hints for further exploration?
What to Evaluate? What can be measured that reflects users’ ability to use system? (Cleverdon 66) –Coverage of Information –Form of Presentation –Effort required/Ease of Use –Time and Space Efficiency –Recall »proportion of relevant material actually retrieved –Precision »proportion of retrieved material actually relevant effectiveness
Relevant vs. Retrieved Relevant Retrieved All docs
Precision vs. Recall Relevant Retrieved All docs
Why Precision and Recall? Get as much good stuff while at the same time getting as little junk as possible.
Retrieved vs. Relevant Documents Relevant Very high precision, very low recall
Precision/Recall Curves l There is a tradeoff between Precision and Recall l So measure Precision at different levels of Recall l Note: this is an AVERAGE over MANY queries precision recall x x x x
Precision/Recall Curves l Difficult to determine which of these two hypothetical results is better: precision recall x x x x
Precision/Recall Curves
Document Cutoff Levels l Another way to evaluate: –Fix the number of documents retrieved at several levels: »top 5 »top 10 »top 20 »top 50 »top 100 »top 500 –Measure precision at each of these levels –Take (weighted) average over results l This is a way to focus on how well the system ranks the first k documents.
Problems with Precision/Recall l Can’t know true recall value –except in small collections l Precision/Recall are related –A combined measure sometimes more appropriate l Assumes batch mode –Interactive IR is more important –We will touch on this in the UI section l Assumes a strict rank ordering matters.