SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Introduction to Information Retrieval
Query Models Use Types What do search engines do.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003
9/11/2000Information Organization and Retrieval Content Analysis and Statistical Properties of Text Ray Larson & Marti Hearst University of California,
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Multimedia and Text Indexing. Multimedia Data Management The need to query and analyze vast amounts of multimedia data (i.e., images, sound tracks, video.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of.
SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003
1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.
DOK 324: Principles of Information Retrieval Hacettepe University Department of Information Management.
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004
9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.
8/28/97Information Organization and Retrieval IR Implementation Issues, Web Crawlers and Web Search Engines University of California, Berkeley School of.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
Evaluating the Performance of IR Sytems
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Indexing and Representation: The Vector Space Model Document represented by a vector of terms Document represented by a vector of terms Words (or word.
9/14/2000Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Marti Hearst University of California,
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.
SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004
September 7, 2000Information Organization and Retrieval Introduction to Information Retrieval Ray Larson & Marti Hearst University of California, Berkeley.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Web- and Multimedia-based Information Systems Lecture 2.
Vector Space Models.
Information Retrieval
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Search and Retrieval: Finding Out About Prof. Marti Hearst SIMS 202, Lecture 18.
Search and Retrieval: Query Languages Prof. Marti Hearst SIMS 202, Lecture 19.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
1 CS 430: Information Discovery Lecture 21 Interactive Retrieval.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.
Why indexing? For efficient searching of a document
Plan for Today’s Lecture(s)
Query Models Use Types What do search engines do.
Text Based Information Retrieval
Why the interest in Queries?
Information Retrieval
Basic Information Retrieval
Representation of documents and queries
Chapter 5: Information Retrieval and Web Search
Content Analysis of Text
Boolean and Vector Space Retrieval Models
Presentation transcript:

SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000

Midterm Review (Most slides taken from earlier lectures)

Structure of an IR System Search Line Interest profiles & Queries Documents & data Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Storage Line Potentially Relevant Documents Comparison/ Matching Store1: Profiles/ Search requests Store2: Document representations Indexing (Descriptive and Subject) Formulating query in terms of descriptors Storage of profiles Storage of Documents Information Storage and Retrieval System Adapted from Soergel, p. 19

Search is an Iterative Process Repositories Workspace Goals

Cognitive (Human) Aspects of Information Access and Retrieval l “Finding Out About” (FOA) –types of information needs –specifying information needs (queries) –the process of information access –search strategies –“sensemaking” l Relevance l Modeling the User

Retrieval Models l Boolean Retrieval l Ranked Retrieval l Vector Space Model l Probabilistic Models

Boolean Queries l (Cat OR Dog) AND (Collar OR Leash) –Each of the following combinations satisfies this statement: l Catxxxxx l Dogxxxx l Collarxxxx l Leashxxxxx

Boolean Queries l (Cat OR Dog) AND (Collar OR Leash) –None of the following combinations work: l Catx l Dogxx l Collarxx l Leashxx

Boolean Queries –Usually expressed as INFIX operators in IR »((a AND b) OR (c AND b)) –NOT is UNARY PREFIX operator »((a AND b) OR (c AND (NOT b))) –AND and OR can be n-ary operators »(a AND b AND c AND d) –Some rules - (De Morgan revisited) »NOT(a) AND NOT(b) = NOT(a OR b) »NOT(a) OR NOT(b)= NOT(a AND b) »NOT(NOT(a)) = a

Boolean Searching “Measurement of the width of cracks in prestressed concrete beams” Formal Query: cracks AND beams AND Width_measurement AND Prestressed_concrete Cracks Beams Width measurement Prestressed concrete Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P)

Boolean Logic 3t33t3 1t11t1 2t22t2 1D11D1 2D22D2 3D33D3 4D44D4 5D55D5 6D66D6 8D88D8 7D77D7 9D99D9 10 D D 11 m1m1 m2m2 m3m3 m5m5 m4m4 m7m7 m8m8 m6m6 m 2 = t 1 t 2 t 3 m 1 = t 1 t 2 t 3 m 4 = t 1 t 2 t 3 m 3 = t 1 t 2 t 3 m 6 = t 1 t 2 t 3 m 5 = t 1 t 2 t 3 m 8 = t 1 t 2 t 3 m 7 = t 1 t 2 t 3

Precedence Ordering l In what order do we evaluate the components of the Boolean expression? –Parenthesis get done first »(a or b) and (c or d) »(a or (b and c) or d) –Usually start from the left and work right (in case of ties) –Usually (if there are no parentheses) »NOT before AND »AND before OR

Boolean Problems l Disjunctive (OR) queries lead to information overload l Conjunctive (AND) queries lead to reduced, and commonly zero result l Conjunctive queries imply reduction in Recall

Vector Space Model l Documents are represented as vectors in term space –Terms are usually stems –Documents represented by binary vectors of terms l Queries represented the same as documents l Query and Document weights are based on length and direction of their vector l A vector distance measure between the query and documents is used to rank retrieved documents

Vector Space with Term Weights and Cosine Matching D2D2 D1D1 Q Term B Term A D i =(d i1,w di1 ;d i2, w di2 ;…;d it, w dit ) Q =(q i1,w qi1 ;q i2, w qi2 ;…;q it, w qit ) Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7)

Assigning Weights to Terms l Binary Weights l Raw term frequency l tf x idf –Recall the Zipf distribution –Want to weight terms highly if they are »frequent in relevant documents … BUT »infrequent in the collection as a whole l Automatically derived thesaurus terms

Binary Weights l Only the presence (1) or absence (0) of a term is included in the vector

Raw Term Weights l The frequency of occurrence for the term in each document is included in the vector

Assigning Weights l tf x idf measure: –term frequency (tf) –inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution l Goal: assign a tf * idf weight to each term in each document

tf x idf

Inverse Document Frequency l IDF provides high values for rare words and low values for common words

tf x idf normalization l Normalize the term weights (so longer documents are not unfairly given more weight) –normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive.

Vector space similarity (use the weights to compare the documents)

Vector Space Similarity Measure combine tf x idf into a similarity measure

Relevance Feedback –aka query modification –aka “more like this”

Query Modification l Problem: how to reformulate the query? –Thesaurus expansion: »Suggest terms similar to query terms –Relevance feedback: »Suggest terms (and documents) similar to retrieved documents that have been judged to be relevant

Relevance Feedback l Main Idea: –Modify existing query based on relevance judgements »Extract terms from relevant documents and add them to the query »and/or re-weight the terms already in the query –Two main approaches: »Automatic (psuedo-relevance feedback) »Users select relevant documents –Users/system select terms from an automatically-generated list

Relevance Feedback l Usually do both: –expand query with new terms –re-weight terms in query l There are many variations –usually positive weights for terms from relevant docs –sometimes negative weights for terms from non-relevant docs

Rocchio Method

l Rocchio automatically –re-weights terms –adds in new terms (from relevant docs) »have to be careful when using negative terms »Rocchio is not a machine learning algorithm l Most methods perform similarly –results heavily dependent on test collection l Machine learning methods are proving to work better than standard IR approaches like Rocchio

Using Relevance Feedback l Known to improve results –in TREC-like conditions (no user involved) l What about with a user in the loop? –How might you measure this? –Let’s examine a user study of relevance feedback by Koenneman & Belkin 1996.

Content Analysis l Automated Transformation of raw text into a form that represent some aspect(s) of its meaning l Including, but not limited to: –Automated Thesaurus Generation –Phrase Detection –Categorization –Clustering –Summarization

Techniques for Content Analysis l Statistical –Single Document –Full Collection l Linguistic –Syntactic –Semantic –Pragmatic l Knowledge-Based (Artificial Intelligence) l Hybrid (Combinations)

Text Processing l Standard Steps: –Recognize document structure »titles, sections, paragraphs, etc. –Break into tokens »usually space and punctuation delineated »special issues with Asian languages –Stemming/morphological analysis –Store in inverted index (to be discussed later)

Figure from Baeza-Yates & Ribeiro-Neto Document Processing Steps

Statistical Properties of Text l Token occurrences in text are not uniformly distributed l They are also not normally distributed l They do exhibit a Zipf distribution

Plotting Word Frequency by Rank l Main idea: count –How many tokens occur 1 time –How many tokens occur 2 times –How many tokens occur 3 times … l Now rank these according to how of they occur. This is called the rank.

Plotting Word Frequency by Rank l Say for a text with 100 tokens l Count –How many tokens occur 1 time (50) –How many tokens occur 2 times (20) … –How many tokens occur 7 times (10) … –How many tokens occur 12 times (1) –How many tokens occur 14 times (1) l So things that occur the most time shave the highest rank (rank 1). l Things that occur the fewest times have the lowest rank (rank n).

Observation: MANY phenomena can be characterized this way. l Words in a text collection l Library book checkout patterns l Incoming Web Page Requests (Nielsen) l Outgoing Web Page Requests (Cunha & Crovella) l Document Size on Web (Cunha & Crovella)

Illustration by Jacob Nielsen Zipf Distribution (linear and log scale)

Zipf Distribution l The product of the frequency of words (f) and their rank (r) is approximately constant –Rank = order of words’ frequency of occurrence l Another way to state this is with an approximately correct rule of thumb: –Say the most common term occurs C times –The second most common occurs C/2 times –The third most common occurs C/3 times –…

Zipf Distribution l The Important Points: –a few elements occur very frequently –a medium number of elements have medium frequency –many elements occur very infrequently

Word Frequency vs. Resolving Power (from van Rijsbergen 79) The most frequent words are not the most descriptive.

Consequences of Zipf l There are always a few very frequent tokens that are not good discriminators. –Called “stop words” in IR –Usually correspond to linguistic notion of “closed-class” words »English examples: to, from, on, and, the,... »Grammatical classes that don’t take on new members. l There are always a large number of tokens that occur almost once and can mess up algorithms. l Medium frequency words most descriptive

Inverted Indexes We have seen “Vector files” conceptually. An Inverted File is a vector file “inverted” so that rows become columns and columns become rows

How Are Inverted Files Created l Documents are parsed to extract tokens. These are saved with the Document ID. Now is the time for all good men to come to the aid of their country Doc 1 It was a dark and stormy night in the country manor. The time was past midnight Doc 2

How Inverted Files are Created l After all documents have been parsed the inverted file is sorted alphabetically.

How Inverted Files are Created l Multiple term entries for a single document are merged. l Within-document term frequency information is compiled.

How Inverted Files are Created l Then the file can be split into –A Dictionary file and –A Postings file

How Inverted Files are Created Dictionary Postings

Inverted indexes l Permit fast search for individual terms l For each term, you get a list consisting of: –document ID –frequency of term in doc (optional) –position of term in doc (optional) l These lists can be used to solve Boolean queries: »country -> d1, d2 »manor -> d2 »country AND manor -> d2 l Also used for statistical ranking algorithms

How Inverted Files are Used Dictionary Postings Query on “time” AND “dark” 2 docs with “time” in dictionary -> IDs 1 and 2 from posting file 1 doc with “dark” in dictionary -> ID 2 from posting file Therefore, only doc 2 satisfied the query.

Document Vectors l Documents are represented as “bags of words” l Represented as vectors when used computationally –A vector is like an array of floating point –Has direction and magnitude –Each vector holds a place for every term in the collection –Therefore, most vectors are sparse

Document Vectors novagalaxy heath’wood filmroledietfur ABCDEFGHIABCDEFGHI Document ids

Text Clustering Term 1 Term 2 Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu

Document/Document Matrix

Agglomerative Clustering ABCDEFGHIABCDEFGHI

ABCDEFGHIABCDEFGHI

ABCDEFGHIABCDEFGHI

Evaluation l Why Evaluate? l What to Evaluate? l How to Evaluate?

Why Evaluate? l Determine if the system is desirable l Make comparative assessments l Others?

What to Evaluate? l How much of the information need is satisfied. l How much was learned about a topic. l Incidental learning: –How much was learned about the collection. –How much was learned about other topics. l How inviting the system is.

Relevance l In what ways can a document be relevant to a query? –Answer precise question precisely. –Partially answer question. –Suggest a source for more information. –Give background information. –Remind the user of other knowledge. –Others...

Relevance l How relevant is the document –for this user for this information need. l Subjective, but l Measurable to some extent –How often do people agree a document is relevant to a query l How well does it answer the question? –Complete answer? Partial? –Background Information? –Hints for further exploration?

What to Evaluate? What can be measured that reflects users’ ability to use system? (Cleverdon 66) –Coverage of Information –Form of Presentation –Effort required/Ease of Use –Time and Space Efficiency –Recall »proportion of relevant material actually retrieved –Precision »proportion of retrieved material actually relevant effectiveness

Relevant vs. Retrieved Relevant Retrieved All docs

Precision vs. Recall Relevant Retrieved All docs

Why Precision and Recall? Get as much good stuff while at the same time getting as little junk as possible.

Retrieved vs. Relevant Documents Relevant Very high precision, very low recall

Precision/Recall Curves l There is a tradeoff between Precision and Recall l So measure Precision at different levels of Recall l Note: this is an AVERAGE over MANY queries precision recall x x x x

Precision/Recall Curves l Difficult to determine which of these two hypothetical results is better: precision recall x x x x

Precision/Recall Curves

Document Cutoff Levels l Another way to evaluate: –Fix the number of documents retrieved at several levels: »top 5 »top 10 »top 20 »top 50 »top 100 »top 500 –Measure precision at each of these levels –Take (weighted) average over results l This is a way to focus on how well the system ranks the first k documents.

Problems with Precision/Recall l Can’t know true recall value –except in small collections l Precision/Recall are related –A combined measure sometimes more appropriate l Assumes batch mode –Interactive IR is more important –We will touch on this in the UI section l Assumes a strict rank ordering matters.