2007.02.01 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.

Slides:



Advertisements
Similar presentations
Lecture 6: Boolean to Vector
Advertisements

Traditional IR models Jian-Yun Nie.
Chapter 5: Introduction to Information Retrieval
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Retrieval Models and Ranking Systems CSC 575 Intelligent Information Retrieval.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
IR Models: Overview, Boolean, and Vector
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003
9/11/2000Information Organization and Retrieval Content Analysis and Statistical Properties of Text Ray Larson & Marti Hearst University of California,
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,
Multimedia and Text Indexing. Multimedia Data Management The need to query and analyze vast amounts of multimedia data (i.e., images, sound tracks, video.
9/11/2001Information Organization and Retrieval Content Analysis and Statistical Properties of Text Ray Larson & Warren Sack University of California,
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
ISP433/633 Week 3 Query Structure and Query Operations.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Ch 4: Information Retrieval and Text Mining
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
DOK 324: Principles of Information Retrieval Hacettepe University Department of Information Management.
Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document.
9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.
8/28/97Information Organization and Retrieval IR Implementation Issues, Web Crawlers and Web Search Engines University of California, Berkeley School of.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Indexing and Representation: The Vector Space Model Document represented by a vector of terms Document represented by a vector of terms Words (or word.
9/14/2000Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Marti Hearst University of California,
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
9/19/2000Information Organization and Retrieval Vector and Probabilistic Ranking Ray Larson & Marti Hearst University of California, Berkeley School of.
SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Advanced Multimedia Text Classification Tamara Berg.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Web- and Multimedia-based Information Systems Lecture 2.
1 Information Retrieval LECTURE 1 : Introduction.
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Natural Language Processing Topics in Information Retrieval August, 2002.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.
IR 6 Scoring, term weighting and the vector space model.
Why indexing? For efficient searching of a document
Plan for Today’s Lecture(s)
Why the interest in Queries?
Basic Information Retrieval
Representation of documents and queries
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Content Analysis of Text
4. Boolean and Vector Space Retrieval Models
CS 430: Information Discovery
Presentation transcript:

SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00 pm Spring Principles of Information Retrieval Lecture 6: Boolean to Vector

SLIDE 2IS 240 – Spring 2007 Today IR Models The Boolean Model Boolean implementation issues

SLIDE 3IS 240 – Spring 2007 Review IR Models Extended Boolean

SLIDE 4IS 240 – Spring 2007 IR Models Set Theoretic Models –Boolean –Fuzzy –Extended Boolean Vector Models (Algebraic) Probabilistic Models (probabilistic)

SLIDE 5IS 240 – Spring 2007 Boolean Logic A B

SLIDE 6IS 240 – Spring 2007 Parse Result (Query Tree) Z39.50 queries… Oper: AND Title XXX and Subject YYY Operand: Index = Title Value = XXX Operand: Index = Subject Value = YYY left right

SLIDE 7IS 240 – Spring 2007 Parse Results Subject XXX and (title yyy and author zzz) Op: AND Oper: Index: Subject Value: XXX Oper: Index: Title Value: YYY Oper: Index: Author Value: ZZZ

SLIDE 8IS 240 – Spring 2007 Boolean AND Algorithm AND =

SLIDE 9IS 240 – Spring 2007 Boolean OR Algorithm OR =

SLIDE 10IS 240 – Spring 2007 Boolean AND NOTAlgorithm AND NOT =

SLIDE 11IS 240 – Spring 2007 Basic Concepts for Extended Boolean Instead of binary values, terms in documents and queries have a weight (importance or some other statistical property) Instead of binary set membership, sets are “fuzzy” and the weights are used to determine degree of membership. Degree of set membership can be used to rank the results of a query

SLIDE 12IS 240 – Spring 2007 Fuzzy Sets Introduced by Zadeh in If set {A} has value v(A) and {B} has value v(B), where 0  v  1 v(A  B) = min(v(A), v(B)) v(A  B) = max(v(A), v(B)) v(~A) = 1-v(A)

SLIDE 13IS 240 – Spring 2007 Rule Evaluation Tree World_Series (0.63) Event (0.63) “World Series”Baseball_championship (0.7) Baseball (1.0) Championship (0.7) St._Louis_Cardinals (0) Team (0) “Cardinals” (0) Milwaukee_brewers (0) Cardinals_full_name (0) “Milwaukee Brewers” (0)“Brewers” (0) Saint (0)“Louis” (0) “Saint” (0)“St.” (0) “Cardinals” (0) “baseball” (1.0)“championship” (1.0)“ball” (1.0)

SLIDE 14IS 240 – Spring 2007 Boolean Limitations Advantages –simple queries are easy to understand –relatively easy to implement Disadvantages –difficult to specify what is wanted, particularly in complex situations (E.g., RUBRIC Queries) –too much returned, or too little –ordering not well determined in Traditional Boolean –Ordering may be problematic in extended Boolean (Robertson’s critique) –Weighting is based only on the query – or some undefined weighting scheme must be used for the documents.

SLIDE 15IS 240 – Spring 2007 Lecture Overview Statistical Properties of Text –Zipf Distribution –Statistical Dependence Indexing and Inverted Files Vector Representation Term Weights Credit for some of the slides in this lecture goes to Marti Hearst

SLIDE 16IS 240 – Spring 2007 Lecture Overview Statistical Properties of Text –Zipf Distribution –Statistical Dependence Indexing and Inverted Files Vector Representation Term Weights Vector Matching Credit for some of the slides in this lecture goes to Marti Hearst

SLIDE 17IS 240 – Spring 2007 A Small Collection (Stems) Rank Freq Term 1 37 system 2 32 knowledg 3 24 base 4 20 problem 5 18 abstract 6 15 model 7 15 languag 8 15 implem 9 13 reason inform expert analysi rule program oper evalu comput case 19 9 gener 20 9 form enhanc energi emphasi detect desir date critic content consider concern compon compar commerci clause aspect area aim affect

SLIDE 18IS 240 – Spring 2007 The Corresponding Zipf Curve Rank Freq 1 37 system 2 32 knowledg 3 24 base 4 20 problem 5 18 abstract 6 15 model 7 15 languag 8 15 implem 9 13 reason inform expert analysi rule program oper evalu comput case 19 9 gener 20 9 form

SLIDE 19IS 240 – Spring 2007 Zipf Distribution The Important Points: –A few elements occur very frequently –A medium number of elements have medium frequency –Many elements occur very infrequently

SLIDE 20 Zipf Distribution Linear ScaleLogarithmic Scale

SLIDE 21IS 240 – Spring 2007 Related Distributions/”Laws” Bradford’s Law of Scattering Lotka’s Law of Productivity De Solla Price’s Urn Model for “Cumulative Advantage Processes” ½ = 50%2/3 = 66%¾ = 75%Pick Replace +1

SLIDE 22IS 240 – Spring 2007 Frequent Words on the WWW the a to of and in s for on this is by with or at all are from e you be that not an as home it i have if new t your page about com information will can more has no other one c d m was copyright us (see

SLIDE 23IS 240 – Spring 2007 Word Frequency vs. Resolving Power The most frequent words are not the most descriptive (from van Rijsbergen 79)

SLIDE 24IS 240 – Spring 2007 Statistical Independence Two events x and y are statistically independent if the product of the probabilities of their happening individually equals the probability of their happening together

SLIDE 25IS 240 – Spring 2007 Lexical Associations Subjects write first word that comes to mind –doctor/nurse; black/white (Palermo & Jenkins 64) Text Corpora can yield similar associations One measure: Mutual Information (Church and Hanks 89) If word occurrences were independent, the numerator and denominator would be equal (if measured across a large collection)

SLIDE 26IS 240 – Spring 2007 Interesting Associations with “Doctor” AP Corpus, N=15 million, Church & Hanks 89

SLIDE 27IS 240 – Spring 2007 These associations were likely to happen because the non-doctor words shown here are very common and therefore likely to co-occur with any noun Un-Interesting Associations with “Doctor” AP Corpus, N=15 million, Church & Hanks 89

SLIDE 28IS 240 – Spring 2007 Content Analysis Summary Content Analysis: transforming raw text into more computationally useful forms Words in text collections exhibit interesting statistical properties –Word frequencies have a Zipf distribution –Word co-occurrences exhibit dependencies

SLIDE 29IS 240 – Spring 2007 Lecture Overview Statistical Properties of Text –Zipf Distribution –Statistical Dependence Indexing and Inverted Files Vector Representation Term Weights Vector Matching Credit for some of the slides in this lecture goes to Marti Hearst

SLIDE 30IS 240 – Spring 2007 Inverted Indexes We have seen “Vector files” conceptually –An Inverted File is a vector file “inverted” so that rows become columns and columns become rows

SLIDE 31IS 240 – Spring 2007 Inverted File Structure Dictionary Postings

SLIDE 32IS 240 – Spring 2007 Inverted Indexes Permit fast search for individual terms For each term, you get a list consisting of: –Document ID –Frequency of term in doc (optional) –Position of term in doc (optional) These lists can be used to solve Boolean queries: country -> d1, d2 manor -> d2 country AND manor -> d2 Also used for statistical ranking algorithms

SLIDE 33IS 240 – Spring 2007 How Inverted Files are Used Dictionary Postings Query on “time” AND “dark” 2 docs with “time” in dictionary -> IDs 1 and 2 from posting file 1 doc with “dark” in dictionary -> ID 2 from posting file Therefore, only doc 2 satisfied the query

SLIDE 34IS 240 – Spring 2007 Lecture Overview Review –Boolean Searching –Content Analysis Statistical Properties of Text –Zipf Distribution –Statistical Dependence Indexing and Inverted Files Vector Representation Term Weights Vector Matching Credit for some of the slides in this lecture goes to Marti Hearst

SLIDE 35IS 240 – Spring 2007 Document Vectors Documents are represented as “bags of words” Represented as vectors when used computationally –A vector is like an array of floating point –Has direction and magnitude –Each vector holds a place for every term in the collection –Therefore, most vectors are sparse

SLIDE 36IS 240 – Spring 2007 Vector Space Model Documents are represented as vectors in term space –Terms are usually stems –Documents represented by binary or weighted vectors of terms Queries represented the same as documents Query and Document weights are based on length and direction of their vector A vector distance measure between the query and documents is used to rank retrieved documents

SLIDE 37IS 240 – Spring 2007 Vector Representation Documents and Queries are represented as vectors Position 1 corresponds to term 1, position 2 to term 2, position t to term t The weight of the term is stored in each position

SLIDE 38IS 240 – Spring 2007 Document Vectors + Frequency “Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.)

SLIDE 39IS 240 – Spring 2007 Document Vectors + Frequency “Hollywood” occurs 7 times in text I “Film” occurs 5 times in text I “Diet” occurs 1 time in text I “Fur” occurs 3 times in text I

SLIDE 40IS 240 – Spring 2007 Document Vectors + Frequency

SLIDE 41IS 240 – Spring 2007 We Can Plot the Vectors Star Diet Doc about astronomy Doc about movie stars Doc about mammal behavior

SLIDE 42IS 240 – Spring 2007 Documents in 3D Space Primary assumption of the Vector Space Model: Documents that are “close together” in space are similar in meaning

SLIDE 43IS 240 – Spring 2007 Vector Space Documents and Queries D1D1 D2D2 D3D3 D4D4 D5D5 D6D6 D7D7 D8D8 D9D9 D 10 D 11 t2t2 t3t3 t1t1 Boolean term combinations Q is a query – also represented as a vector

SLIDE 44IS 240 – Spring 2007 Documents in Vector Space t1t1 t2t2 t3t3 D1D1 D2D2 D 10 D3D3 D9D9 D4D4 D7D7 D8D8 D5D5 D 11 D6D6

SLIDE 45IS 240 – Spring 2007 Document Space has High Dimensionality What happens beyond 2 or 3 dimensions? Similarity still has to do with how many tokens are shared in common. More terms -> harder to understand which subsets of words are shared among similar documents. We will look in detail at ranking methods Approaches to handling high dimensionality: Clustering and LSI (later)

SLIDE 46IS 240 – Spring 2007 Lecture Overview Statistical Properties of Text –Zipf Distribution –Statistical Dependence Indexing and Inverted Files Vector Representation Term Weights Vector Matching Credit for some of the slides in this lecture goes to Marti Hearst

SLIDE 47IS 240 – Spring 2007 Assigning Weights to Terms Binary Weights Raw term frequency tf*idf –Recall the Zipf distribution –Want to weight terms highly if they are Frequent in relevant documents … BUT Infrequent in the collection as a whole Automatically derived thesaurus terms

SLIDE 48IS 240 – Spring 2007 Binary Weights Only the presence (1) or absence (0) of a term is included in the vector

SLIDE 49IS 240 – Spring 2007 Raw Term Weights The frequency of occurrence for the term in each document is included in the vector

SLIDE 50IS 240 – Spring 2007 Assigning Weights tf*idf measure: –Term frequency (tf) –Inverse document frequency (idf) A way to deal with some of the problems of the Zipf distribution Goal: Assign a tf*idf weight to each term in each document

SLIDE 51IS 240 – Spring 2007 Simple tf*idf

SLIDE 52IS 240 – Spring 2007 Inverse Document Frequency IDF provides high values for rare words and low values for common words For a collection of documents (N = 10000)

SLIDE 53IS 240 – Spring 2007 Non-Boolean IR Need to measure some similarity between the query and the document The basic notion is that documents that are somehow similar to a query, are likely to be relevant responses for that query We will revisit this notion again and see how the Language Modelling approach to IR has taken it to a new level

SLIDE 54IS 240 – Spring 2007 Non-Boolean? To measure similarity we… –Need to consider the characteristics of the document and the query –Make the assumption that similarity of language use between the query and the document implies similarity of topic and hence, potential relevance.

SLIDE 55IS 240 – Spring 2007 Similarity Measures (Set-based) Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient Assuming that Q and D are the sets of terms associated with a Query and Document:

SLIDE 56IS 240 – Spring 2007 What form should these take? Each of the queries and documents might be considered as: –A set of terms (Boolean approach) “index terms” “words”, stems, etc. –Some other form?

SLIDE 57IS 240 – Spring 2007 Weighting schemes We have seen something of –Binary –Raw term weights –TF*IDF There are many other possibilities –IDF alone –Normalized term frequency –etc.

SLIDE 58IS 240 – Spring 2007 Next Week More on Vector Space Probabilistic Models and Retrieval