2011.02.09 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

Slides:



Advertisements
Similar presentations
Traditional IR models Jian-Yun Nie.
Advertisements

Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Introduction to Information Retrieval
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
IR Models: Overview, Boolean, and Vector
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Multimedia and Text Indexing. Multimedia Data Management The need to query and analyze vast amounts of multimedia data (i.e., images, sound tracks, video.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Ch 4: Information Retrieval and Text Mining
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Hinrich Schütze and Christina Lioma
DOK 324: Principles of Information Retrieval Hacettepe University Department of Information Management.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
8/28/97Information Organization and Retrieval IR Implementation Issues, Web Crawlers and Web Search Engines University of California, Berkeley School of.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
9/14/2000Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Marti Hearst University of California,
Vector Space Model CS 652 Information Extraction and Integration.
SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004
9/19/2000Information Organization and Retrieval Vector and Probabilistic Ranking Ray Larson & Marti Hearst University of California, Berkeley School of.
9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.
SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Advanced Multimedia Text Classification Tamara Berg.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Clustering C.Watters CS6403.
Web- and Multimedia-based Information Systems Lecture 2.
Vector Space Models.
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Cheshire II and Automatic Categorization Ray R. Larson Associate.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Information Retrieval Lecture 6 Vector Methods 2.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.
Information Retrieval in Practice
Plan for Today’s Lecture(s)
Information Retrieval and Web Search
Representation of documents and queries
6. Implementation of Vector-Space Retrieval
Boolean and Vector Space Retrieval Models
CS 430: Information Discovery
Presentation transcript:

SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval Lecture 7: Vector (cont.)

SLIDE 2IS 240 – Spring 2011 Mini-TREC Need to start thinking about groups –Today Systems –SMART (not recommended…) ftp://ftp.cs.cornell.edu/pub/smart –MG (We have a special version if interested) –Cheshire II & 3 II = 3 = –Zprise (Older search system from NIST) –IRF (new Java-based IR framework from NIST) –Lemur –Lucene (Java-based Text search engine) –Others?? (See )

SLIDE 3IS 240 – Spring 2011 Mini-TREC Proposed Schedule –February 9 – Database and previous Queries –March 2 – report on system acquisition and setup –March 9, New Queries for testing… –April 18, Results due –April 20, Results and system rankings –April 27 Group reports and discussion

SLIDE 4IS 240 – Spring 2011 Review IR Models Vector Space Introduction

SLIDE 5IS 240 – Spring 2011 IR Models Set Theoretic Models –Boolean –Fuzzy –Extended Boolean Vector Models (Algebraic) Probabilistic Models (probabilistic)

SLIDE 6IS 240 – Spring 2011 Vector Space Model Documents are represented as vectors in term space –Terms are usually stems –Documents represented by binary or weighted vectors of terms Queries represented the same as documents Query and Document weights are based on length and direction of their vector A vector distance measure between the query and documents is used to rank retrieved documents

SLIDE 7IS 240 – Spring 2011 Document Vectors + Frequency “Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.)

SLIDE 8IS 240 – Spring 2011 We Can Plot the Vectors Star Diet Doc about astronomy Doc about movie stars Doc about mammal behavior

SLIDE 9IS 240 – Spring 2011 Documents in 3D Space Primary assumption of the Vector Space Model: Documents that are “close together” in space are similar in meaning

SLIDE 10IS 240 – Spring 2011 Document Space has High Dimensionality What happens beyond 2 or 3 dimensions? Similarity still has to do with how many tokens are shared in common. More terms -> harder to understand which subsets of words are shared among similar documents. We will look in detail at ranking methods Approaches to handling high dimensionality: Clustering and LSI (later)

SLIDE 11IS 240 – Spring 2011 Assigning Weights to Terms Binary Weights Raw term frequency tf*idf –Recall the Zipf distribution –Want to weight terms highly if they are Frequent in relevant documents … BUT Infrequent in the collection as a whole Automatically derived thesaurus terms

SLIDE 12IS 240 – Spring 2011 Binary Weights Only the presence (1) or absence (0) of a term is included in the vector

SLIDE 13IS 240 – Spring 2011 Raw Term Weights The frequency of occurrence for the term in each document is included in the vector

SLIDE 14IS 240 – Spring 2011 Assigning Weights tf*idf measure: –Term frequency (tf) –Inverse document frequency (idf) A way to deal with some of the problems of the Zipf distribution Goal: Assign a tf*idf weight to each term in each document

SLIDE 15IS 240 – Spring 2011 Simple tf*idf

SLIDE 16IS 240 – Spring 2011 Inverse Document Frequency IDF provides high values for rare words and low values for common words For a collection of documents (N = 10000)

SLIDE 17IS 240 – Spring 2011 Word Frequency vs. Resolving Power The most frequent words are not the most descriptive. (from van Rijsbergen 79)

SLIDE 18IS 240 – Spring 2011 Non-Boolean IR Need to measure some similarity between the query and the document The basic notion is that documents that are somehow similar to a query, are likely to be relevant responses for that query We will revisit this notion again and see how the Language Modelling approach to IR has taken it to a new level

SLIDE 19IS 240 – Spring 2011 Non-Boolean? To measure similarity we… –Need to consider the characteristics of the document and the query –Make the assumption that similarity of language use between the query and the document implies similarity of topic and hence, potential relevance.

SLIDE 20IS 240 – Spring 2011 Similarity Measures (Set-based) Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient Assuming that Q and D are the sets of terms associated with a Query and Document:

SLIDE 21IS 240 – Spring 2011 tf x idf normalization Normalize the term weights (so longer documents are not unfairly given more weight) –normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive.

SLIDE 22IS 240 – Spring 2011 Vector space similarity Use the weights to compare the documents

SLIDE 23IS 240 – Spring 2011 Vector Space Similarity Measure combine tf x idf into a measure

SLIDE 24IS 240 – Spring 2011 Weighting schemes We have seen something of –Binary –Raw term weights –TF*IDF There are many other possibilities –IDF alone –Normalized term frequency

SLIDE 25IS 240 – Spring 2011 Term Weights in SMART SMART is an experimental IR system developed by Gerard Salton (and continued by Chris Buckley) at Cornell. Designed for laboratory experiments in IR –Easy to mix and match different weighting methods –Really terrible user interface –Intended for use by code hackers

SLIDE 26IS 240 – Spring 2011 Term Weights in SMART In SMART weights are decomposed into three factors:

SLIDE 27IS 240 – Spring 2011 SMART Freq Components Binary maxnorm augmented log

SLIDE 28IS 240 – Spring 2011 Collection Weighting in SMART Inverse squared probabilistic frequency

SLIDE 29IS 240 – Spring 2011 Term Normalization in SMART sum cosine fourth max

SLIDE 30IS 240 – Spring 2011 Lucene Algorithm The open-source Lucene system is a vector based system that differs from SMART-like systems in the ways the TF*IDF measures are normalized

SLIDE 31IS 240 – Spring 2011 Lucene The basic Lucene algorithm is: Where is the length normalized query –and norm d,t is the term normalization (square root of the number of tokens in the same document field as t) –overlap(q,d) is the proportion of query terms matched in the document –boost t is a user specified term weight enhancement

SLIDE 32IS 240 – Spring 2011 How To Process a Vector Query Assume that the database contains an inverted file like the one we discussed earlier… –Why an inverted file? –Why not a REAL vector file? What information should be stored about each document/term pair? –As we have seen SMART gives you choices about this…

SLIDE 33IS 240 – Spring 2011 Simple Example System Collection frequency is stored in the dictionary Raw term frequency is stored in the inverted file postings list Formula for term ranking

SLIDE 34IS 240 – Spring 2011 Processing a Query For each term in the query –Count number of times the term occurs – this is the tf for the query term –Find the term in the inverted dictionary file and get: n k : the number of documents in the collection with this term Loc : the location of the postings list in the inverted file Calculate Query Weight: w qk Retrieve n k entries starting at Loc in the postings file

SLIDE 35IS 240 – Spring 2011 Processing a Query Alternative strategies… –First retrieve all of the dictionary entries before getting any postings information Why? –Just process each term in sequence How can we tell how many results there will be? –It is possible to put a limitation on the number of items returned How might this be done?

SLIDE 36IS 240 – Spring 2011 Processing a Query Like Hashed Boolean OR: –Put each document ID from each postings list into hash table If match increment counter (optional) –If first doc, set a WeightSUM variable to 0 Calculate Document weight w ik for the current term Multiply Query weight and Document weight and add it to WeightSUM Scan hash table contents and add to new list – including document ID and WeightSUM Sort by WeightSUM and present in sorted order

SLIDE 37IS 240 – Spring 2011 Today More on Cosine Similarity

SLIDE 38IS 240 – Spring 2011 Computing Cosine Similarity Scores

SLIDE 39IS 240 – Spring 2011 What’s Cosine anyway? One of the basic trigonometric functions encountered in trigonometry. Let theta be an angle measured counterclockwise from the x-axis along the arc of the unit circle. Then cos(theta) is the horizontal coordinate of the arc endpoint. As a result of this definition, the cosine function is periodic with period 2pi. From

SLIDE 40IS 240 – Spring 2011 Cosine Detail (degrees)

SLIDE 41IS 240 – Spring 2011 Computing a similarity score

SLIDE 42IS 240 – Spring 2011 Vector Space with Term Weights and Cosine Matching D2D2 D1D1 Q Term B Term A D i =(d i1,w di1 ;d i2, w di2 ;…;d it, w dit ) Q =(q i1,w qi1 ;q i2, w qi2 ;…;q it, w qit ) Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7)

SLIDE 43IS 240 – Spring 2011 Problems with Vector Space There is no real theoretical basis for the assumption of a term space –it is more for visualization that having any real basis –most similarity measures work about the same regardless of model Terms are not really orthogonal dimensions –Terms are not independent of all other terms

SLIDE 44IS 240 – Spring 2011 Vector Space Refinements As we saw earlier, the SMART system included a variety of weighting methods that could be combined into a single vector model algorithm Vector space has proven very effective in most IR evaluations (or used to be) Salton in a short article in SIGIR Forum (Fall 1981) outlined a “Blueprint” for automatic indexing and retrieval using vector space that has been, to a large extent, followed by everyone doing vector IR

SLIDE 45IS 240 – Spring 2011 Vector Space Refinements Amit Singhal (one of Salton’s students) found that the normalization of document length usually performed in the “standard tfidf” tended to overemphasize short documents He and Chris Buckley came up with the idea of adjusting the normalization document length to better correspond to observed relevance patterns The “Pivoted Document Length Normalization” provided a valuable enhancement to the performance of vector space systems

SLIDE 46IS 240 – Spring 2011 Pivoted Normalization Probability of Retrieval Probability of Relevance Pivot Document Length Probability

SLIDE 47IS 240 – Spring 2011 Pivoted Normalization New Normalization Old Normalization Pivot Old Normalization Factor Final Normalization Factor

SLIDE 48IS 240 – Spring 2011 Pivoted Normalization Using pivoted normalization the new tfidf weight for a document can be written as:

SLIDE 49IS 240 – Spring 2011 Pivoted Normalization Training from past relevance data, and assuming that the slope is going to be consistent with new results, we can adjust to better fit the relevance curve for document size normalization

SLIDE 50IS 240 – Spring 2011 Today Clustering Automatic Classification Cluster-enhanced search

SLIDE 51IS 240 – Spring 2011 Overview Introduction to Automatic Classification and Clustering Classification of Classification Methods Classification Clusters and Information Retrieval in Cheshire II DARPA Unfamiliar Metadata Project

SLIDE 52IS 240 – Spring 2011 Classification The grouping together of items (including documents or their representations) which are then treated as a unit. The groupings may be predefined or generated algorithmically. The process itself may be manual or automated. In document classification the items are grouped together because they are likely to be wanted together –For example, items about the same topic.

SLIDE 53IS 240 – Spring 2011 Automatic Indexing and Classification Automatic indexing is typically the simple deriving of keywords from a document and providing access to all of those words. More complex Automatic Indexing Systems attempt to select controlled vocabulary terms based on terms in the document. Automatic classification attempts to automatically group similar documents using either: –A fully automatic clustering method. –An established classification scheme and set of documents already indexed by that scheme.

SLIDE 54IS 240 – Spring 2011 Background and Origins Early suggestion by Fairthorne –“The Mathematics of Classification” Early experiments by Maron (1961) and Borko and Bernick(1963) Work in Numerical Taxonomy and its application to Information retrieval Jardine, Sibson, van Rijsbergen, Salton (1970’s). Early IR clustering work more concerned with efficiency issues than semantic issues.

SLIDE 55IS 240 – Spring 2011 Document Space has High Dimensionality What happens beyond three dimensions? Similarity still has to do with how many tokens are shared in common. More terms -> harder to understand which subsets of words are shared among similar documents. One approach to handling high dimensionality: Clustering

SLIDE 56IS 240 – Spring 2011 Vector Space Visualization

SLIDE 57IS 240 – Spring 2011 Cluster Hypothesis The basic notion behind the use of classification and clustering methods: “Closely associated documents tend to be relevant to the same requests.” –C.J. van Rijsbergen

SLIDE 58IS 240 – Spring 2011 Classification of Classification Methods Class Structure –Intellectually Formulated Manual assignment (e.g. Library classification) Automatic assignment (e.g. Cheshire Classification Mapping) –Automatically derived from collection of items Hierarchic Clustering Methods (e.g. Single Link) Agglomerative Clustering Methods (e.g. Dattola) Hybrid Methods (e.g. Query Clustering)

SLIDE 59IS 240 – Spring 2011 Classification of Classification Methods Relationship between properties and classes –monothetic –polythetic Relation between objects and classes –exclusive –overlapping Relation between classes and classes –ordered –unordered Adapted from Sparck Jones

SLIDE 60IS 240 – Spring 2011 Properties and Classes Monothetic –Class defined by a set of properties that are both necessary and sufficient for membership in the class Polythetic –Class defined by a set of properties such that to be a member of the class some individual must have some number (usually large) of those properties, and that a large number of individuals in the class possess some of those properties, and no individual possesses all of the properties.

SLIDE 61IS 240 – Spring 2011 Monothetic vs. Polythetic A B C D E F G H Polythetic Monothetic Adapted from van Rijsbergen, ‘79

SLIDE 62IS 240 – Spring 2011 Exclusive Vs. Overlapping Item can either belong exclusively to a single class Items can belong to many classes, sometimes with a “membership weight”

SLIDE 63IS 240 – Spring 2011 Ordered Vs. Unordered Ordered classes have some sort of structure imposed on them –Hierarchies are typical of ordered classes Unordered classes have no imposed precedence or structure and each class is considered on the same “level” –Typical in agglomerative methods

SLIDE 64IS 240 – Spring 2011 Text Clustering Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu Term 1 Term 2

SLIDE 65IS 240 – Spring 2011 Text Clustering Term 1 Term 2 Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu

SLIDE 66IS 240 – Spring 2011 Text Clustering Finds overall similarities among groups of documents Finds overall similarities among groups of tokens Picks out some themes, ignores others

SLIDE 67IS 240 – Spring 2011 Coefficients of Association Simple Dice’s coefficient Jaccard’s coefficient Cosine coefficient Overlap coefficient

SLIDE 68IS 240 – Spring 2011 Pair-wise Document Similarity How to compute document similarity?

SLIDE 69IS 240 – Spring 2011 Pair-wise Document Similarity (no normalization for simplicity)

SLIDE 70IS 240 – Spring 2011 Pair-wise Document Similarity cosine normalization

SLIDE 71IS 240 – Spring 2011 Document/Document Matrix

SLIDE 72IS 240 – Spring 2011 Clustering Methods Hierarchical Agglomerative Hybrid Automatic Class Assignment

SLIDE 73IS 240 – Spring 2011 Hierarchic Agglomerative Clustering Basic method: 1) Calculate all of the interdocument similarity coefficients 2) Assign each document to its own cluster 3) Fuse the most similar pair of current clusters 4) Update the similarity matrix by deleting the rows for the fused clusters and calculating entries for the row and column representing the new cluster (centroid) 5) Return to step 3 if there is more than one cluster left

SLIDE 74IS 240 – Spring 2011 Hierarchic Agglomerative Clustering ABCDEFGHIABCDEFGHI

SLIDE 75IS 240 – Spring 2011 Hierarchic Agglomerative Clustering ABCDEFGHIABCDEFGHI

SLIDE 76IS 240 – Spring 2011 Hierarchic Agglomerative Clustering ABCDEFGHIABCDEFGHI

SLIDE 77IS 240 – Spring 2011 Next Week More on Clustering, Automatic classification, etc.