9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Retrieval Models and Ranking Systems CSC 575 Intelligent Information Retrieval.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Learning for Text Categorization
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003
9/11/2000Information Organization and Retrieval Content Analysis and Statistical Properties of Text Ray Larson & Marti Hearst University of California,
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
Text Databases. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and video.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,
I256 Applied Natural Language Processing Fall 2009
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Multimedia and Text Indexing. Multimedia Data Management The need to query and analyze vast amounts of multimedia data (i.e., images, sound tracks, video.
9/11/2001Information Organization and Retrieval Content Analysis and Statistical Properties of Text Ray Larson & Warren Sack University of California,
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
INFM 700: Session 7 Unstructured Information (Part II) Jimmy Lin The iSchool University of Maryland Monday, March 10, 2008 This work is licensed under.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SIMS 296a-3: UI Background Marti Hearst Fall ‘98.
DOK 324: Principles of Information Retrieval Hacettepe University Department of Information Management.
Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document.
8/28/97Information Organization and Retrieval IR Implementation Issues, Web Crawlers and Web Search Engines University of California, Berkeley School of.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
9/14/2000Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Marti Hearst University of California,
Vector Space Model CS 652 Information Extraction and Integration.
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
9/19/2000Information Organization and Retrieval Vector and Probabilistic Ranking Ray Larson & Marti Hearst University of California, Berkeley School of.
1 I256: Applied Natural Language Processing Marti Hearst Nov 6, 2006.
SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004
Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
Advanced Multimedia Text Retrieval/Classification Tamara Berg.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Advanced Multimedia Text Classification Tamara Berg.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Interaction LBSC 734 Module 4 Doug Oard. Agenda Where interaction fits Query formulation Selection part 1: Snippets  Selection part 2: Result sets Examination.
Clustering C.Watters CS6403.
Vector Space Models.
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Term weighting and Vector space retrieval
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.
Why indexing? For efficient searching of a document
Plan for Today’s Lecture(s)
Indexing & querying text
Multimedia and Text Indexing
Basic Information Retrieval
Information Organization: Clustering
Representation of documents and queries
Document Clustering Matt Hughes.
4. Boolean and Vector Space Retrieval Models
CS 430: Information Discovery
Presentation transcript:

9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval Lecture authors: Marti Hearst & Ray Larson & Warren Sack

9/13/2001Information Organization and Retrieval Last Time Content Analysis: – Transformation of raw text into more computationally useful forms Words in text collections exhibit interesting statistical properties –Zipf distribution –Word co-occurrences non-independent

Information Organization and Retrieval Document Processing Steps

9/13/2001Information Organization and Retrieval Zipf Distribution Rank = order of words’ frequency of occurrence The product of the frequency of words (f) and their rank (r) is approximately constant

9/13/2001Information Organization and Retrieval Zipf Distribution The Important Points: –a few elements occur very frequently –a medium number of elements have medium frequency –many elements occur very infrequently

Information Organization and Retrieval Zipf Distribution (Same curve on linear and log scale)

9/13/2001Information Organization and Retrieval Statistical Independence Two events x and y are statistically independent if the product of their probability of their happening individually equals their probability of happening together.

9/13/2001Information Organization and Retrieval Today Document Vectors Inverted Files Vector Space Model Term Weighting Clustering

9/13/2001Information Organization and Retrieval Document Vectors Documents are represented as “bags of words” Represented as vectors when used computationally –A vector is like an array of (floating point) numbers –Has direction and magnitude –Each vector holds a place for every term in the collection –Therefore, most vectors are sparse

9/13/2001Information Organization and Retrieval Document Vectors One location for each word. novagalaxy heath’wood filmroledietfur ABCDEFGHIABCDEFGHI “Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.)

9/13/2001Information Organization and Retrieval Document Vectors One location for each word. novagalaxy heath’wood filmroledietfur ABCDEFGHIABCDEFGHI “Hollywood” occurs 7 times in text I “Film” occurs 5 times in text I “Diet” occurs 1 time in text I “Fur” occurs 3 times in text I

9/13/2001Information Organization and Retrieval Document Vectors novagalaxy heath’wood filmroledietfur ABCDEFGHIABCDEFGHI Document ids

9/13/2001Information Organization and Retrieval We Can Plot the Vectors Star Diet Doc about astronomy Doc about movie stars Doc about mammal behavior

Information Organization and Retrieval Documents in 3D Space

9/13/2001Information Organization and Retrieval Content Analysis Summary Content Analysis: transforming raw text into more computationally useful forms Words in text collections exhibit interesting statistical properties –Word frequencies have a Zipf distribution –Word co-occurrences exhibit dependencies Text documents are transformed into vectors –Pre-processing includes tokenization, stemming, collocations/phrases –Documents occupy multi-dimensional space.

Information need Index Pre-process Parse Collections Rank Query text input How is the index constructed?

9/13/2001Information Organization and Retrieval Inverted Index This is the primary data structure for text indexes Main Idea: –Invert documents into a big index Basic steps: –Make a “dictionary” of all the tokens in the collection –For each token, list all the docs it occurs in. –Do a few things to reduce redundancy in the data structure

9/13/2001Information Organization and Retrieval Inverted Indexes We have seen “Vector files” conceptually. An Inverted File is a vector file “inverted” so that rows become columns and columns become rows

9/13/2001Information Organization and Retrieval How Are Inverted Files Created Documents are parsed to extract tokens. These are saved with the Document ID. Now is the time for all good men to come to the aid of their country Doc 1 It was a dark and stormy night in the country manor. The time was past midnight Doc 2

9/13/2001Information Organization and Retrieval How Inverted Files are Created After all documents have been parsed the inverted file is sorted alphabetically.

9/13/2001Information Organization and Retrieval How Inverted Files are Created Multiple term entries for a single document are merged. Within-document term frequency information is compiled.

9/13/2001Information Organization and Retrieval How Inverted Files are Created Then the file can be split into –A Dictionary file and –A Postings file

9/13/2001Information Organization and Retrieval How Inverted Files are Created Dictionary Postings

9/13/2001Information Organization and Retrieval Inverted indexes Permit fast search for individual terms For each term, you get a list consisting of: –document ID –frequency of term in doc (optional) –position of term in doc (optional) These lists can be used to solve Boolean queries: country -> d1, d2 manor -> d2 country AND manor -> d2 Also used for statistical ranking algorithms

9/13/2001Information Organization and Retrieval How Inverted Files are Used Dictionary Postings Boolean Query on “time” AND “dark” 2 docs with “time” in dictionary -> IDs 1 and 2 from posting file 1 doc with “dark” in dictionary -> ID 2 from posting file Therefore, only doc 2 satisfied the query.

9/13/2001Information Organization and Retrieval Vector Space Model Documents are represented as vectors in term space –Terms are usually stems –Documents represented by binary vectors of terms Queries represented the same as documents Query and Document weights are based on length and direction of their vector A vector distance measure between the query and documents is used to rank retrieved documents This makes partial matching possible

9/13/2001Information Organization and Retrieval Documents in 3D Space Assumption: Documents that are “close together” in space are similar in meaning.

9/13/2001Information Organization and Retrieval Vector Space Documents and Queries D1D1 D2D2 D3D3 D4D4 D5D5 D6D6 D7D7 D8D8 D9D9 D 10 D 11 t2t2 t3t3 t1t1 Boolean term combinations Q is a query – also represented as a vector

9/13/2001Information Organization and Retrieval Documents in Vector Space t1t1 t2t2 t3t3 D1D1 D2D2 D 10 D3D3 D9D9 D4D4 D7D7 D8D8 D5D5 D 11 D6D6

9/13/2001Information Organization and Retrieval Assigning Weights to Terms Binary Weights Raw term frequency tf x idf –Recall the Zipf distribution –Want to weight terms highly if they are frequent in relevant documents … BUT infrequent in the collection as a whole

9/13/2001Information Organization and Retrieval Binary Weights Only the presence (1) or absence (0) of a term is included in the vector

9/13/2001Information Organization and Retrieval Raw Term Weights The frequency of occurrence for the term in each document is included in the vector

9/13/2001Information Organization and Retrieval Assigning Weights tf x idf measure: –term frequency (tf) –inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution Goal: assign a tf * idf weight to each term in each document

9/13/2001Information Organization and Retrieval tf x idf

9/13/2001Information Organization and Retrieval Inverse Document Frequency IDF provides high values for rare words and low values for common words For a collection of documents

9/13/2001Information Organization and Retrieval Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient

9/13/2001Information Organization and Retrieval Computing Similarity Scores -- Preview

9/13/2001Information Organization and Retrieval Document Space has High Dimensionality What happens beyond 2 or 3 dimensions? Similarity still has to do with how many tokens are shared in common. More terms -> harder to understand which subsets of words are shared among similar documents. Next time we will look in detail at ranking methods One approach to handling high dimensionality:Clustering

9/13/2001Information Organization and Retrieval Vector Space Visualization

9/13/2001Information Organization and Retrieval Text Clustering Finds overall similarities among groups of documents Finds overall similarities among groups of tokens Picks out some themes, ignores others

9/13/2001Information Organization and Retrieval Text Clustering Clustering is “The art of finding groups in data.” Kaufmann and Rousseeuw, Finding Groups in Data, 1990 Term 1 Term 2

9/13/2001Information Organization and Retrieval Text Clustering Term 1 Term 2 Clustering is “The art of finding groups in data.” Kaufmann and Rousseeuw, Finding Groups in Data, 1990

9/13/2001Information Organization and Retrieval Types of Clustering Hierarchical vs. Flat Hard vs.Soft vs. Disjunctive (set vs. uncertain vs. multiple assignment)

9/13/2001Information Organization and Retrieval Pair-wise Document Similarity novagalaxy heath’wood filmroledietfur ABCDABCD How to compute document similarity?

9/13/2001Information Organization and Retrieval Pair-wise Document Similarity (no normalization for simplicity) novagalaxy heath’wood filmroledietfur ABCDABCD

9/13/2001Information Organization and Retrieval Pair-wise Document Similarity (cosine normalization)

9/13/2001Information Organization and Retrieval Document/Document Matrix

9/13/2001Information Organization and Retrieval Hierarchical Clustering (Agglomerative) ABCDEFGHIABCDEFGHI

9/13/2001Information Organization and Retrieval Hierarchical Clustering (Agglomerative) ABCDEFGHIABCDEFGHI

9/13/2001Information Organization and Retrieval Hierarchical Clustering (Agglomerative) ABCDEFGHIABCDEFGHI

9/13/2001Information Organization and Retrieval

9/13/2001Information Organization and Retrieval Types of Hierarchical Clustering Top-down vs. Bottom-up O(n**2) vs. O(n**3) Single-link vs. complete-link (local coherence vs. global coherence)

9/13/2001Information Organization and Retrieval

9/13/2001Information Organization and Retrieval

9/13/2001Information Organization and Retrieval Flat Clustering K-Means –Hard –O(n) EM (soft version of K-Means)

9/13/2001Information Organization and Retrieval K-Means Clustering 1 Create a pair-wise similarity measure 2 Find K centers 3 Assign each document to nearest center, forming new clusters 4 Repeat 3 as necessary

9/13/2001Information Organization and Retrieval

9/13/2001Information Organization and Retrieval

9/13/2001Information Organization and Retrieval Scatter/Gather Cutting, Pedersen, Tukey & Karger 92, 93 Hearst & Pedersen 95 Cluster sets of documents into general “themes”, like a table of contents Display the contents of the clusters by showing topical terms and typical titles User chooses subsets of the clusters and re- clusters the documents within Resulting new groups have different “themes”

9/13/2001Information Organization and Retrieval S/G Example: query on “star” Encyclopedia text 14 sports 8 symbols47 film, tv 68 film, tv (p) 7 music 97 astrophysics 67 astronomy(p)12 stellar phenomena 10 flora/fauna 49 galaxies, stars 29 constellations 7 miscelleneous Clustering and re-clustering is entirely automated

9/13/2001Information Organization and Retrieval Another use of clustering Use clustering to map the entire huge multidimensional document space into a huge number of small clusters. “Project” these onto a 2D graphical representation:

9/13/2001Information Organization and Retrieval Clustering Multi-Dimensional Document Space Wise, Thomas, Pennock, Lantrip, Pottier, Schur, Crow “Visualizing the Non-Visual: Spatial analysis and interaction with Information from text documents,” 1995

9/13/2001Information Organization and Retrieval Clustering Multi-Dimensional Document Space Wise et al., 1995

9/13/2001Information Organization and Retrieval Concept “Landscapes” Browsing without search Pharmocology Anatomy Legal Disease Hospitals (e.g., Xia Lin, “Visualization for the Document Space,” 1992) Based on Kohonen feature maps; See

9/13/2001Information Organization and Retrieval More examples of information visualization Stuart Card, Jock Mackinlay, Ben Schneiderman (eds.) Readings in Information Visualization (San Francisco: Morgan Kaufmann, 1999) Martin Dodge,

9/13/2001Information Organization and Retrieval Clustering Advantages: –See some main themes Disadvantage: –Many ways documents could group together are hidden Thinking point: what is the relationship to classification systems and faceted queries? e.g., f1: (osteoporosis OR ‘bone loss’) f2: (drugs OR pharmaceuticals) f3: (prevention OR cure)

9/13/2001Information Organization and Retrieval More information on content analysis and clustering Christopher Manning and Hinrich Schutze, Foundations of Statistical Natural Language Processing (Cambridge, MA: MIT Press, 1999) Daniel Jurafsky and James Martin, Speech and Language Processing (Upper Saddle River, NJ: Prentice Hall, 2000)

9/13/2001Information Organization and Retrieval Next Time Vector Space Ranking Probabilistic Models and Ranking