Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Introduction to Information Retrieval
The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Multimedia and Text Indexing. Multimedia Data Management The need to query and analyze vast amounts of multimedia data (i.e., images, sound tracks, video.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Automatic Image Annotation and Retrieval using Cross-Media Relevance Models J. Jeon, V. Lavrenko and R. Manmathat Computer Science Department University.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Modeling Modern Information Retrieval
SIMS 296a-3: UI Background Marti Hearst Fall ‘98.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
9/14/2000Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Marti Hearst University of California,
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Vector Space Model CS 652 Information Extraction and Integration.
9/19/2000Information Organization and Retrieval Vector and Probabilistic Ranking Ray Larson & Marti Hearst University of California, Berkeley School of.
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.
SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
IR Models: Review Vector Model and Probabilistic.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Advanced Multimedia Text Classification Tamara Berg.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Modern Information Retrieval Lecture 2: Key concepts in IR.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Search and Retrieval: Query Languages Prof. Marti Hearst SIMS 202, Lecture 19.
User Interfaces for Information Access Prof. Marti Hearst SIMS 202, Lecture 26.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.
IR 6 Scoring, term weighting and the vector space model.
SIMS 202, Marti Hearst Final Review Prof. Marti Hearst SIMS 202.
Plan for Today’s Lecture(s)
Representation of documents and queries
Visualizing Document Collections
Text Categorization Assigning documents to a fixed set of categories
Document Clustering Matt Hughes.
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22

Marti A. Hearst SIMS 202, Fall 1997 Today n Document Ranking n term weights n similarity measures n vector space model n probabilistic models n Multi-dimensional spaces n Clustering

Marti A. Hearst SIMS 202, Fall 1997 Finding Out About n Three phases: n Asking of a question n Construction of an answer n Assessment of the answer n Part of an iterative process

Marti A. Hearst SIMS 202, Fall 1997 Ranking Algorithms n Assign weights to the terms in the query. n Assign weights to the terms in the documents. n Compare the weighted query terms to the weighted document terms. n Rank order the results.

Information need Index Pre-process Parse Collections Rank Query text input

Marti A. Hearst SIMS 202, Fall 1997 Vector Representation (revisited; see Salton article in Science) n Documents and Queries are represented as vectors. n Position 1 corresponds to term 1, position 2 to term 2, position t to term t n The weight of the term is stored in each position

Marti A. Hearst SIMS 202, Fall 1997 Assigning Weights to Terms n Raw term frequency n tf x idf n Automatically-derived thesaurus terms

Marti A. Hearst SIMS 202, Fall 1997 Assigning Weights to Terms n Raw term frequency n tf x idf n Recall the Zipf distribution n Want to weight terms highly if they are n frequent in relevant documents … BUT n infrequent in the collection as a whole n Automatically derived thesaurus terms

Marti A. Hearst SIMS 202, Fall 1997 Assigning Weights n tf x idf measure: n term frequency (tf) n inverse document frequency (idf) n Goal: assign a tf * idf weight to each term in each document

Marti A. Hearst SIMS 202, Fall 1997 tf x idf

Marti A. Hearst SIMS 202, Fall 1997 tf x idf normalization n Normalize the term weights (so longer documents are not unfairly given more weight) n normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive.

Marti A. Hearst SIMS 202, Fall 1997 Vector space similarity (use the weights to compare the documents)

Marti A. Hearst SIMS 202, Fall 1997 Vector Space Similarity Measure combine tf x idf into a similarity measure

Marti A. Hearst SIMS 202, Fall 1997 To Think About n How does this ranking algorithm behave? n Make a set of hypothetical documents consisting of terms and their weights n Create some hypothetical queries n How are the documents ranked, depending on the weights of their terms and the queries’ terms?

Marti A. Hearst SIMS 202, Fall 1997 Computing Similarity Scores

Marti A. Hearst SIMS 202, Fall 1997 Computing a similarity score

Marti A. Hearst SIMS 202, Fall 1997 Other Major Ranking Schemes n Probabilistic Ranking n Attempts to be more theoretically sound than the vector space (v.s.) model n try to predict the probability of a document’s being relevant, given the query n there are many many variations n usually more complicated to compute than v.s. n usually many approximations are required n Usually can’t beat v.s. reliably using standard evaluation measures

Marti A. Hearst SIMS 202, Fall 1997 Other Major Ranking Schemes n Staged Logistic Regression n A variation on probabilistic ranking n Used successfully here at Berkeley in the Cheshire II system

Marti A. Hearst SIMS 202, Fall 1997 Staged Logistic Regression n Pick a set of X feature types n sum of frequencies of all terms in query x1 n sum of frequencies of all query terms in document x2 n query length x3 n document length x4 n sum of idf’s for all terms in query x5 n Determine weights, c, to indicate how important each feature type is (use training examples) n To assign a score to the document: n add up the feature weight times the term weight for each feature and each term in the query

Marti A. Hearst SIMS 202, Fall 1997 Multi-Dimensional Space n Documents exist in multi-dimensional space n What does this mean? n Consider a set of objects with features n different shapes n different sizes n different colors n In what ways can they be grouped? n The features define an abstract space that the objects can reside in. n Generalize this to terms in documents. n There are more than three kinds of terms!

Marti A. Hearst SIMS 202, Fall 1997 Text Clustering Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu Term 1 Term 2

Marti A. Hearst SIMS 202, Fall 1997 Text Clustering Term 1 Term 2 Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu

Marti A. Hearst SIMS 202, Fall 1997 Pair-wise Document Similarity novagalaxy heath’wood filmroledietfur ABCDABCD How to compute document similarity?

Marti A. Hearst SIMS 202, Fall 1997 Pair-wise Document Similarity (no normalization for simplicity) novagalaxy heath’wood filmroledietfur ABCDABCD

Marti A. Hearst SIMS 202, Fall 1997 Using Clustering n Cluster entire collection n Find cluster centroid that best matches the query n This has been explored extensively n it is expensive n it doesn’t work well

Marti A. Hearst SIMS 202, Fall 1997 Using Clustering n Alternative (scatter/gather): n cluster top-ranked documents n show cluster summaries to user n Seems useful n experiments show relevant docs tend to end up in the same cluster n users seem able to interpret and use the cluster summaries some of the time n More computationally feasible

Marti A. Hearst SIMS 202, Fall 1997 Clustering n Advantages: n See some main themes n Disadvantage: n Many ways documents could group together are hidden

Marti A. Hearst SIMS 202, Fall 1997 Using Clustering n Another alternative: n cluster entire collection n force results into a 2D space n display graphically to give an overview n looks neat but hasn’t been shown to be useful n Kohonen feature maps can be used instead of clustering to produce display of documents in 2D regions

Marti A. Hearst SIMS 202, Fall 1997 Clustering Multi-Dimensional Document Space (image from Wise et al 95)

Marti A. Hearst SIMS 202, Fall 1997 Clustering Multi-Dimensional Document Space (image from Wise et al 95)

Marti A. Hearst SIMS 202, Fall 1997 Concept “Landscapes” from Kohonen Feature Maps (X. Lin and H. Chen) Pharmocology Anatomy Legal Disease Hospitals

Marti A. Hearst SIMS 202, Fall 1997 Graphical Depictions of Clusters Problems: Problems: Either too many concepts, or too coarse Either too many concepts, or too coarse Only one concept per document Only one concept per document Hard to view titles Hard to view titles Browsing without search Browsing without search

Marti A. Hearst SIMS 202, Fall 1997 Another Approach to Term Weighting: Latent Semantic Indexing n Try to find words that are similar in meaning to other words by: n computing document by term matrix n a matrix is a two-dimensional vector n processing the matrix to pull out the main themes

Marti A. Hearst SIMS 202, Fall 1997 Document/Term Matrix

Marti A. Hearst SIMS 202, Fall 1997 Finding Similar Tokens Two terms are considered similar if they co-occur often in many documents.

Marti A. Hearst SIMS 202, Fall 1997 Document/Term Matrix n This approach doesn’t work well n Problems: n Word contexts too large n Polysemy n Alternative Approaches n Use Smaller Contexts n Machine-Readable Dictionaries n Local syntactic structure n LSI (Latent Semantic Indexing) n Find main themes within matrix