9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Text Similarity David Kauchak CS457 Fall 2011.
Unsupervised learning
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003
9/11/2000Information Organization and Retrieval Content Analysis and Statistical Properties of Text Ray Larson & Marti Hearst University of California,
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
Text Databases. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and video.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
I256 Applied Natural Language Processing Fall 2009
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Multimedia and Text Indexing. Multimedia Data Management The need to query and analyze vast amounts of multimedia data (i.e., images, sound tracks, video.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
Visualizating the Non-Visual: Spatial Analysis and Interaction with Information from Text Documents J.A. Wise, J.J. Thomas, K. Pennock, D. Lantrip, M.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
INFM 700: Session 7 Unstructured Information (Part II) Jimmy Lin The iSchool University of Maryland Monday, March 10, 2008 This work is licensed under.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SIMS 296a-3: UI Background Marti Hearst Fall ‘98.
DOK 324: Principles of Information Retrieval Hacettepe University Department of Information Management.
9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
9/14/2000Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Marti Hearst University of California,
Vector Space Model CS 652 Information Extraction and Integration.
9/19/2000Information Organization and Retrieval Vector and Probabilistic Ranking Ray Larson & Marti Hearst University of California, Berkeley School of.
1 I256: Applied Natural Language Processing Marti Hearst Nov 6, 2006.
9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.
SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004
Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
ISP 433/633 Week 12 User Interface in IR. Why care about User Interface in IR Human Search using IR depends on –Search in IR and search in human memory.
Advanced Multimedia Text Classification Tamara Berg.
Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
IAT Text ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT]
5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Document Collections cs5984: Information Visualization Chris North.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Interaction LBSC 734 Module 4 Doug Oard. Agenda Where interaction fits Query formulation Selection part 1: Snippets  Selection part 2: Result sets Examination.
Clustering C.Watters CS6403.
1 Patrick Lambrix Department of Computer and Information Science Linköpings universitet Information Retrieval.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering.
Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.
IR 6 Scoring, term weighting and the vector space model.
Why indexing? For efficient searching of a document
Plan for Today’s Lecture(s)
Indexing & querying text
Information Organization: Clustering
Representation of documents and queries
Visualizing Document Collections
From frequency to meaning: vector space models of semantics
Document Clustering Matt Hughes.
Text Categorization Berlin Chen 2003 Reference:
Information Retrieval and Web Design
Presentation transcript:

9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval Lecture authors: Marti Hearst & Ray Larson & Warren Sack

9/18/2001Information Organization and Retrieval Last Time Document Vectors Inverted Files Vector Space Model Term Weighting Clustering

9/18/2001Information Organization and Retrieval Document Vectors novagalaxy heath’wood filmroledietfur ABCDEFGHIABCDEFGHI Document ids

9/18/2001Information Organization and Retrieval We Can Plot the Vectors Star Diet Doc about astronomy Doc about movie stars Doc about mammal behavior

9/18/2001Information Organization and Retrieval Inverted Index This is the primary data structure for text indexes Main Idea: –Invert documents into a big index Basic steps: –Make a “dictionary” of all the tokens in the collection –For each token, list all the docs it occurs in. –Do a few things to reduce redundancy in the data structure

9/18/2001Information Organization and Retrieval Inverted Indexes We have seen “Vector files” conceptually. An Inverted File is a vector file “inverted” so that rows become columns and columns become rows

9/18/2001Information Organization and Retrieval Vector Space Model Documents are represented as vectors in term space –Terms are usually stems –Documents represented by binary vectors of terms Queries represented the same as documents Query and Document weights are based on length and direction of their vector A vector distance measure between the query and documents is used to rank retrieved documents This makes partial matching possible

9/18/2001Information Organization and Retrieval Documents in 3D Space Assumption: Documents that are “close together” in space are similar in meaning.

9/18/2001Information Organization and Retrieval Assigning Weights tf x idf measure: –term frequency (tf) –inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution Goal: assign a tf * idf weight to each term in each document

9/18/2001Information Organization and Retrieval Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient

9/18/2001Information Organization and Retrieval Computing Similarity Scores

9/18/2001Information Organization and Retrieval Text Clustering Clustering is “The art of finding groups in data.” Kaufmann and Rousseeuw, Finding Groups in Data, 1990 Term 1 Term 2

9/18/2001Information Organization and Retrieval Text Clustering Term 1 Term 2 Clustering is “The art of finding groups in data.” Kaufmann and Rousseeuw, Finding Groups in Data, 1990

9/18/2001Information Organization and Retrieval Types of Clustering Hierarchical vs. Flat Hard vs.Soft vs. Disjunctive (set vs. uncertain vs. multiple assignment)

9/18/2001Information Organization and Retrieval

9/18/2001Information Organization and Retrieval Flat Clustering K-Means –Hard –O(n) EM (soft version of K-Means)

9/18/2001Information Organization and Retrieval K-Means Clustering 1 Create a pair-wise similarity measure 2 Find K centers 3 Assign each document to nearest center, forming new clusters 4 Repeat 3 as necessary

9/18/2001Information Organization and Retrieval

9/18/2001Information Organization and Retrieval

9/18/2001Information Organization and Retrieval Scatter/Gather Cutting, Pedersen, Tukey & Karger 92, 93 Hearst & Pedersen 95 Cluster sets of documents into general “themes”, like a table of contents Display the contents of the clusters by showing topical terms and typical titles User chooses subsets of the clusters and re- clusters the documents within Resulting new groups have different “themes”

9/18/2001Information Organization and Retrieval Scatter/Gather Example: query on “star” Encyclopedia text 14 sports 8 symbols47 film, tv 68 film, tv (p) 7 music 97 astrophysics 67 astronomy(p)12 stellar phenomena 10 flora/fauna 49 galaxies, stars 29 constellations 7 miscelleneous Clustering and re-clustering is entirely automated

9/18/2001Information Organization and Retrieval Another use of clustering Use clustering to map the entire huge multidimensional document space into a huge number of small clusters. “Project” these onto a 2D graphical representation:

9/18/2001Information Organization and Retrieval Clustering Multi-Dimensional Document Space Wise, Thomas, Pennock, Lantrip, Pottier, Schur, Crow “Visualizing the Non-Visual: Spatial analysis and interaction with Information from text documents,” 1995

9/18/2001Information Organization and Retrieval Clustering Multi-Dimensional Document Space Wise et al., 1995

9/18/2001Information Organization and Retrieval Concept “Landscapes” Browsing without search Pharmocology Anatomy Legal Disease Hospitals (e.g., Xia Lin, “Visualization for the Document Space,” 1992) Based on Kohonen feature maps; See

9/18/2001Information Organization and Retrieval More examples of information visualization Stuart Card, Jock Mackinlay, Ben Schneiderman (eds.) Readings in Information Visualization (San Francisco: Morgan Kaufmann, 1999) Martin Dodge,

9/18/2001Information Organization and Retrieval Clustering Advantages: –See some main themes Disadvantage: –Many ways documents could group together are hidden Thinking point: what is the relationship to classification systems and faceted queries? e.g., f1: (osteoporosis OR ‘bone loss’) f2: (drugs OR pharmaceuticals) f3: (prevention OR cure)

9/18/2001Information Organization and Retrieval More information on content analysis and clustering Christopher Manning and Hinrich Schutze, Foundations of Statistical Natural Language Processing (Cambridge, MA: MIT Press, 1999) Daniel Jurafsky and James Martin, Speech and Language Processing (Upper Saddle River, NJ: Prentice Hall, 2000)

9/18/2001Information Organization and Retrieval And now on to… Vector Space Ranking Probabilistic Models and Ranking