SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Text Similarity David Kauchak CS457 Fall 2011.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003
9/11/2000Information Organization and Retrieval Content Analysis and Statistical Properties of Text Ray Larson & Marti Hearst University of California,
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,
I256 Applied Natural Language Processing Fall 2009
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Multimedia and Text Indexing. Multimedia Data Management The need to query and analyze vast amounts of multimedia data (i.e., images, sound tracks, video.
9/11/2001Information Organization and Retrieval Content Analysis and Statistical Properties of Text Ray Larson & Warren Sack University of California,
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
INFM 700: Session 7 Unstructured Information (Part II) Jimmy Lin The iSchool University of Maryland Monday, March 10, 2008 This work is licensed under.
Ch 4: Information Retrieval and Text Mining
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SIMS 296a-3: UI Background Marti Hearst Fall ‘98.
9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.
9/14/2000Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Marti Hearst University of California,
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
9/19/2000Information Organization and Retrieval Vector and Probabilistic Ranking Ray Larson & Marti Hearst University of California, Berkeley School of.
1 I256: Applied Natural Language Processing Marti Hearst Nov 6, 2006.
SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004
Advanced Multimedia Text Retrieval/Classification Tamara Berg.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Advanced Multimedia Text Classification Tamara Berg.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Interaction LBSC 734 Module 4 Doug Oard. Agenda Where interaction fits Query formulation Selection part 1: Snippets  Selection part 2: Result sets Examination.
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
Clustering C.Watters CS6403.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
From Frequency to Meaning: Vector Space Models of Semantics
Automated Information Retrieval
Intelligent Information Retrieval
Plan for Today’s Lecture(s)
Distance and Similarity Measures
Text Based Information Retrieval
CS 430: Information Discovery
Information Organization: Clustering
Representation of documents and queries
Visualizing Document Collections
CS 430: Information Discovery
From frequency to meaning: vector space models of semantics
Document Clustering Matt Hughes.
CS 430: Information Discovery
CS276B Text Information Retrieval, Mining, and Exploitation
Content Analysis of Text
Text Categorization Berlin Chen 2003 Reference:
CS 430: Information Discovery
Presentation transcript:

SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15

UCB SIMS 202 Review Content Analysis: Content Analysis: Transformation of raw text into more computationally useful forms Transformation of raw text into more computationally useful forms Words in text collections exhibit interesting statistical properties Words in text collections exhibit interesting statistical properties Zipf distribution Zipf distribution Word co-occurrences non-independent Word co-occurrences non-independent Text documents are transformed to vectors Text documents are transformed to vectors Pre-processing Pre-processing Vectors represent multi-dimensional space Vectors represent multi-dimensional space

UCB SIMS 202 Zipf Distribution Rank = order of words’ frequency of occurrence The product of the frequency of words (f) and their rank (r) is approximately constant

UCB SIMS 202 Consequences of Zipf There are always a few very frequent tokens that are not good discriminators. There are always a few very frequent tokens that are not good discriminators. Called “stop words” in IR Called “stop words” in IR Usually correspond to linguistic notion of “closed-class” words Usually correspond to linguistic notion of “closed-class” words English examples: to, from, on, and, the,... English examples: to, from, on, and, the,... Grammatical classes that don’t take on new members. Grammatical classes that don’t take on new members. There are always a large number of tokens that occur almost once and can mess up algorithms. There are always a large number of tokens that occur almost once and can mess up algorithms. Medium frequency words most descriptive Medium frequency words most descriptive

UCB SIMS 202 Word Frequency vs. Resolving Power (from van Rijsbergen 79) The most frequent words are not the most descriptive.

UCB SIMS 202 Statistical Independence vs. Dependence How likely is token W to appear, given that we’ve seen token V? How likely is token W to appear, given that we’ve seen token V? Non-independence implies that tokens that co-occur may be related in some meaningful way. Non-independence implies that tokens that co-occur may be related in some meaningful way. Very simple corpus-processing algorithms producing meaningful results. Very simple corpus-processing algorithms producing meaningful results.

UCB SIMS 202 Interesting Associations with “Doctor” (AP Corpus, N=15 million, Church & Hanks 89)

UCB SIMS 202 Un-Interesting Associations with “Doctor” (AP Corpus, N=15 million, Church & Hanks 89)

UCB SIMS 202 Computing Co-occurence Compute for a window of words Compute for a window of words w1w11 w21 a b c d e f g h i j k l m n o p

UCB SIMS 202 Document Vectors Documents are represented as “bags of words” Documents are represented as “bags of words” Represented as vectors when used computationally Represented as vectors when used computationally A vector is like an array of floating point A vector is like an array of floating point Has direction and magnitude Has direction and magnitude Each vector holds a place for every term in the collection Each vector holds a place for every term in the collection Therefore, most vectors are sparse Therefore, most vectors are sparse

UCB SIMS 202 Document Vectors novagalaxy heath’wood filmroledietfur ABCDEFGHIABCDEFGHI Document ids

UCB SIMS 202 Topics for Today Multiple-dimensionality of Document Space Multiple-dimensionality of Document Space Automatic Methods for Automatic Methods for Clustering Clustering Creating Thesaurus Terms Creating Thesaurus Terms Review and Sample Questions for Midterm Review and Sample Questions for Midterm

UCB SIMS 202 Documents in 3D Space Assumption: Documents that are “close together” in space are similar in meaning.

UCB SIMS 202 Numbers represent how many documents share the indicated subset of terms. Numbers represent how many documents share the indicated subset of terms. How to represent similarity among five terms? Six? How to represent similarity among five terms? Six? diet hot star fur Document Similarity

UCB SIMS 202 Document Space has High Dimensionality What happens beyond three dimensions? What happens beyond three dimensions? Similarity still has to do with how many tokens are shared in common. Similarity still has to do with how many tokens are shared in common. More terms -> harder to understand which subsets of words are shared among similar documents. More terms -> harder to understand which subsets of words are shared among similar documents. One approach to handling high dimensionality: One approach to handling high dimensionality:Clustering

UCB SIMS 202 Text Clustering Finds overall similarities among groups of documents Finds overall similarities among groups of documents Finds overall similarities among groups of tokens Finds overall similarities among groups of tokens Picks out some themes, ignores others Picks out some themes, ignores others

UCB SIMS 202 Text Clustering Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu Term 1 Term 2

UCB SIMS 202 Text Clustering Term 1 Term 2 Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu

UCB SIMS 202 Pair-wise Document Similarity novagalaxy heath’wood filmroledietfur ABCDABCD How to compute document similarity?

UCB SIMS 202 Pair-wise Document Similarity (no normalization for simplicity) novagalaxy heath’wood filmroledietfur ABCDABCD

UCB SIMS 202 Pair-wise Document Similarity (cosine normalization)

UCB SIMS 202 Document/Document Matrix

UCB SIMS 202 Agglomerative Clustering ABCDEFGHIABCDEFGHI

UCB SIMS 202 Agglomerative Clustering ABCDEFGHIABCDEFGHI

UCB SIMS 202 Agglomerative Clustering ABCDEFGHIABCDEFGHI

UCB SIMS 202 K-Means Clustering 1 Create a pair-wise similarity measure 1 Create a pair-wise similarity measure 2 Find K centers using agglomerative clustering 2 Find K centers using agglomerative clustering take a small sample take a small sample group bottom up until K groups found group bottom up until K groups found 3 Assign each document to nearest center, forming new clusters 3 Assign each document to nearest center, forming new clusters 4 Repeat 3 as necessary 4 Repeat 3 as necessary

UCB SIMS 202 Scatter/Gather Cutting, Pedersen, Tukey & Karger 92, 93 Hearst & Pedersen 95 Cluster sets of documents into general “themes”, like a table of contents Cluster sets of documents into general “themes”, like a table of contents Display the contents of the clusters by showing topical terms and typical titles Display the contents of the clusters by showing topical terms and typical titles User chooses subsets of the clusters and re-clusters the documents within User chooses subsets of the clusters and re-clusters the documents within Resulting new groups have different “themes” Resulting new groups have different “themes”

UCB SIMS 202 S/G Example: query on “star” Encyclopedia text 14 sports 8 symbols47 film, tv 8 symbols47 film, tv 68 film, tv (p) 7 music 97 astrophysics 67 astronomy(p)12 steller phenomena 10 flora/fauna 49 galaxies, stars 29 constellations 7 miscelleneous 7 miscelleneous Clustering and re-clustering is entirely automated

UCB SIMS 202 Another use of clustering Use clustering to map the entire huge multidimensional document space into a huge number of small clusters. Use clustering to map the entire huge multidimensional document space into a huge number of small clusters. “Project” these onto a 2D graphical representation: “Project” these onto a 2D graphical representation:

UCB SIMS 202 Clustering Multi-Dimensional Document Space (image from Wise et al 95)

UCB SIMS 202 Clustering Multi-Dimensional Document Space (image from Wise et al 95)

UCB SIMS 202 Concept “Landscapes” Pharmocology Anatomy Legal Disease Hospitals (e.g., Lin, Chen, Wise et al.) Too many concepts, or too coarse Too many concepts, or too coarse Single concept per document Single concept per document No titles No titles Browsing without search Browsing without search

UCB SIMS 202 Clustering Advantages: Advantages: See some main themes See some main themes Disadvantage: Disadvantage: Many ways documents could group together are hidden Many ways documents could group together are hidden Thinking point: what is the relationship to classification systems and facets? Thinking point: what is the relationship to classification systems and facets?