Document Clustering Carl Staelin. Lecture 7Information Retrieval and Digital LibrariesPage 2 Motivation It is hard to rapidly understand a big bucket.

Slides:



Advertisements
Similar presentations
Slide 1 Insert your own content. Slide 2 Insert your own content.
Advertisements

Combining Like Terms. Only combine terms that are exactly the same!! Whats the same mean? –If numbers have a variable, then you can combine only ones.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
0 - 0.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
MULTIPLYING MONOMIALS TIMES POLYNOMIALS (DISTRIBUTIVE PROPERTY)
ADDING INTEGERS 1. POS. + POS. = POS. 2. NEG. + NEG. = NEG. 3. POS. + NEG. OR NEG. + POS. SUBTRACT TAKE SIGN OF BIGGER ABSOLUTE VALUE.
SUBTRACTING INTEGERS 1. CHANGE THE SUBTRACTION SIGN TO ADDITION
MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
Addition Facts
CS4026 Formal Models of Computation Running Haskell Programs – power.
O X Click on Number next to person for a question.
© S Haughton more than 3?
5.9 + = 10 a)3.6 b)4.1 c)5.3 Question 1: Good Answer!! Well Done!! = 10 Question 1:
1 Directed Depth First Search Adjacency Lists A: F G B: A H C: A D D: C F E: C D G F: E: G: : H: B: I: H: F A B C G D E H I.
Take from Ten First Subtraction Strategy -9 Click on a number below to go directly to that type of subtraction problems
Energy & Green Urbanism Markku Lappalainen Aalto University.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: K-Means Clustering Martin Russell.
K-means Clustering Ke Chen.
Traditional IR models Jian-Yun Nie.
Past Tense Probe. Past Tense Probe Past Tense Probe – Practice 1.
Limits (Algebraic) Calculus Fall, What can we do with limits?
Addition 1’s to 20.
25 seconds left…...
Test B, 100 Subtraction Facts
11 = This is the fact family. You say: 8+3=11 and 3+8=11
Week 1.
O X Click on Number next to person for a question.
K-MEANS ALGORITHM Jelena Vukovic 53/07
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Unsupervised Classification
L6:CSC © Dr. Basheer M. Nasef Lecture #6 By Dr. Basheer M. Nasef.
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering.
Clustering Basic Concepts and Algorithms
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
Agglomerative Hierarchical Clustering 1. Compute a distance matrix 2. Merge the two closest clusters 3. Update the distance matrix 4. Repeat Step 2 until.
Data Mining Techniques: Clustering
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
4. Ad-hoc I: Hierarchical clustering
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering 1.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Clustering Unsupervised learning Generating “classes”
Evaluating Performance for Data Mining Techniques
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Analysis of Massive Data Sets Prof. dr. sc. Siniša Srbljić Doc. dr. sc. Dejan Škvorc Doc. dr. sc. Ante Đerek Faculty of Electrical Engineering and Computing.
Data Mining: Basic Cluster Analysis
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Sampath Jayarathna Cal Poly Pomona
Data Clustering Michael J. Watts
Information Organization: Clustering
Lecture 21 Clustering (2).
Text Categorization Berlin Chen 2003 Reference:
SEEM4630 Tutorial 3 – Clustering.
Presentation transcript:

Document Clustering Carl Staelin

Lecture 7Information Retrieval and Digital LibrariesPage 2 Motivation It is hard to rapidly understand a big bucket of documents Humans look for patterns, and are good at pattern matching “Random” collections of documents don’t have a recognizable structure Clustering documents into recognizable groups makes it easier to see patterns Can rapidly eliminate irrelevant clusters

Lecture 7Information Retrieval and Digital LibrariesPage 3 Basic Idea Choose a document similarity measure Choose a cluster cost criterion

Lecture 7Information Retrieval and Digital LibrariesPage 4 Basic Idea Choose a document similarity measure Choose a cluster cost or similarity criterion Group like documents into clusters with minimal cluster cost

Lecture 7Information Retrieval and Digital LibrariesPage 5 Cluster Cost Criteria Sum-of-squared-error Cost =  i ||x i -x|| 2 Average squared distance Cost = (1/n 2 )  i  j ||x i -x j || 2

Lecture 7Information Retrieval and Digital LibrariesPage 6 Cluster Similarity Measure Measures the similarity of two clusters C i, C j 1. d min (C i, C j ) = min x i  C i,x j  C j ||x i – x j || 2. d max (C i, C j ) = max x i  C i,x j  C j ||x i – x j || 3. d avg (C i, C j ) = (1/ n i n j )  x i  C i  x j  C j ||x i – x j || 4. d mean (C i, C j ) = ||(1/n j )  x i  C i x i – (1/n j ) ,x j  C j x j || 5. …

Lecture 7Information Retrieval and Digital LibrariesPage 7 Iterative Clustering Assign points to initial k clusters Often this is done by random assignment Until done Select a candidate point x, in cluster c Find “best” cluster c’ for x If c  c’, then move x to c’

Lecture 7Information Retrieval and Digital LibrariesPage 8 Iterative Clustering The user must pre-select the number of clusters Often the “correct” number is not known in advance! The quality of the outcome is usually dependent on the quality of the initial assignment Possibly use some other algorithm to create a good initial assignment?

Lecture 7Information Retrieval and Digital LibrariesPage 9 Hierarchical Agglomerative Clustering Create N single- document clusters For i in 1..n Merge two clusters with greatest similarity

Lecture 7Information Retrieval and Digital LibrariesPage 10 Hierarchical Agglomerative Clustering Create N single- document clusters For i in 1..n Merge two clusters with greatest similarity

Lecture 7Information Retrieval and Digital LibrariesPage 11 Hierarchical Agglomerative Clustering Create N single- document clusters For i in 1..n Merge two clusters with greatest similarity

Lecture 7Information Retrieval and Digital LibrariesPage 12 Hierarchical Agglomerative Clustering Hierarchical agglomerative clustering gives a hierarchy of clusters This makes it easier to explore the set of possible k-cluster values to choose the best number of clusters 3 4 5

Lecture 7Information Retrieval and Digital LibrariesPage 13 High density variations Intuitively “correct” clustering

Lecture 7Information Retrieval and Digital LibrariesPage 14 High density variations Intuitively “correct” clustering HAC-generated clusters

Lecture 7Information Retrieval and Digital LibrariesPage 15 Hybrid Combine HAC and iterative clustering Assign points to initial clusters using HAC Until done Select a candidate point x, in cluster c Find “best” cluster c’ for x If c  c’, then move x to c’

Lecture 7Information Retrieval and Digital LibrariesPage 16 Other Algorithms Support Vector Clustering Information Bottleneck …

Lecture 7Information Retrieval and Digital LibrariesPage 17 High density variations