DOCUMENT CLUSTERING. Clustering  Automatically group related documents into clusters.  Example  Medical documents  Legal documents  Financial documents.

Slides:



Advertisements
Similar presentations
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Advertisements

Clustering.
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Albert Gatt Corpora and Statistical Methods Lecture 13.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Agglomerative Hierarchical Clustering 1. Compute a distance matrix 2. Merge the two closest clusters 3. Update the distance matrix 4. Repeat Step 2 until.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching ANSHUL VARMA FAISAL QURESHI.
Clustering approaches for high- throughput data Sushmita Roy BMI/CS 576 Nov 12 th, 2013.
UPGMA Algorithm.  Main idea: Group the taxa into clusters and repeatedly merge the closest two clusters until one cluster remains  Algorithm  Add a.
1 Machine Learning: Symbol-based 10d More clustering examples10.5Knowledge and Learning 10.6Unsupervised Learning 10.7Reinforcement Learning 10.8Epilogue.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
4. Ad-hoc I: Hierarchical clustering
Clustering Specific Issues related to Project 2. Reducing dimensionality –Lowering the number of dimensions makes the problem more manageable Less memory.
1 Clustering Algorithms Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall,
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
1 Chapter 23 Probabilistic Language Processing Clustering examples Additional sources used in preparing the slides: David Grossman’s clustering slides:
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
Hierarchical Clustering
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL.
tch?v=Y6ljFaKRTrI Fireflies.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Clustering.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Motivation  Methods of local analysis extract information from local set of documents retrieved to expand the query  An alternative is to expand the.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Clustering Patrice Koehl Department of Biological Sciences National University of Singapore
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining: Basic Cluster Analysis
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Hierarchical and K-means Clustering
More on Clustering in COSC 4335
Clustering CSC 600: Data Mining Class 21.
Ananya Das Christman CS311 Fall 2016
Machine Learning and Data Mining Clustering
K-means and Hierarchical Clustering
Clustering.
Hierarchical and Ensemble Clustering
Information Organization: Clustering
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Hierarchical and Ensemble Clustering
Automatic Global Analysis
Clustering The process of grouping samples so that the samples are similar within each group.
SEEM4630 Tutorial 3 – Clustering.
Hierarchical Clustering
Clustering.
Presentation transcript:

DOCUMENT CLUSTERING

Clustering  Automatically group related documents into clusters.  Example  Medical documents  Legal documents  Financial documents

Uses of Clustering If a collection is well clustered, we can search only the cluster that will contain relevant documents. Searching a smaller collection should improve effectiveness and efficiency.

Clustering Algorithms Hierarchical Agglomerative Clustering – requires a pre-computed doc-doc similarity matrix. Clustering without a pre-computed doc-doc similarity matrix.

Hierarchical Clustering Algorithms Single Link Complete Linkage Group Average Ward’s Method

Hierarchical Agglomerative Create NxN doc-doc similarity matrix Each document starts as a cluster of size one Do Until there is only one cluster – combine the two clusters with the greatest similarity – update the doc-doc matrix

Example Consider A, B, C, D, E as documents with the following similarities: A B C D E A B C D E Hhighest pair is: E-B = 14

Example So lets cluster E and B. We now have the structure:

Example Now we update the DOC-DOC matrix. A BE C D A BE C D Note: To compute BE -- SC (A, B) = 2 SC (A, E) = 4 SC(A,BE) = 4 if we are using single link (take max) SC(A,BE) = 2 if we are using complete linkage (take min) SC(A,BE) = 3 if we are using group average (take average) Note: C - BE is now the highest link

Example So lets cluster BE and C. We now have the structure:

Example Now we update the DOC-DOC matrix. A BCE D A BCE D To compute SC(A, BCE): SC (A, BE) = 2 SC (A, C) = 7 so SC(A,BCE) = 2 To Compute: SC(D,BCE) SC(D, BE) = 2 SC(D, C) = 4 so SC(D, BCE) = 2 SC(D,A) = 9 which is greater than SC(A,BCE) or SC(D,BCE) so we now cluster A and D.

Example So lets cluster A and D. At this point, there are only two nodes that have not been clustered, AD and BCE. We now cluster them.

Example Now we have clustered everything.

Analysis  Hierarchical clustering requires:  Number of steps = number of document -1.  For each clustering step we must re-compute the DOC-DOC matrix.  We conclude that its very expensive and slow.

Other Clustering Algorithms One-pass Buckshot

One pass Clustering Choose a document and declare it to be in a cluster of size one. Now compute distance from this cluster to all remaining nodes. Add “closest” node to the cluster. If no node is really close (within some threshold), start a new cluster between the two closest nodes.

Example Choose node A as the first cluster Now compute SC(A,B), SC(A,C) and SC(A,D). B is the closest so we now cluster B.

Example Now we compute the centroid of the cluster just formed and use that to compute SC between the cluster and all remaining clusters. Lets assume its too far from AB to E,D, and C. So now we choose one of these non-clustered element and place it in a cluster. Lets choose E.

Example Now we compute the distance from E to D and E to C. E to D is closer so we form a cluster of E and D.

Example Now we compute the centroid of D and E and use that to compute the SC(DE, C). It is within the threshold so we now include C in this cluster.

Example Now we compute the centroid of D and E and use that to compute the SC(DE, C). It is within the threshold so we now include C in this cluster.

Effect of Document Order Hierarchical clustering we get the same clusters every time One pass clustering, we get different clusters based on the order we use to process the documents

Buckshot Clustering Relatively recent technique (1992). Goal is to reduce run time to O(kn) instead of O(n^2) where k is the number of clusters.

Summary Pro – Clustering can work to give users an overview of the contents of a document collection – Also can reduce the search space Con –Expensive – Difficult to identify which cluster or clusters should be searched

References  