Incremental Clustering Previous clustering algorithms worked in “batch” mode: processed all points at essentially the same time. Some IR applications cluster.

Slides:



Advertisements
Similar presentations
Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague.
Advertisements

Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Data Set used. K Means K Means Clusters 1.K Means begins with a user specified amount of clusters 2.Randomly places the K centroids on the data set 3.Finds.
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Clustering.
DATA MINING CLUSTERING ANALYSIS. Data Mining (by R.S.K. Baber) 2 CLUSTERING Example: suppose we have 9 balls of three different colours. We are interested.
Hierarchical Clustering
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
1 Very Large-Scale Incremental Clustering Berk Berker Mumin Cebe Ismet Zeki Yalniz 27 March 2007.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011.
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Date : 21 st of May, Shri Ramdeo Baba College of Engineering and Management Presentation By : Rimjhim Singh Under the Guidance of: Dr. M.B. Chandak.
Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.
Lecture 6 Image Segmentation
CS 326 A: Motion Planning robotics.stanford.edu/~latombe/cs326/2003/index.htm Collision Detection and Distance Computation: Feature Tracking Methods.
Local Clustering Algorithm DISCOVIR Image collection within a client is modeled as a single cluster. Current Situation.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Krakow, Jan. 9, Outline: 1. Online bidding 2. Cow-path 3. Incremental medians (size approximation) 4. Incremental medians (cost approximation) 5.
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering Vertices of 3D Animated Meshes
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
THE MODEL OF ASIS FOR PROCESS CONTROL APPLICATIONS P.Andreeva, T.Atanasova, J.Zaprianov Institute of Control and System Researches Topic Area: 12. Intelligent.
Health and CS Philip Chan. DNA, Genes, Proteins What is the relationship among DNA Genes Proteins ?
Clustering Unsupervised learning Generating “classes”
Evaluating Performance for Data Mining Techniques
1 By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Clustering Methods K- means. K-means Algorithm Assume that K=3 and initially the points are assigned to clusters as follows. C 1 ={x 1,x 2,x 3 }, C 2.
SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Joint work with Chandrashekhar Nagarajan (Yahoo!)
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
COMP Data Mining: Concepts, Algorithms, and Applications 1 K-means Arbitrarily choose k objects as the initial cluster centers Until no change,
DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.
By Timofey Shulepov Clustering Algorithms. Clustering - main features  Clustering – a data mining technique  Def.: Classification of objects into sets.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Machine Learning Queens College Lecture 7: Clustering.
Other Clustering Techniques
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Computer Science: A Structured Programming Approach Using C1 5-5 Incremental Development Part II In Chapter 4, we introduced the concept of incremental.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Motion Segmentation at Any Speed Shrinivas J. Pundlik Department of Electrical and Computer Engineering, Clemson University, Clemson, SC.
CURE: An Efficient Clustering Algorithm for Large Databases Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Presentation by: Vuk Malbasa For CIS664.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining and Text Mining. The Standard Data Mining process.
Data Mining – Algorithms: K Means Clustering
Data Mining: Basic Cluster Analysis
More on Clustering in COSC 4335
Hierarchical Clustering
Clustering Uncertain Taxi data
File Management.
Clustering.
Hierarchical and Ensemble Clustering
A Framework for Clustering Evolving Data Streams
Hierarchical and Ensemble Clustering
Birch presented by : Bahare hajihashemi Atefeh Rahimi
Fuzzy Clustering Algorithms
Text Categorization Berlin Chen 2003 Reference:
Clustering.
Topic 5: Cluster Analysis
SEEM4630 Tutorial 3 – Clustering.
Presentation transcript:

Incremental Clustering Previous clustering algorithms worked in “batch” mode: processed all points at essentially the same time. Some IR applications cluster an incoming document stream (e.g., topic tracking). For these applications, we need incremental clustering algorithms.

Incremental Clustering Issues How to be efficient? Should all documents be cached? How to handle or support concept drift? How to reduce sensitivity to ordering? Goals: minimize the maximum cluster diameter minimize the number of clusters given a fixed diameter

Incremental Clustering Model [Charikar et al. 1997] Extension to HAC as follows: Incremental Clustering: “for an update sequence of n points in M, maintain a collection of k clusters such that as each one is presented, either it is assigned to one of the current k clusters or it starts off a new cluster while two existing clusters are merged into one.” Maintains a HAC for points added up until current time. M. Charikar, C. Chekuri, T. Feder, R. Motwani. “Incremental Clustering and Dynamic Information Retrieval”, Proc. 29 th Annual ACM Symposium on Theory of Computing, 1997.

Doubling Algorithm (  =  =2) 1. Assign first k+1 points to k+1 clusters with each point as centroid, d1=distance between closest two points. 2. Do while more points 1. d t+1 =  d t 2. Merge clusters until all clusters in some new cluster: 1. Pick an arbitrary cluster; merge all clusters within d t+1 of centers 2. Remove selected clusters from old clusters 3. Calculate the centroid for the new cluster 3. Update clusters while number of clusters <=k: 1. Assign new point to closest cluster if within  d t+1 of center; otherwise create new cluster.

Example:Plot -- Incremental

Example:Doubling Merge d2= X

Example:Doubling Update d2= X X

Example:Doubling Update d2= X X

Example:Doubling Update d2= X X

Example:Doubling Solution

Clique Partition Background A clique in G=(V,E) is a subset V’ of V s.t. every two vertices in V’ are joined by an edge in E. A clique partition for G is a partition of V into disjoint subsets V1…Vk s.t. for 1<=I<=k, the subgraph induced by Vi is a complete graph.

Clique Partition Algorithm 1. Assign first k+1 points to k+1 clusters with each point as centroid, d1=distance between closest two points. 2. Do while more points 1. d t+1 =  d t 2. Merge clusters: 1. Compute minimum clique partition from d t+1 threshold graph 2. Merge clusters in each clique 3. In each new cluster, arbitrarily assign one of the existing centers as the center for the new cluster 3. Update clusters while number of clusters <=k: 1. Assign new point to a cluster if within d t+1 of center of it or sub- clusters; otherwise create new cluster.

Example: CP: Merge d1=

Example: CP: Update d2=

Web Document Clustering Applications Organizing search engine retrieval results Meta-search engine that hierarchically clusters of results: VivisimoVivisimo Meta-search engine that graphically displays clusters of results: KartooKartoo Detecting redundancy (e.g., mirror sites or moved or re-formatted documents) User interest profiles (aka filtering)

Vivisimo: Result Organization

Kartoo: Visual Clustering

Detecting Mirrors/Subsumed Web Documents Resemblance assesses similarity between two documents. Containment assesses how A is a subset of B. A.Z. Broder, S.C. Glassman, M.S. Manasse, G. Zweig, “Syntactic Clustering of the Web”, Proceedings of WWW6, 1997.

Computing R and C S(D,w) (shingle) is the set of all unique contiguous subsequences of length w in document D. S(D) is S(D,w) for a fixed size w. To reduce the storage and computation, we can sample the shingles for each doc: First s: MIN s (W) Every mth: MOD m (W)

Estimating R & C from a Portion of a Document Keep a sketch of each document D, which consists of F(D) and/or V(D).

Web Clustering with R & C w=10, m=25, s=50?, threshold=.5 Pre-process documents 1. For each doc, calculate a sketch 2. Sort pairs of, removing lexically- equivalent and shingle-equivalent docs 3. Compute list of doc pairs with # of shared shingles, ignoring very common shingles 4. Generate clusters 1. if r(A,B) > threshold, then add link A B 2. Produce connected components using union-find

Web Clustering Results M web pages, 150 GBytes 600M shingles 3.6M clusters of 12.3M docs 2.1M clusters of 5.3M identical docs Took 10.5 CPU days to compute

Web Applications of Resemblance Clusters Find URL similar to … relies on fixed threshold and requires URLs to have been processed WWW Lost and Found requires keeping some historical sketch info Remove similar docs from search results