Overview Of Clustering Techniques D. Gunopulos, UCR.

Slides:



Advertisements
Similar presentations
Clustering Basic Concepts and Algorithms
Advertisements

PARTITIONAL CLUSTERING
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.
Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10
Qiang Yang Adapted from Tan et al. and Han et al.
Clustering Prof. Navneet Goyal BITS, Pilani
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.
Clustering II.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Cluster Analysis.
An Introduction to Clustering
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Clustering II.
Segmentation Divide the image into segments. Each segment:
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Cluster Analysis.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
Clustering & Dimensionality Reduction 273A Intro Machine Learning.
Birch: An efficient data clustering method for very large databases
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Clustering Part2 BIRCH Density-based Clustering --- DBSCAN and DENCLUE
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Density-Based Clustering Algorithms
1 Clustering Sunita Sarawagi
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Han/Eick: Clustering II 1 Clustering Part2 continued 1. BIRCH skipped 2. Density-based Clustering --- DBSCAN and DENCLUE 3. GRID-based Approaches --- STING.
Topic9: Density-based Clustering
Han/Eick: Clustering II 1 Clustering Part2 continued 1. BIRCH skipped 2. Density-based Clustering --- DBSCAN and DENCLUE 3. GRID-based Approaches --- STING.
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Clustering.
Data Mining and Warehousing: Chapter 8
CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Author:George et al. Advisor:Dr. Hsu Graduate:ZenJohn Huang IDSL seminar 2001/10/23.
Presented by Ho Wai Shing
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
1 Efficient and Effective Clustering Methods for Spatial Data Mining Raymond T. Ng, Jiawei Han Pavan Podila COSC 6341, Fall ‘04.
1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
CLUSTERING DENSITY-BASED METHODS Elsayed Hemayed Data Mining Course.
DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel.
Mesh Resampling Wolfgang Knoll, Reinhard Russ, Cornelia Hasil 1 Institute of Computer Graphics and Algorithms Vienna University of Technology.
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
1 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular.
Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
More on Clustering in COSC 4335
CSE 5243 Intro. to Data Mining
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Clustering in Ratemaking: Applications in Territories Clustering
©Jiawei Han and Micheline Kamber Department of Computer Science
Overview Of Clustering Techniques
CS 685: Special Topics in Data Mining Jinze Liu
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
CSE572, CBS572: Data Mining by H. Liu
Clustering Large Datasets in Arbitrary Metric Space
CSE572: Data Mining by H. Liu
CS 685: Special Topics in Data Mining Jinze Liu
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

Overview Of Clustering Techniques D. Gunopulos, UCR

Clusteting Data Clustering Algorithms –K-means and K-medoids algorithms –Density Based algorithms –Density Approximation Spatial Association Rules (Koperski et al, 95) Statistical techniques (Wang et al, 1997) Finding proximity relationships (Knorr et at, 96, 97]

Clustering Data The clustering problem: Given a set of objects, find groups of similar objects What is similar? Define appropriate metrics Applications in marketing, image processing, biology

Clustering Methods K-Means and K-medoids algorithms: –CLARANS, [Ng and Han, VLDB 1994] Hierarchical algorithms –CURE, [Guha et al, SIGMOD 1998] –BIRCH, [Zhang et al, SIGMOD 1996] –CHAMELEON, [Kapyris et al, COMPUTER, 32] Density based algorithms –DENCLUE, [Hinneburg, Keim, KDD 1998] –DBSCAN, [Ester et al, KDD 96] Clustering with obstacles, [Tung et al, ICDE 2001] Excellent survey: [Han et al., 2000]

K-means and K-medoids algorithms Minimizes the sum of square distances of points to cluster representative Efficient iterative algorithms (O(n))

Problems with K-means type algorithms Clusters are approximately spherical High dimensionality is a problem The value of K is an input parameter

Hierarchical Clustering Quadratic algorithms Running time can be improved using sampling [Guha et al, SIGMOD 1998] [Kollios et al, ICDE 2001]

Density Based Algorithms Clusters are regions of space which have a high density of points Clusters can have arbitrary shapes

Dimensionality Reduction Reduce the dimensionality of the space, while preserving distances Many techniques (SVD, MDS) May or may not help

Dimensionality Reduction Dimensionality reduction does not work always

Speeding up the clustering algorithms: Data Reduction Data Reduction: –approximate the original dataset using a small representation –ideally, the representation must be stored in main memory –summarization, compression The accuracy loss must be as small as possible. Use the approximated dataset to run the clustering algorithms

Random Sampling as a Data Reduction Method Random Sampling is used as a data reduction method Idea: Use a random sample of the dataset and run the clustering algorithm over the sample Used for clustering and association rule detection [Ng and Han 94][Toivonen 96][Guha et al 98] But: –For datasets that contain clusters with different densities, we may miss some sparse ones –For datasets with noise we may include significant amount of noise in our sample

A better idea: Biased Sampling Use biased sampling instead of random sampling In biased sampling, the prob that a point is included in the sample depends on the local density We can oversample or undersample regions in our datasets depending on the DM task at hand

Example: NorthEast Dataset NorthEast Dataset, 130K postal addresses in North Eastern USA

Random Sample Random Sampling fails to find the clusters

Biased Sampling Biased Sampling finds the clusters

The Biased Sampling Technique Basic idea: –First compute an approximation of the density function of the dataset –Use the density function to define the bias for each point and perform the sampling [Kollios et al, ICDE 2001] [Domeniconi and Gunopulos, ICML 2001] [Palmer and Faloutsos, SIGMOD 2000]

Density Estimation We use kernels to approximate the probability density function (pdf) We scan the dataset and we compute an initial random sample and standard deviation For each sample we use a kernel. The approximate pdf is the sum of all kernels

Kernel Estimator Example of a Kernel Estimator

The sampling step Let f(p) the pdf value for the point p =(x 1,x 2, …, x d ) We define L (p) = f(p)   where  is parameter We compute the normalization parameter k (in one scan):

The sampling step (cont.) The sampling bias is proportional to: Where b is the size of the sample and k the normalization factor In another scan we perform the sampling (two scans) We can combine the above two steps into one scan

The variable  If  = 0 then we have uniform random sampling bias: If  > 0 then regions with higher density are sampled at a higher rate If  < 0 then regions with higher density are sampled at a lower rate We can show that if  > -1, relative densities are preserved in the sample Bias ~

Biased vs Uniform random sampling DataSet 5 clusters With 1000 Uniform RS With 1000 Biased RS, a=-0.5