Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.

Slides:

Advertisements

Similar presentations

DBSCAN & Its Implementation on Atlas Xin Zhou, Richard Luo Prof. Carlo Zaniolo Spring 2002.

Advertisements

Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

Cluster Analysis: Basic Concepts and Algorithms

Hierarchical Clustering, DBSCAN The EM Algorithm

PARTITIONAL CLUSTERING

Hotspot/cluster detection methods(1) Spatial Scan Statistics: Hypothesis testing – Input: data – Using continuous Poisson model Null hypothesis H0: points.

Data Mining Cluster Analysis: Advanced Concepts and Algorithms

DBSCAN – Density-Based Spatial Clustering of Applications with Noise M.Ester, H.P.Kriegel, J.Sander and Xu. A density-based algorithm for discovering clusters.

Qiang Yang Adapted from Tan et al. and Han et al.

Clustering Prof. Navneet Goyal BITS, Pilani

CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.

Clustering Methods Professor: Dr. Mansouri

More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. local-density based spatial clustering algorithm with noise Presenter : Lin, Shu-Han Authors : Lian Duan,

LOGO Clustering Lecturer: Dr. Bo Yuan

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.

K-Means and DBSCAN Erik Zeitler Uppsala Database Laboratory.

Cluster Analysis.

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Overview Of Clustering Techniques D. Gunopulos, UCR.

© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.

SCAN: A Structural Clustering Algorithm for Networks

Cluster Analysis.

Cluster Analysis: Basic Concepts and Algorithms

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.

Time-focused density-based clustering of trajectories of moving objects Margherita D’Auria Mirco Nanni Dino Pedreschi.

An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.

Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.

1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.

Garrett Poppe, Liv Nguekap, Adrian Mirabel CSUDH, Computer Science Department.

Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1.

Topic9: Density-based Clustering

DBSCAN Data Mining algorithm Dr Veljko Milutinović Milan Micić

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Presented by Ho Wai Shing

Density-Based Clustering Methods. Clustering based on density (local cluster criterion), such as density-connected points Major features: –Discover clusters.

Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.

1 CLUSTER VALIDITY  Clustering tendency Facts  Most clustering algorithms impose a clustering structure to the data set X at hand.  However, X may not.

1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.

Other Clustering Techniques

CLUSTERING DENSITY-BASED METHODS Elsayed Hemayed Data Mining Course.

Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.

Clustering By : Babu Ram Dawadi. 2 Clustering cluster is a collection of data objects, in which the objects similar to one another within the same cluster.

Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.

Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.

Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.

DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.

General Elliptical Hotspot Detection Xun Tang, Yameng Zhang Group

Mining Statistically Significant Co-location and Segregation Patterns.

Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.

Data Mining: Basic Cluster Analysis

DATA MINING Spatial Clustering

More on Clustering in COSC 4335

CSE 4705 Artificial Intelligence

Hierarchical Clustering: Time and Space requirements

CSE 5243 Intro. to Data Mining

Clustering in Ratemaking: Applications in Territories Clustering

Clustering (3) Center-based algorithms Fuzzy k-means

CS 685: Special Topics in Data Mining Jinze Liu

Data Mining Cluster Analysis: Advanced Concepts and Algorithms

CSE572, CBS598: Data Mining by H. Liu

CS 685: Special Topics in Data Mining Jinze Liu

DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일

CS 485G: Special Topics in Data Mining

CSE572, CBS572: Data Mining by H. Liu

CSE572, CBS572: Data Mining by H. Liu

Clustering Wei Wang.

CSE572: Data Mining by H. Liu

CS 685: Special Topics in Data Mining Jinze Liu

Presentation transcript:

Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang

AGENDA Problem Statement Motivation / Novelty Related Work & Our Contributions Proposed Approach Key Concepts Validation Results Conclusions Future Work

AGENDA Problem Statement Motivation / Novelty Related Work & Our Contributions Proposed Approach Key Concepts Validation Results Conclusions Future Work

PROBLEM STATEMENT Input: Two different Clustering algorithms (DBScan, SatScan) Same Input Dataset Criteria of Comparison Output: Result of Comparison – Data / Graph Constraints: DBScan – No data about efficiency SatScan Software – 1 pre defined shape Objective: Usage Scenarios – Which algorithm can be used where?

AGENDA Problem Statement Motivation / Novelty Related Work & Our Contributions Proposed Approach Key Concepts Validation Results Conclusions Future Work

MOTIVATION/NOVELTY Different clustering algorithms Categorized into different types Comparisons Algorithms - Same category No Systematic way of comparison, Biased Comparisons No situation based comparison – Which to use where? No comparison betn. DBScan & SatScan

AGENDA Problem Statement Motivation / Novelty Related Work & Our Contributions Proposed Approach Key Concepts Validation Results Conclusions Future Work

RELATED WORK Comparison of Clustering Algorithms Same type of Algorithms Different type of Algorithms Density Based – DBScan & OPTICS DBScan Vs K-Means Our Work – DBSCan Vs SatScan Density Based – DBScan & SNN K-means (Centroid Based) Vs Hierarchical, Expectation Vs Maximization (Distance Based)

AGENDA Problem Statement Motivation / Novelty Related Work & Our Contributions Proposed Approach Key Concepts Validation Results Conclusions Future Work

PROPOSED APPROACH Our Approach: Two different types of Clustering algorithms DBScan SatScan Unbiased comparison Systematic – 3 factors & Same Input datasets Shape of the cluster Statistical Significance Scalability

AGENDA Problem Statement Motivation / Novelty Related Work & Our Contributions Proposed Approach Key Concepts Challenges Validation Results Conclusions Future Work

KEY CONCEPTS - 1 Clustering Task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups. Data Mining, Statistical Analysis & many more fields Real world Application: Earthquake studies: Clustering observed earthquake epicenters to identify dangerous zones Field Robotics: For robotic situational awareness to track objects and detect outliers in sensor data

KEY CONCEPTS - 2 Types of Clustering Algorithms Connectivity based / Hierarchical Centroid Based Distribution Based Density Based Core Idea - Objects being more related to nearby objects (distance) than to objects farther away Core Idea- Clusters are represented by a central vector, which may not necessarily be a member of the data set Ex: K - Means Core Idea - Clusters can be defined as objects belonging most likely to the same distribution Core Idea - Clusters are areas of higher density than the remainder of the data set ……… Core Idea- Clusters are represented by a central vector, which may not necessarily be a member of the data set Ex: K - Means

KEY CONCEPTS - DBSCAN Density based Clustering Arguments Minimum number of Points – MinPts Radius - Eps Density = Number of Points within specified radius (Eps) Three types of Points Core Point – No. of points > MinPts within Eps Border point – No. of Points < MinPts within Eps but is in neighborhood of a core point Noise point - Neither a core point nor a border point

EXAMPLE - DBSCAN Dataset 1 :

DBSCAN RESULTS - 1 DB Scan o/p on dataset1: Min-Neighbors=3, Radius = 5 Number of Clusters = 36

DBSCAN RESULTS - 2 DB Scan o/p on dataset1: Min-Neighbors=7, Radius = 1 Number of clusters = 0

DBSCAN RESULTS - 3 DB Scan o/p on dataset1: Min-Neighbors=20,Radius = 20 Number of clusters = 4

KEY CONCEPTS - 3 SaTScan – Spatial Scan Statistics Input: Dataset null hypothesis model Procedure: Pre-defined shape scanning window Variating size of the window Calculate likelyhood ratio => Most Likely clusters Test statistical significance (Monte Carlo Sampling, 1000 runs) Output: Clusters with p-value Significant/primary Insignificant/secondary

AGENDA Problem Statement Motivation / Novelty Related Work & Our Contributions Proposed Approach Key Concepts Challenges Validation Results Conclusions Future Work

CHALLENGES Tuning parameters - DBScan Manual tuning to detect clusters Hard to set correct parameters Design of appropriate Datasets To demonstrate Criteria of Comparison

AGENDA Problem Statement Motivation / Novelty Related Work & Our Contributions Proposed Approach Key Concepts Challenges Validation Results Conclusions Future Work

VALIDATION Experiment Assumptions based on theory Designing datasets and running experiment Able to validate them with results

AGENDA Problem Statement Motivation / Novelty Related Work & Our Contributions Proposed Approach Key Concepts Challenges Validation Results Conclusions Future Work

CLUSTER SHAPE - DBSCAN Vs SatScan

STATISTICAL SIGNIFICANCE CSR Dataset points

STATISTICAL SIGNIFICANCE CSR Dataset points

RUNTIME – Number of Points - DBScan

RUNTIME – Number of Points - SATScan

RUNTIME – Number of Clusters – DB Vs SAT Datasets: 3000 points

RUNTIME – Number of Clusters – DBScan

RUNTIME – Number of Clusters – SATScan

AGENDA Problem Statement Motivation / Novelty Related Work & Our Contributions Proposed Approach Key Concepts Challenges Validation Results Conclusions Future Work

CONCLUSIONS S.N o Factor of Comparison DBSCAN SATSCAN 1 Number of clusters not known beforehand Yes Yes 2 Shape: Data has different shaped clusters Yes No - Only 1 shape of clusters (Circle, ellipse, rectangle.. ) 3 Runtime: How much time to form clusters? Less runtime More runtime Iterative approach to detect clusters and Monte Carlo Sampling too 4 Scalability: How well it scales when data size is increased Still manageable runtime - Curse of dimensionality Runtime α Size, Number of clusters 5 Statistical Significance: How significant are the clusters detected? No significance factor Significance is at the core 6 Noise: Is noise allowed or should all points be in cluster? Yes Yes

AGENDA Problem Statement Motivation / Novelty Related Work & Our Contributions Proposed Approach Key Concepts Challenges Validation Results Conclusions Future Work

FUTURE WORK Same project – Real World Datasets Run more instances of the experiments Control over parameters Compare with other types of clustering algorithms

QUESTIONS?

BACKUP SLIDE - 1 DBSCAN requires two parameters: epsilon (eps) and minimum points (minPts). It starts with an arbitrary starting point that has not been visited. It then finds all the neighbor points within distance eps of the starting point. If the number of neighbors is greater than or equal to minPts, a cluster is formed. The starting point and its neighbors are added to this cluster and the starting point is marked as visited. The algorithm then repeats the evaluation process for all the neighbors recursively. If the number of neighbors is less than minPts, the point is marked as noise. If a cluster is fully expanded (all points within reach are visited) then the algorithm proceeds to iterate through the remaining unvisited points in the dataset.

BACKUP SLIDE 2 -CONCLUSIONS DBScan Works Same density clusters Don’t know the number of clusters beforehand Different shaped clusters All points need not be in clusters – Noise concept is present DBScan doesn’t work Varying density clusters Quality of DBScan depends on – Epsilon – If Euclidean distance High dimension data – Curse of dimensionality TO DO

CLUSTER SHAPE - DBSCAN Results

SHAPE - SATSCAN RESULTS p-value: 0.001