Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.

Similar presentations


Presentation on theme: "Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang."— Presentation transcript:

1 Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang

2 AGENDA Problem Statement Motivation / Novelty Related Work & Our Contributions Proposed Approach Key Concepts Validation Results Conclusions Future Work

3 AGENDA Problem Statement Motivation / Novelty Related Work & Our Contributions Proposed Approach Key Concepts Validation Results Conclusions Future Work

4 PROBLEM STATEMENT Input: Two different Clustering algorithms (DBScan, SatScan) Same Input Dataset Criteria of Comparison Output: Result of Comparison – Data / Graph Constraints: DBScan – No data about efficiency SatScan Software – 1 pre defined shape Objective: Usage Scenarios – Which algorithm can be used where?

5 AGENDA Problem Statement Motivation / Novelty Related Work & Our Contributions Proposed Approach Key Concepts Validation Results Conclusions Future Work

6 MOTIVATION/NOVELTY Different clustering algorithms Categorized into different types Comparisons Algorithms - Same category No Systematic way of comparison, Biased Comparisons No situation based comparison – Which to use where? No comparison betn. DBScan & SatScan

7 AGENDA Problem Statement Motivation / Novelty Related Work & Our Contributions Proposed Approach Key Concepts Validation Results Conclusions Future Work

8 RELATED WORK Comparison of Clustering Algorithms Same type of Algorithms Different type of Algorithms Density Based – DBScan & OPTICS DBScan Vs K-Means Our Work – DBSCan Vs SatScan Density Based – DBScan & SNN K-means (Centroid Based) Vs Hierarchical, Expectation Vs Maximization (Distance Based)

9 AGENDA Problem Statement Motivation / Novelty Related Work & Our Contributions Proposed Approach Key Concepts Validation Results Conclusions Future Work

10 PROPOSED APPROACH Our Approach: Two different types of Clustering algorithms DBScan SatScan Unbiased comparison Systematic – 3 factors & Same Input datasets Shape of the cluster Statistical Significance Scalability

11 AGENDA Problem Statement Motivation / Novelty Related Work & Our Contributions Proposed Approach Key Concepts Challenges Validation Results Conclusions Future Work

12 KEY CONCEPTS - 1 Clustering Task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups. Data Mining, Statistical Analysis & many more fields Real world Application: Earthquake studies: Clustering observed earthquake epicenters to identify dangerous zones Field Robotics: For robotic situational awareness to track objects and detect outliers in sensor data

13 KEY CONCEPTS - 2 Types of Clustering Algorithms Connectivity based / Hierarchical Centroid Based Distribution Based Density Based Core Idea - Objects being more related to nearby objects (distance) than to objects farther away Core Idea- Clusters are represented by a central vector, which may not necessarily be a member of the data set Ex: K - Means Core Idea - Clusters can be defined as objects belonging most likely to the same distribution Core Idea - Clusters are areas of higher density than the remainder of the data set ……… Core Idea- Clusters are represented by a central vector, which may not necessarily be a member of the data set Ex: K - Means

14 KEY CONCEPTS - DBSCAN Density based Clustering Arguments Minimum number of Points – MinPts Radius - Eps Density = Number of Points within specified radius (Eps) Three types of Points Core Point – No. of points > MinPts within Eps Border point – No. of Points < MinPts within Eps but is in neighborhood of a core point Noise point - Neither a core point nor a border point

15 EXAMPLE - DBSCAN Dataset 1 :

16 DBSCAN RESULTS - 1 DB Scan o/p on dataset1: Min-Neighbors=3, Radius = 5 Number of Clusters = 36

17 DBSCAN RESULTS - 2 DB Scan o/p on dataset1: Min-Neighbors=7, Radius = 1 Number of clusters = 0

18 DBSCAN RESULTS - 3 DB Scan o/p on dataset1: Min-Neighbors=20,Radius = 20 Number of clusters = 4

19 KEY CONCEPTS - 3 SaTScan – Spatial Scan Statistics Input: Dataset null hypothesis model Procedure: Pre-defined shape scanning window Variating size of the window Calculate likelyhood ratio => Most Likely clusters Test statistical significance (Monte Carlo Sampling, 1000 runs) Output: Clusters with p-value Significant/primary Insignificant/secondary

20 AGENDA Problem Statement Motivation / Novelty Related Work & Our Contributions Proposed Approach Key Concepts Challenges Validation Results Conclusions Future Work

21 CHALLENGES Tuning parameters - DBScan Manual tuning to detect clusters Hard to set correct parameters Design of appropriate Datasets To demonstrate Criteria of Comparison

22 AGENDA Problem Statement Motivation / Novelty Related Work & Our Contributions Proposed Approach Key Concepts Challenges Validation Results Conclusions Future Work

23 VALIDATION Experiment Assumptions based on theory Designing datasets and running experiment Able to validate them with results

24 AGENDA Problem Statement Motivation / Novelty Related Work & Our Contributions Proposed Approach Key Concepts Challenges Validation Results Conclusions Future Work

25 CLUSTER SHAPE - DBSCAN Vs SatScan

26

27 STATISTICAL SIGNIFICANCE CSR Dataset -1000 points

28 STATISTICAL SIGNIFICANCE CSR Dataset - 2000 points

29 RUNTIME – Number of Points - DBScan

30 RUNTIME – Number of Points - SATScan

31 RUNTIME – Number of Clusters – DB Vs SAT Datasets: 3000 points

32 RUNTIME – Number of Clusters – DBScan

33 RUNTIME – Number of Clusters – SATScan

34 AGENDA Problem Statement Motivation / Novelty Related Work & Our Contributions Proposed Approach Key Concepts Challenges Validation Results Conclusions Future Work

35 CONCLUSIONS S.N o Factor of Comparison DBSCAN SATSCAN 1 Number of clusters not known beforehand Yes Yes 2 Shape: Data has different shaped clusters Yes No - Only 1 shape of clusters (Circle, ellipse, rectangle.. ) 3 Runtime: How much time to form clusters? Less runtime More runtime Iterative approach to detect clusters and Monte Carlo Sampling too 4 Scalability: How well it scales when data size is increased Still manageable runtime - Curse of dimensionality Runtime α Size, Number of clusters 5 Statistical Significance: How significant are the clusters detected? No significance factor Significance is at the core 6 Noise: Is noise allowed or should all points be in cluster? Yes Yes

36 AGENDA Problem Statement Motivation / Novelty Related Work & Our Contributions Proposed Approach Key Concepts Challenges Validation Results Conclusions Future Work

37 FUTURE WORK Same project – Real World Datasets Run more instances of the experiments Control over parameters Compare with other types of clustering algorithms

38 QUESTIONS?

39

40 BACKUP SLIDE - 1 DBSCAN requires two parameters: epsilon (eps) and minimum points (minPts). It starts with an arbitrary starting point that has not been visited. It then finds all the neighbor points within distance eps of the starting point. If the number of neighbors is greater than or equal to minPts, a cluster is formed. The starting point and its neighbors are added to this cluster and the starting point is marked as visited. The algorithm then repeats the evaluation process for all the neighbors recursively. If the number of neighbors is less than minPts, the point is marked as noise. If a cluster is fully expanded (all points within reach are visited) then the algorithm proceeds to iterate through the remaining unvisited points in the dataset.

41 BACKUP SLIDE 2 -CONCLUSIONS DBScan Works Same density clusters Don’t know the number of clusters beforehand Different shaped clusters All points need not be in clusters – Noise concept is present DBScan doesn’t work Varying density clusters Quality of DBScan depends on – Epsilon – If Euclidean distance High dimension data – Curse of dimensionality TO DO

42 CLUSTER SHAPE - DBSCAN Results

43 SHAPE - SATSCAN RESULTS p-value: 0.001


Download ppt "Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang."

Similar presentations


Ads by Google