MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng,

Slides:



Advertisements
Similar presentations
DBSCAN & Its Implementation on Atlas Xin Zhou, Richard Luo Prof. Carlo Zaniolo Spring 2002.
Advertisements

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
PARTITIONAL CLUSTERING
Fast Algorithms For Hierarchical Range Histogram Constructions
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Lecture outline Density-based clustering (DB-Scan) – Reference: Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu: A Density-Based Algorithm for.
Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.
Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.
DBSCAN – Density-Based Spatial Clustering of Applications with Noise M.Ester, H.P.Kriegel, J.Sander and Xu. A density-based algorithm for discovering clusters.
Segmentation in color space using clustering Student: Yijian Yang Advisor: Longin Jan Latecki.
2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.
Clustering Prof. Navneet Goyal BITS, Pilani
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Clustering Methods Professor: Dr. Mansouri
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Query Processing in Databases Dr. M. Gavrilova.  Introduction  I/O algorithms for large databases  Complex geometric operations in graphical querying.
K-Means and DBSCAN Erik Zeitler Uppsala Database Laboratory.
Cluster Analysis.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
SCAN: A Structural Clustering Algorithm for Networks
Cluster Analysis.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
FLANN Fast Library for Approximate Nearest Neighbors
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Tree-Based Density Clustering using Graphics Processors
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade.
Time-focused density-based clustering of trajectories of moving objects Margherita D’Auria Mirco Nanni Dino Pedreschi.
Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui, Ku.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
Expanding the CASE Framework to Facilitate Load Balancing of Social Network Simulations Amara Keller, Martin Kelly, Aaron Todd.
Density-Based Clustering Algorithms
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 29-May 3, 2013 Mr. Scan: Efficient Clustering with MRNet and GPUs Evan Samanas and Ben.
Topic9: Density-based Clustering
Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Record Linkage in a Distributed Environment
DBSCAN Data Mining algorithm Dr Veljko Milutinović Milan Micić
Presented by Ho Wai Shing
Density-Based Clustering Methods. Clustering based on density (local cluster criterion), such as density-connected points Major features: –Discover clusters.
5/29/2008AI UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
RDFPath: Path Query Processing on Large RDF Graph with MapReduce Martin Przyjaciel-Zablocki et al. University of Freiburg ESWC May 2013 SNU IDB.
1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.
CLUSTERING DENSITY-BASED METHODS Elsayed Hemayed Data Mining Course.
Clustering By : Babu Ram Dawadi. 2 Clustering cluster is a collection of data objects, in which the objects similar to one another within the same cluster.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
| presented by Vasileios Zois CS at USC 09/20/2013 Introducing Scalability into Smart Grid 1.
Data Mining: Basic Cluster Analysis
More on Clustering in COSC 4335
Hierarchical Clustering: Time and Space requirements
Parallel Density-based Hybrid Clustering
Sameh Shohdy, Yu Su, and Gagan Agrawal
On Spatial Joins in MapReduce
CSE572, CBS598: Data Mining by H. Liu
CSE572, CBS572: Data Mining by H. Liu
CSE572: Data Mining by H. Liu
CS 685: Special Topics in Data Mining Jinze Liu
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
Presentation transcript:

MR-DBSCAN: An Efficient Parallel Density-based Clustering Algorithm using MapReduce Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng, Jianping Fan

INTRODUCTION This paper is mainly focus on “Parallel Density-based Data Cl ustering” on shared-nothing cluster environment. Data clustering is essential data mining technique which can view macrosco pic patterns of data. Due to the size of datasets, there is a needs to develop parallel data clusteri ng algorithm. In this paper, the authors propose an parallel density-based cl ustering algorithm and implement it by a 4-stages MapReduce paradigm Adopt a quick partitioning strategy for large scale non-indexed data Study the metric of merge among bordering partitions and optimizations Evaluate on real large scale datasets (approx. 1.9 billion GPS log)

Introduction Clustering techniques Pros of DBScan Divide data into clusters with arbitrary shapes Does not require the number of the clusters a priori Insensitive to the order of the points in the dataset Cons of DBScan The sizes of the datasets are growing so that they can not be held on a single machine Much higher computation complexity compared with K-means => PARALLELIZE using MapReduce!! (what a simple..) 3

Background : DBScan DBSCAN (Martin Ester et al, KDD, 1996) The key idea of density-based clustering is that for each point of a cluster the neighborhood of a given radius (Eps) has to contain at least a minimum number of points (MinPts) Directly density-reachable (DDR): o is DDR p if p ∈ N Eps (o) and Card(N Eps (o)) ≤ MinPts. Density-reachable (DR): if there is a chain of points {p i |i = 0,.., n} that each p i is DDR p i+1, then p i is DR t, where t ∈ {p j |j = i + 1,..., n}. (canonical extension) Density-connected (DC): if o is DR p and o is DR q, then p is DC q. (symmetric version) 4

Background : DBScan 5 Class of point : -Unclassified -Core -Border -Noise

Background : MapReduce Borrows from functional programming Users should implement two primary methods: Map: (k1, v1) → list(k2, v2) Reduce: (k2, list(v2)) → list(k3, v3)] 6

Background : MapReduce 7

Design And Implementation Problem Statement Given a set of d-dimensional points DB = {p 1, p 2,..., p n }, a minimal density of clusters defined by Eps and MinPts, and a set of computer CP = {C1, C2,...,Cn} managed by Map-Reduce platform; find the density-based clusters with respect to the given Eps and MinPts values. Overall Framework 8

Stage 1 : Preprocessing Summary spatial distribution, and then genenrate grid based partition Main challenges for a partitioning strategy 1) Load balancing 2) Minimized communication One of the possible solutions is to build an efficient spatial index However the authors does not take well-known indexing method such as R- Tree, KD-Tree, … Because, iterating recursion to get a hierarchical structure is not practical in MapReduc paradigm The authors uses partition algorithm on MapReduce adjusted from the grid file. 9

Stage 1 : Preprocessing 10 Raw Data Bucket Counting (in example, 10 bucket which created by interval 0.1) Compute Spatial distribution for each dimension Partitioning - Proposed Metrics : avg, m Bucket ID Count

Stage 1 : Preprocessing Shape of the Partiton necessity of the access to remote data For a given Eps, and MinPts D 5, if there is no support of accessing remote data, then the neighborhood of object p 1 would contain only 3 points which is less than MinPts, and therefore p 1 would not be a core point. Therefore, to obtain correct clustering results, a “view” over the border of partitions is necessary So, the shape of the partition is S + halo 11 S 1 or iS 2 or i+1 halo Outer halo Inner halo Eps

Stage 2 : Local DBSCAN The algorithm in Local DBSCAN is very similar with DBSCAN Differences is.. A non-noise point q on outer halo, in this point we does not know whether q is a core point or border point. (because computing node is on shared-nothing environment) Those points are classified “Onqueue” status and put into MergeCandidates set (MC) 12

Stage 3 : Find Merging Mapping Character of MC set The composition of MC set The Completeness of MC set 13 q is not in halo q is core pointMore than one neighbor are on halo O is Core point or border point on halo

Stage 3 : Find Merging Mapping Merging clusters of adjacent spaces are needed or not 14

Stage 3 : Find Merging Mapping Let MC1(C, S1) = {AP1 ∪ BP1}, where AP1 is the set of core points and BP1 is the set of border points Theorem 1: Let MC1(C1, S1) = {AP1 ∪ BP1}, where AP1 is the set of core points and BP1 is the set of border points w.r.t. space constraint S1. MC2(C2, S2) = AP2 ∪ BP2, where AP2 is the set of core points and BP2 is the set of border points w.r.t. space constraint S2. If S1 and S2 are bordering 15

Stage 3 : Find Merging Mapping 16

Stage 4 : Merge Build Global Mapping -> Merge and Relabel 17

Evaluation Experiment environment 13-node cluster Each node has 3.0GHz i7 950 (quad-core), 8GB ram, 2TB hdd Ubuntu Hadoop Block size : 64MB Data Set Sanghai taxi GPS logs 18

Evaluation Each location point is normalized into range [0, 1) Two DBSCAN configuration WL-1 Eps : 0.002, MinPts : 1,000 WL-2 Eps : , MinPts : ds-4

Evaluation 20 WL-1 SPD= node ds4 ds3 ds2 ds1 (2/12) (4/12) (6/12)

Conclusions In this paper, implement an efficient parallel DBScan algorithm in a 4-stages MapReduce paradigm. We analyze and propose a practical data partition strategy for large scale non-indexed spatial data. We apply our work on a real world spatial dataset, which contains over 1.9 billion GPS raw records, and run our experiment on a lab-size 13-nodes cluster. Result from experiment shows the speedup and scale-up performance are very efficient. We observe that roadmap based spatial data will highly skew in the road network. If a main road happens lying in the replication area after partitioning, computation and data replication will increase dramatically. One of the future works is to improve the partitioning strategy to aware of this observation and minimize the size of MC sets. The challenge is that its performance is still highly restricted by the distribution of raw spatial data.