MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury -Udeepta Bordoloi.

Slides:



Advertisements
Similar presentations
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
Advertisements

Clustering Prof. Navneet Goyal BITS, Pilani
K Means Clustering , Nearest Cluster and Gaussian Mixture
Alok 1Northwestern University PARSYMONY: Scalable Parallel Data Mining Alok N. Choudhary Northwestern University (ACK: Harsha.
Image Segmentation Image segmentation (segmentace obrazu) –division or separation of the image into segments (connected regions) of similar properties.
Identifying Image Spam Authorship with a Variable Bin-width Histogram-based Projective Clustering Song Gao, Chengcui Zhang, Wei Bang Chen Department of.
Adaptive Resonance Theory (ART) networks perform completely unsupervised learning. Their competitive learning algorithm is similar to the first (unsupervised)
Segmentation and Region Detection Defining regions in an image.
Thresholding Otsu’s Thresholding Method Threshold Detection Methods Optimal Thresholding Multi-Spectral Thresholding 6.2. Edge-based.
1 Reversible Sketches for Efficient and Accurate Change Detection over Network Data Streams Robert Schweller Ashish Gupta Elliot Parsons Yan Chen Computer.
Region Segmentation. Find sets of pixels, such that All pixels in region i satisfy some constraint of similarity.
Segmentation Divide the image into segments. Each segment:
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Spatial Indexing I Point Access Methods.
University at BuffaloThe State University of New York WaveCluster A multi-resolution clustering approach qApply wavelet transformation to the feature space.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Lecture14: Association Rules
2015年7月2日星期四 2015年7月2日星期四 2015年7月2日星期四 Data Mining: Concepts and Techniques1 Data Transformation and Feature Selection/Extraction Qiang Yang Thanks: J.
CS292 Computational Vision and Language Segmentation and Region Detection.
Thresholding Thresholding is usually the first step in any segmentation approach We have talked about simple single value thresholding already Single value.
Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean Hall 5409 T-R 10:30am – 11:50am.
October 8, 2013Computer Vision Lecture 11: The Hough Transform 1 Fitting Curve Models to Edges Most contours can be well described by combining several.
Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
CSE 185 Introduction to Computer Vision Pattern Recognition 2.
Measuring Repeat and Near- Repeat Burglary Effects.
Chapter 10 Image Segmentation.
Author:Rakesh Agrawal
CSC508 What You Should Be Doing Code, code, code –Programming Gaussian Convolution Sobel Edge Operator.
Section 1.4. Combinations of Functions Sum: combine like terms Difference: distribute the negative to the second function and combine like terms Product:
CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
CS654: Digital Image Analysis
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Chung-hung.
Presented by Ho Wai Shing
Lecture 6 : External Sorting Bong-Soo Sohn Assistant Professor School of Computer Science and Engineering Chung-Ang University.
Machine Learning Queens College Lecture 7: Clustering.
Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge.
Data Cleaning Data Cleaning Importance “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball “Data.
Query Processing CS 405G Introduction to Database Systems.
1 Complex Spatio-Temporal Pattern Queries Cahide Sen University of Minnesota.
Static block can be used to check conditions before execution of main begin, Suppose we have developed an application which runs only on Windows operating.
Lloyd Algorithm K-Means Clustering. Gene Expression Susumu Ohno: whole genome duplications The expression of genes can be measured over time. Identifying.
Region Detection Defining regions of an image Introduction All pixels belong to a region Object Part of object Background Find region Constituent pixels.
CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course.
An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.
Simplify Radical Expressions Using Properties of Radicals.
Detecting Image Features: Corner. Corners Given an image, denote the image gradient. C is symmetric with two positive eigenvalues. The eigenvalues give.
Presentation 3 Ruben Villegas Period: 05/31/2012 – 06/03/2012.
Tuning the hierarchical procedure searching for the key where there is light S.Frasca – Potsdam, December 2006.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
DB Seminar Series: The Subspace Clustering Problem By: Kevin Yip (17 May 2002)
Selection Using IF THEN ELSE CASE Introducing Loops.
Data Mining Soongsil University
Simplify Radical Expressions Using Properties of Radicals
Miguel Tavares Coimbra
Data Mining: Concepts and Techniques
Adaptive Resonance Theory (ART)
Clustering and Segmentation
Computer Vision Lecture 13: Image Segmentation III
Computer Vision Lecture 12: Image Segmentation II
Fitting Curve Models to Edges
Non-parametric Filters
CSE572, CBS598: Data Mining by H. Liu
CSE572, CBS572: Data Mining by H. Liu
Data Transformations targeted at minimizing experimental variance
Non-parametric Filters
Continuous Random Variables
External Sorting Dina Said
Presentation transcript:

MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury -Udeepta Bordoloi

Clustering Algorithms BIRCH ROCK CLIQUE –Inputs: grid size and density threshold –Prunes subspaces MAFIA –Adaptive grid size –Inputs: density threshold –No pruning of subspaces

Grids: CLIQUE way Along each dimension: Divide the whole range into intervals (windows) of size given by user. Threshold the number of points in each interval by the user input density to get clusters.

Grids: MAFIA way Along each dimension: Divide the whole range into many small windows. Compute a histogram (Assuming discrete data here). –E.g., we can divide the range of natural numbers (1-15) into 5 windows (1-3, 4-6,…,13-15). Value of a window = max(histogram value within the window) –E.g., if there are three 1s, zero 2s, and five 3s, then the value of the first window (1-3) = three.

Grids: MAFIA way Along each dimension: (contd.) From L-to-R, merge adjacent windows which differ by less than threshold ß. –Can be made a user input, but they hard-coded it (25-75%) What if cannot detect any partition? –Divide the range equally.

Compare… CLIQUE MAFIA

Which windows are cluster candidates? CLIQUE: use user input threshold MAFIA: use user input threshold normalized to window size –Cluster dominance factor: α –Reports clusters as DNF expressions –Cluster candidates henceforth referred to as Candidate Dense Units (CDU)

Algorithm Initialization B = number of records that fit into memory 1.Read data in chunks of B and build histogram for each dimension. 2.Determine the adaptive windows for each dimension, and the normalized thresholds for each window. 3.Get the candidate windows in each dimension. 4.Variable of working dimension, k = 1.

Main Loop Repeat 1. k++; 2.Find candidate dense units (by combining dimensions); 3.Read through the data to find how many points lie in each of these CDUs; 4.Find the true dense units. Until (no more dense units found) Report the true dense units as clusters.

Building CDUs CDUs in k dimensions –merge two dense units of (k-1) dimensions. –such that they share any (k-2) dimensions. –each dense unit of (k-1) dims has to be compared with every other dense unit. –can lead to duplicate CDUs, compare every CDU with every other CDU. Dense units which cannot be combined are a potential cluster (in a subspace).

Building CDU example (2D  3D) We can get repeated CDUs Two passes required. 1.To combine two 2D units to one 3D unit. 2.To eliminate repeated CDUs.

Variables (Recap) Cluster dominance factor, α: –High α, strong clusters and vice-versa. –Usual value: 1.5 Window merging threshold, β: –High β, fine windows and vice-versa.

MAFIA vs. CLIQUE (speedup) CLIQUE used –without pruning. –with 10 bins for each dimension. –with different thresholds ?

MAFIA vs. CLIQUE (number of CDUs computed) Single 7D cluster in a 10D data space CLIQUE: 75 6D clusters, 546 7D clusters

MAFIA vs. CLIQUE (quality) 2 4D clusters in 10D data space CLIQUE: cluster boundary very unreliable –On using a variable number of (fixed size) bins in each dimension (how?), it misses one cluster.

MAFIA (scalability)