5/29/2008AI UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Chapter 1: INTRODUCTION TO DATA STRUCTURE
Divide and Conquer Yan Gu. What is Divide and Conquer? An effective approach to designing fast algorithms in sequential computation is the method known.
CS252: Systems Programming Ninghui Li Program Interview Questions.
AVL Trees1 Part-F2 AVL Trees v z. AVL Trees2 AVL Tree Definition (§ 9.2) AVL trees are balanced. An AVL Tree is a binary search tree such that.
Transform and Conquer Chapter 6. Transform and Conquer Solve problem by transforming into: a more convenient instance of the same problem (instance simplification)
DBSCAN – Density-Based Spatial Clustering of Applications with Noise M.Ester, H.P.Kriegel, J.Sander and Xu. A density-based algorithm for discovering clusters.
Clustering Prof. Navneet Goyal BITS, Pilani
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Chapter 3: Cluster Analysis
Lecture 4 Divide and Conquer for Nearest Neighbor Problem
1 Divide & Conquer Algorithms. 2 Recursion Review A function that calls itself either directly or indirectly through another function Recursive solutions.
2. Getting started Hsu, Lih-Hsing. Computer Theory Lab. Chapter 2P Insertion sort Example: Sorting problem Input: A sequence of n numbers Output:
Analysis of Algorithms CS 477/677 Sorting – Part B Instructor: George Bebis (Chapter 7)
Lecture 8 Jianjun Hu Department of Computer Science and Engineering University of South Carolina CSCE350 Algorithms and Data Structure.
Optimization of ICP Using K-D Tree
CS 253: Algorithms Chapter 7 Mergesort Quicksort Credit: Dr. George Bebis.
Lecture 3 Nearest Neighbor Algorithms Shang-Hua Teng.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Lecture 6 Divide and Conquer for Nearest Neighbor Problem Shang-Hua Teng.
Design and Analysis of Algorithms - Chapter 41 Divide and Conquer The most well known algorithm design strategy: 1. Divide instance of problem into two.
Design and Analysis of Algorithms - Chapter 41 Divide and Conquer The most well known algorithm design strategy: 1. Divide instance of problem into two.
FLANN Fast Library for Approximate Nearest Neighbors
Birch: An efficient data clustering method for very large databases
Unit 1. Sorting and Divide and Conquer. Lecture 1 Introduction to Algorithm and Sorting.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
HOW TO SOLVE IT? Algorithms. An Algorithm An algorithm is any well-defined (computational) procedure that takes some value, or set of values, as input.
A Review of Recursion Dr. Jicheng Fu Department of Computer Science University of Central Oklahoma.
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Copyright © 2007 Pearson Addison-Wesley. All rights reserved. A. Levitin “ Introduction to the Design & Analysis of Algorithms, ” 2 nd ed., Ch. 1 Chapter.
Fundamentals of Algorithms MCS - 2 Lecture # 7
File Organization and Processing Week 13 Divide and Conquer.
Garrett Poppe, Liv Nguekap, Adrian Mirabel CSUDH, Computer Science Department.
Lecture 2 Sorting. Sorting Problem Insertion Sort, Merge Sort e.g.,
Getting Started Introduction to Algorithms Jeff Chastine.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies A hierarchical clustering method. It introduces two concepts : Clustering feature Clustering.
Presented by Ho Wai Shing
Density-Based Clustering Methods. Clustering based on density (local cluster criterion), such as density-connected points Major features: –Discover clusters.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Other Clustering Techniques
CLUSTERING DENSITY-BASED METHODS Elsayed Hemayed Data Mining Course.
1 Ch.19 Divide and Conquer. 2 BIRD’S-EYE VIEW Divide and conquer algorithms Decompose a problem instance into several smaller independent instances May.
Chapter 9 Recursion © 2006 Pearson Education Inc., Upper Saddle River, NJ. All rights reserved.
Lecture 6 Sorting II Divide-and-Conquer Algorithms.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Sorting.
Lecture 2 Sorting.
DATA MINING Spatial Clustering
More on Clustering in COSC 4335
CSE 4705 Artificial Intelligence
Unit 1. Sorting and Divide and Conquer
Lecture 4 Divide-and-Conquer
BIRCH: An Efficient Data Clustering Method for Very Large Databases
Sorting.
CS 685: Special Topics in Data Mining Jinze Liu
CS 685: Special Topics in Data Mining Jinze Liu
CS 485G: Special Topics in Data Mining
Ch 2: Getting Started Ming-Te Chi
Lecture No 6 Advance Analysis of Institute of Southern Punjab Multan
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Birch presented by : Bahare hajihashemi Atefeh Rahimi
Divide & Conquer Algorithms
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
CS 685: Special Topics in Data Mining Jinze Liu
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

5/29/2008AI UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology

Contents Introduction Idea for there major approaches for scalable clustering {Divide-and-Conquer, Incremental, Parallel} There approaches for scalable clustering { BIRCH, DSBCAN, CURE} Application

Introduction – Common method Common method for clustering: visit all data from database and analyze the data, just like: Time: Computational Complexities: O(n*n). Memory: Need to load all data to main memory PP133  huge, huge number  millions Time/ Memory Data

Motivation — Clustering for large database f(x): O(n*n). f(x): O(n). Time/ Memory Data Time/ Memory Data Method ??? PP134

Requirement — Clustering for large database f(x): O(n*n). f(x): O(n). Time/ Memory Data Time/ Memory Data Method ??? PP134 No more (preferably less) than one scan of the database. Process each [record] only once With limited memory Can suspend, stop, and resume Can update the results when new data inserted or removed Can perform different technology to scan the database During execution, method should provide status and ‘best’ answer.

Major approach for scalable clustering Divide-and-Conquer approach Parallel clustering approach Incremental clustering approach PP135

Divide-and Conquer approach Definition. Divide-and-conquer is a problem-solving approach in which we: divide the problem into sub-problems, recursively conquer or solve each sub-problem, and then combine the sub-problem solutions to obtain a solution to the original problem. PP135 Key Assumptions 1.Problem solutions can be constructed using subproblem solutions. 2.Subproblem solutions are independent of one another. 9*9 数独

Parallel clustering approach Idea: Divide data into small set and then run small set on different machine (Come from Divide-and-Conquer) PP

Explanation about Divide-and-Conquer Divide is some algorithms Conquer is some algorithms

Application Sorting: quick-sort and merge sort Fast Fourier transforms Tower of Hanoi puzzle matrix multiplication ….. PP135

CURE- Divide-and-Conquer 1.Get the size n of set D and partition D into p group (contain n/p elements) 2.To each group pi, clustered into k groups by using Heap and k-d tree 3.delete some no relationship node in Heap and k-d tree 4. Cluster the partial clusters and get the final cluster PP

Heap PP

k-D Tree Technically, the letter k refers to the number of dimensions PP dimensional kd-tree

K-D Tree PP

CURE- Divide-and-Conquer PP Nearest Merge Nearest Merge

Incremental clustering approach Idea: scan all data in database, Compare with the existing clusters, if find similar cluster, assign it to with cluster, or else, create a new cluster. Go on till no data Steps: 1. S={};//set cluster = NULL 2. do{ 3. read one record d; 4. r = find_simiarity_cluster(d, S); 5. if (r exists) 6. assign d to the cluster r 6. else 7. Add_cluster(d, S); 8. } untill (no record in database); PP

Application--Incremental clustering approach BIRCH Balanced Iterative Reducing and Clustering using Hierarchies DBSCAN Density-Based Spatial Clustering of Application with Noise

BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies ) Based on distance measurement, compute the similarity between record and cluster and give the clusters. Inner Cluster Among Cluster PP

BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies ) Inner Cluster Among Cluster PP

Related Definiation Cluster: {x i }, where i = 1, 2, …, N CF(Clustering Feature) : is a triple, (N,LS,SS) , N : number of data ; LS : linear sum of N data ; SS : Square sum

Related Definiation CF tree = (B,T), B = (CF i, child i ), if is internal node in a cluster B = (CF i, prev, next) if is external or leaf node in a cluster. T: threshold for all leaf node, which should satisfy mean distance D < T

Algorithm for BIRCH

DBSCAN DBSCAN: Density-Based Spatial Clustering of Application with Noise Ex1: We want to class house along with river from one spatial photo Ex2:

Definition for DBSCAN Eps-neighborhood of a point The Eps-neighborhood of a point p, denoted by N Eps (p), is defined by N Eps (p)={q ∈ D|dist(p,q) ≤ Eps} Minimum Number (MinPts) The MinPts is the minimum number of data points in any cluster.

Definition for DBSCAN Directly density-reachable A point p is directly density-reachable from a point q. Eps and MinPts if 1): p ∈ N Eps (q); 2): |N Eps (q)|≥MinPts;

Definition for DBSCAN Density-reachable A point p is density-reachable from a point q. Eps and MinPts if there is a chain of points p 1,p 2,…,p n,p=p 1,q=p n such as p i +1 is directly desity-reachable from p i ;

Definition for DBSCAN Density-reachable A point p is density-reachable from a point q. Eps and MinPts if there is a chain of points p 1,p 2,…,p n,p=p 1,q=p n such as p i +1 is directly desity-reachable from p i ;

Algorithm of DBSCAN Input D={t 1,t 2,…,t n } MinPts Eps Output K=K 1,K 2,…K k k = 0; for i =1 to n do if t i is not in a cluster then X={t i | t j is density-reachable from t i } end if if X is a valid cluster then k= k+1; K k = X; end if end for