5/29/2008AI UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.

5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology

Contents Introduction Idea for there major approaches for scalable clustering {Divide-and-Conquer, Incremental, Parallel} There approaches for scalable clustering { BIRCH, DSBCAN, CURE} Application

Introduction – Common method Common method for clustering: visit all data from database and analyze the data, just like: Time: Computational Complexities: O(n*n). Memory: Need to load all data to main memory PP133  huge, huge number  millions Time/ Memory Data

Motivation — Clustering for large database f(x): O(n*n). f(x): O(n). Time/ Memory Data Time/ Memory Data Method ??? PP134

Requirement — Clustering for large database f(x): O(n*n). f(x): O(n). Time/ Memory Data Time/ Memory Data Method ??? PP134 No more (preferably less) than one scan of the database. Process each [record] only once With limited memory Can suspend, stop, and resume Can update the results when new data inserted or removed Can perform different technology to scan the database During execution, method should provide status and ‘best’ answer.

Major approach for scalable clustering Divide-and-Conquer approach Parallel clustering approach Incremental clustering approach PP135

Divide-and Conquer approach Definition. Divide-and-conquer is a problem-solving approach in which we: divide the problem into sub-problems, recursively conquer or solve each sub-problem, and then combine the sub-problem solutions to obtain a solution to the original problem. PP135 Key Assumptions 1.Problem solutions can be constructed using subproblem solutions. 2.Subproblem solutions are independent of one another. 9*9 数独

Parallel clustering approach Idea: Divide data into small set and then run small set on different machine (Come from Divide-and-Conquer) PP136-137

Explanation about Divide-and-Conquer Divide is some algorithms Conquer is some algorithms

Application Sorting: quick-sort and merge sort Fast Fourier transforms Tower of Hanoi puzzle matrix multiplication ….. PP135

CURE- Divide-and-Conquer 1.Get the size n of set D and partition D into p group (contain n/p elements) 2.To each group pi, clustered into k groups by using Heap and k-d tree 3.delete some no relationship node in Heap and k-d tree 4. Cluster the partial clusters and get the final cluster PP140-141

Heap PP140-141

k-D Tree Technically, the letter k refers to the number of dimensions PP140-141 3-dimensional kd-tree

K-D Tree PP140-141

CURE- Divide-and-Conquer PP140-141 Nearest Merge Nearest Merge

Incremental clustering approach Idea: scan all data in database, Compare with the existing clusters, if find similar cluster, assign it to with cluster, or else, create a new cluster. Go on till no data Steps: 1. S={};//set cluster = NULL 2. do{ 3. read one record d; 4. r = find_simiarity_cluster(d, S); 5. if (r exists) 6. assign d to the cluster r 6. else 7. Add_cluster(d, S); 8. } untill (no record in database); PP135-136

Application--Incremental clustering approach BIRCH Balanced Iterative Reducing and Clustering using Hierarchies DBSCAN Density-Based Spatial Clustering of Application with Noise

BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies ) Based on distance measurement, compute the similarity between record and cluster and give the clusters. Inner Cluster Among Cluster PP137-138

BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies ) Inner Cluster Among Cluster PP137-138

Related Definiation Cluster: {x i }, where i = 1, 2, …, N CF(Clustering Feature) ： is a triple, (N,LS,SS) ， N ： number of data ； LS ： linear sum of N data ； SS ： Square sum

Related Definiation CF tree = (B,T), B = (CF i, child i ), if is internal node in a cluster B = (CF i, prev, next) if is external or leaf node in a cluster. T: threshold for all leaf node, which should satisfy mean distance D < T

Algorithm for BIRCH

DBSCAN DBSCAN: Density-Based Spatial Clustering of Application with Noise Ex1: We want to class house along with river from one spatial photo Ex2:

Definition for DBSCAN Eps-neighborhood of a point The Eps-neighborhood of a point p, denoted by N Eps (p), is defined by N Eps (p)={q ∈ D|dist(p,q) ≤ Eps} Minimum Number (MinPts) The MinPts is the minimum number of data points in any cluster.

Definition for DBSCAN Directly density-reachable A point p is directly density-reachable from a point q. Eps and MinPts if 1): p ∈ N Eps (q); 2): |N Eps (q)|≥MinPts;

Definition for DBSCAN Density-reachable A point p is density-reachable from a point q. Eps and MinPts if there is a chain of points p 1,p 2,…,p n,p=p 1,q=p n such as p i +1 is directly desity-reachable from p i ;

Algorithm of DBSCAN Input D={t 1,t 2,…,t n } MinPts Eps Output K=K 1,K 2,…K k k = 0; for i =1 to n do if t i is not in a cluster then X={t i | t j is density-reachable from t i } end if if X is a valid cluster then k= k+1; K k = X; end if end for

5/29/2008AI UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.

Similar presentations

Presentation on theme: "5/29/2008AI UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

5/29/2008AI UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.

Similar presentations

Presentation on theme: "5/29/2008AI UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology."— Presentation transcript:

Similar presentations

About project

Feedback