Download presentation
Presentation is loading. Please wait.
Published byCameron Rodgers Modified over 9 years ago
1
CDS-Tree: An Effective Index for Clustering Arbitrary Shapes in Data Streams Huanliang Sun, Ge Yu, Yubin Bao, Faxin Zhao, Daling Wang RIDE-SDMA’05 Advisor : Jia-Ling Koh Speaker : Tsui-Feng Yen
2
Introduction Partitioning - k-means and k-medians algorithms don’t emphasize on finding arbitrary shapes in data streams Density-based -DBSCAN can find arbitrary shapes in data streams, but need to scan database more than one time Cell-based (Grid-based) - CLIQUE has three problems -high complexity -high memory -accuracy is not good with limited memory for changing data streams
3
Problem Definition Domain : A={A1,A2,…,Ak} S= A1xA2x... xAk be a k-dimensional numerical space. A1, A2,…,Ak as the dimensions (attributes) of S A k-dimension data stream X={x1, x2, …, xn} is a set of ordered objects at t time point, where xi=, and xij, the jth component of xi, is drawn from domain Aj.
4
Definition Sliding window model on data stream X - B1 is the most recent bucket, and Bu is the oldest - The window slides by creating a new bucket and discarding a oldest one
5
Definition cont. Partition P of data stream X - P be a set of non-overlapping rectangular cells, which is obtained by partitioning every dimension of X into equal length -Each cell C is the intersection of one interval from each dimension. It is represented as the form {c1,c2,…,ck} -A cell can also be denoted as (cNO1, cNO2, …, cNOk)named the coordinate of the cell, where cNOi is the interval number of the cell on i-th dimension
6
Definition cont. Selectivity pc of cell C -The number of points that belong to C defines the selectivity pc of cell C Clustering based on cells data stream X in a sliding window -If the selectivity of a cell is larger than a threshold τ, we call the cell dense -A cluster is the largest set of cells that are adjacent and dense -Two cells C1 and C2 are connective when they are neighboring, or there exists a cell C3, C1 and C3 are neighboring, C2 and C3 are neighboring
7
CDS-Tree data stream coming : (2,3),(5,4),(6,5) root-node mid leaf total-num-list
8
Related Algorithms of CDS-Tree CDS-Tree building algorithm
9
Related Algorithms of CDS-Tree Clustering algorithm based on CDS-Tree.
10
Granularity Adjustment -the finer the partition is, the higher the accuracy is, but the more number of the cells is created -if the current cost memory Mp is far less than Mmax, we can execute finer granularity partition for higher accuracy. -if the current memory cost Mp is close to Mmax, we should use coarser partition to avoid memory overflow.
11
Granularity Adjustment cont. Safety factor (in case of exhausting memory) -λ : is used to avoid the memory required exceeding the limited memory Mmax when the granularity turns finer, here we set it larger than 1. -η : we set it to decide the time point to adjust the granularity, where ηis less than 1. For example, is set 0.1, which represents when left memory is less than 10% of Mmax, the algorithm will turn granularity coarse to save more memory.
12
Granularity Adjustment Algorithm
13
Experimental Results OS: Microsoft Windows 2000 CPU: 2.5GHz RAM: 512MB Two databases : - KDD-CUP-99 Network Intrusion Detection stream dataset - Image Fourier Coefficient dataset
14
Experimental Results
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.