CDS-Tree: An Effective Index for Clustering Arbitrary Shapes in Data Streams Huanliang Sun, Ge Yu, Yubin Bao, Faxin Zhao, Daling Wang RIDE-SDMA’05 Advisor.

Slides:

Advertisements

Similar presentations

Indexing DNA Sequences Using q-Grams

Advertisements

A distributed method for mining association rules

Fast Algorithms For Hierarchical Range Histogram Constructions

Near-Duplicates Detection

Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.

Clustering Prof. Navneet Goyal BITS, Pilani

Efficient Anomaly Monitoring over Moving Object Trajectory Streams joint work with Lei Chen (HKUST) Ada Wai-Chee Fu (CUHK) Dawei Liu (CUHK) Yingyi Bu (Microsoft)

A Fast High Utility Itemsets Mining Algorithm Ying Liu,Wei-keng Liao,and Alok Choudhary KDD’05 Advisor ： Jia-Ling Koh Speaker ： Tsui-Feng Yen.

Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.

Spatial Mining.

Multidimensional Data. Many applications of databases are "geographic" = 2dimensional data. Others involve large numbers of dimensions. Example: data.

Cluster Analysis.

University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

Segmentation Divide the image into segments. Each segment:

Cluster Analysis.

Spatial Indexing I Point Access Methods.

University at BuffaloThe State University of New York WaveCluster A multi-resolution clustering approach qApply wavelet transformation to the feature space.

Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.

Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.

Multidimensional Data Many applications of databases are ``geographic'' = 2dimensional data. Others involve large numbers of dimensions. Example: data.

A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data Authors: Eleazar Eskin, Andrew Arnold, Michael Prerau,

Automated Social Hierarchy Detection through Network Analysis (SNAKDD07) Ryan Rowe, Germ´an Creamer, Shlomo Hershkop, Salvatore J Stolfo 1 Advisor:

An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.

A 3D Model Alignment and Retrieval System Ding-Yun Chen and Ming Ouhyoung.

MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.

TAR: Temporal Association Rules on Evolving Numerical Attributes Wei Wang, Jiong Yang, and Richard Muntz Speaker: Sarah Chan CSIS DB Seminar May 7, 2003.

PMLAB Finding Similar Image Quickly Using Object Shapes Heng Tao Shen Dept. of Computer Science National University of Singapore Presented by Chin-Yi Tsai.

Intelligent Database Systems Lab 1 Advisor ： Dr. Hsu Graduate ： Jian-Lin Kuo Author ： Silvia Nittel Kelvin T.Leung Amy Braverman 國立雲林科技大學 National Yunlin.

Outlier Detection Lian Duan Management Sciences, UIOWA.

Density-Based Clustering Algorithms

Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences.

OPERATING SYSTEMS Lecture 3: we will explore the role of the operating system in a computer Networks and Communication Department 1.

A genetic approach to the automatic clustering problem Author : Lin Yu Tseng Shiueng Bien Yang Graduate : Chien-Ming Hsiao.

1 Shape Segmentation and Applications in Sensor Networks Xianjin Xhu, Rik Sarkar, Jie Gao Department of CS, Stony Brook University INFOCOM 2007.

Data Management+ Laboratory Dynamic Skylines Considering Range Queries Speaker: Adam Adviser: Yuling Hsueh 16th International Conference, DASFAA 2011 Wen-Chi.

1 FINDING FUZZY SETS FOR QUANTITATIVE ATTRIBUTES FOR MINING OF FUZZY ASSOCIATE RULES By H.N.A. Pham, T.W. Liao, and E. Triantaphyllou Department of Industrial.

Adaptive Mining Techniques for Data Streams using Algorithm Output Granularity Mohamed Medhat Gaber, Shonali Krishnaswamy, Arkady Zaslavsky In Proceedings.

4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. (joint work with.

1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Hong.

Virtual Memory The memory space of a process is normally divided into blocks that are either pages or segments. Virtual memory management takes.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Yu Cheng Chen Author: Chung-hung.

Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 Dept. Computer Science and Information Engineering.

Presented by Ho Wai Shing

Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.

1 Maintaining Knowledge-Bases of Navigational Patterns from Streams of Navigational Sequences Ajumobi Udechukwu, Ken Barker, Reda Alhajj Proceedings of.

Bootstrapped Optimistic Algorithm for Tree Construction

Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.

CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course.

Mining Progressive Confident Rules M. Zhang, W. Hsu and M.L. Lee Int'l Conf on Data Mining (ICDM),2006 IEEE Advisor ： Jia-Ling Koh Speaker ： Tsui-Feng.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Sanghamitra.

Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.

Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.

On Reducing Classifier Granularity in Mining Concept-Drifting Data Streams Peng Wang, H. Wang, X. Wu, W. Wang, and B. Shi Proc. of the Fifth IEEE International.

Course 3 Binary Image Binary Images have only two gray levels: “1” and “0”, i.e., black / white. —— save memory —— fast processing —— many features of.

Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.

CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.

Object Recognition. Segmentation –Roughly speaking, segmentation is to partition the images into meaningful parts that are relatively homogenous in certain.

Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.

QED : An Efficient Framework for Temporal Region Query Processing Yi-Hong Chu 朱怡虹 Network Database Laboratory Dept. of Electrical Engineering National.

Fast Subsequence Matching in Time-Series Databases.

3.1 Clustering Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any.

Data Mining Soongsil University

Parallel Density-based Hybrid Clustering

3.1 Clustering Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any.

An Efficient Algorithm for Incremental Mining of Association Rules

CSE572, CBS598: Data Mining by H. Liu

A Fast Algorithm for Subspace Clustering by Pattern Similarity

K.L Ong, W. Li, W.K. Ng, and E.P. Lim

Presentation transcript:

CDS-Tree: An Effective Index for Clustering Arbitrary Shapes in Data Streams Huanliang Sun, Ge Yu, Yubin Bao, Faxin Zhao, Daling Wang RIDE-SDMA’05 Advisor ： Jia-Ling Koh Speaker ： Tsui-Feng Yen

Introduction Partitioning - k-means and k-medians algorithms don’t emphasize on finding arbitrary shapes in data streams Density-based -DBSCAN can find arbitrary shapes in data streams, but need to scan database more than one time Cell-based (Grid-based) - CLIQUE has three problems -high complexity -high memory -accuracy is not good with limited memory for changing data streams

Problem Definition Domain ： A={A1,A2,…,Ak} S= A1xA2x... xAk be a k-dimensional numerical space. A1, A2,…,Ak as the dimensions (attributes) of S A k-dimension data stream X={x1, x2, …, xn} is a set of ordered objects at t time point, where xi=, and xij, the jth component of xi, is drawn from domain Aj.

Definition Sliding window model on data stream X - B1 is the most recent bucket, and Bu is the oldest - The window slides by creating a new bucket and discarding a oldest one

Definition cont. Partition P of data stream X - P be a set of non-overlapping rectangular cells, which is obtained by partitioning every dimension of X into equal length -Each cell C is the intersection of one interval from each dimension. It is represented as the form {c1,c2,…,ck} -A cell can also be denoted as (cNO1, cNO2, …, cNOk)named the coordinate of the cell, where cNOi is the interval number of the cell on i-th dimension

Definition cont. Selectivity pc of cell C -The number of points that belong to C defines the selectivity pc of cell C Clustering based on cells data stream X in a sliding window -If the selectivity of a cell is larger than a threshold τ, we call the cell dense -A cluster is the largest set of cells that are adjacent and dense -Two cells C1 and C2 are connective when they are neighboring, or there exists a cell C3, C1 and C3 are neighboring, C2 and C3 are neighboring

CDS-Tree data stream coming ： (2,3),(5,4),(6,5) root-node mid leaf total-num-list

Related Algorithms of CDS-Tree CDS-Tree building algorithm

Related Algorithms of CDS-Tree Clustering algorithm based on CDS-Tree.

Granularity Adjustment -the finer the partition is, the higher the accuracy is, but the more number of the cells is created -if the current cost memory Mp is far less than Mmax, we can execute finer granularity partition for higher accuracy. -if the current memory cost Mp is close to Mmax, we should use coarser partition to avoid memory overflow.

Granularity Adjustment cont. Safety factor (in case of exhausting memory) -λ ： is used to avoid the memory required exceeding the limited memory Mmax when the granularity turns finer, here we set it larger than 1. -η ： we set it to decide the time point to adjust the granularity, where ηis less than 1. For example, is set 0.1, which represents when left memory is less than 10% of Mmax, the algorithm will turn granularity coarse to save more memory.

Granularity Adjustment Algorithm

Experimental Results OS: Microsoft Windows 2000 CPU: 2.5GHz RAM: 512MB Two databases ： - KDD-CUP-99 Network Intrusion Detection stream dataset - Image Fourier Coefficient dataset

Experimental Results