Presented by Niwan Wattanakitrungroj

Slides:



Advertisements
Similar presentations
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.
Advertisements

Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague.
Incremental Clustering for Trajectories
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
A Unified Framework for Context Assisted Face Clustering
Sumblr: Continuous Summarization of Evolving Tweet Streams
A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Fast Algorithms For Hierarchical Range Histogram Constructions
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Introduction to Bioinformatics
Date : 21 st of May, Shri Ramdeo Baba College of Engineering and Management Presentation By : Rimjhim Singh Under the Guidance of: Dr. M.B. Chandak.
Models and Security Requirements for IDS. Overview The system and attack model Security requirements for IDS –Sensitivity –Detection Analysis methodology.
Subscription Subsumption Evaluation for Content-Based Publish/Subscribe Systems Hojjat Jafarpour, Bijit Hore, Sharad Mehrotra, and Nalini Venkatasubramanian.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Clustering over Multiple Evolving Streams by Events and Correlations Mi-Yen Yeh, Bi-Ru Dai, Ming-Syan Chen Electrical Engineering, National Taiwan University.
A Framework for Projected Clustering of High Dimensional Data Streams Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004.
Overview Of Clustering Techniques D. Gunopulos, UCR.
Multiple Human Objects Tracking in Crowded Scenes Yao-Te Tsai, Huang-Chia Shih, and Chung-Lin Huang Dept. of EE, NTHU International Conference on Pattern.
Margin Based Sample Weighting for Stable Feature Selection Yue Han, Lei Yu State University of New York at Binghamton.
Simulation Waiting Line. 2 Introduction Definition (informal) A model is a simplified description of an entity (an object, a system of objects) such that.
An Unsupervised Learning Approach for Overlapping Co-clustering Machine Learning Project Presentation Rohit Gupta and Varun Chandola
Radial Basis Function (RBF) Networks
Stream Clustering CSE 902. Big Data Stream analysis Stream: Continuous flow of data Challenges ◦Volume: Not possible to store all the data ◦One-time.
A.C. Chen ADL M Zubair Rafique Muhammad Khurram Khan Khaled Alghathbar Muddassar Farooq The 8th FTRA International Conference on Secure and.
1 On Querying Historical Evolving Graph Sequences Chenghui Ren $, Eric Lo *, Ben Kao $, Xinjie Zhu $, Reynold Cheng $ $ The University of Hong Kong $ {chren,
An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.
Wei Feng , Jiawei Han, Jianyong Wang , Charu Aggarwal , Jianbin Huang
Clustering Moving Objects in Spatial Networks Jidong Chen, Caifeng Lai, Xiaofeng Meng, Renmin University of China Jianliang Xu, and Haibo Hu Hong Kong.
Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.
FlexTable: Using a Dynamic Relation Model to Store RDF Data IDS Lab. Seungseok Kang.
Di Yang, Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute EDBT 2010, Submitted 1 A Unified Framework Supporting Interactive.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
Xiangnan Kong,Philip S. Yu An Ensemble-based Approach to Fast Classification of Multi-label Data Streams Dept. of Computer Science University of Illinois.
Genetic algorithms: A Stochastic Approach for Improving the Current Cadastre Accuracies Anna Shnaidman Uri Shoshani Yerach Doytsher Mapping and Geo-Information.
Anomaly Detection. Network Intrusion Detection Techniques. Ştefan-Iulian Handra Dept. of Computer Science Polytechnic University of Timișoara June 2010.
CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.
1 On Demand Classification of Data Streams Charu C. Aggarwal Jiawei Han Philip S. Yu Proc Int. Conf. on Knowledge Discovery and Data Mining (KDD'04),
Data Stream Management Systems--Supporting Stream Mining Applications
Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09 Supervisor Dr Koh Speaker Nonhlanhla Shongwe April 26,
COOLCAT: An Entropy-Based Algorithm for Categorical Clustering
University of Waikato, New Zealand
Data Mining: Concepts and Techniques
A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets Ashok Sharma, Robert Podolsky, Jieping.
Computer Vision Lecture 13: Image Segmentation III
Parallel Density-based Hybrid Clustering
A New Support Vector Finder Method Based on Triangular Calculations
Clustering Uncertain Taxi data
Computer Vision Lecture 12: Image Segmentation II
Evaluation of Relational Operations
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
Supporting Fault-Tolerance in Streaming Grid Applications
Overview Of Clustering Techniques
Fuzzy Support Vector Machines
Concurrent Graph Exploration with Multiple Robots
A Fault-Tolerant Routing Strategy for Fibonacci-Class Cubes
Intradomain Routing Outline Introduction to Routing
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
A Framework for Clustering Evolving Data Streams
Consensus Partition Liang Zheng 5.21.
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Nearest Neighbors CSC 576: Data Mining.
Pei Lee, ICDE 2014, Chicago, IL, USA
Text Categorization Berlin Chen 2003 Reference:
Online Analytical Processing Stream Data: Is It Feasible?
K.L Ong, W. Li, W.K. Ng, and E.P. Lim
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Frequent Pattern Mining for Data Streams
Presentation transcript:

Presented by Niwan Wattanakitrungroj A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawie Han, Jianyong Wang and Philip S. Yu Proceedings of the 29th VLDB Conference, 2003 Presented by Niwan Wattanakitrungroj 22 June 2010

Outline Introduction The Stream Clustering Framework Online Micro-cluster Maintenance Macro-Cluster Creation Experimental Results Conclusions

Introduction Clustering problem: to partition the set of data points into one or more groups of similar objects Traditional clustering algorithms are not efficient for clustering the data stream Data stream may grow at an unlimited rate and may evolving over time Data stream cannot be revisited over the course of computation

Introduction (cont.) Previous work: STREAM (O’Callahagn et al., 200) They implemented a continuous version of the K-means algorithm It is unsafe for evolving data stream, because K-mean is highly sensitive to the arrival of data points If two clusters are merged, there is no way to split them when required by the evolution

Outline Introduction The Stream Clustering Framework Online Micro-cluster Maintenance Macro-Cluster Creation Experimental Results Conclusions

The Stream Clustering Framework CluStream (proposed) Online component (Micro-cluster maintenance) periodically stores summary statistics Offline component (Macro-cluster creation) uses only this summary statistics (utilized by the analyst)

The Stream Clustering Framework(cont.) Definition 1 A micro-cluster for a set of d-dimensional points with time stamps is defined as the tuples A vector of d values, each value is sum of the squares of all data values in the micro-cluster, i.e.,

The Stream Clustering Framework(cont.) Definition 1 A micro-cluster for a set of d-dimensional points with time stamps is defined as the tuples A vector of d values, each value is sum of all data values in the micro-cluster, i.e.,

The Stream Clustering Framework(cont.) Definition 1 A micro-cluster for a set of d-dimensional points with time stamps is defined as the tuples : sum of the squares of the time stamps : sum of the time stamps n : number of data points

The Stream Clustering Framework(cont.) : cluster feature vector of micro-cluster for a set of points C

The Stream Clustering Framework(cont.) Find the clusters using the subtractive property of micro-clusters at snapshot tc and tc-h time tc tc-h a history of length h How many snapshots should be stored?

The Stream Clustering Framework(cont.) Pyramidal time frame Order of snapshot = 0 to log(T) i -th order occur at time intervals of , where is an integer and is taken at a moment in time t when t is exactly divisible by Only the last snapshots of order i are stored ( ). The maximum number of snapshots at any moment is All the snapshots of order i which are not divisible by are non-redundant. Order of Snapshots Clock Times (Last 5 Snapshots) 55 54 53 52 51 1 54 52 50 48 46 2 52 48 44 40 36 3 48 40 32 24 16 4 48 32 16 5 32 α = 2 and l = 2

Outline Introduction The Stream Clustering Framework Online Micro-cluster Maintenance Macro-Cluster Creation Experimental Results Conclusions

Online Micro-cluster Maintenance Initialization : create initial q micro-clusters Apply a standard k-mean algorithm Online process of updating a new data point Absorbed by a micro-cluster Create a new micro-cluster

Online Micro-cluster Maintenance(cont.) a new data point maximum boundary is defined as a factor of t of the RMS deviation Find the closet micro-cluster Falls in the maximum boundary ? Yes No is absorbed by Create a new micro-cluster

Online Micro-cluster Maintenance(cont.) Create a new micro-cluster assign a new id Reduce # of micro-cluster: calculate the mean and SD CF2t , CF1t relevance stamp is the time of arrival at the m/(2*n)-th percentile Find “relevance stamp” Yes the least relevance stamp of M < δ Join two closet micro-clusters No Delete a micro-cluster Creat idlist which is a union of ids in each micro-cluster M

Outline Introduction The Stream Clustering Framework Online Micro-cluster Maintenance Macro-Cluster Creation Experimental Results Conclusions

Macro-Cluster Creation Using the compactly stored summary statistics of the micro-clusters Inputs from analyst : time-horizon h number of higher level cluster k Apply a modification of a k-mean algorithm The micro-clusters are treated as pseudo-points

Outline Introduction The Stream Clustering Framework Online Micro-cluster Maintenance Macro-Cluster Creation Experimental Results Conclusions

Experimental Results Test Environment and Data set CluStream (proposed) vs. STREAM (O’Callaghan et al.) Dataset: KDD-CUP’99 Network Intrusion Detection (33 attributes) KDD-CUP’98 Charitable Donation (56 attributes) Quality of clustering: measured by sum of square distance (SSQ) Parameter setting:

Experimental Results (cont.) horizon=1, stream speed = 2000 horizon=4, stream speed = 200 horizon=256, stream speed = 200 horizon=16, stream speed = 200 Network Intrusion dataset Charitable Donation dataset

Experimental Results (cont.) Charitable Donation dataset, stream speed = 2000 Network Intrusion dataset , stream speed = 2000 Stream Processing Rate

Outline Introduction The Stream Clustering Framework Online Micro-cluster Maintenance Macro-Cluster Creation Experimental Results Conclusions

Conclusions CluStream : clustering method for large evolving data streams view the stream as a changing process over time flexible to an analyst in a real time and evolving environment

Thank you Q & A