K.L Ong, W. Li, W.K. Ng, and E.P. Lim

Slides:



Advertisements
Similar presentations
A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad.
Advertisements

Mining High-Speed Data Streams Presented by: Tyler J. Sawyer UVM Spring CS 332 Data Mining Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International.
Graduate : Sheng-Hsuan Wang
FP-Growth algorithm Vasiljevic Vladica,
FP (FREQUENT PATTERN)-GROWTH ALGORITHM ERTAN LJAJIĆ, 3392/2013 Elektrotehnički fakultet Univerziteta u Beogradu.
FPtree/FPGrowth (Complete Example). First scan – determine frequent 1- itemsets, then build header B8 A7 C7 D5 E3.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
I/O-Algorithms Lars Arge Aarhus University February 7, 2005.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
Birch: An efficient data clustering method for very large databases
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, Proceedings.,
CanTree: a tree structure for efficient incremental mining of frequent patterns Carson Kai-Sang Leung, Quamrul I. Khan, Tariqul Hoque ICDM ’ 05 報告者:林靜怡.
2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies A hierarchical clustering method. It introduces two concepts : Clustering feature Clustering.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Association Analysis (3)
Bootstrapped Optimistic Algorithm for Tree Construction
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Δ-Tolerance Closed Frequent Itemsets James Cheng,Yiping Ke,and Wilfred Ng ICDM ’ 06 報告者:林靜怡 2007/03/15.
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
On Reducing Classifier Granularity in Mining Concept-Drifting Data Streams Peng Wang, H. Wang, X. Wu, W. Wang, and B. Shi Proc. of the Fifth IEEE International.
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
1 On Demand Classification of Data Streams Charu C. Aggarwal Jiawei Han Philip S. Yu Proc Int. Conf. on Knowledge Discovery and Data Mining (KDD'04),
Presented by Niwan Wattanakitrungroj
CFI-Stream: Mining Closed Frequent Itemsets in Data Streams
B/B+ Trees 4.7.
Reducing Number of Candidates
CS 440 Database Management Systems
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
Efficient Image Classification on Vertically Decomposed Data
Data Mining: Concepts and Techniques
Datastructure.
Multiway range trees: scalable IP lookup with fast updates
B+ Tree.
Byung Joon Park, Sung Hee Kim
BIRCH: An Efficient Data Clustering Method for Very Large Databases
Chapter Trees and B-Trees
Chapter Trees and B-Trees
CS 685: Special Topics in Data Mining Jinze Liu
Efficient Image Classification on Vertically Decomposed Data
Communication and Memory Efficient Parallel Decision Tree Construction
Mining Association Rules from Stars
A Fast and Scalable Nearest Neighbor Based Classification
CS 685: Special Topics in Data Mining Jinze Liu
A Framework for Clustering Evolving Data Streams
A Fast Algorithm for Subspace Clustering by Pattern Similarity
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Frequent-Pattern Tree
Chapter 12 Query Processing (1)
Birch presented by : Bahare hajihashemi Atefeh Rahimi
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
CENG 351 Data Management and File Structures
Finding Frequent Itemsets by Transaction Mapping
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

K.L Ong, W. Li, W.K. Ng, and E.P. Lim SCLOPE: An Algorithm for Clustering Data Streams of Categorical Attributes K.L Ong, W. Li, W.K. Ng, and E.P. Lim Proc. of the 6th Int. Conf. on Data Warehousing and Knowledge Discovery, Zaragoza, Spain, September 2004 (DaWak04) 2019/5/25 報告人:吳建良

Outline Clustering a Data Stream of Categorical Attributes SCLOPE Algorithm CluStream Framework CLOPE Algorithm FP-Tree-Like Structure Experimental Results

Clustering a Data Stream of Categorical Attributes Technically challenges High dimensions Sparsity problem in categorical datasets Additional stream constraints One pass I/O Low CPU consumption

SCLOPE Algorithm Adopt two aspects of CluStream framework SCLOPE Pyramidal time frame: store summary statistics at different time periods Separate the process of clustering Online micro-clustering component Offline macro-clustering component SCLOPE Online: Pyramidal time frame, FP-Tree-like structure Offline: CLOPE Clustering Algorithm

CLOPE Clustering Algorithm Cluster quality measure Histogram Similarity A larger height-to width ratio  better intra-cluster similarity H=S/W Tid Items 1 ab 2 abc 3 acd 4 de 5 def

CLOPE Clustering Algorithm (cont.) Suppose Clustering C={C1, C2, …, Ck} Height-to-width ratio Gradient G(Ci)=H(Ci)/W(Ci)=S(Ci)/W(Ci)2 Criterion function r: repulsion Control the level of intra-cluster similarity

CLOPE Clustering Algorithm (cont.) a b C1={T1} Initial Phase: Tid Items Clus 1 ab 2 abc 3 acd 4 de 5 def T2 is temporally added into C1 Profit=0.55 a b c C1={T1, T2} or Create a new cluster C2 Profit=0.41 C2={T2} C1 C1 C1 C2 C2 T3 is temporally added into C1 Profit=0.5 or Create a new cluster C2 Profit=0.41 a c d C2={T3} a b c d C1={T1, T2 , T3} a b c d C1={T1, T2 , T3} d e f C2={T4, T5} Final Result:

CLOPE Clustering Algorithm (cont.) Iteration Phase: Repeat moved=false For all transaction t in the database move t to an existing cluster or new cluster Cj that maximize profit If Ci≠Cj then write <t, j> moved=true Until not moved

Maintain Summary Statistics Data streams A set of records R1,…, Ri,… arriving at time stamps t1,…, ti,… Each record R contains attributes A={a1, a2, …, aj} A micro-cluster within time window (time tp ~ tq) is defined as a tuple : a vector of record identifiers : cluster histogram - width: - size: - height: size to width ratio

FP-Tree-Like Structure Drawbacks of CLOP Multiple scans of the dataset Multiple evaluations of the criterion function for each record FP-tree-like structure Require two scans of dataset Determine the singleton frequency  Insert each into FP-tree after arranging all attributes according to their descending singleton frequency Share common prefixes Without the need to compute clustering criterion

Construct FP-Tree-Like Structure Tid Items 1 ab 2 abc 3 acd 4 de 5 def Tid Items 1 ab 2 abc 3 adc 4 de 5 def Scan database once a:3, d:3, b:2, c:2, e:2, f:1 Arrange null a:3 b:2 c:1 d:1 d:2 e:2 f:1

FP-Tree-Like Structure Each path (from the root to a leaf node) is a micro-cluster The number of micro-clusters are depend on the available memory space Merge strategy Select node which has longest common prefix Select any two paths passing through the node Merge its corresponding

Online Micro-clustering Component of SCLOPE On beginning of (window wi) do 1: if (i=0) then Q’ ←{a random order of v1,…,v∣A∣} 2: T ← new FP-tree and Q ←Q’ 3: for all (incoming record ) do 4: order R according to Q and 5: if (R can be inserted completely along an existing path Pi in T) then 6: 7: else 8: Pj ← new path in T and ← new cluster histogram for Pj 9:

Online Micro-clustering Component of SCLOPE On end of (window wi) do 10: L ← {<n, height(n)>: n is a node in T with > 2 children} 11: order L according to height(n) 12: while do 13: select <n, height(n)> with largest value 14: select paths Pi, Pj where 15: 16: delete 17: output micro-clusters and cluster histograms

Offline Macro-clustering Component of SCLOPE Time-horizon h and repulsion r h: span one or more windows r: control the intra-cluster similarity Profit function: clustering criterion Each micro-cluster is treated as a pseudo-record #Micro-cluster physical records It takes less time to converge on the clustering criterion

Experimental Results Environment Aspects: Dataset CPU Pentium-4: 2GHz RAM: 1GB OS: Windows 2000 Aspects: Performance, scalability, cluster accuracy Dataset Real-world, synthetic data

Performance and Scalability Real-life data FIMI repository (http://fimi.cs.helsinki.fi/data/)

Performance and Scalability (cont.) Synthetic data IBM synthetic data generator Dataset: 50k records (a) 50 clusters (b) 100 clusters (c) 500 clusters

#Attributes: 1000

Cluster Accuracy Mushroom data set Purity metric 117 distinct attributes and 8124 records Two classes 4208 edible , and 3916 poisonous Purity metric the average percentage of the dominant class label in each cluster

Cluster Accuracy (cont.) Online micro-clustering component of SCLOPE SCLOPE v.s CLOPE