Finding Maximal Frequent Itemsets over Online Data Streams Adaptively

Slides:



Advertisements
Similar presentations
Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science.
Advertisements

Chapter 4: Trees Part II - AVL Tree
Resource-oriented Approximation for Frequent Itemset Mining from Bursty Data Streams SIGMOD’14 Toshitaka Yamamoto, Koji Iwanuma, Shoshi Fukuda.
Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.
FP-Growth algorithm Vasiljevic Vladica,
1 Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
AA Trees another alternative to AVL trees. Balanced Binary Search Trees A Binary Search Tree (BST) of N nodes is balanced if height is in O(log N) A balanced.
Data Mining Association Analysis: Basic Concepts and Algorithms
Balanced Search Trees. 2-3 Trees Trees Red-Black Trees AVL Trees.
Lecture04 Data Compression.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Continuous Data Stream Processing  Music Virtual Channel – extensions  Data Stream Monitoring – tree pattern mining  Continuous Query Processing – sequence.
2-dimensional indexing structure
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
CPSC 231 B-Trees (D.H.)1 LEARNING OBJECTIVES Problems with simple indexing. Multilevel indexing: B-Tree. –B-Tree creation: insertion and deletion of nodes.
B + -Trees Same structure as B-trees. Dictionary pairs are in leaves only. Leaves form a doubly-linked list. Remaining nodes have following structure:
BST Data Structure A BST node contains: A BST contains
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.
Balanced Trees. Binary Search tree with a balance condition Why? For every node in the tree, the height of its left and right subtrees must differ by.
Indexing (cont.). Insertion in a B+ Tree Another B+ Tree
B + -Trees COMP171 Fall AVL Trees / Slide 2 Dictionary for Secondary storage * The AVL tree is an excellent dictionary structure when the entire.
B-Trees (continued) Analysis of worst-case and average number of disk accesses for an insert. Delete and analysis. Structure for B-tree node.
By : Budi Arifitama Pertemuan ke Objectives Upon completion you will be able to: Create and implement binary search trees Understand the operation.
CPSC 335 BTrees Dr. Marina Gavrilova Computer Science University of Calgary Canada.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
EFFICIENT ITEMSET EXTRACTION USING IMINE INDEX By By U.P.Pushpavalli U.P.Pushpavalli II Year ME(CSE) II Year ME(CSE)
Data : The Small Forwarding Table(SFT), In general, The small forwarding table is the compressed version of a trie. Since SFT organizes.
Heaps, Heapsort, Priority Queues. Sorting So Far Heap: Data structure and associated algorithms, Not garbage collection context.
MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.
1 Balanced Trees There are several ways to define balance Examples: –Force the subtrees of each node to have almost equal heights –Place upper and lower.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.
Lecture1 introductions and Tree Data Structures 11/12/20151.
Reporter : Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.
CanTree: a tree structure for efficient incremental mining of frequent patterns Carson Kai-Sang Leung, Quamrul I. Khan, Tariqul Hoque ICDM ’ 05 報告者:林靜怡.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
1 Chapter 7 Objectives Upon completion you will be able to: Create and implement binary search trees Understand the operation of the binary search tree.
Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
Balanced Search Trees Chapter 19 Data Structures and Problem Solving with C++: Walls and Mirrors, Carrano and Henry, © 2013.
BATON A Balanced Tree Structure for Peer-to-Peer Networks H. V. Jagadish, Beng Chin Ooi, Quang Hieu Vu.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
Data Structures: A Pseudocode Approach with C, Second Edition 1 Chapter 7 Objectives Create and implement binary search trees Understand the operation.
Indexing and B+-Trees By Kenneth Cheung CS 157B TR 07:30-08:45 Professor Lee.
Packet Classification Using Dynamically Generated Decision Trees
1 Online Mining (Recently) Maximal Frequent Itemsets over Data Streams Hua-Fu Li, Suh-Yin Lee, Man Kwan Shan RIDE-SDMA ’ 05 speaker :董原賓 Advisor :柯佳伶.
CFI-Stream: Mining Closed Frequent Itemsets in Data Streams
Balanced Search Trees 2-3 Trees AVL Trees Red-Black Trees
Lecture Trees Professor Sin-Min Lee.
Trees Chapter 15.
AA Trees.
Multiway Search Trees Data may not fit into main memory
B+-Trees j a0 k1 a1 k2 a2 … kj aj j = number of keys in node.
New ideas on FP-Growth and batch incremental mining with FP-Tree
Online Frequent Episode Mining
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
Section 8.1 Trees.
COP3530- Data Structures B Trees
B-Trees (continued) Analysis of worst-case and average number of disk accesses for an insert. Delete and analysis. Structure for B-tree node.
COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong
Approximate Frequency Counts over Data Streams
Chapter 20: Binary Trees.
B+-Trees j a0 k1 a1 k2 a2 … kj aj j = number of keys in node.
Maintaining Frequent Itemsets over High-Speed Data Streams
Finding Frequent Itemsets by Transaction Mapping
Presentation transcript:

Finding Maximal Frequent Itemsets over Online Data Streams Adaptively Daesu Lee,Wonsuk Lee IEEE,ICDM’05 報告者:林靜怡 2006/05/05

Introduction Confine the memory usage of a data mining process estDec - prefix tree estDec+ - CP-tree

CP-tree Compressed-prefix tree Effectively used in finding frequent or maximal frequent itemsets a node of a CP-tree can maintain the information of several itemsets together size of a CP-tree can be flexibly controlled by merging or splitting nodes

CP-tree(Conti.) Two consecutive nodes by a prefix tree are merged in a CP-tree when the current support difference between their corresponding itemsets is less than or equal to a merging gap threshold δ (0,1)

Definition :a prefix tree S :a subtree of :the itemset represented by the root of S :an itemset represented by a node of S δ:merging gap threshold |S|:number of nodes in S :the total number of transactions in the current data stream subtree S is a mergeable subtree and compressed into a node of a CP-tree that is equivalent to

CP-node structure A node m of CP-tree maintains four entries m(τ, π, , ) Item-listτ: - m.τ[1]:root node of S,the shortest itemset of the node m and denoted by - m.τ[|S|]:the right-most leaf node in the lowest level of S,the longest itemset of the node m and denoted by

CP-node structure Parent-index list π - y’s parent is x - Largest count - the current count of the shortest itemset Smallest count - the current count of the longest itemset

CP-tree |10-9|/10 = 0.1 0.1 <= 0.2 |10-5|/10 = 0.5 0.5 > 0.2

Merged-count Estimation Given the item-list of a node m in a CP-tree :the itemset represented by (1<=j<=n) the current counts of the remaining itemsets can be estimated by a formula f(m, j):a count estimation function that can model the count in terms of the counts and of the shortest and longest itemsets

CP-tree Maintenance :the parent node of m Node-merge - - a new significant itemset e is identified by the inserting-count estimation process, so that a new node for the itemset needs to be inserted as a child of the node m. Node-split

Finding Maximal Frequent Itemsets estDec+ Method Adaptive Memory Utilization

estDec+ Method Parameter updating phase Count updating & node restructuring phase Itemset insertion phase Maximal frequent itemset selection phase

Parameter updating phase The total number of transactions in the current data stream is updated. Count updating & node restructuring phase :m’s parent prune - split - merge -

Itemset insertion phase insert any new significant itemset which has not been maintained in any insignificant item whose current support is less than is filtered out in the transaction => Traversed to find out any new significant itemset induced by the items in

Maximal frequent itemset selection phase retrieves all the currently frequent or maximal frequent itemsets by traversing the monitoring tree Force-pruning - all the nodes whose largest counts are less than - performed periodically

Adaptive Memory Utilization In order to minimize the estimation error caused by the merged-count estimation, it is very important to keep the value ofδas small as possible. The size of a CP-tree is inversely proportional to the value of δ. the value of should be dynamically changed in the parameter update phase

Adaptive Memory Utilization :the upper bound of desired memory usage :the lower bound of desired :Confined memory space :current memory usage

Experiment Data sets:T10.I4.D1000K and WebLog 1.8 GHz Pentium PC 512 MB Memory Linux 7.3 = 0.001 Count estimation function f(m,j)

Experiment

Experiment

Experiment