Mining Frequent Patterns from Data Streams

Slides:



Advertisements
Similar presentations
Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science.
Advertisements

Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad.
LOGO Association Rule Lecturer: Dr. Bo Yuan
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
Mining Data Streams.
Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.
Date : 21 st of May, Shri Ramdeo Baba College of Engineering and Management Presentation By : Rimjhim Singh Under the Guidance of: Dr. M.B. Chandak.
Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008.
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
Heavy hitter computation over data stream
Association Analysis: Basic Concepts and Algorithms.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
A survey on stream data mining
1 PODS 2002 Motivation. 2 PODS 2002 Data Streams data sets Traditional DBMS – data stored in finite, persistent data sets data streams New Applications.
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
Verifying and Mining Frequent Patterns from Large Windows over Data Streams Barzan Mozafari, Hetal Thakkar, and Carlo Zaniolo ICDE 2008 Cancun, Mexico.
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
Performance and Scalability: Apriori Implementation.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
October 2, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 8 — 8.3 Mining sequence patterns in transactional.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
Chapter 16 Recovery Yonsei University 1 st Semester, 2015 Sanghyun Park.
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
August 21, 2002VLDB Gurmeet Singh Manku Frequency Counts over Data Streams Frequency Counts over Data Streams Stanford University, USA.
CanTree: a tree structure for efficient incremental mining of frequent patterns Carson Kai-Sang Leung, Quamrul I. Khan, Tariqul Hoque ICDM ’ 05 報告者:林靜怡.
Data Warehousing Mining & BI Data Streams Mining DWMBI1.
Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro.
Data Mining: Concepts and Techniques Mining data streams
Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.
Mining of Massive Datasets Ch4. Mining Data Streams
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Mining Data Streams (Part 1)
Frequency Counts over Data Streams
CHP - 9 File Structures.
Updating SF-Tree Speaker: Ho Wai Shing.
The Stream Model Sliding Windows Counting 1’s
The Stream Model Sliding Windows Counting 1’s
Web-Mining Agents Stream Mining
Association rule mining
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Mining Data Streams The Stream Model Sliding Windows Counting 1’s Exponentially Decaying Windows Usually, we mine data that sits somewhere in a database.
Subject Name: File Structures
Byung Joon Park, Sung Hee Kim
Mining Data Streams (Part 1)
Query-Friendly Compression of Graph Streams
Mining Dynamics of Data Streams in Multi-Dimensional Space
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Association Rule Mining
Mining Unusual Patterns in Data Streams in Multi-Dimensional Space
Hidden Markov Models Part 2: Algorithms
Bloom Filters Very fast set membership. Is x in S? False Positive
A Framework for Clustering Evolving Data Streams
Approximate Frequency Counts over Data Streams
Mining Sequential Patterns
Range-Efficient Computation of F0 over Massive Data Streams
FP-Growth Wenlong Zhang.
CENG 351 Data Management and File Structures
Online Analytical Processing Stream Data: Is It Feasible?
Learning from Data Streams
Mining Data Streams Many slides are borrowed from Stanford Data Mining Class and Prof. Jiawei Han’s lecture slides.
Frequent Pattern Mining for Data Streams
Counting Bits.
Presentation transcript:

Mining Frequent Patterns from Data Streams Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Charu Aggarwal And Jiawei Han, Jure Leskovec

OUTLINE Data Streams Frequent Pattern Mining over Data Streams Characteristics of Data Streams Key Challenges in Stream Data Frequent Pattern Mining over Data Streams Counting Itemsets Lossy Counting Extensions

Data Stream What is the data stream? A data stream is an ordered sequence of instances that in many applications of data stream mining can be read only once or a small number of items using limited computing and storage capabilities.

Data Streams Traditional DBMS Data Streams Data stored in finite, persistent data sets Data Streams Continuous, ordered, changing, fast, huge amount Managed by Data Stream Management System (DSMS)

DBMS versus DSMS Persistent relations One-time queries Random access “Unbounded” disk store Only current state matters No real-time services Relatively low update rate Data at any granularity Assume precise data Access plan determined by query processor, physical DB design Transient streams Continuous queries Sequential access Bounded main memory Historical data is important Real-time requirements Possibly multi-GB arrival rate Data at fine granularity Data stale/imprecise Unpredictable/variable data arrival and characteristics

Data Streams Data Streams Characteristics of Data Streams Data streams - continuous, ordered, changing, fast, huge amount Traditional DBMS - data stored in finite, persistent data sets Characteristics of Data Streams Fast changing and requires fast, real-time response

Characteristics of Data Streams Data streams—continuous, ordered, changing, fast, huge amount Traditional DBMS—data stored in finite, persistent data sets Characteristics Fast changing and requires fast, real-time response Huge volumes of continuous data, possibly infinite Data stream captures nicely our data processing needs of today

Stream Data Applications Telecommunication calling records Business: credit card transaction flows Network monitoring and traffic engineering Financial market: stock exchange Engineering & industrial processes: power supply & manufacturing Sensor, monitoring & surveillance: video streams, RFIDs Security monitoring Web logs and Web page click streams Massive data sets (even saved but random access is too expensive)

Characteristics of Data Streams Data streams—continuous, ordered, changing, fast, huge amount Traditional DBMS—data stored in finite, persistent data sets Characteristics Fast changing and requires fast, real-time response Huge volumes of continuous data, possibly infinite Data stream captures nicely our data processing needs of today Random access is expensive—single scan algorithm (can only have one look) Store only the summary of the data seen thus far Most stream data are at pretty low-level or multi-dimensional in nature, needs multi-level and multi-dimensional processing

Key Challenges in Stream Data Mining precise freq. patterns in stream data: unrealistic Infinite length Concept-drift Concept-evolution Feature evolution

Key Challenges: Infinite Length In many data mining situations, we do not know the entire data set in advance. Stream management is important when the input rate is controlled externally Examples: Google queries, Twitter or Facebook status updates

Key Challenges: Infinite Length Infinite length: Impractical to store and use all historical data Requires infinite storage And running time 1

Key Challenges: Concept-Drift Current hyperplane Previous hyperplane A data chunk Negative instance Instances victim of concept-drift Positive instance

Key Challenges: Concept-Evolution X X X X X X X X X X XX X X X X X XX X X X X X X X X X X X X X X X XX X X X X X X X X X X X Novel class y x1 y1 y2 x ++++ ++ ++ + + ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + - - - - - - - - - - - - - - - + + + + + + + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - A C D B y - - - - - - - - - - - - - - - D y1 C Classification rules: R1. if (x > x1 and y < y2) or (x < x1 and y < y1) then class = + R2. if (x > x1 and y > y2) or (x < x1 and y > y1) then class = - A - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - ++++ ++ ++ + + ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + y2 B + + + + + + + + x1 x Concept-evolution occurs when a new class arrives in the stream. In this example, we again see a data chunk having two dimensional data points. There are two classes here, + and -. Suppose we train a rule-based classifier using this chunk Suppose a new class x arrives in the stream in the next chunk. If we use the same classification rules, all novel class instances will be misclassified as either + or -.

Key Challenges: Dynamic Features Why new features evolving Infinite data stream Normally, global feature set is unknown New features may appear Concept drift As concept drifting, new features may appear Concept evolution New type of class normally holds new set of features

Key Challenges: Dynamic Features ith chunk and i + 1st chunk and models have different feature sets runway, climb runway, clear, ramp i + 1st chunk Feature Set Feature Extraction & Selection ith chunk runway, ground, ramp Current model Feature Space Conversion Classification & Novel Class Detection Training New Model Existing classification models need complete fixed features and apply to all the chunks. Global features are difficult to predict. One solution is using all English words and generate vector. Dimension of the vector will be too high.

Frequent Pattern Mining over Data Stream Items Counting Lossy Counting Extensions

Items Counting

Counting Bits – (1) Problem: given a stream of 0’s and 1’s, be prepared to answer queries of the form “how many 1’s in the last k bits?” where k ≤ N. Obvious solution: store the most recent N bits. When new bit comes in, discard the N +1st bit.

Counting Bits – (2) You can’t get an exact answer without storing the entire window. Real Problem: what if we cannot afford to store N bits? E.g., we’re processing 1 billion streams and N = 1 billion But we’re happy with an approximate answer.

DGIM* Method Store O(log2N ) bits per stream. Gives approximate answer, never off by more than 50%. Error factor can be reduced to any fraction > 0, with more complicated algorithm and proportionally more stored bits. *Datar, Gionis, Indyk, and Motwani

Something That Doesn’t (Quite) Work Summarize exponentially increasing regions of the stream, looking backward. Drop small regions if they begin at the same point as a larger region.

Key Idea Summarize blocks of stream with specific numbers of 1’s. Block sizes (number of 1’s) increase exponentially as we go back in time

Example: Bucketized Stream At least 1 of size 16. Partially beyond window. 2 of size 8 2 of size 4 1 of size 2 2 of size 1 1001010110001011010101010101011010101010101110101010111010100010110010 N

Timestamps Each bit in the stream has a timestamp, starting 1, 2, … Record timestamps modulo N (the window size), so we can represent any relevant timestamp in O(log2N ) bits.

Buckets A bucket in the DGIM method is a record consisting of: The timestamp of its end [O(log N ) bits]. The number of 1’s between its beginning and end [O(log log N ) bits]. Constraint on buckets: number of 1’s must be a power of 2. That explains the log log N in (2).

Representing a Stream by Buckets Either one or two buckets with the same power-of-2 number of 1’s. Buckets do not overlap in timestamps. Buckets are sorted by size. Earlier buckets are not smaller than later buckets. Buckets disappear when their end-time is > N time units in the past.

Updating Buckets – (1) When a new bit comes in, drop the last (oldest) bucket if its end-time is prior to N time units before the current time. If the current bit is 0, no other changes are needed.

Updating Buckets – (2) If the current bit is 1: Create a new bucket of size 1, for just this bit. End timestamp = current time. If there are now three buckets of size 1, combine the oldest two into a bucket of size 2. If there are now three buckets of size 2, combine the oldest two into a bucket of size 4. And so on …

Example 1001010110001011010101010101011010101010101110101010111010100010110010 0010101100010110101010101010110101010101011101010101110101000101100101 0010101100010110101010101010110101010101011101010101110101000101100101 0101100010110101010101010110101010101011101010101110101000101100101101 0101100010110101010101010110101010101011101010101110101000101100101101 0101100010110101010101010110101010101011101010101110101000101100101101

Querying To estimate the number of 1’s in the most recent N bits: Sum the sizes of all buckets but the last. Add half the size of the last bucket. Remember: we don’t know how many 1’s of the last bucket are still within the window.

Example: Bucketized Stream At least 1 of size 16. Partially beyond window. 2 of size 8 2 of size 4 1 of size 2 2 of size 1 1001010110001011010101010101011010101010101110101010111010100010110010 N

Error Bound Suppose the last bucket has size 2k. Then by assuming 2k -1 of its 1’s are still within the window, we make an error of at most 2k -1. Since there is at least one bucket of each of the sizes less than 2k, the true sum is at least 1 + 2 + .. + 2k-1 = 2k -1. Thus, error at most 50%.

Frequent Pattern Mining over Data Stream Items Counting Lossy Counting Extensions

LOSSY COUNTING

Mining Approximate Frequent Patterns Mining precise freq. patterns in stream data: unrealistic Even store them in a compressed form, such as FPtree Approximate answers are often sufficient (e.g., trend/pattern analysis) Example: A router is interested in all flows: whose frequency is at least 1% (σ) of the entire traffic stream seen so far and feels that 1/10 of σ (ε = 0.1%) error is comfortable How to mine frequent patterns with good approximation? Lossy Counting Algorithm (Manku & Motwani, VLDB’02) Major ideas: not tracing items until it becomes frequent Adv: guaranteed error bound Disadv: keep a large set of traces

Lossy Counting for Frequent Single Items Bucket 1 Bucket 2 Bucket 3 Divide stream into ‘buckets’ (bucket size is 1/ ε = 1000)

First Bucket of Stream + Empty (summary) + At bucket boundary, decrease all counters by 1

Next Bucket of Stream + At bucket boundary, decrease all counters by 1

Approximation Guarantee Given: (1) support threshold: σ, (2) error threshold: ε, and (3) stream length N Output: items with frequency counts exceeding (σ – ε) N How much do we undercount? If stream length seen so far = N and bucket-size = 1/ε then frequency count error  #buckets = N/bucket-size = N/(1/ε) = εN Approximation guarantee No false negatives False positives have true frequency count at least (σ–ε)N Frequency count underestimated by at most εN For each item for every bucket you decrease the count by what? Some items comes very later and you did not decrease them. So to that extent you’d decrease less than number of buckets. So if they come even at the very beginning, the first bucket is they are on the way to create the count by number of bucket No false negative: if your item is frequent, the item count should be more

Lossy Counting For Frequent Itemsets Divide Stream into ‘Buckets’ as for frequent items But fill as many buckets as possible in main memory one time Bucket 1 Bucket 2 Bucket 3 If we put 3 buckets of data into main memory one time, then decrease each frequency count by 3

Update of Summary Data Structure 2 1 3 bucket data in memory 4 10 2 + 3 9 summary data summary data Itemset ( ) is deleted. That’s why we choose a large number of buckets – delete more

Pruning Itemsets – Apriori Rule 2 1 2 + 1 1 3 bucket data in memory summary data If we find itemset ( ) is not frequent itemset, then we needn’t consider its superset

Summary of Lossy Counting Strength A simple idea Can be extended to frequent itemsets Weakness: Space bound is not good For frequent itemsets, they do scan each record many times The output is based on all previous data. But sometimes, we are only interested in recent data A space-saving method for stream frequent item mining Metwally, Agrawal, and El Abbadi, ICDT'05

Extensions

Extensions Lossy Counting Algorithm (Manku & Motwani, VLDB’02) Mine frequent patterns with Approximate frequent patterns. Keep only current frequent patterns. No changes can be detected. FP-Stream (C. Giannella, J. Han, X. Yan, P.S. Yu, 2003) Use tilted time window frame. Mining evolution and dramatic changes of frequent patterns. Moment (Y. Chi, ICDM ‘04) Very similar to FP-tree, except that keeps a dynamic set of items. Maintain closed frequent itemsets over a Stream Sliding Window

Lossy Counting versus FP-Stream Lossy Counting (Manku & Motwani VLDB’02) Keep only current frequent patterns—No changes can be detected FP-Stream: Mining evolution and dramatic changes of frequent patterns (Giannella, Han, Yan, Yu, 2003) Use tilted time window frame Use compressed form to store significant (approximate) frequent patterns and their time-dependent traces

Summary of FP-Stream Mining Frequent Itemsets at Multiple Time Granularities Based in FP-Growth Maintains Pattern Tree Tilted-time window Advantages Allows to answer time-sensitive queries Places greater information to recent data Drawback Time and memory complexity

Moment Regenerate frequent itemsets from the entire window whenever a new transaction comes into or an old transaction leaves the window Store every itemset, frequent or not, in a traditional data structure such as the prefix tree, and update its support whenever a new transaction comes into or an old transaction leaves the window Drawback Mining each window from scratch - too expensive Subsequent windows have many freq patterns in common Updating frequent patterns every new tuple, also too expensive

Summary of Moment Computes closed frequents itemsets in a sliding window Uses Closed Enumeration Tree Uses 4 type of Nodes: Closed Nodes Intermediate Nodes Unpromising Gateway Nodes Infrequent Gateway Nodes Adding transactions Closed items remains closed Removing transactions Infrequent items remains infrequent

References [Agrawal’ 94] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB, pages 487–499, 1994. [Cheung’ 03] W. Cheung and O. R. Zaiane, “Incremental mining of frequent patterns without candidate generation or support,” in DEAS, 2003. [Chi’ 04] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz, “Moment: Maintaining closed frequent itemsets over a stream sliding window,” in ICDM, November 2004. [Evfimievski’ 03] A. Evfimievski, J. Gehrke, and R. Srikant, “Limiting privacy breaches in privacy preserving data mining,” in PODS, 2003. [Han’ 00] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD, 2000. [Koh’ 04] J. Koh and S. Shieh, “An efficient approach for maintaining association rules based on adjusting fp-tree structures.” in DASFAA, 2004. [Leung’ 05] C.-S. Leung, Q. Khan, and T. Hoque, “Cantree: A tree structure for efficient incremental mining of frequent patterns,” in ICDM, 2005. [Toivonen’ 96] H. Toivonen, “Sampling large databases for association rules,” in VLDB, 1996, pp. 134–145. Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo: Verifying and Mining Frequent Patterns from Large Windows over Data Streams. ICDE 2008: 179-188 Hetal Thakkar, Barzan Mozafari, Carlo Zaniolo. Continuous Post-Mining of Association Rules in a Data Stream Management System. Chapter VII in Post-Mining of Association Rules: Techniques for Effective Knowledge Extraction, Yanchang Zhao; Chengqi Zhang; and Longbing Cao (eds.), ISBN: 978-1-60566-404-0.