COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong

Slides:



Advertisements
Similar presentations
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Advertisements

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Evaluating Classifiers
Header, Specification, Body Input Parameter List Output Parameter List
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Lecture 11 oct 6 Goals: hashing hash functions chaining closed hashing application of hashing.
Adaptive Frequency Counting over Bursty Data Streams Bill Lin, Wai-Shing Ho, Ben Kao and Chun-Kit Chui Form CIDM07.
Mining Data Streams.
Resource-oriented Approximation for Frequent Itemset Mining from Bursty Data Streams SIGMOD’14 Toshitaka Yamamoto, Koji Iwanuma, Shoshi Fukuda.
Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.
Data Mining Association Analysis: Basic Concepts and Algorithms
Adaptive Load Shedding for Mining Frequent Patterns from Data Streams Xuan Hong Dang, Wee-Keong Ng, and Kok-Leong Ong (DaWaK 2006) 2008/3/191Yi-Chun Chen.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Heavy hitter computation over data stream
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku (Standford) Rajeev Motwani (Standford) Presented by Michal Spivak November, 2003.
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Tutorial 6 & 7 Symbol Table
Maintenance of Discovered Association Rules S.D.LeeDavid W.Cheung Presentation : Pablo Gazmuri.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
SAMPLING DISTRIBUTIONS. SAMPLING VARIABILITY
A survey on stream data mining
Continuous Data Stream Processing MAKE Lab Date: 2006/03/07 Post-Excellence Project Subproject 6.
Lecture 11 oct 7 Goals: hashing hash functions chaining closed hashing application of hashing.
Mining Association Rules
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.
Mining frequency counts from sensor set data Loo Kin Kong 25 th June 2003.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying.
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
August 21, 2002VLDB Gurmeet Singh Manku Frequency Counts over Data Streams Frequency Counts over Data Streams Stanford University, USA.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Research issues on association rule mining Loo Kin Kong 26 th February, 2003.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Data Structures & Algorithms
COMP53311 Knowledge Discovery in Databases Overview Prepared by Raymond Wong Presented by Raymond Wong
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
The Misra Gries Algorithm. Motivation Espionage The rest we monitor.
Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
Other Clustering Techniques
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Online Mining (Recently) Maximal Frequent Itemsets over Data Streams Hua-Fu Li, Suh-Yin Lee, Man Kwan Shan RIDE-SDMA ’ 05 speaker :董原賓 Advisor :柯佳伶.
COMP53311 Association Rule Mining Prepared by Raymond Wong Presented by Raymond Wong
1 Efficient Data Reduction Methods for Online Association Rule Discovery -NGDM’02 Herve Bronnimann, Bin Chen, Manoranjan Dash, Peter Haas, Yi Qiao, Peter.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Mining Data Streams (Part 1)
CFI-Stream: Mining Closed Frequent Itemsets in Data Streams
Frequency Counts over Data Streams
The Stream Model Sliding Windows Counting 1’s
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu , Zhihong Chong(崇志宏) , Hongjun Lu.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong
Approximate Frequency Counts over Data Streams
CSCI B609: “Foundations of Data Science”
Maintaining Frequent Itemsets over High-Speed Data Streams
Approximation and Load Shedding Sampling Methods
Presentation transcript:

COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong

COMP53312 Data Mining over Static Data 1.Association 2.Clustering 3.Classification Static Data Output (Data Mining Results)

COMP53313 Data Mining over Data Streams 1.Association 2.Clustering 3.Classification Output (Data Mining Results) … Unbounded Data Real-time Processing

COMP53314 Data Streams 12 … Less recentMore recent Each point: a transaction

COMP53315 Data Streams Traditional Data Mining Data Stream Mining Data TypeStatic Data of Limited Size Dynamic Data of Unlimited Size (which arrives at high speed) MemoryLimitedLimited  More challenging EfficiencyTime-ConsumingEfficient OutputExact AnswerApproximate (or Exact) Answer

COMP53316 Entire Data Streams 12 … Less recentMore recent Each point: a transaction Obtain the data mining results from all data points read so far

COMP53317 Entire Data Streams 12 … Less recentMore recent Each point: a transaction Obtain the data mining results over a sliding window

COMP53318 Data Streams Entire Data Streams Data Streams with Sliding Window

COMP53319 Entire Data Streams Association Clustering Classification Frequent pattern/item

COMP Frequent Item over Data Streams Let N be the length of the data streams Let s be the support threshold (in fraction) (e.g., 20%) Problem: We want to find all items with frequency >= sN 12 … Less recentMore recent Each point: a transaction

COMP Data Streams Traditional Data Mining Data Stream Mining Data TypeStatic Data of Limited Size Dynamic Data of Unlimited Size (which arrives at high speed) MemoryLimitedLimited  More challenging EfficiencyTime-ConsumingEfficient OutputExact AnswerApproximate (or Exact) Answer

COMP Data Streams Static Data Output (Data Mining Results) Output (Data Mining Results) … Unbounded Data Frequent item I 1 Infrequent item I 2 I 3 Frequent item I 1 I 3 Infrequent item I 2

COMP False Positive/Negative E.g. Expected Output Frequent item I 1 Infrequent item I 2 I 3 Algorithm Output Frequent item I 1 I 3 Infrequent item I 2 False Positive -The item is classified as frequent item -In fact, the item is infrequent Which item is one of the false positives?I3I3 More?No. No. of false positives = 1 If we say: The algorithm has no false positives. All true infrequent items are classified as infrequent items in the algorithm output.

COMP False Positive/Negative E.g. Expected Output Frequent item I 1 I 3 Infrequent item I 2 Algorithm Output Frequent item I 1 Infrequent item I 2 I 3 False Negative -The item is classified as infrequent item -In fact, the item is frequent Which item is one of the false negatives?I3I3 More?No. No. of false negatives = 1 No. of false positives =0 If we say: The algorithm has no false negatives. All true frequent items are classified as frequent items in the algorithm output.

COMP Data Streams Traditional Data Mining Data Stream Mining Data TypeStatic Data of Limited Size Dynamic Data of Unlimited Size (which arrives at high speed) MemoryLimitedLimited  More challenging EfficiencyTime-ConsumingEfficient OutputExact AnswerApproximate (or Exact) Answer We need to introduce an input error parameter 

COMP Data Streams Static Data Output (Data Mining Results) Output (Data Mining Results) … Unbounded Data Frequent item I 1 Infrequent item I 2 I 3 Frequent item I 1 I 3 Infrequent item I 2

COMP Data Streams Static Data Output (Data Mining Results) Output (Data Mining Results) … Unbounded Data Store the statistics of all items I 1 : 10 I 2 : 8 I 3 : 12 Estimate the statistics of all items I 1 : 10 I 2 : 4 I 3 : 10 ItemTrue FrequencyEstimated Frequency I1I1 10 I2I2 84 I3I N: total no. of occurrences of items N = 20  = 0.2  N = 4 Diff. D D <=  N ? Yes

COMP  -deficient synopsis Let N be the current length of the stream (or total no. of occurrences of items) Let  be an input parameter (a real number from 0 to 1) All true frequent items are classified as frequent items in the algorithm output. An algorithm maintains an  -deficient synopsis if its output satisfies the following properties Condition 2: The difference between the estimated frequency and the true frequency is at most  N. Condition 1: There is no false negative. Condition 3: All items whose true frequencies less than (s-  )N are classified as infrequent items in the algorithm output

COMP Frequent Pattern Mining over Entire Data Streams Algorithm Sticky Sampling Algorithm Lossy Counting Algorithm Space-Saving Algorithm

COMP Sticky Sampling Algorithm Sticky Sampling s   … Unbounded Data Statistics of items Output Stored in the memory Frequent items Infrequent items Support threshold Error parameter Confidence parameter

COMP Sticky Sampling Algorithm The sampling rate r varies over the lifetime of a stream Confidence parameter  (a small real number) Let t =  1/  ln(s -1  -1 )  Data No.r (sampling rate) 1 ~ 2t1 2t+1 ~ 4t2 4t+1 ~ 8t4 ……

COMP Sticky Sampling Algorithm The sampling rate r varies over the lifetime of a stream Confidence parameter  (a small real number) Let t =  1/  ln(s -1  -1 )  Data No.r (sampling rate) 1 ~ 2t1 2t+1 ~ 4t2 4t+1 ~ 8t4 …… e.g. s = 0.02  = 0.01  = 0.1 t = ~ 2*622 2*622+1 ~ 4*622 4*622+1 ~ 8*622 1~ ~ ~4976

COMP Sticky Sampling Algorithm The sampling rate r varies over the lifetime of a stream Confidence parameter  (a small real number) Let t =  1/  ln(s -1  -1 )  Data No.r (sampling rate) 1 ~ 2t1 2t+1 ~ 4t2 4t+1 ~ 8t4 …… e.g. s = 0.5  = 0.35  = 0.5 t = 4 1 ~ 2*4 2*4+1 ~ 4*4 4*4+1 ~ 8*4 1~8 9~16 17~32

COMP Sticky Sampling Algorithm 1.S: empty list  will contain (e, f) element Estimated frequency 2.When data e arrives,  if e exists in S, increment f in (e, f)  if e does not exist in S, add entry (e, 1) with prob. 1/r (where r: sampling rate) 3.Just after r changes, For each entry (e, f), Repeatedly toss a coin with P(head) = 1/r until the outcome of the coin toss is head If the outcome of the toss is tail, Decrement f in (e, f) If f = 0, delete the entry (e, f) 4. [Output] Get a list of items where f  +  N >= sN

COMP Analysis  -deficient synopsis Sticky Sampling computes an  -deficient synopsis with probability at least 1-  Memory Consumption Sticky Sampling occupies at most  2/  ln(s -1  -1 )  entries on average

COMP Frequent Pattern Mining over Entire Data Streams Algorithm Sticky Sampling Algorithm Lossy Counting Algorithm Space-Saving Algorithm

COMP Lossy Counting Algorithm Lossy Counting s  … Unbounded Data Statistics of items Output Stored in the memory Frequent items Infrequent items Support threshold Error parameter

COMP Lossy Counting Algorithm 12 … Less recentMore recent Each point: a transaction Bucket 1 Bucket 2 Bucket 3 … Bucket b current N: current length of stream b current = Width w =

COMP Lossy Counting Algorithm 1.D: Empty set Will contain (e, f,  ) element Frequency of element since this entry was inserted into D Max. possible error in f 2.When data e arrives, If e exists in D, Increment f in (e, f,  ) If e does not exist in D, Add entry (e, 1, b current -1) 3.Remove some entries in D whenever N  0 mod w (i.e., whenever it reaches the bucket boundary) The rule of deletion is: (e, f,  ) is deleted if f +  <= b current 4.[Output] Get a list of items where f  +  N >= sN

COMP Lossy Counting Algorithm  -deficient synopsis Lossy Counting computes an  -deficient synopsis Memory Consumption Lossy Counting occupies at most   1/  log(  N)  entries.

COMP Comparison  -deficient synopsis Memory Consumption Sticky Sampling 1-  confidence  2/  ln(s -1  -1 )  Lossy Counting100% confidence  1/  log(  N)  Memory = 1243 e.g. s = 0.02  = 0.01  = 0.1 N = 1000 Memory = 231

COMP Comparison  -deficient synopsis Memory Consumption Sticky Sampling 1-  confidence  2/  ln(s -1  -1 )  Lossy Counting100% confidence  1/  log(  N)  Memory = 1243 e.g. s = 0.02  = 0.01  = 0.1 N = 1,000,000 Memory = 922

COMP Comparison  -deficient synopsis Memory Consumption Sticky Sampling 1-  confidence  2/  ln(s -1  -1 )  Lossy Counting100% confidence  1/  log(  N)  Memory = 1243 e.g. s = 0.02  = 0.01  = 0.1 N = 1,000,000,000 Memory = 1612

COMP Frequent Pattern Mining over Entire Data Streams Algorithm Sticky Sampling Algorithm Lossy Counting Algorithm Space-Saving Algorithm

COMP Sticky Sampling Algorithm Sticky Sampling s   … Unbounded Data Statistics of items Output Stored in the memory Frequent items Infrequent items Support threshold Error parameter Confidence parameter

COMP Lossy Counting Algorithm Lossy Counting s  … Unbounded Data Statistics of items Output Stored in the memory Frequent items Infrequent items Support threshold Error parameter

COMP Space-Saving Algorithm Space-Saving s M … Unbounded Data Statistics of items Output Stored in the memory Frequent items Infrequent items Support threshold Memory parameter

COMP Space-Saving M: the greatest number of possible entries stored in the memory

COMP Space-Saving 1.D: Empty set Will contain (e, f,  ) element Frequency of element since this entry was inserted into D Max. possible error in f 2.p e = 0 3.When data e arrives, If e exists in D, Increment f in (e, f,  ) If e does not exist in D, If the size of D = M p e  min e  D {f +  } Remove all entries e where f +   p e Add entry (e, 1, p e ) 4.[Output] Get a list of items where f +  >= sN

COMP Space-Saving Greatest Error Let E be the greatest error in any estimated frequency. E  1/M  -deficient synopsis Space-Saving computes an  -deficient synopsis if E  

COMP Comparison  -deficient synopsis Memory Consumption Sticky Sampling 1-  confidence  2/  ln(s -1  -1 )  Lossy Counting100% confidence  1/  log(  N)  Space-Saving 100% confidence where E <=  M Memory = 1243 e.g. s = 0.02  = 0.01  = 0.1 N = 1,000,000,000 Memory = 1612 Memory can be very large (e.g., 4,000,000) Since E <= 1/M  the error is very small

COMP Data Streams Entire Data Streams Data Streams with Sliding Window

COMP Data Streams with Sliding Window Association Clustering Classification Frequent pattern/itemset

COMP Sliding Window Mining Frequent Itemsets in a sliding window E.g. t 1 : I 1 I 2 t 2 : I 1 I 3 I 4 … To find frequent itemsets in a sliding window t1t1 t2t2 … Sliding window

COMP B1B1 B2B2 B3B3 B4B4 Sliding Window Sliding window Last 4 batches Storage

COMP Sliding Window Sliding window B1B1 B2B2 B3B3 B4B4 Last 4 batches B5B5 Remove the whole batch Storage