Continuous Data Stream Processing MAKE Lab Date: 2006/03/07 Post-Excellence Project Subproject 6.

Slides:



Advertisements
Similar presentations
Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science.
Advertisements

Mining Association Rules from Microarray Gene Expression Data.
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
A distributed method for mining association rules
Data Mining Techniques Association Rule
Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
LOGO Association Rule Lecturer: Dr. Bo Yuan
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.
Data Mining Techniques Cluster Analysis Induction Neural Networks OLAP Data Visualization.
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Continuous Data Stream Processing  Music Virtual Channel – extensions  Data Stream Monitoring – tree pattern mining  Continuous Query Processing – sequence.
Heavy hitter computation over data stream
Progress Report on Continuous Data Stream Management  Mining Frequent Itemsets over Data Streams  Music Virtual Channel Presented by: Dr. Yi-Hung Wu.
Spring 2003Data Mining by H. Liu, ASU1 5. Association Rules Market Basket Analysis and Itemsets APRIORI Efficient Association Rules Multilevel Association.
Continuous Data Stream Processing
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Continuous Data Stream Processing MAKE Lab Date: 2006/03/07 Post-Excellence Project Subproject 6.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
Apriori algorithm Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK Presentation Lauri Lahti.
1 Data Mining over the Deep Web Tantan Liu, Gagan Agrawal Ohio State University April 12, 2011.
NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.
Module 7 Reading SQL Server® 2008 R2 Execution Plans.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
1 Efficient Computation of Frequent and Top-k Elements in Data Streams.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Mining Multidimensional Sequential Patterns over Data Streams Chedy Raїssi and Marc Plantevit DaWak_2008.
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying.
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
9/03Data Mining – Association G Dong (WSU) 1 5. Association Rules Market Basket Analysis APRIORI Efficient Mining Post-processing.
August 21, 2002VLDB Gurmeet Singh Manku Frequency Counts over Data Streams Frequency Counts over Data Streams Stanford University, USA.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.
Data Mining Find information from data data ? information.
Temporal Database Paper Reading R 資工碩一 馬智釗 Efficient Mining Strategy for Frequent Serial Episodes in Temporal Database, K Huang, C Chang.
Multi-object Similarity Query Evaluation Michal Batko.
Information Technology (Some) Research Trends in Location-based Services Muhammad Aamir Cheema Faculty of Information Technology Monash University, Australia.
On Top-n Reverse Top-k Queries: Variants, Algorithms, and Applications 陳良弼 Arbee L.P. Chen National Chengchi University 9/21/2012 at NCHU.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Data Mining - assoc. rules C. Faloutsos.
18 February 2003Mathias Creutz 1 T Seminar: Discovery of frequent episodes in event sequences Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Mining Data Streams (Part 1)
Frequency Counts over Data Streams
Online Frequent Episode Mining
Frequent Pattern Mining
Probabilistic Data Management
Pervasive Data Access (PDA) Research Group
Advanced Associative Structures
Association Rule Mining
DIRECT HASHING AND PRUNING (DHP) ALGORITHM
Association Rule Mining
I don’t need a title slide for a lecture
Approximate Frequency Counts over Data Streams
Heavy Hitters in Streams and Sliding Windows
Presentation transcript:

Continuous Data Stream Processing MAKE Lab Date: 2006/03/07 Post-Excellence Project Subproject 6

Continuous Data Stream Processing 2 Clustering engine Clustering engine Music metadata Music metadata Music Virtual Channel … 1 1 N N 2 2 … Music collections Internet V.C. player V.C. player Filtering engine Filtering engine Music channel simulator Music channel simulator Interface Profile monitor Profile monitor Channel monitor Channel monitor Favorite channel Favorite channel Cluster monitor Cluster monitor Cluster coordinator Cluster coordinator Peer search engine Peer search engine Profile database Profile database MusicXML database MusicXML database XML Filtering engine XML Filtering engine

Continuous Data Stream Processing 3 Research Directions Streaming Data Management Mining Filtering Temporal Query Processing Spatial Query Processing Aggregate Query Processing Frequent Tree Pattern Mining Frequent Itemset Mining (sliding window) Sequence Query Matching Episode Query Matching Range Search KNN Search Top-K Search Closed Tree Pattern Mining Frequent Itemset Mining (landmark model)

Hash-based synopsis with memory consideration for mining frequent itemsets over data streams

Continuous Data Stream Processing 5 Landmark model

Continuous Data Stream Processing 6 Lossy Counting Step 1: Divide the stream into ‘buckets’ bucket 1bucket 2bucket 3 bucket-size = 1/ε ε = 10% of support s

Continuous Data Stream Processing 7 Lossy Counting in Action Empty At bucket boundary, decrement all counters by 1

Continuous Data Stream Processing 8 Lossy Counting continued... At bucket boundary, decrement all counters by 1 Next Bucket + Output: Elements with counter values exceeding sN – εN

Continuous Data Stream Processing 9 Drawbacks of Lossy Counting s ε Lossy-Counting Keep all items with frequency > Applied to mine frequent itemsets, the space may exponentially increase 0 1

Continuous Data Stream Processing 10 hCount  ……, 9, ……  m h 2 h 1 (9) mod m h 2 (9) mod m h 3 (9) mod m h 4 (9) mod m For each item, hash the item into buckets, choose the minimum count and return the item if its minimum count ≥ sN

Continuous Data Stream Processing 11 hash-based  Transaction {1, 2, 3}  Subsets of {1, 2, 3}: …… Total_ Access N last_access ItemsetSurplus_ Estimate True_ Count {1} {2} {3} {1, 2} {1, 3} {2, 3} {1, 2, 3} × ○ ○ ○ × × × N N N N +1 1 How to compute the Surplus_Estimate?

Continuous Data Stream Processing 12 Compute the Surplus_Estimate for an Itemset  Two variables  n: number of different itemsets in the bucket but not in the list  c: sensible counts to be divided between itemsets which are not in the list  If c = [3, 5], n = [3, ?] → Surplus_Estimate = 3, (3, 1, 1)  Surplus_Estimate --, until (Surplus_Estimate) / N last_acces < minSup

Continuous Data Stream Processing 13 Determine c and n 43 {1}20 Itemset 5 11 {2} 4 Total_ Access N last_access Surplus_ Estimate True_ Count {2, 3, 5}, N = 4, minSup = 0.4 {2} is hashed into the bucket Boundary of c: 4-(2+SE) ≤ c ≤ 4-2 Boundary of n: c = 2, n = 2 → (1, 1) → Surplus_Estimate = 1

Monitoring Constrained k-Nearest Neighbor over Moving Objects with Different Values

Continuous Data Stream Processing 15 Motivation (Cont.)  Example: Consider that an user wants to find the k places to buy new shoes where the costs are the lowest. Cost = Price($) + Traffic Cost($) $90 $100 $200 $ *1= *2= *3= *5=590 2-NN Query

Continuous Data Stream Processing 16 Motivation Objects with different values in spatial database.  find the k places to buy something where the costs are the lowest. Cost = Price($) + Traffic Cost($)  Taxi driver wants to find the k places to gain the most profits. Profit = Gain($) - Traffic Cost($)  Taxi driver wants to find the k places to gain the most profits. Profit = Gain($) / Time = Gain($) / Time  Virtual Channel age * profile distance listen hours / profile distance  Market Survey $consumption (or income, age … ) / profile distance

Continuous Data Stream Processing 17 Challenges  Efficiency  Search space reduction  Query processing enhancement  Effectiveness  Previous result reuse

Continuous Data Stream Processing 18 Framework Step1 Find k-candidates to restrict the search region. Step2 Run Pruning Ring on the remaining candidates to determine actual answer. Initialization Handling updates positionsvalues - Incrementally update positions or values for objects and queries - Computation is necessary only for affected query q

Querying Episodes over Event Stream

Continuous Data Stream Processing 20 Motivation  Knowledge Discovery from Telecommunication Network Alarm Databases [ICDE96]  If an alarm of type A occurs, then an alarm of type B occurs within 30 seconds with probability 0.8  If alarms of types A and B occurs within 5 seconds, then a alarm of type C occurs within 60 seconds with probability 0.7  If an alarm of type A precedes an alarm of type B, and C precedes D, all within 15 seconds, then E will follow within 4 minutes with probability 0.6 A A B 5 seconds CD A B 15 seconds

Continuous Data Stream Processing 21 Challenges  Efficiency  Index impaction  Partial result sharing  Load shedding

Continuous Data Stream Processing 22 DBC 5 AB D C 7 D C 3 Q1Q1 Q2Q2 Q3Q3 p1p1 p2p2 p3p3 p4p4 p5p5 p1p1 C B D D p1p1 p2p2 p3p3 p4p4 A p4p4 p5p5 p5p5 Joining events B and C: B C p 1, p 5 Q 3 is composed of p 5 and p 4 Framework

Continuous Data Stream Processing 23 PQueue M.Q.: Q 1 E. I.: -1 P1P1 PQueue M.Q.: Q 2 E. I.: 2 P2P2 PQueue M.Q.: Q 2 E. I.: 2 P3P3 PQueue M.Q.: Q 3 E. I.: 6 P4P4 PQueue M.Q.: Q 3 E. I.: 6 P5P5 DBC 5 AB D C 7 D C 3 A EQueue TLink B E. I.: 5 EQueue TLink C E. I.: 2 EQueue TLink D E. I.: 4 EQueue TLink (time)(S t, E t ) p1p1 C B D D p1p1 p2p2 p3p3 p4p4 A p4p4 p5p5 p5p5