Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09 Supervisor Dr Koh Speaker Nonhlanhla Shongwe April 26,

Slides:

Advertisements

Similar presentations

A Preliminary Attempt ECEn 670 Semester Project Wei Dang Jacob Frogget Poisson Processes and Maximum Likelihood Estimator for Cache Replacement.

Advertisements

Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.

Discovering Lag Interval For Temporal Dependencies Larisa Shwartz Liang Tang, Tao Li, Larisa Shwartz1 Liang Tang, Tao Li

Frequent Closed Pattern Search By Row and Feature Enumeration

Change Detection C. Stauffer and W.E.L. Grimson, “Learning patterns of activity using real time tracking,” IEEE Trans. On PAMI, 22(8): , Aug 2000.

Fast Algorithms For Hierarchical Range Histogram Constructions

1 Virtual COMSATS Inferential Statistics Lecture-7 Ossam Chohan Assistant Professor CIIT Abbottabad.

Resource-oriented Approximation for Frequent Itemset Mining from Bursty Data Streams SIGMOD’14 Toshitaka Yamamoto, Koji Iwanuma, Shoshi Fukuda.

COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong

Modeling Pixel Process with Scale Invariant Local Patterns for Background Subtraction in Complex Scenes (CVPR’10) Shengcai Liao, Guoying Zhao, Vili Kellokumpu,

MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 

HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.

Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.

11 Populations and Samples.

Detecting Distance-Based Outliers in Streams of Data Fabrizio Angiulli and Fabio Fassetti DEIS, Universit `a della Calabria CIKM 07.

Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.

STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)

Topic 5 Statistical inference: point and interval estimate

Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.

Mining Multidimensional Sequential Patterns over Data Streams Chedy Raїssi and Marc Plantevit DaWak_2008.

MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.

Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.

Adaptive Mining Techniques for Data Streams using Algorithm Output Granularity Mohamed Medhat Gaber, Shonali Krishnaswamy, Arkady Zaslavsky In Proceedings.

DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.

Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.

Unsupervised Mining of Statistical Temporal Structures in Video Liu ze yuan May 15,2011.

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.

Data Mining and Decision Support

Predicting the Location and Time of Mobile Phone Users by Using Sequential Pattern Mining Techniques Mert Özer, Ilkcan Keles, Ismail Hakki Toroslu, Pinar.

IPDET Module 9: Choosing the Sampling Strategy. IPDET © Introduction Introduction to Sampling Types of Samples: Random and Nonrandom Determining.

Chapter 7 Introduction to Sampling Distributions Business Statistics: QMIS 220, by Dr. M. Zainal.

1 Maintaining Data Privacy in Association Rule Mining Speaker: Minghua ZHANG Oct. 11, 2002 Authors: Shariq J. Rizvi Jayant R. Haritsa VLDB 2002.

Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.

Mining High-Speed Data Streams Presented by: William Kniffin Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference

Chapter 6 INFERENTIAL STATISTICS I: Foundations and sampling distribution.

Introduction to Machine Learning, its potential usage in network area,

Virtual University of Pakistan

Module 9: Choosing the Sampling Strategy

By Arijit Chatterjee Dr

Security in Outsourcing of Association Rule Mining

Updating SF-Tree Speaker: Ho Wai Shing.

12. Principles of Parameter Estimation

Scatter-plot Based Blind Estimation of Mixed Noise Parameters

Statistical Change Detection for Multi-Dimensional Data

Supervised Time Series Pattern Discovery through Local Importance

Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.

Parallel Density-based Hybrid Clustering

Break and Noise Variance

CHAPTER 10 Comparing Two Populations or Groups

Supporting Fault-Tolerance in Streaming Grid Applications

Jiawei Han Department of Computer Science

When Security Games Go Green

Adjustment of Temperature Trends In Landstations After Homogenization ATTILAH Uriah Heat Unavoidably Remaining Inaccuracies After Homogenization Heedfully.

Time Series Data and Moving Object Trajectory

Learning with information of features

Hidden Markov Models Part 2: Algorithms

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Section 7.7 Introduction to Inference

Discriminative Frequent Pattern Analysis for Effective Classification

CS5112: Algorithms and Data Structures for Applications

Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen, Zne-Jung Lee

CHAPTER 10 Comparing Two Populations or Groups

Sampling: How to Select a Few to Represent the Many

12. Principles of Parameter Estimation

Outlines Introduction & Objectives Methodology & Workflow

Presentation transcript:

Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09 Supervisor Dr Koh Speaker Nonhlanhla Shongwe April 26, 2010

Preview Introduction Challenge Method DMM framework Distance Function Selection Experiments Conclusion

Introduction Mining stream for knowledge mining such as Clustering Classification Frequent patterns discovery , has become important Important characteristic of unbounded data stream is that the underlying distributions can show important changes over time, leading to dynamic data streams.

Challenge The problem of mining dynamic data streams Balance accuracy with efficiency – highly accurate mining techniques are generally computationally expensive Some question to ask: for dynamic data streams Is the distribution changes entirely random and unpredictable Is it possible for the distribution changes to follow certain patterns?

Method Propose a method for mining dynamic data streams Where important observed distributions patterns are stored And compared new detected changes with these patterns

Method Two streams with the same distribution Mining results such as List of all frequent items / itemset for frequent patterns discovery Set of clusters / classes for clustering and classification should be the same If distribution change is detected and a match is found Possible to skip the re-mining process Directly output the mining results for the archived distribution This is called the match-and-reuse strategy

Method Issues to be resolved Pattern selection Pattern representation Selecting and storing important distributions that have a high probability of occurring in the future Pattern representation Storing each pattern succinctly Matching Efficient procedure for rapid data streams with high accuracy Since the data has to be stored. Memory restriction

DMM framework DMM framework stands for Detect, Match and Mine Consist of four sequences Choosing representative set Change detection Pattern matching Stream mining All processes are independent

DMM framework Window model Generate reference window (choosing representative set) Change detection Distribution matching (Pattern matching) Choosing important distribution

Window model Two windows on Stream S Time-based Defines the time intervals Denoted by Wt Called observation window Implemented as a tumbling window Moves forward at each clock tick

Window model cont’s Count-based Contains a sub stream with fixed number of elements Denoted by Wr Called reference window The size of the reference window (|Wr|) and time intervals of Wt are predefined values The smaller is (|Wr|) there many are the distributions (|Wr|) has to be small due to memory limitations

DMM framework Window model Generate reference window (choosing representative set) Change detection Distribution matching (Pattern matching) Choosing important distribution

Generate reference window (choosing representative set) Wr stores a set of data elements that represents a current distribution of S The size needs to be small due to memory limitations Inaccurate results if we use a small data set to represent a distribution if the distribution is complicated Due to this problem, use a dynamic reference window (Wr) Although the size of the representative set has to be limited due to memory and efficiency concerns, if, instead of blindly using the first group of data that arrived after a new distributions is detected, we can carefully choose a sample set with the same size from a large vault of samples, the distribution of this selected set will be much closer to the true distribution

Generate reference window (choosing representative set) Merge and select process Dynamic reference window (Wr) Merge Wt and Wr to get a larger substream |Wr|+|Wt| Select |Wr| elements from the merged window (Wr +Wt) and replace the stream in Wr by the new set Merge and select process is triggered every time Wt tumbles

Generate reference window (choosing representative set)

Generate reference window (choosing representative set) Selecting representative set: Two-step sampling approach First-step sampling approach Estimate the density function of Wr + Wt K= kernel function h= smoothing parameter (bandwidth) si= data element in Wr + Wt We want to find the data set w’r in wr+wt such that its distribution is the closet to the true distribution of S

Generate reference window (choosing representative set) Selecting representative set: Two-step sampling approach K is set to (Standard Gaussian function mean = 0 variance = 1) Then the density function h=value between 0 or 1

Generate reference window (choosing representative set) Selecting representative set: Two-step sampling approach With the density function we are able to estimate the “shape” of the current distribution X-axis is the = value of the data s (v(s)) in Wr +Wt Y-axis is the probability (p(v)) for all data values Higher values indicate that these values occur more frequent in the stream and thus more elements s with value v(s) should be selected into the new reference window

Generate reference window (choosing representative set) Selecting representative set: Two-step sampling approach Second-step sampling approach

Generate reference window (choosing representative set) Selecting representative set: Two-step sampling approach Second-step sampling approach First calculate the start and end values for each partition

DMM framework Window model Generate reference window (choosing representative set) Change detection Distribution matching (Pattern matching) Choosing important distribution

Change detection Online change detection technique that is not restricted to specific stream processing application Wt tumbles, the change detection procedure is triggered Compare the distributions of substreams in both Wr and Wt windows If the distance is greater than the predefined maximum matching distance, then a distribution change is flagged

DMM framework Window model Generate reference window (choosing representative set) Change detection Distribution matching (Pattern matching) Choosing important distribution

Distribution matching (Pattern matching) We use the appropriate distance measure to check their similarity If a match is found, then the persevered mining results are outputted The maximum predefined maximum matching is important Smaller implies a higher accuracy Larger increases the possibility of a new distribution to match a pattern in the preserved set

DMM framework Window model Generate reference window (choosing representative set) Change detection Distribution matching (Pattern matching) Choosing important distribution

Choosing important distribution Use heuristic rules Distribution that have occurred in the stream for more times are more important The longer a distribution lasts in the streams lifespan the more important it is Distribution that has mining results with higher accuracy is more important that a distributions with less accurate mining results For an archived distribution we use a counter to indicate the number of times such a distribution has occurred

Distance Function Selection Dynamic Time Wrapping (DTW), Longest Common Subsequence (LCSS), Edit Distance on Real Sequence (EDR) and Relativized Discrepancy (RD) A proper distance function that can be used with DMM Efficient , with the ability to stretching

Experiments Change detection Kernel density approach (KD) Distance function-based approach(DF)

Experiments Distribution matching evaluation Data from Tropical Atmosphere Ocean Sea surface temperatures. 12 218 streams each with a length of 962

Experiments Efficiency with and without DMM Adopt a popular decision tree-based clustering technique VFDT to cluster the temperatures Best decision tree generators for dynamic data streams Time is reduced by 31.3%

Conclusion Introduced a DMM framework to mine dynamic data streams Window model Generate reference window (choosing representative set) Change detection Distribution matching (Pattern matching) Choosing important distribution Experiments that showing DMM performs better

Thank you for your attention