Di Yang, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute VLDB 2009, Lyon, France 1 A Shared Execution Strategy for Multiple Pattern.

Slides:

Advertisements

Similar presentations

ADAPTIVE FASTEST PATH COMPUTATION ON A ROAD NETWORK: A TRAFFIC MINING APPROACH Hector Gonzalez, Jiawei Han, Xiaolei Li, Margaret Myslinska, John Paul Sondag.

Advertisements

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.

Cluster Analysis: Basic Concepts and Algorithms

1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.

Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.

Hierarchical Clustering, DBSCAN The EM Algorithm

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

PARTITIONAL CLUSTERING

A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad.

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

Using Parallel Genetic Algorithm in a Predictive Job Scheduling

Fast Algorithms For Hierarchical Range Histogram Constructions

Yasuhiro Fujiwara (NTT Cyber Space Labs)

Data Mining Cluster Analysis: Advanced Concepts and Algorithms

More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.

Dynamic Bayesian Networks (DBNs)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.

1 This work partially funded by NSF Grants IIS , IRIS and IIS Matthew O. Ward, Elke A. Rundensteiner, Jing Yang, Punit Doshi, Geraldine.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

1 An Adaptive Nearest Neighbor Classification Algorithm for Data Streams Yan-Nei Law & Carlo Zaniolo University of California, Los Angeles PKDD, Porto,

Overview Of Clustering Techniques D. Gunopulos, UCR.

VLDB Revisiting Pipelined Parallelism in Multi-Join Query Processing Bin Liu and Elke A. Rundensteiner Worcester Polytechnic Institute

Locality Optimizations in OceanStore Patrick R. Eaton Dennis Geels An introduction to introspective techniques for exploiting locality in wide area storage.

A Strategy Selection Framework for Adaptive Prefetching in Visual Exploration Punit R. Doshi, Geraldine E. Rosario, Elke A. Rundensteiner, and Matthew.

Probabilistic Skyline Operator over sliding Windows Wan Qian HKUST DB Group.

An Adaptive Multi-Objective Scheduling Selection Framework For Continuous Query Processing Timothy M. Sutherland Bradford Pielech Yali Zhu Luping Ding.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Prefetching for Visual Data Exploration Punit R. Doshi, Elke A. Rundensteiner, Matthew O. Ward Computer Science Department Worcester Polytechnic Institute.

Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.

CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling

1 On Querying Historical Evolving Graph Sequences Chenghui Ren $, Eric Lo *, Ben Kao $, Xinjie Zhu $, Reynold Cheng $ $ The University of Hong Kong $ {chren,

1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.

Index Tuning for Adaptive Multi-Route Data Stream Systems Karen Works, Elke A. Rundensteiner, and Emmanuel Agu Database Systems Research.

An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.

1 ENTROPY-BASED CONCEPT SHIFT DETECTION PETER VORBURGER, ABRAHAM BERNSTEIN IEEE ICDM 2006 Speaker: Li HueiJyun Advisor: Koh JiaLing Date:2007/11/6 1.

Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.

On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.

ECO-DNS: Expected Consistency Optimization for DNS Chen Stephanos Matsumoto Adrian Perrig © 2013 Stephanos Matsumoto1.

Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.

Marina Drosou, Evaggelia Pitoura Computer Science Department

Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.

Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.

1 Elke. A. Rundensteiner Worcester Polytechnic Institute Elisa Bertino Purdue University 1 Rimma V. Nehme Microsoft.

August 30, 2004STDBM 2004 at Toronto Extracting Mobility Statistics from Indexed Spatio-Temporal Datasets Yoshiharu Ishikawa Yuichi Tsukamoto Hiroyuki.

Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.

Di Yang, Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute EDBT 2010, Submitted 1 A Unified Framework Supporting Interactive.

Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.

D-skyline and T-skyline Methods for Similarity Search Query in Streaming Environment Ling Wang 1, Tie Hua Zhou 1, Kyung Ah Kim 2, Eun Jong Cha 2, and Keun.

Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree

November 4, 2003Applied Research Laboratory, Washington University in St. Louis APOC 2003 Wuhan, China Cost Efficient Routing in Ad Hoc Mobile Wireless.

Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,

Introduction to Machine Learning, its potential usage in network area,

Data Mining: Basic Cluster Analysis

More on Clustering in COSC 4335

Data Mining Soongsil University

Updating SF-Tree Speaker: Ho Wai Shing.

A paper on Join Synopses for Approximate Query Answering

Data Mining K-means Algorithm

Parallel Density-based Hybrid Clustering

RE-Tree: An Efficient Index Structure for Regular Expressions

Query in Streaming Environment

A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.

CSE572, CBS598: Data Mining by H. Liu

IPOG: A General Strategy for T-Way Software Testing

An Adaptive Nearest Neighbor Classification Algorithm for Data Streams

Online Analytical Processing Stream Data: Is It Feasible?

CSE572: Data Mining by H. Liu

Presentation transcript:

Di Yang, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute VLDB 2009, Lyon, France 1 A Shared Execution Strategy for Multiple Pattern Mining Requests Over Streaming Data This work is supported under NSF grants CCF , IIS , IIS Chandi

Motivation: data streams are everywhere 2 transaction info patterns Stock Market Are there any patterns in transactions over past hour? Battlefield position info Stock Analysts Commander Where are the main clusters formed by enemy warcraft? patterns 2

Motivation: pattern mining requests tend to be parameterized 3 Example 1: give me the stocks that dropped significantly in the most recent transactions. Example 2: give me the major clusters formed by enemy warcraft. 10%, 30% or 50% to the original price with in last 10,30, or 60 minutes. size: n war-crafts density: m war-crafts / mi 2

Motivation: best parameters settings are hard to determine I need info for any cluster sized 5 or higher Clusters formed by fighter planes need to be updated every 5 seconds I only care about the clusters that are formed by more than 20 warcraft Clusters formed by boom carriers need to be updated every 10 seconds Multiple analysts may raise multiple queries with different parameter settings. Parameter settings? I probably know. But, can I try different combinations of them? ? A single analyst may raise multiple queries with different parameter settings. 4 Problem: A lot of similar queries, yet with different parameter settings, how to answer them efficiently.

Outline 1. Motivation 2. Problem Definition & State-of-Art 3. Basic: Single Query Processing Strategies. 4. Proposed Solution: Multi-Query Sharing. 5. Experimental Study 6. Conclusion 5 5

Target Pattern Type Core Object: has no less than neighbors in distance from it. Edge Object: not core object but a neighbor of a core object. Noise: not core object and not a neighbor of any core object. θ range θ cnt A Density-Based Cluster (DB-Cluster) is a maximum group of connected core objects and the edge objects attached to them Why: popular and well know, arbitrary shapes, allow unclassified (noise), deterministic…

Clustering in Sliding Windows Over Stream W W2 7 Applications include: monitoring congestion (cluster) in traffic looking for intensive transaction areas (cluster) in stock trades identifying malicious attacks (cluster) in network

Problem Definition Input: a query group QG with multiple density-based clustering queries querying on the same input stream but with arbitrary parameter settings. Goal: to minimize both the average processing time and the peak memory space needed by the system. 8 Template Density-Based Clustering Query Over Sliding Windows Pattern-specific Window-specific

State-Of-Art 9 Previous work on multiple query sharing concentrates on traditional SPJ and aggregation operators [Arasu04][Hammad03][Wang06]. No existing solution for multiple query optimization for complex pattern mining requests, such as clustering, in streaming environments. Existing algorithms for single density-based clustering query over sliding windows include Incremental DBSCAN, Exact-N, Abstract-C and Extra-N [Ester98] [Yang09]. Executing algorithms above independently for each query (naive solution ) is not scalable.

Outline 1. Motivation 2. Problem Definition & State-of-Art 3. Basic: Single Query Processing Strategies. 4. Proposed Solution: Multi-Query Sharing. 5. Experimental Study 6. Conclusion 10

Density-Based Clustering on Data Streams 11 In highly dynamic streaming environments: Re-computation. Incremental cluster maintenance. Extra-N proposed a hybrid neighbor relationship (neighborship) mechanism to represent cluster structure. Maintain “Exact Neighborships” (neighbor lists) for none-core objects. Maintain “Abstract Neighborships” (cluster memberships) for core objects. A general concept of “Predicted View” is applied to efficiently update the cluster structure. —key: a compact and easy-maintainable cluster representation.

Concept of Predicted Views Current View of W 0 window size=16, slide size=4, time=1 Predicted View of W Predicted View of W Predicted View of W W0W0 W1W1 W2W2 W3W3 12

Update Predicted Views Current View of W 1 Predicted View of W Predicted View of W Predicted View of W W1W1 W2W2 W3W3 W4W New Data Points window size=16, slide size=4, time=1 13 Expired View of W 0

Outline 1. Motivation 2. Problem Definition & State-of-Art 3. Basic: Single Query Processing Strategies. 4. Proposed Solution: Multiple Query Sharing. 5. Experimental Study 6. Conclusion 14

Win Slide Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary θ cnt θ range Pattern-Specific ParametersWindow-Specific Parameters 15

Arbitrary Pattern-Specific Parameter Case -- arbitrary, fixed 16 θ range θ cnt Any relationship between the cluster sets identified by them?

“Growth Property” among DB-cluster Sets 17 Independent Cluster Structure StorageHierarchical Cluster Structure Storage Grow If any cluster Ci in Clu_Set1 is “contained” by one cluster in Clu_Set2, Clu_Set2 is a “Growth” of Clu_Set1. c6c5c4 c6c5c4

Benefits of Hierarchical Cluster Structure 18 Benefits for Memory Resources: Memory space needed by storing cluster sets identified by multiple queries in QG is independent from |QG|. Benefits for Computational Resources: Multiple cluster sets stored in the hierarchical cluster structure (which are usually similar) can be maintained incrementally, rather than independently.

Arbitrary Pattern-Specific Parameter Case -- arbitrary, fixed 19 θ range θ cnt θ θ range Growth property transitively holds among the cluster sets identified by multiple queries with arbitrary and same.

Arbitrary Pattern-Specific Parameter Case -- arbitrary, fixed Idea: Growth property transitively holds. Solution: A single integrated representation of predicted views: θ range θ cnt θ IntView_

Win Slide Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary θ cnt θ range Pattern-Specific ParametersWindow-Specific Parameters 21

Arbitrary Pattern-Specific Parameter Case -- Fixed, Arbitrary Growth property transitively holds among the cluster sets identified by multiple queries with arbitrary and same. θ cnt θ range θ θ cnt 22

Arbitrary Pattern-Specific Parameter Case -- arbitrary, fixed θ cnt θ range Idea: Growth property transitively holds. Solution: A single integrated representation of predicted views: θ range IntView_

Win Slide Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary θ cnt θ range Pattern-Specific ParametersWindow-Specific Parameters 24

25 Growth property holds,if ≥ and ≤. An “Predicted View Tree” for all queries. Arbitrary Pattern-Specific Parameter Case -- arbitrary, arbitrary θ cnt θ range θ Qi. θ range Qj. θ cnt Qi. θ cnt Qj. θ IntView_ Cluster set by Q3 Cluster set by Q5 Cluster set by Q2 Cluster set by Q1 Cluster set by Q4 grow

Win Slide Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary θ cnt θ range Pattern-Specific ParametersWindow-Specific Parameters 26

Arbitrary Window-Specific Parameter Case -- arbitrary win, fixed slide In this case, maintaining a single query will be sufficient to answer all. The predicted views for Qi with largest win cover all needed W0W0 W1W1 W3W3 Q1.win=16, slide=4 Q2.win=12, slide=4 W0W0 W2W2 Shared Time Answer for Q1 at T=16Answer for Q2 at T=16 W0W0 Q3.win=8, slide=4 W0W0 Q4.win=4, slide=4 Answer for Q3 at T=16 Answer for Q4 at T=16

Win Slide Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary θ cnt θ range Pattern-Specific ParametersWindow-Specific Parameters 28

Arbitrary Window-Specific Parameter Case -- arbitrary slide, arbitrary win Use a single meta query with largest window size and adaptive slide size to represent queries. 29 Grow

Win Slide Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary Fixed Arbitrary θ cnt θ range Pattern-Specific ParametersWindow-Specific Parameters 30

General Case -- arbitrary all four parameters Our proposed techniques for arbitrary pattern parameter cases (intra-window-optimization) for arbitrary window parameter cases (inter-window-optimization) are orthogonal to each other. Final integrated structure for QG. 31 θ IntView_ IntView_W IntView

Experimental Study 32 Alternative Methods: 1. Incremental DBSCAN [Ester98] 2. Incremental DBSCAN with rqs (range query search sharing) 3. Extra-N [Yang09] 4. Extra-N with rqs (range query search sharing) 5. Chandi (our solution) Real Streaming Data: 1. GMTI data recording information about moving vehicles [Mitre08]. 2. STT data recording stock transactions from NYSE [INETATS08]. Measurements: 1. Average processing time for each tuple. 2. Memory footprint. Chandi

Evaluation for Performance 33 Arbitrary Pattern Parameter Case Arbitrary Window Parameter Case

Evaluation for Performance 34 Arbitrary All Four Parameter Cases

35 Conclusion: First effort of applying multi-query optimization techniques on complex mining requests. General principles proposed, such as incremental pattern representation and meta query strategy, applicable to other pattern types. First full-sharing framework, called Chandi, for multiple density-based clustering queries over streaming windows, Experimental study shows Chandi has excellent efficiency and scalability. Future Work: Other pattern types: other cluster types, outliers, association rules… Pattern management. Visualization and interaction.

The End Thanks 36 Chandi