Di Yang, Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute EDBT 2010, Submitted 1 A Unified Framework Supporting Interactive.

Slides:

Advertisements

Similar presentations

Incremental Clustering for Trajectories

Advertisements

Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.

Conceptual Clustering

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad.

Di Yang, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute VLDB 2009, Lyon, France 1 A Shared Execution Strategy for Multiple Pattern.

CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.

More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.

Data Organization - B-trees. 11.2Database System Concepts A simple index Brighton A Downtown A Downtown A Mianus A Perry.

Yoshiharu Ishikawa (Nagoya University) Yoji Machida (University of Tsukuba) Hiroyuki Kitagawa (University of Tsukuba) A Dynamic Mobility Histogram Construction.

Mapping Nominal Values to Numbers for Effective Visualization Presented by Matthew O. Ward Geraldine Rosario, Elke Rundensteiner, David Brown, Matthew.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.

1 This work partially funded by NSF Grants IIS , IRIS and IIS Matthew O. Ward, Elke A. Rundensteiner, Jing Yang, Punit Doshi, Geraldine.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

1 An Adaptive Nearest Neighbor Classification Algorithm for Data Streams Yan-Nei Law & Carlo Zaniolo University of California, Los Angeles PKDD, Porto,

VLDB Revisiting Pipelined Parallelism in Multi-Join Query Processing Bin Liu and Elke A. Rundensteiner Worcester Polytechnic Institute

A Strategy Selection Framework for Adaptive Prefetching in Visual Exploration Punit R. Doshi, Geraldine E. Rosario, Elke A. Rundensteiner, and Matthew.

Probabilistic Skyline Operator over sliding Windows Wan Qian HKUST DB Group.

Opportunistic Optimization for Market-Based Multirobot Control M. Bernardine Dias and Anthony Stentz Presented by: Wenjin Zhou.

Data Mining – Intro.

12 -1 Lecture 12 User Modeling Topics –Basics –Example User Model –Construction of User Models –Updating of User Models –Applications.

Prefetching for Visual Data Exploration Punit R. Doshi, Elke A. Rundensteiner, Matthew O. Ward Computer Science Department Worcester Polytechnic Institute.

1 Dot Plots For Time Series Analysis Dragomir Yankov, Eamonn Keogh, Stefano Lonardi Dept. of Computer Science & Eng. University of California Riverside.

Multi-Layered Navigation Meshes Wouter G. van Toll, Atlas F. Cook IV, Roland Geraerts ICT.OPEN 2011.

CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.

Data Mining Chun-Hung Chou

Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.

1 On Querying Historical Evolving Graph Sequences Chenghui Ren $, Eric Lo *, Ben Kao $, Xinjie Zhu $, Reynold Cheng $ $ The University of Hong Kong $ {chren,

Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.

Index Tuning for Adaptive Multi-Route Data Stream Systems Karen Works, Elke A. Rundensteiner, and Emmanuel Agu Database Systems Research.

黃福銘 (Angus F.M. Huang) ANTS Lab, IIS, Academia Sinica TrajPattern: Mining Sequential Patterns from Imprecise Trajectories.

An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.

VAST 2011 Sebastian Bremm, Tatiana von Landesberger, Martin Heß, Tobias Schreck, Philipp Weil, and Kay Hamacher Interactive-Graphics Systems TU Darmstadt,

The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.

On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.

Outlier Detection Lian Duan Management Sciences, UIOWA.

Clustering Moving Objects in Spatial Networks Jidong Chen, Caifeng Lai, Xiaofeng Meng, Renmin University of China Jianliang Xu, and Haibo Hu Hong Kong.

Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.

Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.

Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.

Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree

Data Mining Concepts and Techniques Course Presentation by Ali A. Ali Department of Information Technology Institute of Graduate Studies and Research Alexandria.

Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Introduction to Machine Learning, its potential usage in network area,

Presented by Niwan Wattanakitrungroj

Cohesive Subgraph Computation over Large Graphs

REV 00 Chapter 2 Database Environment DDC DATABASE SYSTEM.

Data Mining – Intro.

Framework for real-time clustering over sliding windows

REV 00 Chapter 2 Database Environment DDC DATABASE SYSTEM.

Data Mining: Basic Cluster Analysis

More on Clustering in COSC 4335

A Viewpoint-based Approach for Interaction Graph Analysis

A Forest of Sensors: Using adaptive tracking to classify and monitor activities in a site Eric Grimson AI Lab, Massachusetts Institute of Technology

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

Mining Dynamics of Data Streams in Multi-Dimensional Space

Predicting Traffic Dmitriy Bespalov.

Database Environment Transparencies

Grant Number: IIS Institution of PI: WPI PIs: Matthew O

MURI Kickoff Meeting Randolph L. Moses November, 2008

CS562 – Advanced Topics in Databases

Korea University of Technology and Education

An Adaptive Nearest Neighbor Classification Algorithm for Data Streams

Pei Lee, ICDE 2014, Chicago, IL, USA

Discovery of Significant Usage Patterns from Clickstream Data

Online Analytical Processing Stream Data: Is It Feasible?

Topological Signatures For Fast Mobility Analysis

CS 685: Special Topics in Data Mining Jinze Liu

Presentation transcript:

Di Yang, Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute EDBT 2010, Submitted 1 A Unified Framework Supporting Interactive Exploration of Density-Based Clusters In Streaming Windows This work is supported under NSF grants CCF , IIS , IIS

What are Density-Based Clusters? 2 Clusters that are defined by individual data points (tuples) and their local “neighborhood”. How they are different from K-median style clustering? Cluster 1 Cluster 2 Cluster 1Cluster 2 Cluster 3Cluster 4

Formal Definition Core Object: has more than neighbors in distance from it. Edge Object: not core object but a neighbor of a core object. Noise: not core object and not a neighbor of any core object. θ range θ cnt A Density-Based Cluster (DB-Cluster) is a maximum group of connected core objects and the edge objects attached to them

Cluster Detection in Sliding Windows W W2 4 Template Density-Based Clustering Query Over Sliding Windows Pattern-specific Window-specific

Application Examples: 5 transaction info clusters Stock Market Are there intensive-transaction areas in last 1 hour transactions? Battle field position info Stock Analysts Commander Where are the main clusters formed by enemy war-crafts clusters 5

State-of-Art 6 Existing algorithms for density-based clustering query over sliding windows include Incremental DBSCAN, Exact-N, Abstract-C and Extra-N [Ester98] [Yang09]. Extra-N suffers from the performance inefficiency as the slide/win rate increases. No evolution semantics defined for density-based cluster changes over the time. No existing system allowing interactive exploration of density-based clusters in streaming windows.

Goals 7 1. A more efficient density-based clustering algorithm over streams. 2. An evolution semantics that intuitively explain cluster changes. 3. A visualized pattern space allowing interactive exploration of clusters.

Review: existing algorithm– Extra-N 8 In highly dynamic streaming environments: Re-computation. Incremental cluster maintenance. Extra-N[Yang09] proposed a hybrid neighbor relationship (neighborship) mechanism to represent cluster structure. Maintain “Exact Neighborships” (neighbor lists) for none-core objects. Maintain “Abstract Neighborships” (cluster memberships) for core objects. A general concept of “Predicted View” is applied to efficiently update the cluster structure. —Key: a compact and easy-maintainable cluster representation.

Concept of Predicted Views Current View of W 0 window size=16, slide size=4, time=1 Predicted View of W Predicted View of W Predicted View of W W0W0 W1W1 W2W2 W3W3 9

Update Predicted Views Current View of W 1 Predicted View of W Predicted View of W Predicted View of W W1W1 W2W2 W3W3 W4W New Data Points window size=16, slide size=4, time=1 10 Expired View of W 0

Inefficiency of Extra-N 11 When Slide/Win rate increases, (for example Win=10000, slide=10), large number of predicted views need to be maintained independently. Heavy burden to both CPU and memory resources. Win Slide

Proposed Solution: IWIN 12 Any relationship between the cluster identified ?

“Growth Property” among DB-cluster Sets 13 Independent Cluster Structure StorageHierarchical Cluster Structure Storage Grow If any cluster Ci in Clu_Set1 is “contained” by one cluster in Clu_Set2, Clu_Set2 is a “Growth” of Clu_Set1. c6c5c4 c6c5c4

Integrated Vs. Independent Maintenance of Predicted Views 14 IWIN: Integrated maintenanceExtra-N: Independetmaintenance

Benefits of Integrated Maintenance 15 Benefits for Memory Resources: Memory space needed by storing cluster sets identified by multiple queries in QG is independent from |QG|. Benefits for Computational Resources: Multiple cluster sets stored in the hierarchical cluster structure (which are usually similar) can be maintained incrementally, rather than independently. IWIN outperforms Extra-N in both CPU and memory utilizations.

Goals A more efficient density-based clustering algorithm over streams. 2. An evolution semantics that intuitively explain cluster changes. 3. A visualized pattern space allowing interactive exploration of clusters.

Why we need evolution semantics? 17 Analysts need to know how clusters change over time. It is hard to observe by looking at the clusters only (even with visualization). Commander History: Did any clusters merge? Now: Are their any new cluster? Future: Is there any cluster breaking shortly?

Proposed Semantics 18 Single Step Evolutions: birth termination split merge Preserve/expand/shrink Multi Step Evolutions: split-expand split-merge shrink-split //

How to Compute 19 Extract Predicted Evolution (before window slide) Update Evolution (after window slide) preserve split preserve shrink

Conclusion for Proposed Semantics Intuitively describe the cluster evolution over the time. 2. Easily maintainable: can be computed on-the-fly during cluster maintenance.

Goals A more efficient density-based clustering algorithm over streams. 2. An evolution semantics that intuitively explain cluster changes. 3. A visualized pattern space allowing interactive exploration of clusters.

Outline What is Neighbor-Based Pattern Detection 2. State-of-Art 3. Potential Solutions & Their Inefficiency 4. Proposed Solution: Extra-N 5. Experimental Study 6. Conclusion

Why needed? 23 Analysts need to navigate along the time axis to learn the current, review the history, and predict the near future. Example: how are the two clusters in current window related to those detected 30 minutes back? Analysts need to study the clusters and their evolution at different abstraction level. Example: for routine traffic monitoring, only the position of major clusters need to be reported; when accident happened, specific information of cluster members need to be reported.

Proposed Pattern Space 24

Evaluation for IWIN 25 Alternative Methods: 1. Incremental DBSCAN [Ester98] 2. Extra-N [Yang09] 3. IWIN Real Streaming Data: 1. GMTI data recording information about moving vehicles [Mitre08]. 2. STT data recording stock transactions from NYSE [INETATS08]. Measurements: 1. Average processing time for each tuple. 2. Memory footprint.

Evaluation for IWIN 26

Case Study 1

Case Study 2 28

Conclusion Presented the first unified framework supporting interactive exploration of density-based clusters in streaming windows. 2. Designed a more efficient density-based clustering algorithm IWIN. 3. Define the first evolution semantics for density-based clusters. 4. Our experimental study confirms the both the efficiency and effectiveness of our proposed framework.

Future work 30 Support multiple queries. Support other pattern types, such as outliers, association rules… Support pattern storage and match. More?

The End 31 Thanks