Mining Unusual Patterns in Data Streams in Multi-Dimensional Space

Slides:



Advertisements
Similar presentations
Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science.
Advertisements

Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague.
Incremental Clustering for Trajectories
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Di Yang, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute VLDB 2009, Lyon, France 1 A Shared Execution Strategy for Multiple Pattern.
Fast Algorithms For Hierarchical Range Histogram Constructions
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
1 Stream-based Data Management IS698 Min Song 2 Characteristics of Data Streams  Data Streams Data streams — continuous, ordered, changing, fast, huge.
Jessica Lin, Eamonn Keogh, Stefano Loardi
Dunja Mladenić Marko Grobelnik Jožef Stefan Institute, Slovenia.
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
GeoPKDD Geographic Privacy-aware Knowledge Discovery and Delivery Kick-off meeting Pisa, March 14, 2005.
Data Mining Techniques
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
OnLine Analytical Processing (OLAP)
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Data Warehousing.
BUSINESS ANALYTICS AND DATA VISUALIZATION
Data Warehousing Mining & BI Data Streams Mining DWMBI1.
Data Mining: Concepts and Techniques Mining data streams
Di Yang, Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute EDBT 2010, Submitted 1 A Unified Framework Supporting Interactive.
Xiangnan Kong,Philip S. Yu An Ensemble-based Approach to Fast Classification of Multi-label Data Streams Dept. of Computer Science University of Illinois.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Streaming Semantic Data COMP6215 Semantic Web Technologies Dr Nicholas Gibbins –
Data Mining - Introduction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
The KDD Process for Extracting Useful Knowledge from Volumes of Data Fayyad, Piatetsky-Shapiro, and Smyth Ian Kim SWHIG Seminar.
Dense-Region Based Compact Data Cube
Presented by Niwan Wattanakitrungroj
Data Mining.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 1 —
On-Line Application Processing
Data Mining – Intro.
Information Systems in Organizations
Data Transformation: Normalization
BlinkDB.
ANOMALY DETECTION FRAMEWORK FOR BIG DATA
MIS2502: Data Analytics Advanced Analytics - Introduction
Online Conditional Outlier Detection in Nonstationary Time Series
BlinkDB.
Data Streaming in Computer Networking
Mining Frequent Patterns from Data Streams
Chapter 13 The Data Warehouse
Online Frequent Episode Mining
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Warehouse.
Data Mining: Concepts and Techniques — Chapter 8 — 8. 1
Jiawei Han Department of Computer Science
Mining Dynamics of Data Streams in Multi-Dimensional Space
Data Mining: Concepts and Techniques Course Outline
Spatio-temporal Pattern Queries
Data Mining: Concepts and Techniques — Chapter 8 — 8. 1
Data Warehousing and Data Mining
I don’t need a title slide for a lecture
Data Mining: Concepts and Techniques
A Framework for Clustering Evolving Data Streams
Supporting End-User Access
Data Mining: Concepts and Techniques
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
On-Line Application Processing
Data Mining: Concepts and Techniques
Big DATA.
Online Analytical Processing Stream Data: Is It Feasible?
Topological Signatures For Fast Mobility Analysis
Mining Data Streams Many slides are borrowed from Stanford Data Mining Class and Prof. Jiawei Han’s lecture slides.
Frequent Pattern Mining for Data Streams
Presentation transcript:

Mining Unusual Patterns in Data Streams in Multi-Dimensional Space 11/21/2018 Mining Unusual Patterns in Data Streams in Multi-Dimensional Space Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj

Mining Unusual Patterns in Data Streams Outline Characteristics of data streams Mining unusual patterns in data streams Multi-dimensional regression analysis of data streams Stream cubing and stream OLAP methods Mining other kinds of patterns in data streams Research problems November 21, 2018 Mining Unusual Patterns in Data Streams

Mining Unusual Patterns in Data Streams Data streams—continuous, ordered, changing, fast, huge amount Traditional DBMS—data stored in finite, persistent data sets Characteristics Huge volumes of continuous data, possibly infinite Fast changing and requires fast, real-time response Data stream captures nicely our data processing needs of today Random access is expensive—single linear scan algorithm (can only have one look) Store only the summary of the data seen thus far Most stream data are at pretty low-level or multi-dimensional in nature, needs multi-level and multi-dimensional processing November 21, 2018 Mining Unusual Patterns in Data Streams

Stream Data Applications Telecommunication calling records Business: credit card transaction flows Network monitoring and traffic engineering Financial market: stock exchange Engineering & industrial processes: power supply & manufacturing Sensor, monitoring & surveillance: video streams Security monitoring Web logs and Web page click streams Massive data sets (even saved but random access is too expensive) November 21, 2018 Mining Unusual Patterns in Data Streams

Challenges of Stream Data Mining 11/21/2018 Challenges of Stream Data Mining Multiple, continuous, rapid, time-varying, ordered streams Main memory computation Mining queries are either continuous or ad-hoc Mining queries are often complex Involving multiple streams, large amount of data, and history Finding patterns, models, anomaly, differences, … Mining dynamics (changes, trends and evolutions) of data streams Multi-level/multi-dimensional processing and data mining Most stream data are at pretty low-level or multi-dimensional in nature November 21, 2018 Mining Unusual Patterns in Data Streams

Stream Data Mining Tasks Multi-dimensional (on-line) analysis of streams Clustering data streams Classification of data streams Mining frequent patterns in data streams Mining sequential patterns in data streams Mining partial periodicity in data streams Mining notable gradients in data streams Mining outliers and unusual patterns in data streams …… November 21, 2018 Mining Unusual Patterns in Data Streams

Multi-Dimensional Stream Analysis: Examples Analysis of Web click streams Raw data at low levels: seconds, web page addresses, user IP addresses, … Analysts want: changes, trends, unusual patterns, at reasonable levels of details E.g., Average clicking traffic in North America on sports in the last 15 minutes is 40% higher than that in the last 24 hours.” Analysis of power consumption streams Raw data: power consumption flow for every household, every minute Patterns one may find: average hourly power consumption surges up 30% for manufacturing companies in Chicago in the last 2 hours today than that of the same day a week ago November 21, 2018 Mining Unusual Patterns in Data Streams

A Key Step—Stream Data Reduction Challenges of OLAPing stream data Raw data cannot be stored Simple aggregates are not powerful enough History shape and patterns at different levels are desirable: multi-dimensional regression analysis Proposal A scalable multi-dimensional stream “data cube” that can aggregate regression model of stream data efficiently without accessing the raw data Stream data compression Compress the stream data to support memory- and time-efficient multi-dimensional regression analysis November 21, 2018 Mining Unusual Patterns in Data Streams

Regression Cube for Time-Series Initially, one time-series per base cell Too costly to store all these time-series Too costly to compute regression at multi-dimensional space Regression cube Base cube: only store regression parameters of base cells (e.g., 2 points vs. 1000 points) All the upper level cuboids can be computed precisely for linear regression on both standard dimensions and time dimensions For quadratic regression, we need 5 points In general, we need: where k = 2 for quadratic. November 21, 2018 Mining Unusual Patterns in Data Streams

Basics of General Linear Regression n tuples in one cell: (xi , yi), i =1..n, where yi is the measure attribute to be analyzed For sample i , a vector of k user-defined predictors ui: The linear regression model: where η is a k × 1 vector of regression parameters November 21, 2018 Mining Unusual Patterns in Data Streams

Linearly Compressed Representation (LCR) Stream data compression for multi-dimensional regression analysis Define, for i, j = 0,…,k-1: The linearly compressed representation (LCR) of one cell: Size of LCR of one cell: quadratic in k, independent of the number of tuples n in one cell November 21, 2018 Mining Unusual Patterns in Data Streams

Stock Price Example—Aggregation in Standard Dimensions Simple linear regression on time series data Cells of two companies After aggregation: November 21, 2018 Mining Unusual Patterns in Data Streams

Stock Price Example—Aggregation in Time Dimension Cells of two adjacent time intervals: After aggregation November 21, 2018 Mining Unusual Patterns in Data Streams

A Stream Cube Architecture A tilted time frame Different time granularities second, minute, quarter, hour, day, week, … Critical layers Minimum interest layer (m-layer) Observation layer (o-layer) User: watches at o-layer and occasionally needs to drill-down down to m-layer Partial materialization of stream cubes Full materialization: too space and time consuming No materialization: slow response at query time Partial materialization: what do we mean “partial”? November 21, 2018 Mining Unusual Patterns in Data Streams

A Tilted Time-Frame Model Up to 7 days: 24hrs 4qtrs 15minutes 7 days Time Now 25sec. Up to a year: 31 days 24 hours 4 qtrs 12 months Time Now Logarithmic (exponential) scale: 16t 8t 4t 4t 2t 1t November 21, 2018 Time Now Mining Unusual Patterns in Data Streams

Two Critical Layers in the Stream Cube (*, theme, quarter) o-layer (observation) (user-group, URL-group, minute) m-layer (minimal interest) (individual-user, URL, second) (primitive) stream data layer November 21, 2018 Mining Unusual Patterns in Data Streams

On-Line Materialization vs. On-Line Computation Materialization takes precious resources and time Only incremental materialization (with slide window) Only materialize “cuboids” of the critical layers? Some intermediate cells that should be materialized Popular path approach vs. exception cell approach Materialize intermediate cells along the popular paths Exception cells: how to set up exception thresholds? Notice exceptions do not have monotonic behaviour Computation problem How to compute and store stream cubes efficiently? How to discover unusual cells between the critical layer? November 21, 2018 Mining Unusual Patterns in Data Streams

Stream Cube Structure: from m-layer to o-layer (A1, *, C1) (A1, *, C2) (A1, *, C2) (A1, *, C2) (A1, B1, C2) (A1, B2, C1) (A2, B1, C1) (A2, *, C2) (A2, B1, C2) A2, B2, C1) (A1, B2, C2) (A2, B2, C2) November 21, 2018 Mining Unusual Patterns in Data Streams

Stream Cube Computation Cube structure from m-layer to o-layer Three approaches All cuboids approach Materializing all cells (too much in both space and time) Exceptional cells approach Materializing only exceptional cells (saves space but not time to compute and definition of exception is not flexible) Popular path approach Computing and materializing cells only along a popular path Using H-tree structure to store computed cells (which form the stream cube—a selectively materialized cube) November 21, 2018 Mining Unusual Patterns in Data Streams

An H-Tree Cubing Structure root Observation layer entertainment sports politics uic uic uiuc uiuc Minimal int. layer jeff Jim jeff mary Regression: Sum: xxxx Cnt: yyyy Quant-Info Q.I. Q.I. Q.I. November 21, 2018 Mining Unusual Patterns in Data Streams

Partial Materialization Using H-Tree Introduced for computing data cubes and iceberg cubes J. Han, J. Pei, G. Dong, and K. Wang, “Efficient Computation of Iceberg Cubes with Complex Measures”, SIGMOD'01 Compressed database, fast cubing, and space preserving in cube computation Using H-tree for partial stream cubing Space preserving: Intermediate aggregates can be computed incrementally and saved in tree nodes Facilitate computing other cells and multi-dimensional analysis H-tree with computed cells can be viewed as “stream cube” November 21, 2018 Mining Unusual Patterns in Data Streams

Time and Space vs. Number of Tuples at the m-Layer (Dataset D3L3C10T400K) a) Time vs. m-layer size b) Space vs. m-layer size November 21, 2018 Mining Unusual Patterns in Data Streams

Mining Unusual Patterns in Data Streams Time and Space vs. the Number of Levels a) Time vs. # levels b) Space vs. # levels November 21, 2018 Mining Unusual Patterns in Data Streams

Mining Other Unusual Patterns in Stream Data Clustering and outlier analysis for stream mining Clustering data streams (Guha, Motwani et al. 2000-2002) History-sensitive, high-quality incremental clustering Classification of stream data Evolution of decision trees: Domingos et al. (2000, 2001) Incremental integration of new streams in decision-tree induction Frequent pattern analysis Approximate frequent patterns (Manku & Motwani VLDB’02) Evolution and dramatic changes of frequent patterns November 21, 2018 Mining Unusual Patterns in Data Streams

Mining Unusual Patterns in Data Streams Conclusions Stream data mining: A rich and largely unexplored field Current research focus in database community: DSMS system architecture, continuous query processing, supporting mechanisms Stream data mining and stream OLAP analysis Powerful tools for finding general and unusual patterns Effectiveness, efficiency and scalability: lots of open problems Our philosophy: A multi-dimensional stream analysis framework Time is a special dimension: tilted time frame What to compute and what to save?—Critical layers Very partial materialization/precomputation: popular path approach Mining dynamics of stream data November 21, 2018 Mining Unusual Patterns in Data Streams

Mining Unusual Patterns in Data Streams References B. Babcock, S. Babu, M. Datar, R. Motawani, and J. Widom, “Models and issues in data stream systems”, PODS'02 (tutorial). S. Babu and J. Widom, “Continuous queries over data streams”, SIGMOD Record, 30:109--120, 2001. Y. Chen, G. Dong, J. Han, J. Pei, B. W. Wah, and J. Wang. “Online analytical processing stream data: Is it feasible?”, DMKD'02. Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang, “Multi-dimensional regression analysis of time-series data streams”, VLDB'02. P. Domingos and G. Hulten, “Mining high-speed data streams”, KDD'00. M. Garofalakis, J. Gehrke, and R. Rastogi, “Querying and mining data streams: You only get one look”, SIGMOD'02 (tutorial). J. Gehrke, F. Korn, and D. Srivastava, “On computing correlated aggregates over continuous data streams”, SIGMOD'01. S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan, “Clustering data streams”, FOCS'00. G. Hulten, L. Spencer, and P. Domingos, “Mining time-changing data streams”, KDD'01. November 21, 2018 Mining Unusual Patterns in Data Streams

Mining Unusual Patterns in Data Streams www.cs.uiuc.edu/~hanj Thank you !!! November 21, 2018 Mining Unusual Patterns in Data Streams