Download presentation
Presentation is loading. Please wait.
Published byEarl Norton Modified over 6 years ago
1
Mining Unusual Patterns in Data Streams in Multi-Dimensional Space
11/21/2018 Mining Unusual Patterns in Data Streams in Multi-Dimensional Space Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign
2
Mining Unusual Patterns in Data Streams
Outline Characteristics of data streams Mining unusual patterns in data streams Multi-dimensional regression analysis of data streams Stream cubing and stream OLAP methods Mining other kinds of patterns in data streams Research problems November 21, 2018 Mining Unusual Patterns in Data Streams
3
Mining Unusual Patterns in Data Streams
Data streams—continuous, ordered, changing, fast, huge amount Traditional DBMS—data stored in finite, persistent data sets Characteristics Huge volumes of continuous data, possibly infinite Fast changing and requires fast, real-time response Data stream captures nicely our data processing needs of today Random access is expensive—single linear scan algorithm (can only have one look) Store only the summary of the data seen thus far Most stream data are at pretty low-level or multi-dimensional in nature, needs multi-level and multi-dimensional processing November 21, 2018 Mining Unusual Patterns in Data Streams
4
Stream Data Applications
Telecommunication calling records Business: credit card transaction flows Network monitoring and traffic engineering Financial market: stock exchange Engineering & industrial processes: power supply & manufacturing Sensor, monitoring & surveillance: video streams Security monitoring Web logs and Web page click streams Massive data sets (even saved but random access is too expensive) November 21, 2018 Mining Unusual Patterns in Data Streams
5
Challenges of Stream Data Mining
11/21/2018 Challenges of Stream Data Mining Multiple, continuous, rapid, time-varying, ordered streams Main memory computation Mining queries are either continuous or ad-hoc Mining queries are often complex Involving multiple streams, large amount of data, and history Finding patterns, models, anomaly, differences, … Mining dynamics (changes, trends and evolutions) of data streams Multi-level/multi-dimensional processing and data mining Most stream data are at pretty low-level or multi-dimensional in nature November 21, 2018 Mining Unusual Patterns in Data Streams
6
Stream Data Mining Tasks
Multi-dimensional (on-line) analysis of streams Clustering data streams Classification of data streams Mining frequent patterns in data streams Mining sequential patterns in data streams Mining partial periodicity in data streams Mining notable gradients in data streams Mining outliers and unusual patterns in data streams …… November 21, 2018 Mining Unusual Patterns in Data Streams
7
Multi-Dimensional Stream Analysis: Examples
Analysis of Web click streams Raw data at low levels: seconds, web page addresses, user IP addresses, … Analysts want: changes, trends, unusual patterns, at reasonable levels of details E.g., Average clicking traffic in North America on sports in the last 15 minutes is 40% higher than that in the last 24 hours.” Analysis of power consumption streams Raw data: power consumption flow for every household, every minute Patterns one may find: average hourly power consumption surges up 30% for manufacturing companies in Chicago in the last 2 hours today than that of the same day a week ago November 21, 2018 Mining Unusual Patterns in Data Streams
8
A Key Step—Stream Data Reduction
Challenges of OLAPing stream data Raw data cannot be stored Simple aggregates are not powerful enough History shape and patterns at different levels are desirable: multi-dimensional regression analysis Proposal A scalable multi-dimensional stream “data cube” that can aggregate regression model of stream data efficiently without accessing the raw data Stream data compression Compress the stream data to support memory- and time-efficient multi-dimensional regression analysis November 21, 2018 Mining Unusual Patterns in Data Streams
9
Regression Cube for Time-Series
Initially, one time-series per base cell Too costly to store all these time-series Too costly to compute regression at multi-dimensional space Regression cube Base cube: only store regression parameters of base cells (e.g., 2 points vs points) All the upper level cuboids can be computed precisely for linear regression on both standard dimensions and time dimensions For quadratic regression, we need 5 points In general, we need: where k = 2 for quadratic. November 21, 2018 Mining Unusual Patterns in Data Streams
10
Basics of General Linear Regression
n tuples in one cell: (xi , yi), i =1..n, where yi is the measure attribute to be analyzed For sample i , a vector of k user-defined predictors ui: The linear regression model: where η is a k × 1 vector of regression parameters November 21, 2018 Mining Unusual Patterns in Data Streams
11
Linearly Compressed Representation (LCR)
Stream data compression for multi-dimensional regression analysis Define, for i, j = 0,…,k-1: The linearly compressed representation (LCR) of one cell: Size of LCR of one cell: quadratic in k, independent of the number of tuples n in one cell November 21, 2018 Mining Unusual Patterns in Data Streams
12
Stock Price Example—Aggregation in Standard Dimensions
Simple linear regression on time series data Cells of two companies After aggregation: November 21, 2018 Mining Unusual Patterns in Data Streams
13
Stock Price Example—Aggregation in Time Dimension
Cells of two adjacent time intervals: After aggregation November 21, 2018 Mining Unusual Patterns in Data Streams
14
A Stream Cube Architecture
A tilted time frame Different time granularities second, minute, quarter, hour, day, week, … Critical layers Minimum interest layer (m-layer) Observation layer (o-layer) User: watches at o-layer and occasionally needs to drill-down down to m-layer Partial materialization of stream cubes Full materialization: too space and time consuming No materialization: slow response at query time Partial materialization: what do we mean “partial”? November 21, 2018 Mining Unusual Patterns in Data Streams
15
A Tilted Time-Frame Model
Up to 7 days: 24hrs 4qtrs 15minutes 7 days Time Now 25sec. Up to a year: 31 days 24 hours 4 qtrs 12 months Time Now Logarithmic (exponential) scale: 16t 8t 4t 4t 2t 1t November 21, 2018 Time Now Mining Unusual Patterns in Data Streams
16
Two Critical Layers in the Stream Cube
(*, theme, quarter) o-layer (observation) (user-group, URL-group, minute) m-layer (minimal interest) (individual-user, URL, second) (primitive) stream data layer November 21, 2018 Mining Unusual Patterns in Data Streams
17
On-Line Materialization vs. On-Line Computation
Materialization takes precious resources and time Only incremental materialization (with slide window) Only materialize “cuboids” of the critical layers? Some intermediate cells that should be materialized Popular path approach vs. exception cell approach Materialize intermediate cells along the popular paths Exception cells: how to set up exception thresholds? Notice exceptions do not have monotonic behaviour Computation problem How to compute and store stream cubes efficiently? How to discover unusual cells between the critical layer? November 21, 2018 Mining Unusual Patterns in Data Streams
18
Stream Cube Structure: from m-layer to o-layer
(A1, *, C1) (A1, *, C2) (A1, *, C2) (A1, *, C2) (A1, B1, C2) (A1, B2, C1) (A2, B1, C1) (A2, *, C2) (A2, B1, C2) A2, B2, C1) (A1, B2, C2) (A2, B2, C2) November 21, 2018 Mining Unusual Patterns in Data Streams
19
Stream Cube Computation
Cube structure from m-layer to o-layer Three approaches All cuboids approach Materializing all cells (too much in both space and time) Exceptional cells approach Materializing only exceptional cells (saves space but not time to compute and definition of exception is not flexible) Popular path approach Computing and materializing cells only along a popular path Using H-tree structure to store computed cells (which form the stream cube—a selectively materialized cube) November 21, 2018 Mining Unusual Patterns in Data Streams
20
An H-Tree Cubing Structure
root Observation layer entertainment sports politics uic uic uiuc uiuc Minimal int. layer jeff Jim jeff mary Regression: Sum: xxxx Cnt: yyyy Quant-Info Q.I. Q.I. Q.I. November 21, 2018 Mining Unusual Patterns in Data Streams
21
Partial Materialization Using H-Tree
Introduced for computing data cubes and iceberg cubes J. Han, J. Pei, G. Dong, and K. Wang, “Efficient Computation of Iceberg Cubes with Complex Measures”, SIGMOD'01 Compressed database, fast cubing, and space preserving in cube computation Using H-tree for partial stream cubing Space preserving: Intermediate aggregates can be computed incrementally and saved in tree nodes Facilitate computing other cells and multi-dimensional analysis H-tree with computed cells can be viewed as “stream cube” November 21, 2018 Mining Unusual Patterns in Data Streams
22
Time and Space vs. Number of Tuples at the m-Layer
(Dataset D3L3C10T400K) a) Time vs. m-layer size b) Space vs. m-layer size November 21, 2018 Mining Unusual Patterns in Data Streams
23
Mining Unusual Patterns in Data Streams
Time and Space vs. the Number of Levels a) Time vs. # levels b) Space vs. # levels November 21, 2018 Mining Unusual Patterns in Data Streams
24
Mining Other Unusual Patterns in Stream Data
Clustering and outlier analysis for stream mining Clustering data streams (Guha, Motwani et al ) History-sensitive, high-quality incremental clustering Classification of stream data Evolution of decision trees: Domingos et al. (2000, 2001) Incremental integration of new streams in decision-tree induction Frequent pattern analysis Approximate frequent patterns (Manku & Motwani VLDB’02) Evolution and dramatic changes of frequent patterns November 21, 2018 Mining Unusual Patterns in Data Streams
25
Mining Unusual Patterns in Data Streams
Conclusions Stream data mining: A rich and largely unexplored field Current research focus in database community: DSMS system architecture, continuous query processing, supporting mechanisms Stream data mining and stream OLAP analysis Powerful tools for finding general and unusual patterns Effectiveness, efficiency and scalability: lots of open problems Our philosophy: A multi-dimensional stream analysis framework Time is a special dimension: tilted time frame What to compute and what to save?—Critical layers Very partial materialization/precomputation: popular path approach Mining dynamics of stream data November 21, 2018 Mining Unusual Patterns in Data Streams
26
Mining Unusual Patterns in Data Streams
References B. Babcock, S. Babu, M. Datar, R. Motawani, and J. Widom, “Models and issues in data stream systems”, PODS'02 (tutorial). S. Babu and J. Widom, “Continuous queries over data streams”, SIGMOD Record, 30: , 2001. Y. Chen, G. Dong, J. Han, J. Pei, B. W. Wah, and J. Wang. “Online analytical processing stream data: Is it feasible?”, DMKD'02. Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang, “Multi-dimensional regression analysis of time-series data streams”, VLDB'02. P. Domingos and G. Hulten, “Mining high-speed data streams”, KDD'00. M. Garofalakis, J. Gehrke, and R. Rastogi, “Querying and mining data streams: You only get one look”, SIGMOD'02 (tutorial). J. Gehrke, F. Korn, and D. Srivastava, “On computing correlated aggregates over continuous data streams”, SIGMOD'01. S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan, “Clustering data streams”, FOCS'00. G. Hulten, L. Spencer, and P. Domingos, “Mining time-changing data streams”, KDD'01. November 21, 2018 Mining Unusual Patterns in Data Streams
27
Mining Unusual Patterns in Data Streams
Thank you !!! November 21, 2018 Mining Unusual Patterns in Data Streams
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.