CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU

Slides:



Advertisements
Similar presentations
Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.
Advertisements

CMU SCS : Multimedia Databases and Data Mining Lecture #23: DSP tools – Fourier and Wavelets C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #20: SVD - part III (more case studies) C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #19: SVD - part II (case studies) C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
FUNNEL: Automatic Mining of Spatially Coevolving Epidemics Yasuko Matsubara, Yasushi Sakurai (Kumamoto University) Willem G. van Panhuis (University of.
CMU SCS : Multimedia Databases and Data Mining Lecture #22: DSP tools – Fourier and Wavelets C. Faloutsos.
Deepayan ChakrabartiCIKM F4: Large Scale Automated Forecasting Using Fractals -Deepayan Chakrabarti -Christos Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #25: Time series mining and forecasting Christos Faloutsos.
School of Computer Science Carnegie Mellon Sensor and Graph Mining Christos Faloutsos Carnegie Mellon University & IBM
Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
15-826: Multimedia Databases and Data Mining
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
Efficient Distribution Mining and Classification Yasushi Sakurai (NTT Communication Science Labs), Rosalynn Chong (University of British Columbia), Lei.
Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals - case studies Part III (regions, quadtrees, knn queries) C. Faloutsos.
1 Epidemic Spreading in Real Networks: an Eigenvalue Viewpoint Yang Wang Deepayan Chakrabarti Chenxi Wang Christos Faloutsos.
10-603/15-826A: Multimedia Databases and Data Mining SVD - part II (more case studies) C. Faloutsos.
x – independent variable (input)
Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM.
Analysis of the Internet Topology Michalis Faloutsos, U.C. Riverside (PI) Christos Faloutsos, CMU (sub- contract, co-PI) DARPA NMS, no
CMU SCS Graph and stream mining Christos Faloutsos CMU.
Traffic Analysis: Tools for Mining Time Series
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Energy-efficient Self-adapting Online Linear Forecasting for Wireless Sensor Network Applications Jai-Jin Lim and Kang G. Shin Real-Time Computing Laboratory,
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
Indexing Time Series.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Conclusions C. Faloutsos.
Statistical Methods for long-range forecast By Syunji Takahashi Climate Prediction Division JMA.
CMU SCS : Multimedia Databases and Data Mining Lecture #27: Time series mining and forecasting Christos Faloutsos.
CMU SCS Data Mining in Streams and Graphs Christos Faloutsos CMU.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Tomo-gravity Yin ZhangMatthew Roughan Nick DuffieldAlbert Greenberg “A Northern NJ Research Lab” ACM.
Empirical Modeling Dongsup Kim Department of Biosystems, KAIST Fall, 2004.
Gaussian process modelling
Outlier Detection Using k-Nearest Neighbour Graph Ville Hautamäki, Ismo Kärkkäinen and Pasi Fränti Department of Computer Science University of Joensuu,
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi N.D.Sidiropoulos Theodore Johnson 國立雲林科技大學 National.
CMU SCS Data Mining on Streams Christos Faloutsos CMU.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai (NTT Cyber Space Labs) Spiros Papadimitriou (Carnegie Mellon Univ.) Christos Faloutsos.
Geographic Information Science
BRAID: Stream Mining through Group Lag Correlations Yasushi Sakurai Spiros Papadimitriou Christos Faloutsos SIGMOD 2005.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
AutoPlait: Automatic Mining of Co-evolving Time Sequences Yasuko Matsubara (Kumamoto University) Yasushi Sakurai (Kumamoto University) Christos Faloutsos.
Applications of Neural Networks in Time-Series Analysis Adam Maus Computer Science Department Mentor: Doctor Sprott Physics Department.
ISOMAP TRACKING WITH PARTICLE FILTER Presented by Nikhil Rane.
Lei Li Computer Science Department Carnegie Mellon University Pre Proposal Time Series Learning completed work 11/27/2015.
ECE-7000: Nonlinear Dynamical Systems Overfitting and model costs Overfitting  The more free parameters a model has, the better it can be adapted.
Stream Monitoring under the Time Warping Distance Yasushi Sakurai (NTT Cyber Space Labs) Christos Faloutsos (Carnegie Mellon Univ.) Masashi Yamamuro (NTT.
R-MAT: A Recursive Model for Graph Mining Deepayan Chakrabarti Yiping Zhan Christos Faloutsos.
RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School.
Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.
CMU SCS KDD'09Faloutsos, Miller, Tsourakakis P9-1 Large Graph Mining: Power Tools and a Practitioner’s guide Christos Faloutsos Gary Miller Charalampos.
Data statistics and transformation revision Michael J. Watts
Feature learning for multivariate time series classification Mustafa Gokce Baydogan * George Runger * Eugene Tuv † * Arizona State University † Intel Corporation.
Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)
Non-linear Mining of Competing Local Activities
“The Art of Forecasting”
Kijung Shin1 Mohammad Hammoud1
4th Joint EU-OECD Workshop on BCS, Brussels, October 12-13
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
R-MAT: A Recursive Model for Graph Mining
15-826: Multimedia Databases and Data Mining
Large Graph Mining: Power Tools and a Practitioner’s guide
“Measures of Trend” Dr. A. PHILIP AROKIADOSS Chapter 1 Time Series
Calibration and homographies
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Presentation transcript:

CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU

CMU SCS Telcordia 2003C. Faloutsos2 Outline Problem definition - motivation Linear forecasting - AR and AWSOM Coevolving series - MUSCLES Fractal forecasting - F4 Other projects –graph modeling, outliers etc

CMU SCS Telcordia 2003C. Faloutsos3 Problem definition Given: one or more sequences x 1, x 2, …, x t, … (y 1, y 2, …, y t, … … ) Find –forecasts; patterns –clusters; outliers

CMU SCS Telcordia 2003C. Faloutsos4 Motivation - Applications Financial, sales, economic series Medical –ECGs +; blood pressure etc monitoring –reactions to new drugs –elderly care

CMU SCS Telcordia 2003C. Faloutsos5 Motivation - Applications (cont’d) ‘Smart house’ –sensors monitor temperature, humidity, air quality video surveillance

CMU SCS Telcordia 2003C. Faloutsos6 Motivation - Applications (cont’d) civil/automobile infrastructure –bridge vibrations [Oppenheim+02] – road conditions / traffic monitoring

CMU SCS Telcordia 2003C. Faloutsos7 Stream Data: automobile traffic Automobile traffic time # cars

CMU SCS Telcordia 2003C. Faloutsos8 Motivation - Applications (cont’d) Weather, environment/anti-pollution –volcano monitoring –air/water pollutant monitoring

CMU SCS Telcordia 2003C. Faloutsos9 Stream Data: Sunspots time #sunspots per month

CMU SCS Telcordia 2003C. Faloutsos10 Motivation - Applications (cont’d) Computer systems –‘Active Disks’ (buffering, prefetching) –web servers (ditto) –network traffic monitoring –...

CMU SCS Telcordia 2003C. Faloutsos11 Stream Data: Disk accesses time #bytes

CMU SCS Telcordia 2003C. Faloutsos12 Settings & Applications One or more sensors, collecting time-series data

CMU SCS Telcordia 2003C. Faloutsos13 Settings & Applications Each sensor collects data (x 1, x 2, …, x t, …)

CMU SCS Telcordia 2003C. Faloutsos14 Settings & Applications Sensors ‘report’ to a central site

CMU SCS Telcordia 2003C. Faloutsos15 Settings & Applications Problem #1: Finding patterns in a single time sequence

CMU SCS Telcordia 2003C. Faloutsos16 Settings & Applications Problem #2: Finding patterns in many time sequences

CMU SCS Telcordia 2003C. Faloutsos17 Problem #1: Goal: given a signal (eg., #packets over time) Find: patterns, periodicities, and/or compress year count lynx caught per year (packets per day; temperature per day)

CMU SCS Telcordia 2003C. Faloutsos18 Problem#1’: Forecast Given x t, x t-1, …, forecast x t Time Tick Number of packets sent ??

CMU SCS Telcordia 2003C. Faloutsos19 Problem #2: Given: A set of correlated time sequences Forecast ‘Sent(t)’

CMU SCS Telcordia 2003C. Faloutsos20 Differences from DSP/Stat Semi-infinite streams –we need on-line, ‘any-time’ algorithms Can not afford human intervention –need automatic methods sensors have limited memory / processing / transmitting power –need for (lossy) compression

CMU SCS Telcordia 2003C. Faloutsos21 Important observations Patterns, rules, compression and forecasting are closely related: To do forecasting, we need –to find patterns/rules good rules help us compress to find outliers, we need to have forecasts –(outlier = too far away from our forecast)

CMU SCS Telcordia 2003C. Faloutsos22 Pictorial outline of the talk

CMU SCS Telcordia 2003C. Faloutsos23 Outline Problem definition - motivation Linear forecasting –AR –AWSOM Coevolving series - MUSCLES Fractal forecasting - F4 Other projects –graph modeling, outliers etc

CMU SCS Telcordia 2003C. Faloutsos24 Mini intro to A.R.

CMU SCS Telcordia 2003C. Faloutsos25 Forecasting "Prediction is very difficult, especially about the future." - Nils Bohr houghts.html

CMU SCS Telcordia 2003C. Faloutsos26 Problem#1’: Forecast Example: give x t-1, x t-2, …, forecast x t Time Tick Number of packets sent ??

CMU SCS Telcordia 2003C. Faloutsos27 Forecasting: Preprocessing MANUALLY: remove trends spot periodicities time 7 days

CMU SCS Telcordia 2003C. Faloutsos28 Linear Regression: idea Body weight express what we don’t know (= ‘dependent variable’) as a linear function of what we know (= ‘indep. variable(s)’) Body height

CMU SCS Telcordia 2003C. Faloutsos29 Linear Auto Regression:

CMU SCS Telcordia 2003C. Faloutsos30 Problem#1’: Forecast Solution: try to express x t as a linear function of the past: x t-2, x t-2, …, (up to a window of w) Formally: Time Tick ??

CMU SCS Telcordia 2003C. Faloutsos31 Linear Auto Regression: Number of packets sent (t-1) Number of packets sent (t) lag w=1 Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1]) ‘lag-plot’

CMU SCS Telcordia 2003C. Faloutsos32 More details: Q1: Can it work with window w>1? A1: YES! x t-2 xtxt x t-1

CMU SCS Telcordia 2003C. Faloutsos33 More details: Q1: Can it work with window w>1? A1: YES! (we’ll fit a hyper-plane, then!) x t-2 xtxt x t-1

CMU SCS Telcordia 2003C. Faloutsos34 More details: Q1: Can it work with window w>1? A1: YES! (we’ll fit a hyper-plane, then!) x t-2 x t-1 xtxt

CMU SCS Telcordia 2003C. Faloutsos35 Even more details Q2: Can we estimate a incrementally? A2: Yes, with the brilliant, classic method of ‘Recursive Least Squares’ (RLS) (see, e.g., [Chen+94], or [Yi+00], for details) Q3: can we ‘down-weight’ older samples? A3: yes (RLS does that easily!)

CMU SCS Telcordia 2003C. Faloutsos36 Mini intro to A.R.

CMU SCS Telcordia 2003C. Faloutsos37 How to choose ‘w’? goal: capture arbitrary periodicities with NO human intervention on a semi-infinite stream

CMU SCS Telcordia 2003C. Faloutsos38 Outline Problem definition - motivation Linear forecasting –AR –AWSOM Coevolving series - MUSCLES Fractal forecasting - F4 Other projects –graph modeling, outliers etc

CMU SCS Telcordia 2003C. Faloutsos39 Problem: in a train of spikes (128 ticks apart) any AR with window w < 128 will fail What to do, then?

CMU SCS Telcordia 2003C. Faloutsos40 Answer (intuition) Do a Wavelet transform (~ short window DFT) look for patterns in every frequency

CMU SCS Telcordia 2003C. Faloutsos41 Intuition Why NOT use the short window Fourier transform (SWFT)? A: how short should be the window? time freq w’

CMU SCS Telcordia 2003C. Faloutsos42 wavelets t f main idea: variable-length window!

CMU SCS Telcordia 2003C. Faloutsos43 Advantages of Wavelets Better compression (better RMSE with same number of coefficients - used in JPEG-2000) fast to compute (usually: O(n)!) very good for ‘spikes’ mammalian eye and ear: Gabor wavelets

CMU SCS Telcordia 2003C. Faloutsos44 Wavelets - intuition: t f Q: baritone/silence/ soprano - DWT? time value

CMU SCS Telcordia 2003C. Faloutsos45 Wavelets - intuition: Q: baritone/soprano - DWT? t f time value

CMU SCS Telcordia 2003C. Faloutsos46 AWSOM xtxt t t W 1,1 t W 1,2 t W 1,3 t W 1,4 t W 2,1 t W 2,2 t W 3,1 t V 4,1 time frequency =

CMU SCS Telcordia 2003C. Faloutsos47 AWSOM xtxt t t W 1,1 t W 1,2 t W 1,3 t W 1,4 t W 2,1 t W 2,2 t W 3,1 t V 4,1 time frequency

CMU SCS Telcordia 2003C. Faloutsos48 AWSOM - idea W l,t W l,t-1 W l,t-2 W l,t   l,1 W l,t-1   l,2 W l,t-2  … W l’,t’-1 W l’,t’-2 W l’,t’ W l’,t’   l’,1 W l’,t’-1   l’,2 W l’,t’-2  …

CMU SCS Telcordia 2003C. Faloutsos49 Wavelets - example: Q: weekly + daily periodicity - DWT? t f

CMU SCS Telcordia 2003C. Faloutsos50 Wavelets - example: Q: weekly + daily periodicity - DWT? t f

CMU SCS Telcordia 2003C. Faloutsos51 Wavelets - Example: Q: weekly + daily periodicity - DWT? t f

CMU SCS Telcordia 2003C. Faloutsos52 More details… Update of wavelet coefficients Update of linear models Feature selection –Not all correlations are significant –Throw away the insignificant ones (“noise”) (incremental) (incremental; RLS) (single-pass)

CMU SCS Telcordia 2003C. Faloutsos53 Results - Synthetic data Triangle pulse Mix (sine + square) AR captures wrong trend (or none) Seasonal AR estimation fails AWSOMARSeasonal AR

CMU SCS Telcordia 2003C. Faloutsos54 Results - Real data Automobile traffic –Daily periodicity –Bursty “noise” at smaller scales AR fails to capture any trend Seasonal AR estimation fails

CMU SCS Telcordia 2003C. Faloutsos55 Results - real data Sunspot intensity –Slightly time-varying “period” AR captures wrong trend Seasonal ARIMA –wrong downward trend, despite help by human!

CMU SCS Telcordia 2003C. Faloutsos56 Complexity Model update Space: O  lgN + mk 2   O  lgN  Time: O  k 2   O  1  Where –N: number of points (so far) –k:number of regression coefficients; fixed –m:number of linear models; O  lgN 

CMU SCS Telcordia 2003C. Faloutsos57 Conclusions AWSOM: Automatic, ‘hands-off’ traffic modeling (first of its kind!)

CMU SCS Telcordia 2003C. Faloutsos58 Outline Problem definition - motivation Linear forecasting –AR –AWSOM Coevolving series - MUSCLES Fractal forecasting - F4 Other projects –graph modeling, outliers etc

CMU SCS Telcordia 2003C. Faloutsos59 Co-Evolving Time Sequences Given: A set of correlated time sequences Forecast ‘Repeated(t)’ ??

CMU SCS Telcordia 2003C. Faloutsos60 Solution: Q: what should we do?

CMU SCS Telcordia 2003C. Faloutsos61 Solution: Least Squares, with Dep. Variable: Repeated(t) Indep. Variables: Sent(t-1) … Sent(t-w); Lost(t-1) …Lost(t-w); Repeated(t-1),... (named: ‘MUSCLES’ [Yi+00])

CMU SCS Telcordia 2003C. Faloutsos62 Examples - Experiments Datasets –Modem pool traffic (14 modems, 1500 time- ticks; #packets per time unit) –AT&T WorldNet internet usage (several data streams; 980 time-ticks) Measures of success –Accuracy : Root Mean Square Error (RMSE)

CMU SCS Telcordia 2003C. Faloutsos63 Accuracy - “Modem” MUSCLES outperforms AR & “yesterday”

CMU SCS Telcordia 2003C. Faloutsos64 Accuracy - “Internet” MUSCLES consistently outperforms AR & “yesterday”

CMU SCS Telcordia 2003C. Faloutsos65 Outline Problem definition - motivation Linear forecasting –AR –AWSOM Coevolving series - MUSCLES Fractal forecasting - F4 Other projects –graph modeling, outliers etc

CMU SCS Telcordia 2003C. Faloutsos66 Detailed Outline Non-linear forecasting –Problem –Idea –How-to –Experiments –Conclusions

CMU SCS Telcordia 2003C. Faloutsos67 Recall: Problem #1 Given a time series {x t }, predict its future course, that is, x t+1, x t+2,... Time Value

CMU SCS Telcordia 2003C. Faloutsos68 How to forecast? ARIMA - but: linearity assumption ANSWER: ‘Delayed Coordinate Embedding’ = Lag Plots [Sauer92]

CMU SCS Telcordia 2003C. Faloutsos69 General Intuition (Lag Plot) x t-1 xtxtxtxt 4-NN New Point Interpolate these… To get the final prediction Lag = 1, k = 4 NN

CMU SCS Telcordia 2003C. Faloutsos70 Questions: Q1: How to choose lag L? Q2: How to choose k (the # of NN)? Q3: How to interpolate? Q4: why should this work at all?

CMU SCS Telcordia 2003C. Faloutsos71 Q1: Choosing lag L Manually (16, in award winning system by [Sauer94]) Our proposal: choose L such that the ‘intrinsic dimension’ in the lag plot stabilizes [Chakrabarti+02]

CMU SCS Telcordia 2003C. Faloutsos72 Fractal Dimensions FD = intrinsic dimensionality Embedding dimensionality = 3 Intrinsic dimensionality = 1

CMU SCS Telcordia 2003C. Faloutsos73 Fractal Dimensions FD = intrinsic dimensionality log(r) log( # pairs)

CMU SCS Telcordia 2003C. Faloutsos74 Intuition Its lag plot for lag = 1 X(t-1) X(t) The Logistic Parabola x t = ax t-1 (1-x t-1 ) + noise time x(t)

CMU SCS Telcordia 2003C. Faloutsos75 Intuition x(t-1) x(t) x(t-2) x(t) x(t-2) x(t-1) x(t)

CMU SCS Telcordia 2003C. Faloutsos76 Intuition The FD vs L plot does flatten out L(opt) = 1 Lag Fractal dimension

CMU SCS Telcordia 2003C. Faloutsos77 Proposed Method Use Fractal Dimensions to find the optimal lag length L(opt) Lag (L) Fractal Dimension Choose this epsilon

CMU SCS Telcordia 2003C. Faloutsos78 Q2: Choosing number of neighbors k Manually (typically ~ 1-10)

CMU SCS Telcordia 2003C. Faloutsos79 Q3: How to interpolate? How do we interpolate between the k nearest neighbors? A3.1: Average A3.2: Weighted average (weights drop with distance - how?)

CMU SCS Telcordia 2003C. Faloutsos80 Q3: How to interpolate? A3.3: Using SVD - seems to perform best ([Sauer94] - first place in the Santa Fe forecasting competition) X t-1 xtxt

CMU SCS Telcordia 2003C. Faloutsos81 Q4: Any theory behind it? A4: YES!

CMU SCS Telcordia 2003C. Faloutsos82 Theoretical foundation Based on the “Takens’ Theorem” [Takens81] which says that long enough delay vectors can do prediction, even if there are unobserved variables in the dynamical system (= diff. equations)

CMU SCS Telcordia 2003C. Faloutsos83 Theoretical foundation Example: Lotka-Volterra equations dH/dt = r H – a H*P dP/dt = b H*P – m P H is count of prey (e.g., hare) P is count of predators (e.g., lynx) Suppose only P(t) is observed (t=1, 2, …). H P Skip

CMU SCS Telcordia 2003C. Faloutsos84 Theoretical foundation But the delay vector space is a faithful reconstruction of the internal system state So prediction in delay vector space is as good as prediction in state space Skip H P P(t-1) P(t)

CMU SCS Telcordia 2003C. Faloutsos85 Detailed Outline Non-linear forecasting –Problem –Idea –How-to –Experiments –Conclusions

CMU SCS Telcordia 2003C. Faloutsos86 Datasets Logistic Parabola: x t = ax t-1 (1-x t-1 ) + noise Models population of flies [R. May/1976] time x(t) Lag-plot

CMU SCS Telcordia 2003C. Faloutsos87 Datasets Logistic Parabola: x t = ax t-1 (1-x t-1 ) + noise Models population of flies [R. May/1976] time x(t) Lag-plot ARIMA: fails

CMU SCS Telcordia 2003C. Faloutsos88 Logistic Parabola Timesteps Value Our Prediction from here

CMU SCS Telcordia 2003C. Faloutsos89 Logistic Parabola Timesteps Value Comparison of prediction to correct values

CMU SCS Telcordia 2003C. Faloutsos90 Datasets LORENZ: Models convection currents in the air dx / dt = a (y - x) dy / dt = x (b - z) - y dz / dt = xy - c z Value

CMU SCS Telcordia 2003C. Faloutsos91 LORENZ Timesteps Value Comparison of prediction to correct values

CMU SCS Telcordia 2003C. Faloutsos92 Datasets Time Value LASER: fluctuations in a Laser over time (used in Santa Fe competition)

CMU SCS Telcordia 2003C. Faloutsos93 Laser Timesteps Value Comparison of prediction to correct values

CMU SCS Telcordia 2003C. Faloutsos94 Conclusions Lag plots for non-linear forecasting (Takens’ theorem) suitable for ‘chaotic’ signals

CMU SCS Telcordia 2003C. Faloutsos95 Additional projects at CMU Graph/Network mining spatio-temporal mining - outliers

CMU SCS Telcordia 2003C. Faloutsos96 Graph/network mining Internet; web; gnutella P2P networks Q: Any pattern? Q: how to generate ‘realistic’ topologies? Q: how to define/verify realism?

CMU SCS Telcordia 2003C. Faloutsos97 Patterns? avg degree is, say 3.3 pick a node at random - what is the degree you expect it to have? degree count avg: 3.3

CMU SCS Telcordia 2003C. Faloutsos98 Patterns? avg degree is, say 3.3 pick a node at random - what is the degree you expect it to have? A: 1!! degree count avg: 3.3

CMU SCS Telcordia 2003C. Faloutsos99 Patterns? avg degree is, say 3.3 pick a node at random - what is the degree you expect it to have? A: 1!! degree count avg: 3.3

CMU SCS Telcordia 2003C. Faloutsos100 Patterns? A: Power laws! log {(out) degree} log(count)

CMU SCS Telcordia 2003C. Faloutsos101 Other ‘laws’? Count vs Outdegree Count vs Indegree Hop-plot Eigenvalue vs Rank “Network value” Stress Effective Diameter

CMU SCS Telcordia 2003C. Faloutsos102 RMAT, to generate realistic graphs Count vs Outdegree Count vs Indegree Hop-plot Eigenvalue vs Rank “Network value” Stress Effective Diameter

CMU SCS Telcordia 2003C. Faloutsos103 Epidemic threshold? one a real graph, will a (computer / biological) virus die out? (given –beta: probability that an infected node will infect its neighbor and –delta: probability that an infected node will recover NOMAYBEYES

CMU SCS Telcordia 2003C. Faloutsos104 Epidemic threshold? one a real graph, will a (computer / biological) virus die out? (given –beta: probability that an infected node will infect its neighbor and –delta: probability that an infected node will recover A: depends on largest eigenvalue of adjacency matrix! [Wang+03]

CMU SCS Telcordia 2003C. Faloutsos105 Additional projects Graph mining spatio-temporal mining - outliers

CMU SCS Telcordia 2003C. Faloutsos106 Outliers - ‘LOCI’

CMU SCS Telcordia 2003C. Faloutsos107 Outliers - ‘LOCI’ finds outliers quickly, with no human intervention

CMU SCS Telcordia 2003C. Faloutsos108 Conclusions AWSOM for automatic, linear forecasting MUSCLES for co-evolving sequences F4 for non-linear forecasting Graph/Network topology: power laws and generators; epidemic threshold LOCI for outlier detection

CMU SCS Telcordia 2003C. Faloutsos109 Conclusions Overarching theme: automatic discovery of patterns (outliers/rules) in –time sequences (sensors/streams) –graphs (computer/social networks) –multimedia (video, motion capture data etc)

CMU SCS Telcordia 2003C. Faloutsos110 Books William H. Press, Saul A. Teukolsky, William T. Vetterling and Brian P. Flannery: Numerical Recipes in C, Cambridge University Press, 1992, 2nd Edition. (Great description, intuition and code for DFT, DWT) C. Faloutsos: Searching Multimedia Databases by Content, Kluwer Academic Press, 1996 (introduction to DFT, DWT)

CMU SCS Telcordia 2003C. Faloutsos111 Books George E.P. Box and Gwilym M. Jenkins and Gregory C. Reinsel, Time Series Analysis: Forecasting and Control, Prentice Hall, 1994 (the classic book on ARIMA, 3rd ed.) Brockwell, P. J. and R. A. Davis (1987). Time Series: Theory and Methods. New York, Springer Verlag.

CMU SCS Telcordia 2003C. Faloutsos112 Resources: software and urls MUSCLES: Prof. Byoung-Kee Yi: or AWSOM & LOCI: F4, RMAT:

CMU SCS Telcordia 2003C. Faloutsos113 Additional Reading [Chakrabarti+02] Deepay Chakrabarti and Christos Faloutsos F4: Large-Scale Automated Forecasting using Fractals CIKM 2002, Washington DC, Nov [Chen+94] Chung-Min Chen, Nick Roussopoulos: Adaptive Selectivity Estimation Using Query Feedback. SIGMOD Conference 1994: [Gilbert+01] Anna C. Gilbert, Yannis Kotidis and S. Muthukrishnan and Martin Strauss, Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries, VLDB 2001

CMU SCS Telcordia 2003C. Faloutsos114 Additional Reading Spiros Papadimitriou, Anthony Brockwell and Christos Faloutsos Adaptive, Hands-Off Stream Mining VLDB 2003, Berlin, Germany, Sept Spiros Papadimitriou, Hiroyuki Kitagawa, Phil Gibbons and Christos Faloutsos LOCI: Fast Outlier Detection Using the Local Correlation Integral ICDE 2003, Bangalore, India, March 5 - March 8, Sauer, T. (1994). Time series prediction using delay coordinate embedding. (in book by Weigend and Gershenfeld, below) Addison-Wesley.

CMU SCS Telcordia 2003C. Faloutsos115 Additional Reading Takens, F. (1981). Detecting strange attractors in fluid turbulence. Dynamical Systems and Turbulence. Berlin: Springer-Verlag. Yang Wang, Deepayan Chakrabarti, Chenxi Wang and Christos Faloutsos Epidemic Spreading in Real Networks: An Eigenvalue Viewpoint 22nd Symposium on Reliable Distributed Computing (SRDS2003) Florence, Italy, Oct. 6-8, 2003

CMU SCS Telcordia 2003C. Faloutsos116 Additional Reading Weigend, A. S. and N. A. Gerschenfeld (1994). Time Series Prediction: Forecasting the Future and Understanding the Past, Addison Wesley. (Excellent collection of papers on chaotic/non-linear forecasting, describing the algorithms behind the winners of the Santa Fe competition.) [Yi+00] Byoung-Kee Yi et al.: Online Data Mining for Co-Evolving Time Sequences, ICDE (Describes MUSCLES and Recursive Least Squares)