Presentation is loading. Please wait.

Presentation is loading. Please wait.

CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU

Similar presentations


Presentation on theme: "CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU"— Presentation transcript:

1 CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU christos@cs.cmu.edu

2 CMU SCS Telcordia 2003C. Faloutsos2 Outline Problem definition - motivation Linear forecasting - AR and AWSOM Coevolving series - MUSCLES Fractal forecasting - F4 Other projects –graph modeling, outliers etc

3 CMU SCS Telcordia 2003C. Faloutsos3 Problem definition Given: one or more sequences x 1, x 2, …, x t, … (y 1, y 2, …, y t, … … ) Find –forecasts; patterns –clusters; outliers

4 CMU SCS Telcordia 2003C. Faloutsos4 Motivation - Applications Financial, sales, economic series Medical –ECGs +; blood pressure etc monitoring –reactions to new drugs –elderly care

5 CMU SCS Telcordia 2003C. Faloutsos5 Motivation - Applications (cont’d) ‘Smart house’ –sensors monitor temperature, humidity, air quality video surveillance

6 CMU SCS Telcordia 2003C. Faloutsos6 Motivation - Applications (cont’d) civil/automobile infrastructure –bridge vibrations [Oppenheim+02] – road conditions / traffic monitoring

7 CMU SCS Telcordia 2003C. Faloutsos7 Stream Data: automobile traffic Automobile traffic 0 200 400 600 800 1000 1200 1400 1600 1800 2000 time # cars

8 CMU SCS Telcordia 2003C. Faloutsos8 Motivation - Applications (cont’d) Weather, environment/anti-pollution –volcano monitoring –air/water pollutant monitoring

9 CMU SCS Telcordia 2003C. Faloutsos9 Stream Data: Sunspots time #sunspots per month

10 CMU SCS Telcordia 2003C. Faloutsos10 Motivation - Applications (cont’d) Computer systems –‘Active Disks’ (buffering, prefetching) –web servers (ditto) –network traffic monitoring –...

11 CMU SCS Telcordia 2003C. Faloutsos11 Stream Data: Disk accesses time #bytes

12 CMU SCS Telcordia 2003C. Faloutsos12 Settings & Applications One or more sensors, collecting time-series data

13 CMU SCS Telcordia 2003C. Faloutsos13 Settings & Applications Each sensor collects data (x 1, x 2, …, x t, …)

14 CMU SCS Telcordia 2003C. Faloutsos14 Settings & Applications Sensors ‘report’ to a central site

15 CMU SCS Telcordia 2003C. Faloutsos15 Settings & Applications Problem #1: Finding patterns in a single time sequence

16 CMU SCS Telcordia 2003C. Faloutsos16 Settings & Applications Problem #2: Finding patterns in many time sequences

17 CMU SCS Telcordia 2003C. Faloutsos17 Problem #1: Goal: given a signal (eg., #packets over time) Find: patterns, periodicities, and/or compress year count lynx caught per year (packets per day; temperature per day)

18 CMU SCS Telcordia 2003C. Faloutsos18 Problem#1’: Forecast Given x t, x t-1, …, forecast x t+1 0 10 20 30 40 50 60 70 80 90 1357911 Time Tick Number of packets sent ??

19 CMU SCS Telcordia 2003C. Faloutsos19 Problem #2: Given: A set of correlated time sequences Forecast ‘Sent(t)’

20 CMU SCS Telcordia 2003C. Faloutsos20 Differences from DSP/Stat Semi-infinite streams –we need on-line, ‘any-time’ algorithms Can not afford human intervention –need automatic methods sensors have limited memory / processing / transmitting power –need for (lossy) compression

21 CMU SCS Telcordia 2003C. Faloutsos21 Important observations Patterns, rules, compression and forecasting are closely related: To do forecasting, we need –to find patterns/rules good rules help us compress to find outliers, we need to have forecasts –(outlier = too far away from our forecast)

22 CMU SCS Telcordia 2003C. Faloutsos22 Pictorial outline of the talk

23 CMU SCS Telcordia 2003C. Faloutsos23 Outline Problem definition - motivation Linear forecasting –AR –AWSOM Coevolving series - MUSCLES Fractal forecasting - F4 Other projects –graph modeling, outliers etc

24 CMU SCS Telcordia 2003C. Faloutsos24 Mini intro to A.R.

25 CMU SCS Telcordia 2003C. Faloutsos25 Forecasting "Prediction is very difficult, especially about the future." - Nils Bohr http://www.hfac.uh.edu/MediaFutures/t houghts.html

26 CMU SCS Telcordia 2003C. Faloutsos26 Problem#1’: Forecast Example: give x t-1, x t-2, …, forecast x t 0 10 20 30 40 50 60 70 80 90 1357911 Time Tick Number of packets sent ??

27 CMU SCS Telcordia 2003C. Faloutsos27 Forecasting: Preprocessing MANUALLY: remove trends spot periodicities time 7 days

28 CMU SCS Telcordia 2003C. Faloutsos28 Linear Regression: idea 40 45 50 55 60 65 70 75 80 85 15253545 Body weight express what we don’t know (= ‘dependent variable’) as a linear function of what we know (= ‘indep. variable(s)’) Body height

29 CMU SCS Telcordia 2003C. Faloutsos29 Linear Auto Regression:

30 CMU SCS Telcordia 2003C. Faloutsos30 Problem#1’: Forecast Solution: try to express x t as a linear function of the past: x t-2, x t-2, …, (up to a window of w) Formally: 0 10 20 30 40 50 60 70 80 90 1357911 Time Tick ??

31 CMU SCS Telcordia 2003C. Faloutsos31 Linear Auto Regression: 40 45 50 55 60 65 70 75 80 85 15253545 Number of packets sent (t-1) Number of packets sent (t) lag w=1 Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1]) ‘lag-plot’

32 CMU SCS Telcordia 2003C. Faloutsos32 More details: Q1: Can it work with window w>1? A1: YES! x t-2 xtxt x t-1

33 CMU SCS Telcordia 2003C. Faloutsos33 More details: Q1: Can it work with window w>1? A1: YES! (we’ll fit a hyper-plane, then!) x t-2 xtxt x t-1

34 CMU SCS Telcordia 2003C. Faloutsos34 More details: Q1: Can it work with window w>1? A1: YES! (we’ll fit a hyper-plane, then!) x t-2 x t-1 xtxt

35 CMU SCS Telcordia 2003C. Faloutsos35 Even more details Q2: Can we estimate a incrementally? A2: Yes, with the brilliant, classic method of ‘Recursive Least Squares’ (RLS) (see, e.g., [Chen+94], or [Yi+00], for details) Q3: can we ‘down-weight’ older samples? A3: yes (RLS does that easily!)

36 CMU SCS Telcordia 2003C. Faloutsos36 Mini intro to A.R.

37 CMU SCS Telcordia 2003C. Faloutsos37 How to choose ‘w’? goal: capture arbitrary periodicities with NO human intervention on a semi-infinite stream

38 CMU SCS Telcordia 2003C. Faloutsos38 Outline Problem definition - motivation Linear forecasting –AR –AWSOM Coevolving series - MUSCLES Fractal forecasting - F4 Other projects –graph modeling, outliers etc

39 CMU SCS Telcordia 2003C. Faloutsos39 Problem: in a train of spikes (128 ticks apart) any AR with window w < 128 will fail What to do, then?

40 CMU SCS Telcordia 2003C. Faloutsos40 Answer (intuition) Do a Wavelet transform (~ short window DFT) look for patterns in every frequency

41 CMU SCS Telcordia 2003C. Faloutsos41 Intuition Why NOT use the short window Fourier transform (SWFT)? A: how short should be the window? time freq w’

42 CMU SCS Telcordia 2003C. Faloutsos42 wavelets t f main idea: variable-length window!

43 CMU SCS Telcordia 2003C. Faloutsos43 Advantages of Wavelets Better compression (better RMSE with same number of coefficients - used in JPEG-2000) fast to compute (usually: O(n)!) very good for ‘spikes’ mammalian eye and ear: Gabor wavelets

44 CMU SCS Telcordia 2003C. Faloutsos44 Wavelets - intuition: t f Q: baritone/silence/ soprano - DWT? time value

45 CMU SCS Telcordia 2003C. Faloutsos45 Wavelets - intuition: Q: baritone/soprano - DWT? t f time value

46 CMU SCS Telcordia 2003C. Faloutsos46 AWSOM xtxt t t W 1,1 t W 1,2 t W 1,3 t W 1,4 t W 2,1 t W 2,2 t W 3,1 t V 4,1 time frequency =

47 CMU SCS Telcordia 2003C. Faloutsos47 AWSOM xtxt t t W 1,1 t W 1,2 t W 1,3 t W 1,4 t W 2,1 t W 2,2 t W 3,1 t V 4,1 time frequency

48 CMU SCS Telcordia 2003C. Faloutsos48 AWSOM - idea W l,t W l,t-1 W l,t-2 W l,t   l,1 W l,t-1   l,2 W l,t-2  … W l’,t’-1 W l’,t’-2 W l’,t’ W l’,t’   l’,1 W l’,t’-1   l’,2 W l’,t’-2  …

49 CMU SCS Telcordia 2003C. Faloutsos49 Wavelets - example: Q: weekly + daily periodicity - DWT? t f

50 CMU SCS Telcordia 2003C. Faloutsos50 Wavelets - example: Q: weekly + daily periodicity - DWT? t f

51 CMU SCS Telcordia 2003C. Faloutsos51 Wavelets - Example: Q: weekly + daily periodicity - DWT? t f

52 CMU SCS Telcordia 2003C. Faloutsos52 More details… Update of wavelet coefficients Update of linear models Feature selection –Not all correlations are significant –Throw away the insignificant ones (“noise”) (incremental) (incremental; RLS) (single-pass)

53 CMU SCS Telcordia 2003C. Faloutsos53 Results - Synthetic data Triangle pulse Mix (sine + square) AR captures wrong trend (or none) Seasonal AR estimation fails AWSOMARSeasonal AR

54 CMU SCS Telcordia 2003C. Faloutsos54 Results - Real data Automobile traffic –Daily periodicity –Bursty “noise” at smaller scales AR fails to capture any trend Seasonal AR estimation fails

55 CMU SCS Telcordia 2003C. Faloutsos55 Results - real data Sunspot intensity –Slightly time-varying “period” AR captures wrong trend Seasonal ARIMA –wrong downward trend, despite help by human!

56 CMU SCS Telcordia 2003C. Faloutsos56 Complexity Model update Space: O  lgN + mk 2   O  lgN  Time: O  k 2   O  1  Where –N: number of points (so far) –k:number of regression coefficients; fixed –m:number of linear models; O  lgN 

57 CMU SCS Telcordia 2003C. Faloutsos57 Conclusions AWSOM: Automatic, ‘hands-off’ traffic modeling (first of its kind!)

58 CMU SCS Telcordia 2003C. Faloutsos58 Outline Problem definition - motivation Linear forecasting –AR –AWSOM Coevolving series - MUSCLES Fractal forecasting - F4 Other projects –graph modeling, outliers etc

59 CMU SCS Telcordia 2003C. Faloutsos59 Co-Evolving Time Sequences Given: A set of correlated time sequences Forecast ‘Repeated(t)’ ??

60 CMU SCS Telcordia 2003C. Faloutsos60 Solution: Q: what should we do?

61 CMU SCS Telcordia 2003C. Faloutsos61 Solution: Least Squares, with Dep. Variable: Repeated(t) Indep. Variables: Sent(t-1) … Sent(t-w); Lost(t-1) …Lost(t-w); Repeated(t-1),... (named: ‘MUSCLES’ [Yi+00])

62 CMU SCS Telcordia 2003C. Faloutsos62 Examples - Experiments Datasets –Modem pool traffic (14 modems, 1500 time- ticks; #packets per time unit) –AT&T WorldNet internet usage (several data streams; 980 time-ticks) Measures of success –Accuracy : Root Mean Square Error (RMSE)

63 CMU SCS Telcordia 2003C. Faloutsos63 Accuracy - “Modem” MUSCLES outperforms AR & “yesterday”

64 CMU SCS Telcordia 2003C. Faloutsos64 Accuracy - “Internet” MUSCLES consistently outperforms AR & “yesterday”

65 CMU SCS Telcordia 2003C. Faloutsos65 Outline Problem definition - motivation Linear forecasting –AR –AWSOM Coevolving series - MUSCLES Fractal forecasting - F4 Other projects –graph modeling, outliers etc

66 CMU SCS Telcordia 2003C. Faloutsos66 Detailed Outline Non-linear forecasting –Problem –Idea –How-to –Experiments –Conclusions

67 CMU SCS Telcordia 2003C. Faloutsos67 Recall: Problem #1 Given a time series {x t }, predict its future course, that is, x t+1, x t+2,... Time Value

68 CMU SCS Telcordia 2003C. Faloutsos68 How to forecast? ARIMA - but: linearity assumption ANSWER: ‘Delayed Coordinate Embedding’ = Lag Plots [Sauer92]

69 CMU SCS Telcordia 2003C. Faloutsos69 General Intuition (Lag Plot) x t-1 xtxtxtxt 4-NN New Point Interpolate these… To get the final prediction Lag = 1, k = 4 NN

70 CMU SCS Telcordia 2003C. Faloutsos70 Questions: Q1: How to choose lag L? Q2: How to choose k (the # of NN)? Q3: How to interpolate? Q4: why should this work at all?

71 CMU SCS Telcordia 2003C. Faloutsos71 Q1: Choosing lag L Manually (16, in award winning system by [Sauer94]) Our proposal: choose L such that the ‘intrinsic dimension’ in the lag plot stabilizes [Chakrabarti+02]

72 CMU SCS Telcordia 2003C. Faloutsos72 Fractal Dimensions FD = intrinsic dimensionality Embedding dimensionality = 3 Intrinsic dimensionality = 1

73 CMU SCS Telcordia 2003C. Faloutsos73 Fractal Dimensions FD = intrinsic dimensionality log(r) log( # pairs)

74 CMU SCS Telcordia 2003C. Faloutsos74 Intuition Its lag plot for lag = 1 X(t-1) X(t) The Logistic Parabola x t = ax t-1 (1-x t-1 ) + noise time x(t)

75 CMU SCS Telcordia 2003C. Faloutsos75 Intuition x(t-1) x(t) x(t-2) x(t) x(t-2) x(t-1) x(t)

76 CMU SCS Telcordia 2003C. Faloutsos76 Intuition The FD vs L plot does flatten out L(opt) = 1 Lag Fractal dimension

77 CMU SCS Telcordia 2003C. Faloutsos77 Proposed Method Use Fractal Dimensions to find the optimal lag length L(opt) Lag (L) Fractal Dimension Choose this epsilon

78 CMU SCS Telcordia 2003C. Faloutsos78 Q2: Choosing number of neighbors k Manually (typically ~ 1-10)

79 CMU SCS Telcordia 2003C. Faloutsos79 Q3: How to interpolate? How do we interpolate between the k nearest neighbors? A3.1: Average A3.2: Weighted average (weights drop with distance - how?)

80 CMU SCS Telcordia 2003C. Faloutsos80 Q3: How to interpolate? A3.3: Using SVD - seems to perform best ([Sauer94] - first place in the Santa Fe forecasting competition) X t-1 xtxt

81 CMU SCS Telcordia 2003C. Faloutsos81 Q4: Any theory behind it? A4: YES!

82 CMU SCS Telcordia 2003C. Faloutsos82 Theoretical foundation Based on the “Takens’ Theorem” [Takens81] which says that long enough delay vectors can do prediction, even if there are unobserved variables in the dynamical system (= diff. equations)

83 CMU SCS Telcordia 2003C. Faloutsos83 Theoretical foundation Example: Lotka-Volterra equations dH/dt = r H – a H*P dP/dt = b H*P – m P H is count of prey (e.g., hare) P is count of predators (e.g., lynx) Suppose only P(t) is observed (t=1, 2, …). H P Skip

84 CMU SCS Telcordia 2003C. Faloutsos84 Theoretical foundation But the delay vector space is a faithful reconstruction of the internal system state So prediction in delay vector space is as good as prediction in state space Skip H P P(t-1) P(t)

85 CMU SCS Telcordia 2003C. Faloutsos85 Detailed Outline Non-linear forecasting –Problem –Idea –How-to –Experiments –Conclusions

86 CMU SCS Telcordia 2003C. Faloutsos86 Datasets Logistic Parabola: x t = ax t-1 (1-x t-1 ) + noise Models population of flies [R. May/1976] time x(t) Lag-plot

87 CMU SCS Telcordia 2003C. Faloutsos87 Datasets Logistic Parabola: x t = ax t-1 (1-x t-1 ) + noise Models population of flies [R. May/1976] time x(t) Lag-plot ARIMA: fails

88 CMU SCS Telcordia 2003C. Faloutsos88 Logistic Parabola Timesteps Value Our Prediction from here

89 CMU SCS Telcordia 2003C. Faloutsos89 Logistic Parabola Timesteps Value Comparison of prediction to correct values

90 CMU SCS Telcordia 2003C. Faloutsos90 Datasets LORENZ: Models convection currents in the air dx / dt = a (y - x) dy / dt = x (b - z) - y dz / dt = xy - c z Value

91 CMU SCS Telcordia 2003C. Faloutsos91 LORENZ Timesteps Value Comparison of prediction to correct values

92 CMU SCS Telcordia 2003C. Faloutsos92 Datasets Time Value LASER: fluctuations in a Laser over time (used in Santa Fe competition)

93 CMU SCS Telcordia 2003C. Faloutsos93 Laser Timesteps Value Comparison of prediction to correct values

94 CMU SCS Telcordia 2003C. Faloutsos94 Conclusions Lag plots for non-linear forecasting (Takens’ theorem) suitable for ‘chaotic’ signals

95 CMU SCS Telcordia 2003C. Faloutsos95 Additional projects at CMU Graph/Network mining spatio-temporal mining - outliers

96 CMU SCS Telcordia 2003C. Faloutsos96 Graph/network mining Internet; web; gnutella P2P networks Q: Any pattern? Q: how to generate ‘realistic’ topologies? Q: how to define/verify realism?

97 CMU SCS Telcordia 2003C. Faloutsos97 Patterns? avg degree is, say 3.3 pick a node at random - what is the degree you expect it to have? degree count avg: 3.3

98 CMU SCS Telcordia 2003C. Faloutsos98 Patterns? avg degree is, say 3.3 pick a node at random - what is the degree you expect it to have? A: 1!! degree count avg: 3.3

99 CMU SCS Telcordia 2003C. Faloutsos99 Patterns? avg degree is, say 3.3 pick a node at random - what is the degree you expect it to have? A: 1!! degree count avg: 3.3

100 CMU SCS Telcordia 2003C. Faloutsos100 Patterns? A: Power laws! log {(out) degree} log(count)

101 CMU SCS Telcordia 2003C. Faloutsos101 Other ‘laws’? Count vs Outdegree Count vs Indegree Hop-plot Eigenvalue vs Rank “Network value” Stress Effective Diameter

102 CMU SCS Telcordia 2003C. Faloutsos102 RMAT, to generate realistic graphs Count vs Outdegree Count vs Indegree Hop-plot Eigenvalue vs Rank “Network value” Stress Effective Diameter

103 CMU SCS Telcordia 2003C. Faloutsos103 Epidemic threshold? one a real graph, will a (computer / biological) virus die out? (given –beta: probability that an infected node will infect its neighbor and –delta: probability that an infected node will recover NOMAYBEYES

104 CMU SCS Telcordia 2003C. Faloutsos104 Epidemic threshold? one a real graph, will a (computer / biological) virus die out? (given –beta: probability that an infected node will infect its neighbor and –delta: probability that an infected node will recover A: depends on largest eigenvalue of adjacency matrix! [Wang+03]

105 CMU SCS Telcordia 2003C. Faloutsos105 Additional projects Graph mining spatio-temporal mining - outliers

106 CMU SCS Telcordia 2003C. Faloutsos106 Outliers - ‘LOCI’

107 CMU SCS Telcordia 2003C. Faloutsos107 Outliers - ‘LOCI’ finds outliers quickly, with no human intervention

108 CMU SCS Telcordia 2003C. Faloutsos108 Conclusions AWSOM for automatic, linear forecasting MUSCLES for co-evolving sequences F4 for non-linear forecasting Graph/Network topology: power laws and generators; epidemic threshold LOCI for outlier detection

109 CMU SCS Telcordia 2003C. Faloutsos109 Conclusions Overarching theme: automatic discovery of patterns (outliers/rules) in –time sequences (sensors/streams) –graphs (computer/social networks) –multimedia (video, motion capture data etc) www.cs.cmu.edu/~christos christos@cs.cmu.edu

110 CMU SCS Telcordia 2003C. Faloutsos110 Books William H. Press, Saul A. Teukolsky, William T. Vetterling and Brian P. Flannery: Numerical Recipes in C, Cambridge University Press, 1992, 2nd Edition. (Great description, intuition and code for DFT, DWT) C. Faloutsos: Searching Multimedia Databases by Content, Kluwer Academic Press, 1996 (introduction to DFT, DWT)

111 CMU SCS Telcordia 2003C. Faloutsos111 Books George E.P. Box and Gwilym M. Jenkins and Gregory C. Reinsel, Time Series Analysis: Forecasting and Control, Prentice Hall, 1994 (the classic book on ARIMA, 3rd ed.) Brockwell, P. J. and R. A. Davis (1987). Time Series: Theory and Methods. New York, Springer Verlag.

112 CMU SCS Telcordia 2003C. Faloutsos112 Resources: software and urls MUSCLES: Prof. Byoung-Kee Yi: http://www.postech.ac.kr/~bkyi/ or christos@cs.cmu.edu AWSOM & LOCI: spapadim@cs.cmu.edu F4, RMAT: deepay@cs.cmu.edu

113 CMU SCS Telcordia 2003C. Faloutsos113 Additional Reading [Chakrabarti+02] Deepay Chakrabarti and Christos Faloutsos F4: Large-Scale Automated Forecasting using Fractals CIKM 2002, Washington DC, Nov. 2002. [Chen+94] Chung-Min Chen, Nick Roussopoulos: Adaptive Selectivity Estimation Using Query Feedback. SIGMOD Conference 1994:161-172 [Gilbert+01] Anna C. Gilbert, Yannis Kotidis and S. Muthukrishnan and Martin Strauss, Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries, VLDB 2001

114 CMU SCS Telcordia 2003C. Faloutsos114 Additional Reading Spiros Papadimitriou, Anthony Brockwell and Christos Faloutsos Adaptive, Hands-Off Stream Mining VLDB 2003, Berlin, Germany, Sept. 2003 Spiros Papadimitriou, Hiroyuki Kitagawa, Phil Gibbons and Christos Faloutsos LOCI: Fast Outlier Detection Using the Local Correlation Integral ICDE 2003, Bangalore, India, March 5 - March 8, 2003. Sauer, T. (1994). Time series prediction using delay coordinate embedding. (in book by Weigend and Gershenfeld, below) Addison-Wesley.

115 CMU SCS Telcordia 2003C. Faloutsos115 Additional Reading Takens, F. (1981). Detecting strange attractors in fluid turbulence. Dynamical Systems and Turbulence. Berlin: Springer-Verlag. Yang Wang, Deepayan Chakrabarti, Chenxi Wang and Christos Faloutsos Epidemic Spreading in Real Networks: An Eigenvalue Viewpoint 22nd Symposium on Reliable Distributed Computing (SRDS2003) Florence, Italy, Oct. 6-8, 2003

116 CMU SCS Telcordia 2003C. Faloutsos116 Additional Reading Weigend, A. S. and N. A. Gerschenfeld (1994). Time Series Prediction: Forecasting the Future and Understanding the Past, Addison Wesley. (Excellent collection of papers on chaotic/non-linear forecasting, describing the algorithms behind the winners of the Santa Fe competition.) [Yi+00] Byoung-Kee Yi et al.: Online Data Mining for Co-Evolving Time Sequences, ICDE 2000. (Describes MUSCLES and Recursive Least Squares)


Download ppt "CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU"

Similar presentations


Ads by Google