CMU SCS Data Mining on Streams Christos Faloutsos CMU.

Slides:



Advertisements
Similar presentations
High Performance Discovery from Time Series Streams
Advertisements

Indexing Time Series Based on original slides by Prof. Dimitrios Gunopulos and Prof. Christos Faloutsos with some slides from tutorials by Prof. Eamonn.
CMU SCS : Multimedia Databases and Data Mining Lecture #23: DSP tools – Fourier and Wavelets C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #19: SVD - part II (case studies) C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
FUNNEL: Automatic Mining of Spatially Coevolving Epidemics Yasuko Matsubara, Yasushi Sakurai (Kumamoto University) Willem G. van Panhuis (University of.
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture#5: Multi-key and Spatial Access Methods - II C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #22: DSP tools – Fourier and Wavelets C. Faloutsos.
Deepayan ChakrabartiCIKM F4: Large Scale Automated Forecasting Using Fractals -Deepayan Chakrabarti -Christos Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #25: Multimedia indexing C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #25: Time series mining and forecasting Christos Faloutsos.
School of Computer Science Carnegie Mellon Sensor and Graph Mining Christos Faloutsos Carnegie Mellon University & IBM
Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU.
Efficient Anomaly Monitoring over Moving Object Trajectory Streams joint work with Lei Chen (HKUST) Ada Wai-Chee Fu (CUHK) Dawei Liu (CUHK) Yingyi Bu (Microsoft)
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
15-826: Multimedia Databases and Data Mining
Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals - case studies Part III (regions, quadtrees, knn queries) C. Faloutsos.
Time Series Indexing II. Time Series Data
Indexing Time Series. Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time.
Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM.
Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)
Efficient Similarity Search in Sequence Databases Rakesh Agrawal, Christos Faloutsos and Arun Swami Leila Kaghazian.
CMU SCS Data Mining Meets Systems: Tools and Case Studies Christos Faloutsos SCS CMU.
Data Mining: Concepts and Techniques Mining time-series data.
CMU SCS Graph and stream mining Christos Faloutsos CMU.
1 ISI’02 Multidimensional Databases Challenge: representation for efficient storage, indexing & querying Examples (time-series, images) New multidimensional.
Based on Slides by D. Gunopulos (UCR)
Spatial and Temporal Data Mining
CMU SCS Sensor data mining and forecasting Christos Faloutsos CMU
Traffic Analysis: Tools for Mining Time Series
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
A Multiresolution Symbolic Representation of Time Series
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
Indexing Time Series.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Conclusions C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #27: Time series mining and forecasting Christos Faloutsos.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Pattern Matching with Acceleration Data Pramod Vemulapalli.
Building Efficient Time Series Similarity Search Operator Mijung Kim Summer Internship 2013 at HP Labs.
CMU SCS : Multimedia Databases and Data Mining Lecture #10: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
Multimedia and Time-series Data
BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai (NTT Cyber Space Labs) Spiros Papadimitriou (Carnegie Mellon Univ.) Christos Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #12: Fractals - case studies Part III (quadtrees, knn queries) C. Faloutsos.
Fast Subsequence Matching in Time-Series Databases Author: Christos Faloutsos etc. Speaker: Weijun He.
E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal.
Stream Monitoring under the Time Warping Distance Yasushi Sakurai (NTT Cyber Space Labs) Christos Faloutsos (Carnegie Mellon Univ.) Masashi Yamamuro (NTT.
Presented by Ho Wai Shing
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and.
Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.
CMU SCS : Multimedia Databases and Data Mining Lecture #18: SVD - part I (definitions) C. Faloutsos.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.
FastMap : Algorithm for Indexing, Data- Mining and Visualization of Traditional and Multimedia Datasets.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)
Time Series Indexing II
Fast Subsequence Matching in Time-Series Databases.
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Data Mining: Concepts and Techniques — Chapter 8 — 8
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Data Mining: Concepts and Techniques — Chapter 8 — 8
Data Mining: Concepts and Techniques — Chapter 8 — 8
Presentation transcript:

CMU SCS Data Mining on Streams Christos Faloutsos CMU

CMU SCS LLNL'06(c) C. Faloutsos, Thanks Dr. Deepay Chakrabarti (Yahoo) Dr. Spiros Papadimitriou (IBM) Prof. Byoung-Kee Yi (Pohang U.) Prof. Dimitris Gunopulos (UCR) Dr. Mengzhi Wang (Google)

CMU SCS LLNL'06(c) C. Faloutsos, For more info: 3h tutorial, at: BT04-tut/faloutsos-edbt04.pdf

CMU SCS LLNL'06(c) C. Faloutsos, Outline Motivation Similarity Search and Indexing DSP (Digital Signal Processing) Linear Forecasting Bursty traffic - fractals and multifractals Non-linear forecasting Conclusions

CMU SCS LLNL'06(c) C. Faloutsos, Problem definition Given: one or more sequences x 1, x 2, …, x t, … (y 1, y 2, …, y t, … … ) Find –similar sequences; forecasts –patterns; clusters; outliers

CMU SCS LLNL'06(c) C. Faloutsos, Motivation - Applications Financial, sales, economic series Medical –ECGs +; blood pressure etc monitoring –reactions to new drugs –elder care

CMU SCS LLNL'06(c) C. Faloutsos, Motivation - Applications (cont’d) ‘Smart house’ –sensors monitor temperature, humidity, air quality video surveillance

CMU SCS LLNL'06(c) C. Faloutsos, Motivation - Applications (cont’d) civil/automobile infrastructure –bridge vibrations [Oppenheim+02] – road conditions / traffic monitoring

CMU SCS LLNL'06(c) C. Faloutsos, Motivation - Applications (cont’d) Weather, environment/anti-pollution –volcano monitoring –air/water pollutant monitoring

CMU SCS LLNL'06(c) C. Faloutsos, Motivation - Applications (cont’d) Computer systems –‘Active Disks’ (buffering, prefetching) –web servers (ditto) –network traffic monitoring –...

CMU SCS LLNL'06(c) C. Faloutsos, Problem #1: Goal: given a signal (e.g.., #packets over time) Find: patterns, periodicities, and/or compress year count lynx caught per year (packets per day; temperature per day)

CMU SCS LLNL'06(c) C. Faloutsos, Problem#2: Forecast Given x t, x t-1, …, forecast x t Time Tick Number of packets sent ??

CMU SCS LLNL'06(c) C. Faloutsos, Problem#2’: Similarity search E.g.., Find a 3-tick pattern, similar to the last one Time Tick Number of packets sent ??

CMU SCS LLNL'06(c) C. Faloutsos, Differences from DSP/Stat Semi-infinite streams –we need on-line, ‘any-time’ algorithms Can not afford human intervention –need automatic methods sensors have limited memory / processing / transmitting power –need for (lossy) compression

CMU SCS LLNL'06(c) C. Faloutsos, Important observations Patterns, rules, forecasting and similarity indexing are closely related: To do forecasting, we need –to find patterns/rules –to find similar settings in the past to find outliers, we need to have forecasts –(outlier = too far away from our forecast)

CMU SCS LLNL'06(c) C. Faloutsos, Important topics NOT in this tutorial: Continuous queries –[Babu+Widom ] [Gehrke+] [Madden+] Categorical data streams –[Hatonen+96] Outlier detection (discontinuities) –[Breunig+00]

CMU SCS LLNL'06(c) C. Faloutsos, Outline Motivation Similarity Search and Indexing DSP Linear Forecasting Bursty traffic - fractals and multifractals Non-linear forecasting Conclusions

CMU SCS LLNL'06(c) C. Faloutsos, Outline Motivation Similarity Search and Indexing –distance functions: Euclidean;Time-warping –indexing –feature extraction DSP...

CMU SCS LLNL'06(c) C. Faloutsos, Euclidean and Lp... L 1 : city-block = Manhattan L 2 = Euclidean L 

CMU SCS LLNL'06(c) C. Faloutsos, day $price 1365 day $price 1365 day $price 1365 distance function: by expert

CMU SCS LLNL'06(c) C. Faloutsos, Idea: ‘GEMINI’ E.g.., ‘find stocks similar to MSFT’ Seq. scanning: too slow How to accelerate the search? [Faloutsos96]

CMU SCS LLNL'06(c) C. Faloutsos, day 1365 day 1365 S1 Sn F(S1) F(Sn) ‘GEMINI’ - Pictorially eg, avg eg,. std

CMU SCS LLNL'06(c) C. Faloutsos, GEMINI Solution: Quick-and-dirty' filter: extract n features (numbers, eg., avg., etc.) map into a point in n-d feature space organize points with off-the-shelf spatial access method (‘SAM’) discard false alarms

CMU SCS LLNL'06(c) C. Faloutsos, Examples of GEMINI Time sequences: DFT (up to 100 times faster) [SIGMOD94]; [Kanellakis+], [Mendelzon+]

CMU SCS LLNL'06(c) C. Faloutsos, Examples of GEMINI Even on other-than-sequence data: Images (QBIC) [JIIS94] tumor-like shapes [VLDB96] video [Informedia + S-R-trees] automobile part shapes [Kriegel+97]

CMU SCS LLNL'06(c) C. Faloutsos, Indexing - SAMs Q: How do Spatial Access Methods (SAMs) work? A: they group nearby points (or regions) together, on nearby disk pages, and answer spatial queries quickly (‘range queries’, ‘nearest neighbor’ queries etc) For example:

CMU SCS LLNL'06(c) C. Faloutsos, R-trees [Guttman84] eg., w/ fanout 4: group nearby rectangles to parent MBRs; each group -> disk page A B C D E F G H I J Skip

CMU SCS LLNL'06(c) C. Faloutsos, R-trees eg., w/ fanout 4: A B C D E F G H I J P1 P2 P3 P4 FGDEHIJABC Skip

CMU SCS LLNL'06(c) C. Faloutsos, R-trees eg., w/ fanout 4: A B C D E F G H I J P1 P2 P3 P4 P1P2P3P4 FGDEHIJABC Skip

CMU SCS LLNL'06(c) C. Faloutsos, R-trees - range search? A B C D E F G H I J P1 P2 P3 P4 P1P2P3P4 FGDEHIJABC Skip

CMU SCS LLNL'06(c) C. Faloutsos, R-trees - range search? A B C D E F G H I J P1 P2 P3 P4 P1P2P3P4 FGDEHIJABC Skip

CMU SCS LLNL'06(c) C. Faloutsos, Conclusions Fast indexing: through GEMINI –feature extraction and –(off the shelf) Spatial Access Methods [Gaede+98]

CMU SCS LLNL'06(c) C. Faloutsos, Outline Motivation Similarity Search and Indexing –distance functions –indexing –feature extraction DSP...

CMU SCS LLNL'06(c) C. Faloutsos, Outline Motivation Similarity Search and Indexing –distance functions –indexing –feature extraction DFT, DWT, DCT (data independent) SVD, etc (data dependent) MDS, FastMap

CMU SCS LLNL'06(c) C. Faloutsos, DFT and cousins very good for compressing real signals more details on DFT/DCT/DWT: later

CMU SCS LLNL'06(c) C. Faloutsos, Feature extraction SVD (finds hidden/latent variables) Random projections (works surprisingly well!)

CMU SCS LLNL'06(c) C. Faloutsos, Conclusions - Practitioner’s guide Similarity search in time sequences 1) establish/choose distance (Euclidean, time- warping,…) 2) extract features (SVD, DWT, MDS), and use a SAM (R-tree/variant) or a Metric Tree (M-tree) 2’) for high intrinsic dimensionalities, consider sequential scan (it might win…)

CMU SCS LLNL'06(c) C. Faloutsos, Books William H. Press, Saul A. Teukolsky, William T. Vetterling and Brian P. Flannery: Numerical Recipes in C, Cambridge University Press, 1992, 2nd Edition. (Great description, intuition and code for SVD) C. Faloutsos: Searching Multimedia Databases by Content, Kluwer Academic Press, 1996 (introduction to SVD, and GEMINI)

CMU SCS LLNL'06(c) C. Faloutsos, References Agrawal, R., K.-I. Lin, et al. (Sept. 1995). Fast Similarity Search in the Presence of Noise, Scaling and Translation in Time-Series Databases. Proc. of VLDB, Zurich, Switzerland. Babu, S. and J. Widom (2001). “Continuous Queries over Data Streams.” SIGMOD Record 30(3): Breunig, M. M., H.-P. Kriegel, et al. (2000). LOF: Identifying Density-Based Local Outliers. SIGMOD Conference, Dallas, TX. Berry, Michael:

CMU SCS LLNL'06(c) C. Faloutsos, References Ciaccia, P., M. Patella, et al. (1997). M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. VLDB. Foltz, P. W. and S. T. Dumais (Dec. 1992). “Personalized Information Delivery: An Analysis of Information Filtering Methods.” Comm. of ACM (CACM) 35(12): Guttman, A. (June 1984). R-Trees: A Dynamic Index Structure for Spatial Searching. Proc. ACM SIGMOD, Boston, Mass.

CMU SCS LLNL'06(c) C. Faloutsos, References Gaede, V. and O. Guenther (1998). “Multidimensional Access Methods.” Computing Surveys 30(2): Gehrke, J. E., F. Korn, et al. (May 2001). On Computing Correlated Aggregates Over Continual Data Streams. ACM Sigmod, Santa Barbara, California.

CMU SCS LLNL'06(c) C. Faloutsos, References Gunopulos, D. and G. Das (2001). Time Series Similarity Measures and Time Series Indexing. SIGMOD Conference, Santa Barbara, CA. Hatonen, K., M. Klemettinen, et al. (1996). Knowledge Discovery from Telecommunication Network Alarm Databases. ICDE, New Orleans, Louisiana. Jolliffe, I. T. (1986). Principal Component Analysis, Springer Verlag.

CMU SCS LLNL'06(c) C. Faloutsos, References Keogh, E. J., K. Chakrabarti, et al. (2001). Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases. SIGMOD Conference, Santa Barbara, CA. Eamonn J. Keogh, Stefano Lonardi, Chotirat (Ann) Ratanamahatana: Towards parameter-free data mining. KDD 2004: Kobla, V., D. S. Doermann, et al. (Nov. 1997). VideoTrails: Representing and Visualizing Structure in Video Sequences. ACM Multimedia 97, Seattle, WA.

CMU SCS LLNL'06(c) C. Faloutsos, References Oppenheim, I. J., A. Jain, et al. (March 2002). A MEMS Ultrasonic Transducer for Resident Monitoring of Steel Structures. SPIE Smart Structures Conference SS05, San Diego. Papadimitriou, C. H., P. Raghavan, et al. (1998). Latent Semantic Indexing: A Probabilistic Analysis. PODS, Seattle, WA. Rabiner, L. and B.-H. Juang (1993). Fundamentals of Speech Recognition, Prentice Hall.

CMU SCS LLNL'06(c) C. Faloutsos, References Traina, C., A. Traina, et al. (October 2000). Fast feature selection using the fractal dimension,. XV Brazilian Symposium on Databases (SBBD), Paraiba, Brazil.

CMU SCS LLNL'06(c) C. Faloutsos, References Dennis Shasha and Yunyue Zhu High Performance Discovery in Time Series: Techniques and Case Studies Springer 2004 Yunyue Zhu, Dennis Shasha ``StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time'‘ VLDB, August, pp Samuel R. Madden, Michael J. Franklin, Joseph M. Hellerstein, and Wei Hong. The Design of an Acquisitional Query Processor for Sensor Networks. SIGMOD, June 2003, San Diego, CA.

CMU SCS LLNL'06(c) C. Faloutsos,

CMU SCS LLNL'06(c) C. Faloutsos, Outline Motivation Similarity Search and Indexing DSP (DFT, DWT) Linear Forecasting Bursty traffic - fractals and multifractals Non-linear forecasting Conclusions

CMU SCS LLNL'06(c) C. Faloutsos, Outline DFT –Definition of DFT and properties –how to read the DFT spectrum DWT –Definition of DWT and properties –how to read the DWT scalogram

CMU SCS LLNL'06(c) C. Faloutsos, Introduction - Problem#1 Goal: given a signal (eg., packets over time) Find: patterns and/or compress year count lynx caught per year (packets per day; automobiles per hour)

CMU SCS LLNL'06(c) C. Faloutsos, DFT: Amplitude spectrum year count Freq. Ampl. freq=12 freq=0 Amplitude:

CMU SCS LLNL'06(c) C. Faloutsos, DFT: Amplitude spectrum year count Freq. Ampl. freq=12 freq=0

CMU SCS LLNL'06(c) C. Faloutsos, DFT: Amplitude spectrum year count Freq. Ampl. freq=12 freq=0

CMU SCS LLNL'06(c) C. Faloutsos, Wavelets - DWT DFT is great - but, how about compressing a spike? value time

CMU SCS LLNL'06(c) C. Faloutsos, Wavelets - DWT DFT is great - but, how about compressing a spike? A: Terrible - all DFT coefficients needed! Freq. Ampl. value time

CMU SCS LLNL'06(c) C. Faloutsos, Wavelets - DWT DFT is great - but, how about compressing a spike? A: Terrible - all DFT coefficients needed! value time

CMU SCS LLNL'06(c) C. Faloutsos, Wavelets - DWT Similarly, DFT suffers on short-duration waves (eg., baritone, silence, soprano) time value

CMU SCS LLNL'06(c) C. Faloutsos, Wavelets - DWT Solution#1: Short window Fourier transform (SWFT) But: how short should be the window? time freq time value

CMU SCS LLNL'06(c) C. Faloutsos, Wavelets - DWT Answer: multiple window sizes! -> DWT time freq Time domain DFT SWFT DWT

CMU SCS LLNL'06(c) C. Faloutsos, Haar Wavelets subtract sum of left half from right half repeat recursively for quarters, eight-ths,...

CMU SCS LLNL'06(c) C. Faloutsos, Wavelets - construction x0 x1 x2 x3 x4 x5 x6 x7 Skip

CMU SCS LLNL'06(c) C. Faloutsos, Wavelets - construction x0 x1 x2 x3 x4 x5 x6 x7 s1,0 + - d1,0 s1,1d1, level 1 Skip

CMU SCS LLNL'06(c) C. Faloutsos, Wavelets - construction d2,0 x0 x1 x2 x3 x4 x5 x6 x7 s1,0 + - d1,0 s1,1d1, s2,0 level 2 Skip

CMU SCS LLNL'06(c) C. Faloutsos, Wavelets - construction d2,0 x0 x1 x2 x3 x4 x5 x6 x7 s1,0 + - d1,0 s1,1d1, s2,0 etc... Skip

CMU SCS LLNL'06(c) C. Faloutsos, Wavelets - construction d2,0 x0 x1 x2 x3 x4 x5 x6 x7 s1,0 + - d1,0 s1,1d1, s2,0 Q: map each coefficient on the time-freq. plane t f Skip

CMU SCS LLNL'06(c) C. Faloutsos, Wavelets - construction d2,0 x0 x1 x2 x3 x4 x5 x6 x7 s1,0 + - d1,0 s1,1d1, s2,0 Q: map each coefficient on the time-freq. plane t f Skip

CMU SCS LLNL'06(c) C. Faloutsos, Haar wavelets - code #!/usr/bin/perl5 # expects a file with numbers # and prints the dwt transform # The number of time-ticks should be a power of 2 # USAGE # haar.pl # the smooth component of the signal # the high-freq. component # collect the values into the = split ); } my $len = my $half = int($len/2); while($half >= 1 ){ for(my $i=0; $i< $half; $i++){ $diff [$i] = ($vals[2*$i] - $vals[2*$i + 1] )/ sqrt(2); print "\t", $diff[$i]; $smooth [$i] = ($vals[2*$i] + $vals[2*$i + 1] )/ sqrt(2); } print $half = int($half/2); } print "\t", $vals[0], "\n" ; # the final, smooth component

CMU SCS LLNL'06(c) C. Faloutsos, Wavelets - construction Observation1: ‘+’ can be some weighted addition ‘-’ is the corresponding weighted difference (‘Quadrature mirror filters’) Observation2: unlike DFT/DCT, there are *many* wavelet bases: Haar, Daubechies- 4, Daubechies-6, Coifman, Morlet, Gabor,...

CMU SCS LLNL'06(c) C. Faloutsos, Wavelets - how do they look like? E.g., Daubechies-4

CMU SCS LLNL'06(c) C. Faloutsos, Wavelets - how do they look like? E.g., Daubechies-4 ? ?

CMU SCS LLNL'06(c) C. Faloutsos, Wavelets - how do they look like? E.g., Daubechies-4

CMU SCS LLNL'06(c) C. Faloutsos, Outline Motivation Similarity Search and Indexing DSP –DFT –DWT Definition of DWT and properties how to read the DWT scalogram

CMU SCS LLNL'06(c) C. Faloutsos, Wavelets - Drill#1: t f Q: baritone/silence/soprano - DWT? time value

CMU SCS LLNL'06(c) C. Faloutsos, Wavelets - Drill#1: t f Q: baritone/soprano - DWT? time value

CMU SCS LLNL'06(c) C. Faloutsos, Wavelets - Drill#2: Q: spike - DWT? t f

CMU SCS LLNL'06(c) C. Faloutsos, Wavelets - Drill#2: t f Q: spike - DWT?

CMU SCS LLNL'06(c) C. Faloutsos, Wavelets - Drill#3: Q: weekly + daily periodicity, + spike - DWT? t f

CMU SCS LLNL'06(c) C. Faloutsos, Wavelets - Drill#3: Q: weekly + daily periodicity, + spike - DWT? t f

CMU SCS LLNL'06(c) C. Faloutsos, Wavelets - Drill#3: Q: weekly + daily periodicity, + spike - DWT? t f

CMU SCS LLNL'06(c) C. Faloutsos, Wavelets - Drill#3: Q: weekly + daily periodicity, + spike - DWT? t f

CMU SCS LLNL'06(c) C. Faloutsos, Wavelets - Drill#3: Q: weekly + daily periodicity, + spike - DWT? t f

CMU SCS LLNL'06(c) C. Faloutsos, Wavelets - Drill#3: Q: DFT? t f t f DWT DFT

CMU SCS LLNL'06(c) C. Faloutsos, Advantages of Wavelets Better compression (better RMSE with same number of coefficients - used in JPEG-2000) fast to compute (usually: O(n)!) very good for ‘spikes’ mammalian eye and ear: Gabor wavelets

CMU SCS LLNL'06(c) C. Faloutsos, Overall Conclusions DFT, DCT spot periodicities DWT : multi-resolution - matches processing of mammalian ear/eye better All three: powerful tools for compression, pattern detection in real signals All three: included in math packages –(matlab, ‘R’, mathematica, … - often in spreadsheets!)

CMU SCS LLNL'06(c) C. Faloutsos, Overall Conclusions DWT : very suitable for self-similar traffic DWT: used for summarization of streams [Gilbert+01], db histograms etc

CMU SCS LLNL'06(c) C. Faloutsos, Resources - software and urls FFTSpectrumAnalyser.html : Nice java applets for FFThttp:// FFTSpectrumAnalyser.html voice frequency analyzer (needs microphone)

CMU SCS LLNL'06(c) C. Faloutsos, Resources: software and urls xwpl: open source wavelet package from Yale, with excellent GUI /waveletDemos.html : wavelets and scalogramshttp://monet.me.ic.ac.uk/people/gavin/java /waveletDemos.html

CMU SCS LLNL'06(c) C. Faloutsos, Books William H. Press, Saul A. Teukolsky, William T. Vetterling and Brian P. Flannery: Numerical Recipes in C, Cambridge University Press, 1992, 2nd Edition. (Great description, intuition and code for DFT, DWT) C. Faloutsos: Searching Multimedia Databases by Content, Kluwer Academic Press, 1996 (introduction to DFT, DWT)

CMU SCS LLNL'06(c) C. Faloutsos, Additional Reading [Gilbert+01] Anna C. Gilbert, Yannis Kotidis and S. Muthukrishnan and Martin Strauss, Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries, VLDB 2001

CMU SCS LLNL'06(c) C. Faloutsos, skip to end

CMU SCS LLNL'06(c) C. Faloutsos, Outline Motivation Similarity Search and Indexing DSP Linear Forecasting Bursty traffic - fractals and multifractals Non-linear forecasting Conclusions

CMU SCS LLNL'06(c) C. Faloutsos, Forecasting "Prediction is very difficult, especially about the future." - Nils Bohr houghts.html

CMU SCS LLNL'06(c) C. Faloutsos, Outline Motivation... Linear Forecasting –Auto-regression: Least Squares; RLS –Co-evolving time sequences –Examples –Conclusions

CMU SCS LLNL'06(c) C. Faloutsos, Problem#2: Forecast Example: give x t-1, x t-2, …, forecast x t Time Tick Number of packets sent ??

CMU SCS LLNL'06(c) C. Faloutsos, Forecasting: Preprocessing MANUALLY: remove trends spot periodicities time 7 days

CMU SCS LLNL'06(c) C. Faloutsos, Problem#2: Forecast Solution: try to express x t as a linear function of the past: x t-2, x t-2, …, (up to a window of w) Formally: Time Tick ??

CMU SCS LLNL'06(c) C. Faloutsos, Linear Auto Regression: Number of packets sent (t-1) Number of packets sent (t) lag w=1 Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1]) ‘lag-plot’

CMU SCS LLNL'06(c) C. Faloutsos, More details: Q1: Can it work with window w>1? A1: YES! (we’ll fit a hyper-plane, then!) x t-2 xtxt x t-1

CMU SCS LLNL'06(c) C. Faloutsos, How to choose ‘w’? goal: capture arbitrary periodicities with NO human intervention on a semi-infinite stream

CMU SCS LLNL'06(c) C. Faloutsos, Answer: ‘AWSOM’ (Arbitrary Window Stream fOrecasting Method) [Papadimitriou+, vldb2003] idea: do AR on each wavelet level in detail:

CMU SCS LLNL'06(c) C. Faloutsos, AWSOM xtxt t t W 1,1 t W 1,2 t W 1,3 t W 1,4 t W 2,1 t W 2,2 t W 3,1 t V 4,1 time frequency =

CMU SCS LLNL'06(c) C. Faloutsos, AWSOM xtxt t t W 1,1 t W 1,2 t W 1,3 t W 1,4 t W 2,1 t W 2,2 t W 3,1 t V 4,1 time frequency

CMU SCS LLNL'06(c) C. Faloutsos, AWSOM - idea W l,t W l,t-1 W l,t-2 W l,t   l,1 W l,t-1   l,2 W l,t-2  … W l’,t’-1 W l’,t’-2 W l’,t’ W l’,t’   l’,1 W l’,t’-1   l’,2 W l’,t’-2  …

CMU SCS LLNL'06(c) C. Faloutsos, More details… Update of wavelet coefficients Update of linear models Feature selection –Not all correlations are significant –Throw away the insignificant ones (“noise”) (incremental) (incremental; RLS) (single-pass)

CMU SCS LLNL'06(c) C. Faloutsos, Results - Synthetic data Triangle pulse Mix (sine + square) AR captures wrong trend (or none) Seasonal AR estimation fails AWSOMARSeasonal AR

CMU SCS LLNL'06(c) C. Faloutsos, Results - Real data Automobile traffic –Daily periodicity –Bursty “noise” at smaller scales AR fails to capture any trend Seasonal AR estimation fails

CMU SCS LLNL'06(c) C. Faloutsos, Results - real data Sunspot intensity –Slightly time-varying “period” AR captures wrong trend Seasonal ARIMA –wrong downward trend, despite help by human!

CMU SCS LLNL'06(c) C. Faloutsos, Complexity Model update Space: O  lgN + mk 2   O  lgN  Time: O  k 2   O  1  Where –N: number of points (so far) –k:number of regression coefficients; fixed –m:number of linear models; O  lgN  Skip

CMU SCS LLNL'06(c) C. Faloutsos, Conclusions - Practitioner’s guide AR(IMA) methodology: prevailing method for linear forecasting Brilliant method of Recursive Least Squares for fast, incremental estimation. See [Box-Jenkins] recently: AWSOM (no human intervention)

CMU SCS LLNL'06(c) C. Faloutsos, Resources: software and urls MUSCLES: Prof. Byoung-Kee Yi: or free-ware: ‘R’ for stat. analysis (clone of Splus)

CMU SCS LLNL'06(c) C. Faloutsos, Books George E.P. Box and Gwilym M. Jenkins and Gregory C. Reinsel, Time Series Analysis: Forecasting and Control, Prentice Hall, 1994 (the classic book on ARIMA, 3rd ed.) Brockwell, P. J. and R. A. Davis (1987). Time Series: Theory and Methods. New York, Springer Verlag.

CMU SCS LLNL'06(c) C. Faloutsos, Additional Reading [Papadimitriou+ vldb2003] Spiros Papadimitriou, Anthony Brockwell and Christos Faloutsos Adaptive, Hands-Off Stream Mining VLDB 2003, Berlin, Germany, Sept [Yi+00] Byoung-Kee Yi et al.: Online Data Mining for Co-Evolving Time Sequences, ICDE (Describes MUSCLES and Recursive Least Squares)

CMU SCS LLNL'06(c) C. Faloutsos, Outline Motivation Similarity Search and Indexing DSP (Digital Signal Processing) Linear Forecasting Bursty traffic - fractals and multifractals Non-linear forecasting On-going projects and Conclusions

CMU SCS LLNL'06(c) C. Faloutsos, On-going projects Lag correlations (BRAID, [SIGMOD’05]) Streaming SVD (SPIRIT, [VLDB’05]) tensor analysis ([KDD’06]) IP-from IP-to t=0

CMU SCS LLNL'06(c) C. Faloutsos, On-going projects Lag correlations (BRAID, [SIGMOD’05]) Streaming SVD (SPIRIT, [VLDB’05]) tensor analysis ([KDD’06]) t=0 t=1 t=2

CMU SCS LLNL'06(c) C. Faloutsos, Ongoing projects - ref’s [BRAID] Yasushi Sakurai, Spiros Papadimitriou, Christos Faloutsos: BRAID: Stream Mining through Group Lag Correlations. SIGMOD 2005: , Baltimore, MD, USA. [SPIRIT] Spiros Papadimitriou, Jimeng Sun, Christos Faloutsos: Streaming Pattern Discovery in Multiple Time- Series. VLDB 2005: , Trodheim, Norway. [Tensors] Jimeng Sun Dacheng Tao Christos Faloutsos Beyond Streams and Graphs: Dynamic Tensor Analysis KDD 2006, Philadelphia, PA, USA.

CMU SCS LLNL'06(c) C. Faloutsos, Overall conclusions Similarity search: Euclidean/time-warping; feature extraction and SAMs

CMU SCS LLNL'06(c) C. Faloutsos, Overall conclusions Similarity search: Euclidean/time-warping; feature extraction and SAMs Signal processing: DWT is a powerful tool

CMU SCS LLNL'06(c) C. Faloutsos, Overall conclusions Similarity search: Euclidean/time-warping; feature extraction and SAMs Signal processing: DWT is a powerful tool Linear Forecasting: AR (Box-Jenkins) methodology; AWSOM

CMU SCS LLNL'06(c) C. Faloutsos, Overall conclusions Similarity search: Euclidean/time-warping; feature extraction and SAMs Signal processing: DWT is a powerful tool Linear Forecasting: AR (Box-Jenkins) methodology; AWSOM Bursty traffic: multifractals (80-20 ‘law’)

CMU SCS LLNL'06(c) C. Faloutsos, Overall conclusions Similarity search: Euclidean/time-warping; feature extraction and SAMs Signal processing: DWT is a powerful tool Linear Forecasting: AR (Box-Jenkins) methodology; AWSOM Bursty traffic: multifractals (80-20 ‘law’) Non-linear forecasting: lag-plots (Takens)

CMU SCS LLNL'06(c) C. Faloutsos, ‘Take home’ messages Hard, but desirable query for sensor data: ‘find patterns / outliers’ We need fast, automated such tools –Many great tools exist (DWT, ARIMA, …) –some are readily usable; others need to be made scalable / single pass/ automatic

CMU SCS LLNL'06(c) C. Faloutsos, For code, papers, questions etc: christos cs.cmu.edu