Discovering the Intrinsic Cardinality and Dimensionality of Time Series using MDL Bing Hu Thanawin Rakthanmanon Yuan Hao Scott Evans 1 Stefano Lonardi.

Slides:



Advertisements
Similar presentations
Panos Ipeirotis Stern School of Business
Advertisements

STATISTICS HYPOTHESES TEST (II) One-sample tests on the mean and variance Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National.
Sales Forecasting using Dynamic Bayesian Networks Steve Djajasaputra SNN Nijmegen The Netherlands.
Using Text Effectively in the Biology Classroom
0 - 0.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
ENV Envisioning Information Lecture 2 Simple Graphs and Charts Ken Brodlie School of Computing University of Leeds.
EC220 - Introduction to econometrics (chapter 2)
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Learning Objectives for Section 3.2
Copyright Jiawei Han, modified by Charles Ling for CS411a
EC220 - Introduction to econometrics (chapter 1)
A probabilistic model for retrospective news event detection
SAX: a Novel Symbolic Representation of Time Series
Building a Conceptual Understanding of Algebra with Algebra Tiles
Configuration management
Machine Learning Math Essentials Part 2
CHAPTER 13: Alpaydin: Kernel Machines
Discrete Controller Design
S-Curves & the Zero Bug Bounce:
5-1 Chapter 5 Theory & Problems of Probability & Statistics Murray R. Spiegel Sampling Theory.
5.9 + = 10 a)3.6 b)4.1 c)5.3 Question 1: Good Answer!! Well Done!! = 10 Question 1:
Christopher Dougherty EC220 - Introduction to econometrics (review chapter) Slideshow: the central limit theorem Original citation: Dougherty, C. (2012)
EC220 - Introduction to econometrics (review chapter)
Christopher Dougherty EC220 - Introduction to econometrics (chapter 9) Slideshow: two-stage least squares Original citation: Dougherty, C. (2012) EC220.
Time Series Epenthesis: Clustering Time Series Streams Requires Ignoring Some Data Thanawin Rakthanmanon Eamonn Keogh Stefano Lonardi Scott Evans.
Adding and Subtracting Integers
Christopher Dougherty EC220 - Introduction to econometrics (chapter 11) Slideshow: model c assumptions Original citation: Dougherty, C. (2012) EC220 -
Christopher Dougherty EC220 - Introduction to econometrics (chapter 8) Slideshow: model b: properties of the regression coefficients Original citation:
Christopher Dougherty EC220 - Introduction to econometrics (review chapter) Slideshow: one-sided t tests Original citation: Dougherty, C. (2012) EC220.
EC220 - Introduction to econometrics (chapter 1)
1 MAXIMUM LIKELIHOOD ESTIMATION OF REGRESSION COEFFICIENTS X Y XiXi 11  1  +  2 X i Y =  1  +  2 X We will now apply the maximum likelihood principle.
MODELS WITH A LAGGED DEPENDENT VARIABLE
EC220 - Introduction to econometrics (chapter 3)
EC220 - Introduction to econometrics (chapter 4)
This, that, these, those Number your paper from 1-10.
Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden A Data Pre-processing Method to Increase.
Addition 1’s to 20.
25 seconds left…...
Determining How Costs Behave
Week 1.
Christopher Dougherty EC220 - Introduction to econometrics (review chapter) Slideshow: asymptotic properties of estimators: the use of simulation Original.
CIS Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern Introduction to Probability and Statistics Understanding.
DETC ASME Computers and Information in Engineering Conference ONE AND TWO DIMENSIONAL DATA ANALYSIS USING BEZIER FUNCTIONS P.Venkataraman Rochester.
Simple Linear Regression Analysis
Multiple Regression and Model Building
K-MEANS ALGORITHM Jelena Vukovic 53/07
1 Functions and Applications
Commonly Used Distributions
1 Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi,
1 ECE 776 Project Information-theoretic Approaches for Sensor Selection and Placement in Sensor Networks for Target Localization and Tracking Renita Machado.
Relevance Feedback Retrieval of Time Series Data Eamonn J. Keogh & Michael J. Pazzani Prepared By/ Fahad Al-jutaily Supervisor/ Dr. Mourad Ykhlef IS531.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Locally Constraint Support Vector Clustering
Jessica Lin, Eamonn Keogh, Stefano Loardi
Time Series Bitmap Experiments This file contains full color, large scale versions of the experiments shown in the paper, and additional experiments which.
Time Series Bitmap Experiments This file contains full color, large scale versions of the experiments shown in the paper, and additional experiments which.
Time Series Anomaly Detection Experiments This file contains full color, large scale versions of the experiments shown in the paper, and additional experiments.
Time Series Data Analysis - II
Discovering the Intrinsic Cardinality and Dimensionality of Time Series using MDL BING HU THANAWIN RAKTHANMANON YUAN HAO SCOTT EVANS1 STEFANO LONARDI EAMONN.
Quantitative Skills 1: Graphing
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 Part 4 Curve Fitting.
Review of Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang.
11/23/2015Slide 1 Using a combination of tables and plots from SPSS plus spreadsheets from Excel, we will show the linkage between correlation and linear.
NSF Career Award IIS University of California Riverside Eamonn Keogh Efficient Discovery of Previously Unknown Patterns and Relationships.
Global predictors of regression fidelity A single number to characterize the overall quality of the surrogate. Equivalence measures –Coefficient of multiple.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
Intelligent and Adaptive Systems Research Group A Novel Method of Estimating the Number of Clusters in a Dataset Reza Zafarani and Ali A. Ghorbani Faculty.
From: The evolution of star formation activity in galaxy groups
Presentation transcript:

Discovering the Intrinsic Cardinality and Dimensionality of Time Series using MDL Bing Hu Thanawin Rakthanmanon Yuan Hao Scott Evans 1 Stefano Lonardi Eamonn Keogh {bhu002, rakthant, 1 {stelo, Department of Computer Science & Engineering University of California, Riverside 1 GE Global Research

Outline Introduction Modeling of MDL Application Examples All the figures in this slides can be generated by running our code. All the data can be downloaded from the paper webpage

Introduction Figure 1. Three unrelated industrial time series with low intrinsic cardinality. I) Evaporator (channel one). II) Winding (channel five). III) Dryer (channel one). The number of unique values in each time series is, from top to bottom, 14, 500 and 62. However, we might reasonably claim, that the intrinsic alphabet cardinality is instead 2, 2, and 12 respectively.

MDL Modeling of Time Series T = Figure 3. A sample time series T Figure 4. Time series T (blue/fine), approximated by a one-dimensional APCA approximation H 1 (red/bold). The error for this model is represented by the vertical lines The errors e 1 is: e 1 = Figure 5. Time series T (blue/fine), approximated by a two-dimensional APCA approximation H 2 (red/bold). Vertical lines represent the error for this model. the error e 2 is: e 2 = Clearly, we have not done yet ………..

MDL Modeling of Time Series we should also test H 3, H 4, H 5, etc., corresponding to 3, 4, 5, etc. piecewise constant segments. Moreover, we can also test alterative models corresponding to different levels of DFT or PLA representation. Figure 6. A time series T shown in bold/blue and three different models of it shown in fine/red: from left to right: DFT, APCA, and PLA. DFT APCA PLA

Application Example 1. Donoho-Johnstone Benchmark Figure 7. A version of the Donoho-Johnstone block benchmark created ten years ago and download from ftp://ftp.sas.com/pub/neural/dojo/dojo.html. Also, this data can be downloaded from our paper webpage.ftp://ftp.sas.com/pub/neural/dojo/dojo.html Donoho-Johnstone Benchmark

Application Example 1. Donoho-Johnstone Benchmark Figure 9. The description length of the Donoho-Johnstone block benchmark time series is minimized at a dimensionality corresponding to twelve piecewise constant segments, which is the correct answer Figure 10. The description length of the Donoho-Johnstone block benchmark time series is minimized with a cardinality of ten, which is the true cardinality The correct dimensionality and cardinality of this famous data are found correctly by using MDL!

Application Example 1. Donoho-Johnstone Benchmark We also tested several other methods, including a recent Bayesian Information Criterion based method that we found predicted a too coarse four-segment model We beat other methods! The following figure is by using L method, they found the dimensionality at 10,which is wrong [21] Salvador, S. and Chan, P Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms. International Conference on Tools with Artifical Intelligence, Figure 8. The knee finding L-Method top) A residual error vs. size-of-model curve (bold/blue) is modeled by all possible pairs of regression lines (light/red). Here just one possibility is shown. bottom) The location that minimizes the summed residual error of the two regression lines is given as the optimal knee.

Application Example 1. Donoho-Johnstone Benchmark Figure 11. The description length of the Donoho-Johnstone block benchmark time series is minimized with a piecewise constant model (APCA), not a piecewise linear model (PLA) or Fourier representation (DFT). APCA is the correct one! Which model is the intrinsic model?

Application Example 2. An Example in Physiology Figure 12. top) An excerpt from the Muscle dataset[16]. bottom) A zoomed-in section of the Muscle dataset which had its model, dimensionality and cardinality set by MDL [16] Mörchen, F. and Ultsch, A Optimizing time series discretization for knowledge discovery. KDD,

Application Example 2. An Example in Physiology Figure 13. left) The description length of the muscle activation time series is minimized with a cardinality of three, which is the correct answer. right) The Persist algorithm, using the code from [16] predicts a value of four. [16] Mörchen, F. and Ultsch, A Optimizing time series discretization for knowledge discovery. KDD,

Application Example 2. An Example in Physiology The dataset is from paper [16]Mörchen, F. and Ultsch, A Optimizing time series discretization for knowledge discovery. KDD, We download the dataset from the author webpage You can also download the data from our paper webpage.

Application Example 3. An Example Application in Astronomy: outlier detection Figure 14. top) The distribution of intrinsic dimensionalities of star light curves, estimated over 5,327 human-annotated examples. bottom) Three typical examples of the class RRL, and a high intrinsic dimensionality example, labeled as an outlier by [19]. [19]Protopapas, P., Giammarco, J. M., Faccioli, L., Struble, M. F., Dave, R., and Alcock, C Finding outlier light-curves in catalogs of periodic variable stars. Monthly Notices of the Royal Astronomical Society, 369 (2006), 677–696 We measured the intrinsic DFT dimensionally of all its members, finding one anomalous data that had a value of 31

Application Example 3. An Example Application in Astronomy: outlier detection The star light curve data can be downloaded from the paper webpage.

Application Example 4. An Example Application in Cardiologyanomalous data detection Application Example 4. An Example Application in Cardiology : anomalous data detection Figure 15. top) The distribution of intrinsic dimensionalities of individual heartbeats, estimated over the 200 normal examples the proceeded time zero in record 108 of the MIT BIH Arrhythmia Database (bottom). We measured the intrinsic DFT dimensionally of every heartbeat, finding the anomalous ECG that had a value of 31. Another advantage of MDL for anomalous data detection : B has significantly more noise than heartbeat A. MDL is essentially invariant to the noise heartbeat, which poises great difficulties for distance-based and density-based outlier detection methods. The original ECG charter from PhysioATM

Application Example 4. An Example Application in Cardiologyanomalous ECG detection Application Example 4. An Example Application in Cardiology : anomalous ECG detection The data is downloaded from in record 108 of the MIT BIH Arrhythmia Database. Also, the data can be downloaded from the paper webpage.