Discovering the Intrinsic Cardinality and Dimensionality of Time Series using MDL BING HU THANAWIN RAKTHANMANON YUAN HAO SCOTT EVANS1 STEFANO LONARDI EAMONN KEOGH DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING REPORTED BY WANG YAWEN
Outline Introduction Definitions and Notation MDL Modeling of Time Series Algorithm Experimental Evaluation Complexity Conclusion
Introduction Choose the best representation and abstraction level Discover the natural intrinsic representation model, dimensionality and alphabet cardinality of a time series Select the best parameters for particular algorithms An important sub-routine in algorithms for classification, clustering and outlier discovery Minimal Description Length(MDL) fame work
Introduction Dimension reduction Discrete Fourier Transform(DFT) Discrete Wavelet Transform(DWT) Adaptive Piecewise Constant Approximation(APCA) Piecewise Linear Approximation(PLA) Choose the best abstraction level and/or representation of the data for a given task/dataset Useful in its own right to understand/describe the data and an important sub-routine in algorithms for classification, clustering and outlier discovery
Introduction Actual cardinality: 14, 500, 62 Intrinsic cardinality: 2, 2, 12
Introduction Objective Not simply save memory Increasing interest in using specialized hardware for data mining, but the complexity of implementing data mining algorithms in hardware typically grows super linearly with the cardinality of the alphabet Some data mining benefit from having the data represented in the lowest meaningful cardinality
Introduction Objective Most time series indexing algorithms critically depend on the ability to reduce the dimensionality or the cardinality of the time series, and searching over the compacted representation in main memory Remove the spurious precision induced by a cardinality/dimensionally that is too high in resource- limited devices Create very simple outlier detection models
Introduction MDL framework Automatically discover the parameters that reflect the intrinsic model/cardinality/dimensionally of the data Without requiring external information or expensive cross validation search
Definitions and Notations MDL is defined for discrete values Reduce the original number of possible values to a manageable amount The quantization makes no perceptible difference
Definitions and Notations
How many bits it takes to represent a time series T
Definitions and Notations Convert a given time series to other representation or model DFT, APCA, PLA
Definitions and Notations DL(H): model cost DL(T|H): correction cost(description cost or error term) DL(T|H) = DL(T-H)
MDL Modeling of Time Series
APCA Mean 8 16 possible values, DL(H) = 4
MDL Modeling of Time Series
Algorithm Discover the intrinsic cardinality and dimensionality of an input time series Find the right model or data representation for the given time series
Algorithm
APCA Constant lines Dimensionality: m/2 d constant segments d-1 pointers to Indicate the offset of the end of each segment
Algorithm PLA Starting value Ending value Ending offset
Algorithm DFT Linear combination of sine waves Half set of all coefficients Subsets of half coef to approximately regenerate T Sort by absolute value Use top-d coefficients inverseDFT Constant bits(32 bits) for max and min value of the real parts and of the imaginary parts Hence
Experimental Evaluation A detailed example on a famous problem Baseline L-Method: explain the residual error vs. size-of-model curve using all possible pairs of two regression lines 10 Bayesian Information Criterion based method 4
Experimental Evaluation An example application in physiology
Experimental Evaluation An example application in astronomy Anomaly detector
Experimental Evaluation An example application in cardiology
Experimental Evaluation An example application in geosciences
Complexity Space complexity Linear in the size of the original data Time complexity O(mlog 2 m)
Conclusion Simple methodology based on MDL Robustly specify the intrinsic model, cardinality and dimensionality of time series data from a wide variety of domains General and parameter-free