Discovering the Intrinsic Cardinality and Dimensionality of Time Series using MDL BING HU THANAWIN RAKTHANMANON YUAN HAO SCOTT EVANS1 STEFANO LONARDI EAMONN.

Slides:



Advertisements
Similar presentations
Discovering the Intrinsic Cardinality and Dimensionality of Time Series using MDL Bing Hu Thanawin Rakthanmanon Yuan Hao Scott Evans 1 Stefano Lonardi.
Advertisements

Time Series Epenthesis: Clustering Time Series Streams Requires Ignoring Some Data Thanawin Rakthanmanon Eamonn Keogh Stefano Lonardi Scott Evans.
Time Series II.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
11/11/02 IDR Workshop Dealing With Location Uncertainty in Images Hasan F. Ates Princeton University 11/11/02.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Feature Grouping-Based Fuzzy-Rough Feature Selection Richard Jensen Neil Mac Parthaláin Chris Cornelis.
Polynomial Curve Fitting BITS C464/BITS F464 Navneet Goyal Department of Computer Science, BITS-Pilani, Pilani Campus, India.
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Machine learning continued Image source:
DFT/FFT and Wavelets ● Additive Synthesis demonstration (wave addition) ● Standard Definitions ● Computing the DFT and FFT ● Sine and cosine wave multiplication.
Mining Time Series.
Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.
School of Computing Science Simon Fraser University
Distributed Regression: an Efficient Framework for Modeling Sensor Network Data Carlos Guestrin Peter Bodik Romain Thibaux Mark Paskin Samuel Madden.
Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA Who.
Curve Fitting and Interpolation: Lecture (IV)
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Visually Mining and Monitoring Massive Time Series Amy Karlson V. Shiv Naga Prasad 15 February 2004 CMSC 838S Images courtesy of Jessica Lin and Eamonn.
A Multiresolution Symbolic Representation of Time Series
Presented by Arun Qamra
Part I: Classification and Bayesian Learning
1 Dot Plots For Time Series Analysis Dragomir Yankov, Eamonn Keogh, Stefano Lonardi Dept. of Computer Science & Eng. University of California Riverside.
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.
COSC 4335 DM: Preprocessing Techniques
Time Series Data Analysis - II
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Chapter 4 CONCEPTS OF LEARNING, CLASSIFICATION AND REGRESSION Cios / Pedrycz / Swiniarski / Kurgan.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Analysis of Constrained Time-Series Similarity Measures
Intelligent Vision Systems ENT 496 Object Shape Identification and Representation Hema C.R. Lecture 7.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
Chapter 8 Curve Fitting.
Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)
BARCODE IDENTIFICATION BY USING WAVELET BASED ENERGY Soundararajan Ezekiel, Gary Greenwood, David Pazzaglia Computer Science Department Indiana University.
Semi-Supervised Time Series Classification & DTW-D REPORTED BY WANG YAWEN.
Identifying Patterns in Time Series Data Daniel Lewis 04/06/06.
Mathematical Programming in Data Mining Author: O. L. Mangasarian Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
Advances in digital image compression techniques Guojun Lu, Computer Communications, Vol. 16, No. 4, Apr, 1993, pp
Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Nurjahan BegumLiudmila Ulanova Jun Wang 1 Eamonn Keogh University.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Data Mining and Decision Support
Time Series Sequence Matching Jiaqin Wang CMPS 565.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Jessica K. Ting Michael K. Ng Hongqiang Rong Joshua Z. Huang 國立雲林科技大學.
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
3/13/2016Data Mining 1 Lecture 1-2 Data and Data Preparation Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB) Bangkok.
Writing for Computer science ——Chapter 6 Graphs, figures, and tables Tao Yang
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
ITree: Exploring Time-Varying Data using Indexable Tree Yi Gu and Chaoli Wang Michigan Technological University Presented at IEEE Pacific Visualization.
1 Novel Online Methods for Time Series Segmentation Xiaoyan Liu, Member, IEEE Computer Society, Zhenjiang Lin, andHuaiqing Wang IEEE TRANSACTIONS ON KNOWLEDGE.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Chapter 3 Data Representation
Hierarchical Clustering: Time and Space requirements
Computation of the solutions of nonlinear polynomial systems
System Programming and administration
Boosting and Additive Trees (2)
Introduction to Data Science Lecture 7 Machine Learning Overview
Fundamentals Data.
Machine Learning Feature Creation and Selection
A Unifying View on Instance Selection
An Adaptive Middleware for Supporting Time-Critical Event Response
Introduction to Data Structure
Data Transformations targeted at minimizing experimental variance
Using Manifold Structure for Partially Labeled Classification
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

Discovering the Intrinsic Cardinality and Dimensionality of Time Series using MDL BING HU THANAWIN RAKTHANMANON YUAN HAO SCOTT EVANS1 STEFANO LONARDI EAMONN KEOGH DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING REPORTED BY WANG YAWEN

Outline  Introduction  Definitions and Notation  MDL Modeling of Time Series  Algorithm  Experimental Evaluation  Complexity  Conclusion

Introduction  Choose the best representation and abstraction level  Discover the natural intrinsic representation model, dimensionality and alphabet cardinality of a time series  Select the best parameters for particular algorithms  An important sub-routine in algorithms for classification, clustering and outlier discovery  Minimal Description Length(MDL) fame work

Introduction  Dimension reduction  Discrete Fourier Transform(DFT)  Discrete Wavelet Transform(DWT)  Adaptive Piecewise Constant Approximation(APCA)  Piecewise Linear Approximation(PLA)  Choose the best abstraction level and/or representation of the data for a given task/dataset  Useful in its own right to understand/describe the data and an important sub-routine in algorithms for classification, clustering and outlier discovery

Introduction  Actual cardinality: 14, 500, 62  Intrinsic cardinality: 2, 2, 12

Introduction  Objective  Not simply save memory  Increasing interest in using specialized hardware for data mining, but the complexity of implementing data mining algorithms in hardware typically grows super linearly with the cardinality of the alphabet  Some data mining benefit from having the data represented in the lowest meaningful cardinality

Introduction  Objective  Most time series indexing algorithms critically depend on the ability to reduce the dimensionality or the cardinality of the time series, and searching over the compacted representation in main memory  Remove the spurious precision induced by a cardinality/dimensionally that is too high in resource- limited devices  Create very simple outlier detection models

Introduction  MDL framework  Automatically discover the parameters that reflect the intrinsic model/cardinality/dimensionally of the data  Without requiring external information or expensive cross validation search

Definitions and Notations  MDL is defined for discrete values  Reduce the original number of possible values to a manageable amount  The quantization makes no perceptible difference

Definitions and Notations

 How many bits it takes to represent a time series T

Definitions and Notations  Convert a given time series to other representation or model  DFT, APCA, PLA

Definitions and Notations  DL(H): model cost  DL(T|H): correction cost(description cost or error term)  DL(T|H) = DL(T-H)

MDL Modeling of Time Series

 APCA  Mean 8  16 possible values, DL(H) = 4

MDL Modeling of Time Series

Algorithm  Discover the intrinsic cardinality and dimensionality of an input time series  Find the right model or data representation for the given time series

Algorithm

 APCA  Constant lines  Dimensionality: m/2  d constant segments  d-1 pointers to Indicate the offset of the end of each segment

Algorithm  PLA  Starting value  Ending value  Ending offset

Algorithm  DFT  Linear combination of sine waves  Half set of all coefficients  Subsets of half coef to approximately regenerate T  Sort by absolute value  Use top-d coefficients  inverseDFT  Constant bits(32 bits) for max and min value of the real parts and of the imaginary parts Hence

Experimental Evaluation  A detailed example on a famous problem  Baseline  L-Method: explain the residual error vs. size-of-model curve using all possible pairs of two regression lines  10  Bayesian Information Criterion based method  4

Experimental Evaluation  An example application in physiology

Experimental Evaluation  An example application in astronomy  Anomaly detector

Experimental Evaluation  An example application in cardiology

Experimental Evaluation  An example application in geosciences

Complexity  Space complexity  Linear in the size of the original data  Time complexity  O(mlog 2 m)

Conclusion  Simple methodology based on MDL  Robustly specify the intrinsic model, cardinality and dimensionality of time series data from a wide variety of domains  General and parameter-free