DTW-D: Time Series Semi-Supervised Learning from a Single Example Yanping Chen 1.

Slides:



Advertisements
Similar presentations
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Advertisements

Word Spotting DTW.
Time Series Classification under More Realistic Assumptions Bing Hu Yanping Chen Eamonn Keogh SIAM Data Mining Conference (SDM), 2013.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Patch to the Future: Unsupervised Visual Prediction
Modeling Pixel Process with Scale Invariant Local Patterns for Background Subtraction in Complex Scenes (CVPR’10) Shengcai Liao, Guoying Zhao, Vili Kellokumpu,
Cost-Sensitive Classifier Evaluation Robert Holte Computing Science Dept. University of Alberta Co-author Chris Drummond IIT, National Research Council,
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA Who.
Themis Palpanas1 VLDB - Aug 2004 Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use.
CBF Dataset Two-Pat Dataset Euclidean DTW Increasingly Large Training.
1 An Adaptive Nearest Neighbor Classification Algorithm for Data Streams Yan-Nei Law & Carlo Zaniolo University of California, Los Angeles PKDD, Porto,
Efficient Query Filtering for Streaming Time Series
Making Time-series Classification More Accurate Using Learned Constraints © Chotirat “Ann” Ratanamahatana Eamonn Keogh 2004 SIAM International Conference.
Maximizing Classifier Utility when Training Data is Costly Gary M. Weiss Ye Tian Fordham University.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Cluster Analysis (1).
1 Ensembles of Nearest Neighbor Forecasts Dragomir Yankov, Eamonn Keogh Dept. of Computer Science & Eng. University of California Riverside Dennis DeCoste.
Using Relevance Feedback in Multimedia Databases
A Multiresolution Symbolic Representation of Time Series
1 Dot Plots For Time Series Analysis Dragomir Yankov, Eamonn Keogh, Stefano Lonardi Dept. of Computer Science & Eng. University of California Riverside.
Selective Sampling on Probabilistic Labels Peng Peng, Raymond Chi-Wing Wong CSE, HKUST 1.
Jinhui Tang †, Shuicheng Yan †, Richang Hong †, Guo-Jun Qi ‡, Tat-Seng Chua † † National University of Singapore ‡ University of Illinois at Urbana-Champaign.
FastDTW: Toward Accurate Dynamic Time Warping in Linear Time and Space
Active Learning for Class Imbalance Problem
Mining Discriminative Components With Low-Rank and Sparsity Constraints for Face Recognition Qiang Zhang, Baoxin Li Computer Science and Engineering Arizona.
Analysis of Constrained Time-Series Similarity Measures
1 ENTROPY-BASED CONCEPT SHIFT DETECTION PETER VORBURGER, ABRAHAM BERNSTEIN IEEE ICDM 2006 Speaker: Li HueiJyun Advisor: Koh JiaLing Date:2007/11/6 1.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Online Kinect Handwritten Digit Recognition Based on Dynamic Time Warping and Support Vector Machine Journal of Information & Computational Science, 2015.
Ground Truth Free Evaluation of Segment Based Maps Rolf Lakaemper Temple University, Philadelphia,PA,USA.
Semi-Supervised Time Series Classification & DTW-D REPORTED BY WANG YAWEN.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Estimation of Number of PARAFAC Components
Mingyang Zhu, Huaijiang Sun, Zhigang Deng Quaternion Space Sparse Decomposition for Motion Compression and Retrieval SCA 2012.
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
Exploit of Online Social Networks with Community-Based Graph Semi-Supervised Learning Mingzhen Mo and Irwin King Department of Computer Science and Engineering.
Semi-Supervised Time Series Classification Li Wei Eamonn Keogh University of California, Riverside {wli,
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Optimal Dimensionality of Metric Space for kNN Classification Wei Zhang, Xiangyang Xue, Zichen Sun Yuefei Guo, and Hong Lu Dept. of Computer Science &
An Approximate Nearest Neighbor Retrieval Scheme for Computationally Intensive Distance Measures Pratyush Bhatt MS by Research(CVIT)
Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Nurjahan BegumLiudmila Ulanova Jun Wang 1 Eamonn Keogh University.
Agnostic Active Learning Maria-Florina Balcan*, Alina Beygelzimer**, John Langford*** * : Carnegie Mellon University, ** : IBM T.J. Watson Research Center,
Chapter 13 (Prototype Methods and Nearest-Neighbors )
CSSE463: Image Recognition Day 11 Due: Due: Written assignment 1 tomorrow, 4:00 pm Written assignment 1 tomorrow, 4:00 pm Start thinking about term project.
Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.
Instance-Based Learning Evgueni Smirnov. Overview Instance-Based Learning Comparison of Eager and Instance-Based Learning Instance Distances for Instance-Based.
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
Mining Concept-Drifting Data Streams Using Ensemble Classifiers Haixun Wang Wei Fan Philip S. YU Jiawei Han Proc. 9 th ACM SIGKDD Internal Conf. Knowledge.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
Feature learning for multivariate time series classification Mustafa Gokce Baydogan * George Runger * Eugene Tuv † * Arizona State University † Intel Corporation.
Data Science Credibility: Evaluating What’s Been Learned
Guillaume-Alexandre Bilodeau
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Traffic Sign Recognition Using Discriminative Local Features Andrzej Ruta, Yongmin Li, Xiaohui Liu School of Information Systems, Computing and Mathematics.
Supervised Time Series Pattern Discovery through Local Importance
A Time Series Representation Framework Based on Learned Patterns
COMBINED UNSUPERVISED AND SEMI-SUPERVISED LEARNING FOR DATA CLASSIFICATION Fabricio Aparecido Breve, Daniel Carlos Guimarães Pedronette State University.
School of Computer Science & Engineering
Image Segmentation Techniques
COSC 4335: Other Classification Techniques
An Adaptive Nearest Neighbor Classification Algorithm for Data Streams
Semi-Supervised Time Series Classification
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

DTW-D: Time Series Semi-Supervised Learning from a Single Example Yanping Chen 1

Outline Introduction The proposed method – The key idea – When the idea works Experiment 2

Introduction Most research assumes there are large amounts of labeled training data. In reality, labeled data is often very difficult /costly to obtain Whereas, the acquisition of unlabeled data is trivial Example: Sleep study test A study produce 40,000 heartbeats; but it requires cardiologists to label the individual heartbeats; 3

Introduction Obvious solution: Semi-supervised Learning (SSL) However, direct applications of off-the-shelf SSL algorithms do not typically work well for time series 4

Our Contribution 1.explain why semi-supervised learning algorithms typically fail for time series problems 2. introduce a simple but very effective fix 5

Outline Introduction The proposed method – The key idea – When the idea works Experiment 6

SSL: self-training Self-training algorithm: 1. Train the classifier based on labeled data 2. Use the classifier to classify the unlabeled data 3. the most confident unlabeled points, are added to the training set. 4. The classifier is re-trained, and repeat until stop criteria is met Evaluation: The classifier is evaluated on some holdout dataset P:Labeled U:unlabeled classifier train classify retrain 7

Two conclusions from the community 1)Most suitable classifier: the nearest neighbor classifier(NN) 2)Distance measure: DTW is exceptionally difficult to beat In time series SSL, we use NN classifier and DTW distance. For simplicity, we consider one-class classification, positive class and negative class. [1] Hui Ding, Goce Trajcevski, Peter Scheuermann, Xiaoyue Wang and Eamonn Keogh (2008) Querying and Mining of Time Series Data: Experimental Comparison of Representations and Distance Measures, VLDB

Observation : 1.Under certain assumptions, unlabeled negative objects are closer to labeled dataset than the unlabeled positive objects. 2.Nevertheless, unlabeled positive objects tend to benefit more from using DTW than unlabeled negative objects. 3.The amount of benefit from DTW over ED is a feature to be exploited. I will explain this in the next four slides Our Observation 9 d pos d neg d neg < d pos labeled unlabeled

Our Observation P: Labeled Dataset P1P1 0 1 U: unlabeled dataset U1U1 U2U2 0 1 Positive class Negative class Example: 10

Our Observation U U1U1 U2U2 P1P Ask any SSL algorithm to choose one object from U to add to P using the Euclidean distance. U2U2 U1U1 P1P1 P1P1 ED(P 1, U 1 ) < ED(P 1, U 2 ), SSL would pick the wrong one. ED(P 1, U 1 ) = 6.2 ED(P 1, U 2 ) = 11 Not surprising, as is well-known, ED is brittle to warping[1]. [1[ Keogh, E. (2002). Exact indexing of dynamic time warping. In 28 th International Conference on Very Large Data Bases. Hong Kong. pp P: Labeled DatasetU: Unlabeled Dataset

Our Observation What about replacing ED with DTW distance? U1U1 U2U2 P1P1 P1P1 DTW(P 1, U 1 ) = 5.8 DTW(P 1, U 2 ) = 6.1 DTW helps significantly, but still picks the wrong one. Why DTW fails? Besides warping, there are other difference between P 1 and U 2. E.g., the first and last peak have different heights. DTW can not mitigate this. 12 P U U1U1 U2U2 P1P

Our Observation P U U1U1 U2U2 P1P DTW(P 1, U 1 ) = 5.8 DTW(P 1, U 2 ) = 6.1 ED(P 1, U 1 ) = 6.2 ED(P 1, U 2 ) = 11 ED: DTW: Under the DTW-Delta ratio(r): 13

Why DTW-D works? Objects from same class: Objects from different classes: warpingnoise warpingnoise ED = DTW = warpingnoise warpingnoise ED = DTW = shape difference distance from:

DTW-D distance DTW-D : the amount of benefit from using DTW over ED. 15

Outline Introduction The proposed method – The key idea – When the idea works Experiment 16

When does DTW-D help? Two assumptions  Assumption 2: The negative class is diverse, and occasionally produces objects close to a member of the positive class, even under DTW. Our claim: if the two assumptions are true for a given problem, DTW-D will be better than either ED or DTW. 17  Assumption 1: The positive class contains warped versions of some platonic ideal, possibly with other types of noise/distortions. Warped version Platonic ideal

When are our assumptions true? Observation1: Assumption 1 is mitigated by large amounts of labeled data U: 1 positive object, 200 negative objects(random walks). P: Vary the number of objects in P from 1-10, and compute the probability that the selected unlabeled object is a true positive. Result: When |P| is small, DTW-D is much better than DTW and ED. This advantage is getting less as |P| gets larger. 18

When are our assumptions true? Observation2: Assumption 2 is compounded by a large negative dataset P: 1 positive object U: We vary the size of the negative dataset from positive object. Result: When the negative dataset is large, DTW-D is much better than DTW and ED. Positive class Negative class 19

When are our assumptions true? Observation3: Assumption 2 is compounded by low complexity negative data P: 1 positive object U: We vary the complexity of negative data, and 1 positive object. Result: When the negative data are of low complexity, DTW-D is better than DTW and ED. 5 non-zero DFT coefficients; 20 non-zero DFT coefficients; [1] Gustavo Batista, Xiaoyue Wang and Eamonn J. Keogh (2011) A Complexity-Invariant Distance Measure for Time Series. SDM

Summary of assumptions Check the given problem for: – Positive class » Warping » Small amounts of labeled data – Negative class » Large dataset, and/or… » Contains low complexity data 21

DTW-D and Classification DTW-D helps SSL, because: small amounts of labeled data negative class is typically diverse and contains low-complexity data DTW-D is not expected to help the classic classification problem: large set of labeled training data no class much higher diversity and/or with much lower complexity data than other class 22

Outline Introduction The proposed method – The key idea – When the idea works Experiment 23

Experiments P U test select holdout Initial P:  Single training example  Multiple runs, each time with a different training example  Report average accuracy Evaluation  Classifier is evaluated for each size of |P| 24

Experiments Insect Wingbeat Sound Detection ED DTW DTW-D Number of labeled objects in P Accuracy of classifier Positive : Culex quinquefasciatus ♀ (1,000) Negative : unstructured audio stream (4,000) Two positive examples Two negative examples 25 Unstructured audio stream

Comparison to rival methods Both rivals start with 51 labeled examples Accuracy of classifier Number of objects added to P Our DTW-D starts with a single labeled example Wei’s method[2] Ratana’s method[1] Grey curve: The algorithm stops adding objects to the labeled set [1] W. Li, E. Keogh, Semi-supervised time series classification, ACM SIGKDD: 2006 [2] C. A. Ratanamahatana., D. Wanichsan, Stopping Criterion Selection for Efficient Semi-supervised Time Series Classification. SNPD : 1-14,

Experiments Historical Manuscript Mining Positive class: Fugger shield(64) Negative class: Other image patches(1,200) ED DTW DTW-D Number of labeled objects in P Accuracy of classifier Red Green Blue 27

Experiments Activity Recognition Dataset: Pamap dataset[1] (9 subjects performing 18 activities) Positive class: vacuum cleaning Negative class: Other activities ED DTW DTW-D Number of labeled objects in P Accuracy of classifier [1] PAMAP, Physical Activity Monitoring for Aging People, retrieved

Conclusions We have introduced a simple idea that dramatically improves the quality of SSL in time series domains Advantages : – Parameter free – Allow use of existing SSL algorithm. Only a single line of code needs to be changed. Future work : – revisiting the stopping criteria issue – consider other avenues where DTW-D may be useful 29

30 Thank you! Questions? Contact Author: Yanping Chen

Why DTW-D works? Objects from same class: Objects from different classes: warpingnoise warpingnoise ED = DTW = warpingnoise warpingnoise ED = DTW = shape difference distance from:

Why DTW-D works? Objects from same class:Objects from different classes: warpingnoise warping noise ED = DTW = warpingnoise warpingnoise ED = DTW = shape difference distance includes:

DTW-D and Classification DTW-D helps SSL, because: small amounts of labeled data negative class is typically diverse and contains low-complexity data Even an occasional false positive is lethal for any SSL algorithm DTW-D is not expected to help the classic classification problem: large set of labeled training data no class with much lower complexity data, and/or much higher diversity than other class a few misclassifications do NOT hurt classification accuracy much

Are the assumptions commonly true in real word? Assumption1: There are warping in the positive class Most real world signals are not perfectly aligned in time. Walking patterns: Walk slowly or quickly Accelerations or decelerations in one observation We don’t expect two observations to be perfectly aligned. The variations in time and speed results in the warping in data. Assumption2: The negative class is diverse Walking patterns: limited way to ‘walk’, but unlimited way to be ‘not walking’ Audio: limited way to sound like a mosquito wingbeat, but unlimited ways to be not like a mosquito wingbeat.

Why DTW-D works? 1.Intra-class comparison: 1.Same model, so objects are all the same without noise 2.In reality, they are corrupted by warping + noise. 3.ED = dis(warping) + dis(noise); DTW = dis(noise); 4.Dis(warping) occupies a big portion, DTW<< ED. 2.Inter-class comparison: 1.ED = dis(model difference) + dis(warping) + dis(noise); DTW = dis(model difference) + dis(noise); 2. DTW < ED, but not <<. So do not change much.

Why is DTW-D better? DTW DTW-D Number of labeled objects in P Accuracy DTW-D ED DTW DTW-D Number of labeled objects in P Accuracy DTW-D ED top) Helps training: Use DTW-D/DTW(DTW-D/ED) for selecting training objects, but use the same distance metric DTW(ED) for testing. bottom) Helps testing: use the same distance metric DTW(ED) for selecting training objects, but use different distance metrics DTW-D/DTW (DTW-D/ED) for testing.

ED and DTW 37

38

Our Observation P U U1U1 U2U2 P1P Under the DTW-Delta ratio: 39 U1U1 U2U2 P1P1 P1P1 DTW(P 1, U 1 ) = 5.8 DTW(P 1, U 2 ) = 6.1 ED(P 1, U 1 ) = 6.2 ED(P 1, U 2 ) = 11 ED DTW U2U2 U1U1 P1P1 P1P1