F AST A PPROXIMATE C ORRELATION FOR M ASSIVE T IME - SERIES D ATA SIGMOD’10 Abdullah Mueen, Suman Nath, Jie Liu 1.

Slides:



Advertisements
Similar presentations
GAMPS COMPRESSING MULTI SENSOR DATA BY GROUPING & AMPLITUDE SCALING
Advertisements

Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
Fast Algorithms For Hierarchical Range Histogram Constructions
A CTION R ECOGNITION FROM V IDEO U SING F EATURE C OVARIANCE M ATRICES Kai Guo, Prakash Ishwar, Senior Member, IEEE, and Janusz Konrad, Fellow, IEEE.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
DFT/FFT and Wavelets ● Additive Synthesis demonstration (wave addition) ● Standard Definitions ● Computing the DFT and FFT ● Sine and cosine wave multiplication.
Dual-domain Hierarchical Classification of Phonetic Time Series Hossein Hamooni, Abdullah Mueen University of New Mexico Department of Computer Science.
Exact or stable image\signal reconstruction from incomplete information Project guide: Dr. Pradeep Sen UNM (Abq) Submitted by: Nitesh Agarwal IIT Roorkee.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Chapter 8: The Discrete Fourier Transform
Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
TRADING OFF PREDICTION ACCURACY AND POWER CONSUMPTION FOR CONTEXT- AWARE WEARABLE COMPUTING Presented By: Jeff Khoshgozaran.
Efficient Similarity Search in Sequence Databases Rakesh Agrawal, Christos Faloutsos and Arun Swami Leila Kaghazian.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Abdullah Mueen UC Riverside Suman Nath Microsoft Research Jie Liu Microsoft Research.
Elastic Burst Detection: Applications Discovering intervals with an unusually large numbers of events. –In astrophysics, the sky is constantly observed.
Probabilistic Similarity Search for Uncertain Time Series Presented by CAO Chen 21 st Feb, 2011.
Visual Querying By Color Perceptive Regions Alberto del Bimbo, M. Mugnaini, P. Pala, and F. Turco University of Florence, Italy Pattern Recognition, 1998.
Basic Concepts and Definitions Vector and Function Space. A finite or an infinite dimensional linear vector/function space described with set of non-unique.
Finding Time Series Motifs on Disk-Resident Data
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
Random Projections of Signal Manifolds Michael Wakin and Richard Baraniuk Random Projections for Manifold Learning Chinmay Hegde, Michael Wakin and Richard.
Right Protection via Watermarking with Provable Preservation of Distance-based Mining Spyros Zoumpoulis Joint work with Michalis Vlachos, Nick Freris and.
Chapter 9 Numerical Integration Numerical Integration Application: Normal Distributions Copyright © The McGraw-Hill Companies, Inc. Permission required.
On Error Preserving Encryption Algorithms for Wireless Video Transmission Ali Saman Tosun and Wu-Chi Feng The Ohio State University Department of Computer.
Exact Indexing of Dynamic Time Warping
1 Chapter 8 The Discrete Fourier Transform 2 Introduction  In Chapters 2 and 3 we discussed the representation of sequences and LTI systems in terms.
Studies in Big Data 4 Weng-Long Chang Athanasios V. Vasilakos MolecularComputing Towards a Novel Computing Architecture for Complex Problem Solving.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Kaihua Zhang Lei Zhang (PolyU, Hong Kong) Ming-Hsuan Yang (UC Merced, California, U.S.A. ) Real-Time Compressive Tracking.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Analysis of Algorithms CSCI Previous Evaluations of Programs Correctness – does the algorithm do what it is supposed to do? Generality – does it.
Fast Subsequence Matching in Time-Series Databases Author: Christos Faloutsos etc. Speaker: Weijun He.
Zhongguo Liu_Biomedical Engineering_Shandong Univ. Chapter 8 The Discrete Fourier Transform Zhongguo Liu Biomedical Engineering School of Control.
Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.
k-Shape: Efficient and Accurate Clustering of Time Series
Mohamed Hefeeda 1 School of Computing Science Simon Fraser University, Canada Efficient k-Coverage Algorithms for Wireless Sensor Networks Mohamed Hefeeda.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
2005/12/021 Content-Based Image Retrieval Using Grey Relational Analysis Dept. of Computer Engineering Tatung University Presenter: Tienwei Tsai ( 蔡殿偉.
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
Exact indexing of Dynamic Time Warping
A B C D E F A ABSTRACT A novel, efficient, robust, feature-based algorithm is presented for intramodality and multimodality medical image registration.
Stream Monitoring under the Time Warping Distance Yasushi Sakurai (NTT Cyber Space Labs) Christos Faloutsos (Carnegie Mellon Univ.) Masashi Yamamuro (NTT.
University of Macau Discovering Longest-lasting Correlation in Sequence Databases Yuhong Li Department of Computer and Information Science.
Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Nurjahan BegumLiudmila Ulanova Jun Wang 1 Eamonn Keogh University.
D-skyline and T-skyline Methods for Similarity Search Query in Streaming Environment Ling Wang 1, Tie Hua Zhou 1, Kyung Ah Kim 2, Eun Jong Cha 2, and Keun.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Resource Allocation in Hospital Networks Based on Green Cognitive Radios 王冉茵
DTFT continue (c.f. Shenoi, 2006)  We have introduced DTFT and showed some of its properties. We will investigate them in more detail by showing the associated.
Mesh Resampling Wolfgang Knoll, Reinhard Russ, Cornelia Hasil 1 Institute of Computer Graphics and Algorithms Vienna University of Technology.
Clustering Data Streams A presentation by George Toderici.
Dense-Region Based Compact Data Cube
Singular Value Decomposition and its applications
Computing and Compressive Sensing in Wireless Sensor Networks
Lecture 22: Linearity Testing Sparse Fourier Transform
Location Cloaking for Location Safety Protection of Ad Hoc Networks
Spatio-temporal Pattern Queries
Finding Fastest Paths on A Road Network with Speed Patterns
Chapter 8 The Discrete Fourier Transform
Example: Academic Search
Chapter 8 The Discrete Fourier Transform
 = N  N matrix multiplication N = 3 matrix N = 3 matrix N = 3 matrix
Chapter 8 The Discrete Fourier Transform
Presentation transcript:

F AST A PPROXIMATE C ORRELATION FOR M ASSIVE T IME - SERIES D ATA SIGMOD’10 Abdullah Mueen, Suman Nath, Jie Liu 1

O UTLINE Introduce Motivation Reducing I/O Costs Reducing Computational Cost Evaluation Conclusion 2

I NTRODUCE The increasing instrumentation of physical and computing processes has given us unprecedented capabilities to collect massive volumes of data. Data center management,Environmental monitoring,financial engineering Stream warehousing systems (SWS) Minimizing response times of ad-hoc queries in an SWS is very important for effective user interaction. 3

(C ONT.) A typical SWS, such as DataGarage, monitors multiple data centers, each of which may contain tens of thousands of servers. In this paper, we consider a more complex query of computing a correlation matrix. 4 (I)n signals of equal length m (II)compute a nxn matrix C (III) such that C[i; j] ; 1 <= i, j <= n is the correlation coefficient corr(i, j) of signals i and j

P EARSON ’ S CORRELATION COEFFICIENTS 5

(a) Correlation Matrix C Figure 1: Correlation matrices. Darker pixels represent higher correlation coefficients (e.g., a black pixel represents a corr. coeff. of 1). 6

(C ONT.) Fast computation of a large correlation matrix in this setting is challenging because of its high I/O and CPU overhead. High I/O costs High CPU costs To address the above challenges, we consider a slightly relaxed, yet almost equally useful, version of the problem 7

(C ONT.) Threshold Correlation Matrix (I)n signals of equal length m (II)compute a nxn matrix C (III) such that C[i; j] ; 1 <= i, j <= n is the correlation coefficient corr(i, j) of signals i and j. (I)n signals of equal length m (II)compute a nxn matrix C (III) such that C[i; j] ; 1 <= i, j <= n is the correlation coefficient corr(i, j) of signals i and j. 8

(b) Threshold Corr. Matrix CT (a) Correlation Matrix C Figure 1: Correlation matrices. Darker pixels represent higher correlation coefficients (e.g., a black pixel represents a corr. coeff. of 1). 9

M OTIVATION 10

R EDUCING I/O C OSTS To reduce I/O costs we propose a novel data partitioning algorithm that divides a given dataset into batches such that each batch fits in the available memory 11

R EDUCING I/O C OSTS A naïve algorithm would require computing correlation of all pairs of n signals and have an O( ) I/O cost. We can reduce the cost in at least two ways. Pruning : How does the algorithm know which signals are correlated? Intelligent Caching : How can one find a good caching strategy to minimize I/O costs? 12

( CONT.)I DENTIFYING C ORRELATED P AIRS We use Discrete Fourier Transform (DFT) to identify correlated signal pairs in an I/O efficient manner. As the following lemma suggests, the correlation coefficient of signals can be reduced to the Euclidean distance between their normalized series. 13

(C ONT.)L EMMA 1 & L EMMA 2 14 Lemma 1 : Lemma 2(higher than threshold) :

(C ONT.) By ignoring such pairs, we will get a set of likely correlated signal pairs. This is a superset of the correlated signal pairs, but there will be no false negatives. In real-world, the first few low frequency DFT coefficients are sufficient to capture the overall shape of a signal. Since k << m, we can maintain the coefficients in cache. 15

16

(C ONT.)C ACHING S TRATEGY 17

(C ONT.) 18

R EDUCING CPU C OSTS Two novel approximation algorithms. First approximation algorithm computes correlation coefficients within a given error bound Second algorithm identifies signal pairs with correlation above a threshold without any false positives or false negatives. 19

(C ONT.)T WO F ACTS It is often sufficient to report correlation coefficients within a small error bound. It may often be sufficient to only identify all and only the pairs with correlation above a given threshold; knowing the exact correlation coefficients is not strictly necessary. 20

(C ONT.) A PPROXIMATE T HRESHOLD C ORRELATION M ATRIX 21 (I)n signals of equal length m,a threshold T,and an error bound (II)compute a nxn matrix C (III) such that C[i; j] ; 1 <= i, j <= n is the correlation coefficient corr(i, j ) of signals i and j

(C ONT.) 22 Figure 1(b) compare

(C ONT.) A PPROXIMATION WITH P REFIX D ISTANCE Thus, one way to approximate corr(x; y) is to use an approximate value for d(bx;by ); e.g., dk(bx;by), for k < m. For significant savings in computational complexity, it is important that k << m. Unfortunately such approximation does not work well in the time domain. 23

(C ONT.) We therefore approximate distance in the frequency domain. Since DFT is a linear transformation, the Euclidean distance is preserved under the transformation. Therefore This allows us to rephrase Lemma 1 as follows: 24

(C ONT.) Thus, to minimize overall computation cost while guaranteeing a given approximation error bound, we need to compute the smallest number k of DFT coefficients that ensure the target accuracy. Our main result in this section shows a relationship between the approximation error in a correlation coefficient and the number of leading DFT coefficients used for approximating the coefficient 25

(C ONT.)E NERGY OF A SIGNAL The energy of a signal x is defined as Lemma 3 26

(C ONT.)L EMMA 4 27

(C ONT.) 28

T HRESHOLD B OOLEAN C ORRELATION M ATRIX 29 (I)n signals of equal length m,a threshold T (II)Compute a nxn matrix C The second type of approximation we consider is to identify if a pair of signals has correlation above a given threshold T.

(C ONT.) Figure 1(b) compare 30

(C ONT.)L EMMA 5 31 The proof of the lemma follows Lemma 2. Obviously, the tighter the upper and the lower bounds are, the more likely it is to determine the correlation status of a pair. how do we efficiently find good bounds on distances of two signals?

(C ONT.)L EMMA 6 32 Unfortunately, apart from being computationally expensive, the resulting upper bounds are not tight and are practically useless because a random reference signal is expected to be equally far from all of the signals in a very high (= m) dimensional space.

(C ONT.) A D YNAMIC P ROGRAMMING A LGORITHM Ideally, for a pair of signal x and y, we want a third signal v (called a verifier) that is either very close to both x and y implying that x and y are likely to be close, and hence correlated to each other very close to x but not to y (or vice versa), implying that x and y are not correlated. 33

E VALUATION 34

35

C ONCLUSION We have proposed novel algorithms, based on Discrete Fourier Transform (DFT) and graph partitioning, to reduce the end-to-end response time of an all-pair correlation query. 36