Building Efficient Time Series Similarity Search Operator Mijung Kim Summer Internship 2013 at HP Labs
Overview The internship project is a part of a project that: ◦ builds a scalable analytics framework and ◦ constructs a set of analytic operators within the framework Trade-off performance with available resources ◦ Multiple implementations with different trade-offs for each operator ◦ Mechanism to choose an implementation given constraints My goal is to build a time series similarity search operator ◦ Parallel data processing ◦ Alternative implementations for the time series similarity search
What is Time Series? Time series data is a sequence of data points repeatedly measured over time Image from wikipedia series Example:
Time Series Similarity Search Time series Segment (T_i(j), …, T_i(j+m)) Query pattern (P) Query length (m) Time series database (T) Distance Given a time series database (T) and query pattern (P), find k-nearest neighbors of the query in the database O(N_t *n*m) N_t: # time series, n: time series length, m: query length Linear to the query length – inefficient for large query lengths! Use cases: Targeted marketing, Anomaly detection, many more…
FFT (Fast Fourier Transform) based Search Time series data in the time domain can be transformed to the frequency domain ◦ We can compute the distance without a time series point by point comparison in each time series segment in the time domain. Image from wikipedia onvolution O(N_t*n*logn) N_t: # time series, n: time series length Independent from the query length FFT for each time series can be pre-processed and re-used for each time series segment!
Time Series Search with MapReduce Time Series Partition_1 Reducer Map_1 … … Time Series Partition_2 Time Series Partition_n Map_2 Map_n Top-K Query pattern … Query result Time series database Horizontally partitioned time series database Compute the distance between each time series segment in the partition and the query
FFT-based vs. Naïve Search FFT-based search cost is independent from the query length (efficient for larger query lengths but naïve search is better for smaller query lengths) - We can develop query plans based on the query length! Single machine vs. Cluster (e.g., >15X gain on cluster mode)
Lessons so far -FFT is proven to be efficient in the time series similarity search operation but There are other more (theoretically) efficient techniques for the time series similarity search operator, e.g., LSH -Parallel data processing with MapReduce on a cluster environment helps but Lacks of rich data analytic algorithms commonly supported by statistical software such as MATLAB and R We investigate frameworks that support R with MapReduce as a general analytic operation framework
Why R + MapReduce? Rich Data Analytics Algorithms and Graphics Parallel Processing On Cluster Environment Parallel Processing On Cluster Environment - R is a free software and a widely used programming language/framework/environment for statistical computation for data analysis and graphics - R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering,...) and graphical techniques, and is highly extensible. In-Memory computation of R is impractical for large-scale data analysis!
Parallel R (Split-apply-combine) :::: apply R functions partition Aggregate function input combine partition split ::::
Examples (R+MapReduce) :::: R instance (forecast) R instance (forecast) R instance (forecast) input Different training periods input R function (ARIMA) :::: input :::: Arima model :::: Movie Ratings of each customer Arima (Autoregressive Integrated Moving Average) model of each customer Measure error [Googleparallelism, Stokely et al. JSM ‘11] [IBM Ricardo, Das et al. SIGMOD ‘10]
Time Series Search on RHIPE Time Series Partition_1 Reducer … … Time Series Partition_2 Time Series Partition_n Top-K Query pattern … Query result Time series database FFT R function Map_1 rJava (R Java) R code Java BytesWritable Java BytesWritable R array Protocol buffer Java code Map_2 Map_n RHIPE ( - Open-source R package - Provides an abstraction layer that allows users to formulate MapReduce jobs in R scripts
Data Analysis (R, Matlab, C/C++) Data Analysis (R, Matlab, C/C++) Summary Parallel Processing On Cluster Environment (Hadoop) Parallel Processing On Cluster Environment (Hadoop) My Role -Built a time series similarity operator for a scalable data analytic framework -Working with mentors: Jun Li (System) and Krishnamurthy Viswanathan (Data scientist) -Played a role as a bridge to interoperate between parallel system and data analysis : Designing parallel processing for data analytic algorithms and implementing the algorithms on cluster environment
Internship work Conclusion (What I gain…) Research work (+ industry experience) Research work (+ industry experience) - Time series data analysis - Mathematical techniques (FFT/LSH) - Hadoop, JNI, … - Parallel data processing - Relational database - Java, MATLAB, C/C++, R, … - Machine learning algorithms What’s more… - An invention disclosure regarding the time series similarity search filed in HP - Network with leading researchers in my research area