Building Efficient Time Series Similarity Search Operator Mijung Kim Summer Internship 2013 at HP Labs.

Building Efficient Time Series Similarity Search Operator Mijung Kim Summer Internship 2013 at HP Labs

Overview The internship project is a part of a project that: ◦ builds a scalable analytics framework and ◦ constructs a set of analytic operators within the framework Trade-off performance with available resources ◦ Multiple implementations with different trade-offs for each operator ◦ Mechanism to choose an implementation given constraints My goal is to build a time series similarity search operator ◦ Parallel data processing ◦ Alternative implementations for the time series similarity search

What is Time Series? Time series data is a sequence of data points repeatedly measured over time Image from wikipedia http://en.wikipedia.org/wiki/Time_ series Example:

Time Series Similarity Search Time series Segment (T_i(j), …, T_i(j+m)) Query pattern (P) Query length (m) Time series database (T) Distance Given a time series database (T) and query pattern (P), find k-nearest neighbors of the query in the database O(N_t *n*m) N_t: # time series, n: time series length, m: query length Linear to the query length – inefficient for large query lengths! Use cases: Targeted marketing, Anomaly detection, many more…

FFT (Fast Fourier Transform) based Search Time series data in the time domain can be transformed to the frequency domain ◦ We can compute the distance without a time series point by point comparison in each time series segment in the time domain. Image from wikipedia http://en.wikipedia.org/wiki/C onvolution O(N_t*n*logn) N_t: # time series, n: time series length Independent from the query length FFT for each time series can be pre-processed and re-used for each time series segment!

Time Series Search with MapReduce Time Series Partition_1 Reducer Map_1 … … Time Series Partition_2 Time Series Partition_n Map_2 Map_n Top-K Query pattern … Query result Time series database Horizontally partitioned time series database Compute the distance between each time series segment in the partition and the query

FFT-based vs. Naïve Search FFT-based search cost is independent from the query length (efficient for larger query lengths but naïve search is better for smaller query lengths) - We can develop query plans based on the query length! Single machine vs. Cluster (e.g., >15X gain on cluster mode)

Lessons so far -FFT is proven to be efficient in the time series similarity search operation but  There are other more (theoretically) efficient techniques for the time series similarity search operator, e.g., LSH -Parallel data processing with MapReduce on a cluster environment helps but  Lacks of rich data analytic algorithms commonly supported by statistical software such as MATLAB and R We investigate frameworks that support R with MapReduce as a general analytic operation framework

Why R + MapReduce? Rich Data Analytics Algorithms and Graphics Parallel Processing On Cluster Environment Parallel Processing On Cluster Environment - R is a free software and a widely used programming language/framework/environment for statistical computation for data analysis and graphics - R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering,...) and graphical techniques, and is highly extensible. In-Memory computation of R is impractical for large-scale data analysis!

Parallel R (Split-apply-combine) :::: apply R functions partition Aggregate function input combine partition split ::::

Examples (R+MapReduce) :::: R instance (forecast) R instance (forecast) R instance (forecast) input Different training periods input R function (ARIMA) :::: input :::: Arima model :::: Movie Ratings of each customer Arima (Autoregressive Integrated Moving Average) model of each customer Measure error [Googleparallelism, Stokely et al. JSM ‘11] [IBM Ricardo, Das et al. SIGMOD ‘10]

Time Series Search on RHIPE Time Series Partition_1 Reducer … … Time Series Partition_2 Time Series Partition_n Top-K Query pattern … Query result Time series database FFT R function Map_1 rJava (R Java) R code Java BytesWritable Java BytesWritable R array Protocol buffer Java code Map_2 Map_n RHIPE (www.rhipe.org)www.rhipe.org - Open-source R package - Provides an abstraction layer that allows users to formulate MapReduce jobs in R scripts

Data Analysis (R, Matlab, C/C++) Data Analysis (R, Matlab, C/C++) Summary Parallel Processing On Cluster Environment (Hadoop) Parallel Processing On Cluster Environment (Hadoop) My Role -Built a time series similarity operator for a scalable data analytic framework -Working with mentors: Jun Li (System) and Krishnamurthy Viswanathan (Data scientist) -Played a role as a bridge to interoperate between parallel system and data analysis : Designing parallel processing for data analytic algorithms and implementing the algorithms on cluster environment

Internship work Conclusion (What I gain…) Research work (+ industry experience) Research work (+ industry experience) - Time series data analysis - Mathematical techniques (FFT/LSH) - Hadoop, JNI, … - Parallel data processing - Relational database - Java, MATLAB, C/C++, R, … - Machine learning algorithms What’s more… - An invention disclosure regarding the time series similarity search filed in HP - Network with leading researchers in my research area

Building Efficient Time Series Similarity Search Operator Mijung Kim Summer Internship 2013 at HP Labs.

Similar presentations

Presentation on theme: "Building Efficient Time Series Similarity Search Operator Mijung Kim Summer Internship 2013 at HP Labs."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Building Efficient Time Series Similarity Search Operator Mijung Kim Summer Internship 2013 at HP Labs.

Similar presentations

Presentation on theme: "Building Efficient Time Series Similarity Search Operator Mijung Kim Summer Internship 2013 at HP Labs."— Presentation transcript:

Similar presentations

About project

Feedback