Building Efficient Time Series Similarity Search Operator Mijung Kim Summer Internship 2013 at HP Labs.

Slides:



Advertisements
Similar presentations
epiC: an Extensible and Scalable System for Processing Big Data
Advertisements

© Chinese University, CSE Dept. Software Engineering / Software Engineering Topic 1: Software Engineering: A Preview Your Name: ____________________.
Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.
An Introduction of Support Vector Machine
Teaching Courses in Scientific Computing 30 September 2010 Roger Bielefeld Director, Advanced Research Computing.
An Overview of Machine Learning
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Software Issues Derived from Dr. Fawcett’s Slides Phil Pratt-Szeliga Fall 2009.
Data Mining – Intro.
What is R Muhammad Omer. What is R  R is the programing language software for statistical computing and data analysis  The R language is extensively.
Large-Scale Content-Based Image Retrieval Project Presentation CMPT 880: Large Scale Multimedia Systems and Cloud Computing Under supervision of Dr. Mohamed.
Overview of Distributed Data Mining Xiaoling Wang March 11, 2003.
Automatic software deployment using user-level virtualization for cloud-computing Future Generation Computer System (2013) Youhui Zhang, Yanhua Li, Weimin.
Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining Techniques
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Data Mining Chun-Hung Chou
ROOT: A Data Mining Tool from CERN Arun Tripathi and Ravi Kumar 2008 CAS Ratemaking Seminar on Ratemaking 17 March 2008 Cambridge, Massachusetts.
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Analysis of Constrained Time-Series Similarity Measures
1 Research Groups : KEEL: A Software Tool to Assess Evolutionary Algorithms for Data Mining Problems SCI 2 SMetrology and Models Intelligent.
Appraisal and Data Mining of Large Size Complex Documents Rob Kooper, William McFadden and Peter Bajcsy National Center for Supercomputing Applications.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Fan Guo 1, Chao Liu 2 and Yi-Min Wang 2 1 Carnegie Mellon University 2 Microsoft Research Feb 11, 2009.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Ch 1. A Python Q&A Session Spring Why do people use Python? Software quality Developer productivity Program portability Support libraries Component.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Yang Hu University of Pittsburgh Department of Computer Science.
An Introduction to Support Vector Machines (M. Law)
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Understanding the field & setting expectations.  Personal  International  UNT Alumni (Mathematics)  Academic  Economics & Mathematics  Professional.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Machine Learning Extract from various presentations: University of Nebraska, Scott, Freund, Domingo, Hong,
Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
ITree: Exploring Time-Varying Data using Indexable Tree Yi Gu and Chaoli Wang Michigan Technological University Presented at IEEE Pacific Visualization.
Apache Mahout Industrial Strength Machine Learning Jeff Eastman.
BIG DATA/ Hadoop Interview Questions.
EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.
The KDD Process for Extracting Useful Knowledge from Volumes of Data Fayyad, Piatetsky-Shapiro, and Smyth Ian Kim SWHIG Seminar.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Image taken from: slideshare
SNS COLLEGE OF TECHNOLOGY
MapReduce Compiler RHadoop
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
MATLAB Distributed, and Other Toolboxes
Pathology Spatial Analysis February 2017
Spark Presentation.
Efficient Image Classification on Vertically Decomposed Data
A Black-Box Approach to Query Cardinality Estimation
Introduction to R Programming with AzureML
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Introduction to Spark.
Efficient Image Classification on Vertically Decomposed Data
I don’t need a title slide for a lecture
Charles Tappert Seidenberg School of CSIS, Pace University
CHAPTER 7: Information Visualization
Presentation transcript:

Building Efficient Time Series Similarity Search Operator Mijung Kim Summer Internship 2013 at HP Labs

Overview The internship project is a part of a project that: ◦ builds a scalable analytics framework and ◦ constructs a set of analytic operators within the framework Trade-off performance with available resources ◦ Multiple implementations with different trade-offs for each operator ◦ Mechanism to choose an implementation given constraints My goal is to build a time series similarity search operator ◦ Parallel data processing ◦ Alternative implementations for the time series similarity search

What is Time Series? Time series data is a sequence of data points repeatedly measured over time Image from wikipedia series Example:

Time Series Similarity Search Time series Segment (T_i(j), …, T_i(j+m)) Query pattern (P) Query length (m) Time series database (T) Distance Given a time series database (T) and query pattern (P), find k-nearest neighbors of the query in the database O(N_t *n*m) N_t: # time series, n: time series length, m: query length Linear to the query length – inefficient for large query lengths! Use cases: Targeted marketing, Anomaly detection, many more…

FFT (Fast Fourier Transform) based Search Time series data in the time domain can be transformed to the frequency domain ◦ We can compute the distance without a time series point by point comparison in each time series segment in the time domain. Image from wikipedia onvolution O(N_t*n*logn) N_t: # time series, n: time series length Independent from the query length FFT for each time series can be pre-processed and re-used for each time series segment!

Time Series Search with MapReduce Time Series Partition_1 Reducer Map_1 … … Time Series Partition_2 Time Series Partition_n Map_2 Map_n Top-K Query pattern … Query result Time series database Horizontally partitioned time series database Compute the distance between each time series segment in the partition and the query

FFT-based vs. Naïve Search FFT-based search cost is independent from the query length (efficient for larger query lengths but naïve search is better for smaller query lengths) - We can develop query plans based on the query length! Single machine vs. Cluster (e.g., >15X gain on cluster mode)

Lessons so far -FFT is proven to be efficient in the time series similarity search operation but  There are other more (theoretically) efficient techniques for the time series similarity search operator, e.g., LSH -Parallel data processing with MapReduce on a cluster environment helps but  Lacks of rich data analytic algorithms commonly supported by statistical software such as MATLAB and R We investigate frameworks that support R with MapReduce as a general analytic operation framework

Why R + MapReduce? Rich Data Analytics Algorithms and Graphics Parallel Processing On Cluster Environment Parallel Processing On Cluster Environment - R is a free software and a widely used programming language/framework/environment for statistical computation for data analysis and graphics - R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering,...) and graphical techniques, and is highly extensible. In-Memory computation of R is impractical for large-scale data analysis!

Parallel R (Split-apply-combine) :::: apply R functions partition Aggregate function input combine partition split ::::

Examples (R+MapReduce) :::: R instance (forecast) R instance (forecast) R instance (forecast) input Different training periods input R function (ARIMA) :::: input :::: Arima model :::: Movie Ratings of each customer Arima (Autoregressive Integrated Moving Average) model of each customer Measure error [Googleparallelism, Stokely et al. JSM ‘11] [IBM Ricardo, Das et al. SIGMOD ‘10]

Time Series Search on RHIPE Time Series Partition_1 Reducer … … Time Series Partition_2 Time Series Partition_n Top-K Query pattern … Query result Time series database FFT R function Map_1 rJava (R Java) R code Java BytesWritable Java BytesWritable R array Protocol buffer Java code Map_2 Map_n RHIPE ( - Open-source R package - Provides an abstraction layer that allows users to formulate MapReduce jobs in R scripts

Data Analysis (R, Matlab, C/C++) Data Analysis (R, Matlab, C/C++) Summary Parallel Processing On Cluster Environment (Hadoop) Parallel Processing On Cluster Environment (Hadoop) My Role -Built a time series similarity operator for a scalable data analytic framework -Working with mentors: Jun Li (System) and Krishnamurthy Viswanathan (Data scientist) -Played a role as a bridge to interoperate between parallel system and data analysis : Designing parallel processing for data analytic algorithms and implementing the algorithms on cluster environment

Internship work Conclusion (What I gain…) Research work (+ industry experience) Research work (+ industry experience) - Time series data analysis - Mathematical techniques (FFT/LSH) - Hadoop, JNI, … - Parallel data processing - Relational database - Java, MATLAB, C/C++, R, … - Machine learning algorithms What’s more… - An invention disclosure regarding the time series similarity search filed in HP - Network with leading researchers in my research area