Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,

Slides:



Advertisements
Similar presentations
Object Specific Compressed Sensing by minimizing a weighted L2-norm A. Mahalanobis.
Advertisements

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Fast Algorithms For Hierarchical Range Histogram Constructions
1 StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time Pankaj Kumar Madhukar Rakesh Kumar Singh Puspendra Kumar Project Instructor:
Data mining and statistical learning - lecture 6
Optimal Workload-Based Weighted Wavelet Synopsis
Approximating Sensor Network Queries Using In-Network Summaries Alexandra Meliou Carlos Guestrin Joseph Hellerstein.
Multimedia DBs.
Indexing Time Series. Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time.
Finding Aggregates from Streaming Data in Single Pass Medha Atre Course Project for CS631 (Autumn 2002) under Prof. Krithi Ramamritham (IITB).
Distributed Regression: an Efficient Framework for Modeling Sensor Network Data Carlos Guestrin Peter Bodik Romain Thibaux Mark Paskin Samuel Madden.
Spatial and Temporal Data Mining
Wavelet Packets For Wavelets Seminar at Haifa University, by Eugene Mednikov.
19-1 Chapter Nineteen MULTIVARIATE ANALYSIS: An Overview.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Gerhard Maierbacher Scalable Coding Solutions for Wireless Sensor Networks IT.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Energy-efficient Self-adapting Online Linear Forecasting for Wireless Sensor Network Applications Jai-Jin Lim and Kang G. Shin Real-Time Computing Laboratory,
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
Indexing Time Series.
RACE: Time Series Compression with Rate Adaptivity and Error Bound for Sensor Networks Huamin Chen, Jian Li, and Prasant Mohapatra Presenter: Jian Li.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Exact Indexing of Dynamic Time Warping
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Summarized by Soo-Jin Kim
MINING RELATED QUERIES FROM SEARCH ENGINE QUERY LOGS Xiaodong Shi and Christopher C. Yang Definitions: Query Record: A query record represents the submission.
CS910: Foundations of Data Analytics Graham Cormode Time Series Analysis.
Analysis of Constrained Time-Series Similarity Measures
An Integration Framework for Sensor Networks and Data Stream Management Systems.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
Wavelet Synopses with Predefined Error Bounds: Windfalls of Duality Panagiotis Karras DB seminar, 23 March, 2006.
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos.
The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006.
Exact indexing of Dynamic Time Warping
Dr. Sudharman K. Jayaweera and Amila Kariyapperuma ECE Department University of New Mexico Ankur Sharma Department of ECE Indian Institute of Technology,
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Time Series Sequence Matching Jiaqin Wang CMPS 565.
A* optimality proof, cycle checking CPSC 322 – Search 5 Textbook § 3.6 and January 21, 2011 Taught by Mike Chiang.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.
ITree: Exploring Time-Varying Data using Indexable Tree Yi Gu and Chaoli Wang Michigan Technological University Presented at IEEE Pacific Visualization.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Dense-Region Based Compact Data Cube
Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)
Spatial Data Management
Data Transformation: Normalization
Spatio-Temporal Databases
BlinkDB.
Computing and Compressive Sensing in Wireless Sensor Networks
Priority Queues An abstract data type (ADT) Similar to a queue
BlinkDB.
Spatial Online Sampling and Aggregation
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
K Nearest Neighbor Classification
Spatio-Temporal Databases
Finding Fastest Paths on A Road Network with Speed Patterns
Instance Based Learning
Linear sketching with parities
Image Registration 박성진.
Analytics – Statistical Approaches
Major Design Strategies
Data Pre-processing Lecture Notes for Chapter 2
Deterministic Error Guarantees for Queries on Compressed Time Series
PlatoDB: Fast Approximating Statistic Queries over Sensor Data with Tight Error Guarantees Chunbin Lin, Etienne Boursier, Korhan Demirkaya, Jacque Brito,
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Major Design Strategies
Efficient Aggregation over Objects with Extent
Presentation transcript:

Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis, Yannis Papakonstantinou

Challenge & Motivation Sensor data is big GB per minute DELPHI project collects many environmental and biomedical data (spatiotemporal) …… Public health analyst Online analytical queries are needed E.g., Cross-Correlation between heart rate and temperature (for various participant subsets) Performance is too slow Approximate answers with bounded errors are acceptable Errors should be bounded by user provided budgets. E.g., budget = 10% Computation of approximate answer should be efficient

PlatoDB Overview PlatoDB provides efficient approximate query processing and deterministic error guarantees Offline Construct an segment tree data structure. Each node is a segment that is represented by a linear scalable function minimizing the Euclidean distance Offline Store error statistics for each segment. Online Use error statistics to get tight estimated error for each operator Calculate query errors by combining estimated operator errors Online Navigate tree structures and incrementally update the estimated error

Data model Time series data a sequence of (timestamp, value) pairs Assume fixed intervals Omit timestamps (2013-01-01 00:00:00,15.72),(2013-01-01 00:00:01,15.74),...,(2016-12-30 23:59:59,2.32) (15.72, 15.74 ,..., 2.32) *Dina Q. Goldin, Paris C. Kanellakis: On Similarity Queries for Time-Series Data: Constraint Specification and Implementation. CP 1995: 137-153

Query Basic time series operators Serialize Serialize(1.2, 5) creates a time series with 5 points whose values are 1.2 Shift Shift(T, 2) moves i-th point to (i+2)-th position Plus Plus(T1, T2) creates a new time series where di = di(1) + di(2) Minus Minus(T1, T2) creates a new time series where di = di(1) - di(2) Times Times(T1, T2) creates a new time series where di = di(1) * di(2) Aggregation operator Sum Sum(T, a, b) returns the value of Arithmetic manipulation operator (e.g., +, -, x, / )

Arithmetic manipulation Example Queries Time series operator Aggregation operator Arithmetic manipulation

PlatoDB Architecture Pre-Processing Query Processing Query Q Online Error budget Query Processor Query Processing Segment Tree Generator Time Series Segment Trees Sensor Data Pre-Processing Offline Noise Removal clean data error statistics compressed data Time series T1 Segment Tree for T1 S1 S1.1 S1.2 S1.1.1 S1.1.2 Segment S1 Segment S1.1 Segment S1.2 S1.1,1 Segment Tree for T2 Segment Tree for T1

Segment tree One tree structure for each time series How to build such a tree? Existing time series segmentation algorithms Top-down Bottom-up Sliding-window Tree may not be balance Each node represents one segment Lower level nodes represent smaller segments How to represent each segment? Existing time series representation algorithms Piecewise Aggregate Approximation (PAA) Adaptive Piecewise Constant Approximation (APCA) Piecewise Linear Representation (PLR) Discrete Wavelet Transform (DWT) *Fu, Tak-chung. "A review on time series data mining." Engineering Applications of Artificial Intelligence 24.1 (2011): 164-181.

Segment tree One tree structure for each time series Linear scalable family: A family is linear scalable if for every segment S and every function f on this segment, for any subsegment S1, the restriction f|S1 is still a function in this family. Tree may not be balance Each node represents one segment Lower level nodes represent smaller segments Polynomial function family is linear scalable family PlatoDB chooses a linear scalable function f to represent a segment where f minimizes the Euclidean Distance Contribution 1: PlatoDB provides O(n) algorithm to build the tree structure.

Segment tree One tree structure for each time series Each node stores two kinds of values Coefficients of the estimation function in the orthonormal basis Error statistics One tree structure for each time series Tree may not be balance Each node represents one segment Lower level nodes represent smaller segments PlatoDB stores by setting p = q =2

Query Processing Contribution 2: PlatoDB provides a top-down query processing algorithm Update one node at a time and update the current error Terminate when the current error < error budget Consider query = (Sum(Times(T1, T2)), 10% Time series T1 Time series T2

Query Processing Contribution 2: PlatoDB provides a top-down query processing algorithm Update one node at a time and update the current error Terminate when the current error < error budget Consider query = (Sum(Times(T1, T2)), 10% Time series T1 Time series T2

Error Estimation Contribution 3: PlatoDB provides minimal error estimation for each time series operator by using error statistics Store for each segment Aligned segments (same length and same start point) Based on the orthogonal projection property In PlatoDB

Error Estimation Contribution 3: PlatoDB provides minimal error estimation for each time series operator by using error statistics Store for each segment Unaligned segments (different lengths or different start points) Symbolic computation Contribution 4: PlatoDB provides an optimal algorithm in O(K1+K2) to find the segmentation with minimal error

Further Optimizations Performance-wise optimization Contribution 5: PlatoDB provides an incremental update segmentation algorithm that gives ratio compared with the optimal one. 1+ 2 Space-wise optimization Contribution 6: PlatoDB can avoid storing the coefficients for the right nodes. Only red nodes store coefficients Coefficients can be deduced from the parent node and the left sibling node via an invert basis matrix

Preliminary Experiment Data: 1 billion temperature and ozone data points. Query: Correlation of T1 and T2

Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Thank you for your time! Q & A

Techniques orthogonal to PlatoDB Synopses E.g., Wavelets, histograms, sketches Tightly tied to specific classes of queries Queries are assume to be known in advance Samples E.g., STRAT (single stratified sample), SciBORQ (biased samples), AQUA (single stratified sample), BlinkDB (multiple samples)

Sampling vs Compression Sampling: a subset of the original data Compression: use functions to model data