Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis, Yannis Papakonstantinou
Challenge & Motivation Sensor data is big GB per minute DELPHI project collects many environmental and biomedical data (spatiotemporal) …… Public health analyst Online analytical queries are needed E.g., Cross-Correlation between heart rate and temperature (for various participant subsets) Performance is too slow Approximate answers with bounded errors are acceptable Errors should be bounded by user provided budgets. E.g., budget = 10% Computation of approximate answer should be efficient
PlatoDB Overview PlatoDB provides efficient approximate query processing and deterministic error guarantees Offline Construct an segment tree data structure. Each node is a segment that is represented by a linear scalable function minimizing the Euclidean distance Offline Store error statistics for each segment. Online Use error statistics to get tight estimated error for each operator Calculate query errors by combining estimated operator errors Online Navigate tree structures and incrementally update the estimated error
Data model Time series data a sequence of (timestamp, value) pairs Assume fixed intervals Omit timestamps (2013-01-01 00:00:00,15.72),(2013-01-01 00:00:01,15.74),...,(2016-12-30 23:59:59,2.32) (15.72, 15.74 ,..., 2.32) *Dina Q. Goldin, Paris C. Kanellakis: On Similarity Queries for Time-Series Data: Constraint Specification and Implementation. CP 1995: 137-153
Query Basic time series operators Serialize Serialize(1.2, 5) creates a time series with 5 points whose values are 1.2 Shift Shift(T, 2) moves i-th point to (i+2)-th position Plus Plus(T1, T2) creates a new time series where di = di(1) + di(2) Minus Minus(T1, T2) creates a new time series where di = di(1) - di(2) Times Times(T1, T2) creates a new time series where di = di(1) * di(2) Aggregation operator Sum Sum(T, a, b) returns the value of Arithmetic manipulation operator (e.g., +, -, x, / )
Arithmetic manipulation Example Queries Time series operator Aggregation operator Arithmetic manipulation
PlatoDB Architecture Pre-Processing Query Processing Query Q Online Error budget Query Processor Query Processing Segment Tree Generator Time Series Segment Trees Sensor Data Pre-Processing Offline Noise Removal clean data error statistics compressed data Time series T1 Segment Tree for T1 S1 S1.1 S1.2 S1.1.1 S1.1.2 Segment S1 Segment S1.1 Segment S1.2 S1.1,1 Segment Tree for T2 Segment Tree for T1
Segment tree One tree structure for each time series How to build such a tree? Existing time series segmentation algorithms Top-down Bottom-up Sliding-window Tree may not be balance Each node represents one segment Lower level nodes represent smaller segments How to represent each segment? Existing time series representation algorithms Piecewise Aggregate Approximation (PAA) Adaptive Piecewise Constant Approximation (APCA) Piecewise Linear Representation (PLR) Discrete Wavelet Transform (DWT) *Fu, Tak-chung. "A review on time series data mining." Engineering Applications of Artificial Intelligence 24.1 (2011): 164-181.
Segment tree One tree structure for each time series Linear scalable family: A family is linear scalable if for every segment S and every function f on this segment, for any subsegment S1, the restriction f|S1 is still a function in this family. Tree may not be balance Each node represents one segment Lower level nodes represent smaller segments Polynomial function family is linear scalable family PlatoDB chooses a linear scalable function f to represent a segment where f minimizes the Euclidean Distance Contribution 1: PlatoDB provides O(n) algorithm to build the tree structure.
Segment tree One tree structure for each time series Each node stores two kinds of values Coefficients of the estimation function in the orthonormal basis Error statistics One tree structure for each time series Tree may not be balance Each node represents one segment Lower level nodes represent smaller segments PlatoDB stores by setting p = q =2
Query Processing Contribution 2: PlatoDB provides a top-down query processing algorithm Update one node at a time and update the current error Terminate when the current error < error budget Consider query = (Sum(Times(T1, T2)), 10% Time series T1 Time series T2
Query Processing Contribution 2: PlatoDB provides a top-down query processing algorithm Update one node at a time and update the current error Terminate when the current error < error budget Consider query = (Sum(Times(T1, T2)), 10% Time series T1 Time series T2
Error Estimation Contribution 3: PlatoDB provides minimal error estimation for each time series operator by using error statistics Store for each segment Aligned segments (same length and same start point) Based on the orthogonal projection property In PlatoDB
Error Estimation Contribution 3: PlatoDB provides minimal error estimation for each time series operator by using error statistics Store for each segment Unaligned segments (different lengths or different start points) Symbolic computation Contribution 4: PlatoDB provides an optimal algorithm in O(K1+K2) to find the segmentation with minimal error
Further Optimizations Performance-wise optimization Contribution 5: PlatoDB provides an incremental update segmentation algorithm that gives ratio compared with the optimal one. 1+ 2 Space-wise optimization Contribution 6: PlatoDB can avoid storing the coefficients for the right nodes. Only red nodes store coefficients Coefficients can be deduced from the parent node and the left sibling node via an invert basis matrix
Preliminary Experiment Data: 1 billion temperature and ozone data points. Query: Correlation of T1 and T2
Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Thank you for your time! Q & A
Techniques orthogonal to PlatoDB Synopses E.g., Wavelets, histograms, sketches Tightly tied to specific classes of queries Queries are assume to be known in advance Samples E.g., STRAT (single stratified sample), SciBORQ (biased samples), AQUA (single stratified sample), BlinkDB (multiple samples)
Sampling vs Compression Sampling: a subset of the original data Compression: use functions to model data