GAMPS COMPRESSING MULTI SENSOR DATA BY GROUPING & AMPLITUDE SCALING

Slides:



Advertisements
Similar presentations
© 2007 Open Grid Forum Grids in the IT Data Center OGF 21 - Seattle Nick Werstiuk October 16, 2007.
Advertisements

Deep packet inspection – an algorithmic view Cristian Estan (U of Wisconsin-Madison) at IEEE CCW 2008.
BURSTY SUBGRAPHS IN SOCIAL NETWORKS. Introduction 2.
Project : Phase 1 Grading Default Statistics (40 points) Values and Charts (30 points) Analyses (10 points) Branch Predictor Statistics (30 points) Values.
A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees Shimin Chen* Phillip B. Gibbons* Suman Nath + *Intel Labs Pittsburgh.
F AST A PPROXIMATE C ORRELATION FOR M ASSIVE T IME - SERIES D ATA SIGMOD’10 Abdullah Mueen, Suman Nath, Jie Liu 1.
Présentation EPFL-Public | PAST Processing and Storage of Time series Eleni Tzirita Zacharatou, Jasmina Malicevic, Nikolaos Kokolakis, Eric Beguet,
Hossein Ahmadi, Nam Pham, Raghu Ganti, Tarek Abdelzaher, Suman Nath, Jiawei Han Pallavi Arora.
David Chu--UC Berkeley Amol Deshpande--University of Maryland Joseph M. Hellerstein--UC Berkeley Intel Research Berkeley Wei Hong--Arched Rock Corp. Approximate.
Intraship Integration Control Instructor: TV Prabakar.
Resource Management of Highly Configurable Tasks April 26, 2004 Jeffery P. HansenSourav Ghosh Raj RajkumarJohn P. Lehoczky Carnegie Mellon University.
Memory System Characterization of Big Data Workloads
Compressive Data Gathering for Large- Scale Wireless Sensor Networks Chong Luo Feng Wu Shanghai Jiao Tong University Microsoft Research Asia Jun Sun Chang.
Recursive End-to-end Distortion Estimation with Model-based Cross-correlation Approximation Hua Yang, Kenneth Rose Signal Compression Lab University of.
Adaptive Sampling for Sensor Networks Ankur Jain ٭ and Edward Y. Chang University of California, Santa Barbara DMSN 2004.
Distributed Regression: an Efficient Framework for Modeling Sensor Network Data Carlos Guestrin Peter Bodik Romain Thibaux Mark Paskin Samuel Madden.
Themis Palpanas1 VLDB - Aug 2004 Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use.
Indexing of Time Series by Major Minima and Maxima Eugene Fink Kevin B. Pratt Harith S. Gandhi.
Abdullah Mueen UC Riverside Suman Nath Microsoft Research Jie Liu Microsoft Research.
Multi-Scale Analysis for Network Traffic Prediction and Anomaly Detection Ling Huang Joint work with Anthony Joseph and Nina Taft January, 2005.
SWE 423: Multimedia Systems Chapter 7: Data Compression (3)
Autonomous Vehicle Positioning with GPS in Urban Canyon Environments
Compressing Historical Information in Sensor Networks From ACM SIGMOD 2004 and VLDB journal 2006.
A Multiresolution Symbolic Representation of Time Series
Hand Signals Recognition from Video Using 3D Motion Capture Archive Tai-Peng Tian Stan Sclaroff Computer Science Department B OSTON U NIVERSITY I. Introduction.
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.
SWE 423: Multimedia Systems Chapter 7: Data Compression (5)
FLANN Fast Library for Approximate Nearest Neighbors
Real-time Video Streaming from Mobile Underwater Sensors 1 Seongwon Han (UCLA) Roy Chen (UCLA) Youngtae Noh (Cisco Systems Inc.) Mario Gerla (UCLA)
Resource Management in Virtualization-based Data Centers Bhuvan Urgaonkar Computer Systems Laboratory Pennsylvania State University Bhuvan Urgaonkar Computer.
Naixue GSU Slide 1 ICVCI’09 Oct. 22, 2009 A Multi-Cloud Computing Scheme for Sharing Computing Resources to Satisfy Local Cloud User Requirements.
Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.
© 2006 IBM Corporation Adaptive Self-Tuning Memory in DB2 Adam Storm, Christian Garcia-Arellano, Sam Lightstone – IBM Toronto Lab Yixin Diao, M. Surendra.
Target Tracking with Binary Proximity Sensors: Fundamental Limits, Minimal Descriptions, and Algorithms N. Shrivastava, R. Mudumbai, U. Madhow, and S.
Pattern Matching in DAME using AURA technology Jim Austin, Robert Davis, Bojian Liang, Andy Pasley University of York.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
Compressive Sensing Based on Local Regional Data in Wireless Sensor Networks Hao Yang, Liusheng Huang, Hongli Xu, Wei Yang 2012 IEEE Wireless Communications.
Topic 1 modelling of sensors systems ETEC Calibration methods We have a RTD sensing the temperature. The integer of the sensor inside the PLC is.
Energy-Aware Scheduling with Quality of Surveillance Guarantee in Wireless Sensor Networks Jaehoon Jeong, Sarah Sharafkandi and David H.C. Du Dept. of.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.
BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai (NTT Cyber Space Labs) Spiros Papadimitriou (Carnegie Mellon Univ.) Christos Faloutsos.
A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos.
Energy-Efficient Signal Processing and Communication Algorithms for Scalable Distributed Fusion.
EE 113D Fall 2008 Patrick Lundquist Ryan Wong
Dr. Sudharman K. Jayaweera and Amila Kariyapperuma ECE Department University of New Mexico Ankur Sharma Department of ECE Indian Institute of Technology,
Lev Finkelstein ISCA/Thermal Workshop 6/ Overview 1.Motivation (Kevin) 2.Thermal issues (Kevin) 3.Power modeling (David) 4.Thermal management (David)
Copyright 2007, Information Builders. Slide 1 Machine Sizing and Scalability Mark Nesson, Vashti Ragoonath June 2008.
Psychology 202a Advanced Psychological Statistics October 22, 2015.
BME 353 – BIOMEDICAL MEASUREMENTS AND INSTRUMENTATION MEASUREMENT PRINCIPLES.
U of Minnesota DIWANS'061 Energy-Aware Scheduling with Quality of Surveillance Guarantee in Wireless Sensor Networks Jaehoon Jeong, Sarah Sharafkandi and.
Image Enhancement Objective: better visualization of remotely sensed images visual interpretation remains to be the most powerful image interpretation.
Arizona State University1 Fast Mining of a Network of Coevolving Time Series Wei FanHanghang TongPing JiYongjie Cai.
A Distributed and Adaptive Signal Processing Approach to Reducing Energy Consumption in Sensor Networks Jim Chou, et al Univ. of Califonia at Berkeley.
Dense-Region Based Compact Data Cube
SketchVisor: Robust Network Measurement for Software Packet Processing
Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)
Types of Operating System
Multi-core CPU Power Control
Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,
CSI-447: Multimedia Systems
Microsoft Build /20/2018 5:17 AM © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY,
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
Faraz Ahmad and T. N. Vijaykumar Purdue University
Random feature for sparse signal classification
View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions 1,2 1.
Adaptive Filter A digital filter that automatically adjusts its coefficients to adapt input signal via an adaptive algorithm. Applications: Signal enhancement.
Minwise Hashing and Efficient Search
Greg Knowles ECE Fall 2004 Professor Yu Hu Hen
Overview: Chapter 4 (cont)
Presentation transcript:

GAMPS COMPRESSING MULTI SENSOR DATA BY GROUPING & AMPLITUDE SCALING Sorabh Gandhi, UC Santa Barbara Suman Nath, Microsoft Research Subhash Suri, UC Santa Barbara Jie Liu, Microsoft Research GAMPS bigger remaining smaller. Problem, Current solution, what we are trying.

Fine Grained Sensing & Data Glut Advances in sensing technology fine grained ubiquitous sensing of environment Many applications, but the issue is data glut Automated Data Center Cooling: [MSFT DCGenome project] physical parameters ex. humidity, temperature etc 1000s of sensors, 10 bytes/sensors/sec  10s of GBs/day Server Performance Monitoring: [MSFT server farm monitoring] performance counters ex. cpu utilization, memory usage etc 100s of counters, 1000s of servers, few bytes/counter/sec TBs/day Recent advances in sensing technologies have made possible, both technologically and economically, the deployment of densely distributed sensor networks. These networks provide fine grained ubiquotous sensing of the environment. Large scale phenomenon with lots of dynamic elements- data glut. Challenges in data management.

Focus and Objectives Data archival + (reliable and fast) query processing Centralized setting Point query: report value for sensor x, time t Similarity query: report sensors ‘similar’ to sensor x in time range Obvious solution: compression, data is set of time series Initial idea: approximate every time series individually Many approximation techniques known ex. DFT, DCT, piecewise linear Focus: L1 error [guarantee on point queries] ex techniques wavelets, piecewise constant/linear approximations Compression not enough!! Gives upto an order of magnitude improvement, we want more Our focus is data archival, for historical and trend analysis Also, we want to archive this data in a format which enables fast query processing Where the query could be a point query, or a similarity query.

Signals are Correlated! Shifted/Scaled groups Dynamic groups Server dataset: 40 signals, 1 day, sampling once every 30 seconds, counter: # of connected users # Connected Users Similar signals in a group Example dataset, Server, Server performance monitoring application. The signals are certainly correlated across time but they also seem correlated in space with each other Time

Contributions We propose GAMPS, which exploits linear correlations among multiple signals while compressing them together, and gives L1 guarantees Compression both along time and across signals We propose an index structure for compressed data which can give fast responses to a lot of relevant queries Through simulations on real data, we show that on large datasets, GAMPS can achieve upto an order of magnitude improvement over state of the art compression techniques

State of the art: Single Signal Optimal L1 approximations Problem: Given a time series S and input parameter ² approximate S with piecewise constant segments such that the L1 error is <= ² Greedy algorithm (PCGreedy(S, ²))

State of the art: Single Signal Optimal L1 approximations Problem: Given a time series S and input parameter ², approximate S with piecewise constant segments such that the L1 error is <= ² Greedy algorithm (PCGreedy(S, ²)) 2² The algorithm divides the time series into contiguous disjoint parts  buckets The algorithms approximates each bucket by a segment Let the magnitude of 2\epsilon be as shown by the segment on the left hand part of the slide. The algorithm starts with an empty bucket, and processes every data point one by one. The algorithms maintains the maximum and minimum values seen for every part, and the moment the difference of these values excceds 2\epsilon, the Is there anything more we can hope to do ? Original Time Series ICDE’03 Lazardis et al. Approximation

GAMPS Overview GAMPS take as input, the set of time series and approximation parameter ² Compression Partition phase: partitions the data into contiguous time intervals Group phase: divides a given partition into groups of similar signals Amplitude scaling phase: compression happens with sharing of representations Data Amplitude Scaling Phase Partition Phase Grouping Phase Compressed Index Structure Data INDEXING COMPRESSION

Compression by Amplitude Scaling Given a group of k ‘similar’ signals Let the signals be denoted by set X = {X1, X2, …, Xk} Key idea: express all signals Xi as scaled function of some signal Xj: Xi = AiXj Ai is the ratio/amplitude signal and Xj is the base signal If signal Xi is a perfectly scaled version of Xj then Ai = constant To reconstruct Xi, we only need to store the constant and Xj In reality, no perfect correlation However, we found that if there are enough linearly correlated signals smartly approximating Ais and Xj can give very good compression factors! If our premise is true, that is, if signal Xi is a perfectly scaled/shifted/overlapping version of Xj, then we can Achieve big compression gains. For instance, if Xi is a perfectly scaled version of Xj then Ai is constant We found experimentally that if there are enough linearly correlated signals, By smartly approximating Ai and Xj with PCGreedy() such that reconstruction error in Xi is less than target epsilon, we can achive …. Let us have a look at an example of this with the help of a small part of our datacenter dataset

Illustration: Amplitude Scaling on Real Dataset DataCenter dataset 6 signals shown for ~3 days each, parameter: relative humidity Input: X = {X1, X2, …, X6}, ² = 1% Need to choose base signal and divide ² among base signal (²b) and ratio signal approximations (²r) Oracle: X4 is base signal, also provides values ²b and ²r Run PCGreedy(X4, ²b) and PCGreedy(Ai, ²r) for signals other than the base signal DataCenter Dataset Let the target error be 1% of the data value. Again, we will concentrate for now on signal A_i, which we call the ratio signal, as it gives much better compression results for all our datasets as compared to Bi. Call X_j as base signal. Say for now oracle tell us that base signal is X_4 and it should be approximated with 40% of the total error. Using this, one can determine e2 such that reconstruction error in approximation of X1 X2 X3 X5 X6 is less than target error e. So we construct the ratio signal approximations.

Illustration: Amplitude Scaling on Real Dataset Leftmost figure, all signals use PCGreedy() with ² = 1.0% Middle figure, higher fidelity base signal, ²b =0.4% Rightmost figure: Ratio signals Very sparse (small number of segments to represent) Individual approx Y-axis: Relative Humidity Base signal approx Y-axis: Relative Humidity Ratio signal approx Y-axis: Ratio The results for technique mentioned on the previous slide are shown here. Leftmost figure show individual approximations with optimal single signal approximation algorithm with error 1%. Middle figure shows approximation of base signal with 0.4e. And the right figure shows ratio signal approximations such that reconstruction error is less than e. The most interesting things to note here is that Ratio signals are very sparse, i.e. lot less memory than corresponding individual approximation in Figure 1. So, even though Base signal approximation takes more segments, overall compression seems better than individual approximations.

Quantitative Comparison for Amplitude Scaling Compression factor = M1/M2 M1 = number of segments in individual signal approximations M2 = number of segments in (base signal + ratio signal) approximations For this illustrative dataset, compression factor (1% error) is 1.9 Comparison with optimal individual approximations

Grouping and Amplitude Scaling by Facility Location Facility location problem Problem is modeled as a graph G(V, E) Opening a facility at node j costs c(j) Serving a demand point j using facility i costs w(i,j) Objective is to choose F µ V Minimize j 2 F c(i) + i 2 V w(i,j) Grouping & amplitude scaling is modeled as facility location Complete graph, every signal is represented by a node Cost opening a facility: # segments needed to represent base signal Cost of serving a demand point: # segments needed to represent the ratio signal Graph Suppose the paritioning part provides us with a batch of data. We solve our grouping and compression problem by using algorithms known for facility location problem. Let us first understand the facility location problem. The Facility Location problem consists of a set of potential facility sites, represented by nodes V where facility can be opened, and a set of demand points also represented by V that must be serviced. The goal is to pick a subset F of facilities to open, to minimize the sum of cost of serving every demand point by one facility, plus the sum of opening costs of the facilities.

Implementation Setup We set ²b = 0.4² [error allocation for base signal] Facility location : NP hard We show results with exact solution (integer linear program) Approximation solutions are with 90% of the results shown Time taken to solve the linear program is <= few seconds We use three different datasets Server dataset: 240 signals, 1 day data [CPU utilization counter] DataCenter dataset: 24 signals, 3 days of data [humidity sensors] IBT dataset: 45 signals, 1 day of data [temperature sensors in a building in Berkeley] I will only show results with compression. For all the experiments shown …. Certainly a parameter which can be tuned, but we find that we get good result even with this fixed value.

Quantitative Evaluation: GAMPS Figure on the left shows compression factor over raw data For 1.5% error, 300 for server data, 50 for the other two Figure on the right: compression factor over individual approximations For 1.5% error, between factor 2-10 Compression factor high for Server dataset Average group size is highest (60 as compared to 4.5 & 6)

Scaling versus Group size We extracted 60 signals in the same group for the Server dataset Compression factor (versus individual approximations) increases as group size increases

Advantage of Grouping Demonstrate the advantage of having multiple groups Datasets IBT and Server Hybrid: algorithm which allows only 1 group Every signal is either in the group or approximated individually For both datasets, for all errors, grouping gives great advantage Compression Factor: 1.5 (IBT) - 9 (Server) [Error 1.5%]

Grouping: Geographical Locality IBT dataset, 1 day, error = 1.5% GAMPS runs the grouping on entire days data Picture on left shows sensor layout in the Intel Berkeley lab Hexagons are sensor positions, crosses are sensors without data for the one day, rectangles are outliers (individual approximations) Simple region boundaries conform our intuition Grouping algorithm has no information about geographical locations Sensor Layout Group Layout

Indexing Compressed Data Skip-list of groups 1 2 3 Ptr. to base signal 4 Skip-list of approx. lines for ratio signal 5 Propose Skip list based index structure Point query: log(n) Range query : log(n) + range Similarity query : log(n) + #groups in range

Future Work How to distribute error among base and ratio signals ? How about generic linear transformations ? We use only ratio signal (scaling) : Xi = AiXj Maybe we can get much better compression by using Xi = AiXj + Bi How about piecewise linear signals ? Underlying algorithm is not so trivial (convex hulls) Can we apply this technique to 2D signals ? Consider a video, every pixel value in time  time series Every pixel-time-series, correlated with neighboring pixel-time-series

Thanks for your attention

Example Query: Similarity Query Based on grouping we can define similarity coefficient for a given time range (t1, t2) = 1, if signals Si and Sj are in the same group at time t Part of IBT dataset Similarity Query

Compression by Interval Sharing Key Idea: If two sensors have near overlapping time series they can share a part of the approximation Let number of signals be k and desired error be ² (®, ¯) approximation algorithm For given error ² say optimal algorithm taken OPT (®, ¯) algorithm has error no more than ®² and uses no more than ¯OPT segments We propose polynomial time (5, log k + log OPT) approximation algorithm for approximation with PC segments using interval sharing Signal 1 Signal 2 Representation can be shared

Multiple Correlated Signals: Example 1 Instant messaging service – Server dataset 240 servers, 2 weeks, >= 100 performance counters 40 signals shown (normalized) for one day, counter: #connected users, sampling rate once in 30 seconds Signals are correlated (almost overlapping) with each other, can we exploit this in compression ? Server Dataset Hope is that many signals are related and if so, we want a technique which can exploit it.

Multiple Correlated Signals: Example 2 Data center monitoring 24 sensors, 2 years, 2 parameters: humidity, temperature 6 signals shown for ~3 days each, parameter: relative humidity, sampling rate once every 30 seconds Signals not overlapping, but still correlated Shifting or scaling may help Question: Can we exploit this correlation ? We propose a technique to compress multiple signals along both time and across signals DataCenter Dataset GAMPS overview: Grouping and compression (linear transform) in practice sclaing is pretty effectve

Partition Determination Use double-half-same size heuristic Start with some initial batch size (say 100 data points) For next batch run group and compress with 200, 100 & 50 data points For 200, compare with two batches of size 100, whichever one takes less memory is chosen Similarly for 50, compare two batch sizes of 50 with one batch size 100 Memory taken = # segments + Cluster delta Cluster delta: Every time clusters change, we need to update the base signals and base-ratio signal relationships

(Similar signals together) Select Base and Ratio Signals GAMPS Illustration 1 Partition 1 2 2 3 3 4 4 5 5 Grouping (Similar signals together) Base signals Select Base and Ratio Signals 2 4 1 3 5 1 2 Ratio signals 3 4 5

GAMPS Compression Illustration 1 Partition 1 2 2 3 (To overcome varying correlations) 3 4 4 5 5 Grouping (Similar signals together) Compress by Amplitude Scaling 1 2 3 4 5