Approximate Aggregation Techniques for Sensor Databases John Byers Department of Computer Science Boston University Joint work with Jeffrey Considine,

Approximate Aggregation Techniques for Sensor Databases John Byers Department of Computer Science Boston University Joint work with Jeffrey Considine, George Kollios, Feifei Li

Sensor Network Model Large set of sensors distributed in a sensor field. Communication via a wireless ad-hoc network. Node and links are failure-prone. Sensors are resource-constrained –Limited memory, battery-powered, messaging is costly.

Sensor Databases Treat sensor field as a distributed database –But: data is gathered, not stored nor saved. Perform standard queries over sensor field: –COUNT, SUM, GROUP-BY Exemplified by work such as TAG and Cougar For this talk: –One-shot queries –Continuous queries are a natural extension.

Tiny Aggregation (TAG) Approach [Madden, Franklin, Hellerstein, Hong] Aggregation component of TinyDB –Follows database approach –Uses simple SQL-like language for queries –Power-aware, in-network query processing –Optimizations are transparent to end-user. TAG supports COUNT, SUM, AVG, MIN, MAX and others

TAG (continued) Queries proceed in two phases: –Phase 1: Sink broadcasts desire to compute an aggregate. Nodes create a routing tree with the sink as the root. –Phase 2: Nodes start sending back partial results. Each node receives the partial results of its children and computes a new partial result. Then it forwards the new partial result to its parent. Can compute any decomposable function –f (v 1, v 2, …, v n ) = g( f (v 1,.., v k ), f (v k+1, …, v n ))

Example for SUM sink 2 1 4 2 1 2 1 4 3 2 3 2 9 1 3 4 3 4 20 Sink initiates the query Nodes form a spanning tree Each node sends its partial result to its parent Sink computes the total sum

Classification of Aggregates TAG classifies aggregates according to –Size of partial state –Monotonicity –Exemplary vs. summary –Duplicate-sensitivity MIN/MAX (cheap and easy) –Small state, monotone, exemplary, duplicate-insensitive COUNT/SUM (considerably harder) –Small state and monotone, BUT duplicate-sensitive –Cheap if aggregating over tree without losses –Expensive with multiple paths and losses

Basic approaches to computing SUM 1.Separate, reliable delivery of every value to sink –Extremely costly in energy and energy consumption 2.Aggregate values back to sink along a tree –A single fault eliminates values of an entire subtree 3.“Split” values and route fractions separately –Send (value / k) to each of k parents –Better variance, but same expectation as approach (2) 4.Send values along multiple paths –Duplicates need to be handled. – pairs have limited in-network aggregation.

Design Objectives for Robust SUM Admit in-network aggregation of partial values Let aggregates be both order-insensitive and duplicate-insensitive Be agnostic to routing protocol –Trust routing protocol to be best-effort. –Routing and aggregation logically decoupled [NG ’03]. –Some routing algorithms better than others.

Design Objectives (cont) Final aggregate is exact if at least one representative from each leaf survives to reach the sink. This won’t happen in practice. It is reasonable to hope for approximate results. We argue that it is reasonable to use aggregation methods that are themselves approximate.

Outline Motivation for sensor databases and aggregation. COUNT aggregation via Flajolet-Martin SUM aggregation Experimental evaluation

Flajolet / Martin sketches [JCSS ’85] Goal: Estimate N from a small-space representation of a set. Sketch of a union of items is the OR of their sketches Insertion order and duplicates don’t matter! Prerequisite: Let h be a random, binary hash function. Sketch of an item For each unique item with ID x, For each integer 1 ≤ i ≤ k in turn, Compute h (x, i). Stop when h (x, i) = 1, and set bit i. X 00100 Z 10000 X Z 10100 ∩

Flajolet / Martin sketches (cont) Estimating COUNT Take the sketch of a set of N items. Let j be the position of the leftmost zero in the sketch. j is an estimator of log 2 (0.77 N) Fixable drawbacks: Estimate has faint bias Variance in the estimate is large. 110 1 S 1 Best guess: COUNT ~ 11 j = 3

Flajolet / Martin sketches (cont) Standard variance reduction methods apply. –Compute m independent sketches in parallel. –Compute m independent estimates of N. –Take the mean of the estimates. Provable tradeoffs between m and variance of the estimator

Application to COUNT Each sensor computes k independent sketches of itself (using unique ID x) –Coming next: sensor computes a sketch of its value. Use a robust routing algorithm to route sketches up to the sink. Aggregate the k sketches via union en-route. The sink then estimates the count.

Multipath Routing Braided Paths: Two paths from the source to the sink that differ in at least two nodes

Routing Methodologies Considerable work on reliable delivery via multipath routing –Directed diffusion [IGE ’00] –“Braided” diffusion [GGSE ’01] –GRAdient Broadcast [YZLZ ’02] Broadcast intermediate results along gradient back to source Can dynamically control width of broadcast Trade off fault tolerance and transmission costs Our approach similar to GRAB: –Broadcast. Grab if upstream, ignore if downstream Common goal: try to get at least one copy to sink

Simple Upstream Routing By expanding ring search, nodes can compute their hop distance from the sink. Refer to nodes at distance i as level i. At level i, gather aggregates from level i+1. Then broadcast aggregates to level i - 1 neighbors. Ignore downstream and sidestream aggregates.

Extending Flajolet / Martin Sketches Also interested in approximating SUM FM sketches can handle this (albeit clumsily): –To insert a value of 500, perform 500 distinct item insertions Our observation: We can simulate a large number of insertions into an FM sketch more efficiently. Sensor-net restrictions –No floating point operations –Must keep memory usage and CPU time to a minimum

Simulating a set of insertions Set all the low-order bits in the “safe” region. –First S = log c – 2 log log c bits are set to 1 w.h.p. Statistically estimate number of trials going beyond “safe” region –Probability of a trial doing so is simply 2 -S –Number of trials ~ B(c,2 -S ). [Mean = O(log 2 c)] For trials and bits outside “safe” region, set those bits manually. –Running time is O(1) for each outlying trial. Expected running time: O(log c) + time to draw from B(c,2 -S ) + O(log 2 c)

Fast sampling from discrete pdf’s We need to generate samples from B(n, p). General problem: sampling from a discrete pdf. Assume can draw uniformly at random from [0,1]. With an event space of size N: –O(log N) lookups are immediate. Represent the cdf in an array of size N. Draw from [0, 1] and do binary search. –Cleverer methods for O(log log N), O(log* N) time Amazingly, this can be done in constant time!

Constant Time Sampling Theorem [Walker ’77]: For any discrete pdf D over a sample space of size N, a table of size O(N) can be constructed in O(N) time that enables random variables to be drawn from D using at most two table lookups.

Sampling in O(1) time [Walker ’77] A B C D E Start with a discrete pdf. {0.40, 0.30, 0.15, 0.10, 0.05} Construct a table of 2N entries. 10.750.5 0.251 ABA __ pipi QiQi In table above: Pr[B] = 1 * 0.2 + 0.5 * 0.2 = 0.3 Pr[C] = 0.75 * 0.2 = 0.15 Algorithm Pick a column at random. Pick x uniformly from [0, 1]. If x < p i  output i. Else output Q i

Methods of [Walker ’77] (cont.) Table construction Take “below-average” i. Choose p i to satisfy x i = p i /n. Set j with largest x j as Q i Reduce x j accordingly. Repeat. A B C D E Ok, but how do you construct the table? 10.750.5 0.251 ABA __ pipi QiQi A B C D E 0.400.30 0.050.150.10 0 0 0.20 0.25 0 0.20 Linear time construction.

Back to extending FM sketches We need to sample from B(c, 2 -S ) for various values of S. Using Walker’s method, we can sample from B(c, 2 -S ) in O(1) time and O(c) space, assuming tables are pre-computed offline.

Back to extending FM sketches (cont) With more cleverness, we can trade off space for time. Recall that, –Running time = time to sample from B + O(log 2 c) –Sampling in O(log 2 c) time leads to O(c / log 2 c) space. –With max sensor value of 2 16, saving a log 2 c term is a 256-fold space savings. Tables for S = 1, 2,…, 16 together take 4600 bytes (without this optimization, tables would be >1MB)

Intermission FM sketches require more work initially. Need k bits to represent a single bit! But: –Sketched values can easily be aggregated. –Aggregation operation (OR) is both order- insensitive and duplicate-insensitive. –Result is a natural fit with sensor aggregation.

Outline Sensor database motivation COUNT aggregation via Flajolet-Martin SUM aggregation Experimental evaluation

Experimental Results We employ the publicly available TAG simulator. Basic topologies: grid (2-D lattice) and random Can modulate: –Grid size [default: 30 by 30] –Node, packet, or link loss rate [default: 5% link loss rate] –Number of bitmaps [default: twenty 16-bit sketches]. –Transmission radius [default: 8 neighbors on the grid]

Experimental Results We consider four main methods. –TAG: transmit aggregates up a single tree –DAG-based TAG: Send a 1/k fraction of the aggregated values to each of k parents. –SKETCH: broadcast an aggregated sketch to all neighbors at level i –1 –LIST: explicitly enumerate all pairs and broadcast to all neighbors at level i – 1. LIST vs. SKETCH measures the penalty associated with approximate values.

Message Comparison TAG: transmit aggregates up a single tree –1 message transmitted per node. –1 message received per node (on average). –Message size: 16 bits. SKETCH: broadcast a sketch up the tree –1 message transmitted per node. –Fanout of k receivers per transmission (constant k). –Message size: 20 16-bit sketches = 320 bits.

COUNT vs Link Loss (Grid)

COUNT vs Network Diameter (Grid)

COUNT vs Link Loss (Random)

SUM vs Link Loss

Compressability The FM sketches are amenable to compression. We employ a very basic method: –Run length encode initial prefix of ones. –Run length encode suffix of zeroes. –Represent the middle explicitly. Method can be applied to a group of sketches. This alone buys about a factor of 3. Better methods exist.

Compression

Space Usage

Future Directions Spatio-temporal queries –Restrict queries to specific regions of space, time, or space-time. Other aggregates –What else needs to be computed or approximated? Better aggregation methods –FM sketches have rather high variance. –Many other sketching methods can potentially be used.

Approximate Aggregation Techniques for Sensor Databases John Byers Department of Computer Science Boston University Joint work with Jeffrey Considine,

Similar presentations

Presentation on theme: "Approximate Aggregation Techniques for Sensor Databases John Byers Department of Computer Science Boston University Joint work with Jeffrey Considine,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Approximate Aggregation Techniques for Sensor Databases John Byers Department of Computer Science Boston University Joint work with Jeffrey Considine,

Similar presentations

Presentation on theme: "Approximate Aggregation Techniques for Sensor Databases John Byers Department of Computer Science Boston University Joint work with Jeffrey Considine,"— Presentation transcript:

Similar presentations

About project

Feedback