An Effective Coreset Compression Algorithm for Large Scale Sensor Networks Dan Feldman, Andrew Sugaya Daniela Rus MIT
=Data
How much data?
1 GPS Packet = 100 bytes (latitude, longitude, time)
1 GPS Packet = 100 bytes every 10 seconds
~40 Mb / hour or ~1 Gb / day
per device
~300 million smart phones sold in
For 100 million devices
~ 100 petabytes per day For 100 million devices
~ 100 thousand terabytes per day
2 terabytes each
x50000 / day
A lot of data.
GPS-points Data iPhones can collect high-frequency GPS traces GPS-point = (latitude, longitude, time) latitudelongitudetime :44: :44: :45: :45: :45: :45: :45: :45: :45: :45:11 ………
Example
3-D Visualization
Challenges Storing data on iPhone is expensive Transmission data is expensive Hard to interpret raw data Dynamic real-time streaming data
Key Insight: Identify Critical Points Approximate the n points by k << n semantically meaningful connected segments
Our Approach Central Expy, Singapore Ayer Rajah Expy, Singapore Chin Swee Rd, Singapore 261 Outram Rd, Singapore St Andrew's Rd, Singapore A Havelock Rd, Singapore A Raffles Ave, Singapore Raffles Blvd, Singapore N Buona Vista Rd, Singapore 5 Lower Kent Ridge Rd, Singapore 4 Medical Dr, Singapore Leonie Hill, Singapore 113 Devonshire Rd, Singapore Devonshire Rd, Singapore Grange Rd, Singapore 27 Grange Rd, Singapore Natl Youth Council, Singapore 25K Paterson Rd, Singapore Orchard Rd, Singapore Orchard Rd, Singapore timelatitudelongitude 8:44: :44: :45: :45: :45: :45: :45: :45: :45: :45: ………
Solution overview Semantically compress data points – Use coresets Fit lines to the semantic points – Use splines on coreset Reverse geo-cite to get directions
Problem Statement Input: set P of n data points in R d and integer k Output: optimal k-spline for P that provides semantic compression for large data set P
Related Work
Our Main Compression Theorem Example application
Streaming and Parallel Computation
Previous Work for streaming
p1p1 p2p2 p3p3 p4p4 p5p5 p7p7 p6p6 p8p8 p9p9 p 10 p 11 p 12 p 13 p 15 p 14 p 16 Streaming Compression using merge & reduce
Our Main Streaming Theorem
p1p1 p2p2 p3p3 p4p4 p5p5 p7p7 p6p6 p8p8 p9p9 p 10 p 11 p 12 p 13 p 15 p 14 p 16 Parallel computation
Summary Central Expy, Singapore Ayer Rajah Expy, Singapore Chin Swee Rd, Singapore 261 Outram Rd, Singapore St Andrew's Rd, Singapore A Havelock Rd, Singapore A Raffles Ave, Singapore Raffles Blvd, Singapore N Buona Vista Rd, Singapore 5 Lower Kent Ridge Rd, Singapore 4 Medical Dr, Singapore Leonie Hill, Singapore 113 Devonshire Rd, Singapore Devonshire Rd, Singapore Grange Rd, Singapore 27 Grange Rd, Singapore Natl Youth Council, Singapore 25K Paterson Rd, Singapore Orchard Rd, Singapore Orchard Rd, Singapore timelatitudelongitude 8:44: :44: :45: :45: :45: :45: :45: :45: :45: :45: ………
5000 points 300 points
Running time
Space
Tested Data sets NameNo. of Users Time Extent Data Size ~ Source Subject in Singapore 12 Days300kProbe device and iPhone application Taxi-Cabs in San-Francisco 5004 Months 300MBPublic data (“Crowdad”) Taxi-Cabs in Boston 254 Years15GBMIT
The Experiment
Experiments: Subject in Singapore Compression Ratio Error Ratio
Experiments: 500 San-Francisco Taxi-cabs
Website Coreset Display Data Display Visualization of Result of Algorithm - A Coreset
Contribution Semantic compression of data from sensors Line simplification using – One pass over data – Logarithmic space (for massive data sets) – Linear time – Provable bounded error