Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile.

Physical Data Storage Stephen Dawson-Haggerty

Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile feedback - Fault detection Hadoop HDFS Applications StreamFS

Time-Series Databases Expected workload Related work Server architecture API Performance Future directions

Dent circuit meter sMAP Write Workload sMAP Sources – HTTP/REST protocol for exposing physical information – Data trickles in as its generated – Typical data rates: 1 reading/1-60s Bulk imports – Existing databases – Migrations

Read Workload Plotting engine Matlab & python adaptors for analysis Mobile apps Batch analysis Dominated by range queries Latency is important, for interactive data exploration

Page CacheLock Manager Key-Value Store Storage Alloc. Time-series Interface Bucketing RPC Compression readingdb insert resample aggregate query streaming pipeline SQL Storage mapper MySQL

Time series interface db_open() db_query(streamid, start, end) Query points in a range db_next(streamid, ref), db_prev(...) Query points near a reference time db_add(streamid, vector) Insert points into the database db_avail(streamid) Retrieve storage map db_close() All data is part of a stream, identified only by streamid A stream is a series of tuples: (timestamp, sequence, value, min, max)

Storage Manager: BDB Berkeley Database: embedded key-value store Store binary blobs using B+ trees Very mature: around since 1992, supports transactions, free-threading, replication We use version 4

RPC Evolution First: shared memory – Low latency Move to threaded TCP Google protocol buffers – zig-zag integer representation, multiple language bindings – Extensible for multiple versions

On-Disk Format All data stores perform poorly with one key per reading – index size is high – unnecessary Solution: bucket readings Excellent locality of reference with B+ tree intexes – Data sorted by streamid and timestamp – Range queries translate into mostly large sequential IOs bucket (streamid, timestamp)

Represent in memory with materialized structure – 32b/rec – Inefficient on disk – lots of repeated data, missing fields Solution: compression – First: delta encode each bucket in protocol buffer – Second: Huffman Tree or Run Length encoding (zlib) Combined compression 2x better than gzip or either one 1m rec/second compress/decompress on modest hardware On-Disk Format compress bdb page...

Other Services: Storage Mapping What is in the database? – Compute a set of tuples (start, end, n) The desired interpretation is “the data source was alive” Different data sources have different ways of maintaining this information and maintaining confidence – Sometimes you have to infer it from the data – Sometime data sources give you liveness/presence guarantees – “I haven’t heard from you in an hour, but I’m still alive!” dead or alive?

readingdb6 Up since December supporting Cory Hall, SDH Hall, most other LoCal Deployments – behind www.openbms.org > 2 billion points in 10k streams – 12Gb on disk ~= 5b/rec including index – So... we fit in memory! Import at around 300k points/sec – We maxed out the NIC

Low Latency RPC

Compression ratios

Write load Importing old data: 150k points/secContinuous write load: 300-500pts/sec

Future thoughts A component of a cloud storage stack for physical data Hadoop adaptor: improve Mapreduce performance over Hbase solution The data is small: 2 billion points in 12GB – We can go a long time without distributing this very much – Probably necessary for reasons other than performance

THE END

Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile.

Similar presentations

Presentation on theme: "Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile.

Similar presentations

Presentation on theme: "Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile."— Presentation transcript:

Similar presentations

About project

Feedback