Download presentation
Presentation is loading. Please wait.
1
Physical Data Storage Stephen Dawson-Haggerty
2
Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile feedback - Fault detection Hadoop HDFS Applications StreamFS
3
Time-Series Databases Expected workload Related work Server architecture API Performance Future directions
4
Dent circuit meter sMAP Write Workload sMAP Sources – HTTP/REST protocol for exposing physical information – Data trickles in as its generated – Typical data rates: 1 reading/1-60s Bulk imports – Existing databases – Migrations
5
Read Workload Plotting engine Matlab & python adaptors for analysis Mobile apps Batch analysis Dominated by range queries Latency is important, for interactive data exploration
6
Page CacheLock Manager Key-Value Store Storage Alloc. Time-series Interface Bucketing RPC Compression readingdb insert resample aggregate query streaming pipeline SQL Storage mapper MySQL
7
Time series interface db_open() db_query(streamid, start, end) Query points in a range db_next(streamid, ref), db_prev(...) Query points near a reference time db_add(streamid, vector) Insert points into the database db_avail(streamid) Retrieve storage map db_close() All data is part of a stream, identified only by streamid A stream is a series of tuples: (timestamp, sequence, value, min, max)
8
Storage Manager: BDB Berkeley Database: embedded key-value store Store binary blobs using B+ trees Very mature: around since 1992, supports transactions, free-threading, replication We use version 4
9
RPC Evolution First: shared memory – Low latency Move to threaded TCP Google protocol buffers – zig-zag integer representation, multiple language bindings – Extensible for multiple versions
10
On-Disk Format All data stores perform poorly with one key per reading – index size is high – unnecessary Solution: bucket readings Excellent locality of reference with B+ tree intexes – Data sorted by streamid and timestamp – Range queries translate into mostly large sequential IOs bucket (streamid, timestamp)
11
Represent in memory with materialized structure – 32b/rec – Inefficient on disk – lots of repeated data, missing fields Solution: compression – First: delta encode each bucket in protocol buffer – Second: Huffman Tree or Run Length encoding (zlib) Combined compression 2x better than gzip or either one 1m rec/second compress/decompress on modest hardware On-Disk Format compress bdb page...
12
Other Services: Storage Mapping What is in the database? – Compute a set of tuples (start, end, n) The desired interpretation is “the data source was alive” Different data sources have different ways of maintaining this information and maintaining confidence – Sometimes you have to infer it from the data – Sometime data sources give you liveness/presence guarantees – “I haven’t heard from you in an hour, but I’m still alive!” dead or alive?
13
readingdb6 Up since December supporting Cory Hall, SDH Hall, most other LoCal Deployments – behind www.openbms.org > 2 billion points in 10k streams – 12Gb on disk ~= 5b/rec including index – So... we fit in memory! Import at around 300k points/sec – We maxed out the NIC
14
Low Latency RPC
15
Compression ratios
16
Write load Importing old data: 150k points/secContinuous write load: 300-500pts/sec
17
Future thoughts A component of a cloud storage stack for physical data Hadoop adaptor: improve Mapreduce performance over Hbase solution The data is small: 2 billion points in 12GB – We can go a long time without distributing this very much – Probably necessary for reasons other than performance
18
THE END
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.