Download presentation
Presentation is loading. Please wait.
Published byKevin Brewer Modified over 11 years ago
1
DataGarage: Warehousing Massive Performance Data on Commodity Servers
Charles Loboz, Slawek Smyl, Suman Nath Microsoft Corporation
2
Monitoring Large DataCenters
Management Task Monitoring Planning Historical analysis CPU, memory, disk utilization,… Response time, queue length,… Performance data Context Performance Data Design Goals DataGarage Query Processing Experiments
3
Monitoring Data Management
100K servers = 1TB data per day! Storage challenge Query challenge Store data over many months, years Petabytes of data Hours to run simple queries Context Performance Data Design Goals DataGarage Query Processing Experiments
4
DataGarage DataGarage Performance data
Performance data warehousing system Storage, query processing Efficient, scalable, cheap CPU, memory, disk utilization,… Response time, queue length,… Performance data Context Performance Data Design Goals DataGarage Query Processing Experiments
5
Outline Context Performance data characteristics Design goals
DataGarage design Query Processing Evaluation Conclusion
6
Performance Data Collection
Time CPU Mem Jobs Disk … 10:00 48 37 3 134 10:01 52 39 342 10:02 58 45 2 324 Our Deployment Sampling period 15 seconds counters/server 5-100 MB/server/day 0.01% CPU time Monitoring process CPU utilization, memory usage, disk space, SQL queue length, app response time, cache hit rate, network bandwidth, … Context Performance Data Design Goals DataGarage Query Processing Experiments
7
Performance Data Characteristics
Heterogeneous counter sets 30K different counters, per server Numeric, read-only, possibly-dirty Dirty data retained, may be ignored for query Hierarchical queries Selection, projection, aggregation, data mining Fraction of hotmail.com servers in a given rack with CPU utilization > 50% Average memory utilization trend of hotmail servers Context Performance Data Design Goals DataGarage Query Processing Experiments
8
DataGarage Design Goals
Small storage footprint Reduces storage and communication cost Small pay-as-you-go cost for Cloud systems Cheap Commodity hardware and off-the-shelf software Fast and robust query processing Allows fast decisions Tolerates faulty and slow hardware Simple and flexible query interface (SQL + UDF) Fast query writing Context Performance Data Design Goals DataGarage Query Processing Experiments
9
Outline Context Performance data characteristics Design goals
DataGarage design Query Processing Evaluation Conclusion
10
Options TableStore: Relational table FileStore: Files
DB engine: single-node DBMS, parallel DBMS MapReduce: HadoopDB [Abouzeid et al. VLDB’09] FileStore: Files MapReduce: Hadoop, Dryad [Isard et al., EuroSys’07] Context Performance Data Design Goals DataGarage Query Processing Experiments
11
TableStore + MR + single node DB
Trade-offs Performance Fault-tolerance Cost Storage footprint TableStore + Parallel DB Engine (DBMS-X) TableStore + MR + single node DB (HadoopDB) FileStore + MapReduce (Hadoop, Dryad) TableStore in files (DataGarage) Context Performance Data Design Goals DataGarage Query Processing Experiments
12
Storage Inefficiency: TableStore
Key problem: heterogeneous counter sets Total 30,000 unique counters, <1000/server Wide table Narrow table Machine id Timestamps Counter 1 Counter 2 Counter n All possible counters Machine id Timestamps Counter id Value Key-value store Too many columns >95% sparse Redundant keys (4x more expensive than raw data) Expensive joins needed Context Performance Data Design Goals DataGarage Query Processing Experiments
13
Storage Inefficiency: FileStore
Heterogeneous counter sets Files need to maintain schema for each server No structure in data Compression cannot exploit data correlation Context Performance Data Design Goals DataGarage Query Processing Experiments
14
SQL Lite, MS SQL Server Compact Edition
Our Solution One wide-table per server Benefits of TableStore, without sparseness/ redundancy Each wide-table in an embedded database file Benefits of FileStore SQL Lite, MS SQL Server Compact Edition .sdf file c1 c2 c3 c1 c4 c6 c7 c8 c2 c4 c5 c8 File system Microsoft SQL Server Compact Edition library Context Performance Data Design Goals DataGarage Query Processing Experiments
15
DataGarage Architecture
Controller (Query Dissemination) Data analysis tools Query Distributed file system Summary Database Embedded database Data collector Data collector Data collector Context Performance Data Design Goals DataGarage Query Processing Experiments
16
Data Compression Zipping files with PKZip is not effective
Compress one column at a time Exploit strong correlation RLE, delta encoding not very effective Our idea: Bit-truncation + Byte-interleaving 42 AE 91 83 2B 39 A0 E4 38 C4 … 42 AE 91 83 2B 39 … 42 AE .. 91 83 … if lossy <1% Context Performance Data Design Goals DataGarage Query Processing Experiments
17
Storage Efficiency Context Performance Data Design Goals DataGarage Query Processing Results
18
Outline Context Performance data characteristics Design goals
DataGarage design Query Processing Evaluation Conclusion
19
DataGarage Query DataGarage query: Three components
On: filesystem path: /hotmail/dc1/* sdf Apply: a SQL query run on individual database files Combine: a SQL query to compute final result Enables map-reduce style execution Context Performance Data Design Goals DataGarage Query Processing Experiments
20
Query Execution … Controller Node Execution Nodes Distributed
Apply Controller Node Dissemination On Result Combine Controller Combine Execution Nodes … Apply Apply Distributed File system Temporary Context Performance Data Design Goals DataGarage Query Processing Experiments
21
Query Execution Time Context Performance Data Design Goals DataGarage Query Processing Results
22
Fault Tolerance DataGarage key technology:
Decoupling of execution and storage Fine-grained data partitioning Data is replicated by the file system Slow execution nodes Assigned smaller jobs Faster nodes take additional load after finished Execution node failures New nodes work on remaining job of failed nodes Context Performance Data Design Goals DataGarage Query Processing Experiments
23
Goals Revisited High performance: queries are pushed inside embedded database Storage efficient: compression Fault tolerant: fine partitioning of data and query processing, aggressive restarting, speculative execution Hierarchical queries: file system paths Simple interface: SQL queries Cheap: off-the-shelf tools, commodity machines
24
Outline Context Performance data characteristics Design goals
DataGarage design Query Processing Experience Conclusion
25
Operational Experience
Have been in operation for more than 1 year Warehousing data from Microsoft data centers Partitioning with fine granularity + compression is the key to store massive data Previous implementation with narrow table 30K server-days in 1TB disk Slow queries Current implementation: 1-3 million server-days/TB Orders of magnitude faster queries Context Performance Data Design Goals DataGarage Query Processing Results
26
Operational Experience
Embedded database files give flexibility Placement, backup simplified Scavenge available storage on the fly Simple design helps Several thousands lines of C# code to glue together existing tools (FS, Embedded DB, R, …) Defer features until necessary: Parallel Combine Good fit with Cloud computing model Data and/or computation can be on the Cloud Cheap: only file storage needed, small footprint Context Performance Data Design Goals DataGarage Query Processing Results
27
Conclusion Existing solutions are not efficient for warehousing performance data DataGarage: performance data warehouse Cheap, scalable, fault tolerant Combines benefits of DB, MapReduce, file systems Operational experience shows the benefits Questions? Context Performance Data Design Goals DataGarage Query Processing Results
28
Compression Overhead Context Performance Data Design Goals DataGarage Query Processing Results
29
Related Work HadoopDB DataGarage has finer data partitioning
Improves fault tolerance and storage efficiency DataGarage uses embedded databases Cheap, enables using hierarchical file system DataGarage uses data compression Context Performance Data Design Goals DataGarage Query Processing Experiments
30
Query Processing Distributed file system Controller
<apply_script> Controller (Query Dissemination) <target> Result <combine_script> <combine_script> Temporary table <apply_script> Embedded database <apply_script> Distributed file system Context Performance Data Design Goals DataGarage Query Processing Experiments
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.