DataGarage: Warehousing Massive Performance Data on Commodity Servers

Slides:

Advertisements

Similar presentations

Meet Hadoop Doug Cutting & Eric Baldeschwieler Yahoo!

Advertisements

From Startup to Enterprise A Story of MySQL Evolution Vidur Apparao, CTO Stephen OSullivan, Manager of Data and Grid Technologies April 2009.

Evaluating Caching and Storage Options on the Amazon Web Services Cloud Gagan Agrawal, Ohio State University - Columbus, OH David Chiu, Washington State.

GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP.

Extreme Performance with Oracle Data Warehousing

Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China

Apache Hadoop and Hive.

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Introduction to Data Center Computing Derek Murray October 2010.

A walk in cloud (and look for databases) Jian Xu DMM DB-talk, Feb 2010.

Dan Bassett, Jonathan Canfield December 13, 2011.

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

A Local-Optimization based Strategy for Cost-Effective Datasets Storage of Scientific Applications in the Cloud Many slides from authors’ presentation.

University of Notre Dame

Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.

HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.

C-Store: Data Management in the Cloud Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 5, 2009.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.

Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.

UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.

Database Software File Management Systems Database Management Systems.

Abdullah Mueen UC Riverside Suman Nath Microsoft Research Jie Liu Microsoft Research.

5 Creating the Physical Model. Designing the Physical Model Phase IV: Defining the physical model.

UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.

AN INTRODUCTION TO CLOUD COMPUTING Web, as a Platform…

Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.

Google AppEngine. Google App Engine enables you to build and host web apps on the same systems that power Google applications. App Engine offers fast.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.

Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.

A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.

Network Support for Cloud Services Lixin Gao, UMass Amherst.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Cloud MapReduce ： a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.

RAID: High-Performance, Reliable Secondary Storage Mei Qing & Chaoxia Liao Nov. 20, 2003.

©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential.

Systems analysis and design, 6th edition Dennis, wixom, and roth

H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.

Monitoring Latency Sensitive Enterprise Applications on the Cloud Shankar Narayanan Ashiwan Sivakumar.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

HadoopDB project An Architetural hybrid of MapReduce and DBMS Technologies for Analytical Workloads Anssi Salohalla.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

HadoopDB Presenters: Serva rashidyan Somaie shahrokhi Aida parbale Spring 2012 azad university of sanandaj 1.

The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.

© 2008 Quest Software, Inc. ALL RIGHTS RESERVED. Perfmon and Profiler 101.

The Memory B. Ramamurthy C B. Ramamurthy1. Topics for discussion On chip memory On board memory System memory Off system/online storage/ secondary memory.

4 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved. Computer Software Chapter 4.

Virtualization and Databases Ashraf Aboulnaga University of Waterloo.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

IMS 4212: Database Implementation 1 Dr. Lawrence West, Management Dept., University of Central Florida Physical Database Implementation—Topics.

Performance. Performance Performance is a critical issue especially in a multi-user environment. Benchmarking is one way of testing this.

Hadoop & Neptune Feb 김형준.

1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.

Cluster computing. 1.What is cluster computing? 2.Need of cluster computing. 3.Architecture 4.Applications of cluster computing 5.Advantages of cluster.

Next Generation of Apache Hadoop MapReduce Owen

CPSC8985 FA 2015 Team C3 DATA MIGRATION FROM RDBMS TO HADOOP By Naga Sruthi Tiyyagura Monika RallabandiRadhakrishna Nalluri.

Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:

Microsoft Ignite /28/2017 6:07 PM

Hadoop Aakash Kag What Why How 1.

Curator: Self-Managing Storage for Enterprise Clusters

Distributed Network Traffic Feature Extraction for a Real-time IDS

Migration Strategies – Business Desktop Deployment (BDD) Overview

Overview of big data tools

Presentation transcript:

DataGarage: Warehousing Massive Performance Data on Commodity Servers Charles Loboz, Slawek Smyl, Suman Nath Microsoft Corporation

Monitoring Large DataCenters Management Task Monitoring Planning Historical analysis CPU, memory, disk utilization,… Response time, queue length,… Performance data Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments

Monitoring Data Management 100K servers = 1TB data per day! Storage challenge Query challenge Store data over many months, years Petabytes of data Hours to run simple queries Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments

DataGarage DataGarage Performance data Performance data warehousing system Storage, query processing Efficient, scalable, cheap CPU, memory, disk utilization,… Response time, queue length,… Performance data Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments

Outline Context Performance data characteristics Design goals DataGarage design Query Processing Evaluation Conclusion

Performance Data Collection Time CPU Mem Jobs Disk … 10:00 48 37 3 134 10:01 52 39 342 10:02 58 45 2 324 Our Deployment Sampling period 15 seconds 100-1000 counters/server 5-100 MB/server/day 0.01% CPU time Monitoring process CPU utilization, memory usage, disk space, SQL queue length, app response time, cache hit rate, network bandwidth, … Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments

Performance Data Characteristics Heterogeneous counter sets 30K different counters, 100-1000 per server Numeric, read-only, possibly-dirty Dirty data retained, may be ignored for query Hierarchical queries Selection, projection, aggregation, data mining Fraction of hotmail.com servers in a given rack with CPU utilization > 50% Average memory utilization trend of hotmail servers Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments

DataGarage Design Goals Small storage footprint Reduces storage and communication cost Small pay-as-you-go cost for Cloud systems Cheap Commodity hardware and off-the-shelf software Fast and robust query processing Allows fast decisions Tolerates faulty and slow hardware Simple and flexible query interface (SQL + UDF) Fast query writing Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments

Outline Context Performance data characteristics Design goals DataGarage design Query Processing Evaluation Conclusion

Options TableStore: Relational table FileStore: Files DB engine: single-node DBMS, parallel DBMS MapReduce: HadoopDB [Abouzeid et al. VLDB’09] FileStore: Files MapReduce: Hadoop, Dryad [Isard et al., EuroSys’07] Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments

TableStore + MR + single node DB Trade-offs Performance Fault-tolerance Cost Storage footprint TableStore + Parallel DB Engine (DBMS-X)  TableStore + MR + single node DB (HadoopDB) FileStore + MapReduce (Hadoop, Dryad) TableStore in files (DataGarage) Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments

Storage Inefficiency: TableStore Key problem: heterogeneous counter sets Total 30,000 unique counters, <1000/server Wide table Narrow table Machine id Timestamps Counter 1 Counter 2 Counter n All possible counters Machine id Timestamps Counter id Value Key-value store Too many columns >95% sparse Redundant keys (4x more expensive than raw data) Expensive joins needed Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments

Storage Inefficiency: FileStore Heterogeneous counter sets Files need to maintain schema for each server No structure in data Compression cannot exploit data correlation Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments

SQL Lite, MS SQL Server Compact Edition Our Solution One wide-table per server Benefits of TableStore, without sparseness/ redundancy Each wide-table in an embedded database file Benefits of FileStore SQL Lite, MS SQL Server Compact Edition .sdf file c1 c2 c3 c1 c4 c6 c7 c8 c2 c4 c5 c8 File system Microsoft SQL Server Compact Edition library Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments

DataGarage Architecture Controller (Query Dissemination) Data analysis tools Query Distributed file system Summary Database Embedded database Data collector Data collector Data collector Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments

Data Compression Zipping files with PKZip is not effective Compress one column at a time Exploit strong correlation RLE, delta encoding not very effective Our idea: Bit-truncation + Byte-interleaving 42 AE 91 83 2B 39 A0 E4 38 C4 … 42 AE 91 83 2B 39 … 42 AE .. 91 83 … if lossy <1% Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments

Storage Efficiency Context  Performance Data  Design Goals  DataGarage  Query Processing  Results

Outline Context Performance data characteristics Design goals DataGarage design Query Processing Evaluation Conclusion

DataGarage Query DataGarage query: Three components On: filesystem path: /hotmail/dc1/*.10-.-2009.sdf Apply: a SQL query run on individual database files Combine: a SQL query to compute final result Enables map-reduce style execution Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments

Query Execution … Controller Node Execution Nodes Distributed Apply Controller Node Dissemination On Result Combine Controller Combine Execution Nodes … Apply Apply Distributed File system Temporary Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments

Query Execution Time Context  Performance Data  Design Goals  DataGarage  Query Processing  Results

Fault Tolerance DataGarage key technology: Decoupling of execution and storage Fine-grained data partitioning Data is replicated by the file system Slow execution nodes Assigned smaller jobs Faster nodes take additional load after finished Execution node failures New nodes work on remaining job of failed nodes Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments

Goals Revisited High performance: queries are pushed inside embedded database Storage efficient: compression Fault tolerant: fine partitioning of data and query processing, aggressive restarting, speculative execution Hierarchical queries: file system paths Simple interface: SQL queries Cheap: off-the-shelf tools, commodity machines

Outline Context Performance data characteristics Design goals DataGarage design Query Processing Experience Conclusion

Operational Experience Have been in operation for more than 1 year Warehousing data from Microsoft data centers Partitioning with fine granularity + compression is the key to store massive data Previous implementation with narrow table 30K server-days in 1TB disk Slow queries Current implementation: 1-3 million server-days/TB Orders of magnitude faster queries Context  Performance Data  Design Goals  DataGarage  Query Processing  Results

Operational Experience Embedded database files give flexibility Placement, backup simplified Scavenge available storage on the fly Simple design helps Several thousands lines of C# code to glue together existing tools (FS, Embedded DB, R, …) Defer features until necessary: Parallel Combine Good fit with Cloud computing model Data and/or computation can be on the Cloud Cheap: only file storage needed, small footprint Context  Performance Data  Design Goals  DataGarage  Query Processing  Results

Conclusion Existing solutions are not efficient for warehousing performance data DataGarage: performance data warehouse Cheap, scalable, fault tolerant Combines benefits of DB, MapReduce, file systems Operational experience shows the benefits Questions? Context  Performance Data  Design Goals  DataGarage  Query Processing  Results

Compression Overhead Context  Performance Data  Design Goals  DataGarage  Query Processing  Results

Related Work HadoopDB DataGarage has finer data partitioning Improves fault tolerance and storage efficiency DataGarage uses embedded databases Cheap, enables using hierarchical file system DataGarage uses data compression Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments

Query Processing Distributed file system Controller <apply_script> Controller (Query Dissemination) <target> Result <combine_script> <combine_script> Temporary table <apply_script> Embedded database <apply_script> Distributed file system Context  Performance Data  Design Goals  DataGarage  Query Processing  Experiments