Adaptive Storage Management for Modern Data Centers Imranul Hoque 1.

Slides:

Advertisements

Similar presentations

Advertisements

Finding a needle in Haystack Facebook’s Photo Storage

Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

1 Storage-Aware Caching: Revisiting Caching for Heterogeneous Systems Brian Forney Andrea Arpaci-Dusseau Remzi Arpaci-Dusseau Wisconsin Network Disks University.

Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.

Ceph: A Scalable, High-Performance Distributed File System Priya Bhat, Yonggang Liu, Jing Qin.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.

Lecture 17 I/O Optimization. Disk Organization Tracks: concentric rings around disk surface Sectors: arc of track, minimum unit of transfer Cylinder:

Peer-to-Peer Based Multimedia Distribution Service Zhe Xiang, Qian Zhang, Wenwu Zhu, Zhensheng Zhang IEEE Transactions on Multimedia, Vol. 6, No. 2, April.

Partitioning Social Networks for Time-dependent Queries Berenice Carrasco, Yi Lu and Joana M. F. da Trindade - University of Illinois - EuroSys11 – Workshop.

Cooperative Caching Middleware for Cluster-Based Servers Francisco Matias Cuenca-Acuna Thu D. Nguyen Panic Lab Department of Computer Science Rutgers University.

Locality-Aware Request Distribution in Cluster-based Network Servers 1. Introduction and Motivation --- Why have this idea? 2. Strategies --- How to implement?

Exploiting Content Localities for Efficient Search in P2P Systems Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang 1 1 College of William and Mary,

Chapter 3: Data Storage and Access Methods

Bandwidth Allocation in a Self-Managing Multimedia File Server Vijay Sundaram and Prashant Shenoy Department of Computer Science University of Massachusetts.

Capacity planning for web sites. Promoting a web site Thoughts on increasing web site traffic but… Two possible scenarios…

World Wide Web Caching: Trends and Technology Greg Barish and Katia Obraczka USC Information Science Institute IEEE Communications Magazine, May 2000 Presented.

CS Spring 2012 CS 414 – Multimedia Systems Design Lecture 34 – Media Server (Part 3) Klara Nahrstedt Spring 2012.

DEXA 2005 Quality-Aware Replication of Multimedia Data Yicheng Tu, Jingfeng Yan and Sunil Prabhakar Department of Computer Sciences, Purdue University.

DISKS IS421. DISK  A disk consists of Read/write head, and arm  A platter is divided into Tracks and sector  The R/W heads can R/W at the same time.

Toolbox for Dimensioning Windows Storage Systems Jalil Boukhobza, Claude Timsit 12/09/2006 Versailles Saint Quentin University.

Lecture 11: DMBS Internals

Dynamic and Decentralized Approaches for Optimal Allocation of Multiple Resources in Virtualized Data Centers Wei Chen, Samuel Hargrove, Heh Miao, Liang.

Tennessee Technological University1 The Scientific Importance of Big Data Xia Li Tennessee Technological University.

Flashing Up the Storage Layer I. Koltsidas, S. D. Viglas (U of Edinburgh), VLDB 2008 Shimin Chen Big Data Reading Group.

Load Balancing in Structured P2P System Ananth Rao, Karthik Lakshminarayanan, Sonesh Surana, Richard Karp, Ion Stoica IPTPS ’03 Kyungmin Cho 2003/05/20.

Distributed Load Balancing for Key-Value Storage Systems Imranul Hoque Michael Spreitzer Malgorzata Steinder.

RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel.

Min Xu1, Yunfeng Zhu2, Patrick P. C. Lee1, Yinlong Xu2

M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Data Structures & Algorithms and The Internet: A different way of thinking.

1 On the Placement of Web Server Replicas Lili Qiu, Microsoft Research Venkata N. Padmanabhan, Microsoft Research Geoffrey M. Voelker, UCSD IEEE INFOCOM’2001,

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. LogKV: Exploiting Key-Value.

Scalable Web Server on Heterogeneous Cluster CHEN Ge.

Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.

THE LITTLE ENGINE(S) THAT COULD: SCALING ONLINE SOCIAL NETWORKS B 圖資三謝宗昊.

MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.

Sharing Social Content from Home: A Measurement-driven Feasibility Study Massimiliano Marcon Bimal Viswanath Meeyoung Cha Krishna Gummadi NOSSDAV 2011.

DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.

GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.

C-Hint: An Effective and Reliable Cache Management for RDMA- Accelerated Key-Value Stores Yandong Wang, Xiaoqiao Meng, Li Zhang, Jian Tan Presented by:

1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.

 Cachet Technologies 1998 Cachet Technologies Technology Overview February 1998.

Scalable Data Scale #2 site on the Internet (time on site) >200 billion monthly page views Over 1 million developers in 180 countries.

DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.

DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.

On the Placement of Web Server Replicas Yu Cai. Paper On the Placement of Web Server Replicas Lili Qiu, Venkata N. Padmanabhan, Geoffrey M. Voelker Infocom.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

CS Spring 2009 CS 414 – Multimedia Systems Design Lecture 27 – Media Server (Part 2) Klara Nahrstedt Spring 2009.

Incrementally Improving Lookup Latency in Distributed Hash Table Systems Hui Zhang 1, Ashish Goel 2, Ramesh Govindan 1 1 University of Southern California.

THE FUTURE IS HERE: APPLICATION- AWARE CACHING BY ASHOK ANAND.

Yiting Xia, T. S. Eugene Ng Rice University

Proxy Caching for Streaming Media

Measurement-based Design

Lecture 16: Data Storage Wednesday, November 6, 2006.

Finding a Needle in Haystack : Facebook’s Photo storage

CHAPTER 3 Architectures for Distributed Systems

Lecture 11: DMBS Internals

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

Qingbo Zhu, Asim Shankar and Yuanyuan Zhou

Benchmarking Cloud Serving Systems with YCSB

Persistence: hard disk drive

Group Based Management of Distributed File Caches

How Yahoo! use to serve millions of videos from its video library.

Replica Placement Heuristics of Application-level Multicast

Performance-Robust Parallel I/O

Presentation transcript:

Adaptive Storage Management for Modern Data Centers Imranul Hoque 1

Applications in Modern Data Centers More than 20 PB of data (> 260 billion files) just in the photo app. 2 More than 500 million active users. Over 2.5 million websites have integrated with Facebook.

Application Characteristics Type of data – Networked data (integration between content and social sites) Volume of data/scale of system – TB to PB (photos, videos, news articles, etc.) End-user performance requirement – Low latency (real time web) 3

New Classes of Storage Systems Application requirements Why traditional storage systems have failed? New storage system Networked data  Representation  Join (e.g., friends of friends) Graph database Example: Twitter uses FlockDB Large data volume and massive scale of system  Consistency vs. (Performance and Availability)  (Small but frequent r/w and batch transactions with rare write) vs. Heavy read/write workload Key-value storage Example: Digg uses Cassandra Low latency and high throughput  Disk is bottleneckIn-memory storage Example: Craigslist uses Redis 4 Absence of adaptive techniques for storage management cause these new storage systems to perform sub-optimally.

Adaptive Storage Management New storage systemUse casesScope of performance improvement Graph database  Online social network  Amazon style recommendation engine  Hierarchical data set Data layout on disk Key-value storage  Status update  Image tags  Click stream Load distribution across servers In-memory storage  Status update  Financial tick data  Online gaming stats Construction of working set 5 Bondhu Shomota Sreeti How the adaptive techniques should be designed?

Hypothesis Statement “Adaptive techniques which leverage the underlying heterogeneity of the system as a first class citizen, can improve the performance of these new classes of storage systems significantly.” 6

Bondhu: Leveraging Heterogeneity 7 Placement techniques Social Graph Hard Disk Drive Exploit heterogeneity in the social graph to make better data placement decisions.

Shomota: Leveraging Heterogeneity 8 Server 1 Server 2 Server 3 Server 4 SAT TUE SUN MON WED THU FRI MON TUE WED THU FRI SAT SUN Tablets Table Mitigate load heterogeneity across servers to alleviate hot spot via adaptive load balancing techniques.

Sreeti: Leveraging Heterogeneity 9 Swapping strategy Prefetching strategy Exploit heterogeneity in user access patterns to design prefetching and swapping techniques for better performance. Main memory Persistent storage Users

Contribution Project name Storage type Heterogeneity leveraged Exploit/m itigate Techniques proposed Status BondhuGraphSocial graphExploitData layout on disk Mature ShomotaKey-valueLoadMitigateLoad balancing across servers On-going SreetiIn-memoryRequest pattern ExploitPre-fetching and swapping Future work 10

Bondhu: A Social Network-Aware Disk Manager for Graph Databases 11

Visualization of Blocks Accessed Facebook New Orleans network [Viswanath2009] Neo4j graph database – 400 KB blocks per user blktrace tool to trace blocks 12 getProperty() Property Scattered disk access

Sequential vs. Random Access How bad is random access? fio benchmarking tool vs. 0.7 MB/s 150 vs. 1 MB/s 98 vs. 0.8 MB/s

Social Network-Aware Disk Manager Approaches in other systems – Popularity-based approach: multimedia file system [Wong1983] – Tracking block access patterns [Li2004][Bhadkamkar2009] Properties of online social networks [Misolve2007] – Strong community structure – Small world phenomenon Exploit heterogeneity in social graph – Keep related users’ data close by on disk – Reduce seek time, rotational latency, # of seeks 14

The Bondhu System Novel framework for disk layout algorithms – Based on community detection Integration into Neo4j, a widely-used open source graph database Experimentation using real social graph – Response time improvement by 48% compared to the default Neo4j layout 15

Problem Definition Logical block addressing scheme (LBA) – One-dimensional representation of disk 16 L1L1 L2L2 L3L3 L4L4 L5L5 L6L6 L7L7 V1V1 V2V2 V3V3 V4V4 V5V5 V6V6 V7V7 V5V5 V1V1 V3V3 V2V2 V4V4 V6V6 V7V7 Cost of layout = 18 Cost of layout = 14 NP-hard problem: fast multi-level heuristic approach V1V1 V1V1 V5V5 V5V5 V2V2 V2V2 V3V3 V3V3 V4V4 V4V4 V6V6 V6V6 V7V7 V7V7

Disk Layout Algorithm 17

Community Detection Module Goal: organize the users of the social graph into clusters Based on community detection algorithms – Graph partition driven (ParCom) [Karypis1998] – Modularity optimization driven (ModCom) [Blondel2008] 18 V1V1 V1V1 V5V5 V5V5 V2V2 V2V2 V3V3 V3V3 V4V4 V4V4 V6V6 V6V6 V7V7 V7V7

Intra-community Layout Module 19 V1V1 V1V1 V5V5 V5V5 V2V2 V2V2 V3V3 V3V3 V4V4 V4V4 V6V6 V6V6 V7V7 V7V V2V2 V1V1 V3V3 V5V5 V3V3 V3V3 V1V1 V1V1 2 2 VCVC VCVC V5V5 V5V5 V2V2 V2V2 V7V7 V6V6 V4V4 Goal: create a disk layout for each community

Inter-community Layout Module 20 V1V1 V1V1 V5V5 V5V5 V2V2 V2V2 V3V3 V3V3 V4V4 V4V4 V6V6 V6V6 V7V7 V7V7 VAVA VAVA VBVB VBVB 2 VAVA VBVB V2V2 V1V1 V3V3 V5V5 V7V7 V6V6 V4V4

Modeling OSN Dynamics Uniform Model – Assign equal weight Preferential Model – Weight of edge (V i, V j ) ∝ edge degree of V j – We use [edge_degree(V i ) + edge_degree(V j )]/2 Overlap Model – Weight proportional to # of common friends – We use (c + 1), c = # of common friends 21 V1V1 V1V1 V5V5 V5V5 V2V2 V2V2 V3V3 V3V3 V4V4 V4V4 V6V6 V6V6 V7V7 V7V7 degree = 4 degree=

Implementation and Evaluation Modified PropertyStore of Neo4j Facebook New Orleans network [Viswanath2009] – users – links – Assign weights according to uniform, preferential, and overlap models Workload: sample social network application – ‘List all friends’ operation – 1500 random users, 6 times/user Metrics – Cost (defined earlier) – Response time = time to fetch data blocks from all friends of a random user 22

Visualization using Bondhu 23 Default layout: Scattered disk access Bondhu: Clustered disk access

Effect of Block Size 24 Caching disabled Caching enabled File system reads data in chunks of 4KB – (4B, 40B, 400B) block = (1024, 102, 10) users data Default layout – 10x decrease in expected # of friends, when block size increases from 4B to 40B to 400B Bondhu layout – 4B to 40B not much decrease in expected # of friends – 40B to 400B rapid decrease in expected # of friends Cached in memory

Response Time Metric vs. Cost Metric 25 Improvement in response time is due to better placement decisions. Block size: 40 B Block size: 400 KB

Effect of Different Models Model = {preferential, overlap, uniform, default} Workload = {random, preferential, overlap} – 1000 random users, 1000 requests, 10 measurements 26 Little added benefit

Effect of OSN Evolution 27 Still better than the default layout by 72% 33% less nodes

Summary Adaptive disk layout techniques – Exploit heterogeneity of social graph (community structure) Implementation in Neo4j graph database Extensive trace-driven experimentation 48% improvement in median response time Low additional benefit using complex models Infrequent re-organization 28

Shomota: An Adaptive Load Balancer for Distributed Key-Value Storage Systems 29

30

Load Heterogeneity in K-V Storage Hash partitioned vs. range partitioned – Range partitioned data ensures efficient range scan/search – Hash partitioned data helps even distribution Uneven space distribution due to range partitioning – Solution: partition the tablets and move them around Few number of very popular records 31 Server 1 Server 2 Server 3 Server 4 SAT TUE SUN MON WED THU FRI

The Shomota System Mitigate load heterogeneity Algorithms for solving the load balancing problem – Load = space, bandwidth – Evenly distribute the spare capacity – Distributed algorithm, not a centralized one – Reduce the number of moves Previous solutions: – One dimensional/key-space redistribution/bulk loading [Stoica2001, Byers2003, Karger2004, Rao2003, Godfrey2005, Silberstein2008] 32

System Modeling and Assumptions Table Tablet Server A Server B Server C B 1, S 1 B 2, S 2 B 3, S 3 B A, S A B B, S B B C, S C 33 1.<= 0.01 in both dimensions 2. # of tablets >> # of nodes 1.<= 0.01 in both dimensions 2. # of tablets >> # of nodes B 1, S 1 B 4, S 4 B 5, S 5

System State B B SS Target Zone: helps achieve convergence Target Point Goal: Move tablets around so that every server is within the target zone 34 A given server

Load Balancing Algorithms Phase 1: – Global averaging phase – Variance of the approximation of the average decreases exponentially fast [Kempe2003] Phase 2: – Gossip phase – Point selection strategy Midpoint strategy Greedy strategy – Tablet transfer strategy Move to the selected point with minimum cost (space transferred) Phase 1 Phase 2 Phase 1 Phase 2 t 35

Summary Distributed load balancing techniques for key- value storage system – Mitigates both space and throughput heterogeneity across servers – PeerSim-based simulation – Integration in Voldemort (on-going) – Simulation results exhibit fast convergence while keeping the data movements at a low level 36

Sreeti: Access Pattern-Aware Memory Management 37

In-memory Storage System Growth in Internet population – Search engine, social networking, blogging, e- commerce, media sharing User expectation – Fast response time + high availability Serving large number of users at real-time – Option 1: SSD – Option 2: memory Emerging trends – Memory caching system: Memcached – In-memory storage system: Redis, VoltDB, etc. 38

Motivation Assumption in existing in-memory storage systems – Enough RAM to fit all data in memory Counter example: – Values associated with keys are large Approach taken by existing systems: – Redis, Memcached: use LRU for swapping 39 Performance of in-memory storage systems can be improved further if heterogeneity in user request-pattern is leveraged.

Proposal for Sreeti System Adaptive techniques for prefetching, caching, and swapping – Exploit heterogeneity in user request-patterns 40 Swap Fetch Users Application Main Memory Persistent Storage Associative rule mining

Hypothesis Statement “Adaptive techniques which leverage the underlying heterogeneity of the system as a first class citizen, can improve the performance of these new classes of storage systems significantly.” 41 Disk layout Load balancing Prefetching, caching, swapping Disk layout Load balancing Prefetching, caching, swapping Exploit Mitigate Exploit Mitigate Graph database Key-value storage In-memory storage Graph database Key-value storage In-memory storage

Summary Project name Storage typeHeterogeneity leveraged Exploit/miti gate Techniques proposed BondhuGraphSocial graphExploitData layout on disk ShomotaKey-valueLoadMitigateLoad balancing across servers SreetiIn-memoryRequest patternExploitPre-fetching and swapping 42