SPANStore: Cost-Effective Geo-Replicated Storage Spanning Multiple Cloud Services Zhe Wu, Michael Butkiewicz, Dorian Perkins, Ethan Katz-Bassett, Harsha.

Slides:

Advertisements

Similar presentations

Wyatt Lloyd * Michael J. Freedman * Michael Kaminsky David G. Andersen * Princeton, Intel Labs, CMU Dont Settle for Eventual : Scalable Causal Consistency.

Advertisements

Finding a needle in Haystack Facebook’s Photo Storage

Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.

High throughput chain replication for read-mostly workloads

Henry C. H. Chen and Patrick P. C. Lee

STANFORD UNIVERSITY INFORMATION TECHNOLOGY SERVICES IT Services Storage And Backup Low Cost Central Storage (LCCS) January 9,

© 2013 A. Haeberlen, Z. Ives Cloud Storage & Case Studies NETS 212: Scalable & Cloud Computing Fall 2014 Z. Ives University of Pennsylvania 1.

G O O G L E F I L E S Y S T E M 陳仕融黃振凱林佑恩 Z 1.

Amazon’s Dynamo Simple Cloud Storage. Foundations 1970 – E.F. Codd “A Relational Model of Data for Large Shared Data Banks”E.F. Codd –Idea of tabular.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Towards a dynamic multi-cloud computing universe Divy Agrawal & Amr El Abbadi UC Santa Barbara

Dynamo Highly Available Key-Value Store 1Dennis Kafura – CS5204 – Operating Systems.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

1 Placement of Continuous Media in Wireless Peer-to-Peer Networks Shahram Ghadeharizadeh, Bhaskar Krishnamachari, Shanshan Song, IEEE Transactions on Multimedia,

Locality Optimizations in OceanStore Patrick R. Eaton Dennis Geels An introduction to introspective techniques for exploiting locality in wide area storage.

Computer Science Lecture 14, page 1 CS677: Distributed OS Consistency and Replication Introduction Consistency models –Data-centric consistency models.

Clock-RSM: Low-Latency Inter-Datacenter State Machine Replication Using Loosely Synchronized Physical Clocks Jiaqing Du, Daniele Sciascia, Sameh Elnikety.

Research Directions for On-chip Network Microarchitectures Luca Carloni, Steve Keckler, Robert Mullins, Vijay Narayanan, Steve Reinhardt, Michael Taylor.

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Matei Ripeanu.

MetaSync File Synchronization Across Multiple Untrusted Storage Services Seungyeop Han Haichen Shen, Taesoo Kim*, Arvind Krishnamurthy,

IBM Haifa Research 1 The Cloud Trade Off IBM Haifa Research Storage Systems.

PNUTS: YAHOO!’S HOSTED DATA SERVING PLATFORM FENGLI ZHANG.

Slingshot: Deploying Stateful Services in Wireless Hotspots Ya-Yunn Su Jason Flinn University of Michigan.

Disaster Recovery as a Cloud Service Chao Liu SUNY Buffalo Computer Science.

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.

Accelerating Mobile Applications through Flip-Flop Replication

EE616 Technical Project Video Hosting Architecture By Phillip Sutton.

Training Workshop Windows Azure Platform. Presentation Outline (hidden slide): Technical Level: 200 Intended Audience: Developers Objectives (what do.

Orbe: Scalable Causal Consistency Using Dependency Matrices & Physical Clocks Jiaqing Du, EPFL Sameh Elnikety, Microsoft Research Amitabha Roy, EPFL Willy.

Elastic Applications in the Cloud Dinesh Rajan University of Notre Dame CCL Workshop, June 2012.

© 2011 Cisco All rights reserved.Cisco Confidential 1 APP server Client library Memory (Managed Cache) Memory (Managed Cache) Queue to disk Disk NIC Replication.

Consistency And Replication

ARGONNE  CHICAGO Ian Foster Discussion Points l Maintaining the right balance between research and development l Maintaining focus vs. accepting broader.

Ahmad Al-Shishtawy 1,2,Tareq Jamal Khan 1, and Vladimir Vlassov KTH Royal Institute of Technology, Stockholm, Sweden {ahmadas, tareqjk,

Papers on Storage Systems 1) Purlieus: Locality-aware Resource Allocation for MapReduce in a Cloud, SC ) Making Cloud Intermediate Data Fault-Tolerant,

CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.

MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.

1 ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING (Autonomic and Trusted Computing 2006) Giray Kömürcü.

Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.

Replica Consistency in a Data Grid1 IX International Workshop on Advanced Computing and Analysis Techniques in Physics Research December 1-5, 2003 High.

GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.

Distributed Systems CS Consistency and Replication – Part I Lecture 10, September 30, 2013 Mohammad Hammoud.

HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

Dynamo: Amazon’s Highly Available Key-value Store DAAS – Database as a service.

The IEEE International Conference on Cluster Computing 2010

SpotADAPT: Spot-Aware (re-)Deployment of Analytical Processing Tasks on Amazon EC2 by Dalia Kaulakiene, Aalborg University (Denmark) Christian Thomsen,

Optimized File Uploads in Mobile Cloud Computing Yash Sheth Vishal Sahu Swapnil Tiwari

Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Google Spanner Steve Ko Computer Sciences and Engineering University at Buffalo.

Implementation of Simple Cloud-based Distributed File System Group ID: 4 Baolin Wu, Liushan Yang, Pengyu Ji.

Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.

Minimizing Commit Latency of Transactions in Geo-Replicated Data Stores Paper Authors: Faisal Nawab, Vaibhav Arora, Divyakant Argrawal, Amr El Abbadi University.

Gorilla: A Fast, Scalable, In-Memory Time Series Database

Pouya Ostovari and Jie Wu Computer & Information Sciences

CalvinFS: Consistent WAN Replication and Scalable Metdata Management for Distributed File Systems Thomas Kao.

Tom Creighton Considerations Regarding Archival Cloud Storage.

Slingshot: Deploying Stateful Services in Wireless Hotspots

CSCE 990: Advanced Distributed Systems

Replication Middleware for Cloud Based Storage Service

The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.

Dev Test on Windows Azure Solution in a Box

Providing Secure Storage on the Internet

Building a Database on S3

CSE 486/586 Distributed Systems Consistency --- 1

EECS 498 Introduction to Distributed Systems Fall 2017

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

Building global and highly-available services using Windows Azure

CSE 486/586 Distributed Systems Consistency --- 2

Presentation transcript:

SPANStore: Cost-Effective Geo-Replicated Storage Spanning Multiple Cloud Services Zhe Wu, Michael Butkiewicz, Dorian Perkins, Ethan Katz-Bassett, Harsha V. Madhyastha UC Riverside and USC

Geo-distributed Services for Low Latency 2

Cloud Services Simplify Geo-distribution 3

Need for Geo-Replication Data uploaded by a user may be viewed/edited by users in other locations Social networking (Facebook, Twitter) File sharing (Dropbox, Google Docs)  Geo-replication of data is necessary Isolated storage service in each cloud data center  Application needs to handle replication itself 4

Geo-replication on Cloud Services Lots of recent work on enabling geo-replication Walter(SOSP’11), COPS(SOSP’11), Spanner(OSDI’12), Gemini(OSDI’12), Eiger(NSDI’13)… Faster performance or stronger consistency Added consideration on cloud services 5 Minimizing cost

Outline Problem and motivation SPANStore overview Techniques for reducing cost Evaluation 6

SPANStore Key value store (GET/PUT interface) spanning cloud storage services Main objective: minimize cost Satisfy application requirements Latency SLOs Consistency (Eventual vs. sequential consistency) Fault-tolerance 7

SPANStore Overview 8 SPANStor e App Metadata lookups Return data/ACK Library request Read/write data based on optimal replication policy Data center A Data center B Data center C Data center D

SPANStore Overview 9 SPANStor e App Data center B SPANStor e App Data center C SPANStor e Data center A SPANStor e App Data center D Placement Manager workload Replication policy Inter-DC latencies Pricing policies Latency, consistency and fault tolerance requirements SPANStore Characterization Application Input

Outline Problem and motivation SPANStore overview Techniques for reducing cost Evaluation 10

Questions to be addressed for every object: Where to store replicas How to execute PUTs and GETs

Cloud Storage Service Cost 12 Storage cost Request cost Data transfer cost + + = Storage service cost (the amount of data stored) (the number of PUT and GET requests issued) (the amount of data transferred out of data center)

Low Latency SLO Requires High Replication in Single Cloud Deployment 13 R R R R Latency bound = 100ms AWS regions

Technique 1: Harness Multiple Clouds 14 R R R R R R Latency bound = 100ms AWS regions

Price Discrepancies across Clouds 15 Cloud regionStorage price (GB) Data transfer price (GB) GET request price (10000 requests) PUT request price (1000 requests) S3 US West0.095$0.12$0.004$0.005$ Azure Zone20.095$0.19$0.001$0.0001$ GCS0.085$0.12$0.01$ …………… Leveraging discrepancies judiciously can reduce cost

Range of Candidate Replication Policies 16 Strategy 1: single replica in cheapest storage cloud R High latencies

Range of Candidate Replication Policies 17 Strategy 2: few replicas to reduce latencies R R High data transfer cost

Range of Candidate Replication Policies 18 Strategy 3: replicated everywhere PUT R R R R High latencies& cost of PUTs High storage cost Optimal replication policy depends on: 1. application requirements 2. workload properties

High Variability of Individual Objects 19 Estimate workload based on same hour in previous week 60% of hours have error higher than 50% 20% of hours have error higher than 100% Error can be as high as 1000% Analyze predictability of Twitter workload

Technique 2: Aggregate Workload Prediction per Access Set Observation: stability in aggregate workload Diurnal and weekly patterns Classify objects by access set: Set of data centers from which object is accessed Leverage application knowledge of sharing pattern Dropbox/Google Docs know users that share a file Facebook controls every user’s news feed 20

Technique 2: Aggregate Workload Prediction per Access Set 21 Aggregate workload is more stable and predictable Estimate workload based on same hour in previous week

Optimizing Cost for GETs and PUTs 22 R R GET R R Use cheap (request + data transfer) data centers

Technique 3: Relay Propagation 23 PUT Asynchronous propagation (no latency constraint) R 0.25$/GB 0.19$/GB 0.2$/GB 0.19$/GB 0.12$/GB R R R R

Technique 3: Relay Propagation 24 PUT 0.25$/GB 0.19$/GB 0.2$/GB 0.19$/GB 0.12$/GB Violate SLO Asynchronous propagation (no latency constraint) Synchronous propagation (bounded by latency SLO) R R R R R

Summary Insights to reduce cost Multi-cloud deployment Use aggregate workload per access set Relay propagation Placement manager uses ILP to combine insights Other techniques Metadata management Two phase-locking protocol Asymmetric quorum set 25

Outline Problem and motivation SPANStore overview Techniques for reducing cost Evaluation 26

Evaluation Scenario Application is deployed on EC2 SPANStore is deployed across S3, Azure and GCS Simulations to evaluate cost savings Deployment to verify application requirements Retwis ShareJS 27

Simulation Settings Compare SPANStore against Replicate everywhere Single replica Single cloud deployment Application requirements Sequential consistency PUT SLO: min SLO satisfies replicate everywhere GET SLO: min SLO satisfies single replica 28

SPANStore Enables Cost Savings across Disparate Workloads 29 Savings by relay propagation #1: big objects, more GETs (Lots of data transfers from replicas) #2: big objects, more PUTs ( Lots of data transfers to replicas ) Savings by reducing data transfer #3: small objects, more GETs ( Lots of GET requests ) Savings by price discrepancy of GET request #4: small objects, more PUTs ( Lots of PUT requests ) Savings by price discrepancy of PUT request

Deployment Settings 30 Retwis Scale down Twitter workload GET: read timeline PUT: make post Insert: read follower’s timeline and append post to it Requirements: Eventual consistency 90%ile PUT/GET SLO = 100ms

SPANStore Meets SLOs 31 SLO 90%ile Insert SLO

Conclusions SPANStore Minimize cost while satisfying latency, consistency and fault-tolerance requirements Use multiple cloud providers for greater data center density and pricing discrepancies Judiciously determine replication policy based on workload properties and application needs 32