Project Voldemort Distributed Key-value Storage Alex Feinberg

Slides:



Advertisements
Similar presentations
Dynamo: Amazon’s Highly Available Key-value Store
Advertisements

Case Study - Amazon. Amazon r Amazon has many Data Centers r Hundreds of services r Thousands of commodity machines r Millions of customers at peak times.
High throughput chain replication for read-mostly workloads
Amazon’s Dynamo Simple Cloud Storage. Foundations 1970 – E.F. Codd “A Relational Model of Data for Large Shared Data Banks”E.F. Codd –Idea of tabular.
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
Dynamo: Amazon's Highly Available Key-value Store Guiseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin,
1 Dynamo Amazon’s Highly Available Key-value Store Scott Dougan.
Spark: Cluster Computing with Working Sets
Project Voldemort Bhupesh Bansal & Jay Kreps
Kafka high-throughput, persistent, multi-reader streams
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Dynamo A presentation that look’s at Amazon’s Dynamo service (based on a research paper published by Amazon.com) as well as related cloud storage implementations.
Amazon’s Dynamo System The material is taken from “Dynamo: Amazon’s Highly Available Key-value Store,” by G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati,
Cloud Storage – A look at Amazon’s Dyanmo A presentation that look’s at Amazon’s Dynamo service (based on a research paper published by Amazon.com) as.
Dynamo: Amazon’s Highly Available Key-value Store Presented By: Devarsh Patel 1CS5204 – Operating Systems.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Distributed File Systems Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Case Study: Amazon Dynamo Steve Ko Computer Sciences and Engineering University at Buffalo.
Peer-to-Peer in the Datacenter: Amazon Dynamo Aaron Blankstein COS 461: Computer Networks Lectures: MW 10-10:50am in Architecture N101
Dynamo: Amazon’s Highly Available Key-value Store COSC7388 – Advanced Distributed Computing Presented By: Eshwar Rohit
1 The Google File System Reporter: You-Wei Zhang.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Project Voldemort: What’s New Alex Feinberg. The plan  Introduction  Motivation  Inspiration  Implementation  Present day  New features within the.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
High Throughput Computing on P2P Networks Carlos Pérez Miguel
Dynamo: Amazon's Highly Available Key-value Store Dr. Yingwu Zhu.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Amazon’s Dynamo Lecturer.
Dynamo: Amazon’s Highly Available Key-value Store
IMDGs An essential part of your architecture. About me
Cassandra - A Decentralized Structured Storage System
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
HDFS (Hadoop Distributed File System) Taejoong Chung, MMLAB.
VMware vSphere Configuration and Management v6
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
Dynamo: Amazon’s Highly Available Key-value Store DAAS – Database as a service.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Cloud Computing – UNIT - II. VIRTUALIZATION Virtualization Hiding the reality The mantra of smart computing is to intelligently hide the reality Binary->
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Next Generation of Apache Hadoop MapReduce Owen
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Big Data Yuan Xue CS 292 Special topics on.
Kitsuregawa Laboratory Confidential. © 2007 Kitsuregawa Laboratory, IIS, University of Tokyo. [ hoshino] paper summary: dynamo 1 Dynamo: Amazon.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Amazon’s Dynamo Lecturer.
Course: Cluster, grid and cloud computing systems Course author: Prof
Project Voldemort Distributed Key-value Storage Alex Feinberg
Distributed Cache Technology in Cloud Computing and its Application in the GIS Software Wang Qi Zhu Yitong Peng Cheng
Curator: Self-Managing Storage for Enterprise Clusters
CSE 486/586 Distributed Systems Case Study: Amazon Dynamo
Open Source distributed document DB for an enterprise
Dynamo: Amazon’s Highly Available Key-value Store
CS6604 Digital Libraries IDEAL Webpages Presented by
AWS Cloud Computing Masaki.
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Presentation transcript:

Project Voldemort Distributed Key-value Storage Alex Feinberg

The Plan  What is it? –Motivation –Inspiration  Design –Core Concepts –Trade-offs  Implementation  In production –Use cases and challenges  What’s next

What is it?

Distributed Key-value Storage  The Basics: –Simple APIs:  get(key)  put(key,value)  getAll(key1…keyN)  delete() –Distributed  Single namespace, transparent partitioning  Symmetric  Scalable –Stable storage  Shared nothing disk persistence  Adequate performance even when data doesn’t fit entirely into RAM  Open sourced January 2009 –Spread beyond LinkedIn: job listings mentioning Voldemort!

Motivation  LinkedIn’s Search, Networks and Analytics Team –Search –Recommendation Engine –Data intensive features  People you may know  Who’s viewed my profile  History Service  Services and functional/vertical partitioning  Simple queries –Side effect of the modular architecture –Necessity when federation is impossible

Inspiration: Specialized Systems  Specialized systems within the SNA group –Search Infrastructure  Real time  Distributed –Social Graph –Data Infrastructure  Publish/subscribe  Offline systems

Inspiration: Fast Key-value Storage  Memcached –Scalable –High throughput, low latency –Proven to work well  Amazon’s Dynamo –Multiple datacenters –Commodity hardware –Eventual consistency –Variable SLAs –Feasible to implement

Design (So you want to build a distributed key/value store?)

Design  Key-value data model  Consistent hashing for data distribution  Fault tolerance through replication  Versioning  Variable SLAs

Request Routing with Consistent Hashing  Calculate “master” partition for a key  Preference list –Next N adjacent partitions in the ring belonging to different nodes  Assign nodes to multiples places on the hash ring –Load balancing –Ability to migrate partitions

Replication  Replication –Fault tolerance and high availability –Disaster Recovery –Multiple datacenters  Operation transfer –Each node starts in the same state –If each node receives the same operations, all nodes will end in the same state (consistent with each other) –How do you send the same operations?

Consistency  Strong consistency –2PC –3PC  Eventual Consistency –Weak Eventual Consistency –“Read-your-writes” consistency  Other eventually consistent systems –DNS –Usenet (“writes-follow-reads” consistency) – –See: “Optimistic Replication.”, Saito and Shapiro [2003] –In other words: very common, not a new or unique concept!

Trade-offs  CAP theorem –Consistency, Availability, (Network) Partition Tolerance  Network partitions – splits  Can only guarantee two out of three –Tunable knobs, not binary switches –Decrease one to increase the other two  Why eventual consistency (i.e., “AP”) –Allows multi-datacenter operation –Network partitions may occur even within the same datacenter –Good performance for both reads and writes –Easier to implement

Versioning  Timestamps –Clock skew  Logical clock –Establishes a “happened-before” relation –Lamport Timestamps –“X caused Y implies X happened before Y” –Vector Clocks  Partial ordering

Quorums and SLAs  Quorums –N replicas total (the preference list) –Quorum reads  Read from the first R available replicas in the preference list  Return the latest version, repair the obsolete versions  Allow for client side reconciliation if causality can’t be determined –Quorum writes  Synchronously write to W replicas in the preference list.  Asynchronously write to the rest –If a quorum for an operation isn’t met, operation is considered a failure –If R + W > N, then we have “read-your-writes” consistency  SLAs –Different applications have different requirements –Allow different R, W, N per application

An observation  Distribution model vs. the query model –Consistency, versioning, quorums aren’t specific to key-value storage –Other systems with state can be built upon the Dynamo model! –Think of scalability, availability and consistency requirements –Adjust the application to the query model

Implementation

Architecture  Layered design  One interface down all the layers  Four APIs –get –put –delete –getall

Storage Basics  Cluster may serve multiple stores  Each store has a unique key space, store definition  Store Definition –Serialization: method and schema –SLA parameters (R, W, N, preferred-reads, preferred-writes) –Storage engine used –Compression (gzip, lzf)  Serialization –Can be separate for keys and values –Pluggable: binary JSON, Protobufs, (new!) Avro

Storage Engines  Pluggable  One size doesn’t fit all –Is the load write heavy? Read heavy? –Is the amount of data per node significantly larger than the node’s memory?  BerkeleyDB JE is most popular –Log-structured B+Tree (great write performance) –Many configuration options  MySQL Storage Engine is available –Hasn’t been extensively tested/tuned, potential for great performance

Read Only Stores  Data cycle at LinkedIn –Events gathered from multiple sources –Offline computation (Hadoop/MapReduce) –Results are used in data intensive applications –How do you make the data available for real time serving?  Read Only Storage Engine –Heavily optimized for read-only data –Build the stores using MapReduce –Parallel fetch the pre-built stores from HDFS –Transfers are throttled to protect live serving –Atomically swap the stores

Read Only Store Swap Process

Store Server  Socket Server –Most frequently used –Multiple wire protocols (different versions of a native protocol, protocol buffers) –Blocking I/O, thread pool implementation –Event-driven, non-blocking I/O (NIO) implementation  Tricky to get high performance  Multiple threads available to parallelize CPU tasks (e.g., to take advantage of multiple cores)  HTTP server available –Performance lower than the Socket Server –Doesn’t implement REST

Store Client  “Thick Client” –Performs routing and failure detection –Available in the Java and C++ implementations  “Thin Client” –Delegated routing to the server –Designed for easy implementation  E.g., if failure detection algorithm is changed in the thick clients, thin clients do not need to update theirs –Python and Ruby implementations  HTTP client also available

Monitoring/Operations  JMX –Easy to create new metrics and operations –Widely used standard –Exposed both on the server and on the (Java) client  Metrics exposed –Per/store performance statistics –Aggregate performance statistics –Failure detector statistics –Storage Engine statistics  Operations available –Recovering from replicas –Stopping/starting services –Manage asynchronous operations

Failure Detection  Based on requests rather than heart beats  Recently overhauled  Pluggable, configurable layer  Two implementations –Bannage period failure detector (older option)  If we see a certain number of failures, ban a node for a time period  Once the time period expired, assume healthy, try again –Threshold failure detector (new!)  Looks at the number of successes and failures within a time interval  If a node responds very slowly, don’t count is a success  When a node is marked down, keep retrying it asynchronously. Mark as available when it has been successfully reached.

Admin Client  Needed functionality, shouldn’t be used by applications –Streaming data to and from a node –Manipulating metadata –Asynchronous operations  Uses –Migrating partitions between nodes –Retrieving, deleting, updating partitions on a node –Extraction, transformation, loading –Changing cluster membership information

Rebalancing  Dynamic node addition and removal  Live requests (including writes) can be served as rebalancing proceeds  Introduced in release 0.70 (January 2010)  Procedure: –Initially, new nodes have no partitions assigned to them –Create a new cluster configuration, invoke command line tool

Rebalancing  Algorithm –Node (“stealer”) receives a command to rebalance to a specified cluster layout –Cluster metadata is updated –Fetches the partitions from the “donor” node –If data is not yet migrated, proxy the requests to the donor –If a rebalancing task fails, cluster metadata is reverted –If any nodes did not receive the updated metadata, they may synchronize the metadata via the gossip protocol

(Experimental) Views  Inspired by CouchDB  Moves computation close to the data (to the server)  Example: –We’re storing a list as a value, want to append a new element –Regular way:  Retrieves, de-serialize, mutate, serialize, store –Problem: unnecessary transfers –With views:  Client sends only the element they wish to append

Client/Server Performance  Single node max (1 client/1 server) throughput –19,384 reads/second –16,556 writes/second –(Mostly in-memory dataset)  Larger value performance test –6 nodes, ~50,000,000 keys, 8192 value –Production-like key request distribution –Two clients –~6,000 queries/second per client  In Production (“Data platform” cluster) –7,000 client operations/second –14,000 server operations/second –Peak Monday morning load, on six servers

Open Source!  Open Sourced in January 2009  Enthusiastic community –Mailing list  Equal amount contributed inside and outside LinkedIn  Available on Github –

Testing and Release Cycle  Regular release cycle established –So far monthly, ~15 th of the month  Extensive unit testing  Continuous integration through Hudson –Snapshot builds available  Automated testing of complex features on EC2 –Distributed systems require tests that test the entire cluster –EC2 allows nodes to be provisioned, deployed and started programmatically –Easy to simulate failures programmatically: shutting down and rebooting the instances

In Production

 At LinkedIn: multiple clusters, multiple teams –32 gb of RAM, 8 cores (very low CPU usage)  SNA team –Read/write cluster (12 nodes, to be expanded soon) –Read/only cluster –Recommendation engine cluster  Other clusters  Some uses –Data driven features: people you may know, who viewed my profile –Recommendation engine –Rate limiting, crawler detection –News processing – system –UI settings –Some communications features –More coming

Challenges of Production Use  Putting a custom storage system in production –Different from a stateless service –Backup and restore –Monitoring –Capacity planning  Performance tuning –Performance is deceitfully high when data is in RAM –Need realistic tests: production-like data and load  Operational advantages –No single point of failure –Predictable query performance

Case Study: KaChing  Personal investment start-up  Using Voldemort for six months  Stock market data, user history, analytics  Six node cluster  Challenges: high traffic volume, large data sets on low- end hardware  Experiments with SSDs: “Voldemort In the Wild”,

Case Study: eHarmony  Online match-making  Using Voldemort since April 2009  Data keyed off a unique id, doesn’t require ACID  Three production clusters: ten, seven and three nodes  Challenges: identifying SLA outliers

Case study: Gilt Groupe  Premium shopping site  Using Voldemort since August 2009  Load spikes during sales events –Have to remain up and responsive during the load spikes –Have to remain transitionally healthy even if machines die  Uses: –Shopping cart –Two separate stores for order processing  Three clusters, four nodes each. More coming.  “Last Thursday we lost a server and no-one noticed”

Nokia  Contributing to Voldemort  Plans involve 10+ TB (not counting replication) of data –Many nodes –MySQL Storage Engine  Evaluated other options –Found Voldemort best fit for environment, performance profile

Gilt: Load Spikes

What’s Next

The roadmap  Performance investigation  Multiple datacenter support  Additional consistency mechanisms –Merkle Trees –Finishing Hinted Handoff  Publish/subscribe mechanism  NIO client  Storage engine work?

Shameless plug  All contributions are welcome – – –Not just code:  Documentation  Bug reports  We’re hiring! –Open Source Projects  More than just Voldemort:  Search: real time search, elastic search, faceted search  Cluster management (Norbert)  More… –Positions and technologies  Search relevance, machine learning and data products  Distributed systems –Distributed social graph –Data infrastructure (Voldemort, Hadoop, pub/sub)  Hadoop, Lucene, ZooKeeper, Netty, Scala and more…  Q&A