Project Voldemort Bhupesh Bansal & Jay Kreps

Project Voldemort Bhupesh Bansal & Jay Kreps
Thanks all, excited to talk to you 4/15/2017

The Plan Motivation Core Concepts Implementation In Practice Results

LinkedIn’s Data & Analytics Team Project Voldemort Hadoop system
Who Are We? LinkedIn’s Data & Analytics Team Project Voldemort Hadoop system Recommendation Engine Relevance and Analysis Data intensive features People you may know Who viewed my profile User history service

Motivation Why write a database?
How did this come about--personal project with uses

The Idea of the Relational Database
One-size-fits-all Stonebraker

The Reality of a Modern Web Site

Specialized systems are efficient (10-100x) Search: Inverted index
Why did this happen? Specialized systems are efficient (10-100x) Search: Inverted index Offline: Hadoop, Terradata, Oracle DWH Memcached In memory systems (social graph) Specialized system are scalable New data and problems Graphs, sequences, and text 10-100x faster Scaling Once you have 4 mandatory specialists what is left for the RDMS?

Services and Scale Break Relational DBs
No joins Lots of denormalization ORM is pointless No constraints, triggers, etc disappear Natural operations not captured easily Caching => key/value model Latency is key No Joins Across data domains due to APIs Within data domains due to performance Natural operation: getAll(id…) Latency: if you want to call 30 services on your main pages, they better be quick (30 * 20ms = 600ms)

Two Cheers For Relational Databases
The relational model is a triumph of computer science: General Concise Well understood But then again: SQL is a pain Hard to build re-usable data structures Don’t hide the memory hierarchy! Good: Filesystem API Bad: SQL, some RPCs Exposes seek/hides seek

Who is responsible for performance (engineers? DBA? site operations?)
Other Considerations Who is responsible for performance (engineers? DBA? site operations?) Can you do capacity planning? Can you simulate the problem early in the design phase? How do you do upgrades? Can you mock your database?

Some domain constraints
This is a real-time system Data set is large and persistent Cannot be all in memory Partitioning is key 80% of caching tiers are fixing problems that shouldn’t exist Need control over system availability and data durability Must replicate data on multiple machines Cost of scalability can’t be too high Must support diverse usages Explain about ratio of memory to disk

Inspired By Amazon Dynamo & Memcached
Amazon’s Dynamo storage system Works across data centers Eventual consistency Commodity hardware Not too hard to build Memcached Actually works Really fast Really simple Decisions: Multiple reads/writes Consistent hashing for data distribution Key-Value model Data versioning

Priorities Performance and scalability Actually works Community
Data consistency Flexible & Extensible Everything else -If you haven’t solved a performance problem you haven’t done anything -- mysql is really quite good for < 2,000 requests / sec and smaller data sizes Storage is Achilles heal of services, can bring down the site Don’t want to have the wrong stuff Very diverse use cases

Failures in a distributed system are much more complicated
Why Is This Hard? Failures in a distributed system are much more complicated A can talk to B does not imply B can talk to A A can talk to B does not imply C can talk to B Getting a consistent view of the cluster is as hard as getting a consistent view of the data Nodes will fail and come back to life with stale data I/O has high request latency variance I/O on commodity disks is even worse Intermittent failures are common User must be isolated from these problems There are fundamental trade-offs between availability and consistency

Core Concepts Switch to Bhup

ACID CAP Theorem Consistency Models Core Concepts - I
Great for single centralized server. CAP Theorem Consistency (Strict), Availability , Partition Tolerance Impossible to achieve all three at same time in distributed platform Can choose 2 out of 3 Dynamo chooses High Availability and Partition Tolerance by sacrificing Strict Consistency to Eventual consistency Consistency Models Strict consistency 2 Phase Commits PAXOS : distributed algorithm to ensure quorum for consistency Eventual consistency Different nodes can have different views of value In a steady state system will return last written value. BUT Can have much strong guarantees. - Strong Consistency: all clients see the same view, even in presence of updates - High Availability: all clients can find some replica of the data, even in the presence of failures Partition-tolerance: the system properties hold even when the system is partitioned high availability : Mantra for websites Better to deal with inconsistencies, because their primary need is to scale well to allow for a smooth user experience. Proprietary & Confidential 4/15/2017 16 16

Key space is Partitioned Partitions never change Replication
Hashing .. Why do we need it ?? Basic problem : Clients need to know which data is where ?? Many ways of solving it Central configuration Hashing Linear hashing works : issue is when cluster is dynamic ?? KeyHash –node IDmapping change for a lot of entries When you add new slots Consistent hashing : preserves key –Node mapping for most of the keys and only change the minimal amount needed How to do it ?? Number of partitions Arbitrary , each node is allocated many partitions (better load balancing and fault tolerance) Few hundreds to few thousands .. Key  partition mapping is fixed and only ownership of partitions can change Core Concept - II Consistent Hashing Key space is Partitioned Many small partitions Partitions never change Partitions ownership can change Replication Each partition is stored by ‘N’ nodes Node Failures Transient (short term) Long term Needs faster bootstrapping Proprietary & Confidential 4/15/2017 17

N - The replication factor R - The number of blocking reads
Core Concept - III N - The replication factor R - The number of blocking reads W - The number of blocking writes If R+W > N then we have a quorum-like algorithm Guarantees that we will read latest writes OR fail R, W, N can be tuned for different use cases W = 1, Highly available writes R = 1, Read intensive workloads Knobs to tune performance, durability and availability Proprietary & Confidential 4/15/2017 18

A vector clock is a tuple {t1 , t2 , ..., tn } of counters.
Core Concepts - IV Vector Clock [Lamport] provides way to order events in a distributed system. A vector clock is a tuple {t1 , t2 , ..., tn } of counters. Each value update has a master node When data is written with master node i, it increments ti. All the replicas will receive the same version Helps resolving consistency between writes on multiple replicas If you get network partitions You can have a case where two vector clocks are not comparable. In this case Voldemort returns both values to clients for conflict resolution Fancy way of doing Optimistic locking Proprietary & Confidential 4/15/2017 19

Implementation

One interface for all layers: put/get/delete
Voldemort Design Will discuss each layer in more detail Layered design One interface for all layers: put/get/delete Each layer decorates the next Very flexible Easy to test Client API : very basic API just provides the raw interface to user Conflct reslution layer : handles all the versioning issues and provides hooks to supply custom conflict resolution strategy Serialization : Object <=> Data Network Layer : Depending on configuration can fit either here or below .. Main job is to handle the network interface, socket pooling other performance related optimizatons Routing layer : Handles and hide many details from the client/user hashing schema failed nodes replication required reads/required writes Storage engine Handle disk persistenct

Data is organized into “stores”, i.e. tables Key-value only
Client API Data is organized into “stores”, i.e. tables Key-value only But values can be arbitrarily rich or complex Maps, lists, nested combinations … Four operations PUT (Key K, Value V) GET (Key K) MULTI-GET (Iterator<Key> K), DELETE (Key K) / (Key K , Version ver) No Range Scans Very simple APIS NO Range Scans .. . No iterator on KeySet / Entry SET : Very hard to fix performance Have plans to provide such an iterator

Versioning & Conflict Resolution
Eventual Consistency allows multiple versions of value Need a way to understand which value is latest Need a way to say values are not comparable Solutions Timestamp Vector clocks Provides global ordering. No locking or blocking necessary Give example of read and writes with vector clocks Pros and cons vs paxos and 2pc User can supply strategy for handling cases where v1 and v2 are not comparable.

Serialization Really important Few Considerations Schema free?
Backward/Forward compatible Real life data structures Bytes <=> objects <=> strings? Size (No XML) Many ways to do it -- we allow anything Compressed JSON, Protocol Buffers, Thrift, Voldemort custom serializtion Avro is good, but new and unreleased Storing data is really different from an RPC IDL Data is forever (some data) Inverted index Threaded comment tree Don’t hardcode serialization Don’t just do byte[] -- checks are good, many features depend on knowing what the data is Xml profile

Routing Routing layer hides lot of complexity Hashing schema Replication (N, R , W) Failures Read-Repair (online repair mechanism) Hinted Handoff (Long term recovery mechanism) Easy to add domain specific strategies E.g. only do synchronous operations on nodes in the local data center Client Side / Server Side / Hybrid

Voldemort Physical Deployment
Explain about partitions Make things fast by removing slow things, not by tuning HTTP client not performant Separate caching layer

Need to be very very fast View of server state may be inconsistent
Routing With Failures Failure Detection Requirements Need to be very very fast View of server state may be inconsistent A can talk to B but C cannot A can talk to C , B can talk to A but not to C Currently done by routing layer (request timeouts) Periodically retries failed nodes. All requests must have hard SLAs Other possible solutions Central server Gossip protocol Need to look more into this. Client v. server - client is better - but harder

Bootstrapping mechanism (We don’t have it yet)
Repair Mechanism Read Repair Online repair mechanism Routing client receives values from multiple node Notify a node if you see an old value Only works for keys which are read after failures Hinted Handoff If a write fails write it to any random node Just mark the write as a special write Each node periodically tries to get rid of all special entries Bootstrapping mechanism (We don’t have it yet) If a node was down for long time Hinted handoff can generate ton of traffic Need a better way to bootstrap and clear hinted handoff tables Proprietary & Confidential 4/15/2017 28

Network Layer Network is the major bottleneck in many uses Client performance turns out to be harder than server (client must wait!) Lots of issue with socket buffer size/socket pool Server is also a Client Two implementations HTTP + servlet container Simple socket protocol + custom server HTTP server is great, but http client is 5-10X slower Socket protocol is what we use in production Recently added a non-blocking version of the server

Single machine key-value storage is a commodity
Persistence Single machine key-value storage is a commodity Plugins are better than tying yourself to a single strategy Different use cases optimize reads optimize writes Large vs Small values SSDs may completely change this layer Couple of different options BDB, MySQL and mmap’d file implementations Berkeley DBs most popular In memory plugin for testing Btrees are still the best all-purpose structure No flush on write is a huge, huge win You can write an implementation very easily We support plugins so you can run this backend without being in the main code base

In Practice

LinkedIn problems we wanted to solve
Application Examples People You May Know Item-Item Recommendations Typeahead selection Member and Company Derived Data User’s Network statistics Who Viewed My Profile? Abuse detection User’s History Service Relevance data Crawler detection Many others have come up since Some data is batch computed and served as read only Some data is very high write load Latency is key Not discusssing these because they are too linkedin-specific

User Data lookup service
Example Use Cases - I User Data lookup service Returns user data for a user. Create a single Voldemort Store Key : User ID Value : User Settings Treat Voldemort as a simple Hash Table with extra benefits Scalability Fault tolerance Persistence Data Size > RAM Can control Data / RAM ratio for better performance There are “surprisingly” a lot of use cases which can be served by this simple model “provide best seller lists, shopping carts, customer preferences, session management, sales rank, and product catalog” [Dynamo paper] Proprietary & Confidential 4/15/2017 33

User Comments Service Insert time: Query time: Example Use Cases - II
Return Latest ‘N’ comments for a topic Create two Voldemort Stores Store 1: Key: Topic Id Value : CommentID tree reverse sorted by timestamp Store 2: Key : commentID Value: Comment data Insert time: Get the commentID tree using store 1. Update the tree and put it back in store1 Insert new key/value pair in store2 Query time: Do a query on Store1 and then multi-get Query for last ‘N’ comments. Proprietary & Confidential 4/15/2017 34 34

Inbox Search Service Insert time Query Time Example Use Cases - III
Search for a ‘keyword’ phrase in member Inbox data Create a Voldemort Store. Key: User ID Value : Search Index for user s Insert time Get the search index for user given userID Update search index with new update and put back to store. Query Time get the right index using userID use search index to query for ‘keyword’ and return results Try to colocate voldemort/Inbox Search service to avoid network hops Proprietary & Confidential 4/15/2017 35

Example which do not work
Financial Transactions Please donot use *any* alpha system to save credit card details. Transaction requirement between 2 stores Write key1 to store A iff key2 to store B succeeded. Range scans Switch to Jay! Proprietary & Confidential 4/15/2017 36

Example V: Hadoop and Voldemort sitting in a tree…
Hadoop can generate a lot of data Bottleneck 1: Getting the data out of hadoop Bottleneck 2: Transfer to DB Bottleneck 3: Index building We had a critical process where this process took a DBA a week to run! Index building is a batch operation Fixing a bottleneck in one system creates bottlenecks elsewhere 4/15/2017

Read-only storage engine
Throughput vs. Latency Index building done in Hadoop Fully parallel transfer Very efficient on-disk structure Heavy reliance on OS pagecache Rollback! Transfer time: 30 minutes Can max out a gb network, so be careful

Wide variety of data sizes, clients, needs My team: 12 machines
Voldemort At LinkedIn 4 Clusters, 4 teams Wide variety of data sizes, clients, needs My team: 12 machines Nice servers 500M operations/day ~4 billion events in 10 stores (one per event type) Peak load > 10k operations / second Other teams: news article data, related data, UI settings Bigger than core db!

Gilt Group Gilt Groupe Inc ( Thanks to Geir Magnusson for sharing numbers 3 Clusters (Another 3 coming in 3 weeks) 4 Machines each Replication 2 required Bdb storage engine Values avg size about 5K QPS : upto 2700/sec Latency 96 % < 20ms 99.7 % < 40 ms Production issues seen: load balancer started caching causing writes to fail

Results State of the project Performance What is left to do Etc.

Some performance numbers
Production stats Median: 0.1 ms 99.9 percentile GET: 3 ms Single node max throughput (1 client node, 1 server node): 19,384 reads/sec 16,559 writes/sec These numbers are for mostly in-memory problems Client bound

Glaring Weaknesses Not nearly enough documentation
Need a rigorous performance and multi-machine failure tests running NIGHTLY No online cluster expansion (without reduced guarantees) Need more clients in other languages (Java, Python, and C++ currently) Better tools for cluster-wide control and monitoring

State of the Project Active mailing list
4-5 regular committers outside LinkedIn Lots of contributors Equal contribution from in and out of LinkedIn Project basics IRC Some documentation Lots more to do > 300 unit tests that run on every checkin (and pass) Pretty clean code Moved to GitHub (by popular demand) Production usage at a half dozen companies Not just a LinkedIn project anymore But LinkedIn is really committed to it (and we are hiring to work on it) - Open source n00b - Very impressed by all the friendly smart people out there who love to hack on things

Index build runs 100% in Hadoop
MapReduce job outputs Voldemort Stores to HDFS Nodes all download their pre-built stores in parallel Atomic swap to make the data live Heavily optimized storage engine for read-only data I/O Throttling on the transfer to protect the live servers

Some new & upcoming things
Python client Non-blocking socket server Alpha round on online cluster expansion Read-only store and Hadoop integration Improved monitoring stats Compression Future Publish/Subscribe model to track changes Great performance and integration tests Improved failure detection

Shameless promotion Check it out: project-voldemort.com We love getting patches. We kind of love getting bug reports. LinkedIn is hiring, so you can work on this full time. me if interested

The End Questions, comments, etc

Project Voldemort Bhupesh Bansal & Jay Kreps

Similar presentations

Presentation on theme: "Project Voldemort Bhupesh Bansal & Jay Kreps"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Project Voldemort Bhupesh Bansal & Jay Kreps

Similar presentations

Presentation on theme: "Project Voldemort Bhupesh Bansal & Jay Kreps"— Presentation transcript:

Similar presentations

About project

Feedback