pg_shard: Shard and Scale Out PostgreSQL

pg_shard: Shard and Scale Out PostgreSQL
Jason Petersen Sumedh Pathak Ozgun Erdogan Ozgun .. one of the founders at Citus Data. Prior to Citus, I was a software developer in DSE Today, I’m going to talk about pg_shard: a sharding and scaling extension for PostgreSQL Have about 35 slides. Technical talk. If you have questions, please feel free to interrupt Before we start, I have one slide to put things into context Speak slowly.

What is CitusDB? CitusDB is a scalable analytics database that extends PostgreSQL. pg_shard targets the short read/write use case Citus shards and replicates your data in the same way pg_shard does Citus parallelizes your queries for analytics Citus isn’t a fork of Postgres. Rather, it hooks onto the planner and executor for distributed query execution. Prior to this talk, how many of you heard of CitusDB? How many have heard of pg_shard? Just to clarify, CitusDB and pg_shard are two separate products that complement each other Pg_shard targets the real-time reads and writes use-case. In other words, it targets the NoSQL use-case. CitusDB is more applicable when you have big data sets and want to analyze that data in real-time. You can think of CitusDB as your massively parallel processing database. First sub-bullet point. .. That’s why the two products are compatible

Talk Outline Why use pg_shard? Data: Scaling and failure handling
Computation: Query routing and failure handling Random PostgreSQL coolness - I’m going to start this talk by motivating pg_shard. I’m going to first talk about the use-cases that are applicable Data: How pg_shard lays out the data in the cluster. How does the cluster dynamically scale and replicate data? Next, I’m going to talk about the execution logic. What happens when you send a query to the pg_shard cluster? How do we route the query? What happens when there are failures? Last, I’m going to conclude with a few slides on cool PostgreSQL features and extensions. This is the talk outline, and let’s start with the motivation.

#1 Requested Feature from Citus
Real-time analytics calls for real-time data ingest. We had one customer who built real-time inserts on their own. Then two other customers did the same thing. We also talked to PostgreSQL users. Some considered application level sharding, or migrating to NoSQL solutions. We initially started out by saying Citus only supports batch data loads. (no real-time ingest path) One customer: I’d like to insert data in real-time. Isn’t this PostgreSQL? Can’t I just write my own library to do this? By the third customer, we got the hint We then took a step back to talk to existing PostgreSQL users. You could say we did a survey. We saw two themes. One was doing sharding at the application level. This is a lot of effort for the end user. You need to understand distributed systems, and think through what happens when you have failures.

Customer Interviews Dynamically scale a cluster as new machines are added or old ones are retired. Magically handle failures Simple to set up and use. Works natively on PostgreSQL Then, we asked, if you had the ideal PostgreSQL scaling solutions, “what are your top 2-3 wishes?” Of course, top 10 asks make up a long list. We were curious what are the most important things for you? I don’t want to think about how to balance my cluster when I add in new machines I don’t want to set up multiple components and configure them. If I’m using PostgreSQL 9.3, it should just work on Postgres 9.3. If I want JSONB out of 9.4, it should just work with that.

Architectural Decisions
Dynamic Scaling: Use logical shards to facilitate easy rebalancing as cluster membership changes Simple to Use: Works natively with PostgreSQL by augmenting its planner and executor logic for real-time data ingest and querying We took all of that in and tied it to the architectural decisions The first decision, dynamic scaling is what I’m going to talk about next. There, we used the concept of logical shards to make scaling out and failure handling easy. For the simplicity decision, we leveraged PostgreSQL’s extension APIs. If you look at PostgreSQL’s planner and executor, they are built to read data from disk or memory. In other words, they operate by pulling data. If you’re building a distributed database, you want to push your computations to where the data is. So, your query planner and executor need to be fundamentally different, and also fully cooperate with PostgreSQL’s logic. We’ll cover this part after logical sharding.

Traditional Partitioning
- Let’s start by looking at scaling out a cluster. First, I’m going to talk about how partitioning has been done in the past.

Node #1 (PostgreSQL) Node #2 Node #3 click_events_2012
(4 TB) (4 TB) (4 TB) You have three nodes, and you partition your data set into those three nodes. 1/3 goes to node #1. The partitioning dimension here is time. But it could really be anything. The idea is, let’s say you have 12 TB of data. You have a table that holds 4 TB on node #1 Any ideas how this could introduce problems when you want to scale out? Any guesses?

Node #1 Node #2 Node #3 Node #4 click_events_2012 (4 TB)
- Let’s say you add a new machine into the cluster.

Node #1 Node #2 Node #3 Node #4 1 TB (each) click_events_2012 (4 TB)
Now that you added a new machine, you need to rebalance your cluster. What you’re going to do is transfer large data sets. You’re transferring 1 TB over a Gigabit network. This transfer itself is going to take hours. You need to coordinate the transfer from node #1, #2, .. You may have failures. Now that we’ve seen the scaling issues, let’s also take a look at how failures are handled in “traditional partitioning”

Node #1 Node #2 Node #3 Node #4 Node #5 Node #6 click_events_2012
We now introduce replication into the picture. We use exact replicas in this set up. Node #4 is an exact replica of node #1, etc. Let’s see what happens when there’s a failure?

Node #1 Node #2 Node #3 Node #4 Node #5 Node #6 click_events_2012
When you have a temporary failure in node #1, node #4 will take on twice the load. In this case, your cluster’s latency and throughput is bottlenecked on node #4. In a cluster of 6 nodes, this isn’t a big deal. But imagine the use case where you had 100 machines. Even when you lose a single a machine, you’re bottlenecked on the failed machine’s replica. You don’t get much out of having 100 machines in the cluster. Also, if node #1 doesn’t come back up, you need to re-replicate 4 TB of data from node #4. Any ideas on how to resolve this issue?

Logical Sharding - Here comes logical sharding

Node #1 (PostgreSQL) Node #2 Node #3 Node #4 512 MB (each) 1 3 4 6 7 9
… Node #2 1 2 4 5 7 8 … Node #3 2 3 5 6 8 9 … Node #4 512 MB (each) Hadoop Distributed File System (HDFS) was the first solution to introduce this at the system level. In this diagram, we have the dynamic scaling case. We introduce a new node … Your shard rebalancer moves shard #4 from node #1, shard #5 from node #2, … (restart operations on failures) The rebalancing operations that we want to do become much more flexible

Node #1 (PostgreSQL) Node #2 Node #3 Node #4 Node #5 Node #6 1 6 7 … 1
8 … Node #4 3 4 8 … Node #5 4 5 9 … Node #6 5 6 9 … The second benefit comes with the replication case Pg_shard now replicated your shards using a round-robin policy No two nodes are exact replicas of each other

Node #1 (PostgreSQL) Node #2 Node #3 Node #4 Node #5 Node #6 1 6 7 … 1
8 … Node #4 3 4 8 … Node #5 4 5 9 … Node #6 5 6 9 … Shard 1’s load go to node #2, shard #6’s load go to node #6, .. Temporary failure: even load distribution Permanent failure: re-replication becomes much easier Again, imagine that we had 100 machines in the cluster? Quick question. How many of you have known about these problems with traditional partitioning before?

Example Data Distribution in pg_shard Cluster Metadata Server
(PostgreSQL + pg_shard) shard and shard placement metadata 1 3 4 6 7 9 … Worker Node #1 1 2 4 5 7 8 … Worker Node #2 2 3 5 6 8 9 … Worker Node #3 Here’s how data and metadata are represented in an example pg_shard cluster These worker nodes are exact PostgreSQL databases. Nothing special about the worker nodes We have the metadata server. That’s where you create your distributed table The metadata server can also be called the coordinator or master node. It holds authoritative state on the metadata

Metadata Server (Master Node)
Metadata server holds authoritative state on shards and their placements in the cluster This state is minimal: 1 row per distributed table, 1 row per shard, and 1 row per shard placement. Kept in 3 Postgres tables To handle metadata node failures, you can create a streaming replica, or reconstruct metadata from workers on the fly The metadata server is the authority. This metadata is tiny This state changes only when (a) you create new shards, (b) rebalance the shards, or © fail in writing to shards * The challenge is to keep this metadata consistent in the face of failures

Handling Master Failure
Use streaming replication, and fail over to a secondary node On the cloud, use EBS volumes (metadata size is small) Restore metadata from pg_dump, etc. Reconstruct metadata from worker nodes The immediate question is what happens when you have a metadata server failure I ordered how you can handle master node failures from the “most amount of effort and future thinking” to the “least amount of effort” In one case, we had a customer who didn’t yet do these and wrote a script to reconstruct the metadata We don’t recommend option #4, but it still works

Metadata and Hash Partitioning
postgres=# SELECT * FROM pgs_distribution_metadata.shard; id | relation_id | storage | min_value | max_value | | t | | | | t | | | | t | | | | t | | | | t | | | | t | | | ... | ... | ... | ... To make this metadata example concrete, here’s an psql command example This is how the metadata is laid out We have shards represented with ids Each shard corresponds to a hash token range For example, a query comes into the system say for a table partitioned on customer_id. Pg_shard hashes the customer_id using Postgres’ hash functions, and gets a hash value. It then looks up in this metadata to find the hash token range that corresponds to the value

Worker Nodes 1 shard placement = 1 PostgreSQL table
Names of tables, indexes, and constraints are extended by their shard identifier e.g. click_events_1001 holds data for shard 1001 Indexes and constraint definitions are propagated when shards first created Worker nodes are regular PostgreSQL instances If you log into a worker node, and do a \d, you’ll see multiple tables pg_shard automatically extends table names behind the covers One thing to keep in mind is that, you create your index and constraint definitions before distributing your table So that wraps up the part on how we lay out the data in pg_shard.

Worker Node Failure User-defined function to reconstruct shard
Replay DDL commands for table, constraints Copy data from good worker node Update metadata on the master node Concurrent modifications during rebuild Allow? Or lock shard completely during rebuild? Alternative: manually set up streaming replicas What happens to a worker node’s data when it fails? We follow the simplest approach: You call a user-defined function to repair the shards. This is checked in yesterday! There is a question of what happens when you’re repairing your shards and you have concurrent modifications coming into the shard. We have three alternative approaches there, and this is an area where we’re looking for your feedback. If you have a particular use case in mind, please talk to us after the talk. With that, we’re now switching gears to usability and query handling logic.

What about the logic? Drop-in PostgreSQL extension
Supports subset of SQL Doesn’t use any special functions It’s still just PostgreSQL So let’s talk about the logic. The way we’re thinking about pg_shard is that it strikes a balance between usability and SQL functionality coverage. After talking to PostgreSQL users, we found that they just wanted a simple component to get them rolling. And pg_shard is a drop-in PostgreSQL extension. You say create extension, and it’s there. You then start running SQL commands against your distributed tables In that sense, you don’t have to use special functions to query your data. A lot of KV stores have the notion of distribution built-into them at the API level, whereas SQL doesn’t have the concept of distribution The balance we want to strike here. We want to support a subset of SQL without any changes to your application

Users: Making Scaling SQL Easy
CREATE EXTENSION pg_shard; -- create a regular PostgreSQL table: CREATE TABLE customer_reviews (customer_id TEXT NOT NULL, review_date DATE, ...); -- distribute the table on the given partition key: SELECT master_create_distributed_table('customer_reviews', 'customer_id'); -- create 16 logical shards with 2 placements on workers: SELECT master_create_worker_shards('customer_reviews', 16, 2); Here’s a simple example that sets up pg_shard and a distributed table Create a normal table – nothing special about the created table. This table will soon become a shell table on the master Invoke master_create_distributed_table. Designates this table as distributed. It also specifies that customer_id is the partition column Then, you call the master_create_worker_shards to create shards and shard replicas. This function connects to remote worker nodes, replays table schema, index, and constraint DDLs, and creates the shards on worker nodes. Two final arguments to the function: number of shards and replication factor … if you have 2, in case one of the replicas fails, that table is still available

PostgreSQL Secret Sauce: Hooks
Full control over command lifetime Specific to needs: Planning Execution (Start, Run, Finish, End) Utility I don’t know how many of you are familiar with the internals of PostgreSQL, or the internal C APIs it provides One of the wonderful things it gives us is hooks – you have about a dozen You can take over any part of the command lifetime – very granular Can chain hook to use more than one – call one extension over the other They are very flexible: no strict contract about what you have to do In summary, hook APIs give us a very first class way of hooking into the PostgreSQL system

Planning Phase Determine if distributed
Fall through to PostgreSQL if not Using partition key, find involved shards Deparse shard-specific SQL Here’s the overview of our planning phase You can use regular and distributed tables in the same database Taking a step back, if the query is for a distributed table, the query would have already been parsed. We find the partition key, apply partition pruning, and find shards involved in the query We then take the parsed query or queries, and uses same deparse logic as PostgreSQL proper to generate back SQL statements

Planning Example INSERT INTO customer_reviews (customer_id, rating) VALUES ('HN892', 5); Determine partition key clauses: customer_id = 'HN892' Find shards from metadata tables: hashtext('HN892') BETWEEN min_value AND max_value Produce shard-specific SQL INSERT INTO customer_reviews_16 (customer_id, rating) VALUES ('HN892', 5); Detailed example of planning in pg_shard. Insert Into comes into pg_shard. PostgreSQL parses the query Separate out clauses on hash partition Find shard that satisfies their hash values This uses same constraint exclusion mechanism used in e.g. PostgreSQL partitioning implementation When done, we have a shard-specific command to forward to a worker This is important because each worker node has many shard placements, so we need to identify which one to write to Any questions about how the planning works?

Execute Distributed Modify
Locks enforce safe commutation Replicas visited in predictable order Per-session connection pool uses libpq If replica errors out, mark as inactive And then, there is the part of the actual execution. You have to be a bit careful here because we have a distributed system, and parallel request are going on. We have to think about what safe commutation is, on the set of operations that can be performed. So for example, Inserts commute with Selects, and you can use that property if you’re willing to relax isolation. We do have full consistency on writes – which means if you write something, you’re going to see it when you read it back. There is none of this potential to read from stale shard, and get back stale results as you do in document stores. Replicas need to be visited in order for the constraints to work properly. If any of the replica writes succeed during the modification, we consider the query as successful. If write a replica has failed, its metadata is marked as inactive.

Single-shard INSERT Replication factor: 2 Master Worker Node #1
INSERT INTO customer_reviews ... Master 1 3 4 6 7 9 … Worker Node #1 1 2 4 5 7 8 … Worker Node #2 2 3 5 6 8 9 … Worker Node #3 Visual representation of our cluster Determined shard 6 is target shard for this query – in this case replication factor is 2, so we touch two nodes Connections possibly already open (session-bound) Send query to all replicas

Single-shard INSERT One replica fails Master Worker Node #1
INSERT INTO customer_reviews ... Master 1 3 4 6 7 9 … Worker Node #1 1 2 4 5 7 8 … Worker Node #2 2 3 5 6 8 9 … Worker Node #3 Here worker node #3 has failed But #1 succeeded, so query is a success

Single-shard INSERT Master marks inactive Master Worker Node #1
Sets shard 6, node 3 to inactive status Master 1 3 4 6 7 9 … Worker Node #1 1 2 4 5 7 8 … Worker Node #2 2 3 5 6 8 9 … Worker Node #3 Master marks shard 6 on node 3 as needing repair. Master node needs to do a bit of bookkeeping Returns success to client Node 3 must be repaired to restore replication factor for shard 6 User-defined function (currently manual intervention) – copy data from node #1

Modification Semantics
Consistent (read your own writes) Safety comes from commutativity rules SELECTs and INSERTs can be reordered UPDATEs and DELETEs cannot Constraints require predictable visit order As stated: consistent. Lax requirements give us ability to queue rows later, etc. SELECTs/INSERTs can commute safely UPDATE/DELETE cannot With uniqueness constraints, INSERT to a given shard must visit replicas in-order

Single Shard SELECT Fetch entire result from single shard
Failover to another replica on error Do not modify state if failure happens Common key-value access pattern Finds shard which contains result, gets result Can dynamically fail over to other replicas if first fails Read failures do not currently modify shard state Common pattern (key-value store). Maybe you have a JSON blob

Single-shard SELECT Try first placement Master Worker Node #1
SELECT * FROM customer_reviews  WHERE customer_id = 'HN892'; Master 1 3 4 6 7 9 … Worker Node #1 1 2 4 5 7 8 … Worker Node #2 2 3 5 6 8 9 … Worker Node #3 Master looks at query’s WHERE clauses Determines where to route query Here it found that result will be in shard 3

Single-shard SELECT Encounter error Master Worker Node #1
SELECT * FROM customer_reviews  WHERE customer_id = 'HN892'; Master 1 3 4 6 7 9 … Worker Node #1 1 2 4 5 7 8 … Worker Node #2 2 3 5 6 8 9 … Worker Node #3 Worker one has an intermittent failure Master finds another worker with shard 3

Single-shard SELECT Try next placement Master Worker Node #1
SELECT * FROM customer_reviews  WHERE customer_id = 'HN892'; Master 1 3 4 6 7 9 … Worker Node #1 1 2 4 5 7 8 … Worker Node #2 2 3 5 6 8 9 … Worker Node #3 Still in context of original query from client Fails over to worker 3 Gets result, sends to client One difference with Insert is that it does not modify state of workers in metadata This covers the single shard SELECT use-case

Multi-Shard SELECT pg_shard: Engaged when no partition key constraint is present. Pulls data to master and performs final pass locally (aggregates, etc.) CitusDB: Pushes computations to worker nodes themselves for fully distributed query logic Even included a multi-shard mode for ease-of-use When master finds query is not restricted to single shard Pulls partial results from all involved queries Does aggregation, transformations, etc. on master This is CitusDB’s bread and butter It creates fully distributed plan to exploit all computational resources Shares work among all worker nodes. Even does distributed JOIN Compatible with pg_shard: do modifications using pg_shard and read using CitusDB

Limitations Transactions cannot… No JOIN support (CitusDB provides)
involve multiple shards span multiple statements No JOIN support (CitusDB provides) Cross-shard constraints unenforced Transaction support is very limited… Cannot span multiple shards or multiple statements No JOIN support (as stated, this is where CitusDB comes in) Cross-shard constraints are currently unenforced and undetected

What’s Next? pg_shard v1.1 release this week
Includes shard repair udf, easier integration with CitusDB, COPY support through triggers, SELECT project push downs, faster INSERTs, and bug fixes pg_shard v1.2 will have more real-time analytics functionality Anything else? What’s keeping you from using pg_shard? More SQL coverage? Rebalancing shards across new machines? Multi-master for stricter uptime guarantees? Range partitioning enables many more cases Less hands-on for recovery?

Summary Sharding extension for PostgreSQL Many small logical shards Implemented with standard PostgreSQL hooks Load extension… create table… distribute JSONB + pg_shard as alternative to NoSQL Simple sharding for PostgreSQL Many small logical shards makes moving around data faster Gets new machines online quicker, spreads load better Standard SQL, standard PostgreSQL extension Easy to get started Goes great with JSONB

PostgreSQL Renaissance
Normally, this would have been the end of the talk. I just have four more slides on random cool things about PostgreSQL.

PostgreSQL Usage Trends
What database does your company use? Hacker News 2010 Hacker News 2014 The first one is on PostgreSQL’s usage statistics. This is a survey from a website called Hacker News. How many of you are familiar with this website? They have this survey on Hacker News every year, asking what database does your company use? On the left hand side, you see this survey’s results in There, MySQL has more usage than PostgreSQL and MongoDB combined. Fast forward the same survey four years, and you see two differences. First, PostgreSQL’s popularity has tripled over time, far exceeding any other database. Second, PostgreSQL now has more usage than MySQL and MongoDB combined.

Why is PostgreSQL popular
Oraclization of MySQL Reliable and robust database New extension framework: Is the monolithic SQL database dying? If so, long live Postgres The natural follow-up question is, “why has PostgreSQL become so popular?” The first reason is the Oraclization of MySQL. Second, when you talk to users and ask them why they love Postgres, they say “it’s robust, it’s reliable, and it does what I expect it to do. There are no surprises.” Postgres has made architectural decisions early on that enable it to be extended as a database. As a result, with the new extension framework, you can implement your own logic for your own use-case. In fact, PostgreSQL already has 200 extensions available.

PostgreSQL Extensions #1
HyperLogLog extension for real-time unique counts over varying time intervals JSONB and GIN indexes for semi-structured data Here are a few example extensions that we’ve seen our customers use and love. The first one implements an algorithm to calculate distinct count approximations over large data sets. For example, if you have unique counts for day 1, and unique counts for day2, and if you want to merge them together in real-time, you want to use the hyperloglog extension. CloudFlare and Neustar are two customers that are using this extension to power their real-time analytic dashboards. The second extension adds a JSONB data type to PostgreSQL. This way, you can store your relational and semi-structured data within the same table, and query them together.

PostgreSQL Extensions #2
Real-time funnel and cohort queries Dynamic funnels to look at people who visited A, then B, and finally J Heatmap video of vehicles on a map The third extension enables you to answer dynamic funnel questions in real-time. For example, we have one customer who’s a billion dollar retailer in Europe. Think of them as a local Walmart. This customer has trucks that go around countries delivering goods. And they want to look at the trucks that visited location A, B, and then C, and compare their fuel consumption. For that, they’re using an extension for real-time funnels. (They also want to visualize these trucks as a heatmap overlayed on a map. And I’ll show a quick video that does that.)

Contact Jason: Sumedh: Ozgun: General: groups.google.com/forum/#!forum/pg_shard-users

Questions

pg_shard: Shard and Scale Out PostgreSQL

Similar presentations

Presentation on theme: "pg_shard: Shard and Scale Out PostgreSQL"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

pg_shard: Shard and Scale Out PostgreSQL

Similar presentations

Presentation on theme: "pg_shard: Shard and Scale Out PostgreSQL"— Presentation transcript:

Similar presentations

About project

Feedback