pg_shard: Shard and Scale Out PostgreSQL

Slides:



Advertisements
Similar presentations
A Ridiculously Easy & Seriously Powerful SQL Cloud Database Itamar Haber AVP Ops & Solutions.
Advertisements

Module 8 Importing and Exporting Data. Module Overview Transferring Data To/From SQL Server Importing & Exporting Table Data Inserting Data in Bulk.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Mastering Galera Data Masters. Special Thanks To… 1010 NE 2 nd Ave Miami, FL
Database Scalability, Elasticity, and Autonomy in the Cloud Agrawal et al. Oct 24, 2011.
Transaction.
A Fast Growing Market. Interesting New Players Lyzasoft.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Fundamentals, Design, and Implementation, 9/e Chapter 11 Managing Databases with SQL Server 2000.
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo How to Scale a Database System.
Working with SQL and PL/SQL/ Session 1 / 1 of 27 SQL Server Architecture.
Module 14: Scalability and High Availability. Overview Key high availability features available in Oracle and SQL Server Key scalability features available.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
How WebMD Maintains Operational Flexibility with NoSQL Rajeev Borborah, Sr. Director, Engineering Matt Wilson – Director, Production Engineering – Consumer.
Ch 4. The Evolution of Analytic Scalability
Database Design Table design Index design Query design Transaction design Capacity Size limits Partitioning (shard) Latency Redundancy Replica overhead.
1 The Google File System Reporter: You-Wei Zhang.
Software Engineer, #MongoDBDays.
Database Design for DNN Developers Sebastian Leupold.
Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.
Database Technical Session By: Prof. Adarsh Patel.
Sofia, Bulgaria | 9-10 October SQL Server 2005 High Availability for developers Vladimir Tchalkov Crossroad Ltd. Vladimir Tchalkov Crossroad Ltd.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
DBSQL 14-1 Copyright © Genetic Computer School 2009 Chapter 14 Microsoft SQL Server.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Development of Hybrid SQL/NoSQL PanDA Metadata Storage PanDA/ CERN IT-SDC meeting Dec 02, 2014 Marina Golosova and Maria Grigorieva BigData Technologies.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Presenters: Rezan Amiri Sahar Delroshan
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation MongoDB Architecture.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
 2009 Calpont Corporation 1 Calpont Open Source Columnar Storage Engine for Scalable MySQL Data Warehousing April 22, 2009 MySQL User Conference Santa.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
06 | Modifying Data in SQL Server Brian Alderman | MCT, CEO / Founder of MicroTechPoint Tobias Ternstrom | Microsoft SQL Server Program Manager.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Ing. Erick López Ch. M.R.I. Replicación Oracle. What is Replication  Replication is the process of copying and maintaining schema objects in multiple.
SQL/Lesson 7/Slide 1 of 32 Implementing Indexes Objectives In this lesson, you will learn to: * Create a clustered index * Create a nonclustered index.
Real-Time Performance at Massive Scale
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Topic Distributed DBMS Database Management Systems Fall 2012 Presented by: Osama Ben Omran.
Dynamo: Amazon’s Highly Available Key-value Store DAAS – Database as a service.
Session 1 Module 1: Introduction to Data Integrity
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
Senior Solutions Architect, MongoDB Inc. Massimo Brignoli #MongoDB Introduction to Sharding.
Last Updated : 27 th April 2004 Center of Excellence Data Warehousing Group Teradata Performance Optimization.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Features Scalability Manage Services Deliver Features Faster Create Business Value Availability Latency Lifecycle Data Integrity Portability.
Mick Badran Using Microsoft Service Fabric to build your next Solution with zero downtime – Lvl 300 CLD32 5.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
1 Information Retrieval and Use De-normalisation and Distributed database systems Geoff Leese September 2008, revised October 2009.
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
Replicazione e QoS nella gestione di database grid-oriented Barbara Martelli INFN - CNAF.
Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)
SQL Triggers, Functions & Stored Procedures Programming Operations.
BIG DATA/ Hadoop Interview Questions.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
Scaling PostgreSQL with GridSQL. Who Am I? Jim Mlodgenski – Co-organizer of NYCPUG – Founder of Cirrus Technologies – Former Chief Architect of EnterpriseDB.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Oracle Database Native Sharding: a customer perspective ©2016 PayPal Inc. Confidential and proprietary. John Kanagaraj, Sr. Member of Technical Staff,
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
MongoDB Distributed Write and Read
Providing Secure Storage on the Internet
Ch 4. The Evolution of Analytic Scalability
H-store: A high-performance, distributed main memory transaction processing system Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alex.
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Chapter 11 Managing Databases with SQL Server 2000
Presentation transcript:

pg_shard: Shard and Scale Out PostgreSQL Jason Petersen Sumedh Pathak Ozgun Erdogan Ozgun .. one of the founders at Citus Data. Prior to Citus, I was a software developer in DSE Today, I’m going to talk about pg_shard: a sharding and scaling extension for PostgreSQL Have about 35 slides. Technical talk. If you have questions, please feel free to interrupt Before we start, I have one slide to put things into context Speak slowly.

What is CitusDB? CitusDB is a scalable analytics database that extends PostgreSQL. pg_shard targets the short read/write use case Citus shards and replicates your data in the same way pg_shard does Citus parallelizes your queries for analytics Citus isn’t a fork of Postgres. Rather, it hooks onto the planner and executor for distributed query execution. Prior to this talk, how many of you heard of CitusDB? How many have heard of pg_shard? Just to clarify, CitusDB and pg_shard are two separate products that complement each other Pg_shard targets the real-time reads and writes use-case. In other words, it targets the NoSQL use-case. CitusDB is more applicable when you have big data sets and want to analyze that data in real-time. You can think of CitusDB as your massively parallel processing database. First sub-bullet point. .. That’s why the two products are compatible

Talk Outline Why use pg_shard? Data: Scaling and failure handling Computation: Query routing and failure handling Random PostgreSQL coolness - I’m going to start this talk by motivating pg_shard. I’m going to first talk about the use-cases that are applicable Data: How pg_shard lays out the data in the cluster. How does the cluster dynamically scale and replicate data? Next, I’m going to talk about the execution logic. What happens when you send a query to the pg_shard cluster? How do we route the query? What happens when there are failures? Last, I’m going to conclude with a few slides on cool PostgreSQL features and extensions. This is the talk outline, and let’s start with the motivation.

#1 Requested Feature from Citus Real-time analytics calls for real-time data ingest. We had one customer who built real-time inserts on their own. Then two other customers did the same thing. We also talked to PostgreSQL users. Some considered application level sharding, or migrating to NoSQL solutions. We initially started out by saying Citus only supports batch data loads. (no real-time ingest path) One customer: I’d like to insert data in real-time. Isn’t this PostgreSQL? Can’t I just write my own library to do this? By the third customer, we got the hint We then took a step back to talk to existing PostgreSQL users. You could say we did a survey. We saw two themes. One was doing sharding at the application level. This is a lot of effort for the end user. You need to understand distributed systems, and think through what happens when you have failures.

Customer Interviews Dynamically scale a cluster as new machines are added or old ones are retired. Magically handle failures Simple to set up and use. Works natively on PostgreSQL Then, we asked, if you had the ideal PostgreSQL scaling solutions, “what are your top 2-3 wishes?” Of course, top 10 asks make up a long list. We were curious what are the most important things for you? I don’t want to think about how to balance my cluster when I add in new machines I don’t want to set up multiple components and configure them. If I’m using PostgreSQL 9.3, it should just work on Postgres 9.3. If I want JSONB out of 9.4, it should just work with that.

Architectural Decisions Dynamic Scaling: Use logical shards to facilitate easy rebalancing as cluster membership changes Simple to Use: Works natively with PostgreSQL by augmenting its planner and executor logic for real-time data ingest and querying We took all of that in and tied it to the architectural decisions The first decision, dynamic scaling is what I’m going to talk about next. There, we used the concept of logical shards to make scaling out and failure handling easy. For the simplicity decision, we leveraged PostgreSQL’s extension APIs. If you look at PostgreSQL’s planner and executor, they are built to read data from disk or memory. In other words, they operate by pulling data. If you’re building a distributed database, you want to push your computations to where the data is. So, your query planner and executor need to be fundamentally different, and also fully cooperate with PostgreSQL’s logic. We’ll cover this part after logical sharding.

Traditional Partitioning - Let’s start by looking at scaling out a cluster. First, I’m going to talk about how partitioning has been done in the past.

Node #1 (PostgreSQL) Node #2 Node #3 click_events_2012 (4 TB) (4 TB) (4 TB) You have three nodes, and you partition your data set into those three nodes. 1/3 goes to node #1. The partitioning dimension here is time. But it could really be anything. The idea is, let’s say you have 12 TB of data. You have a table that holds 4 TB on node #1 Any ideas how this could introduce problems when you want to scale out? Any guesses?

Node #1 Node #2 Node #3 Node #4 click_events_2012 (4 TB) - Let’s say you add a new machine into the cluster.

Node #1 Node #2 Node #3 Node #4 1 TB (each) click_events_2012 (4 TB) Now that you added a new machine, you need to rebalance your cluster. What you’re going to do is transfer large data sets. You’re transferring 1 TB over a Gigabit network. This transfer itself is going to take hours. You need to coordinate the transfer from node #1, #2, .. You may have failures. Now that we’ve seen the scaling issues, let’s also take a look at how failures are handled in “traditional partitioning”

Node #1 Node #2 Node #3 Node #4 Node #5 Node #6 click_events_2012 We now introduce replication into the picture. We use exact replicas in this set up. Node #4 is an exact replica of node #1, etc. Let’s see what happens when there’s a failure?

Node #1 Node #2 Node #3 Node #4 Node #5 Node #6 click_events_2012 When you have a temporary failure in node #1, node #4 will take on twice the load. In this case, your cluster’s latency and throughput is bottlenecked on node #4. In a cluster of 6 nodes, this isn’t a big deal. But imagine the use case where you had 100 machines. Even when you lose a single a machine, you’re bottlenecked on the failed machine’s replica. You don’t get much out of having 100 machines in the cluster. Also, if node #1 doesn’t come back up, you need to re-replicate 4 TB of data from node #4. Any ideas on how to resolve this issue?

Logical Sharding - Here comes logical sharding

Node #1 (PostgreSQL) Node #2 Node #3 Node #4 512 MB (each) 1 3 4 6 7 9 … Node #2 1 2 4 5 7 8 … Node #3 2 3 5 6 8 9 … Node #4 512 MB (each) Hadoop Distributed File System (HDFS) was the first solution to introduce this at the system level. In this diagram, we have the dynamic scaling case. We introduce a new node … Your shard rebalancer moves shard #4 from node #1, shard #5 from node #2, … (restart operations on failures) The rebalancing operations that we want to do become much more flexible

Node #1 (PostgreSQL) Node #2 Node #3 Node #4 Node #5 Node #6 1 6 7 … 1 8 … Node #4 3 4 8 … Node #5 4 5 9 … Node #6 5 6 9 … The second benefit comes with the replication case Pg_shard now replicated your shards using a round-robin policy No two nodes are exact replicas of each other

Node #1 (PostgreSQL) Node #2 Node #3 Node #4 Node #5 Node #6 1 6 7 … 1 8 … Node #4 3 4 8 … Node #5 4 5 9 … Node #6 5 6 9 … Shard 1’s load go to node #2, shard #6’s load go to node #6, .. Temporary failure: even load distribution Permanent failure: re-replication becomes much easier Again, imagine that we had 100 machines in the cluster? Quick question. How many of you have known about these problems with traditional partitioning before?

Example Data Distribution in pg_shard Cluster Metadata Server (PostgreSQL + pg_shard) shard and shard placement metadata 1 3 4 6 7 9 … Worker Node #1 1 2 4 5 7 8 … Worker Node #2 2 3 5 6 8 9 … Worker Node #3 Here’s how data and metadata are represented in an example pg_shard cluster These worker nodes are exact PostgreSQL databases. Nothing special about the worker nodes We have the metadata server. That’s where you create your distributed table The metadata server can also be called the coordinator or master node. It holds authoritative state on the metadata

Metadata Server (Master Node) Metadata server holds authoritative state on shards and their placements in the cluster This state is minimal: 1 row per distributed table, 1 row per shard, and 1 row per shard placement. Kept in 3 Postgres tables To handle metadata node failures, you can create a streaming replica, or reconstruct metadata from workers on the fly The metadata server is the authority. This metadata is tiny This state changes only when (a) you create new shards, (b) rebalance the shards, or © fail in writing to shards * The challenge is to keep this metadata consistent in the face of failures

Handling Master Failure Use streaming replication, and fail over to a secondary node On the cloud, use EBS volumes (metadata size is small) Restore metadata from pg_dump, etc. Reconstruct metadata from worker nodes The immediate question is what happens when you have a metadata server failure I ordered how you can handle master node failures from the “most amount of effort and future thinking” to the “least amount of effort” In one case, we had a customer who didn’t yet do these and wrote a script to reconstruct the metadata We don’t recommend option #4, but it still works

Metadata and Hash Partitioning postgres=# SELECT * FROM pgs_distribution_metadata.shard; id | relation_id | storage | min_value | max_value -------+-------------+---------+-------------+------------- 10004 | 177880 | t | -2147483648 | -1879048194 10005 | 177880 | t | -1879048193 | -1610612739 10006 | 177880 | t | -1610612738 | -1342177284 10007 | 177880 | t | -1342177283 | -1073741829 10008 | 177880 | t | -1073741828 | -805306374 10009 | 177880 | t | -805306373 | -536870919 ... | ... | ... | ... | ... To make this metadata example concrete, here’s an psql command example This is how the metadata is laid out We have shards represented with ids Each shard corresponds to a hash token range For example, a query comes into the system say for a table partitioned on customer_id. Pg_shard hashes the customer_id using Postgres’ hash functions, and gets a hash value. It then looks up in this metadata to find the hash token range that corresponds to the value

Worker Nodes 1 shard placement = 1 PostgreSQL table Names of tables, indexes, and constraints are extended by their shard identifier e.g. click_events_1001 holds data for shard 1001 Indexes and constraint definitions are propagated when shards first created Worker nodes are regular PostgreSQL instances If you log into a worker node, and do a \d, you’ll see multiple tables pg_shard automatically extends table names behind the covers One thing to keep in mind is that, you create your index and constraint definitions before distributing your table So that wraps up the part on how we lay out the data in pg_shard.

Worker Node Failure User-defined function to reconstruct shard Replay DDL commands for table, constraints Copy data from good worker node Update metadata on the master node Concurrent modifications during rebuild Allow? Or lock shard completely during rebuild? Alternative: manually set up streaming replicas What happens to a worker node’s data when it fails? We follow the simplest approach: You call a user-defined function to repair the shards. This is checked in yesterday! There is a question of what happens when you’re repairing your shards and you have concurrent modifications coming into the shard. We have three alternative approaches there, and this is an area where we’re looking for your feedback. If you have a particular use case in mind, please talk to us after the talk. With that, we’re now switching gears to usability and query handling logic.

What about the logic? Drop-in PostgreSQL extension Supports subset of SQL Doesn’t use any special functions It’s still just PostgreSQL So let’s talk about the logic. The way we’re thinking about pg_shard is that it strikes a balance between usability and SQL functionality coverage. After talking to PostgreSQL users, we found that they just wanted a simple component to get them rolling. And pg_shard is a drop-in PostgreSQL extension. You say create extension, and it’s there. You then start running SQL commands against your distributed tables In that sense, you don’t have to use special functions to query your data. A lot of KV stores have the notion of distribution built-into them at the API level, whereas SQL doesn’t have the concept of distribution The balance we want to strike here. We want to support a subset of SQL without any changes to your application

Users: Making Scaling SQL Easy CREATE EXTENSION pg_shard; -- create a regular PostgreSQL table: CREATE TABLE customer_reviews (customer_id TEXT NOT NULL, review_date DATE, ...); -- distribute the table on the given partition key: SELECT master_create_distributed_table('customer_reviews', 'customer_id'); -- create 16 logical shards with 2 placements on workers: SELECT master_create_worker_shards('customer_reviews', 16, 2); Here’s a simple example that sets up pg_shard and a distributed table Create a normal table – nothing special about the created table. This table will soon become a shell table on the master Invoke master_create_distributed_table. Designates this table as distributed. It also specifies that customer_id is the partition column Then, you call the master_create_worker_shards to create shards and shard replicas. This function connects to remote worker nodes, replays table schema, index, and constraint DDLs, and creates the shards on worker nodes. Two final arguments to the function: number of shards and replication factor … if you have 2, in case one of the replicas fails, that table is still available

PostgreSQL Secret Sauce: Hooks Full control over command lifetime Specific to needs: Planning Execution (Start, Run, Finish, End) Utility I don’t know how many of you are familiar with the internals of PostgreSQL, or the internal C APIs it provides One of the wonderful things it gives us is hooks – you have about a dozen You can take over any part of the command lifetime – very granular Can chain hook to use more than one – call one extension over the other They are very flexible: no strict contract about what you have to do In summary, hook APIs give us a very first class way of hooking into the PostgreSQL system

Planning Phase Determine if distributed Fall through to PostgreSQL if not Using partition key, find involved shards Deparse shard-specific SQL Here’s the overview of our planning phase You can use regular and distributed tables in the same database Taking a step back, if the query is for a distributed table, the query would have already been parsed. We find the partition key, apply partition pruning, and find shards involved in the query We then take the parsed query or queries, and uses same deparse logic as PostgreSQL proper to generate back SQL statements

Planning Example INSERT INTO customer_reviews (customer_id, rating) VALUES ('HN892', 5); Determine partition key clauses: customer_id = 'HN892' Find shards from metadata tables: hashtext('HN892') BETWEEN min_value AND max_value Produce shard-specific SQL INSERT INTO customer_reviews_16 (customer_id, rating) VALUES ('HN892', 5); Detailed example of planning in pg_shard. Insert Into comes into pg_shard. PostgreSQL parses the query Separate out clauses on hash partition Find shard that satisfies their hash values This uses same constraint exclusion mechanism used in e.g. PostgreSQL partitioning implementation When done, we have a shard-specific command to forward to a worker This is important because each worker node has many shard placements, so we need to identify which one to write to Any questions about how the planning works?

Execute Distributed Modify Locks enforce safe commutation Replicas visited in predictable order Per-session connection pool uses libpq If replica errors out, mark as inactive And then, there is the part of the actual execution. You have to be a bit careful here because we have a distributed system, and parallel request are going on. We have to think about what safe commutation is, on the set of operations that can be performed. So for example, Inserts commute with Selects, and you can use that property if you’re willing to relax isolation. We do have full consistency on writes – which means if you write something, you’re going to see it when you read it back. There is none of this potential to read from stale shard, and get back stale results as you do in document stores. Replicas need to be visited in order for the constraints to work properly. If any of the replica writes succeed during the modification, we consider the query as successful. If write a replica has failed, its metadata is marked as inactive.

Single-shard INSERT Replication factor: 2 Master Worker Node #1 INSERT INTO customer_reviews ... Master 1 3 4 6 7 9 … Worker Node #1 1 2 4 5 7 8 … Worker Node #2 2 3 5 6 8 9 … Worker Node #3 Visual representation of our cluster Determined shard 6 is target shard for this query – in this case replication factor is 2, so we touch two nodes Connections possibly already open (session-bound) Send query to all replicas

Single-shard INSERT One replica fails Master Worker Node #1 INSERT INTO customer_reviews ... Master 1 3 4 6 7 9 … Worker Node #1 1 2 4 5 7 8 … Worker Node #2 2 3 5 6 8 9 … Worker Node #3 Here worker node #3 has failed But #1 succeeded, so query is a success

Single-shard INSERT Master marks inactive Master Worker Node #1 Sets shard 6, node 3 to inactive status Master 1 3 4 6 7 9 … Worker Node #1 1 2 4 5 7 8 … Worker Node #2 2 3 5 6 8 9 … Worker Node #3 Master marks shard 6 on node 3 as needing repair. Master node needs to do a bit of bookkeeping Returns success to client Node 3 must be repaired to restore replication factor for shard 6 User-defined function (currently manual intervention) – copy data from node #1

Modification Semantics Consistent (read your own writes) Safety comes from commutativity rules SELECTs and INSERTs can be reordered UPDATEs and DELETEs cannot Constraints require predictable visit order As stated: consistent. Lax requirements give us ability to queue rows later, etc. SELECTs/INSERTs can commute safely UPDATE/DELETE cannot With uniqueness constraints, INSERT to a given shard must visit replicas in-order

Single Shard SELECT Fetch entire result from single shard Failover to another replica on error Do not modify state if failure happens Common key-value access pattern Finds shard which contains result, gets result Can dynamically fail over to other replicas if first fails Read failures do not currently modify shard state Common pattern (key-value store). Maybe you have a JSON blob

Single-shard SELECT Try first placement Master Worker Node #1 SELECT * FROM customer_reviews
 WHERE customer_id = 'HN892'; Master 1 3 4 6 7 9 … Worker Node #1 1 2 4 5 7 8 … Worker Node #2 2 3 5 6 8 9 … Worker Node #3 Master looks at query’s WHERE clauses Determines where to route query Here it found that result will be in shard 3

Single-shard SELECT Encounter error Master Worker Node #1 SELECT * FROM customer_reviews
 WHERE customer_id = 'HN892'; Master 1 3 4 6 7 9 … Worker Node #1 1 2 4 5 7 8 … Worker Node #2 2 3 5 6 8 9 … Worker Node #3 Worker one has an intermittent failure Master finds another worker with shard 3

Single-shard SELECT Try next placement Master Worker Node #1 SELECT * FROM customer_reviews
 WHERE customer_id = 'HN892'; Master 1 3 4 6 7 9 … Worker Node #1 1 2 4 5 7 8 … Worker Node #2 2 3 5 6 8 9 … Worker Node #3 Still in context of original query from client Fails over to worker 3 Gets result, sends to client One difference with Insert is that it does not modify state of workers in metadata This covers the single shard SELECT use-case

Multi-Shard SELECT pg_shard: Engaged when no partition key constraint is present. Pulls data to master and performs final pass locally (aggregates, etc.) CitusDB: Pushes computations to worker nodes themselves for fully distributed query logic Even included a multi-shard mode for ease-of-use When master finds query is not restricted to single shard Pulls partial results from all involved queries Does aggregation, transformations, etc. on master This is CitusDB’s bread and butter It creates fully distributed plan to exploit all computational resources Shares work among all worker nodes. Even does distributed JOIN Compatible with pg_shard: do modifications using pg_shard and read using CitusDB

Limitations Transactions cannot… No JOIN support (CitusDB provides) involve multiple shards span multiple statements No JOIN support (CitusDB provides) Cross-shard constraints unenforced Transaction support is very limited… Cannot span multiple shards or multiple statements No JOIN support (as stated, this is where CitusDB comes in) Cross-shard constraints are currently unenforced and undetected

What’s Next? pg_shard v1.1 release this week Includes shard repair udf, easier integration with CitusDB, COPY support through triggers, SELECT project push downs, faster INSERTs, and bug fixes pg_shard v1.2 will have more real-time analytics functionality Anything else? What’s keeping you from using pg_shard? More SQL coverage? Rebalancing shards across new machines? Multi-master for stricter uptime guarantees? Range partitioning enables many more cases Less hands-on for recovery?

Summary Sharding extension for PostgreSQL https://github.com/citusdata/pg_shard Many small logical shards Implemented with standard PostgreSQL hooks Load extension… create table… distribute JSONB + pg_shard as alternative to NoSQL Simple sharding for PostgreSQL Many small logical shards makes moving around data faster Gets new machines online quicker, spreads load better Standard SQL, standard PostgreSQL extension Easy to get started Goes great with JSONB

PostgreSQL Renaissance Normally, this would have been the end of the talk. I just have four more slides on random cool things about PostgreSQL.

PostgreSQL Usage Trends What database does your company use? Hacker News 2010 Hacker News 2014 The first one is on PostgreSQL’s usage statistics. This is a survey from a website called Hacker News. How many of you are familiar with this website? They have this survey on Hacker News every year, asking what database does your company use? On the left hand side, you see this survey’s results in 2010. There, MySQL has more usage than PostgreSQL and MongoDB combined. Fast forward the same survey four years, and you see two differences. First, PostgreSQL’s popularity has tripled over time, far exceeding any other database. Second, PostgreSQL now has more usage than MySQL and MongoDB combined.

Why is PostgreSQL popular Oraclization of MySQL Reliable and robust database New extension framework: Is the monolithic SQL database dying? If so, long live Postgres The natural follow-up question is, “why has PostgreSQL become so popular?” The first reason is the Oraclization of MySQL. Second, when you talk to users and ask them why they love Postgres, they say “it’s robust, it’s reliable, and it does what I expect it to do. There are no surprises.” Postgres has made architectural decisions early on that enable it to be extended as a database. As a result, with the new extension framework, you can implement your own logic for your own use-case. In fact, PostgreSQL already has 200 extensions available.

PostgreSQL Extensions #1 HyperLogLog extension for real-time unique counts over varying time intervals JSONB and GIN indexes for semi-structured data Here are a few example extensions that we’ve seen our customers use and love. The first one implements an algorithm to calculate distinct count approximations over large data sets. For example, if you have unique counts for day 1, and unique counts for day2, and if you want to merge them together in real-time, you want to use the hyperloglog extension. CloudFlare and Neustar are two customers that are using this extension to power their real-time analytic dashboards. The second extension adds a JSONB data type to PostgreSQL. This way, you can store your relational and semi-structured data within the same table, and query them together.

PostgreSQL Extensions #2 Real-time funnel and cohort queries Dynamic funnels to look at people who visited A, then B, and finally J Heatmap video of vehicles on a map The third extension enables you to answer dynamic funnel questions in real-time. For example, we have one customer who’s a billion dollar retailer in Europe. Think of them as a local Walmart. This customer has trucks that go around countries delivering goods. And they want to look at the trucks that visited location A, B, and then C, and compare their fuel consumption. For that, they’re using an extension for real-time funnels. (They also want to visualize these trucks as a heatmap overlayed on a map. And I’ll show a quick video that does that.)

Contact Jason: jason@citusdata.com Sumedh: sumedh@citusdata.com Ozgun: ozgun@citusdata.com General: groups.google.com/forum/#!forum/pg_shard-users

Questions