© Continuent 6/13/2015 PostgreSQL replication strategies Understanding High Availability and choosing the right solution

Slides:



Advertisements
Similar presentations
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Advertisements

Mecanismos de alta disponibilidad con Microsoft SQL Server 2008 Por: ISC Lenin López Fernández de Lara.
Introduction to DBA.
High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ
Oracle Data Guard Ensuring Disaster Recovery for Enterprise Data
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
NETWORK LOAD BALANCING NLB.  Network Load Balancing (NLB) is a Clustering Technology.  Windows Based. (windows server).  To scale performance, Network.
June 23rd, 2009Inflectra Proprietary InformationPage: 1 SpiraTest/Plan/Team Deployment Considerations How to deploy for high-availability and strategies.
Objektorienteret Middleware Presentation 2: Distributed Systems – A brush up, and relations to Middleware, Heterogeneity & Transparency.
Distributed Processing, Client/Server, and Clusters
1 © Copyright 2010 EMC Corporation. All rights reserved. EMC RecoverPoint/Cluster Enabler for Microsoft Failover Cluster.
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
City University London
Keith Burns Microsoft UK Mission Critical Database.
Midterm 2: April 28th Material:   Query processing and Optimization, Chapters 12 and 13 (ignore , 12.7, and 13.5)   Transactions, Chapter.
Overview Distributed vs. decentralized Why distributed databases
Building Highly Available Web Applications Emmanuel Cecchet Chief architect - Continuent
Lesson 1: Configuring Network Load Balancing
Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.
1© Copyright 2011 EMC Corporation. All rights reserved. EMC RECOVERPOINT/ CLUSTER ENABLER FOR MICROSOFT FAILOVER CLUSTER.
National Manager Database Services
Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.
Building Highly Available Systems with SQL Server™ 2005 Vineet Gupta Evangelist – Data and Integration Microsoft Corp.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
SANPoint Foundation Suite HA Robert Soderbery Sr. Director, Product Management VERITAS Software Corporation.
Chapter 10 : Designing a SQL Server 2005 Solution for High Availability MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design.
Highly available web sites with Tomcat and Clustered JDBC Emmanuel Cecchet.
Business Continuity and Disaster Recovery Chapter 8 Part 2 Pages 914 to 945.
JOnAS developer workshop – /02/2004 status Emmanuel Cecchet
What is (Application) Clustering and Why do you Want to Use it? February 2005 Eero Teerikorpi CEO.
Module 12: Designing High Availability in Windows Server ® 2008.
Sofia, Bulgaria | 9-10 October SQL Server 2005 High Availability for developers Vladimir Tchalkov Crossroad Ltd. Vladimir Tchalkov Crossroad Ltd.
/11/2003 C-JDBC: a High Performance Database Clustering Middleware Nicolas Modrzyk
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
© Emic 2005 Emmanuel Cecchet | ApacheCon EU Building Highly Available Database Applications for Apache Derby Emmanuel Cecchet.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Module 13 Implementing Business Continuity. Module Overview Protecting and Recovering Content Working with Backup and Restore for Disaster Recovery Implementing.
Applications Web et bases de données en grappe Séminaire InTech 3 Février 2005 – Grenoble.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Usenix Annual Conference, Freenix track – June 2004 – 1 : Flexible Database Clustering Middleware Emmanuel Cecchet – INRIA Julie Marguerite.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Kjell Orsborn UU - DIS - UDBL DATABASE SYSTEMS - 10p Course No. 2AD235 Spring 2002 A second course on development of database systems Kjell.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
Highly available database clusters with JDBC
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 12: Planning and Implementing Server Availability and Scalability.
High Availability in DB2 Nishant Sinha
© Chinese University, CSE Dept. Distributed Systems / Distributed Systems Topic 1: Characterization of Distributed & Mobile Systems Dr. Michael R.
Ashish Prabhu Douglas Utzig High Availability Systems Group Server Technologies Oracle Corporation.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Enhancing Scalability and Availability of the Microsoft Application Platform Damir Bersinic Ruth Morton IT Pro Advisor Microsoft Canada
Cloud Computing Computer Science Innovations, LLC.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
Your Data Any Place, Any Time Always On Technologies.
VCS Building Blocks. Topic 1: Cluster Terminology After completing this topic, you will be able to define clustering terminology.
Replicazione e QoS nella gestione di database grid-oriented Barbara Martelli INFN - CNAF.
Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.
SQL Server High Availability Introduction to SQL Server high availability solutions.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Distributed Databases
MySQL HA An overview Kris Buytaert. ● Senior Linux and Open Source ● „Infrastructure Architect“ ● I don't remember when I started.
High Availability 24 hours a day, 7 days a week, 365 days a year…
High Availability Linux (HA Linux)
Maximum Availability Architecture Enterprise Technology Centre.
A Technical Overview of Microsoft® SQL Server™ 2005 High Availability Beta 2 Matthew Stephen IT Pro Evangelist (SQL Server)
Choosing a MySQL Replication & High Availability Solution
SQL Server High Availability Amit Vaid.
Your Data Any Place, Any Time
SpiraTest/Plan/Team Deployment Considerations
Distributed Systems and Concurrency: Distributed Systems
Presentation transcript:

© Continuent 6/13/2015 PostgreSQL replication strategies Understanding High Availability and choosing the right solution Slides available at

1 © Continuent What Drives Database Replication? /Availability – Ensure applications remain up and running when there are hardware/software failures as well as during scheduled maintenance on database hosts /Read Scaling – Distribute queries, reports, and I/O-intensive operations like backup, e.g., on media or forum web sites /Write Scaling – Distribute updates across multiple databases, for example to support telco message processing or document/web indexing /Super Durable Commit – Ensure that valuable transactions such as financial or medical data commit to multiple databases to avoid loss /Disaster Recovery – Maintain data and processing resources in a remote location to ensure business continuity /Geo-cluster – Allow users in different geographic locations to use a local database for processing with automatic synchronization to other hosts

2 © Continuent High availability /The magic nines Percent uptimeDowntime/monthDowntime/year 99.0%7.2 hours3.65 days 99.9%43.2 minutes8.76 hours 99.99%4.32 minutes52.56 minutes %0.43 minutes5.26 minutes %2.6 seconds31 seconds

3 © Continuent Few definitions /MTBF Mean Time Between Failure Total MTBF of a cluster must combine MTBF of its individual components Consider mean-time-between-system-abort (MTBSA) or mean-time-between-critical-failure (MTBCF) /MTTR Mean Time To Repair How is the failure detected? How is it notified? Where are the spare parts for hardware? What does your support contract say?

4 © Continuent Outline /Database replication strategies /PostgreSQL replication solutions /Building HA solutions /Management issues in production

5 © Continuent /Clients connect to the application server /Application server builds web pages with data coming from the database /Application server clustering solves application server failure /Database outage causes overall system outage Internet Database Disk Application servers Problem: Database is the weakest link

6 © Continuent Disk replication/clustering /Eliminates the single point of failure (SPOF) on the disk /Disk failure does not cause database outage /Database outage problem still not solved Internet Database Database disks Application servers

7 © Continuent /Multiple database instances share the same disk /Disk can be replicated to prevent SPOF on disk /No dynamic load balancing /Database failure not transparent to users (partial outage) /Manual failover + manual cleanup needed Internet Databases Database Disks Application servers Database clustering with shared disk

8 © Continuent Master/slave replication /Lazy replication at the disk or database level /No scalability /Data lost at failure time /System outage during failover to slave /Failover requires client reconfiguration Internet Master Database Disks Application servers Slave Database log shipping hot standby

9 © Continuent Internet Web frontend App. server Master Scaling the database tier Master-slave replication /Pros Good solution for disaster recovery with remote slaves /Cons failover time/data loss on master failure read inconsistencies master scalability

10 © Continuent Internet /Pros consistency provided by multi-master replication /Cons atomic broadcast scalability no client side load balancing heavy modifications of the database engine Atomic broadcast Scaling the database tier Atomic broadcast

11 © Continuent Scaling the database tier – SMP Internet Web frontend App. server Well-known database vendor here Database Well-known hardware + database vendors here /Pros Performance /Cons Scalability limit Limited reliability Cost

12 © Continuent Internet /Pros no client application modification database vendor independent heterogeneity support pluggable replication algorithm possible caching /Cons latency overhead might introduce new deadlocks Middleware-based replication

13 © Continuent /Failures can happen in any component at any time of a request execution in any context (transactional, autocommit) /Transparent failover masks all failures at any time to the client perform automatic retry and preserves consistency Internet Sequoia Transparent failover

14 © Continuent Outline /Database replication strategies /PostgreSQL replication solutions /Building HA solutions /Management issues in production

15 © Continuent Featurepgpool-Ipgpool-IIPGcluster-IPGcluster-IISlony-ISequoia Replication type Hot standbyMulti- master Shared diskMaster/SlaveMulti- master Commodity hardware Yes NoYes Application modifications No Yes if reading from slaves NoYes if reading from slaves Client driver update Database modifications No Yes No PG support>=7.4 Unix 7.3.9, 7.4.6, Unix 8.? Unix only? >= 7.3.3All versions Data loss on failure YesYes?No?NoYesNo Failover on DB failure Yes No if due to disk Yes Transparent failover No Yes Disaster recovery Yes No if diskYes Queries load balancing NoYes PostgreSQL replication solutions compared

16 © Continuent Featurepgpool-Ipgpool-IIPGcluster-IPGcluster-IISlony-ISequoia Read scalabilityYes Write scalability No YesNo Query parallelization NoYesNoNo?No Replicas2up to 128LB or replicator limit SAN limitunlimited Super durable commit NoYesNoYesNoYes Add node on the fly NoYes Yes (slave)Yes Online upgrades No YesNoYes (small downtime) Yes Heterogeneous clusters PG >=7.4 Unix only PG PG>=7.3.3Yes Geo-cluster support No Possible but don’t use NoYes PostgreSQL replication solutions compared

17 © Continuent Performance vs Scalability /Performance latency different from throughput /Most solutions don’t provide parallel query execution No parallelization of query execution plan Query do not go faster when database is not loaded /What a perfect load distribution buys you Constant response time when load increases Better throughput when load surpasses capacity of a single database

18 © Continuent Understanding scalability (1/2) 20 users Single DB Sequoia

19 © Continuent Understanding scalability (2/2) 90 users Single DB Sequoia

20 © Continuent RAIDb Concept: Redundant Array of Inexpensive Databases /RAIDb controller – creates single virtual db, balances load /RAIDb 0,1,2: various performance/fault tolerance tradeoffs /New combinations easy to implement tables 2 & 3 table... RAIDb controller table n-1table 1table n SQL partitioning (whole tables) no duplication no fault tolerance at least 2 nodes RAIDb-0 mirroring performance bounded by write broadcast at least 2 nodes uni/cluster certifies only RAIDb-1 RAIDb-1 Full DB RAIDb controller SQL Full DB partial replication at least 2 copies of each table for fault tolerance at least 3 nodes table xtable y tables x & y Full DBtable z SQL RAIDb controller RAIDb-2

21 © Continuent JVM Sequoia JDBC driver Sequoia controller JVM PostgreSQL JDBC Driver PostgreSQL Sequoia architectural overview /Middleware implementing RAIDb 100% Java implementation open source (Apache v2 License) /Two components Sequoia driver (JDBC, ODBC, native lib) Sequoia Controller /Database neutral

22 © Continuent connect myDB connect login, password execute SELECT * FROM t ordering exec RR, WRR, LPRF, … get connection from pool update cache (if available) Sequoia read request

23 © Continuent jdbc:sequoia://node1,node2/myDB Total order reliable multicast Sequoia write request

24 © Continuent Alternative replication algorithms /GORDA API European consortium defining API for pluggable replication algorithms /Sequoia 3.0 GORDA compliant prototype for PostgreSQL Uses triggers to compute write-sets Certifies transaction at commit time Propagate write-sets to other nodes /Tashkent/Tashkent+ Research prototype developed at EPFL Uses workload information for improved load balancing /More information

25 © Continuent PostgreSQL specific issues /Indeterminist queries Macros in queries (now(), current_timestamp, rand(), …) Stored procedures, triggers, … SELECT … LIMIT can create non-deterministic results in UPDATE statements if the SELECT does not have an ORDER BY with a unique index: UPDATE FOO SET KEYVALUE=‘x’ WHERE ID IN (SELECT ID FROM FOO WHERE KEYVALUE IS NULL LIMIT 10) /Sequences setval() and nextval() are not rollback nextval() can also be called within SELECT /Serial type /Large objects and OIDs /Schema changes /User access control not stored in database (pg_hba.conf) host-based control might be fooled by proxy backup/restore with respect to user rights /VACUUM

26 © Continuent Outline /Database replication strategies /PostgreSQL replication solutions /Building HA solutions /Management issues in production

27 © Continuent Simple hot-standby solution (1/3) /Virtual IP address + Heartbeat for failover /Slony-I for replication

28 © Continuent Simple hot-standby solution (2/3) /Virtual IP address + Heartbeat for failover /Linux DRDB for replication /Only 1 node serving requests Client Applications Virtual IP Postgres Linux OS DRBD Heartbeat /dev /drbd0 /dev /drbd0 Postgres Linux OS DRBD Heartbeat

29 © Continuent Simple hot-standby solution (3/3) /pgpool for failover /proxy might become bottleneck requires 3 sockets per client connection increased latency /Only 1 node serving requests Client Applications pgpool Postgres 1 Postgres 2

30 © Continuent Internet /Apache clustering L4 switch, RR-DNS, One-IP techniques, LVS, Linux-HA, … /Web tier clustering mod_jk (T4), mod_proxy/mod_rewrite (T5), session replication /PostgreSQL multi-master clustering solution Highly available web site mod-jk RR-DNS

31 © Continuent Internet /Consider MTBF (Mean time between failure) of every hardware and software component /Take MTTR (Mean Time To Repair) into account to prevent long outages /Tune accordingly to prevent trashing Sequoia Highly available web applications

32 © Continuent Building Geo-Clusters America master Europe slave Asia slave America slave Europe master Asia slave America slave Europe slave Asia master asynchronous WAN replication

33 © Continuent Split brain problem (1/2) /This is what you should NOT do: At least 2 network adapters in controller Use a dedicated network for controller communication Client servers Controllers Databases Network switch eth0 eth1 eth2

34 © Continuent Split brain problem (2/2) /When controllers lose connectivity clients may update inconsistently each half of the cluster /No way to detect this scenario (each half thinks that the other half has simply failed) Client servers Controllers Databases Network switch eth0 eth1 eth2

35 © Continuent Avoiding network failure and split-brain /Collocate all network traffic using Linux Bonding /Replicate all network components (mirror the network configuration) /Various configuration options available for bonding (active-backup or trunking) Client servers ControllersDatabases eth1eth0 bond0 eth1eth0 bond0 eth0 eth1 bond0 eth0 eth1 bond0 eth0 eth1 bond0 eth0 eth1 bond0 eth0 eth1 bond0 eth0 eth1

36 © Continuent Synchronous GeoClusters /Multi-master replication requires group communication optimized for WAN environments /Split-brain issues will happen unless expensive reliable dedicated links are used /Reconciliation procedures are application dependent

37 © Continuent Outline /Database replication strategies /PostgreSQL replication solutions /Building HA solutions /Management issues in production

38 © Continuent Managing a cluster in production /Diagnosing reliably cluster status /Getting proper notifications/alarms when something goes wrong Standard or SNMP traps Logging is key for diagnostic /Minimizing downtime Migrating from single database to cluster Expanding cluster Staging environment is key to test /Planned maintenance operations Vacuum Backup Software maintenance (DB, replication software, …) Node maintenance (reboot, power cycle, …) Site maintenance (in GeoCluster case)

39 © Continuent Dealing with failures /Sotfware vs Hardware failures client application, database, replication software, OS, VM, … power outage, node, disk, network, Byzantine failure, … Admission control to prevent trashing /Detecting failures require proper timeout settings /Automated failover procedures client and cluster reconfiguration dealing with multiple simultaneous failures coordination required between different tiers or admin scripts /Automatic database resynchronization / node repair /Operator errors automation to prevent manual intervention always keep backups and try procedures on staging environment first /Disaster recovery minimize data loss but preserve consistency provisioning and planning are key /Split brain or GeoCluster failover requires organization wide coordination manual diagnostic/reconfiguration often required

40 © Continuent Summary /Different replication strategies for different needs /Performance ≠ Scalability /Manageability becomes THE major issue in production

41 © Continuent /pgpool: /PGcluster: /Slony: /Sequoia: /GORDA: /Slides: Links

© Continuent 6/13/2015 Bonus slides

43 © Continuent RAIDb-2 for scalability /limit replication of heavily written tables to subset of nodes /dynamic replication of temp tables /reduces disk space requirements

44 © Continuent RAIDb-2 for heterogeneous clustering /Migrating from MySQL to Oracle /Migrating from Oracle x to Oracle x+1

45 © Continuent Server farms with master/slave db replication /No need for group communication between controller /Admin. operations broadcast to all controllers

46 © Continuent Composing Sequoia controllers /Sequoia controller viewed as single database by client (app. or other Sequoia controller) /No technical limit on composition deepness /Backends/controller cannot be shared by multiple controllers /Can be expanded dynamically