High Availability in SQL Server 2012 Techniques to reduce downtime Eric Peterson .
Agenda Overview of SQL Sever Methods of High Availability From hardware thru methods not intended to teach how to implement Tips given, tips accepted, Review of methods, terms, features Log Shipping Replication Mirroring Clustering Always On Availability groups Other Things to think about. Comparison on methods
Speaker Background 30+ years professional experience 70s Mainframe 80s Database IDMS, IDSII, Oracle, Sybase DB2 v1.6 DB2 2.0 90s Sybase Pro Serve SQL Server, ODBC, PowerDesigner Design and Beta Teams 00s Independent Consultant - MS SQL Server Current - BCD Travel Eric.Peterson@BCDTravel.com
High Availability, It Is All About the 9s Downtime Outage Seconds Day Week Month Year ½ min 30 0.999653 0.999950 0.999999 1 min 60 0.999306 0.999901 0.999998 5 min 300 0.996528 0.999504 0.999990 10 min 600 0.993056 0.999008 0.999981 15 min 900 0.989583 0.998512 0.999971 30 min 1800 0.979167 0.997024 0.999943 1 hour 3600 0.958333 0.994048 0.999886 2 hours 7200 0.916667 0.988095 0.999772 8 hours 28800 0.666667 0.952381 0.999087 1 day 86400 0.000000 0.857143 0.997260
High Availability (HA) Terms Keeping the system up Is Not Disaster Recovery Recovering from when bad things happen Latency The amount of delay time it takes to synchronize between two systems Temperature Hot – Always up, always in Sync Warm – Close, but has defined latency Cold – manual intervention, defined loss
Methods Technologies that have an impact on HA Maintenance & Backups Replication Log Shipping Mirroring Sync Mirroring Async Clustering Always On Availability Third Party Software
In the Beginning PCs in the post mainframe world Departmental apps Single points of failure Local disk drives System board, Memory Disk controller Power supply Backups ?maybe? Loss of infrastructure Network, etc
Resolving Single Points of Failure Redundant Power , Power Supplies Battery/generator backup Network cards Device Controllers Disks SANs Fault tolerant disks RAID Redundant Array of Independent Disks
RAID Levels Redundant Array of Independent Disks Raid 0 RAID 1 Block-level striping without parity RAID 1 Mirroring without parity or striping.
RAID Levels RAID 5 Block-level striping with distributed parity.
RAID Levels Raid 10 - Also known as Raid 1+0 A combination of RAID 1 and RAID 0 Mirroring + striping No parity write – Faster for Inserts Double the space!!!! Hot Swap
RAID – Which is best BENCHMARK!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! If all things are equal – “MPF” RAID 10 SAN Cache If enough cache parity write may not impact performance Device Controller dependent Single Point of IO If one disk corrupts then the other corrupts
More Redundancy is better Hardware SAN Replication, mirroring, copy, etc Software Redundancy Old School “BAR” Replication Log Shipping Mirroring
Backup and Recover “BAR” Manual or home grown process Backup Copy File Restore Backup Ability to query/develop against non production Scrub production data Latency defined 24 hours 1 week, month, quarter Does not work as well for VLDB Apply differentials
Replication Reads the Log Distribution Database Publisher Distributor Keeps track Publisher Distributor Can be on same SQL Server Subscriber One to many subscribers Process can run in near real time Can schedule as well
Replication Types Snapshot Transactional Replication Merge Snapshot Agent Schema and Data Transactional Replication One Way Publish to Subscribers Incremental changes Continuous or Scheduled Merge Merge Agent for Conflict Resolution Every row is given a unique identifier Detection and resolution
Replication Features Can Publish Stored Procs Monitoring Alerts To non-SQL Server Subscribers To Multiple Subscribers Table by Table Stored Procs Monitoring Alerts
Log Shipping Provides Backup of current DB in Secondary DB Read Only Copy Transaction Logs “Shipped” Automated “BAR” SQL Agent Jobs Latency Schedule Script Dis/Enable One to Many DBs
Log Shipping Issues Failover is not automatic you have to reset everything Read only DB can be used in Manual Failover Require application changes When log is being applied, Secondary read only DB connections are dropped Network dependent Log backup Size Job Schedule default times Spread schedule
Mirroring Dual Write for single DB on different SQL Servers Asynchronous (High Performance) Synchronous (High Safety Mode) Good redundancy Manual or Automatic failover Rolling update 2008 Enterprise Resolution of page errors
Mirroring Issues Managing multiple databases so that if one fails they all fail is difficult if not impossible. Synchronous needs to be close Local or Dark Fiber There is only one mirror of the database The mirror is not directly usable it just sits there unless you are prepared to work with snapshots There is no mirror after the failover, the mirroring state is DISCONNECTED and the principal is exposed A SQL Server native client is needed to use mirroring
Clustering Overview SAN Node l Node N Cluster SQL Server Instance Heartbeat Node l Node N SQL Server Instance Failover SAN Quorum Drive Virtual Server Group SQL Server Windows Server
Clustering Overview SAN Node l Node N Cluster SQL Server Instance Heartbeat Node l Node N SQL Server Instance SAN Quorum Drive Virtual Server Group SQL Server Windows Server Failover
Clustering Terms Node Service Group(old) Resources Heartbeat Quorum SAN
Cluster Setup Cluster is a logical grouping of Resources Failover Cluster Manager (2008+) Cluster Manager Each Machine (node) Windows Server IP Address Local and SAN Drives Each (Service/Group) Instance VM, windows server SQL Dedicated Resources
Cluster Failover Service/Group (VM) from one Node to Node Best Available or Directed Usually takes under 30 seconds Automatic or Manual Heartbeat Monitor Application Independent Failover Tracking Failover Notification Proc SQL Server Log Cluster
Clustering Issues Failover usually fast Failover Issues But can take several min to recover DBs Failover Issues Connections Drop Transactions Stop Failure to Connect Cluster Can fail SAN Disk Failure Memory / Resources Keep a spreadsheet
AlwaysOn Group/Cluster
AlwaysOn Terms Windows Server Failover Clustering Type Always On Failover Cluster Instance Always On Availability Group (cluster change!) Can be either or both Listener: IP Address and DNS Name Logical Instance that programs attach Replica: SQL Server mirror copy of DBs
AlwaysOn Availability Groups One Database or a Group of Databases Advanced Mirroring Multiple Secondary DBs Multiple Synchronous DBs Automatic Page repair Active Secondary Offloading workloads Backup/log from secondary Multiple Groups
AlwaysOn Availability Database Group Failover Automatic or Manual Management Studio Management of Groups in Management Studio Dashboard No Shared SAN Local Attached disks Ability to repair from mirror Change raid level???
Mgmt Studio Primary Secondary Listener
AlwaysOn Availability Restrictions All servers must be in the same domain Can be different data centers/cloud Up to 3 replicas can be synchronous Local, or dark fiber Up to 2 of them can be used for automatic failover All servers must use the same service account If using Kerberos Both AlwaysOn Availability (Group & Cluster ) Rely on Windows Server Failover Clustering infrastructure & Windows Cluster
AlwaysOn Failover Cluster Instance Failover of the instance rather than at the DB level New Features Multi-site clustering across subnets for improved site protection. Flexible failover policy for better control over instance failover. Improved diagnostics for faster troubleshooting. TempDB on local drive allows better query performance.
2012 Always On Downside Failover time Two Machines two deploys for:, Volume Dependent 30 seconds to 30 minutes Two Machines two deploys for:, Security Must be same SID for SQL ids SQL Agent Jobs need to be “primary” aware Secondary must be up bug Corruption
SIOS Software Solution Overview Cluster Heartbeat Node l Node N SQL Server Instance Failover VM SAN Quorum Drive Virtual Server Group SQL Server Windows Server
Comparison of Methods * Maintenance - use most current versions Type Latency Temp 9s HA DR Replication asynchronous - scheduled warm Manual Application must handle Very Good Backup daily cold OK Log Shipping Scheduled Cold - warm good to excellent. Mirroring (Sync) synchronous hot good monitor server very good Mirroring (Async) depends on volume manual Clustering N/A excellent excellent to poor for SAN failure Always On (Sync) Always On (Async) warm + * Maintenance - use most current versions
Maintenance Beyond Patch Tuesday Backups Index/Table fragmentation Clustering Backups Tape Drive speed Index/Table fragmentation Online index Old Data Removal Size matters, smaller is better
What's Best for your environment? If you are afraid to failover You do not have a valid system Define the 9s thru Legal Cheating Change the calculation Yes/No Remove scheduled maintenance time from calc time Change the definition SQL > Cluster Benchmark Knowledge