Download presentation
Presentation is loading. Please wait.
Published byChristopher Bradford Modified over 9 years ago
1
Meeting the Availability Challenge Don Vilen Program Manager SQL Server Microsoft Corporation
2
Causes of Data Unavailability What part of your database system causes the most data unavailability? Hardware? SQL Server and Windows? Application software? Your people?
3
People, Planning, and Procedures In several studies the major cause of data unavailability is people Operators and DBAs Users SQL Server 2005 has many enhancements to improve availability, but not all problems can be solved by DBMS technology Training your people, planning your operations, and documenting your procedures can have the greatest impact on availability
4
Availability Enhancements Database Snapshots Database Mirroring Partial Database Availability Table and Index Partitioning Row-level Versioning Snapshot Isolation Read Committed Snapshot Isolation Failover Clustering enhancements Fast Database Recovery Peer-to-Peer Replication Scaleable Shared Database Checksum on Data Pages Instant File Initialization Backup/Restore Checksum on Backups Backup Media Mirroring Restore Read-Only Filegroups without Log Page-level Restore Piecemeal Restore Data Backups don’t block Log Backups Dedicated Admin Connection (DAC) Dynamic Configuration CPU Affinity, AWE memory Attach/Detach
5
Areas of Availability Operator, DBA, and User Error High Availability and the SQL Server process Availability inside the SQL Server process
6
Operator, DBA, User Error People are the cause of most unavailability Protect – Permissions: Restrict as necessary Prevent / Avoid – Triggers on DDL and DML can rollback an operation even if they do have permission Detect – Triggers and Events can be used to detect and record who, what, when Delay – Log Shipping can delay the error from reaching the standby server Recover – Database Snapshot can save a copy of a database at a point in time Can RESTORE from the Snapshot to recover
7
Database Snapshot Read-only Snapshot of an entire database at a point in time Created instantly; no data is copied Snapshot must be created before the error Base database continues to change Database Snapshot does not restrict the base database Multiple Snapshots are allowed Database Snapshots can exist forever Constrained by resources
8
Database Snapshot Syntax Examples To create a database snapshot CREATE DATABASE mydbSnap0600 ON ( ) AS SNAPSHOT OF mydb To drop a database snapshot DROP DATABASE mydbSnap0600 To restore a database to a snapshot RESTORE DATABASE mydb FROM DATABASE_SNAPSHOT = 'mydbsnap0600'
9
Database Snapshot How it really works mydbSnap – Read-Only Database Snapshot USE mydbSnap SELECT (pages 4, 6, 9, 10, 14) 1 Page 2345678910111213141516 CREATE DATABASE mydbSnap AS SNAPSHOT OF mydb mydb – Database USE mydb UPDATE (pages 4, 9, 10) 4 910 12345678910111213141516
10
Areas of Availability Operator, DBA, and User Error High Availability and the SQL Server process Availability inside the SQL Server process
11
High Availability and the SQL Server Process Instance Maintenance Instance and Database Startup Ensuring High Availability for the SQL Server Process
12
Instance Maintenance Install and Upgrade More reliable Install / Setup process Faster Upgrade Minimal downtime during upgrade System Stored Procedures are in the ‘resource database’ Read-only, invisible/transparent to users Allows for fast, side-by-side upgrades No scripts to run to upgrade system procs master is upgraded during the Upgrade process Other databases are upgraded as part of database startup – once master is complete Upgrade also occurs: When a down-level database is attached When a down-level database is restored
13
Instance Maintenance Configuration More dynamic configuration No instance restart required to change: AWE Memory configuration Hot-add memory CPU Affinity configuration Surface Area Configuration Manager
14
Instance and Database Startup Faster Instance Startup Memory, even AWE memory, is allocated as needed Faster Database Startup Fast Recovery User databases start up in parallel
15
Fast Recovery on Database Restart SQL Server 2000 Database is available after Undo completes SQL Server 2005, Enterprise Edition Database is available when Undo begins UndoRedoAvailable UndoRedoAvailable Time
16
Ensuring High Availability for the SQL Server Process Failover Clustering Database Mirroring Peer-to-Peer Transactional Replication Others Log Shipping Backup/Restore and Attach/Detach 3 rd Party Solutions Geographically-Dipersed Clusters
17
Hot Standby Failover Solutions Failover Clustering and Database Mirroring Both Provide Automatic detection Automatic, fast failover Manual failover Transparent client redirect Zero work loss Database Mirroring Failover Cluster
18
Hot Standby – Automatic failover Built on Microsoft Server Clusters (MSCS) Multiple nodes provide availability, transparent to client Maximum of 2, 4, or 8 nodes depending on OS edition Automatic detection and failover Requires certified hardware; see Windows Catalog: Clustered Supports many scenarios: Multiple Active Instances, N+1, N+I Failover Clustering N+1: N Active, 1 Inactive Instances Inst2 * * Inst1 Multiple Active Instances Inst2 * * Inst1 Inst3 * N+I: N Active, I Inactive Instances Failover Cluster * Inst1
19
Zero work loss, zero impact on throughput Instance Failover – entire instance works as a unit Single copy of instance databases Available since SQL Server 7.0 Setup is performed at install time Clients connect to ‘virtual’ server-name / IP Failover Clustering Hot Standby Instance Failover Cluster * Inst1
20
Failover Clustering SQL Server 2005 More nodes Match operating system limits Two-node Failover Clustering is available in SQL Server 2005 Standard Edition Unattended setup Support for mounted volumes (Mount Points) Full support for Majority Node Set quorum Full-Text Search and Analysis Services are now cluster aware Database Engine, Agent have been since 7.0 Analysis Services now has multiple instances Failover Cluster
21
Database Mirroring New for SQL Server 2005 Hot Standby Database Database Failover Very fast failover Just seconds in most cases Zero data loss Automatic or manual failover Automatic re-sync after failover Automatic, transparent client redirect Database Mirroring
22
Hardware Works with standard computers, storage, and networks No shared storage components, virtually no distance limitations Impact to transaction throughput Zero to minimal, depending on environment / workload / network Database Mirroring
23
Fault Tolerant Virtual Database Principal Clients Witness Mirror
24
Witness and Quorum Sole purpose of the Witness is to provide automatic failover To survive the loss of one server you must have at least three Prevents “split brain” Does a lost connection mean the partner is down or is the network down? To become the Principal, a server must talk to at least one other server The Witness does not direct the Mirror to become Principal It simply answers the question “Who do you see?” Witness
25
Witness Witness is an instance of SQL Server 2005 Perhaps even SQL Server Express Can be witness for multiple sessions Consumes very little resources Not a single point of failure Partners can form quorum on their own Witness
26
Database Mirroring How it works Mirror Principal Witness Log ApplicationSQL Server 2 2 4 5 1 Data Log 3>2>3 Mirror is always redoing – it remains current Commit
27
Safety / Performance There is a trade-off between performance and safety Database Mirroring has two safety levels FULL – commit when logged on Mirror Allows automatic failover No data loss OFF – commit when logged on Principal System does its best to keep up Prevents failover; to make mirror available Must ‘force’ service Or terminate Database Mirroring session
28
Database Mirroring Modes High-Availability Mode Safety Full; Synchronous operation Database is available whenever a quorum exists Automatic failover High-Protection Mode Safety Full; Synchronous operation No witness Principal continues servicing the database even if it loses connection to mirror Manual failover only; no automatic failover A transition mode; should not be in this mode for long High-Performance Mode Safety Off; Asynchronous operation Manual failover only Supports only one form of role switching: forced service (with possible data loss)
29
Transparent Client Redirect No changes to application code Client automatically redirected if session is dropped Client library is aware of Principal and Mirror servers Upon initial connect to Principal, library caches Mirror name When client attempts to reconnect If Principal is available, connects If not, client library automatically redirects connection to Mirror
30
Peer-to-Peer Transactional Replication New in SQL Server 2005 Transactional replication but.. All participants are peers Schema is identical on all sites Publish the updates made on ‘their’ data Subscribe to others to pick up their changes No hierarchy as in ‘normal’ transactional replication A given set of data can be updated at only one site at a time Data ‘ownership’ is purely logical; does not prevent conflicts SQL Server prevents a change from round-tripping Enables load-balancing and high availability Warm / hot standby Small possibility of data loss on failure Peer-to-Peer Transactional Replication
31
Scalable Shared Database New in SQL Server 2005 EE Multiple instances sharing the same database files Volume must be Read-only Database access is Read-only Enables load-balancing and scalability Some availability increase but it depends on shared disk system Data files can be mirrors periodically split off from an updateable database Third-party functionality Read- Only Scalable Shared Database
32
Warm Standby Solutions Replication and Log Shipping Both Provide Multiple copies and Manual failover Replication – since SQL Server 6.0 Primarily used where availability is required in conjunction with scale out of read activity Failover possible; a custom solution Not limited to entire database; Can define subset of source database or tables Copy of database is continuously accessible for read activity Latency between source and copy can be as low as seconds Log Shipping Basic idea: Backup, Copy, Restore Log will always be supported But no more investment in the scripts Database scope Database accessible but read-only Users must exit for next log to be applied Data backups no longer block log backups Log Shipping Replication
33
Both Provide Manual detection and failover Potential for some work loss Whole-database scope Standard servers Limited reporting on standby Duplicate copy of database Client must know where to re-connect Slowest failover – Most downtime Backup / Restore Smaller size – only used pages are copied Log backups allow restore to point in time Longer restore time Detach / Copy / Attach Copies entire files No possibility of rolling forward subsequent logs Backup / Restore Detach / Copy / Attach Cold Standby Solutions Backup / Restore and Detach / Copy / Attach
34
Third-Party Availability Solutions Many solutions by many companies with many different names Most replicate data and log changes by one of these methods Looking into SQL Server’s log files and translating log records into logical operations Intercepting IOs and duplicating them, either to another disk set or simply recording them Replicating the writes at the storage level Local copies / mirrors for snapshot backups, etc. Remote copies / mirrors for disaster recovery and Geographically-Dispersed Clusters
35
Geographically-Dispersed Clusters and Storage-level Replication Solutions from many third party vendors replicate the local I/Os on a remote system Hardware and software methods Synchronous and asynchronous solutions Solutions must meet Core I/O Requirements Many solutions use MSCS to provide automatic failover See Windows Catalog, Geographically-dispersed Clusters Other solutions are primarily to duplicate the data at a remote site, often with an independent SQL Server Similar to Log Shipping or Detach / Attach Geographical Failover Cluster
36
Complementary Technologies Technologies can be Combined Maximize availability for Scale out Offload primary data platform Heavy reporting Mobile/disconnected users Autonomous business units that share data Maximize availability of critical systems Designed for failover Fast, automatic Zero data loss Transactionally current Masks planned and unplanned downtime Replication Failover Solutions
37
Example Fault-Tolerant Publisher and Distributor Subscribers Complementary Technologies Failover Solution + Replication Distributor Publisher
38
Combining HA Technologies Principal Server can be a Failover Cluster Failover to mirror will occur before failover within the cluster So Principal will come back up as the Mirror Mirror can be a Failover Cluster as well PrincipalMirror Failover Cluster
39
Comparing HA Technologies How Do You Compare the Alternatives? There are many dimensions to consider when evaluating an HA / business continuance solution Detection: Automatic or Manual Failover: Automatic or Manual Time to Fail Over Number of Failures it can survive Data Currency / Loss Cost of redundant system(s) Additional hardware Additional management Granularity of Data Safety: Instance, Database, Table, Row Complexity Data Consistency Transparency to Clients Privileges Required to Setup Remote DR Site Impact on Performance There is no “one size fits all” Everyone’s requirements are different
40
Areas of Availability Operator, DBA, and User Error High Availability and the SQL Server process Availability inside the SQL Server process
41
Availability Inside the SQL Server Process Memory and CPUs I/O and Media/Disks Concurrency, Avoiding Deadlocks Diagnosing an Issue Tuning and Fragmentation DBCC Operations
42
Memory and CPUs Memory Dynamic Configuration applies to AWE memory as well as regular memory AWE allocation is dynamic Checkpoint – more well behaved for large memory systems Sniffer – Checksums periodically evaluated in cache CPUs CPU Affinity is Dynamically Configurable NUMA-awareness Scheduling on a CPU is at the batch level rather than the session level
43
I/O and Media/Disks I/O retries, diagnosis/transparency, and better messages SQL Server’s basic I/O requirements RAID and storage is usually reliable, but what happens if there are issues with: Data Files Log Files Full-text Catalogs Backup Files tempdb
44
Microsoft SQL Server Core I/O Requirements I/O- and Storage-Level Replication See whitepaper SQL Server 2000 I/O Basics on TechNet http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/sqlIOb asics.mspx http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/sqlIOb asics.mspx http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/sqlIOb asics.mspx To ensure correct operation the underlying I/O and storage system must meet the Core I/O Requirements Stable Media – once ‘write’ is acknowledged, must be guaranteed Write Ordering –writes must preserve ordering across devices Torn I/O Prevention – writes must be done as a unit; not split Applies to both local storage and any I/O-level or storage-level replication scheme, synchronous or asynchronous Guarantees that what appears at secondary site is an image that actually appeared at some point in time at the primary Asynchronous replication allows potential loss of committed transactions
45
Data Files Database Page Checksums Detect disk I/O errors not reported by the hardware or operating system DBCC Check* now use Database Snapshots for improved scalability Partial Database Availability Database is available if the Primary filegroup is available Fine-Grained Online Repair Online Restore – Database remains online; Only data being restored is offline Piecemeal Restore – Online restore of filegroups by priority Filegroup / File backups allowed on Simple recovery model databases Page-level Restore – Can restore individual pages to repair errors found by page checksum or torn pages Instant File Initialization Skips file zeroing, fast DB create / restore Restore read-only filegroups without applying logs
46
Log Files and Full-text Catalogs Log Files Checksum Full-text Catalogs Backup / Restore includes Full-text data Detach / Attach includes Full-text data
47
Backup Files RESTORE VERIFYONLY now checks everything it can short of writing the data Backup Media Mirroring Extra copies for archival or disaster recovery Backup and Database Page Checksums RESTORE can detect disk I/O errors not reported by the hardware or operating system Can continue past errors – repair later
48
tempdb Scalability Scalability of tempdb has been enhanced Caching of initial pages for temporary tables and table variables Improved allocation page latching protocol Reduced logging overhead for tempdb so there is less I/O bandwidth for tempdb log file More efficient allocation algorithm for pages from ‘mixed’ extents in tempdb
49
tempdb Best Practices We still recommend the following if you see latch contention on tempdb allocation or system catalog pages: Avoid auto grow Pre-allocate space for tempdb files Make as many tempdb files as you have CPUs Account for any affinity mask settings File sizes of equal amounts
50
Increased Concurrency Avoiding Blocking and Deadlocks Database Snapshots Standby / Reporting server with Replication Log Shipping Database Mirroring Scalable Shared Databases – Read-only Row-level versioning and new Snapshot isolation levels
51
Increased Concurrency Row-level Versioning Row-level versioning keeps a copy of each row prior to an update Rather than wait for an update to complete, you can see the data that was committed at the time: Your transaction began: Snapshot Isolation Your statement began: Read-Committed Snapshot Isolation You only see committed data, and it is time-consistent and transactionally consistent But it might not be the most recent, so not appropriate for some OLTP applications If you can work with this snapshot data Readers don’t wait for locks, reducing blocking and deadlocks You have a view of the data at a single point in time Triggers now use row-level versions rather than the log
52
Diagnosing an Issue Diagnosis, Runaway query DMVs Events DDL Triggers Deadlock graphs in XML Dedicated Administration Connection – DAC
53
Tuning and Fragmentation Database Tuning Advisor Understands indexes and partitions Online Index Operations Indexes remain online during reorganize and rebuild Table and Index Partitioning Rows can be placed in filegroups depending on a column value Partial availability, defragmentation can occur at the partition/filegroup level Partitions can be moved among tables for fast load, data movement
54
DBCC Operations Shrink – includes BLOBs Defrag – new ALTER INDEX REORGANIZE | REBUILD Check* uses Database Snapshots for improved scalability, performance In all editions! Progress Reporting for Shrink and Defrag
55
Areas of Availability Summary Operator, DBA, and User Error High Availability and the SQL Server process Availability inside the SQL Server process
56
For More Information SQL Server Books Online Whitepapers Failover Clustering Database Mirroring Peer-to-Peer Replication MSDN and TechNet webcasts
57
© 2006 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.