Download presentation
Presentation is loading. Please wait.
Published byMerryl Atkins Modified over 9 years ago
1
MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett
2
Windows AZURE Cloud Services
3
AZURE Storage Services ● Blob ● Table ● Queue ● File Storage
4
Azure SQL Database Database as a Service ● Predictable performance ● Scalability ● Business continuity ● Data protection ● Zero administration
5
Azure DB
6
Fault Tolerance and Failure Why is it so important? ● Supports concurrency control ● Provides transactional guarantee ● ACID Why does it fail? ●Inevitable software/hardware failure ●Human errors
7
Fault Tolerant SQL Database ● Redundant computers rather than redundant components. ● Fault tolerance at the highest level of the stack - Fault tolerant DB rather than fault tolerant DB servers. ● Database replication across fault zones. ● Failure Detection and Failover.
8
Fault Zones/Domains Each fault zone is a fully independent physical sub- system with its own server racks and network routers.
9
Assigning Storage to a Fault Domain Proximity vs. Isolation ● Proximity of replicas affects network latency ● Isolation helps ensure availability of replicas in the event of a failure Selection of replica location ● MDS codes ● ( N, K ) coding (Banerjee, Das, Mazumder, Derakhshandeh, & Sen, 2014)
10
Database Replication There are 3 copies of each DB, a primary and two secondary replicas. The primary database performs the transactions, and sends the updates and DDL to the replicas.
11
Database Replication Each replica is stored in a different fault zone.
12
Quorum-Based Commit ● At least two copies required. ● Data must be written to the primary and at least one secondary before it is considered committed.
13
PRIMARY FAILS When the server containing the primary database fails, one of the secondary replicas is promoted to primary. Dynamic Quorum
14
SECONDARY FAILS When a server fails that contains secondary replicas, new replicas are created. Dynamic Quorum
15
Transactional Consistency ● Updates are persisted in log ● Primary DB streams updates to secondaries ● Secondaries are asked to commit first ● Secondaries return acknowledgement ● Primary commits after quorum
16
Recovering Transactions If secondary fails, on restart it checks with primary for transactions it may have missed.
17
Failure Detection ● The database is paired with the SQL Engine to detect failures in the neighborhood. ● Distributed failure detection - every node monitored by several neighbors. ● Efficient, localized and fast. ● Prevents ping storms and avoids delayed failure detection
18
Failover ● If primary node fails unexpectedly, standby backup node automatically assumes role of primary. ● Managed by GPM(Global Partition Manager). ● Distributed fabric maintains a global map ● GPM maintains the health, state and location of every DB. ● Fabric informs GPM of any node failure. ● GPM reconfigures assignment of primary and secondary DBs in failed node. Gateway Processes Client pssspsss sspsssps ssspsssp
19
Fault Tolerance in Application Design Data Failure ● application specific ● catastrophic consequences ● not addressed by Azure Computational Failure ● addressed by Azure ● controlled by application Monitoring and Logging ● diagnosis ● debugging (Jie Li et al., 2010)
20
References Fault-tolerance in Windows Azure SQL Database. [Online]. Available: http://azure.microsoft.com/blog/2012/07/30/fault-tolerance-in-windows-azure-sql- database/ Banerjee, S., Das, A., Mazumder, A., Derakhshandeh, Z., & Sen, A. (2014). On the impact of coding parameters on storage requirement of region-based fault tolerant distributed file system design. Paper presented at the Computing, Networking and Communications (ICNC), 2014 International Conference On, 78-82. doi:10.1109/ICCNC.2014.6785309 Jie Li, Humphrey, M., You-Wei Cheah, Youngryel Ryu, Agarwal, D., Jackson, K., & van Ingen, C. (2010). Fault tolerance and scaling in e-science cloud applications: Observations from the continuing development of MODIS Azure. Paper presented at the E-Science (E-Science), 2010 IEEE Sixth International Conference On, 246-253. doi:10.1109/eScience.2010.47 Rajan, D., Canino, A., Izaguirre, J. A., & Thain, D. (2011). Converting a high performance application to an elastic cloud application. Paper presented at the Cloud Computing Technology and Science (CloudCom), 2011 IEEE Third International Conference On, 383-390. doi:10.1109/CloudCom.2011.58
21
QUESTIONS?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.