Ben Vandiver, Hari Balakrishnan, Barbara Liskov, and Sam Madden CSAIL, MIT Tolerating Byzantine Faults in Database Systems using Commit Barrier Scheduling Sponsors: Quanta Computer Inc, NSF
Non-crash faults in Databases Over 50% of reported bugs were non- crash faults Incorrect answers, data or index corruption, etc. Previous focus on fail-stop faults Better model: Byzantine faults Bug CategoryDB2 2/03-8/06 Oracle 7/06-11/06 MySQL 8/06-11/06 DBMS Crash Non-Crash Faults
Failure Independence Heterogeneous replicas Different implementations / versions Easiest with non-invasive solution Requires standard interface SQL is moderately standard
Client Interaction Organized into Transactions Query, Query, …, Commit / Rollback Interactive Strong consistency Single-copy serializable
Database Functionality Each Database provides Serializable isolation Strict (rigorous) 2-phase locking Databases don’t execute in issue-order Limited control over execution order IssueS1S1 S2S2 Replica 1 executes Replica 2 executes S1S1 S2S2 S2S2 S1S1
Replica Coordination BFT well known solution 3f+1 replicas Globally order client requests Replicas execute in order Exhibits no concurrency Goal: mechanism to extract concurrency in database context
Architecture Client Shepherd DB1DB2DB3 SQL
Architecture Client Shepherd DB1DB2DB3 SQL Result ?? Vote Result Need f+1 matching votes
How to extract concurrency? Just issue statements to replicas Likely to get stuck Solution: pre-determine which statements conflict Inspecting SQL is very hard
Commit Barrier Scheduling Primary / Secondary Scheme Run transactions first on the primary Duplicate primary’s ordering on the secondaries Works best when primary is Sufficiently Blocking Required for performance, not correctness
Client Shepherd DB Primary Result Commit Barrier Scheduling SQL ?
Correct Execution Statement Ordering Rule Execute statements of transaction in order Commit Ordering Rule All replicas commit transactions in the same order Order determined by Shepherd
Execution Trace on Primary T1T1 T2T2 SXSX C SYSY SZSZ C Time
Extracting Conflict Info Don’t Conflict! T1T1 T2T2 SXSX C SYSY SZSZ C
Avoiding Conflicts Might Conflict! Transaction-Ordering Rule: A query from transaction T 2 that was executed by the primary after the COMMIT of transaction T 1 can be sent to a secondary only after it has processed all queries of T 1. T1T1 T2T2 SXSX C SYSY SZSZ C
Commit Barrier Scheduling Maintain barrier for each replica Mark statements and transactions with barriers Issue statements and commits when replica’s barrier reaches appropriate value Simple to implement
Analysis of CBS: Non-faulty primary Full concurrency on the Primary Deadlocks detected and resolved locally Ample concurrency on Secondaries allows many statements to run in parallel Secondaries hardly ever block Latency increase
Early Return Client Shepherd DB Primary Result Next SQL Stmt SQL Pipelined Execution!
Early Return Analysis Cut latency in half Must vote at Commit Sent wrong answer, abort the transaction Correctness Condition Clients receive correct answers for all transactions that commit
Masking Faults Faulty Secondary not a problem Voting resolves wrong answers Faulty Primary is a problem Generates invalid schedule Goal: correct execution
Faulty Primary Scenario Faulty PrimaryReplica R1Replica R2 T 1 : A = 1 T 1 : waiting T 2 : A = 1T 2 : waitingT 2 : A = 1 T 1, T 2 – Increment A by 1, return A A initially 0, should end up 2 f+1 matching votes for both answers!
Other Issues Mechanics Replica Repair Shepherd crashes Heterogeneity & SQL
Implementation Prototype called HRDB Implemented in Java About 3500 semicolon-lines of code JDBC interface to clients and databases Works with MySQL, DB2, Derby, and SQLServer
Performance 17%
Heterogeneous Replication Ran 2f+1=3 replica system, heterogeneous vendors MySQL, DB2, Commerical DB X Sufficiently Blocking holds in practice System runs at slowest of f+1 fastest replicas, or primary
Fail-Stop Faults
Bugs and HRDB Successfully masked bugs Heterogeneous vendors & heterogeneous versions Found a new bug in MySQL While running TPC-C Present since October 2001 Patched in recent release Starting to look for bugs actively with HRDB
Conclusion First practical Byzantine Fault Tolerant Database Failure independence by supporting heterogeneous replicas Novel concurrency extraction scheme Tool for finding new bugs in databases
Backup Slides
Snapshot Isolation Allows read-after-write hazards Converts fail-stop to Byzantine faults Need write-sets to implement Scheme called Snapshot Barrier Scheduling
Implement with Barriers T1T1 T2T2 T3T3 SWSW C SJSJ SKSK C CSXSX SZSZ Primary S – Annotate with current barrier upon completion C – Increment barrier before issue SYSY B=1B=2B=0B=3 Secondary S – Issue when replica barrier is at least the value of the annotation C – Increment replica barrier after completion
Heterogeneity Issues Non-determinism in answers Result set ordering Non-deterministic functions in queries Database-assigned row IDs Query Rewriting SQL incompatibility Translation Engine SQL hiding – Views and Stored Procedures
Future Work Replicating the Shepherd Efficient Replica Repair Finding Bugs
Replica Recovery Replicas Fail-stop crashes – Shepherd replays missing transactions Uses transaction log table in database to discover which transactions to replay Byzantine faults – Shepherd repairs faulty state, then replays Efficient repair mechanism under development Shepherd Fail-stop crashes - Maintains a write-ahead log
Faulty Primary Wrong answers result in transaction abort Concurrency Faults Can result in secondaries being unable to make progress System is back to “Correct but Slow” solution Same case as when primary is not sufficiently blocking Can be hard to tell if primary is faulty Replace primary by doing a view change