Software Testing Doesnt Scale James Hamilton Microsoft SQL Server.

Software Testing Doesnt Scale James Hamilton JamesRH@microsoft.com Microsoft SQL Server

2 Overview The Problem: The Problem: S/W size & complexity inevitable S/W size & complexity inevitable Short cycles reduce S/W reliability Short cycles reduce S/W reliability S/W testing is the real issue S/W testing is the real issue Testing doesnt scale Testing doesnt scale trading complexity for quality trading complexity for quality Cluster-based solution Cluster-based solution The Inktomi lesson The Inktomi lesson Shared-nothing cluster architecture Shared-nothing cluster architecture Redundant data & metadata Redundant data & metadata Fault isolation domains Fault isolation domains

3 S/W Size & Complexity Inevitable Successful S/W products grow large Successful S/W products grow large # features used by a given user small # features used by a given user small But union of per-user features sets is huge But union of per-user features sets is huge Reality of commodity, high volume S/W Reality of commodity, high volume S/W Large feature sets Large feature sets Same trend as consumer electronics Same trend as consumer electronics Example mid-tier & server-side S/W stack: Example mid-tier & server-side S/W stack: SAP: ~47 mloc SAP: ~47 mloc DB: ~2 mloc DB: ~2 mloc NT: ~50 mloc NT: ~50 mloc Testing all feature interactions impossible Testing all feature interactions impossible

4 Short Cycles Reduce S/W Reliability Reliable TP systems typically evolve slowly & conservatively Reliable TP systems typically evolve slowly & conservatively Modern ERP systems can go through 6+ minor revisions/year Modern ERP systems can go through 6+ minor revisions/year Many e-commerce sites change even faster Many e-commerce sites change even faster Fast revisions a competitive advantage Fast revisions a competitive advantage Current testing and release methodology: Current testing and release methodology: As much testing as dev time As much testing as dev time Significant additional beta-cycle time Significant additional beta-cycle time Unacceptable choice: Unacceptable choice: reliable but slow evolving or fast changing yet unstable and brittle reliable but slow evolving or fast changing yet unstable and brittle

5 Testing the Real Issue 15 yrs ago test teams tiny fraction of dev group 15 yrs ago test teams tiny fraction of dev group Now tests teams of similar size as dev & growing rapidly Now tests teams of similar size as dev & growing rapidly Current test methodology improving incrementally: Current test methodology improving incrementally: Random grammar driven test case generation Random grammar driven test case generation Fault injection Fault injection Code path coverage tools Code path coverage tools Testing remains effective at feature testing Testing remains effective at feature testing Ineffective at finding inter-feature interactions Ineffective at finding inter-feature interactions Only a tiny fraction of Heisenbugs found in testing (www.research.microsoft.com/~gray/Talks/ISAT_Gray_FT_Aviali ability_talk.ppt) Only a tiny fraction of Heisenbugs found in testing (www.research.microsoft.com/~gray/Talks/ISAT_Gray_FT_Aviali ability_talk.ppt)www.research.microsoft.com/~gray/Talks/ISAT_Gray_FT_Aviali ability_talk.pptwww.research.microsoft.com/~gray/Talks/ISAT_Gray_FT_Aviali ability_talk.ppt Beta testing because test known to be inadequate Beta testing because test known to be inadequate Test team growth scales exponentially with system complexity Test team growth scales exponentially with system complexity Test and beta cycles already intolerably long Test and beta cycles already intolerably long

6 The Inktomi Lesson Inktomi web search engine (SIGMOD98) Inktomi web search engine (SIGMOD98) Quickly evolving software: Quickly evolving software: Memory leaks, race conditions, etc. considered normal Memory leaks, race conditions, etc. considered normal Dont attempt to test & beta until quality high Dont attempt to test & beta until quality high System availability of paramount importance System availability of paramount importance Individual node availability unimportant Individual node availability unimportant Shared nothing cluster Shared nothing cluster Exploit ability to fail individual nodes: Exploit ability to fail individual nodes: Automatic reboots avoid memory leaks Automatic reboots avoid memory leaks Automatic restart of failed nodes Automatic restart of failed nodes Fail fast: fail & restart when redundant checks fail Fail fast: fail & restart when redundant checks fail Replace failed hardware weekly (mostly disks) Replace failed hardware weekly (mostly disks) Dark machine room Dark machine room No panic midnight calls to admins No panic midnight calls to admins Mask failures rather than futile attempt to avoid Mask failures rather than futile attempt to avoid

7 Apply to High Value TP Data? Inktomi model: Inktomi model: Scales to 100s of nodes Scales to 100s of nodes S/W evolves quickly S/W evolves quickly Low testing costs and no-beta requirement Low testing costs and no-beta requirement Exploits ability to lose individual node without impacting system availability Exploits ability to lose individual node without impacting system availability Ability to temporarily lose some data W/O significantly impacting query quality Ability to temporarily lose some data W/O significantly impacting query quality Cant loose data availability in most TP systems Cant loose data availability in most TP systems Redundant data allows node loss w/o data availability lost Redundant data allows node loss w/o data availability lost Inktomi model with redundant data & metadata a solution to exploding test problem Inktomi model with redundant data & metadata a solution to exploding test problem

8 Client Connection Model/Architecture Server Node Server Cloud All data & metadata multiply redundant All data & metadata multiply redundant Shared nothing Shared nothing Single system image Single system image Symmetric server nodes Symmetric server nodes Any client connects to any server Any client connects to any server All nodes SAN-connected All nodes SAN-connected

9 Client Compilation & Execution Model Server Cloud Server Thread Lex analyze Parse Normalize Optimize Code generate Query execute Query execution on many subthreads synchronized by root thread Query execution on many subthreads synchronized by root thread

10 Client Node Loss/Rejoin Server Cloud Execution in progress Execution in progress Rejoin. Rejoin. Node local recovery Node local recovery Rejoin cluster Rejoin cluster Recover global data at rejoining node Recover global data at rejoining node Rejoin cluster Rejoin cluster Lose node Lose node Recompile Recompile Re-execute Re-execute

11 Client Redundant Data Update Model Server Cloud Updates are standard parallel plans Updates are standard parallel plans Optimizer knows all redundant data paths Optimizer knows all redundant data paths Generated plan updates all Generated plan updates all No significant new technology No significant new technology Like materialized view & index updates today Like materialized view & index updates today

12 Fault Isolation Domains Trade single-node perf for redundant data checks: Trade single-node perf for redundant data checks: Fairly common…but complex error recovery is even more likely to be wrong than original forward processing code Fairly common…but complex error recovery is even more likely to be wrong than original forward processing code Many of the best redundant checks are compiled out of retail versions when shipped (when needed most) Many of the best redundant checks are compiled out of retail versions when shipped (when needed most) Fail fast rather than attempting to repair: Fail fast rather than attempting to repair: Bring down node for mem-based data structure faults Bring down node for mem-based data structure faults Never patch inconsistent data…other copies keep system available Never patch inconsistent data…other copies keep system available If anything goes wrong fire the node and continue: If anything goes wrong fire the node and continue: Attempt node restart Attempt node restart Auto-reinstall O/S, DB and recreate DB partition Auto-reinstall O/S, DB and recreate DB partition Mark node dead for later replacement Mark node dead for later replacement

13 Summary 100 MLOC of server-side code and growing: 100 MLOC of server-side code and growing: Cant fight it & cant test it … Cant fight it & cant test it … quality will continue to decline if we dont do something different quality will continue to decline if we dont do something different Cant afford 2 to 3 year dev cycle Cant afford 2 to 3 year dev cycle 60s large system mentality still prevails: 60s large system mentality still prevails: Optimizing precious machine resources is false economy Optimizing precious machine resources is false economy Continuing focus on single-system perf dead wrong: Continuing focus on single-system perf dead wrong: Scalability & system perf rather than individual node performance Scalability & system perf rather than individual node performance Why are we still incrementally attacking an exponential problem? Why are we still incrementally attacking an exponential problem? Any reasonable alternatives to clusters? Any reasonable alternatives to clusters?

Software Testing Doesnt Scale James Hamilton JamesRH@microsoft.com Microsoft SQL Server

Software Testing Doesnt Scale James Hamilton Microsoft SQL Server.

Similar presentations

Presentation on theme: "Software Testing Doesnt Scale James Hamilton Microsoft SQL Server."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Software Testing Doesnt Scale James Hamilton Microsoft SQL Server.

Similar presentations

Presentation on theme: "Software Testing Doesnt Scale James Hamilton Microsoft SQL Server."— Presentation transcript:

Similar presentations

About project

Feedback