Automatic Physical Design Tuning: Workload as a Sequence Sanjay Agrawal, Microsoft Research Eric Chu, University of Wisconsin-Madison Vivek Narasayya, Microsoft Research
Automatic Physical Design Tuning DB applications more complex and varied. Considerable time spent on tuning. Reduce cost of ownership of RDBMS. Automatically recommend physical design. Supported by DB vendors. Database Engine Tuning Advisor, Microsoft Design Advisor, IBM SQL Access Advisor, Oracle 11/21/2018 SIGMOD 2006
Microsoft Database Engine Tuning Advisor Set of queries, updates Applications Workload Query Optimizer (extended) Database Engine Tuning Advisor “What-if” Set of indexes, materialized views, horizontal partitions Microsoft SQL Server 2005 Recommendation 11/21/2018 SIGMOD 2006
Workload as a Sequence: Motivation Data warehousing Query by day, update at night. Set: No index recommended when update costs outweigh benefits. Sequence: May exploit benefits of indexes without incurring update costs. Insert “create” and “drop” of indexes to workload. Exploit order of statements. Create Indexes Drop Indexes Updates Night Queries Day 11/21/2018 SIGMOD 2006
Set VS Sequence Set-based Outputs are different Recommendation is robust to changes in order of statement arrival. Can miss good recommendations compared to sequenced-based approach. Outputs are different Set: what indexes to create or drop? Sequence: what indexes to create or drop and where? Create Indexes Drop Indexes Queries Updates Queries 11/21/2018 SIGMOD 2006
Model Workload as a Sequence Motivation Problem Definition Optimal Algorithm Disjoint Sequences Greedy-SEQ Experiments 11/21/2018 SIGMOD 2006
Problem Setting Cost(Si,Ci) – cost of executing Si with Ci. Workload: S = [S1, S2, …, SN] CN+1 C0 C1 C2 C3 CN S2 S1 S3 SN Si {Select, Insert, Delete, Update} Cost(Si,Ci) – cost of executing Si with Ci. TC(C1, C2) – transition cost Sequence execution cost Nk=1((Cost(Sk,Ck) + TC(Ck-1,Ck)) + TC (CN,CN+1) 11/21/2018 SIGMOD 2006
Problem Definition Given: Database D, workload W = [S1, …, SN], initial configuration C0, and storage bound M. Find configurations C1, C2, …, CN+1 such that Minimize sequence execution cost: Nk=1((Cost(Sk,Ck) + TC(Ck-1,Ck)) + TC (CN,CN+1) Storage of Ci ≤ M, for all i. 11/21/2018 SIGMOD 2006
Search Space Given N statements and M indexes Sequence-based tuning 2M distinct configurations for each statement. 2M(N+1) possible execution sequences. Set-based tuning 2M configurations. 11/21/2018 SIGMOD 2006
Model Workload as a Sequence Motivation Problem Definition Optimal Algorithm Disjoint Sequences Greedy Heuristic Experiments 11/21/2018 SIGMOD 2006
Optimal Algorithm for Single-Index Case { } {I} S1 { } {I} S2 { } {I} SN Id Ic SOURCE { } Id { } DESTINATION Ic DAG for single index, N statements Node costs: Cost(Si, { }) and Cost(Si,{I}). Edge costs: 0, IC, and ID. Cost of shortest path includes node and edge costs. 11/21/2018 SIGMOD 2006
General Case – Multiple Indexes SN EXHAUSTIVE CF1 CF2 CFN C0 Ci1 Ci2 CiN CN+1 C11 C12 C1N C01 C02 C0N At each stage, enumerate all possible configurations from the set of indexes. Algorithm linear in the number of nodes and edges of DAG. However, number of nodes in DAG is exponential in the number of indexes. M indexes => O(N*2M) nodes and O(N*2M) edges. 11/21/2018 SIGMOD 2006
Solve sequence using EXHAUSTIVE Optimal Solution Recommendation Candidate set of structures Solve sequence using EXHAUSTIVE Sequence, Constraints 11/21/2018 SIGMOD 2006
Search-Space Pruning Techniques to reduce number of nodes: Cost-based Pruning Leverages shortest-path solutions of individual indexes. Prunes configurations at each stage without loss of optimality. Disjoint Sequences Divide-and-conquer approach. Splits the input sequence and candidate index set. Greedy-SEQ Guarantees a polynomial number of nodes. 11/21/2018 SIGMOD 2006
Model Workload as a Sequence Motivation Problem Definition Optimal Algorithm Disjoint Sequences Greedy Heuristic Experiments 11/21/2018 SIGMOD 2006
Exploiting Disjoint Sequences Two sequences X and Y are disjoint if they do not share any statements AND indexes. Disjoint sequences are common E.g., server hosts multiple applications that touch different databases. Approach: Split workload into disjoint sequences. Solve each sequence independently. Merge to get final solution. Idea: DAG for each disjoint sequence has fewer nodes. 11/21/2018 SIGMOD 2006
Efficiency Gain with Disjoint Sequences {I1,I2,I3} W 8 nodes at each stage S1 S3 S4 {I1} S2 S5 S6 {I2} S7 {I3} W1 W2 W3 2 nodes at each stage for each sequence 11/21/2018 SIGMOD 2006
Merge solutions of W1, W2, and W3: No storage violations DEST I1c S1 S3 SRC {I1} S4 { } I1d W1 = [S1,S3,S4] S2 DEST S5 S6 I2d W2 = [S2,S5,S6] I2c {I2} { } SRC DEST S7 I3c {I3} { } W3 = [S7] SRC Pu is optimal when there are no storage violations. S2 {I1,I2} S3 S1 SRC {I1} S4 {I2} S5 S6 { } S7 {I3} DEST 11/21/2018 SIGMOD 2006
Merge in the presence of storage violation Suppose storage bound allows only 1 index. Pu is not a valid solution as it has configurations with storage violation. S2 {I1,I2} S3 S1 SRC {I1} S4 {I2} S5 S6 { } S7 {I3} DEST S4 {I2} S5 S6 { } {I3} DEST S7 S1 SRC {I1} { } S2 S3 {I1} {I2} Pu’ = Merge P1, P2 and P3 to get a valid solution. Note that cost of Pu is a lower bound on cost of any valid solution. 11/21/2018 SIGMOD 2006
Solution with Split and Merge Sequence, Constraints Candidate set of structures Apply Split operator to get disjoint sequences Solve each sequence independently using EXHAUSTIVE or GREEDY-SEQ Merge results of disjoint sequences Recommendation 11/21/2018 SIGMOD 2006
Model Workload as a Sequence Motivation Problem Definition Optimal Algorithm Disjoint Sequences Greedy Heuristic Experiments 11/21/2018 SIGMOD 2006
Greedy Approach Goal: Explore a polynomial number of good configurations. Run shortest path over the DAG constructed with these configurations. Solution close to optimal. Greedy-SEQ: adaptation of existing greedy technique for the sequence model. 11/21/2018 SIGMOD 2006
Greedy-SEQ Steps of Greedy-SEQ: Get optimal solution for each index. Record configurations. Initialize current best to be the lowest-cost solution seen so far. Improve current best by combining with other solutions and resetting current best. Record new configurations of current best. Repeat until no more improvement. Run shortest-path over configurations collected. 11/21/2018 SIGMOD 2006
Combining Two Single-Index Solutions SN SK SL S0 SN+1 {I1} {} I1 I2 {I2} {I1} {} {I2} I1,I2 {I1,I2} 11/21/2018 SIGMOD 2006
Combining Two Single-Index Solutions SN SK SL S0 SN+1 {I1} {} I1 I2 {I2} {I1} {I1} {I1} {} {} {} {I2} {I2} {} I1,I2 {I2} {} {I1,I2} {I1,I2} 11/21/2018 SIGMOD 2006
Greedy-SEQ: Greedy Approach Get optimal solution for each index. Record configurations. Initialize current best to be the lowest-cost solution seen so far. Improve current best by combining with other solutions and resetting current best. Record new configurations of current best. Repeat Step 3 until no more improvement. Run shortest-path over configurations collected. 11/21/2018 SIGMOD 2006
End-to-End Solution Candidate set of structures Sequence, Constraints Candidate set of structures Recommendation Apply split operator to get disjoint sequences Solve each sequence independently using EXHAUSTIVE or GREEDY-SEQ Merge results of disjoint sequences Apply cost-based pruning on each sequence 11/21/2018 SIGMOD 2006
Model Workload as a Sequence Motivation Problem Definition Optimal Algorithm Disjoint Sequences Greedy Heuristic Experiments 11/21/2018 SIGMOD 2006
Sequence VS Set-based approaches % improvement relative to the optimal set-based solution. Sequence is better in the presence of updates and/or storage bound is low. Workload M = 1.2 GB M = 3 GB TPCH-22 19% 0% TPCH-22-I-10-MID 22% 16% TPCH-22-I-10-END 25% 28% 11/21/2018 SIGMOD 2006
Greedy-SEQ VS Exhaustive Greedy-SEQ’s much faster with minimal degradation in quality. Workload % reduction in running time % reduction in quality TPCH-3 50% <1% TPCH-5-M-5 98.4% 2.3% TPCH-22 Exhaustive was terminated after 24 hours Not available 11/21/2018 SIGMOD 2006
Effectiveness of Split and Merge With split and merge (SPMR) VS without (WO-SPMR) Workload % reduction in running time compared to WO-SPMR % reduction in quality compared to WO-SPMR TPCH-22 <0.1% 0% WKLD1 89.9% WKLD1-LOW 71.4% 3.0% 11/21/2018 SIGMOD 2006
Conclusion Sequence model allows more optimization opportunities than set model. Model the problem as finding the shortest path over a DAG. Heuristics give nearly optimal solutions with much better performance. 11/21/2018 SIGMOD 2006