Optimizing Big-Data Queries using Program Synthesis

Optimizing Big-Data Queries using Program Synthesis
Kaushik Rajan Matthias Schlaipfer, Akash Lal, Malavika Samak Microsoft Research, India

Why is query optimization important
Many BigData jobs are SQL Millions of SCOPE jobs a day, billions of compute minutes ~80% of all jobs on Azure’s Spark and Hadoop offerings are SQL Massive clusters deployed, huge cost and resource consumption Better query optimization can lead to significant savings

Big-Data query optimization
Rules to rewrite SQL queries to other equivalent SQL queries Rules have limited applicability, miss optimization opportunities that involve non-SQL operators Query script Rule driven query optimization DAG of Map/Reduce Stages

Big-Data query optimization
Rules to rewrite SQL queries to other equivalent SQL queries Rules have limited applicability, miss optimization opportunities that involve non-SQL operators Query script Rule driven query optimization DAG of Map/Reduce Stages Program Synthesis Synthesize query specific non-SQL operators on the fly Generate query plans with fewer stages

sum(blobBytes) as bSum,… sum (pageBytes) as pSum
Example StoreLog GROUP BY use, dt, id sum(blobBytes) as bSum,… sum (pageBytes) as pSum Input:<id,use,dt,blobBytes,pageBytes…> Query Intent For each use,date pair count the number of request ids that access 100MB of blob data and count the number of request ids that access 100MB of page data use, dt, id, bSum, pSum WHERE bSum > 100 GROUP BY use, dt count(id) as cnt WHERE pSum > 100 GROUP BY use, dt count(id) as cnt … use, dt, cnt, “blob” use, dt, cnt, “page” output UNION use, dt, cnt, type Production Query

Example SV1 Extract 1000 vertices StoreLog Data Read 3TB GROUP BY use, dt, id sum(blobBytes) as bSum,… sum (pageBytes) as pSum Group By 1TB SV2 Agg 1000 vertices use, dt, id, bSum, pSum Data Read 1TB WHERE bSum > 100 GROUP BY use, dt count(id) as cnt WHERE pSum > 100 GROUP BY use, dt count(id) as cnt 560GB 510GB … SV3 Agg 500 vertices SV4 Extract 500 vertices use, dt, cnt, “blob” Group By use, dt, cnt, “page” Data Read 560GB 420GB 280GB Rule driven Data Read 510GB output 240GB 160GB UNION Union SV5 Extract 250 vertices use, dt, cnt, type Execution plan Production Query Data Read 400GB

Example SV1 Extract 1000 vertices StoreLog Data Read 3TB GROUP BY use, dt, id sum(blobBytes) as bSum,… sum (pageBytes) as pSum Group By 1TB SV2 Agg 1000 vertices use, dt, id, bSum, pSum Data Read 1TB WHERE bSum > 100 GROUP BY use, dt count(id) as cnt WHERE pSum > 100 GROUP BY use, dt count(id) as cnt 560GB 510GB … SV3 Agg 500 vertices SV4 Extract 500 vertices use, dt, cnt, “blob” Group By use, dt, cnt, “page” Data Read 560GB 420GB 280GB Rule driven Data Read 510GB output 240GB 160GB UNION Union SV5 Extract 250 vertices use, dt, cnt, type Execution plan Production Query Data Read 400GB Many stages dominated by a single stage, shuffles inhibit parallelism and performance

Example SV1 Extract 1000 vertices StoreLog Data Read 3TB GROUP BY use, dt, id sum(blobBytes) as bSum,… sum (pageBytes) as pSum Group By 1TB SV2 Agg 1000 vertices use, dt, id, bSum, pSum Data Read 1TB WHERE bSum > 100 GROUP BY use, dt count(id) as cnt WHERE pSum > 100 GROUP BY use, dt count(id) as cnt 560GB 510GB … SV3 Agg 500 vertices SV4 Extract 500 vertices use, dt, cnt, “blob” Group By use, dt, cnt, “page” Data Read 560GB 420GB 280GB Rule driven Data Read 510GB output 240GB 160GB UNION Union SV5 Extract 250 vertices use, dt, cnt, type Execution plan Production Query Data Read 400GB Every row in output only influenced by few rows from input, here input can be partitioned on use,dt Many stages dominated by a single stage, shuffles inhibit parallelism and performance

Example StoreLog GROUP BY use, dt, id sum(blobBytes) as bSum,… sum (pageBytes) as pSum StoreLog use, dt, id, bSum, pSum PARTITION BY use,dt SORT BY ??=id use, dt, List<id, blobBytes,pageBytes> WHERE bSum > 100 GROUP BY use, dt count(id) as cnt WHERE pSum > 100 GROUP BY use, dt count(id) as cnt … use, dt, cnt, “blob” use, dt, cnt, “page” Query template Final output ??? UNION use, dt, cnt, type use, dt, cnt, type partial rewrite Production Query Can we replace with a simpler query that partitions the input and applies a custom operator equivalent to the query?

Example StoreLog GROUP BY use, dt, id sum(blobBytes) as bSum,… sum (pageBytes) as pSum StoreLog use, dt, id, bSum, pSum PARTITION BY use,dt SORT BY ??=id use, dt, List<id, blobBytes,pageBytes> WHERE bSum > 100 GROUP BY use, dt count(id) as cnt WHERE pSum > 100 GROUP BY use, dt count(id) as cnt … use, dt, cnt, “blob” use, dt, cnt, “page” Program Synthesis Query template Final output udo UNION use, dt, cnt, type use, dt, cnt, type partial rewrite Production Query Can we replace with a simpler query that partitions the input and applies a custom operator equivalent to the query?

Example Execution plan 3.5x faster, shuffles 1/3rd the data,
StoreLog SV1 Extract 1000 vertices PARTITION BY use,dt SORT BY id=c use, dt, List<id, blobBytes,pageBytes> Data Read 3TB 1TB SV5 Extract 250 vertices Final Data Read 1TB udo use, dt, cnt, type Execution plan 3.5x faster, shuffles 1/3rd the data, Requires only half the cumulative CPU time

Rest of the talk Program synthesis
Scaling synthesis via program analysis Evaluation

Program Synthesis Given a specification, a partial program, generate a complete program that satisfies the specification Uses SAT/SMT solvers to find satisfying assignments Possible because of advances in program reasoning, verification and SAT solving Hard to scale synthesis to large programs (scales only to 10s of lines) Typical reducers are about this size so there is hope

Program synthesis for query optimization
Query simplification Partial program Specification Generic template generation standard SQL operator implementations Grammar extraction Operator synthesis

Input PARTITION by <c> SORT by ??=c Query simplification UDO(use,dt, List rows) { //Linear Complexity foreach(row in partition) if(pred1) expr1; ... if(predn) exprn; foreach(row) … } Partial program Specification Generic template generation Final standard SQL operator implementations UDO Grammar extraction Operator synthesis

Input PARTITION by <c> SORT by ??=c Query simplification UDO(use,dt, List rows) { //Linear Complexity foreach(row in partition) if(pred1) expr1; ... if(predn) exprn; foreach(row) … } Partial program Specification Generic template generation Final standard SQL operator implementations UDO Grammar extraction Operator synthesis Sort columns, predicates and expressions extracted via static analysis of query

StoreLog sort : id,blobBytes,pageBytes expri:pSum=0|pSum+=blobBytes| bSum=0|bSum+=blobBytes| cnt1=0|cnt1++| cnt2=0|cnt2++ out(row)|prevRow=row flagi=true/false predi :bSum>100 | pSum>100 !pred | pred AND/OR pred flag|prevRow[c]==row[c] GROUP BY use, dt, id sum(blobBytes) as bSum,… sum (pageBytes) as pSum use, dt, id, bSum, pSum WHERE bSum > 100 GROUP BY use, dt count(id) as cnt WHERE pSum > 100 GROUP BY use, dt count(id) as cnt Extract grammar … use, dt, cnt, “blob” use, dt, cnt, “page” output UNION use, dt, cnt, type sufficient to emulate aggregations in the query, can do more, multiple aggregations in the same loop

StoreLog sort : id,blobBytes,pageBytes expri:pSum=0|pSum+=blobBytes| bSum=0|bSum+=blobBytes| cnt1=0|cnt1++| cnt2=0|cnt2++ out(row)|prevRow=row flagi=true/false predi :bSum>100 | pSum>100 !pred | pred AND/OR pred flag|prevRow[c]==row[c] GROUP BY use, dt, id sum(blobBytes) as bSum,… sum (pageBytes) as pSum use, dt, id, bSum, pSum WHERE bSum > 100 GROUP BY use, dt count(id) as cnt WHERE pSum > 100 GROUP BY use, dt count(id) as cnt Extract grammar … use, dt, cnt, “blob” use, dt, cnt, “page” output UNION use, dt, cnt, type sufficient to emulate aggregations in the query, can do more, multiple aggregations in the same loop Additional power, for example allow some flag variables

PARTITION by <c> SORT by ??=c
Input PARTITION by <c> SORT by ??=c id,blobBytes,pageBytes UDO(use,dt,rows) { foreach(row in rows) if(pred1) expr1; ... if(predn) exprn; foreach(row) … } pSum=0|pSum+=pBytes| bSum=0|bSum+=bBytes| cnt1=0|cnt1++| cnt2=0|cnt2++| out(row)|prevRow=row| flagi=true/false ⊆ bSum>100 | pSum>100| !pred | pred AND/OR pred| flag|prevRow[c]==row[c]

Scaling synthesis Example query still does not work
One of our more complex examples Too many predicates and expressions in grammar Operator still quite large, needs many guarded statements Reasoning about n-ary (n>3) operators like joins and unions leads to blow up in size of the specification (formulae to be solved) Query Query simplification specification Partial program operator synthesis

Query simplification Synthesize UDO incrementally one part at a time
Input GROUP BY use, dt, c sum(blobBytes)…, sum(pageBytes) use, dt, id, Taint analysis removes cols that don’t influence output. WHERE bSum > 100 GROUP BY use, dt count(id) as cnt StoreLog Splitting use, dt, id, “blob” GROUP BY use, dt, id sum(blobBytes) as bSum,… sum (pageBytes) as pSum Redundant column analysis removes dt as it always occurs with use use, dt, id, bSum, pSum WHERE bSum > 100 GROUP BY use, dt count(id) as cnt WHERE pSum > 100 GROUP BY use, dt count(id) as cnt … use, dt, cnt, “blob” use, dt, cnt, “page” Synthesize UDO incrementally one part at a time output UNION use, dt, cnt, type

Query simplification Reassemble full operator

Rest of the talk Program synthesis
Scaling synthesis via program analysis Evaluation 20 long running and hourly repeating SCOPE queries

Execution time Synthesis fails on one query, succeeds on 19 within 10 min

Resource savings (SCOPE)
Task time Shuffled data

Open challenge Synthesis relies on bounded model checking
Only guarantees partial soundness, requires manual verification of UDO Our experience has been very positive (no erroneous UDOs), small world assumption holds in practise Applicability Identifies potential problems and suggest a likely correct fix Readily applies to repeat queries optimize them over time by involving users Amortize verification cost over repeated runs Be brief… defend along the lines of paper

Summary New technique for optimizing BigData queries
Significant speedups on evaluated benchmarks Demonstrate program synthesis can be applied to a important and non-standard setting New analyses for SQL queries to scale synthesis Several new open problems Further scaling synthesis Full verification Other query optimization problems Be brief… defend along the lines of paper

Thank You

Optimizing Big-Data Queries using Program Synthesis

Similar presentations

Presentation on theme: "Optimizing Big-Data Queries using Program Synthesis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimizing Big-Data Queries using Program Synthesis

Similar presentations

Presentation on theme: "Optimizing Big-Data Queries using Program Synthesis"— Presentation transcript:

Similar presentations

About project

Feedback