Materialized View Selection for XQuery Workloads Asterios Katsifodimos 1, Ioana Manolescu 1 & Vasilis Vassalos 2 1 Inria Saclay & Université Paris-Sud, 2 Athens University of Economics and Business Athens University of Economics and Business
View selection in XML databases Materialized View Selection for XQuery Workloads 2 Problem definition Find a set of materialized views that minimizes workload evaluation costs not exceeding a space budget.
Materialized View Selection for XQuery Workloads View selection for multiple-views XQuery rewriting Rich subset of XQuery Tree patterns with multiple return nodes and value joins We provide Candidate view pruning methods View selection algorithms: Utility-Based Greedy (UDG) Reduce-Optimize Algorithm (ROA) Extensive experimental evaluation Outperforming & extending state-of-the-art works Materialized View Selection for XQuery Workloads 3 Contributions
Outline The View Selection Problem View Language & Candidate Views View Selection Algorithms Related Work & Experimentation Materialized View Selection for XQuery Workloads - 4
Query and view language Materialized View Selection for XQuery Workloads 5 Anatomy of a query cont=subtree of the text element Value-join Return the ID of every book along with its text and author if the book author has a paper in the SIGMOD conference. ID ID of book book text cont author val paper author conference [=“SIGMOD”]
Candidate Views Materialized View Selection for XQuery Workloads 6 JOIN [v1.author ID >v2.book ID ] SCAN(v 1 )SCAN(v 2 ) PROJECT [text cont, author val ] Rewriting v1v1 author ID,val v2v2 book ID text ID,cont Candidate Views Example: Query book author val text cont Candidate views: views that can participate in a rewriting of a query. Property: candidate views are exactly those embeddable in a query.
Candidate Views Number of candidate views For query of m value joins and k tree patterns: Early pruning is needed Rules of thumb for pruning Drop all views that can be replaced by others Views should not store anything extraneous Challenge: remove maximum number of views Preserve low cost and/or small size rewriting possibilities. Materialized View Selection for XQuery Workloads 7
Candidate Views Materialized View Selection for XQuery Workloads 8 Pruning techniques book author val text cont v2v2 author ID, cont v2‘v2‘ author ID,val v3v3 book ID text ID,cont v1v1 book author ID QueryCandidate Views ② Do not store unnecessary data i.e. useless cont, val or //-axis Avoid expensive rewritings Save space ① Annotate all nodes with ID Maximize rewriting opportunities v1‘v1‘ book ID author ID v3‘v3‘ book ID text ID,cont
Outline The View Selection Problem View Language & Candidate Views View Selection Algorithms Related Work & Experimentation Materialized View Selection for XQuery Workloads - 9
Materializing a set of views Benefit of materializing a set of views benefit (V, Q)=(cost of evaluating Q over D) – (cost of evaluating Q over V) Computation of benefit requires invoking rewriting algorithm Expensive! Space occupancy of a view set V Total size (in bytes) Materialized View Selection for XQuery Workloads 10 View set benefit
View Selection Algorithms High similarity with the classic 0-1 knapsack problem Typical element of the greedy algorithms for knapsack: utility(v,Q)=benefit({v} U V, Q)/size(v) Materialized View Selection for XQuery Workloads 11 Knapsack-inspired view selection KnapsackView Selection WeightView Size Profit Benefit (evaluation cost savings)
S=12 View Selection Algorithms Materialized View Selection for XQuery Workloads 12 Utility-Driven Greedy (UDG) Algorithm U=Utility(=benefit/size) S=Space occupancy Space Budget Candidate Views U=10 S=7 U=60 S=5 U=50 S=4 U=8 S=2 1. Enumerate candidate views 2. Compute view utilities 3. Order views by utility 4. Select the view of largest utility fitting in budget 5. Repeat 2-4 until budget exhausted
S=12 View Selection Algorithms 1. Enumerate candidates 2. Compute utilities 3. Order by utility 4. Select the view of largest utility fitting in budget 5. Repeat 2-4 until budget exhausted Materialized View Selection for XQuery Workloads 13 Utility-Driven Greedy (UDG) Algorithm U=Utility(=benefit/size) S=Space occupancy Space Budget Candidate Views U=12 S=7 U=40 S=5 U=64 S=4 U=9 S=2
S=12 View Selection Algorithms Materialized View Selection for XQuery Workloads 14 Utility-Driven Greedy (UDG) Algorithm U=Utility(=benefit/size) S=Space occupancy Space Budget Candidate Views U=13 S=7 U=10 S=5 U=64 S=4 U=4 S=2 1. Enumerate candidates 2. Compute utilities 3. Order by utility 4. Select the view of largest utility fitting in budget 5. Repeat 2-4 until budget exhausted Greedy algorithms for knapsack not a perfect fit for our problem Utility of a view may change after every round depends on other views already selected Greedy algorithms for knapsack not a perfect fit for our problem Utility of a view may change after every round depends on other views already selected
View Selection Algorithms Materialized View Selection for XQuery Workloads 15 State space search (state=candidate view set) S1 S3 S4S5 S6S7S8 S9 S10 S11 S12S13 S14 S15S16 Initial state: Best state: query workload largest benefit under space budget transform(S1) S8
View Selection Algorithms View Break: break a view in smaller parts Reveals common sub-expressions of views Can reduce or increase space occupancy Increases query evaluation costs Materialized View Selection for XQuery Workloads 16 State Transformations: Break, Join, Generalize, Adapt book text cont author val paper author conference [=“SIGMOD”]
View Selection Algorithms Join: opposite to Break, join two views into one Reduces evaluation costs Joined views can be smaller in size Materialized View Selection for XQuery Workloads 17 State Transformations: Break, Join, Generalize, Adapt book text cont author val paper author conference [=“SIGMOD”] ID val,ID
View Selection Algorithms Generalize: generalization/relaxation of a view Reveals common sub-expressions of views Can reduce or increase space occupancy Increases query evaluation costs Materialized View Selection for XQuery Workloads 18 State Transformations: Break, Join, Generalize, Adapt book text cont author val paper author conference val [=“SIGMOD”] val cont
View Selection Algorithms Adapt: specialization of views by 1. Conversion of //-axis to /-axis 2. Addition of existential nodes Reduces evaluation costs “Adapted” views can be smaller in size Materialized View Selection for XQuery Workloads 19 State Transformations: Break, Join, Generalize, Adapt book text author paper author conference val [=“SIGMOD”] cont Break, Join, Generalize, Adapt Allow to generate all states Guaranteed not to generate pruned views Break, Join, Generalize, Adapt Allow to generate all states Guaranteed not to generate pruned views
View Selection Algorithms Huge number of states Call rewriting algorithm after every state transition Need for heuristics Proposal: heuristic three-phase algorithm ROA Materialized View Selection for XQuery Workloads 20 The Reduce-Optimize algorithm (ROA) OptimizeJump Reduce
View Selection Algorithms Materialized View Selection for XQuery Workloads 21 The Reduce-Optimize algorithm (ROA) Space Budget Time Space Occupancy Benefit Reduce Optimize Jump Reduce Optimize Reduce... SolutionBest StateRevisited StateIntermediary State
View Selection Algorithms 1. Some transitions may apply several transformations at once 2. Stop the rewriting algorithm early After k rewritings found or At a timeout 3. Consider only the lowest cost rewritings Materialized View Selection for XQuery Workloads 22 Reducing ROA search time - heuristics
Outline The View Selection Problem View Language & Candidate Views View Selection Algorithms Related Work & Experimentation Materialized View Selection for XQuery Workloads - 23
Related Work Materialized View Selection for XQuery Workloads 24 AlgorithmRewriting power [Mandhani, Suciu VLDB05]1-view rewritings [Tang et. al. DASFAA09]1-view rewritings Utility-Driven GreedyMultiple view rewritings Reduce-OptimizeMultiple view rewritings
Experimental Evaluation Queries Workloads Tree patterns: Q 1 (14), Q 2 (50), Q 3 (100) Tree patterns + joins: Q 4 (50), 20% joins Query Selectivity ⅓ low, ⅓ medium, ⅓ high Database: 1GB XMark (10x100MB documents) Materialized View Selection for XQuery Workloads 25 Settings Space budget S=size(Q) Tested space budgets: S, S/2, S/4, S/6 Algorithms UDG and ROA Competitors: [Mandhani & Suciu VLDB05] [Tang et al. DASFAA09] Implementation ViP2P*, Java *
Experimental Evaluation Materialized View Selection for XQuery Workloads 26 Workload Evaluation Time of Q 1 (14 queries) Reduce-Optimize (ROA) Space/Time Greedy [Tang et al. DASFAA09] Set-Cover Greedy [Mandhani & Suciu VLDB05] Utility-Driven Greedy (UDG) Space Optimal [Tang et al. DASFAA09] Hit Ratio Evaluation time versus docs
Experimental Evaluation Materialized View Selection for XQuery Workloads 27 Evaluation Time & hit ratio for Q 3 (100 queries) Reduce-Optimize (ROA) Set-Cover Greedy [Mandhani & Suciu VLDB05] Hit Ratio Evaluation time versus docs
Experimental Evaluation Materialized View Selection for XQuery Workloads 28 ROA evaluation for Q 4 (50 queries, 20% value-joined) % of evaluation time vs. documents Hit Ratio
Conclusions Automatic selection of XQuery views for multiple-views rewritings Reduction of candidate views By orders of magnitude ROA performs better than related work Scales and manages to find good solutions relatively fast 80% of the benefits attained in ~2 minutes Maximum benefit attained within 25 minutes. Algorithms of [Tang et. al. DASFAA09] did not scale beyond 14 queries Utility Drive Greedy (UDG) did not scale beyond 50 queries Materialized View Selection for XQuery Workloads 29
Thank you - 30 Questions? ?
BACKUP Materialized View Selection for XQuery Workloads - 31
Cost of algebraic plans Algebraic Plan cost Execution cost of an operator has A CPU execution cost and An IO cost Both depend on input Evaluation cost of a plan: Calculated bottom-up Materialized View Selection for XQuery Workloads 32 Estimating the evaluation cost of a rewriting Data Statistics DataGuide of every document Enriched with information: # of instances of a path Average path val size (bytes) Average path cont size (bytes) Distinct values of each path Used to estimate Cardinality & size of a view
Cost of algebraic plans ViewSizeCardinality v1v1 500KB50 v2v2 100KB10 Materialized View Selection for XQuery Workloads 33 Cost estimation example JOIN [v1.author=v2.author] SCAN(v 1 ) SCAN(v 2 ) SELECT [conference=“SIGMOD”] PROJECT [text cont, author val ] IO=100 | CPU=10 IO=100 | CPU=10+10 IO= 500 | CPU=50 IO= | CPU=70+50*5 IO=600 | CPU= OUTPUT=50 OUTPUT=10 OUTPUT=5 OUTPUT=25 (50*5*0.1) OUTPUT=25
Experimental Evaluation Materialized View Selection for XQuery Workloads 34 ROA time to attain increasing benefits (minutes)
Experimental Evaluation Materialized View Selection for XQuery Workloads 35 Candidate views pruning CS 0 max Maximum estimated number of candidate views CS 0 min Minimum estimated number of candidate views CS 1 Pruned candidate view set CS 2 Pruned candidate view set – only linear path candidates
Candidate Views The cardinality of the set of candidate views of a tree pattern query q of |q| nodes is bounded by: Materialized View Selection for XQuery Workloads 36 Size of the set of candidate views for a tree pattern Combinations of nodes of q: ({a},{b},{c},{a,b},{a,c},{a,b,c}) Edge combinations: how to connect nodes with (/, //) e.g. /a/b, //a/b, /a//b, //a//b}. There are 12 return node variations for each node in a pattern e.g. (a ID,cont,a val,a ID,val… ) Example: q=/a/b val /c
Candidate Views Given a joined pattern q with: k tree patterns and m value-joins The candidate view set size of q is bounded by: Materialized View Selection for XQuery Workloads 37 Size of the set of candidate views for a joined pattern Value join combinations Number of views resulting from all possible cartesian products of k tree patterns
View Selection Algorithms The benefit of materializing a view set V is The difference in cost of evaluating the workload over V vs. evaluating from the documents Materialized View Selection for XQuery Workloads 38 Benefit of materializing a set of views Cost of evaluating query q given the set of materialized views V Cost of evaluating query q from the documents Frequency of query q
Materialized View Selection for XQuery Workloads 39 Tree Pattern query of |q| nodes Joined Pattern query with m value joins & k tree patterns