Download presentation
Presentation is loading. Please wait.
1
Adaptively Processing Remote Data and Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 14, 2005 LSD Slides courtesy AnHai Doan
2
2 Administrivia Midterm due 3/16 5-10 pages (single-spaced, 10-12 pt) If you haven’t told me which topic, please do so now!
3
3 Today’s Trivia Question
4
4 Many Motivations for Adaptive Query Processing Many domains where cost-based query optimization fails: Complex queries in traditional databases: estimation error grows exponentially with # joins [IC91] ; the focus of [KD98], [M+04] Querying over the Internet: unpredictable access rates, delays Querying external data sources: limited information available about properties of this source Monitor real-world conditions, adapt processing strategy in response
5
5 Generalizing Adaptive Query Processing We’ve seen a range of different adaptive techniques How do they fit together? Can we choose points between eddies and mid- query re-optimization? Can we exploit other kinds of query optimization “tricks”?
6
6 Popular Types of Adaptive Query Processing Adaptive scheduling (q. scramb. [UF98]; dyn. rescheduling [UF01]; PH join [UF00][RS86][I+99][HH99] ) Changes CPU scheduling to improve feedback or reduce delays Can’t reduce total work Redundant computation (competitive exec. [AZ96] ) Compare two+ ways of executing the query Need to identify a few promising plans Plan partitioning (INGRES [S+76], mid-q re-opt. [KD98][I+99][M+04] ) Break the plan into stages; re-optimize future stages as necessary Coarse granularity, breaks pipelining Are these the only options?
7
7 Two More Forms of Adaptivity Adaptive data partitioning ( [AH00][R+03][DH04][I+04] ) Break the data into subsets; use a different plan for each subset Generalizes intra-plan reordering in a SPJGU query The only way to reduce overall computation with fine granularity Only previous implementation has been eddies [AH00][R+03][DH04] Adaptive information passing Extends sideways information passing (“magic sets”) to an adaptive context, both intra- and inter-plan Reduces computation and space devoted to non-productive tuples
8
8 Eddies Combine Adaptive Scheduling and Data Partitioning Decisions Intuitively, each tuple gets its own query plan Route to next operator based on speed and selectivity of each operator Elegant and simple to implement But performing a join creates subresults at the next level! Local & greedy choices may result in state that needs to join with all future data! Consider long-term effects of decisions before making them – separate CPU scheduling from plan selection
9
9 Focusing Purely on Adaptive Data Partitioning Use adaptively scheduled operators to “fill CPU cycles” Now a query optimizer problem: Choose a plan that minimizes long- term cost (in CPU cycles) To allow multiple plans, distribute union through join (and select, project, etc.): If R 1 = R 1 1 [ R 1 2, R 2 = R 2 1 [ R 2 2 then: R 1 ⋈ R 2 = (R 1 1 [ R 1 2 ) ⋈ (R 2 1 [ R 2 2 ) = (R 1 1 ⋈ R 2 1 ) [ (R 1 2 ⋈ R 2 2 ) [ (R 1 1 ⋈ R 2 2 ) [ (R 1 2 ⋈ R 2 1 ) R1R1 R2R2 R11R11 R21R21 R12R12 R22R22 This generalizes to n joins, other SPJ + GU operators…
10
10 Adaptive Data Partitioning: Routing Data across Different Plans R ⋈ S ⋈ T Exclude R 0 S 0 Exclude R 0 S 0 T 0, R 1 S 1 T 1 R 0 S 0 R 1 T 1 S 1 T 0 R 0 S 0 RST R 0 S 0 T 0 R 0 S 0 … S 0 R 0 T 0 S1T1S1T1 R 1 S 1 T 1 R 1 S 1 T 1 Options for combining across phases: New results always injected into old plan Old results into new plan Wait until the end – “stitch-up” plan based on best stats
11
11 Special Architectural Features for ADP Monitoring and re-optimization thread runs alongside execution: System-R-like optimizer with aggregation support; uses most current selectivity estimates Periodic monitoring and re-optimization revises selectivity estimates, recomputes expected costs Query execution with “smart router” operators Special support for efficient stitch-up plans: Uses intermediate results from previous plans (specialized-case of answering queries using views [H01]) Join-over-union (“stitch-up-join”) operator that excludes certain results
12
12 ADP Application 1: Correcting Cost Mis-estimates Goal: react to plans that are obviously bad Don’t spend cycles searching for a slightly better plan Try to avoid paths that are likely to not be promising Monitor/reoptimizer thread watches cardinalities of subresults Re-estimate plan cost, compare to projected costs of alternatives, using several techniques & heuristics (see paper) Our experiments: re-estimate every 1 sec. “Smart router” operator does the following: Waits for monitor/reoptimizer to suggest replacement plan Re-routes source data into the new plan New plan’s output is unioned with output of previous plan; this is fed into any final aggregation operations
13
13 Correcting for Unexpected Selectivities Pentium IV 3.06 GHz Windows XP
14
14 ADP Application 2: Optimizing for Order Most general ADP approach: Pre-generate plans for general case and each “interesting order” “Smart router” sends tuple to the plan whose ordering constraint is followed by this tuple But with multiple joins, MANY plans Instead: do ADP at the operator level “Complementary join pair” Does its own stitch-up internally Easier to optimize for! Can also do “partial sorting” at the router (priority queue) QQ Q QQ Merge Hash R S h(R)h(S)h(R)h(S)... Routers
15
15 Exploiting Partial Order in the Data (1024 tuple) Pentium IV 3.06 GHz Windows XP
16
16 ADP Over “Windows”: Optimizing for Aggregation Group-by optimization [CS94]: May be able to “pre-aggregate” some tuples before joining Why: aggregates can be applied over union But once we insert pre-aggregation, we’re stuck (and it’s not pipelined) Our solution: “Adjustable window pre-aggregation” Change window size depending on how effectively we can aggregate Also allows data to propagate through the plan – better info for adaptivity, early answers T R SUM(T.y sums) GROUP BY T.x SUM(T.y) GROUP BY T.x, T.joinAttrib TR SUM(T.y) GROUP BY T.x vs.
17
17 Pre-Aggregation Comparison
18
18 The State of the Union Join and Agg “Useless” intermediate state is perhaps the biggest concern in ADP-based (or even plan-partitioning) approaches Very easy to create large intermed. state before switching from a plan Results in significant additional computation “The burden of history” [DH04] Also the major bottleneck in computing queries with correlated subqueries Only want to compute parts of a subquery that will contribute to final answers Local DB solution: magic sets rewritings [M+90][CR91][MP94][S+96]
19
19 Intuition behind Magic Sets Rewritings Observations: Computing a subquery once for every iteration of the outer query is repetitive, inefficient Computing the subquery in its entirety is also frequently inefficient So “pass in” information about specifically which tuples from the inner query might join with the outer query A “filter set” – generally a projection of a portion of the outer query results Anything that joins with the parent block must join with the filter set False positives are OK
20
20 Query with Magic Set CREATE VIEW TotalSales(SellerID, Sales, ItemsSold) SELECT SellerID, sum(salePrice) AS Sales, count(*) AS ItemsSold FROM SellerList SL, SaleItem S WHERE SL.SellerID = S.SellerID GROUP BY SL.SellerID SELECT SellerID, Sales, ItemsSold FROM TotalSales TS, Recommended REC, Ratings RAT WHERE REC.SellerID = TS.SellerID AND RAT.SellerID = TS.SellerID AND RAT.Rating > 4 AND ItemsSold > 50
21
21 Query with Magic Set [S+96] CREATE VIEW TotalSales(SellerID, Sales, ItemsSold) SELECT SellerID, sum(salePrice) AS Sales, count(*) AS ItemsSold FROM SellerList SL, SaleItem S WHERE SL.SellerID = S.SellerID GROUP BY SL.SellerID SELECT SellerID, Sales, ItemsSold FROM TotalSales TS, Recommended REC, Ratings RAT WHERE REC.SellerID = TS.SellerID AND RAT.SellerID = TS.SellerID AND RAT.Rating > 4 AND ItemsSold > 50
22
22 Magic in Data Integration In data integration: Difficult to determine when to do sideways information passing/magic in a cost-based way Magic optimization destroys some potential parallelism – must compute outer block first Opportunities: Pipelined hash joins give us complete state for every intermediate result We use bushy trees Our idea: do information passing out-of-band Consider a plan as if it’s a relational calculus expression – every tuple must satisfy constraints The plan dataflow enforces this… … But we can also pass information across the plan outside the normal dataflow A B C x x
23
23 Adaptive Information Passing Cost-based strategy: 1.Execute all blocks in parallel (up to max. pipelineable size) 2.Whenever a subresult is completely computed, feed it elsewhere in the query plan as a filter set Anywhere with a shared predicate is an eligible target Use our ability to estimate remaining cost of query execution to see if the semijoin will speed performance 3.Can always inject “more precise” filter set (one that checks more predicates), or remove a filter set Filter set is a performance/space optimization, not necessary for correctness We use Bloom filters rather than hash tables (our VLDB05 submission has detailed performance comparison) Also compared against a naïve strategy that generates filter sets at every operator; when complete, they are used as filters by downstream ops
24
24 Tuples Created – TPC-H, 1GB (~67% savings in Q2. Also savings in Q5, not shown)
25
25 Adaptive QP in Summary A variety of different techniques, focusing on: Scheduling Comparison & competition Data + plan partitioning Information passing A field that is still fairly open – missing: Effective exploration methods A true theory! What’s possible? What kinds of queries make sense to adapt? Guarantees of optimality and convergence (perhaps under certain assumptions)
26
26 Switching from Low-Level to High-Level We’ve talked about: Query reformulation (composing queries with mappings) Query optimization + execution But how did we ever get the mappings in the first place? This is one of the most tedious tasks Answer: LSD (and not the kind that makes you high!) … Slides courtesy of AnHai Doan, UIUC
27
27 Semantic Mappings between Schemas Mediated & source schemas = XML DTDs house location contact house address name phone num-baths full-bathshalf-baths contact-info agent-name agent-phone 1-1 mappingnon 1-1 mapping
28
28 Suppose user wants to integrate 100 data sources 1. User manually creates mappings for a few sources, say 3 shows LSD these mappings 2. LSD learns from the mappings “Multi-strategy” learning incorporates many types of info in a general way Knowledge of constraints further helps 3. LSD proposes mappings for remaining 97 sources The LSD (Learning Source Descriptions) Approach
29
29 listed-price $250,000 $110,000... address price agent-phone description Example location Miami, FL Boston, MA... phone (305) 729 0831 (617) 253 1429... comments Fantastic house Great location... realestate.com location listed-price phone comments Schema of realestate.com If “fantastic” & “great” occur frequently in data values => description Learned hypotheses price $550,000 $320,000... contact-phone (278) 345 7215 (617) 335 2315... extra-info Beautiful yard Great beach... homes.com If “phone” occurs in the name => agent-phone Mediated schema
30
30 LSD’s Multi-Strategy Learning Use a set of base learners each exploits well certain types of information Match schema elements of a new source apply the base learners combine their predictions using a meta-learner Meta-learner uses training sources to measure base learner accuracy weighs each learner based on its accuracy
31
31 Base Learners Input schema information: name, proximity, structure,... data information: value, format,... Output prediction weighted by confidence score Examples Name learner agent-name => (name,0.7), (phone,0.3) Naive Bayes learner “Kent, WA” => (address,0.8), (name,0.2) “Great location” => (description,0.9), (address,0.1)
32
32 Boston, MA $110,000 (617) 253 1429 Great location Miami, FL $250,000 (305) 729 0831 Fantastic house Training the Learners Naive Bayes Learner (location, address) (listed-price, price) (phone, agent-phone) (comments, description)... (“Miami, FL”, address) (“$ 250,000”, price) (“(305) 729 0831”, agent-phone) (“Fantastic house”, description)... realestate.com Name Learner address price agent-phone description Schema of realestate.com Mediated schema location listed-price phone comments
33
33 Beautiful yard Great beach Close to Seattle (278) 345 7215 (617) 335 2315 (512) 427 1115 Seattle, WA Kent, WA Austin, TX Applying the Learners Name Learner Naive Bayes Meta-Learner (address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) (address,0.6), (description,0.4) Meta-Learner Name Learner Naive Bayes (address,0.7), (description,0.3) (agent-phone,0.9), (description,0.1) address price agent-phone description Schema of homes.com Mediated schema area day-phone extra-info
34
34 Domain Constraints Impose semantic regularities on sources verified using schema or data Examples a = address & b = address a = b a = house-id a is a key a = agent-info & b = agent-name b is nested in a Can be specified up front when creating mediated schema independent of any actual source schema
35
35 area: address contact-phone: agent-phone extra-info: description area: address contact-phone: agent-phone extra-info: address area: (address,0.7), (description,0.3) contact-phone: (agent-phone,0.9), (description,0.1) extra-info: (address,0.6), (description,0.4) The Constraint Handler Can specify arbitrary constraints User feedback = domain constraint ad-id = house-id Extended to handle domain heuristics a = agent-phone & b = agent-name a & b are usually close to each other 0.3 0.1 0.4 0.012 0.7 0.9 0.6 0.378 0.7 0.9 0.4 0.252 Domain Constraints a = address & b = adderss a = b Predictions from Meta-Learner
36
36 Putting It All Together: LSD System L1L1 L2L2 LkLk Mediated schema Source schemas Data listings Training data for base learners Constraint Handler Mapping Combination User Feedback Domain Constraints Base learners: Name Learner, XML learner, Naive Bayes, Whirl learner Meta-learner uses stacking [Ting&Witten99, Wolpert92] returns linear weighted combination of base learners’ predictions Matching PhaseTraining Phase
37
37 Empirical Evaluation Four domains Real Estate I & II, Course Offerings, Faculty Listings For each domain create mediated DTD & domain constraints choose five sources extract & convert data listings into XML mediated DTDs: 14 - 66 elements, source DTDs: 13 – 48 Ten runs for each experiment - in each run: manually provide 1-1 mappings for 3 sources ask LSD to propose mappings for remaining 2 sources accuracy = % of 1-1 mappings correctly identified
38
38 LSD Matching Accuracy LSD’s accuracy: 71 - 92% Best single base learner: 42 - 72% + Meta-learner: + 5 - 22% + Constraint handler: + 7 - 13% + XML learner: + 0.8 - 6% Average Matching Acccuracy (%)
39
39 LSD Summary Applies machine learning to schema matching use of multi-strategy learning Domain & user-specified constraints Probably the most flexible means of doing schema matching today in a semi-automated way Complementary project: CLIO (IBM Almaden) uses key and foreign-key constraints to help the user build mappings
40
40 Jumping Up a Level We’ve now seen how distributed data makes a huge difference … In heterogeneity and the need for relating different kinds of attributes Mapping languages Mapping tools Query reformulation … and in query processing Adaptive query processing
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.