E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung, and Jiefeng Cheng

22 B ACKGROUND : H IDDEN D ATABASES Source DB Query interface (e.g., web form) 2 …… Location = Central Price < 5M Size > 700 ft Target query DB instances; hidden from users

B ACKGROUND : SCHEMA MATCHING 3 S: (pname, email-addr, permanent-addr, current-addr) T: (name, email, mailing-addr, home-addr, office-addr) correspondence source attribute Target attribute Target schema Source schema Schema matching (e.g., from COMA++) Target Query

B ACKGROUND : S CHEMA MAPPING 4 S: (pname, email-addr, permanent-addr, current-addr) T: (name, email, mailing-addr, home-addr, office-addr) Mapping : a subset of matching Target Query Source Query Many different mappings Better if we can know their confidence !

P ROBABILISTIC MAPPINGS A set of h pairs (M i, Pr(M i )), where Pr(M i ) is the probability that mapping M i exists [Gal06, DHY07, CGC10] 5 Querying on these mappings produce answers with confidence Similarity score Bipartite matching on similarity scores

B ASIC QUERY SOLUTION Example 6 Target query: SELECT phone FROM Person WHERE addr=‘aaa’ m1: Source query: SELECT ophone FROM Person WHERE oaddr=‘aaa’ “123”, 0.3 “456”, 0.3

B ASIC QUERY SOLUTION Example 7 Target query: SELECT phone FROM Person WHERE addr=‘aaa’ m1: Source query: SELECT ophone FROM Person WHERE oaddr=‘aaa’ “123”, 0.3 “456”, 0.3 m2: “123”, 0.2 “456”, 0.2

B ASIC QUERY SOLUTION Example 8 Target query: SELECT phone FROM Person WHERE addr=‘aaa’ m1, m2: “123”, 0.5 “456”, 0.5

V ARIANTS OF BASIC SOLUTIONS Enhanced basic (or e-basic): groups identical source queries, and evaluates the distinct ones Much better than basic! e-MQO: attempts to improve e-basic by applying multi-query optimization [ZLFL07] on distinct source queries Experimentally worse than e-basic, since generating a good multi-query plan for lots of mappings is expensive We use e-basic to compare with our new algorithms 9

C ORRESPONDENCE O VERLAP Probabilistic mappings can have many common correspondences 10 Q-sharing and O-sharing uses this to improve query efficiency

Q UERY - LEVEL S HARING (Q-S HARING ) If the query for mappings m1 and m2 are identical, only 1 query needs to be issued. 11 Target query: SELECT addr FROM Person WHERE phone=‘123’ Source query: SELECT oaddr FROM Customer WHERE ophone=‘123’ m1 and m2

Q-S HARING Example 12 Target query: SELECT pname FROM Person WHERE addr=‘abc’ Partition the mappings P1: {m1, m2} P2: {m3, m4} P3: {m5} Only 3 out of 5 mappings are used for query reformulation. Representative mappings: {m1, m3, m5} Partition Tree Partition Tree

P ROBLEM OF Q-S HARING Given a target query, two mappings may share only some query operators, but not all. 13 Target query: SELECT addr FROM Person WHERE phone=‘123’ Q-sharing does not work!

O-S HARING Share query operator evaluation for two mappings with the same correspondence 14 Target query: SELECT addr FROM Person WHERE phone=‘123’ m2 and m3 shares the selection condition 1. Obtain tuples with ophone =123 for m2 and m3 2. For m2, retrieve oaddr ; for m3, retrieve haddr

O- SHARING : E XAMPLE Target query Probabilistic mappings 15

O- SHARING : E XAMPLE An execution unit (e-unit) u1 captures the current status of a target query 16 1) Query plan 2) Mapping set 3) next-op

O- SHARING : E XAMPLE Execution of an e-unit u1 17 For m1 and m2, addr  oaddr Process m1 and m2 in a batch For m3, m4, and m5, addr  haddr Process m3-m5 in a batch select next operator (details later)

O- SHARING : E XAMPLE New e-units u2 and u3 are generated The process goes on until no more e-units are produced 18 Mapping set of u1 is partitioned Intermediate results are generated

O PERATOR S ELECTION Method 1: Random ly select the next operator 19

O PERATOR S ELECTION Method 2: SNF (or Smallest Number of Partition First) chooses a target operator that leads to the fewest mapping partitions 20 Mapped to 3 source attributes, i.e., 3 mapping partitions 4 mapping partitions

O PERATOR S ELECTION Method 3: SEF (or Smallest Entropy First) chooses a target operator that leads to the lowest entropy 21 addr phone

A DVANTAGES OF O- SHARING Interleaves query rewriting and operator execution May not have to consider the whole target query for every mapping, due to empty intermediate result The current o-sharing solution supports selection, projection, join, MIN, MAX, and SUM operators 22

P ROBABILISTIC TOP - K QUERIES Query semantic Returns k tuples whose probabilities are the highest, among those with non-zero probabilities Our new algorithm can prune non-answers tuples Avoid evaluating the actual probabilities of all answer tuples This is done by partially expanding the e-units 23

E XPERIMENTAL SETUP Schemas and data are about purchase orders Source schema: TPC-H 100MB database, with 1M tuples 46 attributes, 8 relations 3 Target schemas provided by COMA++ Excel, Noris, Paragon 48, 66, and 69 attributes Schema matcher: COMA++ 10 target queries: selection, projection, join, COUNT, and SUM 100 probabilistic mappings SEF is used for o-sharing 24

Q UERY PERFORMANCE 25

E FFECT OF QUERY SIZE 26

O PERATOR SELECTION STRATEGIES 27 SNF is much better than Random, and SEF further improves SNF.

T OP - K QUERY PERFORMANCE 28 Top-k query could improve the query performance, especially when the query returns a large set of results.

R ELATED WORK Schema matching Uncertainty is not considered in most existing work Probabilistic schema mapping [Gal06, DHY07] Uncertain XML schema matching [CGC10, GCC11] Computing and storing of probabilistic XML mappings Evaluating of probabilistic XML queries 29

Probabilistic mappings can be used to handle uncertainty of schema matching To efficiently handle table semantics, we examine q-sharing and o-sharing They exploit the correspondences of mappings that share a query or its query operators We plan to study the use of o-sharing on other queries (e.g., set difference and recursive queries) C ONCLUSIONS 30

Reynold Cheng (HKU) URL: http://www.cs.hku.hk/~ckchenghttp://www.cs.hku.hk/~ckcheng Email: ckcheng@cs.hku.hkckcheng@cs.hku.hk T HANK YOU !

R EFERENCES [CGC10] R. Cheng, J. Gong, and D. Cheung. “Managing uncertainty in XML schema matching”, in ICDE, 2010 [GCC11] J. Gong, R. Cheng, and D. Cheung. “Efficient Management of Uncertainty in XML Schema Matching”, in VLDBJ, 2011. [Len02] Lenzerini, “Data integration: a theoretical perspective”, in PODS, 2002 [YP04] Yu et al, “Constraint-based XML query rewriting for data integration”, in SIGMOD, 2004 [DR02] Do et al, “COMA: a system for flexible combination of schema matching approaches”, in VLDB, 2002 [Gal06] Gal, “Managing uncertainty in schema matching with top-k schema mappings”, in J. Data Semantics VI, 2006 [DHY07] Dong et al, “Data integration with uncertainty”, in VLDB, 2007 [QYD07] Qin et al, “TwigList: make twig pattern matching fast”, in DASFAA, 2007 [Murty86] Murty, “An algorithm for ranking all the assignment in increasing order of cost”, Operations Research, vol 16, 1986 [RB01] Rahm et al, “A survey of approaches to automatic schema matching”, VLDB J, vol 10, 2001 [KYS08] Kimelfeld et al, “Query efficiency in probabilistic XML models”, in SIGMOD, 2008 [ZLFL07] J. Zhou, P. Larson, J. Freytag, and W. Lehner, “Efficient exploitation of similar subexpressions for query processing,” in SIGMOD, 2007. 32

P ROBABILISTIC MAPPINGS We assume that the schema matching is represented by h probabilistic mappings. The probability of each mapping is obtained by using a bipartite matching algorithm on the similarity scores of correspondences [CGC10] 33

G ENERATING THE TOP - H MAPPINGS Use a h -maximum bipartite matching algorithm to find the h mappings with the highest scores See [CGC10] 34 Image elements are inserted to model the absence of correspondence We use approach 3

P ROBABILISTIC MAPPINGS Find the h mappings with the highest scores, using a bipartite matching algorithm [CGC10] For each M i, obtain Pr(M i ) by normalizing M i ’s score with the sum of scores of the h mappings 35 Score / total 30 /100 20 /100 10 /100

T ARGET QUERIES 36

B ASIC S OLUTIONS 37 e-basic is the best among the simple solutions. We thus compare it with q-sharing and o-sharing.

O VERLAP OF M APPINGS 38 Fraction of no. of common correspondences over no. of distinct correspondences

O PERATOR S ELECTION S TRATEGIES 39

P ROBABILISTIC Q UERY E VALUATION 2 ways to reformulate and evaluate a target query. By-table semantic All tuples in source tables use the same possible mapping By-tuple semantic Each tuple in source tables may use a different possible mapping Details in Appendix B 41

B Y - TABLE SEMANTIC All tuples in source tables use the same possible mapping The query answers from the mapping Mi have the probability Pr(Mi) If duplicate removal is enforced, then a tuple t returned by both M1 and M2 has probability Pr(t) = Pr(M1) + Pr(M2) 42

B Y - TABLE SEMANTIC Example 43 Target query: SELECT mailing-addr from T When m1 is considered, the query answer: Sunnyvale, 0.5 When m2 is considered, the query answer: Sunnyvale, 0.4 Mountain View, 0.4 When m3 is considered, the query answer: alice@, 0.1 bob@, 0.1 Final query answer (with duplicates removed): Sunnyvale, 0.9Mountain View, 0.4 alice@, 0.1 bob@, 0.1

BASIC SOLUTION Evaluate the target query for every possible mapping M i The query answers from the mapping M i have the probability Pr(M i ) If duplicate removal is enforced, then a tuple t returned by both M 1 and M 2 has probability Pr(t) = Pr(M 1 ) + Pr(M 2 ) Very expensive if the no. of mappings,|M|, is huge 44

A BASIC SOLUTION Example 45 Target query: SELECT phone FROM Person WHERE addr=‘aaa’ m1: Source query: SELECT ophone FROM Person WHERE oaddr=‘aaa’ “123”, 0.3 “456”, 0.3

A BASIC SOLUTION Example 46 Target query: SELECT phone FROM Person WHERE addr=‘aaa’ m1: Source query: SELECT ophone FROM Person WHERE oaddr=‘aaa’ “123”, 0.3 “456”, 0.3 m2: “123”, 0.2 “456”, 0.2

A BASIC SOLUTION Example 47 Target query: SELECT phone FROM Person WHERE addr=‘aaa’ m1, m2: “123”, 0.5 “456”, 0.5

5 A LGORITHMS FOR COMPARISON Basic: consider each possible mapping separately e-basic: first clusters the identical source queries, then evaluate this set of distinct source queries e-MQO: improve the e-basic by applying multi- query optimization with the set of distinct source queries Our solutions: q-sharing and o-sharing 48

T ARGET Q UERY E VALUATION 5 algorithms for querying probabilistic mappings: basic e-basic e-MQO Q-sharing O-sharing 49

Q-S HARING 50 Source query: SELECT oaddr FROM Customer WHERE ophone=‘123’ Query answer for m1 and m2: aaa, 0.5

Q-S HARING : A LGORITHM 51 Partition the possible mappings, and find representative mappings Evaluate the basic solution on the representative mappings Probability of a query answer evaluated by a representative mapping

E FFICIENT M APPING P ARTITIONING Partitioning is needed for every possible mapping A partition tree supports Q-sharing by efficiently classifying possible mappings A non-leaf node is a target attribute An edge is a source attribute A leaf node is a partition of mappings 52

P ARTITION T REE (1) Example 53 Target query: SELECT pname FROM Person WHERE addr=‘abc’ Initial state

P ARTITION T REE Example 54 After m1 is processed Target query: SELECT pname FROM Person WHERE addr=‘abc’

P ARTITION T REE Example 55 After m2 is processed Target query: SELECT pname FROM Person WHERE addr=‘abc’

P ARTITION T REE Example 56 After m3 and m4 are processed Target query: SELECT pname FROM Person WHERE addr=‘abc’

P ARTITION T REE Example 57 Final state Target query: SELECT pname FROM Person WHERE addr=‘abc’

O- SHARING A LGORITHM Repeat Select one query operator from target query A target operator under some operator selection strategies is chosen The operator is reformulated to a source operator and executed Until all target query operators are consumed Our current solution supports selection, projection, join, MIN, MAX, and SUM operators 58

O- SHARING : FRAMEWORK An e-unit (or execution unit) captures the current status of a target query, which contains: Query plan, which organizes the query operators not yet executed and the intermediate query results Mapping set, the mappings that are used to answer the query, and The next-op, a query operator in the e-unit that will be executed in the next step 59

O- SHARING : FRAMEWORK U-trace: a tree of e-units that have not yet been considered 60 Initial e-unit u1 After executing next-op of u1 with m1-m2, empty result is returned Another e-unit u3 is generated with intermediate answer R3

O- SHARING : FRAMEWORK U-trace: a tree of e-units that have not yet been considered 61 After executing next-op of u3 with m3-m4, u4 is generated

O- SHARING : FRAMEWORK U-trace: a tree of e-units that have not yet been considered 62 u4 contains only one operator. After execution, two sets of results R6 and R7 are returned u3‘s next-op is executed over m5, which leads to e-unit u5

O- SHARING : FRAMEWORK U-trace: a tree of e-units that have not yet been considered 63 u5‘s next-op is executed and returns empty result All e-units are executed. The query evaluation is complete.

O- SHARING : DETAIL The operator selection strategy Correctness: not all operators are allowed to be chosen, eg., a selection operator with one attribute Effectiveness: reduce the overall query evaluation cost by maximize the sharing of computation of operators 64

P ROBABILISTIC TOP - K QUERY Top-k query evaluation example Assume the following probability, k = 1 65 NodeProb. u20.5 u60.2 u70.2 u50.1

P ROBABILISTIC TOP - K QUERY Top-k query evaluation example Heap status during the query evaluation 66 NodeProb.HeapLBUB u20.5-0 LB: the lower bound probability of the tuple with the k-th highest probability in the heap UB: the maximal probability of any tuple not in the heap

P ROBABILISTIC TOP - K QUERY Top-k query evaluation example Heap status during the query evaluation 67 NodeProb.HeapLBUB u20.5-0 u60.2ta(0.2,0.5)0.20.3 u70.2ta(0.4,0.5), tb(0.2,0.3), tc(0.2,0.3) 0.40.1 Each tuple has a upper/lower bound of probability

P ROBABILISTIC TOP - K QUERY Top-k query evaluation example Heap status during the query evaluation 68 NodeProb.HeapLBUB u20.5-0 u60.2ta(0.2,0.5)0.20.3 u70.2ta( 0.4,0.5), tb(0.2, 0.3 ), tc(0.2, 0.3 ) 0.40.1 u50.1--- ta can be returned as top-1 answer without visit u5, since: 1) tb and tc’s upper probability is lower than ta’s lower probability, and 2) UB < ta’s lower probability

O- SHARING : DETAIL The o-sharing algorithm 69 1) find representative mappings, and initialize u-trace 2) query evaluation with u-trace 3) aggregate query results and return

O- SHARING : DETAIL The o-sharing algorithm 70 Case 1: no more operator, return query answers Case 2: empty intermediate result is found, return empty query answers Case 3: no early-stop a. find next-op b. partition the mapping set c. for each subset of mappings: - computer next-op - generate a new e-unit - recursively process the e-unit

F UTURE W ORK How to handle complex and aggregate queries in o-sharing? e.g., set difference, recursive queries, subqueries Can we do better if we also consider the selectivity information of operators? How about other kind of schemas? e.g., XML, XMARK 71

E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

Similar presentations

Presentation on theme: "E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

Similar presentations

Presentation on theme: "E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,"— Presentation transcript:

Similar presentations

About project

Feedback