22 B ACKGROUND : H IDDEN D ATABASES Source DB Query interface (e.g., web form) 2 …… Location = Central Price < 5M Size > 700 ft Target query DB instances; hidden from users
B ACKGROUND : SCHEMA MATCHING 3 S: (pname, -addr, permanent-addr, current-addr) T: (name, , mailing-addr, home-addr, office-addr) correspondence source attribute Target attribute Target schema Source schema Schema matching (e.g., from COMA++) Target Query
B ACKGROUND : S CHEMA MAPPING 4 S: (pname, -addr, permanent-addr, current-addr) T: (name, , mailing-addr, home-addr, office-addr) Mapping : a subset of matching Target Query Source Query Many different mappings Better if we can know their confidence !
P ROBABILISTIC MAPPINGS A set of h pairs (M i, Pr(M i )), where Pr(M i ) is the probability that mapping M i exists [Gal06, DHY07, CGC10] 5 Querying on these mappings produce answers with confidence Similarity score Bipartite matching on similarity scores
B ASIC QUERY SOLUTION Example 6 Target query: SELECT phone FROM Person WHERE addr=‘aaa’ m1: Source query: SELECT ophone FROM Person WHERE oaddr=‘aaa’ “123”, 0.3 “456”, 0.3
B ASIC QUERY SOLUTION Example 7 Target query: SELECT phone FROM Person WHERE addr=‘aaa’ m1: Source query: SELECT ophone FROM Person WHERE oaddr=‘aaa’ “123”, 0.3 “456”, 0.3 m2: “123”, 0.2 “456”, 0.2
B ASIC QUERY SOLUTION Example 8 Target query: SELECT phone FROM Person WHERE addr=‘aaa’ m1, m2: “123”, 0.5 “456”, 0.5
V ARIANTS OF BASIC SOLUTIONS Enhanced basic (or e-basic): groups identical source queries, and evaluates the distinct ones Much better than basic! e-MQO: attempts to improve e-basic by applying multi-query optimization [ZLFL07] on distinct source queries Experimentally worse than e-basic, since generating a good multi-query plan for lots of mappings is expensive We use e-basic to compare with our new algorithms 9
C ORRESPONDENCE O VERLAP Probabilistic mappings can have many common correspondences 10 Q-sharing and O-sharing uses this to improve query efficiency
Q UERY - LEVEL S HARING (Q-S HARING ) If the query for mappings m1 and m2 are identical, only 1 query needs to be issued. 11 Target query: SELECT addr FROM Person WHERE phone=‘123’ Source query: SELECT oaddr FROM Customer WHERE ophone=‘123’ m1 and m2
Q-S HARING Example 12 Target query: SELECT pname FROM Person WHERE addr=‘abc’ Partition the mappings P1: {m1, m2} P2: {m3, m4} P3: {m5} Only 3 out of 5 mappings are used for query reformulation. Representative mappings: {m1, m3, m5} Partition Tree Partition Tree
P ROBLEM OF Q-S HARING Given a target query, two mappings may share only some query operators, but not all. 13 Target query: SELECT addr FROM Person WHERE phone=‘123’ Q-sharing does not work!
O-S HARING Share query operator evaluation for two mappings with the same correspondence 14 Target query: SELECT addr FROM Person WHERE phone=‘123’ m2 and m3 shares the selection condition 1. Obtain tuples with ophone =123 for m2 and m3 2. For m2, retrieve oaddr ; for m3, retrieve haddr
O- SHARING : E XAMPLE Target query Probabilistic mappings 15
O- SHARING : E XAMPLE An execution unit (e-unit) u1 captures the current status of a target query 16 1) Query plan 2) Mapping set 3) next-op
O- SHARING : E XAMPLE Execution of an e-unit u1 17 For m1 and m2, addr oaddr Process m1 and m2 in a batch For m3, m4, and m5, addr haddr Process m3-m5 in a batch select next operator (details later)
O- SHARING : E XAMPLE New e-units u2 and u3 are generated The process goes on until no more e-units are produced 18 Mapping set of u1 is partitioned Intermediate results are generated
O PERATOR S ELECTION Method 1: Random ly select the next operator 19
O PERATOR S ELECTION Method 2: SNF (or Smallest Number of Partition First) chooses a target operator that leads to the fewest mapping partitions 20 Mapped to 3 source attributes, i.e., 3 mapping partitions 4 mapping partitions
O PERATOR S ELECTION Method 3: SEF (or Smallest Entropy First) chooses a target operator that leads to the lowest entropy 21 addr phone
A DVANTAGES OF O- SHARING Interleaves query rewriting and operator execution May not have to consider the whole target query for every mapping, due to empty intermediate result The current o-sharing solution supports selection, projection, join, MIN, MAX, and SUM operators 22
P ROBABILISTIC TOP - K QUERIES Query semantic Returns k tuples whose probabilities are the highest, among those with non-zero probabilities Our new algorithm can prune non-answers tuples Avoid evaluating the actual probabilities of all answer tuples This is done by partially expanding the e-units 23
E XPERIMENTAL SETUP Schemas and data are about purchase orders Source schema: TPC-H 100MB database, with 1M tuples 46 attributes, 8 relations 3 Target schemas provided by COMA++ Excel, Noris, Paragon 48, 66, and 69 attributes Schema matcher: COMA++ 10 target queries: selection, projection, join, COUNT, and SUM 100 probabilistic mappings SEF is used for o-sharing 24
O PERATOR SELECTION STRATEGIES 27 SNF is much better than Random, and SEF further improves SNF.
T OP - K QUERY PERFORMANCE 28 Top-k query could improve the query performance, especially when the query returns a large set of results.
R ELATED WORK Schema matching Uncertainty is not considered in most existing work Probabilistic schema mapping [Gal06, DHY07] Uncertain XML schema matching [CGC10, GCC11] Computing and storing of probabilistic XML mappings Evaluating of probabilistic XML queries 29
Probabilistic mappings can be used to handle uncertainty of schema matching To efficiently handle table semantics, we examine q-sharing and o-sharing They exploit the correspondences of mappings that share a query or its query operators We plan to study the use of o-sharing on other queries (e.g., set difference and recursive queries) C ONCLUSIONS 30
Reynold Cheng (HKU) URL: T HANK YOU !
B ASIC S OLUTIONS 37 e-basic is the best among the simple solutions. We thus compare it with q-sharing and o-sharing.
Fraction of no. of common correspondences over no. of distinct correspondences
