M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

22 T HE DATA INTEGRATION PROBLEM Querying the source data through target query interface Eg.: querying multiple data sources through a mediate query interface Data source Query interface Target schema Source schema Schema mapping 2 ……

S CHEMA MATCHING & MAPPING Schema matching : finding element correspondences with similarities between schemas Schema mapping : a set of one-to-one correspondences between two schemas Generation: pick up the best correspondences 3 Sample mapping Order - ORDER BP - IP BCN – ICN …… Sample mapping Order - ORDER BP - IP BCN – ICN ……

44 S CHEMA MAPPING AND UNCERTAINTY The mapping between schemas can be uncertain Compute Pr( M i ) by: 1) aggregating similarities of correspondences, and 2) normalizing probabilities of top-k mappings Which one is correct? Uncertain mappings M 1 : Order-ORDER, …, BCN-ICN, … M 2 : Order-ORDER, …, RCN-ICN, … … Uncertain mappings M 1 : Order-ORDER, …, BCN-ICN, … M 2 : Order-ORDER, …, RCN-ICN, … … Example: Purchase Order schemas 4

55 D ATA INTEGRATION RELOADED Managing uncertainty of XML schema matching Issues: mapping generation and storage, query evaluation etc Data source Query interface Mediate schema Source schema Uncertain schema mapping 5 ……

66 O BSERVATION Sharing among uncertain mappings Uncertain mappings Overlapping: “Order~ORDER” shared by m 1 -m 5 “BP~IP” shared by m 1, m 2, m 4, m 5 “BCN~ICN” shared by m 1, m 2 … Overlapping: “Order~ORDER” shared by m 1 -m 5 “BP~IP” shared by m 1, m 2, m 4, m 5 “BCN~ICN” shared by m 1, m 2 … 6

77 O BSERVATION How much overlapping are there in real world schema mappings? Overlapping ratio (o-ratio): the average overlap of the top-100 possible schema mappings 7

O UR CONTRIBUTION Propose block tree : a novel data structure to represent a set of mappings Definition Efficient generation Propose probabilistic twig query (PTQ) Definition Efficient evaluation with the block tree Top-k PTQ, and its computation issue Improve the possible mapping generation process A divide-and-conquer approach Conduct experiment on real data to validate our methods 8

R ELATED WORK Schema matching approaches and tools [RB01] COMA [DR02] Managing uncertainty in schema matching Top-k schema mappings [Gal06] Generating top-k mappings [Murty86] Query evaluation in data integration Theoretical foundation [Len02] Data integration with uncertainty [DHY07] XML query rewriting for data integration [YP04] XML query evaluation Twig query [QYD07] Querying probabilistic XML document [KYS08] 9

10 O UTLINE Introduction Problem Data model Query model Techniques Results Conclusion 10

11 D ATA MODEL XML schema and document [QYD07] Node-labeled tree Document node may carry text values Schema mapping [DHY07] One-to-one mapping 11 Schema Document Uncertain mappings M 1 : Order-ORDER, …, BCN-ICN, … M 2 : Order-ORDER, …, RCN-ICN, … … Uncertain mappings M 1 : Order-ORDER, …, BCN-ICN, … M 2 : Order-ORDER, …, RCN-ICN, … …

12 Q UERY MODEL ( SINGLE MAPPING ) Twig query through a target schema [YP04] Step 1: rewrite target query into source query, based on schema mapping rewrite M 1 : Order-ORDER, BP-IP, BCN-ICN, … 12 Source query:Target query: Source schema:Target schema:

13 Q UERY MODEL ( SINGLE MAPPING ) Twig query through a target schema [YP04] Step 1: rewrite target query into source query, based on schema mapping rewrite Step 2: evaluate source query on source document 13 Source query: Source document:

14 Q UERY MODEL ( UNCERTAIN MAPPINGS ) Query evaluation with uncertain mappings [DHY07] Mappings: pM = {(M 1,Pr(M 1 )), …, (M h,Pr(M h )} The query answers from mapping M i have probability Pr( M i ) Target query Q T M 1,Pr(M 1 ) … M h,Pr(M h ) R 1,Pr(M 1 ) … R h,Pr(M h ) Q S1 Q Sh Rewriting Evaluation 14 Source query

15 O UTLINE Introduction Problem Techniques Block tree Query evaluation Mapping generation Results Conclusion 15

16 T HE BLOCK Each block, which is attached to a target schema element, consists of: C : A set of correspondences M : A set of mappings Block 16 Drawback : Exponential number of blocks to handle Semantic : mappings in M share correspondences in C Semantic : mappings in M share correspondences in C

17 T HE C - BLOCK A c-block (constrained block) is a block which: Contains correspondence for all elements in its sub-tree (so that it’s more useful for query evaluation) Contains shared mappings more than a threshold (else it’s not worthy to store it) 17 c-block |pM| = 5 Threshold = 0.4 |pM| = 5 Threshold = 0.4

18 T HE BLOCK TREE Creation of the block tree Follows the structure of the target schema A bottom-up method 18 Lemma 1: (informal) The c-blocks for an element can be created from the c-blocks of its children. (detail)detail Lemma 2: (informal) If an element has no c-block, then its parent (if any) has no c-blcok.

19 T HE BLOCK TREE Reducing the storage cost of uncertain mappings If part of a mapping is in the block tree, then replace it with a link

20 O UTLINE Introduction Problem Techniques Block tree Query evaluation Mapping generation Results Conclusion 20

21 Q UERY EVALUATION AND UNCERTAINTY The uncertainty in mappings may affect query answers Uncertain mappings M 1 : Order-ORDER, …, BCN-ICN, … M 2 : Order-ORDER, …, RCN-ICN, … … Uncertain mappings M 1 : Order-ORDER, …, BCN-ICN, … M 2 : Order-ORDER, …, RCN-ICN, … … Target query Q: //ICN which finds all ICNs (contact names of invoice parties) in the purchase order Target query Q: //ICN which finds all ICNs (contact names of invoice parties) in the purchase order Example: a source document Return by M 1 Return by M 2 21

22 T HE BASELINE APPROACH Evaluate Q T with each mapping in pM separately Drawback When the mapping M i is large, or h is large, the computation cost is expensive Target query Q T M 1,Pr(M 1 ) … M h,Pr(M h ) R 1,Pr(M 1 ) … R h,Pr(M h ) Q S1 Q Sh Rewriting Evaluation DSDS DSDS

23 Q UERY EVALUATION WITH BLOCK TREE Consider the root of a query Case 1): the root is found in the block tree, then use the blocks to evaluate the whole query

24 Q UERY EVALUATION WITH BLOCK TREE Case 1): the root is found in the block tree, then use the blocks to evaluate the whole query Only one mapping in the block is used Deal with remainder mappings

25 Q UERY EVALUATION WITH BLOCK TREE Consider the root of a query Case 1): the root is found in the block tree, then use the blocks to evaluate the whole query Case 2): the root is not found, decompose the query (if possible), invoke recursion, and join partial answers

26 Q UERY EVALUATION WITH BLOCK TREE Case 2): the root is not found, decompose the query (if possible), invoke recursion, and join partial answers ++ Direct query RecursionDirect query

27 O UTLINE Introduction Problem Data model Query model Techniques Block tree Query evaluation Mapping generation Results Conclusion 27

28 M APPING GENERATION A mapping m for a schema S with another schema T contains a set of correspondences ( e s,e t ) e t may be EMPTY, i.e., e s matches none element in T Each element in S occurs exactly once in m Each element in T occurs at most once in m m ’s score is the sum of similarities of its correspondences Problem definition Given : two schemas S and T, a set of correspondences (e s,e t ) with similarities (which are schema matching results) Return : h mappings m 1, …, m h, whose scores are among the highest ones

29 M APPING GENERATION Baseline solution Finding h -maximum bipartite matching (Min-Cost Flow) Polynomial with the size of bipartite

30 M APPING GENERATION Observation : XML schema matching is usually sparse Improvement: a divide-and-conquer approach Derive partitions (Maximal Connected Sub-Graphs) of the bipartite Find the top- h partial mappings from each partition Merge

31 O UTLINE Introduction Problem Techniques Results Conclusion 31

32 D ATASET AND RESULTS XML schemas and documents 7 schemas for purchase order, obtained from various E-Commence standards (eg. XCBL, OpenTrans) Accompanied sample XML documents Schema matching Tool: COMA++, with different schema matching methods 10 dataset: (source-schema, target-schema, matching-method) Target query 10 hand-write queriesqueries

33 R ESULTS Uncertain mappings, do they really overlap ?

34 R ESULTS How much space does the block tree save for storing uncertain mappings? And why?

35 R ESULTS Is the block tree effective? Intuitively, larger blocks tends to be more useful

36 R ESULTS The block tree can be efficiently created Fast, and controllable

37 R ESULTS Can the block tree really help to improvement query performance?query Varies the total number of mappings

38 R ESULTS Can it scale? Probabilistic twig query and top- k query

39 R ESULTS Top- h mapping generation Performance gain of partitioning

40 C ONCLUSION We study the problem of handling uncertainty in XML schema matching Observation Overlapping mappings, sparse bipartite, etc Approach The block tree Query evaluation with the block tree Generating uncertain mapping more efficiently Future work Other types of queries, probabilistic document, index update, relational scenario, etc

41 T HANKS ! Q & A 41

R EFERENCES [Len02] Lenzerini, “Data integration: a theoretical perspective”, in PODS, 2002 [YP04] Yu et al, “Constraint-based XML query rewriting for data integration”, in SIGMOD, 2004 [DR02] Do et al, “COMA: a system for flexible combination of schema matching approaches”, in VLDB, 2002 [Gal06] Gal, “Managing uncertainty in schema matching with top-k schema mappings”, in J. Data Semantics VI, 2006 [DHY07] Dong et al, “Data integration with uncertainty”, in VLDB, 2007 [QYD07] Qin et al, “TwigList: make twig pattern matching fast”, in DASFAA, 2007 [Murty86] Murty, “An algorithm for ranking all the assignment in increasing order of cost”, Operations Research, vol 16, 1986 [RB01] Rahm et al, “A survey of approaches to automatic schema matching”, VLDB J, vol 10, 2001 [KYS08] Kimelfeld et al, “Query efficiency in probabilistic XML models”, in SIGMOD, 2008 … 42

43 Q UERY REWRITING Given A target twig query Q T A schema mapping m between S and T, which is a set of correspondences ( e s, e t ) Mapping semantic For each sub-tree in source document D S which contains a set of source element in m, there exists a sub-tree in target document D T which contains the corresponding target elements Procedure For each element in Q T, replace with a source element Connect all the source elements

44 L EMMA 1 An example Lemma 1: (conceptually) The c-blocks for an schema element t can be created from the c-blocks of t’s children. (detail)detail Lemma 1: (conceptually) The c-blocks for an schema element t can be created from the c-blocks of t’s children. (detail)detail

45 R ESULTS What kind of queries do we used?

M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

Similar presentations

Presentation on theme: "M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

Similar presentations

Presentation on theme: "M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010."— Presentation transcript:

Similar presentations

About project

Feedback