M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

Slides:



Advertisements
Similar presentations
Ting Chen, Jiaheng Lu, Tok Wang Ling
Advertisements

Uncertainty in Data Integration Ai Jing
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration.
Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Diversity Maximization Under Matroid Constraints Date : 2013/11/06 Source : KDD’13 Authors : Zeinab Abbassi, Vahab S. Mirrokni, Mayur Thakur Advisor :
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Optimizing Join Enumeration in Transformation-based Query Optimizers ANIL SHANBHAG, S. SUDARSHAN IIT BOMBAY VLDB 2014
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Fast Algorithms For Hierarchical Range Histogram Constructions
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Evaluating Search Engine
Seeing the forest for the trees : using the Gene Ontology to restructure hierarchical clustering Dikla Dotan-Cohen, Simon Kasif and Avraham A. Melkman.
N EIGHBORHOOD F ORMATION AND A NOMALY D ETECTION IN B IPARTITE G RAPHS Jimeng Sun, Huiming Qu, Deepayan Chakrabarti & Christos Faloutsos Jimeng Sun, Huiming.
Suggestion of Promising Result Types for XML Keyword Search Joint work with Jianxin Li, Chengfei Liu and Rui Zhou ( Swinburne University of Technology,
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Efficient Join Processing over Uncertain Data - By Reynold Cheng, et all. Presented By Lydia & Usha.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
BACKGROUND KNOWLEDGE IN ONTOLOGY MATCHING Pavel Shvaiko joint work with Fausto Giunchiglia and Mikalai Yatskevich INFINT 2007 Bertinoro Workshop on Information.
Graph Data Management Lab, School of Computer Science gdm.fudan.edu.cn XMLSnippet: A Coding Assistant for XML Configuration Snippet.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Optimizing Plurality for Human Intelligence Tasks Luyi Mo University of Hong Kong Joint work with Reynold Cheng, Ben Kao, Xuan Yang, Chenghui Ren, Siyu.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
VAST 2011 Sebastian Bremm, Tatiana von Landesberger, Martin Heß, Tobias Schreck, Philipp Weil, and Kay Hamacher Interactive-Graphics Systems TU Darmstadt,
Querying Structured Text in an XML Database By Xuemei Luo.
A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Top-k Similarity Join over Multi- valued Objects Wenjie Zhang Jing Xu, Xin Liang, Ying Zhang, Xuemin Lin The University of New South Wales, Australia.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
CIKM Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.
OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.
Efficient Processing of Top-k Spatial Preference Queries
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.
A Hybrid Match Algorithm for XML Schemas Ray Dos Santos Aug 21, 2009 K. Claypool, V. Hegde, N. Tansalarak UMass – Lowell - ICDE ‘06.
Indexing Correlated Probabilistic Databases Bhargav Kanagal, Amol Deshpande University of Maryland, College Park, USA SIGMOD Presented.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Query Caching and View Selection for XML Databases Bhushan Mandhani Dan Suciu University of Washington Seattle, USA.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006.
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.
Dense-Region Based Compact Data Cube
Probabilistic Data Management
Spatio-temporal Pattern Queries
Hierarchical clustering approaches for high-throughput data
Structure and Content Scoring for XML
Probabilistic Databases
Structure and Content Scoring for XML
A Framework for Testing Query Transformation Rules
Efficient Processing of Top-k Spatial Preference Queries
Presentation transcript:

M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

22 T HE DATA INTEGRATION PROBLEM Querying the source data through target query interface Eg.: querying multiple data sources through a mediate query interface Data source Query interface Target schema Source schema Schema mapping 2 ……

S CHEMA MATCHING & MAPPING Schema matching : finding element correspondences with similarities between schemas Schema mapping : a set of one-to-one correspondences between two schemas Generation: pick up the best correspondences 3 Sample mapping Order - ORDER BP - IP BCN – ICN …… Sample mapping Order - ORDER BP - IP BCN – ICN ……

44 S CHEMA MAPPING AND UNCERTAINTY The mapping between schemas can be uncertain Compute Pr( M i ) by: 1) aggregating similarities of correspondences, and 2) normalizing probabilities of top-k mappings Which one is correct? Uncertain mappings M 1 : Order-ORDER, …, BCN-ICN, … M 2 : Order-ORDER, …, RCN-ICN, … … Uncertain mappings M 1 : Order-ORDER, …, BCN-ICN, … M 2 : Order-ORDER, …, RCN-ICN, … … Example: Purchase Order schemas 4

55 D ATA INTEGRATION RELOADED Managing uncertainty of XML schema matching Issues: mapping generation and storage, query evaluation etc Data source Query interface Mediate schema Source schema Uncertain schema mapping 5 ……

66 O BSERVATION Sharing among uncertain mappings Uncertain mappings Overlapping: “Order~ORDER” shared by m 1 -m 5 “BP~IP” shared by m 1, m 2, m 4, m 5 “BCN~ICN” shared by m 1, m 2 … Overlapping: “Order~ORDER” shared by m 1 -m 5 “BP~IP” shared by m 1, m 2, m 4, m 5 “BCN~ICN” shared by m 1, m 2 … 6

77 O BSERVATION How much overlapping are there in real world schema mappings? Overlapping ratio (o-ratio): the average overlap of the top-100 possible schema mappings 7

O UR CONTRIBUTION Propose block tree : a novel data structure to represent a set of mappings Definition Efficient generation Propose probabilistic twig query (PTQ) Definition Efficient evaluation with the block tree Top-k PTQ, and its computation issue Improve the possible mapping generation process A divide-and-conquer approach Conduct experiment on real data to validate our methods 8

R ELATED WORK Schema matching approaches and tools [RB01] COMA [DR02] Managing uncertainty in schema matching Top-k schema mappings [Gal06] Generating top-k mappings [Murty86] Query evaluation in data integration Theoretical foundation [Len02] Data integration with uncertainty [DHY07] XML query rewriting for data integration [YP04] XML query evaluation Twig query [QYD07] Querying probabilistic XML document [KYS08] 9

10 O UTLINE Introduction Problem Data model Query model Techniques Results Conclusion 10

11 D ATA MODEL XML schema and document [QYD07] Node-labeled tree Document node may carry text values Schema mapping [DHY07] One-to-one mapping 11 Schema Document Uncertain mappings M 1 : Order-ORDER, …, BCN-ICN, … M 2 : Order-ORDER, …, RCN-ICN, … … Uncertain mappings M 1 : Order-ORDER, …, BCN-ICN, … M 2 : Order-ORDER, …, RCN-ICN, … …

12 Q UERY MODEL ( SINGLE MAPPING ) Twig query through a target schema [YP04] Step 1: rewrite target query into source query, based on schema mapping rewrite M 1 : Order-ORDER, BP-IP, BCN-ICN, … 12 Source query:Target query: Source schema:Target schema:

13 Q UERY MODEL ( SINGLE MAPPING ) Twig query through a target schema [YP04] Step 1: rewrite target query into source query, based on schema mapping rewrite Step 2: evaluate source query on source document 13 Source query: Source document:

14 Q UERY MODEL ( UNCERTAIN MAPPINGS ) Query evaluation with uncertain mappings [DHY07] Mappings: pM = {(M 1,Pr(M 1 )), …, (M h,Pr(M h )} The query answers from mapping M i have probability Pr( M i ) Target query Q T M 1,Pr(M 1 ) … M h,Pr(M h ) R 1,Pr(M 1 ) … R h,Pr(M h ) Q S1 Q Sh Rewriting Evaluation 14 Source query

15 O UTLINE Introduction Problem Techniques Block tree Query evaluation Mapping generation Results Conclusion 15

16 T HE BLOCK Each block, which is attached to a target schema element, consists of: C : A set of correspondences M : A set of mappings Block 16 Drawback : Exponential number of blocks to handle Semantic : mappings in M share correspondences in C Semantic : mappings in M share correspondences in C

17 T HE C - BLOCK A c-block (constrained block) is a block which: Contains correspondence for all elements in its sub-tree (so that it’s more useful for query evaluation) Contains shared mappings more than a threshold (else it’s not worthy to store it) 17 c-block |pM| = 5 Threshold = 0.4 |pM| = 5 Threshold = 0.4

18 T HE BLOCK TREE Creation of the block tree Follows the structure of the target schema A bottom-up method 18 Lemma 1: (informal) The c-blocks for an element can be created from the c-blocks of its children. (detail)detail Lemma 2: (informal) If an element has no c-block, then its parent (if any) has no c-blcok.

19 T HE BLOCK TREE Reducing the storage cost of uncertain mappings If part of a mapping is in the block tree, then replace it with a link

20 O UTLINE Introduction Problem Techniques Block tree Query evaluation Mapping generation Results Conclusion 20

21 Q UERY EVALUATION AND UNCERTAINTY The uncertainty in mappings may affect query answers Uncertain mappings M 1 : Order-ORDER, …, BCN-ICN, … M 2 : Order-ORDER, …, RCN-ICN, … … Uncertain mappings M 1 : Order-ORDER, …, BCN-ICN, … M 2 : Order-ORDER, …, RCN-ICN, … … Target query Q: //ICN which finds all ICNs (contact names of invoice parties) in the purchase order Target query Q: //ICN which finds all ICNs (contact names of invoice parties) in the purchase order Example: a source document Return by M 1 Return by M 2 21

22 T HE BASELINE APPROACH Evaluate Q T with each mapping in pM separately Drawback When the mapping M i is large, or h is large, the computation cost is expensive Target query Q T M 1,Pr(M 1 ) … M h,Pr(M h ) R 1,Pr(M 1 ) … R h,Pr(M h ) Q S1 Q Sh Rewriting Evaluation DSDS DSDS

23 Q UERY EVALUATION WITH BLOCK TREE Consider the root of a query Case 1): the root is found in the block tree, then use the blocks to evaluate the whole query

24 Q UERY EVALUATION WITH BLOCK TREE Case 1): the root is found in the block tree, then use the blocks to evaluate the whole query Only one mapping in the block is used Deal with remainder mappings

25 Q UERY EVALUATION WITH BLOCK TREE Consider the root of a query Case 1): the root is found in the block tree, then use the blocks to evaluate the whole query Case 2): the root is not found, decompose the query (if possible), invoke recursion, and join partial answers

26 Q UERY EVALUATION WITH BLOCK TREE Case 2): the root is not found, decompose the query (if possible), invoke recursion, and join partial answers ++ Direct query RecursionDirect query

27 O UTLINE Introduction Problem Data model Query model Techniques Block tree Query evaluation Mapping generation Results Conclusion 27

28 M APPING GENERATION A mapping m for a schema S with another schema T contains a set of correspondences ( e s,e t ) e t may be EMPTY, i.e., e s matches none element in T Each element in S occurs exactly once in m Each element in T occurs at most once in m m ’s score is the sum of similarities of its correspondences Problem definition Given : two schemas S and T, a set of correspondences (e s,e t ) with similarities (which are schema matching results) Return : h mappings m 1, …, m h, whose scores are among the highest ones

29 M APPING GENERATION Baseline solution Finding h -maximum bipartite matching (Min-Cost Flow) Polynomial with the size of bipartite

30 M APPING GENERATION Observation : XML schema matching is usually sparse Improvement: a divide-and-conquer approach Derive partitions (Maximal Connected Sub-Graphs) of the bipartite Find the top- h partial mappings from each partition Merge

31 O UTLINE Introduction Problem Techniques Results Conclusion 31

32 D ATASET AND RESULTS XML schemas and documents 7 schemas for purchase order, obtained from various E-Commence standards (eg. XCBL, OpenTrans) Accompanied sample XML documents Schema matching Tool: COMA++, with different schema matching methods 10 dataset: (source-schema, target-schema, matching-method) Target query 10 hand-write queriesqueries

33 R ESULTS Uncertain mappings, do they really overlap ?

34 R ESULTS How much space does the block tree save for storing uncertain mappings? And why?

35 R ESULTS Is the block tree effective? Intuitively, larger blocks tends to be more useful

36 R ESULTS The block tree can be efficiently created Fast, and controllable

37 R ESULTS Can the block tree really help to improvement query performance?query Varies the total number of mappings

38 R ESULTS Can it scale? Probabilistic twig query and top- k query

39 R ESULTS Top- h mapping generation Performance gain of partitioning

40 C ONCLUSION We study the problem of handling uncertainty in XML schema matching Observation Overlapping mappings, sparse bipartite, etc Approach The block tree Query evaluation with the block tree Generating uncertain mapping more efficiently Future work Other types of queries, probabilistic document, index update, relational scenario, etc

41 T HANKS ! Q & A 41

R EFERENCES [Len02] Lenzerini, “Data integration: a theoretical perspective”, in PODS, 2002 [YP04] Yu et al, “Constraint-based XML query rewriting for data integration”, in SIGMOD, 2004 [DR02] Do et al, “COMA: a system for flexible combination of schema matching approaches”, in VLDB, 2002 [Gal06] Gal, “Managing uncertainty in schema matching with top-k schema mappings”, in J. Data Semantics VI, 2006 [DHY07] Dong et al, “Data integration with uncertainty”, in VLDB, 2007 [QYD07] Qin et al, “TwigList: make twig pattern matching fast”, in DASFAA, 2007 [Murty86] Murty, “An algorithm for ranking all the assignment in increasing order of cost”, Operations Research, vol 16, 1986 [RB01] Rahm et al, “A survey of approaches to automatic schema matching”, VLDB J, vol 10, 2001 [KYS08] Kimelfeld et al, “Query efficiency in probabilistic XML models”, in SIGMOD, 2008 … 42

43 Q UERY REWRITING Given A target twig query Q T A schema mapping m between S and T, which is a set of correspondences ( e s, e t ) Mapping semantic For each sub-tree in source document D S which contains a set of source element in m, there exists a sub-tree in target document D T which contains the corresponding target elements Procedure For each element in Q T, replace with a source element Connect all the source elements

44 L EMMA 1 An example Lemma 1: (conceptually) The c-blocks for an schema element t can be created from the c-blocks of t’s children. (detail)detail Lemma 1: (conceptually) The c-blocks for an schema element t can be created from the c-blocks of t’s children. (detail)detail

45 R ESULTS What kind of queries do we used?