Computing Structural Similarity of Source XML Schemas against Domain XML Schema Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Jixue Liu 3 Guoren Wang 4 Chi.

Slides:



Advertisements
Similar presentations
Ting Chen, Jiaheng Lu, Tok Wang Ling
Advertisements

Chapter 13: Query Processing
COSC2007 Data Structures II Chapter 10 Trees I. 2 Topics Terminology.
COSC 2007 Data Structures II Chapter 12 Advanced Implementation of Tables II.
Jiaheng Lu, Ting Chen and Tok Wang Ling National University of Singapore Finding all the occurrences of a twig.
APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.
1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of.
On Boosting Holism in XML Twig Pattern Matching Using Two Data Streaming Techniques Presenter: Lu Jiaheng Supervisor: Prof. Ling Tok Wang Joint work: Chen.
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
/4/2010 Box and Whisker Plots Objective: Learn how to read and draw box and whisker plots Starter: Order these numbers.
0 - 0.
1 Term 2, 2004, Lecture 5, Physical DesignMarian Ursu, Department of Computing, Goldsmiths College Physical Design 3.
Ken C. K. Lee, Baihua Zheng, Huajing Li, Wang-Chien Lee VLDB 07 Approaching the Skyline in Z Order 1.
CS16: Introduction to Data Structures & Algorithms
Binary Tree Structure a b fe c a rightleft g g NIL c ef b left right pp p pp left key.
Publish-Subscribe Approach to Social Annotation of News Top-k Publish-Subscribe for Social Annotation of News Joint work with: Maxim Gurevich (RelateIQ)
Examples of Physical Query Plan Alternatives
Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology.
Introduction to Indexes Rui Zhang The University of Melbourne Aug 2006.
Lecture plan Outline of DB design process Entity-relationship model
Boolean and Vector Space Retrieval Models
Addition 1’s to 20.
Test B, 100 Subtraction Facts
Number bonds to 10,
Bottom-up Evaluation of XPath Queries Stephanie H. Li Zhiping Zou.
Splay Trees Binary search trees.
CSE Lecture 17 – Balanced trees
Other Dynamic Programming Problems
§2 Binary Trees Note: In a tree, the order of children does not matter. But in a binary tree, left child and right child are different. A B A B andare.
Learning to Recommend Questions Based on User Ratings Ke Sun, Yunbo Cao, Xinying Song, Young-In Song, Xiaolong Wang and Chin-Yew Lin. In Proceeding of.
RollCaller: User-Friendly Indoor Navigation System Using Human-Item Spatial Relation Yi Guo, Lei Yang, Bowen Li, Tianci Liu, Yunhao Liu Hong Kong University.
Part II. Delete an Node from an AVL Tree Consider to delete
CSCE 2100: Computing Foundations 1 The Tree Data Model
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
1 Abdeslame ALILAOUAR, Florence SEDES Fuzzy Querying of XML Documents The minimum spanning tree IRIT - CNRS IRIT : IRIT : Research Institute for Computer.
Advanced Topics in Algorithms and Data Structures 1 Rooting a tree For doing any tree computation, we need to know the parent p ( v ) for each node v.
Implementation of Graph Decomposition and Recursive Closures Graph Decomposition and Recursive Closures was published in 2003 by Professor Chen. The project.
Suggestion of Promising Result Types for XML Keyword Search Joint work with Jianxin Li, Chengfei Liu and Rui Zhou ( Swinburne University of Technology,
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
Recursive Graph Deduction and Reachability Queries Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
XML-to-Relational Schema Mapping Algorithm ODTDMap Speaker: Artem Chebotko* Wayne State University Joint work with Mustafa Atay,
Selective and Authentic Third-Party distribution of XML Documents - Yashaswini Harsha Kumar - Netaji Mandava (Oct 16 th 2006)
A Z Approach in Validating ORA-SS Data Models Scott Uk-Jin Lee Jing Sun Gillian Dobbie Yuan Fang Li.
The main mathematical concepts that are used in this research are presented in this section. Definition 1: XML tree is composed of many subtrees of different.
Querying Structured Text in an XML Database By Xuemei Luo.
Binary Trees, Binary Search Trees RIZWAN REHMAN CENTRE FOR COMPUTER STUDIES DIBRUGARH UNIVERSITY.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Prof. Amr Goneid, AUC1 CSCE 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 4. Trees.
CSED101 INTRODUCTION TO COMPUTING TREE 2 Hwanjo Yu.
M180: Data Structures & Algorithms in Java Trees & Binary Trees Arab Open University 1.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree
From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen National.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
On the Intersection of Inverted Lists Yangjun Chen and Weixin Shen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
Indexing and Querying XML Data for Regular Path Expressions Quanzhong Li and Bongki Moon Dept. of Computer Science University of Arizona VLDB 2001.
1 Trees. 2 Trees Trees. Binary Trees Tree Traversal.
CSCE 210 Data Structures and Algorithms
Integrating XML Data Sources Using Approximate Joins
Structure and Content Scoring for XML
Early Profile Pruning on XML-aware Publish-Subscribe Systems
Structure and Content Scoring for XML
Trees.
NATURE VIEW OF A TREE leaves branches root. NATURE VIEW OF A TREE leaves branches root.
Presentation transcript:

Computing Structural Similarity of Source XML Schemas against Domain XML Schema Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Jixue Liu 3 Guoren Wang 4 Chi Yang 1 1 Swinburne University of Technology 2 Chinese University of Hong Kong 3 University of South Australia 4 Northeastern University of China

2 Outline Motivation Related Work Problem Statement Structural Similarity Model Algorithms Experiments Conclusions and Future Work

3 XML has become the standard for representing, exchanging and integrating data on the web. Different source providers may define different schemas for their data based on different applications. When exact results do not exist, approximate results are also expected to be returned. Motivation Fig.1 Schema of 1 st Source S1 Fig. 2 Schema of 2 nd Source S2

4  Users may issue queries based on their common understanding, i.e., domain schema. For example: Motivation Fig. 3 Domain Schema T  The domain schema doesn’t match the both source schemas. To efficiently return approximate results, it is desirable for system to determine which source schema much more similar to the domain schema. Brief XPath queries: Q1: uni[swin]/dept[ICT]/prof; Q2: uni[swin]/lib[./cname[Hawthorn]]/book; … How to compute the similarity between domain schema and source schemas?

5 Related Work Measuring the similarity between XML documents – To cluster XML documents.  Edit Distance - detecting the required changes from one XML document to another, such as re-labeling, deleting, and inserting.  Similar to Edit Distance, Binary tree – XML documents can be represented as the tree-structured data. And then the similarity can be obtained by comparing the binary trees.  Time series - each occurrence of a tag corresponds to a given impulse. By analyzing the frequencies, they can state the degree of similarity between documents. Measuring the similarity between XML schemas – To derive schema matching, schema mapping or schema integration.  Cupid, XClust and Similarity Flooding proposed a structural match algorithm where they only emphasized the name and data type similarities presented at the leaf level.  COMA the similarity between the elements was recursively computed from the similarity between their respective children with a leaf-level matcher. In summary, the above methods will compute the similarity in a symmetric way.

6 Related Work Example of Binary tree model BiBranch where the smaller the BiB value is, the more similar its corresponding pair of trees are. According to the above computation, T 2 is more similar to T 0 than others. We have a sorted list: T 2 > T 1 = T 3 = T 4. However, it is not correct in query applications. Fig.4 Example of BiBranch model The symmetric similarity model cannot satisfy query needs!!

7 Problem Statement Given a domain schema tree T 0 =(V 0,E 0, v r0,Card) and a source schema tree T = (V,E, v r,Card), we need to compute their structural similarity distance SSD(T 0, T). An XML schema tree is defined as T = (V, E, v r, Card) where  V is a finite set of nodes, representing elements and attributes of the schema.  E is a set of directed edges.  v r  V is the root node of tree T.  Card: V → {“1”, ”*”}.

8 Problem Statement In this work, we will focus on more different aspects:  The purpose of similarity computation is to choose a similar data source for queries.  The similarity computation is asymmetric where the schema conformed by users’ queries is taken as domain schema.  We concern the parent-child (PC) and ancestor-descendant (AD) relationships, rather than the sibling order because they are important in formulating a query.  We take into account the cardinality of schema elements.  An index based on encoding schema is provided to improve the efficiency of computation.

9 Structural Similarity Model The model takes into account three factors: element coverage, consistency of element pair relationships and the difference of element cardinality.  Ratio of Interesting Object:  Cardinality similarity of node pairs: where V ’ = V  V 0 is the set of interesting nodes in V.

10 Structural Similarity Model  Similarity of node pairs: SNP(v 1,v 2,v 01,v 02 ) Similarity of source schema w.r.t. domain schema SSD(T 0,T)

11 Structural Similarity Model Comparison of SSD and BiBranch models: BiBranch model: T 2 > T 1 = T 3 = T 4 T1 = T4 > T3 > T2 The results satisfy our expectation!!! Fig.5 Example of SSD model

12 Algorithms Techniques:  Trimming rules: Root node, Leaf node, Internal node  Numbering scheme as index: pre – preorder, post – postorder, C – Cardinality, P – parent, RD - Rightmost descendant’s preorder. Algorithms:  Basic Algorithm (BA): Conducting pair wise comparisons.  Improved Algorithm (IA): Reducing the number of similarity comparisons.

13 Experiments Response Time vs. Similarity Degree Fig. 6 The schema size varies from 20, 40, 60 and 80 nodes respectively. At the same time, we adjust the similarity degree from 25%, 50%, 75% and 100% respectively. (b) schema size = 40 nodes(a) schema size = 20 nodes (c) schema size = 60 nodes(d) schema size = 80 nodes

14 Fig.7 Schema size is 128 nodes and the level varies from 4 to 16. Experiments Response Time vs. Nested Level Speedup vs. Fanout Fig.8 the schema size is set 128 nodes and the fanout varies from 2 to 5.

15 Fig.9 the schema size varies from 20, 40, 60, and 80 nodes. Experiments Response Time vs. Schema Size Fig.10 The three public datasets: TPC-H-nested.xsd (17), genexml.xsd (85) and mondial-3.0.xsd (120).

16 Conclusions and Future Work Contributions:  Proposed structural similarity problem for the purpose of query application;  Designed a brief structural similarity model and discussed its effectiveness;  Implemented relevant algorithms and demonstrated its efficiency with synthetic and real data sets. Future work:  Improve the similarity model and make it more accurate;  Apply this similarity model to improve query evaluation.

17 Thanks & Question