Integrating XML Data Sources Using Approximate Joins

Slides:

Advertisements

Similar presentations

APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

Advertisements

Computing Structural Similarity of Source XML Schemas against Domain XML Schema Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Jixue Liu 3 Guoren Wang 4 Chi.

Indexing DNA Sequences Using q-Grams

A General Algorithm for Subtree Similarity-Search The Hebrew University of Jerusalem ICDE 2014, Chicago, USA Sara Cohen, Nerya Or 1.

WSPD Applications.

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.

Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,

1 Spatial Join. 2 Papers to Present “Efficient Processing of Spatial Joins using R-trees”, T. Brinkhoff, H-P Kriegel and B. Seeger, Proc. SIGMOD, 1993.

Spatio-temporal Databases

Longest Common Subsequence

iDistance -- Indexing the Distance An Efficient Approach to KNN Indexing C. Yu, B. C. Ooi, K.-L. Tan, H.V. Jagadish. Indexing the distance:

Searching on Multi-Dimensional Data

A Ternary Unification Framework for Optimizing TCAM-Based Packet Classification Systems Author: Eric Norige, Alex X. Liu, and Eric Torng Publisher: ANCS.

1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.

Continuous Intersection Joins Over Moving Objects Rui Zhang University of Melbourne Dan Lin Purdue University Kotagiri Ramamohanarao University of Melbourne.

ViST: a dynamic index method for querying XML data by tree structures Authors: Haixun Wang, Sanghyun Park, Wei Fan, Philip Yu Presenter: Elena Zheleva,

2-dimensional indexing structure

Compressed Accessibility Map: Efficient Access Control for XML Ting Yu : University of Illinois Divesh Srivastava : AT&T Labs Laks V.S. Lakshmanan : University.

Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore VLDB’2005 * Liang Jin and.

Liang Jin and Chen Li VLDB’2005 Supported by NSF CAREER Award IIS Selectivity Estimation for Fuzzy String Predicates in Large Data Sets.

R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.

1 Section 9.2 Tree Applications. 2 Binary Search Trees Goal is implementation of an efficient searching algorithm Binary Search Tree: –binary tree in.

1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.

. Phylogenetic Trees (2) Lecture 12 Based on: Durbin et al Section 7.3, 7.8, Gusfield: Algorithms on Strings, Trees, and Sequences Section 17.

Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.

NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.

TEDI: Efficient Shortest Path Query Answering on Graphs Author: Fang Wei SIGMOD 2010 Presentation: Dr. Greg Speegle.

Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Michael Vassilakopoulos.

Quantifying the dynamics of Binary Search Trees under combined insertions and deletions BACKGROUND The complexity of many operations on Binary Search Trees.

A Polynomial Time Approximation Scheme For Timing Constrained Minimum Cost Layer Assignment Shiyan Hu*, Zhuo Li**, Charles J. Alpert** *Dept of Electrical.

CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina

C++ Programming: Program Design Including Data Structures, Fourth Edition Chapter 19: Searching and Sorting Algorithms.

Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:

Outline Binary Trees Binary Search Tree Treaps. Binary Trees The empty set (null) is a binary tree A single node is a binary tree A node has a left child.

Chapter 18: Searching and Sorting Algorithms. Objectives In this chapter, you will: Learn the various search algorithms Implement sequential and binary.

Segment Trees Basic data structure in computational geometry. Computational geometry.  Computations with geometric objects.  Points in 1-, 2-, 3-, d-space.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.

1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree ： An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.

A New Top-down Algorithm for Tree Inclusion Dr. Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,

Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.

Dynamics of Binary Search Trees under batch insertions and deletions with duplicates ╛ BACKGROUND The complexity of many operations on Binary Search Trees.

1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

1 Spatial Query Processing using the R-tree Donghui Zhang CCIS, Northeastern University Feb 8, 2005.

1 Trees. 2 Trees Trees. Binary Trees Tree Traversal.

By A. Aboulnaga, A. R. Alameldeen and J. F. Naughton Vldb’01

A Linear-Space Top-down Algorithm for Tree Inclusion Problem

Efficient processing of path query with not-predicates on XML data

RE-Tree: An Efficient Index Structure for Regular Expressions

Taku Aratsu1, Kouichi Hirata1 and Tetsuji Kuboyama2

Web Data Extraction Based on Partial Tree Alignment

Time Series Filtering Time Series

TT-Join: Efficient Set Containment Join

Written Midterm Solutions

Week nine-ten: Trees Trees.

Structure and Content Scoring for XML

Distributed Probabilistic Range-Aggregate Query on Uncertain Data

Efficient Record Linkage in Large Data Sets

Jongik Kim1, Dong-Hoon Choi2, and Chen Li3

Structure and Content Scoring for XML

Time Series Filtering Time Series

Clustering Large Datasets in Arbitrary Metric Space

15-826: Multimedia Databases and Data Mining

Wei Wang University of New South Wales, Australia

Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)

Donghui Zhang, Tian Xia Northeastern University

Relax and Adapt: Computing Top-k Matches to XPath Queries

CoXML: A Cooperative XML Query Answering System

Presentation transcript:

Integrating XML Data Sources Using Approximate Joins Author Sudipto Guha, H. V. Jagadish, Nick Koudas, Divesh Srivastava, Ting Yu Source SIGMOD02, ICDE03, ACM TODS06 March Presenter Jim 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Outline Motivation Proposed approach Result Conclusion 9/20/2018 Dept. of Computer Science

Global Query Interface Motivation Data integration A popular application on Internet Query the different data sources eg. an integration of search engine Global Query Interface Result Query DBLP CiteSeer GoogleScholar 9/20/2018 Dept. of Computer Science

Global Query Interface Motivation Data integration Identify same data object in different data sources eg. query papers from the global query interface Global Query Interface Need to identify the same paper from the query result! Result Query DBLP CiteSeer GoogleScholar 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Motivation XML Data integration Identify same data object in different schema General case Find out pair of object from two object set DBLP CiteSeer 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Problem Approximately matching XML documents Given Two set of XML documents, each set conforms to a schema A distance metric Dist between XML documents and a distance threshold t Result All pair of documents (Ti,Tj), such Dist(Ti,Tj) <= t 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Distance metric String edit distance Minimum step of edit operation (insert, delete, change) to change S1 to S2 eg1. ed(“abcde”,”bCdef”)=3, edit operation: del ‘a’, ‘c’ -> ‘C’, ins ‘f’ eg2. ed(“hilton hotel”,”hilson hotel”)=1 edit operation: ‘t’ -> ‘s’ 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Distance metric String edit distance Time complexity O(|S1|*|S2|) worst case complexity Dynamic programming style Recursion step Ci,j is the cost of match Si and Tj (edit distance) Ci,j = Ci-1,j-1 if S[i]=T[j], 1+min(Ci-1, j, Ci, j-1, Ci-1 ,j-1), else. 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Tree edit distance Tree edit distance Generalization of string edit distance Minimum cost sequence of tree edit operation to transform T1 to T2 Operation: insertion, deletion and re-label of nodes Method Dynamic programming style Compute a mapping between nodes of T1 & T2 with minimum cost 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Tree edit distance Tree edit distance – example Mappings (‘e’ means empty) (A,A), (B,e), (D,D), (E,E), (e,H), (C,I), (F,F), (G,G) Do NOT care position of node Cost (edit operations) delete ‘B’, insert ‘H’, re-label ‘C’->‘I’ TreeEditDist(T1,T2)=3 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Tree edit distance Tree edit distance – basic constraints For any pairs (i1,j1), (i2,j2) i1=i2 iff j1=j2 one-to-one i1 is to the left of i2 iff j1 is to the left of j2 sibling order preserving i1 is an ancestor of i2 iff j1 is an ancestor of j2 ancestor order preserving 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Tree edit distance Tree edit distance – algorithms Dynamic programming style Recursion step v and w ib the rightmost nodes of F1 and F2 |F1|*|F2| sub-problem for computing dist(F1,F2) Overall Worst case complexity O(|T1|2*|T2|2) 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Tree edit distance Modelling XML ordered, labelled trees Distance between XML documents Tree edit distance 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Tree edit distance Problem High time complexity of TreeEditDist() Worst case O(|T1|2*|T2|2) Improved: O(|T1|*|T2|*depth(T1)*depth(T2)) Not practical for large XML documents How to efficiently perform approximate join using TreeEditDist()? 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Proposed approaches Without indexing Bounding TreeEditDist() Lower & upper bound as filter Sampling based bounding Sampling the dataset and pre-compute the distance as filter Add indexing R-tree indexing 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Lower bound Idea Transform a tree to a string Compute string edit distance as lower bound Method Use pre-order & post-order traversal of a tree Rational Operation on the string -> operation on the tree, but NOT vice versa dist(T1,T2)>=dist(S1,S2) 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Lower bound Example ed(s1,s2)=3, ed(T1,T2)=3 For the case of ‘<‘, i.e., dist(s1,s2)<dist(T1,T2), suppose all the nodes has the same label pre-order(T1): ABDECFG, pre-order(T2): ADHEIFG 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Lower bound Lower bound theorem Max(StrEditDist(pre(T1),pre(T2)), StrEditDist(post(T1),post(T2))) <= TreeEditDist(T1,T2) Complexity O(|T1|*|T2|) using string edit distance Note: O(|T1|2*|T2|2) for tree edit distance Cheaper than TreeEditDist() 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Upper bound Basic constraints of TreeEditDist() one-to-one sibling order preserving ancestor order preserving 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Upper bound Add one more constraint Intuition Two distinct subtree of T1 will be mapped to two distinct subtree of T2 For any pairs (i1,j1), (i2,j2), (i3,j3) lca(i1,i2) is a proper ancestor of i3 iff lca(j1,j2) is a proper ancestor of j3 lca order preserving 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Upper bound LCA constraint – example (E,E) is not a mapping any more, as D & E are in the same subtree of A in T1, but in two distinct subtree of A in T2 TreeEditDist_UB(T1,T2)=5 delete ‘B’, insert ‘H’ re-label ‘C’->’I’ delete ‘E’, insert ‘E’ 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Upper bound Computing upper bound Reduce the number of sub-problem Reduce time complexity to O(|T1|*|T2|) 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Summary of bounding Bounding method Lower bound Transform tree to string using traversal Upper bound Add LCA constraint to the tree edit distance Filtering with bound Given (T1,T2) and threshold t if LBDist(T1,T2)>t, then prune (T1,T2) if UBDist(T1,T2)<=t, then admit (T1,T2) 9/20/2018 Dept. of Computer Science

Sampling based bounding TreeEditDist(), or simply Dist(), is a metric Triangle inequality |Dist(di,dr)-Dist(dr,dj)|<=Dist(di,dj)<= Dist(di,dr)+Dist(dr,dj) Reference set A set of K sample document Sampling method later dr di dj Reference Set dr1 dr2 … drk 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Reference set For each XML document di Pre-compute the Dist() between di and reference set Distance vector vi (|vi|=K, vi1,vi2,…vik) dr di dj Reference Set dr1 dr2 … drk 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Reference set Reference set based bound Given XML document pair (di,dj) LB_RS=max(|vil-vjl|), 1<=l<=k UB_RS=min(vil+vjl), 1<=l<=k Use the bound if LB_RS>t, then prune (di,dj) if UB_RS<=t, then admit (di,dj) dr di dj Reference Set dr1 dr2 … drk 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Reference set Choose the reference set (or simply, RS) Assumption XML data set is well separated Goal Make the RS decisive with small size Optimal RS One XML file from one cluster Reference Set LB_RS=max(|vil-vjl|), 1<=l<=k UB_RS=min(vil+vjl), 1<=l<=k 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Reference set Choose the RS Randomly sampling the all XML data set Clustering the sampled XML data Based on TreeEditDist() Use CURE or BIRCH to clustering One sample from one cluster Reference Set 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Approximate join Problem statement again Given Two set of XML documents A distance metric Dist between XML documents A distance threshold t Result All pair of documents (Ti,Tj), such Dist(Ti,Tj) <= t Approximate join algorithm Using bounding & reference set No indexing 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Approximate join Baseline algorithm Nested loop join with TreeEditDist() Using lower & upper bound ONLY Nested loop join with bound if LBDist(T1,T2)>t, then prune (T1,T2) if UBDist(T1,T2)<=t, then admit (T1,T2) Using reference set ONLY if LB_RS(T1,T2)>t, then prune (T1,T2) if UB_RS (T1,T2)<=t, then admit (T1,T2) 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Approximate join Apply both bounding and sampling (1) In sequence Filter using reference set Filter using lower & upper bound Apply TreeEditDist() 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Approximate join Apply both bounding and sampling (2) In combination Compute lower & upper bound between each XML data and RS, but Not compute TreeEditDist() For each XML data, there are two vectors vl &vu Change the bounding function Given XML document pair (di,dj) LB_RSC=max(|vium-vjlm|), 1<=m<=k UB_RSC=min(vium+vjum), 1<=m<=k 9/20/2018 Dept. of Computer Science

Dept. of Computer Science R-tree indexing Use R-tree index to make improvement Construct R-tree index for both dataset Leaf node consists of document distance vector DIR node consists of MBRs MBR of the leaf R-tree leaf node K max(viu) K K K min(vil) vu vu …… vu vl vl …… vl D1 D2 Dm 9/20/2018 Dept. of Computer Science

Dept. of Computer Science R-tree indexing Use index to make further improvement Range query (given a Di and MBR node) if exist j (1<=j<=d), max(vju)+vj<=t, admit Di with all leaf nodes in MBR if exist j (1<=j<=d), min(vjl)-vj>t or vj-min(vju)>t, prune Di with all leaf nodes ROOT …… max(viu) max(viu) max(viu) min(vil) min(vil) min(vil) …… vu …… vu vu …… vu vu …… vu 9/20/2018 Dept. of Computer Science vl …… vl vl …… vl vl …… vl

Dept. of Computer Science R-tree indexing Use index to make further improvement R-tree join (given two set of documents) Based on the range query Slightly modify the bounding function from (point,MBR) to (MBR,MBR) ROOT …… max(viu) max(viu) max(viu) min(vil) min(vil) min(vil) …… vu …… vu vu …… vu vu …… vu 9/20/2018 Dept. of Computer Science vl …… vl vl …… vl vl …… vl

Dept. of Computer Science Summary Baseline approximate join algorithm Nested loop join with TreeEditDist() (Naïve) Proposed algorithm Lower & upper bound (B) Reference set (RS) Bound + reference set Sequence (RSB) Combination (RSC) R-tree indexing support 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Experiment result Data set Data Set A: IBM XML data generator Data Set B: arrange A into 8 clusters Data Set C: DBLP data Test query Approximate self-join on each data set 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Experimental result Tightness of lower & upper bound x-axis: ratio of bound/TreeEditDist() y-axis: number to pairs 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Experimental result Processing time (on data set A) y-axis: ratio to naïve (logarithmic scale) x-axis: reference set size 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Experimental result Processing time (on data set B) Data set B has 8 clusters 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Experimental result Component of computing time (on DS B) Optimal: |RS|=8 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Experimental result Processing time on RS A (no cluster) x-axis: distance threshold Larger threshold Upper bound Smaller threshold Lower bound 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Experimental result Using R-tree index x-axis: distance threshold y-axis: vector computation ratio to RSB 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Conclusion Motivation Data integration in the context of XML Approximate join on XML document Approaches Lower & upper bound Use reference set to filtering R-tree indexing 9/20/2018 Dept. of Computer Science

Dept. of Computer Science Q & A Happy New Year! 9/20/2018 Dept. of Computer Science