Integrating XML Data Sources Using Approximate Joins Author Sudipto Guha, H. V. Jagadish, Nick Koudas, Divesh Srivastava, Ting Yu Source SIGMOD02, ICDE03, ACM TODS06 March Presenter Jim 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Outline Motivation Proposed approach Result Conclusion 9/20/2018 Dept. of Computer Science
Global Query Interface Motivation Data integration A popular application on Internet Query the different data sources eg. an integration of search engine Global Query Interface Result Query DBLP CiteSeer GoogleScholar 9/20/2018 Dept. of Computer Science
Global Query Interface Motivation Data integration Identify same data object in different data sources eg. query papers from the global query interface Global Query Interface Need to identify the same paper from the query result! Result Query DBLP CiteSeer GoogleScholar 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Motivation XML Data integration Identify same data object in different schema General case Find out pair of object from two object set DBLP CiteSeer 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Problem Approximately matching XML documents Given Two set of XML documents, each set conforms to a schema A distance metric Dist between XML documents and a distance threshold t Result All pair of documents (Ti,Tj), such Dist(Ti,Tj) <= t 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Distance metric String edit distance Minimum step of edit operation (insert, delete, change) to change S1 to S2 eg1. ed(“abcde”,”bCdef”)=3, edit operation: del ‘a’, ‘c’ -> ‘C’, ins ‘f’ eg2. ed(“hilton hotel”,”hilson hotel”)=1 edit operation: ‘t’ -> ‘s’ 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Distance metric String edit distance Time complexity O(|S1|*|S2|) worst case complexity Dynamic programming style Recursion step Ci,j is the cost of match Si and Tj (edit distance) Ci,j = Ci-1,j-1 if S[i]=T[j], 1+min(Ci-1, j, Ci, j-1, Ci-1 ,j-1), else. 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Tree edit distance Tree edit distance Generalization of string edit distance Minimum cost sequence of tree edit operation to transform T1 to T2 Operation: insertion, deletion and re-label of nodes Method Dynamic programming style Compute a mapping between nodes of T1 & T2 with minimum cost 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Tree edit distance Tree edit distance – example Mappings (‘e’ means empty) (A,A), (B,e), (D,D), (E,E), (e,H), (C,I), (F,F), (G,G) Do NOT care position of node Cost (edit operations) delete ‘B’, insert ‘H’, re-label ‘C’->‘I’ TreeEditDist(T1,T2)=3 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Tree edit distance Tree edit distance – basic constraints For any pairs (i1,j1), (i2,j2) i1=i2 iff j1=j2 one-to-one i1 is to the left of i2 iff j1 is to the left of j2 sibling order preserving i1 is an ancestor of i2 iff j1 is an ancestor of j2 ancestor order preserving 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Tree edit distance Tree edit distance – algorithms Dynamic programming style Recursion step v and w ib the rightmost nodes of F1 and F2 |F1|*|F2| sub-problem for computing dist(F1,F2) Overall Worst case complexity O(|T1|2*|T2|2) 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Tree edit distance Modelling XML ordered, labelled trees Distance between XML documents Tree edit distance 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Tree edit distance Problem High time complexity of TreeEditDist() Worst case O(|T1|2*|T2|2) Improved: O(|T1|*|T2|*depth(T1)*depth(T2)) Not practical for large XML documents How to efficiently perform approximate join using TreeEditDist()? 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Proposed approaches Without indexing Bounding TreeEditDist() Lower & upper bound as filter Sampling based bounding Sampling the dataset and pre-compute the distance as filter Add indexing R-tree indexing 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Lower bound Idea Transform a tree to a string Compute string edit distance as lower bound Method Use pre-order & post-order traversal of a tree Rational Operation on the string -> operation on the tree, but NOT vice versa dist(T1,T2)>=dist(S1,S2) 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Lower bound Example ed(s1,s2)=3, ed(T1,T2)=3 For the case of ‘<‘, i.e., dist(s1,s2)<dist(T1,T2), suppose all the nodes has the same label pre-order(T1): ABDECFG, pre-order(T2): ADHEIFG 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Lower bound Lower bound theorem Max(StrEditDist(pre(T1),pre(T2)), StrEditDist(post(T1),post(T2))) <= TreeEditDist(T1,T2) Complexity O(|T1|*|T2|) using string edit distance Note: O(|T1|2*|T2|2) for tree edit distance Cheaper than TreeEditDist() 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Upper bound Basic constraints of TreeEditDist() one-to-one sibling order preserving ancestor order preserving 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Upper bound Add one more constraint Intuition Two distinct subtree of T1 will be mapped to two distinct subtree of T2 For any pairs (i1,j1), (i2,j2), (i3,j3) lca(i1,i2) is a proper ancestor of i3 iff lca(j1,j2) is a proper ancestor of j3 lca order preserving 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Upper bound LCA constraint – example (E,E) is not a mapping any more, as D & E are in the same subtree of A in T1, but in two distinct subtree of A in T2 TreeEditDist_UB(T1,T2)=5 delete ‘B’, insert ‘H’ re-label ‘C’->’I’ delete ‘E’, insert ‘E’ 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Upper bound Computing upper bound Reduce the number of sub-problem Reduce time complexity to O(|T1|*|T2|) 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Summary of bounding Bounding method Lower bound Transform tree to string using traversal Upper bound Add LCA constraint to the tree edit distance Filtering with bound Given (T1,T2) and threshold t if LBDist(T1,T2)>t, then prune (T1,T2) if UBDist(T1,T2)<=t, then admit (T1,T2) 9/20/2018 Dept. of Computer Science
Sampling based bounding TreeEditDist(), or simply Dist(), is a metric Triangle inequality |Dist(di,dr)-Dist(dr,dj)|<=Dist(di,dj)<= Dist(di,dr)+Dist(dr,dj) Reference set A set of K sample document Sampling method later dr di dj Reference Set dr1 dr2 … drk 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Reference set For each XML document di Pre-compute the Dist() between di and reference set Distance vector vi (|vi|=K, vi1,vi2,…vik) dr di dj Reference Set dr1 dr2 … drk 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Reference set Reference set based bound Given XML document pair (di,dj) LB_RS=max(|vil-vjl|), 1<=l<=k UB_RS=min(vil+vjl), 1<=l<=k Use the bound if LB_RS>t, then prune (di,dj) if UB_RS<=t, then admit (di,dj) dr di dj Reference Set dr1 dr2 … drk 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Reference set Choose the reference set (or simply, RS) Assumption XML data set is well separated Goal Make the RS decisive with small size Optimal RS One XML file from one cluster Reference Set LB_RS=max(|vil-vjl|), 1<=l<=k UB_RS=min(vil+vjl), 1<=l<=k 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Reference set Choose the RS Randomly sampling the all XML data set Clustering the sampled XML data Based on TreeEditDist() Use CURE or BIRCH to clustering One sample from one cluster Reference Set 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Approximate join Problem statement again Given Two set of XML documents A distance metric Dist between XML documents A distance threshold t Result All pair of documents (Ti,Tj), such Dist(Ti,Tj) <= t Approximate join algorithm Using bounding & reference set No indexing 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Approximate join Baseline algorithm Nested loop join with TreeEditDist() Using lower & upper bound ONLY Nested loop join with bound if LBDist(T1,T2)>t, then prune (T1,T2) if UBDist(T1,T2)<=t, then admit (T1,T2) Using reference set ONLY if LB_RS(T1,T2)>t, then prune (T1,T2) if UB_RS (T1,T2)<=t, then admit (T1,T2) 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Approximate join Apply both bounding and sampling (1) In sequence Filter using reference set Filter using lower & upper bound Apply TreeEditDist() 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Approximate join Apply both bounding and sampling (2) In combination Compute lower & upper bound between each XML data and RS, but Not compute TreeEditDist() For each XML data, there are two vectors vl &vu Change the bounding function Given XML document pair (di,dj) LB_RSC=max(|vium-vjlm|), 1<=m<=k UB_RSC=min(vium+vjum), 1<=m<=k 9/20/2018 Dept. of Computer Science
Dept. of Computer Science R-tree indexing Use R-tree index to make improvement Construct R-tree index for both dataset Leaf node consists of document distance vector DIR node consists of MBRs MBR of the leaf R-tree leaf node K max(viu) K K K min(vil) vu vu …… vu vl vl …… vl D1 D2 Dm 9/20/2018 Dept. of Computer Science
Dept. of Computer Science R-tree indexing Use index to make further improvement Range query (given a Di and MBR node) if exist j (1<=j<=d), max(vju)+vj<=t, admit Di with all leaf nodes in MBR if exist j (1<=j<=d), min(vjl)-vj>t or vj-min(vju)>t, prune Di with all leaf nodes ROOT …… max(viu) max(viu) max(viu) min(vil) min(vil) min(vil) …… vu …… vu vu …… vu vu …… vu 9/20/2018 Dept. of Computer Science vl …… vl vl …… vl vl …… vl
Dept. of Computer Science R-tree indexing Use index to make further improvement R-tree join (given two set of documents) Based on the range query Slightly modify the bounding function from (point,MBR) to (MBR,MBR) ROOT …… max(viu) max(viu) max(viu) min(vil) min(vil) min(vil) …… vu …… vu vu …… vu vu …… vu 9/20/2018 Dept. of Computer Science vl …… vl vl …… vl vl …… vl
Dept. of Computer Science Summary Baseline approximate join algorithm Nested loop join with TreeEditDist() (Naïve) Proposed algorithm Lower & upper bound (B) Reference set (RS) Bound + reference set Sequence (RSB) Combination (RSC) R-tree indexing support 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Experiment result Data set Data Set A: IBM XML data generator Data Set B: arrange A into 8 clusters Data Set C: DBLP data Test query Approximate self-join on each data set 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Experimental result Tightness of lower & upper bound x-axis: ratio of bound/TreeEditDist() y-axis: number to pairs 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Experimental result Processing time (on data set A) y-axis: ratio to naïve (logarithmic scale) x-axis: reference set size 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Experimental result Processing time (on data set B) Data set B has 8 clusters 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Experimental result Component of computing time (on DS B) Optimal: |RS|=8 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Experimental result Processing time on RS A (no cluster) x-axis: distance threshold Larger threshold Upper bound Smaller threshold Lower bound 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Experimental result Using R-tree index x-axis: distance threshold y-axis: vector computation ratio to RSB 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Conclusion Motivation Data integration in the context of XML Approximate join on XML document Approaches Lower & upper bound Use reference set to filtering R-tree indexing 9/20/2018 Dept. of Computer Science
Dept. of Computer Science Q & A Happy New Year! 9/20/2018 Dept. of Computer Science