Download presentation
Presentation is loading. Please wait.
Published byHendra Irawan Modified over 6 years ago
1
Integrating XML Data Sources Using Approximate Joins
Author Sudipto Guha, H. V. Jagadish, Nick Koudas, Divesh Srivastava, Ting Yu Source SIGMOD02, ICDE03, ACM TODS06 March Presenter Jim 9/20/2018 Dept. of Computer Science
2
Dept. of Computer Science
Outline Motivation Proposed approach Result Conclusion 9/20/2018 Dept. of Computer Science
3
Global Query Interface
Motivation Data integration A popular application on Internet Query the different data sources eg. an integration of search engine Global Query Interface Result Query DBLP CiteSeer GoogleScholar 9/20/2018 Dept. of Computer Science
4
Global Query Interface
Motivation Data integration Identify same data object in different data sources eg. query papers from the global query interface Global Query Interface Need to identify the same paper from the query result! Result Query DBLP CiteSeer GoogleScholar 9/20/2018 Dept. of Computer Science
5
Dept. of Computer Science
Motivation XML Data integration Identify same data object in different schema General case Find out pair of object from two object set DBLP CiteSeer 9/20/2018 Dept. of Computer Science
6
Dept. of Computer Science
Problem Approximately matching XML documents Given Two set of XML documents, each set conforms to a schema A distance metric Dist between XML documents and a distance threshold t Result All pair of documents (Ti,Tj), such Dist(Ti,Tj) <= t 9/20/2018 Dept. of Computer Science
7
Dept. of Computer Science
Distance metric String edit distance Minimum step of edit operation (insert, delete, change) to change S1 to S2 eg1. ed(“abcde”,”bCdef”)=3, edit operation: del ‘a’, ‘c’ -> ‘C’, ins ‘f’ eg2. ed(“hilton hotel”,”hilson hotel”)=1 edit operation: ‘t’ -> ‘s’ 9/20/2018 Dept. of Computer Science
8
Dept. of Computer Science
Distance metric String edit distance Time complexity O(|S1|*|S2|) worst case complexity Dynamic programming style Recursion step Ci,j is the cost of match Si and Tj (edit distance) Ci,j = Ci-1,j-1 if S[i]=T[j], 1+min(Ci-1, j, Ci, j-1, Ci-1 ,j-1), else. 9/20/2018 Dept. of Computer Science
9
Dept. of Computer Science
Tree edit distance Tree edit distance Generalization of string edit distance Minimum cost sequence of tree edit operation to transform T1 to T2 Operation: insertion, deletion and re-label of nodes Method Dynamic programming style Compute a mapping between nodes of T1 & T2 with minimum cost 9/20/2018 Dept. of Computer Science
10
Dept. of Computer Science
Tree edit distance Tree edit distance – example Mappings (‘e’ means empty) (A,A), (B,e), (D,D), (E,E), (e,H), (C,I), (F,F), (G,G) Do NOT care position of node Cost (edit operations) delete ‘B’, insert ‘H’, re-label ‘C’->‘I’ TreeEditDist(T1,T2)=3 9/20/2018 Dept. of Computer Science
11
Dept. of Computer Science
Tree edit distance Tree edit distance – basic constraints For any pairs (i1,j1), (i2,j2) i1=i2 iff j1=j2 one-to-one i1 is to the left of i2 iff j1 is to the left of j2 sibling order preserving i1 is an ancestor of i2 iff j1 is an ancestor of j2 ancestor order preserving 9/20/2018 Dept. of Computer Science
12
Dept. of Computer Science
Tree edit distance Tree edit distance – algorithms Dynamic programming style Recursion step v and w ib the rightmost nodes of F1 and F2 |F1|*|F2| sub-problem for computing dist(F1,F2) Overall Worst case complexity O(|T1|2*|T2|2) 9/20/2018 Dept. of Computer Science
13
Dept. of Computer Science
Tree edit distance Modelling XML ordered, labelled trees Distance between XML documents Tree edit distance 9/20/2018 Dept. of Computer Science
14
Dept. of Computer Science
Tree edit distance Problem High time complexity of TreeEditDist() Worst case O(|T1|2*|T2|2) Improved: O(|T1|*|T2|*depth(T1)*depth(T2)) Not practical for large XML documents How to efficiently perform approximate join using TreeEditDist()? 9/20/2018 Dept. of Computer Science
15
Dept. of Computer Science
Proposed approaches Without indexing Bounding TreeEditDist() Lower & upper bound as filter Sampling based bounding Sampling the dataset and pre-compute the distance as filter Add indexing R-tree indexing 9/20/2018 Dept. of Computer Science
16
Dept. of Computer Science
Lower bound Idea Transform a tree to a string Compute string edit distance as lower bound Method Use pre-order & post-order traversal of a tree Rational Operation on the string -> operation on the tree, but NOT vice versa dist(T1,T2)>=dist(S1,S2) 9/20/2018 Dept. of Computer Science
17
Dept. of Computer Science
Lower bound Example ed(s1,s2)=3, ed(T1,T2)=3 For the case of ‘<‘, i.e., dist(s1,s2)<dist(T1,T2), suppose all the nodes has the same label pre-order(T1): ABDECFG, pre-order(T2): ADHEIFG 9/20/2018 Dept. of Computer Science
18
Dept. of Computer Science
Lower bound Lower bound theorem Max(StrEditDist(pre(T1),pre(T2)), StrEditDist(post(T1),post(T2))) <= TreeEditDist(T1,T2) Complexity O(|T1|*|T2|) using string edit distance Note: O(|T1|2*|T2|2) for tree edit distance Cheaper than TreeEditDist() 9/20/2018 Dept. of Computer Science
19
Dept. of Computer Science
Upper bound Basic constraints of TreeEditDist() one-to-one sibling order preserving ancestor order preserving 9/20/2018 Dept. of Computer Science
20
Dept. of Computer Science
Upper bound Add one more constraint Intuition Two distinct subtree of T1 will be mapped to two distinct subtree of T2 For any pairs (i1,j1), (i2,j2), (i3,j3) lca(i1,i2) is a proper ancestor of i3 iff lca(j1,j2) is a proper ancestor of j3 lca order preserving 9/20/2018 Dept. of Computer Science
21
Dept. of Computer Science
Upper bound LCA constraint – example (E,E) is not a mapping any more, as D & E are in the same subtree of A in T1, but in two distinct subtree of A in T2 TreeEditDist_UB(T1,T2)=5 delete ‘B’, insert ‘H’ re-label ‘C’->’I’ delete ‘E’, insert ‘E’ 9/20/2018 Dept. of Computer Science
22
Dept. of Computer Science
Upper bound Computing upper bound Reduce the number of sub-problem Reduce time complexity to O(|T1|*|T2|) 9/20/2018 Dept. of Computer Science
23
Dept. of Computer Science
Summary of bounding Bounding method Lower bound Transform tree to string using traversal Upper bound Add LCA constraint to the tree edit distance Filtering with bound Given (T1,T2) and threshold t if LBDist(T1,T2)>t, then prune (T1,T2) if UBDist(T1,T2)<=t, then admit (T1,T2) 9/20/2018 Dept. of Computer Science
24
Sampling based bounding
TreeEditDist(), or simply Dist(), is a metric Triangle inequality |Dist(di,dr)-Dist(dr,dj)|<=Dist(di,dj)<= Dist(di,dr)+Dist(dr,dj) Reference set A set of K sample document Sampling method later dr di dj Reference Set dr1 dr2 … drk 9/20/2018 Dept. of Computer Science
25
Dept. of Computer Science
Reference set For each XML document di Pre-compute the Dist() between di and reference set Distance vector vi (|vi|=K, vi1,vi2,…vik) dr di dj Reference Set dr1 dr2 … drk 9/20/2018 Dept. of Computer Science
26
Dept. of Computer Science
Reference set Reference set based bound Given XML document pair (di,dj) LB_RS=max(|vil-vjl|), 1<=l<=k UB_RS=min(vil+vjl), 1<=l<=k Use the bound if LB_RS>t, then prune (di,dj) if UB_RS<=t, then admit (di,dj) dr di dj Reference Set dr1 dr2 … drk 9/20/2018 Dept. of Computer Science
27
Dept. of Computer Science
Reference set Choose the reference set (or simply, RS) Assumption XML data set is well separated Goal Make the RS decisive with small size Optimal RS One XML file from one cluster Reference Set LB_RS=max(|vil-vjl|), 1<=l<=k UB_RS=min(vil+vjl), 1<=l<=k 9/20/2018 Dept. of Computer Science
28
Dept. of Computer Science
Reference set Choose the RS Randomly sampling the all XML data set Clustering the sampled XML data Based on TreeEditDist() Use CURE or BIRCH to clustering One sample from one cluster Reference Set 9/20/2018 Dept. of Computer Science
29
Dept. of Computer Science
Approximate join Problem statement again Given Two set of XML documents A distance metric Dist between XML documents A distance threshold t Result All pair of documents (Ti,Tj), such Dist(Ti,Tj) <= t Approximate join algorithm Using bounding & reference set No indexing 9/20/2018 Dept. of Computer Science
30
Dept. of Computer Science
Approximate join Baseline algorithm Nested loop join with TreeEditDist() Using lower & upper bound ONLY Nested loop join with bound if LBDist(T1,T2)>t, then prune (T1,T2) if UBDist(T1,T2)<=t, then admit (T1,T2) Using reference set ONLY if LB_RS(T1,T2)>t, then prune (T1,T2) if UB_RS (T1,T2)<=t, then admit (T1,T2) 9/20/2018 Dept. of Computer Science
31
Dept. of Computer Science
Approximate join Apply both bounding and sampling (1) In sequence Filter using reference set Filter using lower & upper bound Apply TreeEditDist() 9/20/2018 Dept. of Computer Science
32
Dept. of Computer Science
Approximate join Apply both bounding and sampling (2) In combination Compute lower & upper bound between each XML data and RS, but Not compute TreeEditDist() For each XML data, there are two vectors vl &vu Change the bounding function Given XML document pair (di,dj) LB_RSC=max(|vium-vjlm|), 1<=m<=k UB_RSC=min(vium+vjum), 1<=m<=k 9/20/2018 Dept. of Computer Science
33
Dept. of Computer Science
R-tree indexing Use R-tree index to make improvement Construct R-tree index for both dataset Leaf node consists of document distance vector DIR node consists of MBRs MBR of the leaf R-tree leaf node K max(viu) K K K min(vil) vu vu …… vu vl vl …… vl D1 D2 Dm 9/20/2018 Dept. of Computer Science
34
Dept. of Computer Science
R-tree indexing Use index to make further improvement Range query (given a Di and MBR node) if exist j (1<=j<=d), max(vju)+vj<=t, admit Di with all leaf nodes in MBR if exist j (1<=j<=d), min(vjl)-vj>t or vj-min(vju)>t, prune Di with all leaf nodes ROOT …… max(viu) max(viu) max(viu) min(vil) min(vil) min(vil) …… vu …… vu vu …… vu vu …… vu 9/20/2018 Dept. of Computer Science vl …… vl vl …… vl vl …… vl
35
Dept. of Computer Science
R-tree indexing Use index to make further improvement R-tree join (given two set of documents) Based on the range query Slightly modify the bounding function from (point,MBR) to (MBR,MBR) ROOT …… max(viu) max(viu) max(viu) min(vil) min(vil) min(vil) …… vu …… vu vu …… vu vu …… vu 9/20/2018 Dept. of Computer Science vl …… vl vl …… vl vl …… vl
36
Dept. of Computer Science
Summary Baseline approximate join algorithm Nested loop join with TreeEditDist() (Naïve) Proposed algorithm Lower & upper bound (B) Reference set (RS) Bound + reference set Sequence (RSB) Combination (RSC) R-tree indexing support 9/20/2018 Dept. of Computer Science
37
Dept. of Computer Science
Experiment result Data set Data Set A: IBM XML data generator Data Set B: arrange A into 8 clusters Data Set C: DBLP data Test query Approximate self-join on each data set 9/20/2018 Dept. of Computer Science
38
Dept. of Computer Science
Experimental result Tightness of lower & upper bound x-axis: ratio of bound/TreeEditDist() y-axis: number to pairs 9/20/2018 Dept. of Computer Science
39
Dept. of Computer Science
Experimental result Processing time (on data set A) y-axis: ratio to naïve (logarithmic scale) x-axis: reference set size 9/20/2018 Dept. of Computer Science
40
Dept. of Computer Science
Experimental result Processing time (on data set B) Data set B has 8 clusters 9/20/2018 Dept. of Computer Science
41
Dept. of Computer Science
Experimental result Component of computing time (on DS B) Optimal: |RS|=8 9/20/2018 Dept. of Computer Science
42
Dept. of Computer Science
Experimental result Processing time on RS A (no cluster) x-axis: distance threshold Larger threshold Upper bound Smaller threshold Lower bound 9/20/2018 Dept. of Computer Science
43
Dept. of Computer Science
Experimental result Using R-tree index x-axis: distance threshold y-axis: vector computation ratio to RSB 9/20/2018 Dept. of Computer Science
44
Dept. of Computer Science
Conclusion Motivation Data integration in the context of XML Approximate join on XML document Approaches Lower & upper bound Use reference set to filtering R-tree indexing 9/20/2018 Dept. of Computer Science
45
Dept. of Computer Science
Q & A Happy New Year! 9/20/2018 Dept. of Computer Science
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.