Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst.

Similar presentations


Presentation on theme: "1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst."— Presentation transcript:

1 1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst. of Tech., Bombay, India (work performed while visiting IIT-Bombay).

2 UBC, CanadaEDBT 2002, Prague.2 Outline I.Motivating Applications II.Problem III.Dual Index IV.Algorithms V.Experiments VI.Summary & Future Work

3 UBC, CanadaEDBT 2002, Prague.3 Motivating Application 1 Information dissemination in the large Numerous data sources on the web Traditional means: search and browse Alternative – publish and subscribe System matches (new) data to subscribers’ interests Periodic notification

4 UBC, CanadaEDBT 2002, Prague.4 Motivating Application 2 Supply chain automation Catalog of products and services from suppliers (data) Registered sets of requirements (subscriptions) from manufacturing units Notify relevant consumers upon arrival of new data Other applications include electronic auctioning, online shopping, etc.

5 UBC, CanadaEDBT 2002, Prague.5 Problem Matching specifications (of products, services, etc.) to requirements (subscriptions) efficiently. Specs – akin to data. Requirements – queries. Data may stream through. Quickly determine which subscribers/users a piece of data is relevant to.

6 UBC, CanadaEDBT 2002, Prague.6 Problem Traditional setting: Large DB One (at most a few) query at a time Our problem: A small DB (a tuple, XML doc, etc.) Large no. of queries Dual to traditional problem Focus of this paper: data = XML docs Queries = a fragment of XPath

7 UBC, CanadaEDBT 2002, Prague.7 Problem (Formalized) Given an XML document a large number of XPath queries Determine which queries are answered by each element (formalized using matching) Query labeling: label each node with sets of queries answered by the subtree rooted there Naïve Approach doesn’t scale w/ no. of queries Main challenge: small (1 or 2) # passes over data tree

8 UBC, CanadaEDBT 2002, Prague.8 An Exampe Query FOR $p IN document(“catalog.xml”)//part, $b in $p/brand, $q IN $p//part WHERE A2D IN $q/name AND AMD IN $q/brand RETURN $p

9 UBC, CanadaEDBT 2002, Prague.9 Problem (An Example)

10 UBC, CanadaEDBT 2002, Prague.10 Dual Index Traditional index – quickly localize search for data matching query pattern Dual index – for each primitive pattern, determine (sub)queries to which they are relevant Choice of primitive patterns depends on type of data (e.g., XML vs. relational) And on classes of queries considered (e.g., chains vs. trees)

11 UBC, CanadaEDBT 2002, Prague.11 Tree Dual Index Primitive “access path” questions to be answered: For a constant c, what are leaf appearances? For a tag t, what are non-leaf appearances? What query nodes are its pc- and ad-children? Example: a 1 b 2 c 3 a 4 b 5 c 6 b 1 c 2 a 6 c 3 b 5a 4 PQ Index entry for a: DI(a)[L]: (P, 3, {}), (Q, 6, {4,6}) ** DI(a)[N]: (P, 1, F, {2,3}, {}), (P, 4, T, {6}, {5}).

12 UBC, CanadaEDBT 2002, Prague.12 Tree Labeling Algorithm – 3 Lists a 1 b 2c 9 c 3a 8 d 10 c 4a 11b 15 a 5 d 6 c 12 d 13 b 7b 14 3 lists (conceptually) TML(u): (Query, query node, DN, ans-node) PL(u): (P,l,m,x): rel QL(u): Query Ids

13 UBC, CanadaEDBT 2002, Prague.13 Tree Labeling Algo. – TML base case a 1 b 2c 9 c 3a 8 d 10 c 4a 11b 15 a 5 d 6 c 12 d 13 b 7b 14 (P,m,{v1, …, vk})  DI(t)[L] (P,v1,m,?), …, (P,vk,m,?)  TML(u), whenever u.tag= t; e.g.: DI(a)[L] has (Q,6,{4,6}). So, add (Q,4,6,?), & (Q,6,6,?) to TML()  (Q,6,6,?)  (Q,6,6,i), i = 1,5, 8, 11. If vi=m, ?  u.

14 UBC, CanadaEDBT 2002, Prague.14 Tree Labeling Algo. – TML  PL a 1 b 2c 9 c 3a 8 d 10 c 4a 11b 15 a 5 d 6 c 12 d 13 b 7b 14 (P,l,m,x)  TML(u)  (P,l,m,x):child  PL(parent(u)). (P,l,m,x):desc  PL(anc(u)). e.g.: (Q,4,6,?)  PL(5) So, (Q,4,6,?):child  PL(4). And (Q,4,6,?):desc  PL(i), i= 3, 2, 1. Optimizations possible, but suppressed.

15 UBC, CanadaEDBT 2002, Prague.15 Tree Labeling Algo. – TML inductive case a 1 b 2c 9 c 3a 8 d 10 c 4a 11b 15 a 5 d 6 c 12 d 13 b 7b 14 (P,l,B,C,D)  DI(t)[N]   c  C: (P,c,m,y):child  PL(u) &  d  D: (P,d,m,y):rel  PL(u)  (P,l,m,x)  TML(u). If l=m, x  u. e.g.: (P,4,T,{6},{5})  DI(a)[N]. (P,6,3,?)  TML(12), so (P,6,3,?):child  PL(11).Similarly, (P,5,3,?):desc  PL(11) So, (P,4,3,?)  TML(11).

16 UBC, CanadaEDBT 2002, Prague.16 Tree Labeling Algo. – QL a 1 b 2c 9 c 3a 8 d 10 c 4a 11b 15 a 5 d 6 c 12 d 13 b 7b 14 TML, PL, feed each other. QL – special case of TML P  QL(u) iff (P,1,m,x)  TML(x). e.g.: (P,1,3,9)  TML(1), so P  QL(9). & (Q,1,6,5)  TML(2), so Q  QL(5).

17 UBC, CanadaEDBT 2002, Prague.17 Tree Labeling – Summary labeling completed in two passes pass 1: compute TML/PL (bottom-up) pass 2: compute QL (top-down) no. of I/O invocations is 2 * # data tree nodes. Other algorithms in paper: chain labeling chain split labeling of trees

18 UBC, CanadaEDBT 2002, Prague.18 Experiments matchMaker implementation: JDK1.3 and C++ storage – BerkeleyDB 3.17 dual index stored in disk lists manipulated in memory Intel PIII, 1GB RAM, 512K cache, Linux 7.0 Data sets: generated using IBM’s XML Gen tool conforming to GEDCOM DTD (geological data) (about 120 elements)

19 UBC, CanadaEDBT 2002, Prague.19 Experiments document depth 10; avg fanout – [2, 5] chain labeling algorithm is at least 5 times faster than query-at-a-time approach For tree labeling, query-at-a-time doesn’t produce results in reasonable time! Focus of experiments (for trees): Direct tree labeling algorithm vs. chain split algorithm (not discussed)

20 UBC, CanadaEDBT 2002, Prague.20 Experiments

21 UBC, CanadaEDBT 2002, Prague.21 Experiments

22 UBC, CanadaEDBT 2002, Prague.22 Experiments

23 UBC, CanadaEDBT 2002, Prague.23 Related Work Documents – user profile match (IR) Notion of standing queries – long history: E.g., Tapesty, TriggerMan, NiagaraCQ, etc. Publish-and-subscribe – Fabret et al. 00, 01. Patterns: boolean combo of relOp comp value XFilter 00, 01. Only determine if doc contains an answer Multiple answers in one doc not considered

24 UBC, CanadaEDBT 2002, Prague.24 Related Work XTrie approach Decompose query tree into ad-free chains Index using trie Determine only if a doc contains an answer Main distinguishing features of matchMaker: Answers located Multiple answers per doc All proposed algorithms – guaranteed resource bounds (e.g., #passes, I/O)

25 UBC, CanadaEDBT 2002, Prague.25 Summary & Future Work Matching large no. of queries to XML data trees (as they stream through) Dual to usual query processing Dual index (chains vs. trees) Algorithms for query labeling of data trees Making algorithms more efficient (single pass algorithm for chains: done) Expanding classes of queries handled Algebra for this dual query processing problem?


Download ppt "1 On Efficient Matching of Streaming XML Documents and Queries Laks V.S. Lakshmanan 1 P. Sailaja 2 1 University of British Columbia, Canada 2 Indian Inst."

Similar presentations


Ads by Google