1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002.

1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

2 In this lecture The Problem Idea Preliminaries PathStack Algorithm TwigStack Algorithm Conclusions

3 The problem To find semantically connected data in the XML document in the efficient way. There are many intermediate results produced that doesn’t participate in the final answers.

4 The problem (example) For example we have this XQuery expression:  book[ title = ‘XML’ ] // author [ fn = ‘jane’ and ln = ‘doe’] We can translate it to the twig (small tree) pattern book title XML author fn jane ln doe

5 The problem (example) In order to solve this problem we have to  Find all binary relationships line (book, title) and (author, fn)  To connect all the patterns we have found to the compile answer. The problem is that every book has title but there are only some of the with title ‘XML’, so we produce many intermediate answers that doesn’t participate in the final answer.

7 Idea The main Idea of the paper is how to save intermediate results in a compact way. To develop algorithm that will be independent of the size of intermediate results. The is a family of stack based algorithms invented for this purpose.

9 Representing position of elements Every node in the XML document is represented as  Leaf: 3-tuple ( DocId, LeftPos, LevelNum )  Node: 3-tuple ( DocId, LeftPos : RightPos, LevelNum )

10 Representing position of elements For example book title XML authors (1,3,3) (1,2:4,2) (1,1:31,1) (1,5:30,2) author fn jane ln poe (1,6:13,2) (1,7:9,3) (1,8,4) (1,10:12,3) (1,11,4) author fn john ln doe (1,14:21,2) (1,15:17,3) (1,16,4) (1,18:20,3) (1,19,4) author fn jane ln doe (1,22:29,2) (1,23:25,3) (1,24,2) (1,26:28,2) (1,27,2)

11 Representing position of elements For example

12 Representing position of elements Profits: Easy to determine  ancestor-descendant relationship a node n 1 (D 1, L 1 : R 1, N 1 ) is descendant to node n 2 (D 2, L 2 : R 2, N 2 ) iff D 1 = D 2, L 2 <L 1 and R 1 <R 2  parent-child relationship a node n 1 (D 1, L 1 : R 1, N 1 ) is parent to node n 2 (D 2, L 2 : R 2, N 2 ) iff D 1 = D 2, L 2 <L 1, R 1 <R 2 and N 1 +1=N 2 fn (1,7:9,3) book (1,1:31,1) ln poe (1,10:12,3) (1,11,4)

13 Representing position of elements Available cases:

14 Matching stream A stream T q contains positional representations of the database nodes that match the node q The nodes in the stream are sorted by the (DocId,LeftPos)

15 jane (1,8,4) jane (1,24,2) author (1,22:29,2) author (1,14:21,2) author Matching stream (example) book title XML authors (1,3,3) (1,2:4,2) (1,1:31,1) (1,5:30,2) fnln poe (1,7:9,3)(1,10:12,3) (1,11,4) fn john ln doe (1,15:17,3) (1,16,4) (1,18:20,3) (1,19,4) fnln doe (1,23:25,3)(1,26:28,2) (1,27,2) T author T jane (1,6:13,2) author (1,14:21,2) author (1,6:13,2) author (1,22:29,2) jane (1,8,4) jane (1,24,2) The operations available on the streams  eof, advance, next, nextL, nextR

16 Linked stacks Idea:  Repeatedly construct stacks that contain partial and total answers  Remove partial answers that couldn’t be extended to total answers

17 Linked stacks (example) A1A1 B1B1 A2A2 B2B2 C1C1 Data A B C Query A1A1 B1B1 A2A2 B2B2 C1C1 Stack encoding A1A1 B1B1 C1C1 A2A2 B2B2 C1C1 A1A1 B2B2 C1C1 Query results

19 Stack based algorithms The stack based algorithms uses chain of linked stack to compactly represent partial and full results

20 B1B1 B2B2 2:8 4:6 A1A1 A2A2 3:7 PathStack algorithm C1C1 Data A B C Query TATA TBTB 1:9 5 TCTC A1A1 A2A2 3:7 1:9 B1B1 B2B2 2:8 4:6 C1C1 5

21 PathStack algorithm B1B1 B2B2 2:8 4:6 A1A1 A2A2 3:7 C1C1 Data A B C Query A1A1 B1B1 C1C1 A1A1 B2B2 C1C1 A2A2 B2B2 C1C1 Query results TATA TBTB 1:9 5 TCTC C1C1 5 Stack encoding SCSC SBSB SASA A1A1 1:9 B1B1 2:8 A2A2 3:7 B2B2 4:6 Always take an element with smallest LeftPos

22 C2C2 8 A B C Query A1A1 B1B1 C1C1 A1A1 B2B2 C1C1 A2A2 B2B2 C1C1 TATA TBTB TCTC Stack encoding SCSC SBSB SASA A1A1 1:10 B1B1 2:9 A2A2 3:7 B2B2 4:6 B1B1 B2B2 2:9 4:6 A1A1 A2A2 3:7 C1C1 Data 1:10 5 Add C2 here C2C2 8 A1A1 B1B1 C2C2 RightPos < LeftPos PathStack algorithm

23 PathStack algorithm problems To find a twig we have to divide it to many paths and  Again we have intermediate results that doesn’t participate in the final result authors (5:30) author fn jane ln poe (6:13) (7:9) (8) (10:12) (11) author fn john ln doe (14:21) (15:17) (16) (18:20) (19) author fn jane ln doe (22:29) (23:25) (24) (26:28) (27) Query author fn jane ln doe

25 TwigStack Algorithm Idea  Before adding the node to the stack check that he has suns that satisfies the twig pattern. When checking the sons theirs sons are checked to  Now we can be shure that every path result is joinable with at least one other path result and participates in at least one full answer.

26 TwigStack Algorithm authors (5:30) author fn jane ln poe (6:13) (7:9) (8) (10:12) (11) author fn john ln doe (14:21) (15:17) (16) (18:20) (19) author fn jane ln doe (22:29) (23:25) (24) (26:28) (27) author fn jane ln doe

28 Conclusions The PathStack and TwigStack algorithms are effective in terms of amount of intermediate results But:  They are only effective for founding ancestor- descendant relationships. If we have also parent-son relationships in the twig then not all nodes that are inserted to the stacks participate in the final result.

29 Brake ?

30 Query Structured Text in an XML Database ACM SIGMOD 2003

31 In this lecture Abstract Introduction Motivation Algebra Access methods Conclusions

32 Abstract XML documents often contain documents with structured text It is important to integrate “information retrieval” style query evaluation It is well studied for natural languages But in the case of XML the data could reside in element descendants.

34 Introduction Boolean style queries (XQuery)  Useful when users are aware of the underlying schema But  Users often don’t know the schema  And collections of XML documents are frequently heterogeneous.

35 Introduction So we have to use relevance ranking in order to define the IR on XML Problem: traditional IR is “document- centric” XML IR should  Be much more granulated  Take document structure into account  Allow more complex analysis then determination of relevance

37 Motivation article article-title Internet Technologies author fnamesname JaneDoe chapter ct Cashing and Replication chapter ct Search and Retrieval section section-title Search Engine section section-title Information Retrieval section section-title Examples ppp … Here are some IR based Search Engines: … …search engine NewSearch uses a new information retrieval technology semantic information retrieval techniques are also being incorporated into some search engines #a1 #a2 #a3 #a4#a5 #a6 #a7 #a10 #a11 … … #a12 #a13 #a14 #a15 #a16 #a17 #a18 #a19 #a20 We have the following XML document named article.xml

38 Motivation Consider the query  Find document components in articles.xml that are about “search engine”. Relevance to “internet” and “information retrieval” is desirable but not necessary. Using AND and OR predicated will not give us the desirable result

39 Motivation article article-title Internet Technologies author fnamesname JaneDoe chapter ct Cashing and Replication chapter ct Search and Retrieval section section-title Search Engine section section-title Information Retrieval section section-title Examples ppp … Here are some IR based Search Engines: … …search engine NewSearch uses an information retrieval technology semantic information retrieval techniques are also being incorporated into some search engines #a1 #a2 #a3 #a4#a5 #a6 #a7 #a10 #a11 … … #a12 #a13 #a14 #a15 #a16 #a17 #a18 #a19 #a20 We have the following XML document named article.xml

40 Motivation Illustrating granulation problem What elements to rank?  If we will rank article The user will see all the article while the relevant information concentrated only in the third chapter  If we will rank paragraphs The paragraphs of the last section will be returned separately The semantic linkage is broken and has to be reconstructed by the user

41 Motivation IR-style XML queries don’t have to be stand alone If the use know the structure of the XML document he can add some structural constraints and limit the number of uninteresting results

43 Algebra We want to fold into a database framework the notion of relevance scoring and ranking

44 Algebra Scored Trees  Scored Data Tree  Scored Pattern Tree

45 Algebra Scored Data Tree  Definition: A rooted ordered tree, such that each node has attribute-value pairs, including at least a tag and a real number valued score A score of a tree is a score of a root node  Example: article[3.6] #a1 author #a3 sname #a5 section[3.6] #a16

46 Algebra Scored Pattern Tree  Definition: P = (T,F,S) T=>node-labeled and edge-labeled tree F=> formula of boolean combination of predicates applicable to nodes S=> set of scoring function

47 Algebra Scored Pattern Tree  Example : Query2: Find document components in the artilce.xml that are part of an article written by an author with last name “Doe” and are about “search engine”. Relevance to “internet” and “information retrieval” is desirable but not necessary. T: $1 $2 $3 $4 pc ad* F: $1.tag=article & $2.tag=author & $3.tag=sname & $3.content = “Doe” S: $4.score = { ScoreFoo({“search engine”},{“internet”,”information retrieval”})} $1.score = $4.score

48 Algebra Common operators  Selection => Scored Selection  Projection => Scored Projection  Join => Scored Join New Operators  Threshold  Pick

49 Algebra (New Operators) Threshold T: $1 $2 $3 $4 pc ad * F: $1.tag=article & $2.tag=author & $3.tag=sname & $3.content = “Doe” S: $4.score = { ScoreFoo({“search engine”},{“internet”,”information retrieval”})} $1.score = $4.score TC %a >... article[3.6] #a1 author #a3 sname #a5 section[3.6] #a16 article[3.6] #a3 author #a23 sname #a25 section[3.6] #a36 article[3.6] #a1 author #a3 sname #a5 section[3.6] #a16 article[3.6] #a3 author #a23 sname #a25 section[3.6] #a36 article[3.6] #a3 author #a23 sname #a25 section[3.6] #a36

50 Algebra (New Operators) Pick article[3.6] #a1 author #a3 sname #a5 section[3.6] #a16 article[3.6] #a2 author #a13 sname #a15 section[3.6] #a26 article[3.6] #a3 author #a23 sname #a25 section[3.6] #a36 T: $1 $2 $3 $4 pc ad * F: $1.tag=article & $2.tag=author & $3.tag=sname & $3.content = “Doe” S: $4.score = { ScoreFoo({“search engine”},{“internet”,”information retrieval”})} $1.score = $4.score PC article[3.6] #a1 author #a3 sname #a5 section[3.6] #a16 article[3.6] #a3 author #a23 sname #a25 section[3.6] #a36

51 Pick Example: Algebra (New Operators) article[5.6] #a1 chapter[5.0] #a10 section[0.8] #a12 title[0.6] #a2 sname #a5 article[5.6] #a1 section[0.6] #a14 section[3.6] #a16 title[0.8] #a13 title[0.6] #a15 p[0.8] #a18 p[1.4] #a19 p[1.4] #a20 Data Tree Pick Condition Data is relevant if: 1.score > 0.8 2.more then 50% of children are relevant 3.it’s direct parent node is not picked

52 Translating to XQuery Query1  Find document components in articles.xml that are about “search engine”. Relevance to “internet” and “information retrieval” is desirable but not necessary XQuery For $a in document(“articles.xml”)//article/descendant-or- self::* Score $a using ScoreFoo($a,{“search engine”}, {“internet”, ”information retrieval”}) Pick $a using PickFoo($a) Return $a Sortby( score ) Threshold @score >=4 stop after 5

54 Access Methods Score-Generating Methods  TermJoin Score-Utilizing Methods  Pick

55 Score-Generating Methods How to give initial score to the data tree The score of every node should be computed according to the amount of terms that we are searching in the node or it’s descendants.

56 Naïve algorithm For every node recompute the value of the scores of all it’s ancestors a ba ca The runtime is bad

57 TermJoin Stack Based algorithm  Use a stack to store the ancestors of every node  Now all ancestors would be affected by the node

58 TaTa TermJoin ab bcac ab (1:9) (2:7) (3:5) ab (4) (6) (8) ab (1:9) a (3:5) ab (4) ac (8) Encoding Stack Phrase: “a” 1 a bc (2:7) 0 a 1 a 1 a 2 a 2 a 3 a 1 a 4 a If we have more then one word in the phrase we will operate some matching streams simultaneously 1 b

59 Score-Utilizing Methods Methods that help us to filter the data according to theirs scores Two such methods are  Threshold  Pick Pick could be much of challenge to implement

60 Score-Utilizing Methods Pick algorithm  The most complex part of the algorithm is removing redundancy.  The is vertical (parent-child) and horizontal (among the siblings, e.g. return the first author from the relevant article) redundancy.  The problem is solved with stack-based algorithm

61 Pick algorithm 1chapter 2 title 3section Search and retrieval 4p4p5p5p 6section7section … IR … Search engine … Search engine retrieval of syntactic information score = 1 score = 2 score = 0 score = 4 score = 5 score = 0 Ancestor Stack containing elements not yet fully explored Main stack containing elements can not yet be eliminated 2 score >= 2 percentage >= 50% 10/1 4 31/1 4 5 2/25 3 67 1/21/31/4

62 Algebra (New Operators) Pick T: $1 $4 ad* F: $1.tag=article S: $4.score = { ScoreFoo({“search engine”})} $1.score = $4.score PC 1chapter 2 title 3section Search and retrieval 4p4p5p5p 6section7section … IR … Search engine … Search engine retrieval of syntactic information score >= 2 percentage >= 50% section pp … IR … Search engine … Search engine retrieval of syntactic information

64 Conclusion Stack based algorithms are used for efficient implementation of new ideas Usable algebra is presented that deals with scoring and relevance in the XML keyword search Possible extension of XQuery

1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002.

Similar presentations

Presentation on theme: "1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002.

Similar presentations

Presentation on theme: "1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002."— Presentation transcript:

Similar presentations

About project

Feedback