Querying Streaming XML Data
Layout of the presentation Introduction Common Problems faced Solution proposed Basic Building blocks of the solution How to build up a solution to a given query Features of the system
Streaming XML XML – standard for information exchange. Some XML documents only available in streaming format. Streaming is like reading data from a tape drive. Used in Stock Market, News, Network Statistics. Predecessor systems used to filter documents.
Structure of an XPath Query Consists of a Location path and an Output Expression (name). Location path consists of closure axis(//), node test (book) and predicate (year>2000). e.g. //book[year>2000]/name
Features of our Approach Efficient Easy to understand design. Design of BPDT is tricky
Common Problems faced First 6. A Second 12. A 13. B Query: /pub[year=2002]/book[price<11]/author
Common Problems faced First 6. A Second 12. A 13. B Query: /pub[year=2002]/book[price<11]/author Element satisfies the path
Common Problems faced First 6. A Second 12. A 13. B Query: /pub[year=2002]/book[price<11]/author Element satisfies the path Failure??
Common Problems faced First 6. A Second 12. A 13. B Query: /pub[year=2002]/book[price<11]/author Element satisfies the path Failure?? Test passed. But year=2002?
Common Problems faced First 6. A Second 12. A 13. B Query: /pub[year=2002]/book[price<11]/author Element satisfies the path Failure?? Test passed. But year=2002? Buffer both A & B
Common Problems faced First 6. A Second 12. A 13. B Query: /pub[year=2002]/book[price<11]/author Element satisfies the path Failure?? Test passed. But year=2002? Failed price<11. Remove Buffer both A & B
Common Problems faced First 6. A Second 12. A 13. B Query: /pub[year=2002]/book[price<11]/author Element satisfies the path Failure?? Test passed. But year=2002? Failed price<11. Remove Buffer both A & B Test passed. Output
Problems caused by closure axis X 5. A Y Z 12. B Query: //pub[year=2002]//book[author]//name Pub [year=2002]book[author] Line 2TrueLine 7False Line 2TrueLine 10True Line 9FalseLine 10True
Problems caused by closure axis X 5. A Y Z 12. B Query: //pub[year=2002]//book[author]//name Pub [year=2002]book[author] Line 2TrueLine 7False Line 2TrueLine 10True Line 9FalseLine 10True Fails year=2002
Problems caused by closure axis X 5. A Y Z 12. B Query: //pub[year=2002]//book[author]//name Pub [year=2002]book[author] Line 2TrueLine 7False Line 2TrueLine 10True Line 9FalseLine 10True Fails year=2002 Passes year=2002
Problems caused by closure axis X 5. A Y 9. B Z 13. B Query: //pub[year=2002]//book[author]//name Pub [year=2002]book[author] Line 2TrueLine 7False Line 2TrueLine 10True Line 9FalseLine 10True Fails year=2002 Passes year=2002 Lets add author. Result?
Handling XML Stream Input – well formed XML stream. Use SAX API to parse XML. Events belong to Begin = {(a, attrs, d)} End = {(/a, d)} Text = {(a, text(), d)} XML Stream: {e 1,e 2,…,e i,…} ¦ e i Є Begin υ End υ Text
Grammar for XPath Queries Q N + [/O] N [/¦//] tag [F] F [FO [ OP constant ] ] FO ¦ tag ¦ text() O ¦ text() OP > ¦ ≥ ¦ = ¦ < ¦ ≥ ¦ ≠ ¦ contains XPath query of the form N 1 N 2 …N n /O Cant handle Reverse Axis, Positional Functions.
Solution to Query Query: /pub[year=2002]/book[price<11]/author PDAPDT
Basic PushDown Transducer (BPDT) Similar to PushDown Automata Actions defined on Transition Arcs Finite set of states A Start state A set of final states Set of input symbols Set of Stack symbols
Book – Author: Buffer for future: Begin event of Author. Book – Author: Remove from Buffer: End event of Book. Book – Author: Output result if predicates true: Begin event of Author. Building a BPDT Query: /pub[year>2000]/book[author]/name/text() Consider location step: /book[author]
Basic Building Blocks XPath Expression: /tag[child]
Buffer Operations needed Enqueue(x): Add x to the end of the queue. Clear(): Removes all items from the queue. Flush(): Outputs all items in the queue in FIFO order. Upload(): Moves all items to the end of the queue of a parent BPDT. No Dequeue operation needed.
Basic Building Blocks XPath Expression:
Basic Building Blocks XPath Expression: /tag[text()=val]
Basic Building Blocks XPath Expression:
Basic Building Blocks XPath Expression: /tag[child=val]
A sample BPDT Query: /pub[year>2000]
Building a solution HPDT for Query: //pub[year>2000]//book[author]//name/text()
HPDT Structure Each BPDT in HPDT has: Position BPDT POSITION (l,K) :- l = depth of BPDT in HPDT, K = sequence # from right to left BPDT Position (i-1,k) – has right child BPDT position (i,2k) – connected to NA state BPDT Position(i-1,k) – has left child BPDT position (I,2k+1) – connected to True state. BPDT Position (i, 2 i – 1) – means predicates in higher level BPDT’s evaluate to true Buffer – potential results Stack – stack of elements (SAX) events Depth Vector
Example Query X 5. A Y Z 12. B Query: //pub[year=2002]//book[author]//name root pubbookname paths from $1 to $14
System Features
Reference Feng Peng and Sudarshan Chawate. XPath Queries on Streaming Data. In SIGMOD 2003.
Thank You ???