Dan SuciuXML Toolkit1 XMLTK: An XML Toolkit for Scalable XML Stream Processing I. Avila-Campillo, T.J. Green, A. Gupta, M. Onizuka, D. Raven, D. Suciu
Dan SuciuXML Toolkit2 Motivation Lots of data sits in large text files –ad hoc data formats “Queried” with Unix command line tools –grep, sort, tail, etc Would be nice to XML-ize it......but then the Unix command line tools won’t work any more.
Dan SuciuXML Toolkit3 Example In the old Unix world… 6 accept P054“Theory of XML parsing” 3 reject P021“Experience with an XML optimizer” 7 accept P069“Towards a unified theory of data models”... 6 accept P054“Theory of XML parsing” 3 reject P021“Experience with an XML optimizer” 7 accept P069“Towards a unified theory of data models”... scoredecision paperID title grep “reject” papers.txt | sort | tail 10 Find the top ten rejected papers (in score order): Text file
Dan SuciuXML Toolkit4 Example (cont’d) In the new XML world… 6 accept P054 Theory of XML parsing 3 reject P021 Experience with an XML optimizer accept P054 Theory of XML parsing 3 reject P021 Experience with an XML optimizer..... … can’t use those tools anymore
Dan SuciuXML Toolkit5 Example (con’d) Doing it with the XML Toolkit: Finds top ten rejected s, in order xsort –c /submissions –e paper[decision/text()=“reject”] –k score/text() papers.xml | xtail –c /submissions –e paper –n 10
Dan SuciuXML Toolkit6 Goals of the XML Toolkit Simple, scalable tools for XML processing Provides service: there are people who need this Provides a research platform: for XML stream processing
Dan SuciuXML Toolkit7 Outline The tools The XPath processing engine Conclusions
Dan SuciuXML Toolkit8 The Tools Current tools: xsort xagg xnest xflatten xdelete xpair xhead xtail file2xml xmill Will talk only about this May look plenty, but actually still incomplete...
Dan SuciuXML Toolkit9 XSort: Definition -c = the context, i.e. where to sort -e = the item, i.e what to sort -k = the key, i.e. what to sort on xsort (–c XPathExpr (-e XPathExpr (-k XPathExpr) * ) * ) * General form
Dan SuciuXML Toolkit10 XSort: Definition XSort c c c e1 e2 e3 e4 e5 e6e7 e8e9 c c c e4 e1 e3 e2e6 e7e5 e9 e8
Dan SuciuXML Toolkit11 XSort Examples Elliotte Rusty Harold W. Scott Means XML in a Nutshell O'Reilly Sylvain Devillers XML and XSLT Modeling for Multimedia Bitstream Manipulation WWW Posters db/conf/www/www2001p.html#Devillers Elliotte Rusty Harold W. Scott Means XML in a Nutshell O'Reilly Sylvain Devillers XML and XSLT Modeling for Multimedia Bitstream Manipulation WWW Posters db/conf/www/www2001p.html#Devillers Examples illustrated on data like this:
Dan SuciuXML Toolkit12 XSort: Examples xsort –c /bib –e paper –k title/text() Sorts the s, by The s are dropped from the output Compare to… xsort –c /bib –e * –k title/text() xsort –c /bib –e paper –k title/text() –e book –k title/text()
Dan SuciuXML Toolkit13 XSort: Examples xsort –c /bib –e paper/author –k lastName/text() –k firstName/text() xsort –c /bib –e paper/author –k lastName/text() –k firstName/text() Sorts the s, by then
Dan SuciuXML Toolkit14 XSort: Examples xsort –c /bib –e paper –e article –e book –e * s first, then s, then s, then all the rest
Dan SuciuXML Toolkit15 XSort: Examples xsort –c /bib/* –e author –e title –e year –e * Normalize all entries: s first, then s, then s then all the other elements xsort –c /bib/paper –e author –e * –c /bib/book –e title –e * xsort –c /bib/paper –e author –e * –c /bib/book –e title –e * In s list the s first; in s list the first; Leave other entries unchanged
Dan SuciuXML Toolkit16 XSort: Implementation Sorts one context at a time, copies the rest For each context: –Create a “global key” for each item –Sort items, with a two-pass, multiway merge sort Quote from Databases 101 (news from the trenches): –with disk blocks of 4KB and 128MB of main memory, one can sort files up to 4TB in two passes !
Dan SuciuXML Toolkit17 XSort: Performance Size (KB)Xalan (sec)Xsort (sec) xsort –c /dblp –e * –k title/text() 1GB ! 8minutes
Dan SuciuXML Toolkit18 Outline The tools The XPath processing engine Conclusions
Dan SuciuXML Toolkit19 The XPath Processor Common to all tools is the following problem: Given: Set of correlated XPath expressions Stream of SAX events Decide: When are the expressions true variable events
Dan SuciuXML Toolkit20 $r$r $c$c $e1$e2$e3 $k1$k2 bib paper book * publishertitle Tree pattern: Example xsort –c /bib –e paper –k publisher –e book –k title –e * xsort –c /bib –e paper –k publisher –e book –k title –e * Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998 Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998 $r$r $c$c $e2 $k2 Variable events
Dan SuciuXML Toolkit21 The XPath Processor How we did it: All Xpath expressions Deterministic Finite Automaton –Restriction: no predicates yet (current work...) Does this scale to many, many XPath expressions ? –Yes, if we compute the DFA lazily (upcoming ICDT’2003 paper) Evaluation time is = parsing time Can do even better with a Stream IndeX (next)
Dan SuciuXML Toolkit22 Stream IndeX (SIX) Solution: “Index” the XML stream, parse only partially Definition: The SIX = a table of (start, end) offsets News: The parser is the main bottleneck in XPath stream processing !
Dan SuciuXML Toolkit23 Stream IndeX (SIX): Construction Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998 Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998 startend bib book publisher12423 author author SIXXML
Dan SuciuXML Toolkit24 Stream IndeX (SIX): Skip Parsing Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems XPath XML /bib/paper/title... Skip Parsing
Dan SuciuXML Toolkit25 Stream IndeX (SIX) in XML Stream Processing The SIX stream is about 6% of the data stream And can be made MUCH smaller The SIX stream is about 6% of the data stream And can be made MUCH smaller SIX (E.g. DIME) XML
Dan SuciuXML Toolkit26
Dan SuciuXML Toolkit27
Dan SuciuXML Toolkit28 Outline The tools The XPath processing engine Conclusions
Dan SuciuXML Toolkit29 Conclusions The toolkit is already available: – – What it does so far it does very well: –Sorting, aggregation, nest/unnest But doesn’t do too much: –Restricted selections, no projections, no restructurings yet –Volunteers welcome ! Can one process XML data without parsing it completely ? –SIX