Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dan SuciuXML Toolkit1 From Searching Text to Querying XML Streams Dan Suciu www.cs.washington.edu/homes/suciu.

Similar presentations


Presentation on theme: "Dan SuciuXML Toolkit1 From Searching Text to Querying XML Streams Dan Suciu www.cs.washington.edu/homes/suciu."— Presentation transcript:

1 Dan SuciuXML Toolkit1 From Searching Text to Querying XML Streams Dan Suciu www.cs.washington.edu/homes/suciu

2 Dan SuciuXML Toolkit2 About Me Born 1957, Romania BS: Bucharest, PhD: University of Pennsylvania Now: University of Washington (Seattle) My work is on semistructured data Book: Data on the Web: From relations, to semistructured data and XML Past/present projects: XML-QL = precursor of XQuery XMill = the XML compressor XML toolkit

3 Dan SuciuXML Toolkit3 Motivation Text databases –Studied over the past 15 years –Traditional client/server model –Struggled with lack of standard text syntax Recently, new standard: XML –Traditional client/server: in today’s dbms –New applications: stream processing This talk: processing stream XML data –My motivation: work on the XML Toolkit project

4 Dan SuciuXML Toolkit4 Outline Background The XML stream processing problem Basic XML processing with automata Adapting automata to XML Stream indexes Conclusions

5 Dan SuciuXML Toolkit5 Background: Relational Databases Structured, stored in tables Schema separate from data Queries: precise, refer to schema and data (SQL) :BOOKS ISBNTitleYearPublisher 0201537710 Foundations of Databases 1995AW 155860622XData on the Web1999MK AUTHOR AIDNameCountry 44AbiteboulFR 06BunemanUK 62HullUSA 12SuciuUSA 29VianuUSA WROTE: ISBNAID 020153771044 020153771062 020153771029 155860622X44 155860622X06 155860622X12 Hard to publish, easy to query precisely

6 Dan SuciuXML Toolkit6 Background: Text Databases Unstructured, stored in documents No schema, only data Queries: imprecise, refer to data only (keywords) Foundations of Databases, Abiteboul (FR), Hull (USA), Vianu (USA) Addison Wesley, 1995 Foundations of Databases, Abiteboul (FR), Hull (USA), Vianu (USA) Addison Wesley, 1995 Data on the Web Abiteoul (FR), Buneman (UK), Suciu (USA) Morgan Kaufmann, 1999 Data on the Web Abiteoul (FR), Buneman (UK), Suciu (USA) Morgan Kaufmann, 1999 Easy to publish, hard to query precisely

7 Dan SuciuXML Toolkit7 Background: XML Data Semistructured Schema and data are together: self-describing Queries: precise, refer to schema and data (SQL) Foundations… Abiteboul FR Hull USA Vianu USA Addison Wesley 1995 … Foundations… Abiteboul FR Hull USA Vianu USA Addison Wesley 1995 … XML: Easier to publish, easy to query precisely

8 Dan SuciuXML Toolkit8 Background: XML Data bib book paper title author publisher authorjournal book Data on the Web namecountry AbiteboulFR BunemanUK namecountry Addison Wesley Data model = tree

9 Dan SuciuXML Toolkit9 Background: XML Data Querying with XPath (and XQuery) This talk: XPath queries restricted to: tag / // * [ ] path=“constant”

10 Dan SuciuXML Toolkit10 Background: XPath in One Slide /bib/book[author/name=“Abiteboul”] /bib/book/[year=“1995” and author[name=“Abiteboul” and country=“FR”]] /bib/book/author/name /bib/book//name/*/zip tag, / //,* [ ] This is precisely the “region algebra” E.g. use proximal nodes [Navarro&Baeza-Yates’97] This is precisely the “region algebra” E.g. use proximal nodes [Navarro&Baeza-Yates’97] Navigate partially known structure Conjunctive queries a la SQL

11 Dan SuciuXML Toolkit11 Outline Background The XML stream processing problem Basic XML processing with automata Adapting automata to XML Stream indexes Conclusions

12 Dan SuciuXML Toolkit12 Main Application: XML Packet Routing Selective Dissemination of Information [Altinel&Franklin’00, Chan et al.02] XML content routing [Snoeren et al.01] SOAP Message routing in Application Servers

13 Dan SuciuXML Toolkit13 XML Packet Routing value value value value value value value value value value value value value value value value value value value value value value value value value value value value value value value value

14 Dan SuciuXML Toolkit14 /bib/book /publisher=“MK” /bib/book [category=“recent”]/title =“Web” /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /publisher=“MK” /bib/book [category=“recent”]/title =“Web” /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” XPath expressions...... Input XML Stream Output XML Streams

15 Dan SuciuXML Toolkit15 The XML Stream Processing Problem Given: A set of XPath expressions An Incoming stream of XML documents Decide: For each document which expressions it matches Given: A set of XPath expressions An Incoming stream of XML documents Decide: For each document which expressions it matches Hard: Large number of XPath expressions e.g. 10 3 - 10 6 Streaming XML data, high throughpute.g. 5MB/s Easy: Shallow XML datae.g. depth=20 Short XPath expressions Hard: Large number of XPath expressions e.g. 10 3 - 10 6 Streaming XML data, high throughpute.g. 5MB/s Easy: Shallow XML datae.g. depth=20 Short XPath expressions

16 Dan SuciuXML Toolkit16 The Approaches Basic techniques NFA plus optimizations: –Xfilter/Yfilter [Altinel&Franklin’00] –XTrie [Chan et al.02] DFA: –XML Toolkit Beyond the obvious Stream indexes (XML Toolkit) Stream views

17 Dan SuciuXML Toolkit17 Outline Background The XML stream processing problem Basic XML processing with automata Adapting automata to XML Stream indexes Conclusions

18 Dan SuciuXML Toolkit18 From XPath to NFA /catalog/product[category="tools"][*/price = 200]/quantity //price /catalog/product[category="tools"][*/price = 200]/quantity //price Extra processing needed to combine branches (not in this talk) Extra processing needed to combine branches (not in this talk) catalog product category price quantity "tools" 200 * price * 

19 Dan SuciuXML Toolkit19 Basic NFA Evaluation /bib/book /publisher=“MK” /bib/book [category=“recent”]/title /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /tag=“some” /bib/book [category=“recent”]/title /bib/book //address//*=“Seattle" /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“Lisbon” /bib/book /address /field=“some”... /bib/book/publisher=“AW” /bib/book [category=“recent”]/title /bib/book //address//*=“123” /bib/book //address//*="Galaxy" /bib/book /category=“new” /bib/book /address=“London” /bib/book /address /field =“some” /bib/book/category =“old” /bib/book /publisher=“MK” /bib/book [category=“recent”]/title /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /tag=“some” /bib/book [category=“recent”]/title /bib/book //address//*=“Seattle" /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“Lisbon” /bib/book /address /field=“some”... /bib/book/publisher=“AW” /bib/book [category=“recent”]/title /bib/book //address//*=“123” /bib/book //address//*="Galaxy" /bib/book /category=“new” /bib/book /address=“London” /bib/book /address /field =“some” /bib/book/category =“old”... NFA... XPath 3,66,102,4534,... 2,3,543,43,254 1,55,99,... STACK SAX events Current states

20 Dan SuciuXML Toolkit20 Basic NFA Evaluation Properties: Space = linear  Throughput = decreases linearly Systems: XFilter [Altinel&Franklin’99], YFilter. XTrie [Chan et al.’02]

21 Dan SuciuXML Toolkit21 Basic DFA Evaluation /bib/book /publisher=“MK” /bib/book [category=“recent”]/title /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /tag=“some” /bib/book [category=“recent”]/title /bib/book //address//*=“Seattle" /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“Lisbon” /bib/book /address /field=“some”... /bib/book/publisher=“AW” /bib/book [category=“recent”]/title /bib/book //address//*=“123” /bib/book //address//*="Galaxy" /bib/book /category=“new” /bib/book /address=“London” /bib/book /address /field =“some” /bib/book/category =“old” /bib/book /publisher=“MK” /bib/book [category=“recent”]/title /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /tag=“some” /bib/book [category=“recent”]/title /bib/book //address//*=“Seattle" /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“Lisbon” /bib/book /address /field=“some”... /bib/book/publisher=“AW” /bib/book [category=“recent”]/title /bib/book //address//*=“123” /bib/book //address//*="Galaxy" /bib/book /category=“new” /bib/book /address=“London” /bib/book /address /field =“some” /bib/book/category =“old”... XPath 399 552 1 STACK SAX events DFAs Current state

22 Dan SuciuXML Toolkit22 Basic DFA Evaluation Properties: Throughput = constant !  Space = GOOD QUESTION System: XML Toolkit [University of Washington] http://xmltk.sourceforge.net

23 Dan SuciuXML Toolkit23 XMLTK: An XML Toolkit for Scalable XML Stream Processing I. Avila-Campillo, T.J. Green, A. Gupta, M. Onizuka, D. Raven, D. Suciu

24 Dan SuciuXML Toolkit24 Motivation Lots of data sits in large text files –ad hoc data formats “Queried” with Unix command line tools –grep, sort, tail, etc Would be nice to XML-ize it......but then the Unix command line tools won’t work any more.

25 Dan SuciuXML Toolkit25 Example In the old Unix world… 6 accept P054“Theory of XML parsing” 3 reject P021“Experience with an XML optimizer” 7 accept P069“Towards a unified theory of data models”... 6 accept P054“Theory of XML parsing” 3 reject P021“Experience with an XML optimizer” 7 accept P069“Towards a unified theory of data models”... scoredecision paperID title grep “reject” papers.txt | sort | tail 10 Find the top ten rejected papers (in score order): Text file

26 Dan SuciuXML Toolkit26 Example (cont’d) In the new XML world… 6 accept P054 Theory of XML parsing 3 reject P021 Experience with an XML optimizer..... 6 accept P054 Theory of XML parsing 3 reject P021 Experience with an XML optimizer..... … can’t use those tools anymore 

27 Dan SuciuXML Toolkit27 Example (con’d) Doing it with the XML Toolkit: Finds top ten rejected s, in order xsort –c /submissions –e paper[decision/text()=“reject”] –k score/text() papers.xml | xtail –c /submissions –e paper –n 10

28 Dan SuciuXML Toolkit28 Goals of the XML Toolkit Simple, scalable tools for XML processing Provides service: there are people who need this Provides a research platform: for XML stream processing

29 Dan SuciuXML Toolkit29 Outline The tools The XPath processing engine Conclusions

30 Dan SuciuXML Toolkit30 The Tools Current tools: xsort xagg xnest xflatten xdelete xpair xhead xtail file2xml xmill Will talk only about this May look plenty, but actually still incomplete...

31 Dan SuciuXML Toolkit31 XSort: Definition -c = the context, i.e. where to sort -e = the item, i.e what to sort -k = the key, i.e. what to sort on xsort (–c XPathExpr (-e XPathExpr (-k XPathExpr) * ) * ) * General form

32 Dan SuciuXML Toolkit32 XSort: Definition XSort c c c e1 e2 e3 e4 e5 e6e7 e8e9 c c c e4 e1 e3 e2e6 e7e5 e9 e8

33 Dan SuciuXML Toolkit33 XSort Examples Elliotte Rusty Harold W. Scott Means XML in a Nutshell O'Reilly 2001 0-596-00058-8 Sylvain Devillers XML and XSLT Modeling for Multimedia Bitstream Manipulation. 2001 WWW Posters http://www10.org/cdrom/posters/1112.pdf db/conf/www/www2001p.html#Devillers01..... Elliotte Rusty Harold W. Scott Means XML in a Nutshell O'Reilly 2001 0-596-00058-8 Sylvain Devillers XML and XSLT Modeling for Multimedia Bitstream Manipulation. 2001 WWW Posters http://www10.org/cdrom/posters/1112.pdf db/conf/www/www2001p.html#Devillers01..... Examples illustrated on data like this:

34 Dan SuciuXML Toolkit34 XSort: Examples xsort –c /bib –e paper –k title/text() Sorts the s, by The s are dropped from the output................ Compare to… xsort –c /bib –e * –k title/text() xsort –c /bib –e paper –k title/text() –e book –k title/text()

35 Dan SuciuXML Toolkit35 XSort: Examples xsort –c /bib –e paper/author –k lastName/text() –k firstName/text() xsort –c /bib –e paper/author –k lastName/text() –k firstName/text() Sorts the s, by then................

36 Dan SuciuXML Toolkit36 XSort: Examples xsort –c /bib –e paper –e article –e book –e * s first, then s, then s, then all the rest................................................

37 Dan SuciuXML Toolkit37 XSort: Examples xsort –c /bib/* –e author –e title –e year –e * Normalize all entries: s first, then s, then s then all the other elements xsort –c /bib/paper –e author –e * –c /bib/book –e title –e * xsort –c /bib/paper –e author –e * –c /bib/book –e title –e * In s list the s first; in s list the first; Leave other entries unchanged

38 Dan SuciuXML Toolkit38 XSort: Implementation Sorts one context at a time, copies the rest For each context: –Create a “global key” for each item –Sort items, with a two-pass, multiway merge sort Quote from Databases 101 (news from the trenches): –with disk blocks of 4KB and 128MB of main memory, one can sort files up to 4TB in two passes !

39 Dan SuciuXML Toolkit39 XSort: Performance Size (KB)Xalan (sec)Xsort (sec) 0.410.080.00 4.910.090.00 76.220.270.02 991.792.520.26 9671.7927.452.85 100964.43-43.97 1009643.71-461.36 xsort –c /dblp –e * –k title/text() 1GB ! 8minutes

40 Dan SuciuXML Toolkit40 Outline The tools The XPath processing engine Conclusions

41 Dan SuciuXML Toolkit41 The XPath Processor Common to all tools is the following problem: Given: Set of correlated XPath expressions Stream of SAX events Decide: When are the expressions true  variable events

42 Dan SuciuXML Toolkit42 $r$r $c$c $e1$e2$e3 $k1$k2 bib paper book * publishertitle Tree pattern: Example xsort –c /bib –e paper –k publisher –e book –k title –e * xsort –c /bib –e paper –k publisher –e book –k title –e * Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998 Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998 $r$r $c$c $e2 $k2 Variable events

43 Dan SuciuXML Toolkit43 The XPath Processor How we did it: All Xpath expressions  Deterministic Finite Automaton –Restriction: no predicates yet (current work...) Does this scale to many, many XPath expressions ? –Yes, if we compute the DFA lazily (upcoming ICDT’2003 paper) Evaluation time is = parsing time Can do even better with a Stream IndeX (next)

44 Dan SuciuXML Toolkit44 Stream IndeX (SIX) Solution: “Index” the XML stream, parse only partially Definition: The SIX = a table of (start, end) offsets News: The parser is the main bottleneck in XPath stream processing !

45 Dan SuciuXML Toolkit45 Stream IndeX (SIX): Construction Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998 Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998 startend bib01490124 book3409023 publisher12423 author426879 author978... SIXXML

46 Dan SuciuXML Toolkit46 Stream IndeX (SIX): Skip Parsing Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998...... Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998...... XPath XML /bib/paper/title... Skip Parsing

47 Dan SuciuXML Toolkit47 Stream IndeX (SIX) in XML Stream Processing.................. 0205 3066 72188 0205 3066 72188 90110 9598 0205 3066 The SIX stream is about 6% of the data stream And can be made MUCH smaller The SIX stream is about 6% of the data stream And can be made MUCH smaller SIX (E.g. DIME) XML

48 Dan SuciuXML Toolkit48

49 Dan SuciuXML Toolkit49

50 Dan SuciuXML Toolkit50 Outline The tools The XPath processing engine Conclusions

51 Dan SuciuXML Toolkit51 Conclusions The toolkit is already available: –http://www.cs.washington.edu/homes/suciu/XMLTK –http://xmltk.sourceforge.net What it does so far it does very well: –Sorting, aggregation, nest/unnest But doesn’t do too much: –Restricted selections, no projections, no restructurings yet –Volunteers welcome ! Can one process XML data without parsing it completely ? –SIX


Download ppt "Dan SuciuXML Toolkit1 From Searching Text to Querying XML Streams Dan Suciu www.cs.washington.edu/homes/suciu."

Similar presentations


Ads by Google