Download presentation
Presentation is loading. Please wait.
Published byMonica Dickerson Modified over 9 years ago
1
Dan SuciuXML Toolkit1 From Searching Text to Querying XML Streams Dan Suciu www.cs.washington.edu/homes/suciu
2
Dan SuciuXML Toolkit2 About Me Born 1957, Romania BS: Bucharest, PhD: University of Pennsylvania Now: University of Washington (Seattle) My work is on semistructured data Book: Data on the Web: From relations, to semistructured data and XML Past/present projects: XML-QL = precursor of XQuery XMill = the XML compressor XML toolkit
3
Dan SuciuXML Toolkit3 Motivation Text databases –Studied over the past 15 years –Traditional client/server model –Struggled with lack of standard text syntax Recently, new standard: XML –Traditional client/server: in today’s dbms –New applications: stream processing This talk: processing stream XML data –My motivation: work on the XML Toolkit project
4
Dan SuciuXML Toolkit4 Outline Background The XML stream processing problem Basic XML processing with automata Adapting automata to XML Stream indexes Conclusions
5
Dan SuciuXML Toolkit5 Background: Relational Databases Structured, stored in tables Schema separate from data Queries: precise, refer to schema and data (SQL) :BOOKS ISBNTitleYearPublisher 0201537710 Foundations of Databases 1995AW 155860622XData on the Web1999MK AUTHOR AIDNameCountry 44AbiteboulFR 06BunemanUK 62HullUSA 12SuciuUSA 29VianuUSA WROTE: ISBNAID 020153771044 020153771062 020153771029 155860622X44 155860622X06 155860622X12 Hard to publish, easy to query precisely
6
Dan SuciuXML Toolkit6 Background: Text Databases Unstructured, stored in documents No schema, only data Queries: imprecise, refer to data only (keywords) Foundations of Databases, Abiteboul (FR), Hull (USA), Vianu (USA) Addison Wesley, 1995 Foundations of Databases, Abiteboul (FR), Hull (USA), Vianu (USA) Addison Wesley, 1995 Data on the Web Abiteoul (FR), Buneman (UK), Suciu (USA) Morgan Kaufmann, 1999 Data on the Web Abiteoul (FR), Buneman (UK), Suciu (USA) Morgan Kaufmann, 1999 Easy to publish, hard to query precisely
7
Dan SuciuXML Toolkit7 Background: XML Data Semistructured Schema and data are together: self-describing Queries: precise, refer to schema and data (SQL) Foundations… Abiteboul FR Hull USA Vianu USA Addison Wesley 1995 … Foundations… Abiteboul FR Hull USA Vianu USA Addison Wesley 1995 … XML: Easier to publish, easy to query precisely
8
Dan SuciuXML Toolkit8 Background: XML Data bib book paper title author publisher authorjournal book Data on the Web namecountry AbiteboulFR BunemanUK namecountry Addison Wesley Data model = tree
9
Dan SuciuXML Toolkit9 Background: XML Data Querying with XPath (and XQuery) This talk: XPath queries restricted to: tag / // * [ ] path=“constant”
10
Dan SuciuXML Toolkit10 Background: XPath in One Slide /bib/book[author/name=“Abiteboul”] /bib/book/[year=“1995” and author[name=“Abiteboul” and country=“FR”]] /bib/book/author/name /bib/book//name/*/zip tag, / //,* [ ] This is precisely the “region algebra” E.g. use proximal nodes [Navarro&Baeza-Yates’97] This is precisely the “region algebra” E.g. use proximal nodes [Navarro&Baeza-Yates’97] Navigate partially known structure Conjunctive queries a la SQL
11
Dan SuciuXML Toolkit11 Outline Background The XML stream processing problem Basic XML processing with automata Adapting automata to XML Stream indexes Conclusions
12
Dan SuciuXML Toolkit12 Main Application: XML Packet Routing Selective Dissemination of Information [Altinel&Franklin’00, Chan et al.02] XML content routing [Snoeren et al.01] SOAP Message routing in Application Servers
13
Dan SuciuXML Toolkit13 XML Packet Routing value value value value value value value value value value value value value value value value value value value value value value value value value value value value value value value value
14
Dan SuciuXML Toolkit14 /bib/book /publisher=“MK” /bib/book [category=“recent”]/title =“Web” /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /publisher=“MK” /bib/book [category=“recent”]/title =“Web” /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” XPath expressions...... Input XML Stream Output XML Streams
15
Dan SuciuXML Toolkit15 The XML Stream Processing Problem Given: A set of XPath expressions An Incoming stream of XML documents Decide: For each document which expressions it matches Given: A set of XPath expressions An Incoming stream of XML documents Decide: For each document which expressions it matches Hard: Large number of XPath expressions e.g. 10 3 - 10 6 Streaming XML data, high throughpute.g. 5MB/s Easy: Shallow XML datae.g. depth=20 Short XPath expressions Hard: Large number of XPath expressions e.g. 10 3 - 10 6 Streaming XML data, high throughpute.g. 5MB/s Easy: Shallow XML datae.g. depth=20 Short XPath expressions
16
Dan SuciuXML Toolkit16 The Approaches Basic techniques NFA plus optimizations: –Xfilter/Yfilter [Altinel&Franklin’00] –XTrie [Chan et al.02] DFA: –XML Toolkit Beyond the obvious Stream indexes (XML Toolkit) Stream views
17
Dan SuciuXML Toolkit17 Outline Background The XML stream processing problem Basic XML processing with automata Adapting automata to XML Stream indexes Conclusions
18
Dan SuciuXML Toolkit18 From XPath to NFA /catalog/product[category="tools"][*/price = 200]/quantity //price /catalog/product[category="tools"][*/price = 200]/quantity //price Extra processing needed to combine branches (not in this talk) Extra processing needed to combine branches (not in this talk) catalog product category price quantity "tools" 200 * price *
19
Dan SuciuXML Toolkit19 Basic NFA Evaluation /bib/book /publisher=“MK” /bib/book [category=“recent”]/title /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /tag=“some” /bib/book [category=“recent”]/title /bib/book //address//*=“Seattle" /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“Lisbon” /bib/book /address /field=“some”... /bib/book/publisher=“AW” /bib/book [category=“recent”]/title /bib/book //address//*=“123” /bib/book //address//*="Galaxy" /bib/book /category=“new” /bib/book /address=“London” /bib/book /address /field =“some” /bib/book/category =“old” /bib/book /publisher=“MK” /bib/book [category=“recent”]/title /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /tag=“some” /bib/book [category=“recent”]/title /bib/book //address//*=“Seattle" /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“Lisbon” /bib/book /address /field=“some”... /bib/book/publisher=“AW” /bib/book [category=“recent”]/title /bib/book //address//*=“123” /bib/book //address//*="Galaxy" /bib/book /category=“new” /bib/book /address=“London” /bib/book /address /field =“some” /bib/book/category =“old”... NFA... XPath 3,66,102,4534,... 2,3,543,43,254 1,55,99,... STACK SAX events Current states
20
Dan SuciuXML Toolkit20 Basic NFA Evaluation Properties: Space = linear Throughput = decreases linearly Systems: XFilter [Altinel&Franklin’99], YFilter. XTrie [Chan et al.’02]
21
Dan SuciuXML Toolkit21 Basic DFA Evaluation /bib/book /publisher=“MK” /bib/book [category=“recent”]/title /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /tag=“some” /bib/book [category=“recent”]/title /bib/book //address//*=“Seattle" /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“Lisbon” /bib/book /address /field=“some”... /bib/book/publisher=“AW” /bib/book [category=“recent”]/title /bib/book //address//*=“123” /bib/book //address//*="Galaxy" /bib/book /category=“new” /bib/book /address=“London” /bib/book /address /field =“some” /bib/book/category =“old” /bib/book /publisher=“MK” /bib/book [category=“recent”]/title /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /tag=“some” /bib/book [category=“recent”]/title /bib/book //address//*=“Seattle" /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“Lisbon” /bib/book /address /field=“some”... /bib/book/publisher=“AW” /bib/book [category=“recent”]/title /bib/book //address//*=“123” /bib/book //address//*="Galaxy" /bib/book /category=“new” /bib/book /address=“London” /bib/book /address /field =“some” /bib/book/category =“old”... XPath 399 552 1 STACK SAX events DFAs Current state
22
Dan SuciuXML Toolkit22 Basic DFA Evaluation Properties: Throughput = constant ! Space = GOOD QUESTION System: XML Toolkit [University of Washington] http://xmltk.sourceforge.net
23
Dan SuciuXML Toolkit23 XMLTK: An XML Toolkit for Scalable XML Stream Processing I. Avila-Campillo, T.J. Green, A. Gupta, M. Onizuka, D. Raven, D. Suciu
24
Dan SuciuXML Toolkit24 Motivation Lots of data sits in large text files –ad hoc data formats “Queried” with Unix command line tools –grep, sort, tail, etc Would be nice to XML-ize it......but then the Unix command line tools won’t work any more.
25
Dan SuciuXML Toolkit25 Example In the old Unix world… 6 accept P054“Theory of XML parsing” 3 reject P021“Experience with an XML optimizer” 7 accept P069“Towards a unified theory of data models”... 6 accept P054“Theory of XML parsing” 3 reject P021“Experience with an XML optimizer” 7 accept P069“Towards a unified theory of data models”... scoredecision paperID title grep “reject” papers.txt | sort | tail 10 Find the top ten rejected papers (in score order): Text file
26
Dan SuciuXML Toolkit26 Example (cont’d) In the new XML world… 6 accept P054 Theory of XML parsing 3 reject P021 Experience with an XML optimizer..... 6 accept P054 Theory of XML parsing 3 reject P021 Experience with an XML optimizer..... … can’t use those tools anymore
27
Dan SuciuXML Toolkit27 Example (con’d) Doing it with the XML Toolkit: Finds top ten rejected s, in order xsort –c /submissions –e paper[decision/text()=“reject”] –k score/text() papers.xml | xtail –c /submissions –e paper –n 10
28
Dan SuciuXML Toolkit28 Goals of the XML Toolkit Simple, scalable tools for XML processing Provides service: there are people who need this Provides a research platform: for XML stream processing
29
Dan SuciuXML Toolkit29 Outline The tools The XPath processing engine Conclusions
30
Dan SuciuXML Toolkit30 The Tools Current tools: xsort xagg xnest xflatten xdelete xpair xhead xtail file2xml xmill Will talk only about this May look plenty, but actually still incomplete...
31
Dan SuciuXML Toolkit31 XSort: Definition -c = the context, i.e. where to sort -e = the item, i.e what to sort -k = the key, i.e. what to sort on xsort (–c XPathExpr (-e XPathExpr (-k XPathExpr) * ) * ) * General form
32
Dan SuciuXML Toolkit32 XSort: Definition XSort c c c e1 e2 e3 e4 e5 e6e7 e8e9 c c c e4 e1 e3 e2e6 e7e5 e9 e8
33
Dan SuciuXML Toolkit33 XSort Examples Elliotte Rusty Harold W. Scott Means XML in a Nutshell O'Reilly 2001 0-596-00058-8 Sylvain Devillers XML and XSLT Modeling for Multimedia Bitstream Manipulation. 2001 WWW Posters http://www10.org/cdrom/posters/1112.pdf db/conf/www/www2001p.html#Devillers01..... Elliotte Rusty Harold W. Scott Means XML in a Nutshell O'Reilly 2001 0-596-00058-8 Sylvain Devillers XML and XSLT Modeling for Multimedia Bitstream Manipulation. 2001 WWW Posters http://www10.org/cdrom/posters/1112.pdf db/conf/www/www2001p.html#Devillers01..... Examples illustrated on data like this:
34
Dan SuciuXML Toolkit34 XSort: Examples xsort –c /bib –e paper –k title/text() Sorts the s, by The s are dropped from the output................ Compare to… xsort –c /bib –e * –k title/text() xsort –c /bib –e paper –k title/text() –e book –k title/text()
35
Dan SuciuXML Toolkit35 XSort: Examples xsort –c /bib –e paper/author –k lastName/text() –k firstName/text() xsort –c /bib –e paper/author –k lastName/text() –k firstName/text() Sorts the s, by then................
36
Dan SuciuXML Toolkit36 XSort: Examples xsort –c /bib –e paper –e article –e book –e * s first, then s, then s, then all the rest................................................
37
Dan SuciuXML Toolkit37 XSort: Examples xsort –c /bib/* –e author –e title –e year –e * Normalize all entries: s first, then s, then s then all the other elements xsort –c /bib/paper –e author –e * –c /bib/book –e title –e * xsort –c /bib/paper –e author –e * –c /bib/book –e title –e * In s list the s first; in s list the first; Leave other entries unchanged
38
Dan SuciuXML Toolkit38 XSort: Implementation Sorts one context at a time, copies the rest For each context: –Create a “global key” for each item –Sort items, with a two-pass, multiway merge sort Quote from Databases 101 (news from the trenches): –with disk blocks of 4KB and 128MB of main memory, one can sort files up to 4TB in two passes !
39
Dan SuciuXML Toolkit39 XSort: Performance Size (KB)Xalan (sec)Xsort (sec) 0.410.080.00 4.910.090.00 76.220.270.02 991.792.520.26 9671.7927.452.85 100964.43-43.97 1009643.71-461.36 xsort –c /dblp –e * –k title/text() 1GB ! 8minutes
40
Dan SuciuXML Toolkit40 Outline The tools The XPath processing engine Conclusions
41
Dan SuciuXML Toolkit41 The XPath Processor Common to all tools is the following problem: Given: Set of correlated XPath expressions Stream of SAX events Decide: When are the expressions true variable events
42
Dan SuciuXML Toolkit42 $r$r $c$c $e1$e2$e3 $k1$k2 bib paper book * publishertitle Tree pattern: Example xsort –c /bib –e paper –k publisher –e book –k title –e * xsort –c /bib –e paper –k publisher –e book –k title –e * Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998 Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998 $r$r $c$c $e2 $k2 Variable events
43
Dan SuciuXML Toolkit43 The XPath Processor How we did it: All Xpath expressions Deterministic Finite Automaton –Restriction: no predicates yet (current work...) Does this scale to many, many XPath expressions ? –Yes, if we compute the DFA lazily (upcoming ICDT’2003 paper) Evaluation time is = parsing time Can do even better with a Stream IndeX (next)
44
Dan SuciuXML Toolkit44 Stream IndeX (SIX) Solution: “Index” the XML stream, parse only partially Definition: The SIX = a table of (start, end) offsets News: The parser is the main bottleneck in XPath stream processing !
45
Dan SuciuXML Toolkit45 Stream IndeX (SIX): Construction Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998 Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998 startend bib01490124 book3409023 publisher12423 author426879 author978... SIXXML
46
Dan SuciuXML Toolkit46 Stream IndeX (SIX): Skip Parsing Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998...... Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998...... XPath XML /bib/paper/title... Skip Parsing
47
Dan SuciuXML Toolkit47 Stream IndeX (SIX) in XML Stream Processing.................. 0205 3066 72188 0205 3066 72188 90110 9598 0205 3066 The SIX stream is about 6% of the data stream And can be made MUCH smaller The SIX stream is about 6% of the data stream And can be made MUCH smaller SIX (E.g. DIME) XML
48
Dan SuciuXML Toolkit48
49
Dan SuciuXML Toolkit49
50
Dan SuciuXML Toolkit50 Outline The tools The XPath processing engine Conclusions
51
Dan SuciuXML Toolkit51 Conclusions The toolkit is already available: –http://www.cs.washington.edu/homes/suciu/XMLTK –http://xmltk.sourceforge.net What it does so far it does very well: –Sorting, aggregation, nest/unnest But doesn’t do too much: –Restricted selections, no projections, no restructurings yet –Volunteers welcome ! Can one process XML data without parsing it completely ? –SIX
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.