Dan SuciuXML Toolkit1 XMLTK: An XML Toolkit for Scalable XML Stream Processing I. Avila-Campillo, T.J. Green, A. Gupta, M. Onizuka, D. Raven, D. Suciu.

Slides:



Advertisements
Similar presentations
Inside an XSLT Processor Michael Kay, ICL 19 May 2000.
Advertisements

Bottom-up Evaluation of XPath Queries Stephanie H. Li Zhiping Zou.
Spring Part III: Introduction to XPath XML Path Language.
Web Data Management XQuery 1. In this lecture Summary of XQuery FLWOR expressions – For, Let, Where, Order by, Return FOR and LET expressions Collections.
Lecture 23UofH - COSC Dr. Verma 1 COSC 3340: Introduction to Theory of Computation University of Houston Dr. Verma Lecture 23.
1 Web Data Management Path Expressions. 2 In this lecture Path expressions Regular path expressions Evaluation techniques Resources: Data on the Web Abiteboul,
DBLABNational Taiwan Ocean University1/35 A Document-based Approach to Indexing XML Data Ya-Hui Chang and Tsan-Lung Hsieh Department of Computer Science.
XSL XSLT and XPath 11-Apr-17.
Schema-based Scheduling of Event Processors and Buffer Minimization for Queries on Structured Data Streams Bernhard Stegmaier (TU München) Joint work with.
Database Management Systems, R. Ramakrishnan1 Introduction to Semistructured Data and XML Chapter 27, Part D Based on slides by Dan Suciu University of.
Agenda from now on Done: SQL, views, transactions, conceptual modeling, E/R, relational algebra. Starting: XML To do: the database engine: –Storage –Query.
Querying XML (cont.). Comments on XPath? What’s good about it? What can’t it do that you want it to do? How does it compare, say, to SQL?
A Graphical Environment to Query XML Data with XQuery
1 Lecture 9: XQuery. 2 XQuery Motivation XPath expressivity insufficient –no join queries (as in SQL) –no changes to the XML structure possible –no quantifiers.
Containment and Equivalence for an XPath Fragment By Gerom e Mikla Dan Suciu Presented By Roy Ionas.
1 Efficient Processing of XPath Queries Using Indexes Yan Chen 1, Sanjay Madria 1, Kalpdrum Passi 2, Sourav Bhowmick 3 1 Department of Computer Science,
Managing XML and Semistructured Data Lecture 6: XPath Prof. Dan Suciu Spring 2001.
1 Introduction to Database Systems CSE 444 Lecture 11 Xpath/XQuery April 23, 2008.
1 Lecture 11: Xpath/XQuery Friday, October 20, 2006.
Dan Suciu Univ. of Washington Querying XML Streams1 From Searching Text to Querying XML Streams Dan Suciu
Managing XML and Semistructured Data
XPath Tao Wan March 04, What is XPath? n A language designed to be used by XSL Transformations (XSLT), Xlink, Xpointer and XML Query. n Primary.
Managing XML and Semistructured Data Lecture 1: Preliminaries and Overview Prof. Dan Suciu Spring 2001.
Fundamentals, Design, and Implementation, 9/e Text and XML databases Instructor: Dragomir R. Radev Winter 2005.
1 Lecture 08: XML and Semistructured Data. 2 Outline XML (Section 17) –XML syntax, semistructured data –Document Type Definitions (DTDs) XPath.
1 Lecture 08: XML and Semistructured Data. 2 Outline XML (Section 17) –XML syntax, semistructured data –Document Type Definitions (DTDs) XPath.
1 Lecture 16: Querying XML Data: XPath, XQuery Friday, February 11, 2005.
Querying XML February 12 th, Querying XML Data XPath = simple navigation through the tree XQuery = the SQL of XML XSLT = recursive traversal –will.
MC 365 – Software Engineering Presented by: John Ristuccia Shawn Posts Ndi Sampson XSLT Introduction BCi.
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
Dan SuciuTools for XML Data Exchange Dan Suciu AT&T Labs Joint work with Mary Fernandez.
XML and XPath. Web Services: XML+XPath2 EXtensible Markup Language (XML) a W3C standard to complement HTML A markup language much like HTML origins: structured.
Semistructured data and XML CS 645 April 5, 2006 Some slide content courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives.
Dan SuciuXML Toolkit1 From Searching Text to Querying XML Streams Dan Suciu
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Streaming Processing of Large XML Data Jana Dvořáková, Filip Zavoral processing of large XML data using XSLT with optimal memory complexity formal model.
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
Querying Structured Text in an XML Database By Xuemei Luo.
Workflows in Webdam Victor Vianu UC San Diego & INRIA/Webdam.
Management of XML and Semistructured Data Lecture 5: Query Languages Wednesday, 4/1/2001.
Managing XML and Semistructured Data Lecture 13: XDuce and Regular Tree Languages Prof. Dan Suciu Spring 2001.
Lecture 6: XML Query Languages Thursday, January 18, 2001.
CSE 636 Data Integration Fall 2006 XML Query Languages XPath.
XML query. introduction An XML document can represent almost anything, and users of an XML query language expect it to perform useful queries on whatever.
More XML: semantics, DTDs, XPATH February 18, 2004.
Management of XML and Semistructured Data Lecture 11: Schemas Wednesday, May 2nd, 2001.
CSE 544: Relational Operators, Sorting Wednesday, 5/12/2004.
Tree-Pattern Queries on a Lightweight XML Processor MIRELLA M. MORO Zografoula Vagena Vassilis J. Tsotras Research partially supported by CAPES, NSF grant.
Lecture 24 Query Execution Monday, November 28, 2005.
1 XQuery Slides From Dr. Suciu. 2 XQuery Based on Quilt, which is based on XML-QL Uses XPath to express more complex queries.
1 Lecture 13: XQuery XML Publishing, XML Storage Monday, October 28, 2002.
IS432 Semi-Structured Data Lecture 4: XPath Dr. Gamal Al-Shorbagy.
XQuery 1. In this lecture Summary of XQuery FLWOR expressions – For, Let, Where, Order by, Return FOR and LET expressions Collections and sorting 2.
Lecture 17: XPath and XQuery Wednesday, Nov. 7, 2001.
1 Lecture 12: XML, XPath, XQuery Friday, October 24, 2003.
Processing XML Streams with Deterministic Automata Denis Mindolin Gaurav Chandalia.
1 Lecture 16: Data Storage Wednesday, November 6, 2006.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Efficient Evaluation of XQuery over Streaming Data
XML path expressions CSE 350 Fall 2003.
An XML Toolkit for Lightweight XML Stream Processing
Lecture 16: Data Storage Wednesday, November 6, 2006.
Lecture 11: Xpath/XQuery
Efficient Filtering of XML Documents with XPath Expressions
Chapter 15 QUERY EXECUTION.
Lecture 7: Index Construction
Alin Deutsch, University of Pennsylvania Mary Mernandez, AT&T Labs
Wednesday, May 22, 2002 XML Publishing, Storage
Lecture 15: Querying XML Friday, October 27, 2000.
Lecture 11: XML and Semistructured Data
Presentation transcript:

Dan SuciuXML Toolkit1 XMLTK: An XML Toolkit for Scalable XML Stream Processing I. Avila-Campillo, T.J. Green, A. Gupta, M. Onizuka, D. Raven, D. Suciu

Dan SuciuXML Toolkit2 Motivation Lots of data sits in large text files –ad hoc data formats “Queried” with Unix command line tools –grep, sort, tail, etc Would be nice to XML-ize it......but then the Unix command line tools won’t work any more.

Dan SuciuXML Toolkit3 Example In the old Unix world… 6 accept P054“Theory of XML parsing” 3 reject P021“Experience with an XML optimizer” 7 accept P069“Towards a unified theory of data models”... 6 accept P054“Theory of XML parsing” 3 reject P021“Experience with an XML optimizer” 7 accept P069“Towards a unified theory of data models”... scoredecision paperID title grep “reject” papers.txt | sort | tail 10 Find the top ten rejected papers (in score order): Text file

Dan SuciuXML Toolkit4 Example (cont’d) In the new XML world… 6 accept P054 Theory of XML parsing 3 reject P021 Experience with an XML optimizer accept P054 Theory of XML parsing 3 reject P021 Experience with an XML optimizer..... … can’t use those tools anymore 

Dan SuciuXML Toolkit5 Example (con’d) Doing it with the XML Toolkit: Finds top ten rejected s, in order xsort –c /submissions –e paper[decision/text()=“reject”] –k score/text() papers.xml | xtail –c /submissions –e paper –n 10

Dan SuciuXML Toolkit6 Goals of the XML Toolkit Simple, scalable tools for XML processing Provides service: there are people who need this Provides a research platform: for XML stream processing

Dan SuciuXML Toolkit7 Outline The tools The XPath processing engine Conclusions

Dan SuciuXML Toolkit8 The Tools Current tools: xsort xagg xnest xflatten xdelete xpair xhead xtail file2xml xmill Will talk only about this May look plenty, but actually still incomplete...

Dan SuciuXML Toolkit9 XSort: Definition -c = the context, i.e. where to sort -e = the item, i.e what to sort -k = the key, i.e. what to sort on xsort (–c XPathExpr (-e XPathExpr (-k XPathExpr) * ) * ) * General form

Dan SuciuXML Toolkit10 XSort: Definition XSort c c c e1 e2 e3 e4 e5 e6e7 e8e9 c c c e4 e1 e3 e2e6 e7e5 e9 e8

Dan SuciuXML Toolkit11 XSort Examples Elliotte Rusty Harold W. Scott Means XML in a Nutshell O'Reilly Sylvain Devillers XML and XSLT Modeling for Multimedia Bitstream Manipulation WWW Posters db/conf/www/www2001p.html#Devillers Elliotte Rusty Harold W. Scott Means XML in a Nutshell O'Reilly Sylvain Devillers XML and XSLT Modeling for Multimedia Bitstream Manipulation WWW Posters db/conf/www/www2001p.html#Devillers Examples illustrated on data like this:

Dan SuciuXML Toolkit12 XSort: Examples xsort –c /bib –e paper –k title/text() Sorts the s, by The s are dropped from the output Compare to… xsort –c /bib –e * –k title/text() xsort –c /bib –e paper –k title/text() –e book –k title/text()

Dan SuciuXML Toolkit13 XSort: Examples xsort –c /bib –e paper/author –k lastName/text() –k firstName/text() xsort –c /bib –e paper/author –k lastName/text() –k firstName/text() Sorts the s, by then

Dan SuciuXML Toolkit14 XSort: Examples xsort –c /bib –e paper –e article –e book –e * s first, then s, then s, then all the rest

Dan SuciuXML Toolkit15 XSort: Examples xsort –c /bib/* –e author –e title –e year –e * Normalize all entries: s first, then s, then s then all the other elements xsort –c /bib/paper –e author –e * –c /bib/book –e title –e * xsort –c /bib/paper –e author –e * –c /bib/book –e title –e * In s list the s first; in s list the first; Leave other entries unchanged

Dan SuciuXML Toolkit16 XSort: Implementation Sorts one context at a time, copies the rest For each context: –Create a “global key” for each item –Sort items, with a two-pass, multiway merge sort Quote from Databases 101 (news from the trenches): –with disk blocks of 4KB and 128MB of main memory, one can sort files up to 4TB in two passes !

Dan SuciuXML Toolkit17 XSort: Performance Size (KB)Xalan (sec)Xsort (sec) xsort –c /dblp –e * –k title/text() 1GB ! 8minutes

Dan SuciuXML Toolkit18 Outline The tools The XPath processing engine Conclusions

Dan SuciuXML Toolkit19 The XPath Processor Common to all tools is the following problem: Given: Set of correlated XPath expressions Stream of SAX events Decide: When are the expressions true  variable events

Dan SuciuXML Toolkit20 $r$r $c$c $e1$e2$e3 $k1$k2 bib paper book * publishertitle Tree pattern: Example xsort –c /bib –e paper –k publisher –e book –k title –e * xsort –c /bib –e paper –k publisher –e book –k title –e * Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998 Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998 $r$r $c$c $e2 $k2 Variable events

Dan SuciuXML Toolkit21 The XPath Processor How we did it: All Xpath expressions  Deterministic Finite Automaton –Restriction: no predicates yet (current work...) Does this scale to many, many XPath expressions ? –Yes, if we compute the DFA lazily (upcoming ICDT’2003 paper) Evaluation time is = parsing time Can do even better with a Stream IndeX (next)

Dan SuciuXML Toolkit22 Stream IndeX (SIX) Solution: “Index” the XML stream, parse only partially Definition: The SIX = a table of (start, end) offsets News: The parser is the main bottleneck in XPath stream processing !

Dan SuciuXML Toolkit23 Stream IndeX (SIX): Construction Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998 Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998 startend bib book publisher12423 author author SIXXML

Dan SuciuXML Toolkit24 Stream IndeX (SIX): Skip Parsing Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems XPath XML /bib/paper/title... Skip Parsing

Dan SuciuXML Toolkit25 Stream IndeX (SIX) in XML Stream Processing The SIX stream is about 6% of the data stream And can be made MUCH smaller The SIX stream is about 6% of the data stream And can be made MUCH smaller SIX (E.g. DIME) XML

Dan SuciuXML Toolkit26

Dan SuciuXML Toolkit27

Dan SuciuXML Toolkit28 Outline The tools The XPath processing engine Conclusions

Dan SuciuXML Toolkit29 Conclusions The toolkit is already available: – – What it does so far it does very well: –Sorting, aggregation, nest/unnest But doesn’t do too much: –Restricted selections, no projections, no restructurings yet –Volunteers welcome ! Can one process XML data without parsing it completely ? –SIX