Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December.

Slides:



Advertisements
Similar presentations
Inside an XSLT Processor Michael Kay, ICL 19 May 2000.
Advertisements

XML: Extensible Markup Language
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
W3C XML Schema: what you might not know (and might or might not like!) Noah Mendelsohn Distinguished Engineer IBM Corp. October 10, 2002.
Min LuTIMBER: A Native XML DB1 TIMBER: A Native XML Database Author: H.V. Jagadish, etc. Presenter: Min Lu Date: Apr 5, 2005.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Query processing and optimization. Advanced DatabasesQuery processing and optimization2 Definitions Query processing –translation of query into low-level.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
1 Database Systems Implementation Introduction. 2 First, some History Many techniques have their roots in two early systems (1970s):  INGRES (Berkeley)
1 COS 425: Database and Information Management Systems XML and information exchange.
Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center.
Sorting and Query Processing Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 29, 2005.
Query Execution Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 23, 2004.
XML, distributed databases, and OLAP/warehousing The semantic web and a lot more.
10/06/041 XSLT: crash course or Programming Language Design Principle XSLT-intro.ppt 10, Jun, 2004.
SPARQL All slides are adapted from the W3C Recommendation SPARQL Query Language for RDF Web link:
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
Lecture 7 of Advanced Databases XML Querying & Transformation Instructor: Mr.Ahmed Al Astal.
1 Distributed Monitoring of Peer-to-Peer Systems By Serge Abiteboul, Bogdan Marinoiu Docflow meeting, Bordeaux.
XML – Extensible Markup Language XML eXtensible – add to language. Markup – delimit info using tags. Language – a way to express info.
1 XML at a neighborhood university near you Innovation 2005 September 16, 2005 Kwok-Bun Yue University of Houston-Clear Lake.
XML for E-commerce III Helena Ahonen-Myka. In this part... n Transforming XML n Traversing XML n Web publishing frameworks.
XML and its applications: 4. Processing XML using PHP.
Lecture 6 of Advanced Databases XML Querying & Transformation Instructor: Mr.Eyad Almassri.
School of Computing and Management Sciences © Sheffield Hallam University To understand the Oracle XML notes you need to have an understanding of all these.
Sofia, Bulgaria | 9-10 October Using XQuery to Query and Manipulate XML Data Stephen Forte CTO, Corzen Inc Microsoft Regional Director NY/NJ (USA) Stephen.
Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center.
DP&NM Lab. POSTECH, Korea - 1 -Interaction Translation Methods for XML/SNMP Gateway Interaction Translation Methods for XML/SNMP Gateway Using XML Technologies.
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
XML Parsers Overview  Types of parsers  Using XML parsers  SAX  DOM  DOM versus SAX  Products  Conclusion.
Computing & Information Sciences Kansas State University Thursday, 15 Mar 2007CIS 560: Database System Concepts Lecture 24 of 42 Thursday, 15 March 2007.
1 CS 430 Database Theory Winter 2005 Lecture 17: Objects, XML, and DBMSs.
XML and Database COSC643 Sungchul Hong. Is XML a Database? Yes but only in the strictest sense of the term. It is a collection of data. (some sort) XML.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Exam and Lecture Overview.
Database Systems Part VII: XML Querying Software School of Hunan University
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
XML Name: Niki Sardjono Class: CS 157A Instructor : Prof. S. M. Lee.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
Internet Technologies Review Week 1 How does Jigsaw differ from EchoServer.java? What abstractions are made available to the servlet writer (under.
XML and Its Applications Ben Y. Zhao, CS294-7 Spring 1999.
The Semistructured-Data Model Programming Languages for XML Spring 2011 Instructor: Hassan Khosravi.
XML and Database.
GIS Data Models GEOG 370 Christine Erlien, Instructor.
XML Access Control Koukis Dimitris Padeleris Pashalis.
Streaming XPath Engine Oleg Slezberg Amruta Joshi.
INRIA - Progress report DBGlobe meeting - Athens November 29 th, 2002.
1 Introduction JAXP. Objectives  XML Parser  Parsing and Parsers  JAXP interfaces  Workshops 2.
INT-2: XQuery Levels the Data Integration Playing Field Carlo (Minollo) Innocenti DataDirect XML Technologies, Program Manager.
Computing & Information Sciences Kansas State University Friday, 20 Oct 2006CIS 560: Database System Concepts Lecture 24 of 42 Friday, 20 October 2006.
XML Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 25, 2008.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
Introduction to Databases Angela Clark University of South Alabama.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
REED : Robust, Efficient Filtering and Event Detection in Sensor Network Daniel J. Abadi, Samuel Madden, Wolfgang Lindner Proceedings of the 31st VLDB.
Querying XML, Part II Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems February 5, 2008.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.
XML 1.Introduction to XML 2.Document Type Definition (DTD) 3.XML Parser 4.Example: CGI Gateway to XML Middleware.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
XML: Extensible Markup Language
Efficient Evaluation of XQuery over Streaming Data
Querying and Transforming XML Data
Open Source distributed document DB for an enterprise
OrientX: an Integrated, Schema-Based Native XML Database System
Querying XML XPath.
Querying XML XPath.
XML and its applications: 4. Processing XML using PHP
Presentation transcript:

Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December 2, 2008

Administrivia  Recall that the final project is due – with a write-up and a 10-minute demo presentation – on Tuesday 12/16, 9-11AM  Also: course evaluations (at end) 2

XML: Its Roles  Perhaps used as a superset of HTML for documents, but…  Most successful as a transport format for sending data between systems  SOAP, WSDL, etc.  Data interchange formats like ebXML, MAGE-ML, …  So why would we want to store it in a database to query it, when we could query over XML as it streams across the network?  (Note: not infinite streams, as in DSMSs, and it’s hierarchical) 3

Streaming XPaths and XQueries  Suppose I give an XPath expression (which is a subset of a regular expression)  Can I match it against the parse tree of the data?  An XQuery takes multiple XPaths in the FOR clause, and iterates over the elements of each Xpath (binding the variable to each)  FOR $i in doc(“abc”)/xyz, $j in $i/def  We can think of an XQuery as doing tree matching, which returns tuples ($i, $j) for each tree matching $i and $j 4

Where This Leads  An XQuery can be broken into two operations:  A parsing / tree matching stage (FOR and also LET)  * Finds matches to the variables  * Returns a tuple of trees  A (mostly) pipelined SPJ / union / group by / order by engine – (WHERE, ORDER BY, nesting in RETURN)  * Like a regular relational engine extended with XML tree datatype!  The first engine to put these things together: Tukwila (Ives+ 2000, 2002)  IBM DB2 was built upon a nearly identical model – TurboXPath (Josifowski 2004) 5

The Key: SAX (Simple API for XML)  If we are to match XPaths in streaming fashion, we need a stream of data items  The original parser model: DOM (Document Object Model)  Builds an entire object hierarchy in memory, which is traversable  Not incremental! (Until later versions)  SAX: a series of event notifications  open-tag, close-tag, character data  Idea: build a state machine (or similar mechanism) to match on the events! 6

Different Options  Many different “streaming XPath:” matching algorithms were developed with some differences  What to match with (DFA, NFA, lazy DFA, PDA, proprietary format)  Complexity of the path language (regular path expressions, XPath), axes (downwards, upwards, sideways), internal references (IDREFs, foreign keys), recursive patterns  Which operations can be pushed into the operator (selection predicates, joins, position indices)  We’ll consider TurboXPath, highlighted in red above (Tukwila’s x-scan is highlighted in green) 7

From XPath Patterns to Tuples and A Normal Query Plan 8 for $c in doc("d1")//customer for $p in doc("d2")//profiles[cid/text() = $c/cid/text()] for $o in $c/order[date = ‘12/12/01’] return {$c/name} {$p/status} {$o/amount} ($c/cid/text(), $c/name, $o/amount) ($p/status, $p/cid/text()) ⋈ Pipelined join TurboXPath over “d1” TurboXPath over “d2” ($c/name, $p/status, $o/amount) XML tagger (add “result”)

XPath Processing in TurboXPath 9

Performance Issues  Predicate pushdown  Similar to “sargable predicates” – reduces the internal state that must be run through a cross-product to produce tuples  “Smart” memory management  Want to deallocate space from partial pattern matches as early as possible  Parser efficiency  We found that Xerces-C (validating C++ parser used by TurboXPath) was 10x slower than expat (non-validating C parser) 10

11 Wrapping up…  This semester has been a whirlwind tour of many different aspects of the “data ecosystem”  Storage  Concurrency control  Query processing  Data distribution and streams  Heterogeneity, mappings, and reformulation (and the limitations thereof)  Many styles of data integration  XML processing  I hope I’ve been able to convey some of what makes this field both relevant and, I think, cool…

Where There Is Room for More Work (Among Many Topics)  Storage: rows versus columns  Concurrency control  Query processing  Is there a theory of adaptivity, and an optimal scheme?  Data distribution, networks, and streams  How do we distribute to 10,000 nodes? What is the relationship between network communication and query processing?  Data integration, better support for collaboration  How can we make it less human-intensive?  “Lightweight databases”  Probabilistic databases  Visualization and interfaces  Databases meets machine learning and info retrieval 12

A Sampler of Some of the Systems Work by (Some) Major DB Groups  Washington: Mystiq – probabilistic databases; distrib. streams  Stanford: Trio – probabilities and “lineage” meets databases  Cornell: databases meets games; probabilistic databases  Wisconsin: Cimple; database support for monitoring clusters  MIT: Sensor query processing; signal processing; column stores  Berkeley: Data management for sensors and networks  Maryland: Querying data models; learning and probabilities meets databases  Penn: Orchestra; data and workflow provenance; keyword querying with learned ranks over databases; lightweight data integration; networking meets databases; sensor integration 13

14 Thanks!!!  I had a great time this semester – I hope you learned a lot and found it to be enjoyable  I’m looking forward to seeing your projects!