Database Management Systems, R. Ramakrishnan1 Introduction to Semistructured Data and XML Chapter 27.

Slides:



Advertisements
Similar presentations
XML May 3 rd, XQuery Based on Quilt (which is based on XML-QL) Check out the W3C web site for the latest. XML Query data model –Ordered !
Advertisements

XML, XML Schema, Xpath and XQuery Slides collated from various sources, many from Dan Suciu at Univ. of Washington.
1 DTD (Document Type Definition) Imposing Structure on XML Documents (W3Schools on DTDs)W3Schools on DTDs.
XML Document Type Definitions ( DTD ). 1.Introduction to DTD An XML document may have an optional DTD, which defines the document’s grammar. Since the.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 311 Database Systems I The Semistructured Data Model.
Database Management Systems, R. Ramakrishnan1 Introduction to Semistructured Data and XML Chapter 27, Part D Based on slides by Dan Suciu University of.
Database Management Systems, R. Ramakrishnan1 Introduction to Semistructured Data and XML Chapter 27, Part D Based on slides by Dan Suciu University of.
Agenda from now on Done: SQL, views, transactions, conceptual modeling, E/R, relational algebra. Starting: XML To do: the database engine: –Storage –Query.
Querying XML (cont.). Comments on XPath? What’s good about it? What can’t it do that you want it to do? How does it compare, say, to SQL?
1 Lecture 10 XML Wednesday, October 18, XML Outline XML (4.6, 4.7) –Syntax –Semistructured data –DTDs.
1 Lecture 10: Database Design XML Wednesday, October 20, 2004.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 357 Database Systems I Query Languages for XML.
1 COS 425: Database and Information Management Systems XML and information exchange.
Query Languages - XQuery Slides partially from Dan Suciu.
CSC056-Z1 – Database Management Systems – Vinnie Costa – Hofstra University1 Database Management Systems Session 10 Instructor: Vinnie Costa
Winter 2002Arthur Keller – CS 18018–1 Schedule Today: Mar. 12 (T) u Semistructured Data, XML, XQuery. u Read Sections Assignment 8 due. Mar. 14.
XML May 1 st, XML for Representing Data John 3634 Sue 6343 Dick 6363 John 3634 Sue 6343 Dick 6363 row name phone “John”3634“Sue”“Dick” persons.
1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau.
1 XML and Databases. 2 Outline (ambitious) Background: documents (SGML/HTML) and databases (structured and semistructured data) XML Basics and Document.
Fall 2001Arthur Keller – CS 18017–1 Schedule Nov. 27 (T) Semistructured Data, XML. u Read Sections Assignment 8 due. Nov. 29 (TH) The Real World,
XML and Databases 198:541. XML Motivation  Huge amounts of unstructured data on the web: HTML documents  No structure information  Only format instructions.
End of SQL XML April 22 th, Null Values If x=Null then 4*(3-x)/7 is still NULL If x=Null then x=“Joe” is UNKNOWN Three boolean values: –FALSE =
XML, XML Schema, XPath and XQuery Query Languages CS561 Slides collated from several sources, including D. Suciu at Univ. of Washington.
1 Lecture 08: XML and Semistructured Data. 2 Outline XML (Section 17) –XML syntax, semistructured data –Document Type Definitions (DTDs) XPath.
Managing XML and Semistructured Data Lecture 2: XML Prof. Dan Suciu Spring 2001.
1 Lecture 08: XML and Semistructured Data. 2 Outline XML (Section 17) –XML syntax, semistructured data –Document Type Definitions (DTDs) XPath.
Xpath to XQuery February 23rd, Other Stuff HW 3 is out. Instructions for Phase 3 are out. Today: finish Xpath, start and finish Xquery. From Wednesday:
Querying XML February 12 th, Querying XML Data XPath = simple navigation through the tree XQuery = the SQL of XML XSLT = recursive traversal –will.
1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.
XML: Extensible Markup Language FST-UMAC Gong Zhiguo.
IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 XML Taken from Chapter 7.
Introduction to XQuery Resources: Official URL: Short intros:
XML by Dan Suciu 1 Introduction to Semistructured Data and XML Based on slides by Dan Suciu University of Washington.
XML and XPath. Web Services: XML+XPath2 EXtensible Markup Language (XML) a W3C standard to complement HTML A markup language much like HTML origins: structured.
S EMISTRUCTURED D ATA AND XML D ATA F ILES ON THE W EB HTML documents often generated by applications consumed by humans only easy access: across.
Extensible Markup and Beyond
End of XML February 19 th, FLWR (“Flower”) Expressions FOR... LET... WHERE... RETURN... FOR... LET... WHERE... RETURN...
Winter 2006Keller, Ullman, Cushing18–1 Plan 1.Information integration: important new application that motivates what follows. 2.Semistructured data: a.
Lecture 6: XML Query Languages Thursday, January 18, 2001.
Lecture 5: XML Tuesday, January 16, Outline XML, DTDs (Data on the Web, 3.1) Semistructured data in XML (3.2) Exporting Relational Data in XML (8.3.1)
XML query. introduction An XML document can represent almost anything, and users of an XML query language expect it to perform useful queries on whatever.
An Introduction to XML Sandeep Bhattaram
Semistructured Data Extensible Markup Language Document Type Definitions Zaki Malik November 04, 2008.
1 XML eXtensible Markup Language. 2 XML vs. HTML HTML is a HyperText Markup language HTML is a HyperText Markup language Designed for a specific application,
1 Introduction to Semistructured Data and XML. 2 How the Web is Today  HTML documents often generated by applications consumed by humans only easy access:
More XML: semantics, DTDs, XPATH February 18, 2004.
1 IST 210 Organization of Data Database and the Web.
The Semistructured-Data Model Programming Languages for XML Spring 2011 Instructor: Hassan Khosravi.
1 Lecture 13: XQuery XML Publishing, XML Storage Monday, October 28, 2002.
XQuery 1. In this lecture Summary of XQuery FLWOR expressions – For, Let, Where, Order by, Return FOR and LET expressions Collections and sorting 2.
Lecture 17: XPath and XQuery Wednesday, Nov. 7, 2001.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
Database Management Systems, R. Ramakrishnan1 Introduction to Semistructured Data and XML Chapter 27.
S EMISTRUCTURED D ATA AND XML D ISCUSSION Q UESTION Think about your personal Itunes library. Should it be maintained in a database system?
1 XML eXtensible Markup Language. 2 Introduction and Motivation Dr. Praveen Madiraju Modified from Dr.Sagiv’s slides.
Introduction to Semistructured Data and XML. How the Web is Today HTML documents – often generated by applications – consumed by humans only – easy access:
Lecture 14: Relational Algebra Projects XML?
XML: Extensible Markup Language
XML path expressions CSE 350 Fall 2003.
Management of XML and Semistructured Data
Querying XML and Semistructured Data
Lecture 12: XML, XPath, XQuery
Semi-Structured data (XML Data MODEL)
Lecture 9: XML Monday, October 17, 2005.
DTD (Document Type Definition)
Lecture 8: XML Data Wednesday, October
Introduction to Database Systems CSE 444 Lecture 10 XML
Semi-Structured data (XML)
Lecture 11: XML and Semistructured Data
Presentation transcript:

Database Management Systems, R. Ramakrishnan1 Introduction to Semistructured Data and XML Chapter 27

Database Management Systems, R. Ramakrishnan2 How the Web is Today  HTML documents often generated by applications consumed by humans only easy access: across platforms, across organizations  No application interoperability: HTML not understood by applications Database technology: client-server

Database Management Systems, R. Ramakrishnan3 New Universal Data Exchange Format: XML A recommendation from the W3C  XML = data  XML generated by applications  XML consumed by applications  Easy access: across platforms, organizations

Database Management Systems, R. Ramakrishnan4 Paradigm Shift on the Web  From documents (HTML) to data (XML)  From information retrieval to data management  For databases, also a paradigm shift: from relational model to semistructured data from data processing to data/query translation from storage to transport

Database Management Systems, R. Ramakrishnan5 HTML  HTML is widely used for formatting and structuring Web documents.  Designed to describe how a Web browser should arrange text, images and push-buttons on a page.  Easy to learn, but does not convey structure and meaning of data in the Web pages.  Fixed tag set. Welcome to the XML course Introduction Opening tag Text (PCDATA) Closing tag “Bachelor” tag Attribute nameAttribute value

Database Management Systems, R. Ramakrishnan6 Semistructure data 1. Information integration: important new application that motivates what follows. 2. Semistructured data: a new data model designed to cope with problems of information integration. 3. XML ( Extensible Markup Language ) : a new Web standard that is essentially semistructured data. 4. XQUERY: an emerging standard query language for XML data.

Database Management Systems, R. Ramakrishnan7 Information Integration Problem: related data exists in many places. They talk about the same things, but differ in model, schema, conventions ( e.g., terminology). Example: In the real world, every bar has its own database.  Some may have relations like beer-price; others have an Microsoft Word file from which the menu is printed.  Some keep phones of manufacturers but not addresses.  Some distinguish beers and ales; others do not.

Database Management Systems, R. Ramakrishnan8 The Semistructured Data Model &o1 &o12&o24&o29 &o43 &96 &243 &206 &25 “Serge” “Abiteboul” 1997 “Victor” “Vianu” paper book paper references author title year http author title publisher author title page firstname lastname firstnamelastnamefirst last Bib Object Exchange Model (OEM) complex object atomic object

Database Management Systems, R. Ramakrishnan9 Characteristics of Semistructured Data  Missing or additional attributes  Multiple attributes  Different types in different objects  Heterogeneous collections Self-describing, irregular data, no a priori structure

Database Management Systems, R. Ramakrishnan10 Comparison with Relational Data { row: { name: “John”, phone: 3634 }, row: { name: “Sue”, phone: 6343 }, row: { name: “Dick”, phone: 6363 } } row name phone “John”3634“Sue”“Dick”

Database Management Systems, R. Ramakrishnan11 XML (Extensible Markup Language)  A W3C standard to complement HTML  Origins: Structured text SGML Large-scale electronic publishing Data exchange on the web  Motivation: HTML describes presentation XML describes content

Database Management Systems, R. Ramakrishnan12 From HTML to XML HTML describes the presentation

Database Management Systems, R. Ramakrishnan13 HTML Bibliography Foundations of Databases Abiteboul, Hull, Vianu Addison Wesley, 1995 Data on the Web Abiteboul, Buneman, Suciu Morgan Kaufmann, 1999

Database Management Systems, R. Ramakrishnan14 XML Foundations… Abiteboul Hull Vianu Addison Wesley 1995 … XML describes the content

Database Management Systems, R. Ramakrishnan15 Why are we DB’ers interested?  It’s data. That’s us.  Database issues: How are we going to model XML? (graphs). How are we going to query XML? (XQuery) How are we going to store XML (in a relational database? object-oriented? native?) How are we going to process XML efficiently? (many interesting research questions!)

Database Management Systems, R. Ramakrishnan16 XML Terminology  Tags: book, title, author, … start tag:, end tag:  Elements: …, … elements can be nested empty element: (Can be abbrv. )  XML document: Has a single root element  Well-formed XML document: Has matching tags  Valid XML document: conforms to a schema

Database Management Systems, R. Ramakrishnan17 Well-Formed XML 1. Declaration =. Normal declaration is “Standalone” means that there is no DTD specified. 2. Root tag surrounds the entire balance of the document.  is balanced by, as in HTML. 3. Any balanced structure of tags OK. Option of tags that don’t require balance, like in HTML.

Database Management Systems, R. Ramakrishnan18 XML: An Example Richard Feynman The Character of Physical Law 1980 R.K. Narayan Waiting for the Mahatma 1981 R.K. Narayan The English Teacher 1980

Database Management Systems, R. Ramakrishnan19 XML – Elements …  Xml is case and space sensitive  Element opening and closing tag names must be identical  Opening tags: “ ”  Closing tags: “ ”  Empty Elements have no data and no closing tag: They begin with a “ ” closing tag attribute attribute valuedata open tag element name

Database Management Systems, R. Ramakrishnan20 XML – Attributes …  Attributes provide additional information for element tags.  There can be zero or more attributes in every element; each one has the the form: attribute_name =‘ attribute_value ’ - There is no space between the name and the “=‘” - Attribute values must be surrounded by “ or ‘ characters  Multiple attributes are separated by white space (one or more spaces or tabs). closing tag attribute attribute valuedata open tag element name

Database Management Systems, R. Ramakrishnan21 Elements The segment of an XML document between an opening and a corresponding closing tag is called an element. Malcolm Atchison (215) element not an element element, a sub-element of

Database Management Systems, R. Ramakrishnan22 XML – Data and Comments …  Xml data is any information between an opening and closing tag  Xml data must not contain the ‘ ’ characters  Comments: closing tag attribute attribute valuedata open tag element name

Database Management Systems, R. Ramakrishnan23 XML text XML has only one “basic” type -- text. It is bounded by tags, e.g. The Big Sleep is still text XML text is called PCDATA (for parsed character data). It uses a 16-bit encoding.

Database Management Systems, R. Ramakrishnan24 XML – Nesting & Hierarchy  Xml tags can be nested in a tree hierarchy  Xml documents can have only one root tag  Between an opening and closing tag you can insert: 1. Data 2. More Elements 3. A combination of data and elements Some Text More

Database Management Systems, R. Ramakrishnan25 Representing relational DBs: Two ways projects: title budget managedBy employees: name ssn age

Database Management Systems, R. Ramakrishnan26 Project and Employee relations in XML Pattern recognition Joe Joe Sandra Auto guided vehicle Sandra : Projects and employees are intermixed

Database Management Systems, R. Ramakrishnan27 Pattern recognition Joe Auto guided vehicles Sandra : Project and Employee relations in XML (cont’d) Joe Sandra : Employees follows projects

Database Management Systems, R. Ramakrishnan28 More XML: Oids and References Jane Mary John oids and references in XML are just syntax

Database Management Systems, R. Ramakrishnan29 XML Data Model (Graph)

Database Management Systems, R. Ramakrishnan30 Document Type Descriptors  Sort of like a schema but not really.  Inherited from SGML DTD standard  BNF grammar establishing constraints on element structure and content  Definitions of entities

Database Management Systems, R. Ramakrishnan31 DTD – An Example

Database Management Systems, R. Ramakrishnan32 DTD - !ELEMENT  !ELEMENT declares an element name, and what children elements it should have  Content types: Other elements #PCDATA (parsed character data) EMPTY (no content) ANY (no checking inside this structure) A regular expression NameChildren

Database Management Systems, R. Ramakrishnan33 DTD - !ELEMENT (Contd.)  A regular expression has the following structure: exp 1, exp 2, exp 3, …, exp k : A list of regular expressions exp*: An optional expression with zero or more occurrences exp+: An optional expression with one or more occurrences exp 1 | exp 2 | … | exp k : A disjunction of expressions

Database Management Systems, R. Ramakrishnan34 DTD - !ATTLIST <!ATTLIST Orange location CDATA #REQUIRED color ‘orange’>  !ATTLIST defines a list of attributes for an element  Attributes can be of different types, can be required or not required, and they can have default values. ElementAttributeTypeFlag

Database Management Systems, R. Ramakrishnan35 DTD – Well-Formed and Valid Well-Formed and Valid Not Well-Formed Well-Formed but Invalid Home

Database Management Systems, R. Ramakrishnan36 Example: An Address Book MacNiel, John Dr. John MacNiel 1234 Huron Street Rome, OH (321) Exactly one name At most one greeting As many address lines as needed (in order) Mixed telephones and faxes As many as needed

Database Management Systems, R. Ramakrishnan37 Specifying the structure  name to specify a name element  greet? to specify an optional (0 or 1) greet elements  name,greet? to specify a name followed by an optional greet

Database Management Systems, R. Ramakrishnan38 Specifying the structure (cont)  addr* to specify 0 or more address lines  tel | fax a tel or a fax element  (tel | fax)* 0 or more repeats of tel or fax  * 0 or more elements

Database Management Systems, R. Ramakrishnan39 A DTD for the address book <!DOCTYPE addressbook [ <!ELEMENT person (name, greet?, address*, (fax | tel)*, *)> ]>

Database Management Systems, R. Ramakrishnan40 DTD for the example relational DB <!DOCTYPE db [... ]>

Database Management Systems, R. Ramakrishnan41 Summary of XML regular expressions  Each element name is a tag.  Its components are the tags that appear nested within, in the order specified.  AThe tag A occurs  e1,e2The expression e1 followed by e2  e*0 or more occurrences of e  e?Optional -- 0 or 1 occurrences  e+1 or more occurrences  e1 | e2either e1 or e2  (e)grouping

Database Management Systems, R. Ramakrishnan42 XML Querying Path Expressions :  Bib.paper  Bib.book.publisher  Bib.paper.author.lastname Given an OEM instance, the value of a path expression p is a set of objects

Database Management Systems, R. Ramakrishnan43 Path Expressions Examples: DB = &o1 &o12&o24&o29 &o43 &o70&o71 &96 &243 &206 &25 “Serge” “Abiteboul” 1997 “Victor” “Vianu” paper book paper references author title year http author title publisher author title page firstname lastname firstnamelastnamefirst last Bib &o44&o45&o46 &o47&o48 &o49 &o50 &o51 &o52 Bib.paper={&o12,&o29} Bib.book.publisher={&o51} Bib.paper.author.lastname={&o71,&206} Bib.paper={&o12,&o29} Bib.book.publisher={&o51} Bib.paper.author.lastname={&o71,&206}

Database Management Systems, R. Ramakrishnan44 XQuery Emerging standard for querying XML documents. Basic form: FOR WHERE RETURN ;  Sets of elements described by paths, consisting of: 1.URL, if necessary. 2.Element names forming a path in the semistructured data graph, e.g., //BAR/NAME = “start at any BAR node and go to a NAME child.” 3.Ending condition of the form [ ]

Database Management Systems, R. Ramakrishnan45 XQuery Overview:  FOR-LET-WHERE-ORDERBY-RETURN = FLWOR FOR/LET Clauses WHERE Clause ORDERBY/RETURN Clause List of tuples Instance of Xquery data model

Database Management Systems, R. Ramakrishnan46 XQuery  FOR $x in expr -- binds $x to each value in the list expr  LET $x = expr -- binds $x to the entire list expr Useful for common subexpressions and for aggregations

Database Management Systems, R. Ramakrishnan47 FOR v.s. LET FOR $x IN document("bib.xml") /bib/book RETURN $x FOR $x IN document("bib.xml") /bib/book RETURN $x Returns:... LET $x IN document("bib.xml") /bib/book RETURN $x LET $x IN document("bib.xml") /bib/book RETURN $x Returns:...

Database Management Systems, R. Ramakrishnan48 XQuery Find all book titles published after 1995: FOR $x IN document("bib.xml") /bib/book WHERE $x/year > 1995 RETURN $x/title FOR $x IN document("bib.xml") /bib/book WHERE $x/year > 1995 RETURN $x/title Result: abc def ghi

Database Management Systems, R. Ramakrishnan49 XQuery For each author of a book by Morgan Kaufmann, list all books s/he published: FOR $a IN distinct( document("bib.xml") /bib/book[publisher=“Morgan Kaufmann”]/author) RETURN $a, FOR $t IN /bib/book[author=$a]/title RETURN $t FOR $a IN distinct( document("bib.xml") /bib/book[publisher=“Morgan Kaufmann”]/author) RETURN $a, FOR $t IN /bib/book[author=$a]/title RETURN $t distinct = a function that eliminates duplicates

Database Management Systems, R. Ramakrishnan50 XQuery Result: Jones abc def Smith ghi

Database Management Systems, R. Ramakrishnan51 XQuery count = a (aggregate) function that returns the number of elms FOR $p IN distinct(document("bib.xml")//publisher) LET $b := document("bib.xml")/book[publisher = $p] WHERE count($b) > 100 RETURN $p FOR $p IN distinct(document("bib.xml")//publisher) LET $b := document("bib.xml")/book[publisher = $p] WHERE count($b) > 100 RETURN $p

Database Management Systems, R. Ramakrishnan52 XQuery Find books whose price is larger than average: LET $a=avg( document("bib.xml") /bib/book/price) FOR $b in document("bib.xml") /bib/book WHERE $b/price > $a RETURN $b LET $a=avg( document("bib.xml") /bib/book/price) FOR $b in document("bib.xml") /bib/book WHERE $b/price > $a RETURN $b

Database Management Systems, R. Ramakrishnan53 Examples for XQuery queries  FOR $x IN doc( //employee [employeeSalary gt 70000]/employeeName RETURN $x/firstName, $x/lastName  FOR $x IN doc( WHERE $x/employeeSalary gt RETURN $x/employeeName/firstName, $x/employeeName/lastName  FOR $x IN doc( /project [projectNumber = 5]/projectWorker, $y IN doc( WHERE $x/hours gt 20.0 AND $y.ssn = $x.ssn RETURN $x/EmployeeName/firstName, $y/employeeName/lastName, $x/hours