Management of XML and Semistructured Data

Slides:

Advertisements

Similar presentations

XML: Extensible Markup Language

Advertisements

Spring Part III: Introduction to XPath XML Path Language.

1 Web Data Management Path Expressions. 2 In this lecture Path expressions Regular path expressions Evaluation techniques Resources: Data on the Web Abiteboul,

XML, XML Schema, Xpath and XQuery Slides collated from various sources, many from Dan Suciu at Univ. of Washington.

&o1 &o12&o24&o29 &o43 &o96 &o243 &o206 &o25 “Serge” “Abiteboul” 1997 “Victor” “Vianu” paper book paper references author title year http author.

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 311 Database Systems I The Semistructured Data Model.

Database Management Systems, R. Ramakrishnan1 Introduction to Semistructured Data and XML Chapter 27, Part D Based on slides by Dan Suciu University of.

Agenda from now on Done: SQL, views, transactions, conceptual modeling, E/R, relational algebra. Starting: XML To do: the database engine: –Storage –Query.

Database Management Systems, R. Ramakrishnan1 Introduction to Semistructured Data and XML Chapter 27.

CSE 636 Data Integration XML Semistructured Data Document Type Definitions.

1 Lecture 10 XML Wednesday, October 18, XML Outline XML (4.6, 4.7) –Syntax –Semistructured data –DTDs.

1 Lecture 10: Database Design XML Wednesday, October 20, 2004.

CSC056-Z1 – Database Management Systems – Vinnie Costa – Hofstra University1 Database Management Systems Session 10 Instructor: Vinnie Costa

1 New Ways of Querying the Web by Eliahu Brodsky and Alina Blizhovsky.

Managing XML and Semistructured Data

Managing XML and Semistructured Data Lecture 6: XPath Prof. Dan Suciu Spring 2001.

1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau.

Managing XML and Semistructured Data

XML and Databases 198:541. XML Motivation  Huge amounts of unstructured data on the web: HTML documents  No structure information  Only format instructions.

End of SQL XML April 22 th, Null Values If x=Null then 4*(3-x)/7 is still NULL If x=Null then x=“Joe” is UNKNOWN Three boolean values: –FALSE =

Managing XML Data Dan Suciu University of Washington.

XML, XML Schema, XPath and XQuery Query Languages CS561 Slides collated from several sources, including D. Suciu at Univ. of Washington.

Sebastian Bitzer Seminar Semistructured Data University of Osnabrueck May 2, 2003 XML An introduction in relation to semistructured.

1 Lecture 08: XML and Semistructured Data. 2 Outline XML (Section 17) –XML syntax, semistructured data –Document Type Definitions (DTDs) XPath.

Managing XML and Semistructured Data Lecture 2: XML Prof. Dan Suciu Spring 2001.

1 Lecture 08: XML and Semistructured Data. 2 Outline XML (Section 17) –XML syntax, semistructured data –Document Type Definitions (DTDs) XPath.

XML: Extensible Markup Language FST-UMAC Gong Zhiguo.

IS432 Semi-Structured Data

IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.

Introduction to XQuery Resources: Official URL: Short intros:

XML by Dan Suciu 1 Introduction to Semistructured Data and XML Based on slides by Dan Suciu University of Washington.

XML and XPath. Web Services: XML+XPath2 EXtensible Markup Language (XML) a W3C standard to complement HTML A markup language much like HTML origins: structured.

S EMISTRUCTURED D ATA AND XML D ATA F ILES ON THE W EB HTML documents often generated by applications consumed by humans only easy access: across.

1 © Netskills Quality Internet Training, University of Newcastle Introducing XML © Netskills, Quality Internet Training University.

VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation An Introduction to XML.

Management of XML and Semistructured Data Lecture 5: Query Languages Wednesday, 4/1/2001.

Lecture 6: XML Query Languages Thursday, January 18, 2001.

1 Bisimulations as a Technique for State Space Reductions.

Lecture 5: XML Tuesday, January 16, Outline XML, DTDs (Data on the Web, 3.1) Semistructured data in XML (3.2) Exporting Relational Data in XML (8.3.1)

1 Introduction to Semistructured Data and XML. 2 How the Web is Today  HTML documents often generated by applications consumed by humans only easy access:

More XML: semantics, DTDs, XPATH February 18, 2004.

Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Well Formed XML The basics. A Simple XML Document Smith Alice.

IS432 Semi-Structured Data Lecture 4: XPath Dr. Gamal Al-Shorbagy.

XML SNU OOPSLA Lab. October Contents  Semistructured Data  Introduction  History  XML Application  DTD & XML Schema  DOM & SAX  Summary.

Jackson, Web Technologies: A Computer Science Perspective, © 2007 Prentice-Hall, Inc. All rights reserved Chapter 7 Representing Web Data:

SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.

S EMISTRUCTURED D ATA AND XML D ISCUSSION Q UESTION Think about your personal Itunes library. Should it be maintained in a database system?

Lecture 14: Relational Algebra Projects XML?

XML: Extensible Markup Language

Unit 4 Representing Web Data: XML

XML path expressions CSE 350 Fall 2003.

Management of XML and Semistructured Data

Management of XML and Semistructured Data

Chapter 7 Representing Web Data: XML

Managing XML and Semistructured Data

About XML/Xquery/RDF.

Managing XML and Semistructured Data

Lecture 11 XML Wednesday, Oct. 24, 2001.

Managing XML and Semistructured Data

Lecture 12: XML, XPath, XQuery

Semi-Structured data (XML Data MODEL)

Lecture 9: XML Monday, October 17, 2005.

CSE 544: Lecture 5 XML 4/15/2002.

Lecture 8: XML Data Wednesday, October

Introduction to Database Systems CSE 444 Lecture 10 XML

Lecture 15: Querying XML Friday, October 27, 2000.

Semi-Structured data (XML)

Lecture 11: XML and Semistructured Data

Presentation transcript:

Management of XML and Semistructured Data Lecture 2, Wednesday, 4/4/2001

Outline Semistructured data XML Simulation Bisimulation Syntax Data model

The Semistructured Data Model Bib &o1 complex object paper paper book references &o12 &o24 &o29 references references author page author year author title http title author publisher title author author &o43 &25 &96 1997 last firstname atomic object firstname lastname first lastname &243 &206 “Serge” “Abiteboul” “Victor” 122 133 “Vianu” Object Exchange Model (OEM)

Syntax for Semistructured Data May omit oid’s: { paper: { author: “Abiteboul”, author: { firstname: “Victor”, lastname: “Vianu”}, title: “Regular path queries …”, page: { first: 122, last: 133 } }

Set Semantics for Trees Want to say that {a, a, b} = {a, b} Define equality for trees first, then for graphs Definition Two trees t, t’ are equal, t=t’, if: They are both atomic values with same value t = {t1, ..., tm}, t’ = {t1’, ..., tn’} and: i=1,...,m, j=1,...,n s.t. ti = tj’ j=1,...,n, i=1,...,m s.t. ti = tj’

Set Semantics: Example b b = a a b c c d c d c c c c c d 1 2 2 1 1 1 1 2 e e e 3 3 3

Set Semantics for Graphs Previous definition does not apply directly to graphs with cycles Need to adapt it  bisimulation First, we will define a simulation

Note: if we insist that R be a function  graph homeomorphism Graph Simulation Definition Two edge-labeled graphs G1, G2 A simulation is a relation R between nodes: if (x1, x2)  R, and (x1,a,y1)  G1, then exists (x2,a,y2)  G2 (same label) s.t. (y1,y2)  R x1 x2 a R G1 G2 y1 a R y2 Note: if we insist that R be a function  graph homeomorphism

Graph Bisimulation Definition Two edge-labeled graphs G1, G2 A bisimulation is a relation R between nodes s.t. both R and R-1 are simulations

Set Semantics for Semistructured Data Definition Two rooted graphs G1, G2 are equal if there exists a bisimulation R from G1 to G2 such that (root(G1), root(G2))  R Notation: G1  G2 For trees, this is precisely our earlier definition

Examples of Bisimilar Graphs = c c c a a = a a a a ...

Examples of non-Bisimilar Graphs c b c This is a simulation but not a bisimulation Why ? Notice: G1, G2 have the same sets of paths

Examples of Simulation Simulation acts like “subset” {a, b}  {a, b, c} {a, b:{c}}  {d, a:{e,f}, b:{c,g}} Question: if DB1  DB2 and DB2  DB1 then DB1  DB2 ? c a a b b d a a b b e c g f c

Facts About a (Bi)Simulation The empty set is always a (bi)simulation If R, R’ are (bi)simulations, so is R U R’ Hence, there always exists a maximal (bi)simulation: Checking if DB1=DB2: compute the maximal bisimulation R, then test (root(DB1),root(DB2)) in R

Computing a (Bi)Simulation Computing the maximal (bi)simulation: Start with R = nodes(G1) x nodes(G2) While exists (x1, x2)  R that violates the definition, remove (x1, x2) from R This runs in polynomial time ! Better: O((m+n)log(m+n)) for bisimulation O(m n) for simulation Compare to finding a graph homeomorphism !

XML a W3C standard to complement HTML origins: structured text SGML motivation: HTML describes presentation XML describes content http://www.w3.org/TR/2000/REC-xml-20001006 (version 2, 10/2000)

From HTML to XML HTML describes the presentation

HTML <h1> Bibliography </h1> <p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995 <p> <i> Data on the Web </i> Abiteoul, Buneman, Suciu <br> Morgan Kaufmann, 1999

XML XML describes the content <bibliography> <book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> … </bibliography> XML describes the content

XML Terminology tags: book, title, author, … start tag: <book>, end tag: </book> elements: <book>…<book>,<author>…</author> elements are nested empty element: <red></red> abbrv. <red/> an XML document: single root element well formed XML document: if it has matching tags

More XML: Attributes <book price = “55” currency = “USD”> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year> </book> attributes are alternative ways to represent data

More XML: Oids and References <person id=“o555”> <name> Jane </name> </person> <person id=“o456”> <name> Mary </name> <children idref=“o123 o555”/> </person> <person id=“o123” mother=“o456”><name>John</name> oids and references in XML are just syntax

More XML: CDATA Section Syntax: <![CDATA[ .....any text here...]]> Example: <example> <![CDATA[ some text here </notAtag> <>]]> </example>

More XML: Entity References Syntax: &entityname; Example: <element> this is less than < </element> Some entities: < > & & ' ‘ " “ & Unicode char

More XML: Processing Instructions Syntax: <?target argument?> Example: <product> <name> Alarm Clock </name> <?ringBell 20?> <price> 19.99 </price> </product> What do they mean ?

More XML: Comments Syntax  Yes, they are part of the data model !!!

XML Namespaces http://www.w3.org/TR/REC-xml-names (1/99) name ::= [prefix:]localpart <book xmlns:isbn=“www.isbn-org.org/def”> <title> … </title> <number> 15 </number> <isbn:number> …. </isbn:number> </book>

XML Namespaces syntactic: <number> , <isbn:number> semantic: provide URL for schema <tag xmlns:mystyle = “http://…”> … <mystyle:title> … </mystyle:title> <mystyle:number> … </tag> defined here

XML Data Model Several competing models: Document Object Model (DOM): http://www.w3.org/TR/2001/WD-DOM-Level-3-CMLS-20010209/ (2/2001) class hierarchy (node, element, attribute,…) objects have behavior defines API to inspect/modify the document XSL data model Infoset PSV (post schema validation) XML Query data model (next)

XML Query Data Model http://www.w3.org/TR/query-datamodel/ 2/2001 Describes XML as a tree, specialized nodes Uses a functional-style notation (think ML)

XML Query Data Model Node ::= DocNode | ElemNode | ValueNode | AttrNode | NSNode | PINode | CommentNode | InfoItemNode | RefNode

XML Query Data Model Element node (simplified definition): elemNode : (QNameValue, {AttrNode }, [ ElemNode | ValueNode])  ElemNode QNameValue = means “a tag name” {...} = means “set of...” [...] = means “list of ...”

XML Query Data Model Reads: “give me a tag, a set of attributes, a list of elements/values, and I will return an element”

XML Query Data Model Example book1= elemNode(book, {price2, currency3}, [title4, author5, author6, author7, year8]) price2 = attrNode(…) /* next */ currency3 = attrNode(…) title4 = elemNode(title, string9) … <book price = “55” currency = “USD”> <title> Foundations … </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <year> 1995 </year> </book>

XML Query Data Model Attribute node: attrNode : (QNameValue, ValueNode)  AttrNode

XML Query Data Model Example price2 = attrNode(price,string10) string10 = valueNode(…) /* next */ currency3 = attrNode(currency, string11) string11 = valueNode(…) <book price = “55” currency = “USD”> <title> Foundations … </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <year> 1995 </year> </book>

XML Query Data Model Value node: ValueNode = StringValue | BoolValue | FloatValue … stringValue : string  StringValue boolValue : boolean  BoolValue floatValue : float  FloatValue

XML Query Data Model Example price2 = attrNode(price,string10) string10 = valueNode(stringValue(“55”)) currency3 = attrNode(currency, string11) string11 = valueNode(stringValue(“USD”)) title4 = elemNode(title, string9) string9 = valueNode(stringValue(“Foundations…”)) <book price = “55” currency = “USD”> <title> Foundations … </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <year> 1995 </year> </book>

XML v.s. Semistructured Data both described best by a graph both are schema-less, self-describing

Similarities and Differences <person id=“o123”> <name> Alan </name> <age> 42 </age> <email> ab@com </email> </person> { person: &o123 { name: “Alan”, age: 42, email: “ab@com” } } <person father=“o123”> … </person> { person: { father: &o123 …} } person name age email Alan 42 ab@com father similar on trees, different on graphs

More Differences XML is ordered, ssd is not XML can mix text and elements: <talk> Making Java easier to type and easier to type <speaker> Phil Wadler </speaker> </talk> XML has lots of other stuff: entities, processing instructions, comments Very important: these differences make XML data management harder

Summary of Data Models semistructured data, XML data is self-describing, irregular schema embedded with the data