Management of XML and Semistructured Data

Slides:



Advertisements
Similar presentations
XML: Extensible Markup Language
Advertisements

Lecture 11: Datalog Tuesday, February 6, Outline Datalog syntax Examples Semantics: –Minimal model –Least fixpoint –They are equivalent Naive evaluation.
1 XML DTD & XML Schema Monica Farrow G30
Extracting Schema From Data The difference between schemas for semistructured data and traditional schemas is that a given semistructured data can have.
Agenda from now on Done: SQL, views, transactions, conceptual modeling, E/R, relational algebra. Starting: XML To do: the database engine: –Storage –Query.
CS 898N – Advanced World Wide Web Technologies Lecture 21: XML Chin-Chih Chang
1 Lecture 10 XML Wednesday, October 18, XML Outline XML (4.6, 4.7) –Syntax –Semistructured data –DTDs.
Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From.
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
1 COS 425: Database and Information Management Systems XML and information exchange.
Keys For XML Peter Buneman Susan Davidson Wenfei Fan Carmem Hara Wang Chiew Tan.
Models and languages for semistructured data Bridging documents and databases.
Winter 2002Arthur Keller – CS 18018–1 Schedule Today: Mar. 12 (T) u Semistructured Data, XML, XQuery. u Read Sections Assignment 8 due. Mar. 14.
1 New Ways of Querying the Web by Eliahu Brodsky and Alina Blizhovsky.
Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’
Fall 2001Arthur Keller – CS 18017–1 Schedule Nov. 27 (T) Semistructured Data, XML. u Read Sections Assignment 8 due. Nov. 29 (TH) The Real World,
Managing XML and Semistructured Data Lecture 16: Indexes Prof. Dan Suciu Spring 2001.
Managing XML and Semistructured Data
1 XML Semistructured Data Extensible Markup Language Document Type Definitions.
1 Lecture 08: XML and Semistructured Data. 2 Outline XML (Section 17) –XML syntax, semistructured data –Document Type Definitions (DTDs) XPath.
1 Lecture 08: XML and Semistructured Data. 2 Outline XML (Section 17) –XML syntax, semistructured data –Document Type Definitions (DTDs) XPath.
1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.
4/20/2017.
IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.
RDF (Resource Description Framework) Why?. XML XML is a metalanguage that allows users to define markup XML separates content and structure from formatting.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Document Type Definition.
Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
CSCE 520- Relational Data Model Lecture 2. Relational Data Model The following slides are reused by the permission of the author, J. Ullman, from the.
A Z Approach in Validating ORA-SS Data Models Scott Uk-Jin Lee Jing Sun Gillian Dobbie Yuan Fang Li.
What is XML?  XML stands for EXtensible Markup Language  XML is a markup language much like HTML  XML was designed to carry data, not to display data.
Avoid using attributes? Some of the problems using attributes: Attributes cannot contain multiple values (child elements can) Attributes are not easily.
Winter 2006Keller, Ullman, Cushing18–1 Plan 1.Information integration: important new application that motivates what follows. 2.Semistructured data: a.
XML A web enabled data description language 4/22/2001 By Mark Lawson & Edward Ryan L’Herault.
Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.
Managing XML and Semistructured Data Lecture 13: XDuce and Regular Tree Languages Prof. Dan Suciu Spring 2001.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 27 XML: Extensible Markup Language.
1 Bisimulations as a Technique for State Space Reductions.
Lecture 5: XML Tuesday, January 16, Outline XML, DTDs (Data on the Web, 3.1) Semistructured data in XML (3.2) Exporting Relational Data in XML (8.3.1)
Jeff Ullman: Introduction to XML 1 XML Semistructured Data Extensible Markup Language Document Type Definitions.
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Understanding How XML Works Ellen Pearlman Eileen Mullin Programming the.
Semistructured Data Extensible Markup Language Document Type Definitions Zaki Malik November 04, 2008.
More XML: semantics, DTDs, XPATH February 18, 2004.
Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.
CSCE 520- Relational Data Model Lecture 2. Oracle login Login from the linux lab or ssh to one of the linux servers using your cse username and password.
Description of Information Resources: RDF/RDFS (an Introduction)
Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
Lecture 9: Query Complexity Tuesday, January 30, 2001.
Metadata Michael J. Watts
1 Representing and Reasoning on XML Documents: A Description Logic Approach D. Calvanese, G. D. Giacomo, M. Lenzerini Presented by Daisy Yutao Guo University.
XML to Relational Database Mapping
Describing Syntax and Semantics
The Enhanced Entity- Relationship (EER) Model
Management of XML and Semistructured Data
Managing XML and Semistructured Data
Chapter 9 Web Services: JAX-RPC, WSDL, XML Schema, and SOAP
Lecture 11 XML Wednesday, Oct. 24, 2001.
Managing XML and Semistructured Data
Managing XML and Semistructured Data
eXtensible Markup Language (XML)
New Perspectives on XML
2/18/2019.
XML indexing – A(k) indices
Extensible Markup Language
Lecture 9: XML Monday, October 17, 2005.
Lecture 8: XML Data Wednesday, October
Lecture 08: E/R Diagrams and Functional Dependencies
Introduction to Database Systems CSE 444 Lecture 10 XML
Lecture 11: XML and Semistructured Data
Presentation transcript:

Management of XML and Semistructured Data Lecture 9: Schemas Friday, April 27, 2001

Outline Schemas for semistructured data Schemas for XML unordered

Schemas for Relational v.s. SS Data Schemas in relational data: Defined before the data Strictly enforced Schemas in SS data Defined after the data May not be enforced (some data doesn’t have schema)

When the Schema is Created Created by the user Before or after the data Does not need to be unique Extracted from the data Unheard of in relational databases Extracted from the query Like type inference in programming languages Different schema formalisms for each task !

Schemas for Unordered Data OEM Cycles Schema itself will be a graph

Schemas: An Example Some database: &r1 &c1 &c2 &s2 &s3 &s6 &s7 &s10 company name address url “Widget” “Trenton” “Gadget” “www.gp.fr” “Paris” &p2 &p1 &p3 &s0 &s1 &s4 &s5 &s8 &s9 person “Smith” name position phone “Manager” “Jones” “5552121” “Dupont” “Sales” employee manages c.e.o. works-for &a1 &a2 &a3 &a4 &a5 &a6 &a7 description procurement salesrep contact task eval 1997 1998 “on target” “below target”

Graph Schemas Root person company works-for managed-by Employee c.e.o. | employee name | address | url name | phone | position string description Any * Upper Bound Schema

The Two Questions to Ask Conformance: does that data conform to this schema ? Classification: if so, then which objects belong to what classes ?

Graph Simulation Definition Two edge-labeled graphs G1, G2 A simulation is a relation R between nodes: if (x1, x2) in R, and (x1,a,y1) in G1, then exists (x2,a,y2) in G2 (same label) s.t. (y1,y2) in R x1 x2 a R G1 G2 y1 a R y2

Using Simulation Data graph D, schema S conformance: find maximal simulation R from D to S Notation: D  S classification: check if (x,c) in R Notation: x  c

Example Database Graph Schema &r1 person person company manages person Root manages person person company works-for &p1 employee company c.e.o. &c1 &p2 &c2 c.e.o. &p3 works-for works-for works-for phone position managed-by name position address name name address name name Company Employee &s0 &s1 &s2 &s3 &s4 &s5 &s6 url &s7 &s8 &s9 description c.e.o. | employee “Smith” “Manager” “Widget” “Trenton” “Jones” “5552121” “Gadget” “Paris” “Dupont” “Sales” name | address | url description &s10 name | phone | position string description &a5 &a1 “www.gp.fr” eval 1998 &a4 procurement salesrep 1997 Any task &a7 * &a2 &a3 &a6 contact “below target” “on target” Database Graph Schema

Formally Graph schema S is a graph, s.t.: Nodes are called classes Edges are labeled with unary predicates, p(x) Examples: person = “x=person” name | address = “x=name  x=address” * = “true” int = “x int” name = “x  name”

Examples of Graph Schemas (What Do They Mean ?) person person S1= description name name age age string int string int  name   age string  int part S3= S1 = there are persons; each can have several names and several ages, of types string and int respectively S2 = same as S1, plus persons can have description edges, under which we can have any edge with any label, except name and age S3 = describes a hierarchy of parts and subparts, arbitrary deep. Leaf subparts (or parts) may have names and prices. S4 = describes ANY database S5 = persons may have name and age of types string, int, and under description there may be any structure. S4= subpart name price person description * S5= name string int age string int “Universal schema”, ST *

D = S = D  S person person person person person name age name name string int Smith 55 . . . . . . . D  S

Any database conforms to ST ! “Universal schema” * Any database conforms to ST ! “Universal schema”

Schemas in SS Data v.s. Relational Data Each data instance has exactly one schema Semistructured data One data instance has several schemas

The Classification Problem D = person person person person name phone name name phone name email John 1234 Mary string string string string Schema is nondeterministic: creates ambiguous classifications.

The Classification Problem Definition A schema S is deterministic if for every class c and every label a, there is at most one outgoing edge labeled a from c Fact: if S is deterministic and D is a tree, then each node is uniquely classified (When D is not a tree, then this is not true.)

Deterministic Upper Bound Schemas Given a schema S, we can always construct an deterministic approximation, Sd S= person person Sd= person name email name phone phone name email string string string string string string string In general, Sd obtained by powerset constrcut  expensive

Lower Bound Schemas Introduced (under a different name and formalism) in: Nestorov, Abiteboul, Motwani, Extracting Schema from Semistructured Data, SIGMOD 98 Goal: extract some “schema” from the data We will see later why these are “lower bound schemas”

Lower Bound Schemas Schema = datalog program with special form Classes = predicates Company(x) :- link(x,”c.e.o.”,y), Employee(y), link(x,”name”,z), String(z), link(x,”address”,u), String(u) Employee(x) :- link(x,”works-for”,y), Company(y), link(x,”managed-by”,z), Employee(z), link(x,”name”,u), String(u), link(x,”address”,v), String(v), Root(x) :- link(x,”company”,y), Company(y), link(x,”person”,z), Employee(z) Maximal fixpoint, rather than minimal fixpoint ! (next)

Datalog Minimal Fixpoint Standard datalog semantcis. Transform program to: Company(x)  y. z. u.(link(x,”c.e.o.”,y), Employee(y), link(x,”name”,z), String(z), link(x,”address”,u), String(u) Employee(x)  y.z.u.v.(link(x,”works-for”,y), Company(y), link(x,”managed-by”,z), Employee(z), link(x,”name”,u), String(u), link(x,”address”,v), String(v)) Root(x)  y.z.(link(x,”company”,y), Company(y), link(x,”person”,z), Employee(z)) Compute minimal model. What is it ?

Datalog Minimal Fixpoint Standard datalog semantcis. Transform program to: Company(x)  y. z. u.(link(x,”c.e.o.”,y), Employee(y), link(x,”name”,z), String(z), link(x,”address”,u), String(u) Employee(x)  y.z.u.v.(link(x,”works-for”,y), Company(y), link(x,”managed-by”,z), Employee(z), link(x,”name”,u), String(u), link(x,”address”,v), String(v)) Root(x)  y.z.(link(x,”company”,y), Company(y), link(x,”person”,z), Employee(z)) Answer: the empty set ! Compute maximal model.

Lower-Bound Schemas Equivalent representation of schema: Root person company works-for managed-by Employee Company c.e.o. address name name string

Simulation Strikes Back person person company manages person company works-for Root person &p1 employee c.e.o. &c1 &p2 &c2 c.e.o. &p3 company works-for works-for name works-for position name phone address name address position name name managed-by &s0 &s1 &s2 &s3 &s4 &s5 &s6 url &s7 &s8 &s9 description Company Employee c.e.o. “Smith” “Manager” “Widget” “Trenton” “Jones” “5552121” “Gadget” “Paris” “Dupont” “Sales” description &s10 address name name &a5 string “www.gp.fr” eval 1998 &a1 &a4 procurement salesrep 1997 &a7 &a3 task &a2 &a6 contact “below target” “on target” Lower Bound Database A model of program P is precisely a simulation (and vice versa)

Summary: Lower v.s. Upper Bound Schemas Tells us what edges are allowed Lower bound schemas Tell us what edges are required

Schema Extraction (From Data) Problem statement given data instance D find the “most specific” schema S for D In practice: S too large, need to relax [Nestorov, Abiteboul, Motwani 1998]

Schema Extraction: Sample Data Example database D = &r employee employee employee employee employee employee employee employee manages manages manages manages manages &p1 &p2 &p3 &p4 &p5 &p6 &p7 &p8 managedby managedby managedby managedby managedby worksfor worksfor worksfor company worksfor worksfor worksfor worksfor worksfor &c

Lower Bound Schema Extraction [NAM’98] approach: Start with the schema given by the data (S = D): Each node = a predicate = a class Compute maximal fixpoint (PTIME) Declare two classes equal iff they are equal sets E.g. p4={&p1,&p4,&p6}, p6={&p1,&p4,&p6}, hence p1=p4 Equivalently, p=p’ iff p(&p’) and p’(&p) . . . p4(x) :- link(x, manages, y), p5(y), link(x, worksfor, z), c(z) p5(x) :- link(x, managed-by, y), p4(y), link(x, worksfor, z), c(z)

Lower Bound Schema Extraction Result S = Root &r employee company employee Bosses &p1,&p4,&p6 manages Regulars &p2,&p3,&p5,&p7,&p8 worksfor managedby Company &c worksfor

Lower Bound Schema Extraction Equivalently: Compute the maximal simulation D  D Can do in time O(m2) Two nodes p, p’ are equivalent iff x  x’ and x’  x Schema consists of equivalence classes Remark: could use the bisimulation relation instead (perhaps is even better)

Upper Bound Schema Extraction The extracted lower bound schema S is also an upper bound schema ! But: nondeterministic Convert S  Sd Alternatively, convert directly D  Dd = Sd These are data guides [McHugh and Widom]

Upper Bound Schema Extraction Result Sd = Root &r employee Employees &p1,&p1,&p3,P4 &p5,&p6,&p7,&p8 company managedby manages worksfor Bosses &p1,&p4,&p6 manages Regulars &p2,&p3,&p5,&p7,&p8 worksfor managedby Company &c worksfor

XML Document Type Definitions part of the original XML specification an XML document may have a DTD terminology for XML: well-formed: if tags are correctly closed valid: if it has a DTD and conforms to it validation is useful in data exchange

DTDs as Grammars <!DOCTYPE paper [ <!ELEMENT paper (section*)> <!ELEMENT section ((title,section*) | text)> <!ELEMENT title (#PCDATA)> <!ELEMENT text (#PCDATA)> ]> <paper> <section> <text> </text> </section> <section> <title> </title> <section> … </section> <section> … </section> </section> </paper>

DTDs as Schemas Not so well suited: impose unwanted constraints on order <!ELEMENT person (name,phone)> references cannot be constrained can be too vague: <!ELEMENT person ((name|phone|email)*)>