Managing XML and Semistructured Data

Slides:

Advertisements

Similar presentations

22-Sep-06 CS6795 Semantic Web Techniques 0 Extensible Markup Language.

Advertisements

XML: Extensible Markup Language

Lecture 11: Datalog Tuesday, February 6, Outline Datalog syntax Examples Semantics: –Minimal model –Least fixpoint –They are equivalent Naive evaluation.

Managing XML and Semistructured Data Lecture 12: XML Schema Prof. Dan Suciu Spring 2001.

XML, XML Schema, Xpath and XQuery Slides collated from various sources, many from Dan Suciu at Univ. of Washington.

The Big Picture Chapter 3. We want to examine a given computational problem and see how difficult it is. Then we need to compare problems Problems appear.

Extracting Schema From Data The difference between schemas for semistructured data and traditional schemas is that a given semistructured data can have.

Agenda from now on Done: SQL, views, transactions, conceptual modeling, E/R, relational algebra. Starting: XML To do: the database engine: –Storage –Query.

Web-site Management System Strudel Presented by: LAKHLIFI Houda Instructor: Dr. Haddouti.

1 Lecture 10 XML Wednesday, October 18, XML Outline XML (4.6, 4.7) –Syntax –Semistructured data –DTDs.

Managing XML and Semistructured Data Lecture : Indexes.

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 8 The Enhanced Entity- Relationship (EER) Model.

Query Languages Aswin Yedlapalli. XML Query data model Document is viewed as a labeled tree with nodes Successors of node may be : - an ordered sequence.

Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From.

Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.

Keys For XML Peter Buneman Susan Davidson Wenfei Fan Carmem Hara Wang Chiew Tan.

Winter 2002Arthur Keller – CS 18018–1 Schedule Today: Mar. 12 (T) u Semistructured Data, XML, XQuery. u Read Sections Assignment 8 due. Mar. 14.

Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’

Managing XML and Semistructured Data Lecture 16: Indexes Prof. Dan Suciu Spring 2001.

Managing XML and Semistructured Data

Managing XML and Semistructured Data Lecture 14: Constraints and Keys Prof. Dan Suciu Spring 2001.

XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.

Managing XML and Semistructured Data Lecture 1: Preliminaries and Overview Prof. Dan Suciu Spring 2001.

Sebastian Bitzer Seminar Semistructured Data University of Osnabrueck May 2, 2003 XML An introduction in relation to semistructured.

1 Lecture 08: XML and Semistructured Data. 2 Outline XML (Section 17) –XML syntax, semistructured data –Document Type Definitions (DTDs) XPath.

Managing XML and Semistructured Data Lecture 2: XML Prof. Dan Suciu Spring 2001.

IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.

Cooperative Query Answering for Semistructured data Michael Barg Raymond K. Wong Reviewed by SwethaJack Christian (Absent) Chris.

A Z Approach in Validating ORA-SS Data Models Scott Uk-Jin Lee Jing Sun Gillian Dobbie Yuan Fang Li.

Page 1 Ming Ji Department of Computer Science University of Illinois at Urbana-Champaign.

Winter 2006Keller, Ullman, Cushing18–1 Plan 1.Information integration: important new application that motivates what follows. 2.Semistructured data: a.

Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

Managing XML and Semistructured Data Lecture 13: XDuce and Regular Tree Languages Prof. Dan Suciu Spring 2001.

Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.

1 Bisimulations as a Technique for State Space Reductions.

Logics for Data and Knowledge Representation Applications of ClassL: Lightweight Ontologies.

Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay.

SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.

Conceptual Foundations © 2008 Pearson Education Australia Lecture slides for this course are based on teaching materials provided/referred by: (1) Statistics.

Metadata Michael J. Watts

1 Representing and Reasoning on XML Documents: A Description Logic Approach D. Calvanese, G. D. Giacomo, M. Lenzerini Presented by Daisy Yutao Guo University.

Chapter 10 NP-Complete Problems.

Describing Syntax and Semantics

The Enhanced Entity- Relationship (EER) Model

Managing XML and Semistructured Data

Management of XML and Semistructured Data

Management of XML and Semistructured Data

Chapter 2: Intro to Relational Model

Managing XML and Semistructured Data

Chapter 9 Web Services: JAX-RPC, WSDL, XML Schema, and SOAP

Managing XML and Semistructured Data

Semi-Structured data (XML Data MODEL)

Tree Searching.

XML indexing – A(k) indices

Lecture 9: XML Monday, October 17, 2005.

Lectures 13: Design Theory II

Turing Machines Complexity ©D.Moshkovitz.

Chapter 2: Intro to Relational Model

Lecture 8: XML Data Wednesday, October

Lecture 08: E/R Diagrams and Functional Dependencies

Introduction to Database Systems CSE 444 Lecture 10 XML

Semi-Structured data (XML)

Lecture 11: XML and Semistructured Data

Teori Bahasa dan Automata Lecture 6: Regular Expression

Theory of Computation Lecture 23: Turing Machines III

CS589 Principles of DB Systems Fall 2008 Lecture 4d: Recursive Datalog with Negation – What is the query answer defined to be? Lois Delcambre

Presentation transcript:

Managing XML and Semistructured Data Lecture 10: Schema Prof. Dan Suciu Spring 2001

In this lecture Schema for unordered data Resources conformance and classification using simulation upper bound and lower bound schema Resources MSL: A Model for W3C XML Schema by Brown, Fuchs, Robie, Wadler, in WWW10, 2001. Subsumption for XML Types by Kuper and Simeon, ICDT'2001. Data on the Web Abiteboul, Buneman, Suciu : chapter 7

Schemas for Relational v.s. SS Data Schemas in relational data: Defined before the data Strictly enforced Schemas in SS data Defined after the data May not be enforced (some data doesn’t have schema)

When the Schema is Created Created by the user Before or after the data Does not need to be unique Extracted from the data Unheard of in relational databases Extracted from the query Like type inference in programming languages Different schema formalisms for each task !

Schemas for Unordered Data OEM Cycles Schema itself will be a graph

Schemas: An Example Some database: &r1 &c1 &c2 &s2 &s3 &s6 &s7 &s10 company name address url “Widget” “Trenton” “Gadget” “www.gp.fr” “Paris” &p2 &p1 &p3 &s0 &s1 &s4 &s5 &s8 &s9 person “Smith” name position phone “Manager” “Jones” “5552121” “Dupont” “Sales” employee manages c.e.o. works-for &a1 &a2 &a3 &a4 &a5 &a6 &a7 description procurement salesrep contact task eval 1997 1998 “on target” “below target”

Graph Schemas Root person company works-for managed-by Employee c.e.o. | employee name | address | url name | phone | position string description Any * Upper Bound Schema

The Two Questions to Ask Conformance: does that data conform to this schema ? Classification: if so, then which objects belong to what classes ?

Graph Simulation Definition Two edge-labeled graphs G1, G2 A simulation is a relation R between nodes: if (x1, x2) in R, and (x1,a,y1) in G1, then exists (x2,a,y2) in G2 (same label) s.t. (y1,y2) in R x1 x2 a R G1 G2 y1 a R y2

Using Simulation Data graph D, schema S conformance: find maximal simulation R from D to S Notation: D  S classification: check if (x,c) in R Notation: x  c

Example Database Graph Schema &r1 person person company manages person Root manages person person company works-for &p1 employee company c.e.o. &c1 &p2 &c2 c.e.o. &p3 works-for works-for works-for phone position managed-by name position address name name address name name Company Employee &s0 &s1 &s2 &s3 &s4 &s5 &s6 url &s7 &s8 &s9 description c.e.o. | employee “Smith” “Manager” “Widget” “Trenton” “Jones” “5552121” “Gadget” “Paris” “Dupont” “Sales” name | address | url description &s10 name | phone | position string description &a5 &a1 “www.gp.fr” eval 1998 &a4 procurement salesrep 1997 Any task &a7 * &a2 &a3 &a6 contact “below target” “on target” Database Graph Schema

Formally Graph schema S is a graph, s.t.: Nodes are called classes Edges are labeled with unary predicates, p(x) Examples: person = “x=person” name | address = “x=name  x=address” * = “true” int = “x int” name = “x  name”

Examples of Graph Schemas (What Do They Mean ?) person person S1= description name name age age string int string int  name   age string  int part S3= S1 = there are persons; each can have several names and several ages, of types string and int respectively S2 = same as S1, plus persons can have description edges, under which we can have any edge with any label, except name and age S3 = describes a hierarchy of parts and subparts, arbitrary deep. Leaf subparts (or parts) may have names and prices. S4 = describes ANY database S5 = persons may have name and age of types string, int, and under description there may be any structure. S4= subpart name price person description * S5= name string int age string int “Universal schema”, ST *

D = S = D  S person person person person person name age name name string int Smith 55 . . . . . . . D  S

Any database conforms to ST ! “Universal schema” * Any database conforms to ST ! “Universal schema”

Schemas in SS Data v.s. Relational Data Each data instance has exactly one schema Semistructured data One data instance has several schemas

The Classification Problem D = person person person person name phone name name phone name email John 1234 Mary string string string string Schema is nondeterministic: creates ambiguous classifications.

The Classification Problem Definition A schema S is deterministic if for every class c and every label a, there is at most one outgoing edge labeled a from c Fact: if S is deterministic and D is a tree, then each node is uniquely classified (When D is not a tree, then this is not true.)

Deterministic Upper Bound Schemas Given a schema S, we can always construct an deterministic approximation, Sd S= person person Sd= person name email name phone phone name email string string string string string string string In general, Sd obtained by powerset constrcut  expensive

Lower Bound Schemas Introduced (under a different name and formalism) in: Nestorov, Abiteboul, Motwani, Extracting Schema from Semistructured Data, SIGMOD 98 Goal: extract some “schema” from the data We will see later why these are “lower bound schemas”

Lower Bound Schemas Schema = datalog program with special form Classes = predicates Company(x) :- link(x,”c.e.o.”,y), Employee(y), link(x,”name”,z), String(z), link(x,”address”,u), String(u) Employee(x) :- link(x,”works-for”,y), Company(y), link(x,”managed-by”,z), Employee(z), link(x,”name”,u), String(u), link(x,”address”,v), String(v), Root(x) :- link(x,”company”,y), Company(y), link(x,”person”,z), Employee(z) Maximal fixpoint, rather than minimal fixpoint ! (next)

Datalog Maximal Fixpoint Standard datalog semantcis. Transform program to: Company(x)  y. z. u.(link(x,”c.e.o.”,y), Employee(y), link(x,”name”,z), String(z), link(x,”address”,u), String(u) Employee(x)  y.z.u.v.(link(x,”works-for”,y), Company(y), link(x,”managed-by”,z), Employee(z), link(x,”name”,u), String(u), link(x,”address”,v), String(v)) Root(x)  y.z.(link(x,”company”,y), Company(y), link(x,”person”,z), Employee(z)) Compute minimal model. What is it ? the empty set

Datalog Minimal Fixpoint Standard datalog semantcis. Transform program to: Company(x)  y. z. u.(link(x,”c.e.o.”,y), Employee(y), link(x,”name”,z), String(z), link(x,”address”,u), String(u) Employee(x)  y.z.u.v.(link(x,”works-for”,y), Company(y), link(x,”managed-by”,z), Employee(z), link(x,”name”,u), String(u), link(x,”address”,v), String(v)) Root(x)  y.z.(link(x,”company”,y), Company(y), link(x,”person”,z), Employee(z)) Answer: the empty set ! Compute maximal model.

Lower-Bound Schemas Equivalent representation of schema: Root person company works-for managed-by Employee Company c.e.o. address name name string

Simulation Strikes Back person person company manages person company works-for Root person &p1 employee c.e.o. &c1 &p2 &c2 c.e.o. &p3 company works-for works-for name works-for position name phone address name address position name name managed-by &s0 &s1 &s2 &s3 &s4 &s5 &s6 url &s7 &s8 &s9 description Company Employee c.e.o. “Smith” “Manager” “Widget” “Trenton” “Jones” “5552121” “Gadget” “Paris” “Dupont” “Sales” description &s10 address name name &a5 string “www.gp.fr” eval 1998 &a1 &a4 procurement salesrep 1997 &a7 &a3 task &a2 &a6 contact “below target” “on target” Lower Bound Database A model of program P is precisely a simulation (and vice versa)

Summary: Lower v.s. Upper Bound Schemas Tells us what edges are allowed Lower bound schemas Tell us what edges are required