Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From.

Slides:



Advertisements
Similar presentations
Three-Step Database Design
Advertisements

CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
XML: Extensible Markup Language
Lecture 11: Datalog Tuesday, February 6, Outline Datalog syntax Examples Semantics: –Minimal model –Least fixpoint –They are equivalent Naive evaluation.
1 Web Data Management Path Expressions. 2 In this lecture Path expressions Regular path expressions Evaluation techniques Resources: Data on the Web Abiteboul,
1 CHAPTER 4 RELATIONAL ALGEBRA AND CALCULUS. 2 Introduction - We discuss here two mathematical formalisms which can be used as the basis for stating and.
XML, XML Schema, Xpath and XQuery Slides collated from various sources, many from Dan Suciu at Univ. of Washington.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
Database Management Systems, R. Ramakrishnan1 Introduction to Semistructured Data and XML Chapter 27, Part D Based on slides by Dan Suciu University of.
1 Introduction to Computability Theory Lecture15: Reductions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture12: Reductions Prof. Amos Israeli.
--What is a Database--1 What is a database What is a Database.
Extracting Schema From Data The difference between schemas for semistructured data and traditional schemas is that a given semistructured data can have.
Agenda from now on Done: SQL, views, transactions, conceptual modeling, E/R, relational algebra. Starting: XML To do: the database engine: –Storage –Query.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Managing XML and Semistructured Data Lecture 8: Query Languages - XML-QL Prof. Dan Suciu Spring 2001.
Web-site Management System Strudel Presented by: LAKHLIFI Houda Instructor: Dr. Haddouti.
1 XEM: Managing the Evolution of XML Documents Author: Hong Su, Diane Kramer. Li Chen, Kajal Claypool and Elke A. Rundensteiner Presented by: Li Shuhong.
1 Lecture 10 XML Wednesday, October 18, XML Outline XML (4.6, 4.7) –Syntax –Semistructured data –DTDs.
Firewall Policy Queries Author: Alex X. Liu, Mohamed G. Gouda Publisher: IEEE Transaction on Parallel and Distributed Systems 2009 Presenter: Chen-Yu Chang.
CS 330 Programming Languages 09 / 18 / 2007 Instructor: Michael Eckmann.
Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’
Managing XML and Semistructured Data Lecture 16: Indexes Prof. Dan Suciu Spring 2001.
CS 330 Programming Languages 09 / 16 / 2008 Instructor: Michael Eckmann.
Other Data Models. Text New edition (DBS the complete book): Chapter 4 Old edition (First course in DBS): –Section 2.1 –Section –Section 2.4.1,
Managing XML and Semistructured Data
Managing XML and Semistructured Data Lecture 14: Constraints and Keys Prof. Dan Suciu Spring 2001.
Putting Semi-structured Data to Practice Alon Levy Seattle, Washingon University of Washington.
THE MODEL OF ASIS FOR PROCESS CONTROL APPLICATIONS P.Andreeva, T.Atanasova, J.Zaprianov Institute of Control and System Researches Topic Area: 12. Intelligent.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.
RDF (Resource Description Framework) Why?. XML XML is a metalanguage that allows users to define markup XML separates content and structure from formatting.
Information storage: Introduction of database 10/7/2004 Xiangming Mu.
The Relational Model. Review Why use a DBMS? OS provides RAM and disk.
A Z Approach in Validating ORA-SS Data Models Scott Uk-Jin Lee Jing Sun Gillian Dobbie Yuan Fang Li.
1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.
DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010.
Of 33 lecture 3: xml and xml schema. of 33 XML, RDF, RDF Schema overview XML – simple introduction and XML Schema RDF – basics, language RDF Schema –
Web Usage Mining for Semantic Web Personalization جینی شیره شعاعی زهرا.
Dimitrios Skoutas Alkis Simitsis
Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.
1 Chapter 1 Introduction. 2 Introduction n Definition A database management system (DBMS) is a general-purpose software system that facilitates the process.
ISBN Chapter 3 Describing Semantics -Attribute Grammars -Dynamic Semantics.
1 5 Normalization. 2 5 Database Design Give some body of data to be represented in a database, how do we decide on a suitable logical structure for that.
Managing XML and Semistructured Data Lecture 13: XDuce and Regular Tree Languages Prof. Dan Suciu Spring 2001.
RRXS Redundancy reducing XML storage in relations O. MERT ERKUŞ A. ONUR DOĞUÇ
Of 41 lecture 4: rdf – basics and language. of 41 RDF basic ideas the fundamental concepts of RDF  resources  properties  statements ece 720, winter.
1 Introduction to Semistructured Data and XML. 2 How the Web is Today  HTML documents often generated by applications consumed by humans only easy access:
Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.
S calable K nowledge C omposition Ontology Interoperation January 19, 1999 Jan Jannink, Prasenjit Mitra, Srinivasan Pichai, Danladi Verheijen, Gio Wiederhold.
Description of Information Resources: RDF/RDFS (an Introduction)
Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay.
Semi-structured Data In many applications, data does not have a rigidly and predefined schema: –e.g., structured files, scientific data, XML. Managing.
Presented by Kyumars Sheykh Esmaili Description Logics for Data Bases (DLHB,Chapter 16) Semantic Web Seminar.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
CENG 351 File Structures and Data Management1 Relational Model Chapter 3.
1 Representing and Reasoning on XML Documents: A Description Logic Approach D. Calvanese, G. D. Giacomo, M. Lenzerini Presented by Daisy Yutao Guo University.
Extracting Schema from Semistructured Data
Managing XML and Semistructured Data
ece 720 intelligent web: ontology and beyond
Management of XML and Semistructured Data
Management of XML and Semistructured Data
Managing XML and Semistructured Data
Managing XML and Semistructured Data
Managing XML and Semistructured Data
Lecture 9: XML Monday, October 17, 2005.
Lecture 8: XML Data Wednesday, October
Semi-structured Data In many applications, data does not have a rigidly and predefined schema: e.g., structured files, scientific data, XML. Managing such.
Semi-Structured data (XML)
Probabilistic Databases with MarkoViews
Presentation transcript:

Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From relations to semistructured data and XML, Morgan Kaufmann Series, ISBN X, 1999

Typing Semistructured Data Introduction: Schema for Semistructured data Motivation for typing Semistructured data Schema formalisms: –First-order logic –Datalog –Graph simulations Extracting schemas from data Inferring schemas from queries Path constraints

What is semistructured data..? Semistructured data has some structure, but is difficult to describe with a predefined, rigid schema –Irregularity –Continual evolution –Structure that is implicit or unknown to the user

What is typing..? Typing is about finding the structure of semistructured data The idea of structuring semistructured data is still an area of much research activity Typing involves finding methods to provide schemas for semistructured data Typing for SSD differ from those for relational or object-oriented data and hence needs separate methods

Uses of typing SSD To optimize query evaluation Example: Original query: select X.title from biblio._X where X.*.zip = “12345” Optimized form: select X.title from biblio.book X where X.address.zip = “12345”

C1C2C3C4 C5 bibliobook titlestring author first name last name string street city zip title journal year paper address

Uses of typing continued... To facilitate the task of integrating several data sources To improve storage –Better clustering may reduce number of page fetches, thus improving query performance To construct indexes To describe the database content to users and facilitate query formulation To proscribe certain updates

Two ways of typing.. Schema extraction –Given one particular data instance, finding the most specific schema for it –With semistructured data we may specify the type after the database is populated –A data instance may have more than one type Schema inference –Finding the most specific schema by analyzing the query –This process is similar to type inference in programming languages

The problem Given a database and a type, –does the database conform to this type…? Classification of objects –Which objects belong to each class..? Typing involves description of the structure of each class and its relationships with other classes

Difference between typing SSD and Object Databases Classes are defined less precisely. As a consequence, objects may belong to several classes Some objects may not belong to any class or may have properties that do not pertain to any class The typing may be approximate. For example, we may accept in a class an object that does not quite conform to the specification of that class.

Schema formalisms First-order logic Datalog Simulation

First-order logic Example: Consider three kinds of objects in the database –Root object(s) have Outgoing edges labeled company to company objects and person to person objects –Person objects have Outgoing edges labeled name and position to string objects Outgoing edges labeled worksfor to company objects Incoming edges labeled manager and employee from company objects –Company objects have Outgoing edges labeled name and address to string objects Outgoing edges labeled manager and employee to person objects Incoming edges labeled worksfor from person objects

If : –if an object has a-edges to strings and b-edges from c’ objects, then it is a c-object. –  Y, Z(ref(X,a,Y) ^ string(Y) ^ c’(Z) ^ ref(Z,b,X)))  c(X) Only-if: – Any c-object has some a-edges to strings and some b-edges from c’ objects: –  Y, Z(ref(X,a,Y) ^ string(Y) ^ c’(Z) ^ ref(Z,b,X)))  c(X) If and only if: –  Y, Z(ref(X,a,Y) ^ string(Y) ^ c’(Z) ^ ref(Z,b,X)))  c(X) Consequence: –c(X) ^ ref(Z,b,X)  c’ (Z) –c(X) ^ ref(X,a,Y)  string(Y) –c(X) ^ ref(X,L,Y) ^ L  a ^ L  b  false

Problem definition with first-order logic The previous questions on typing can be restated in terms of first-order logic –Does D satisfy T, noted D |= T, that is, is there a model of T that coincides with D over the extensional predicates..? –If D |= T, what is the classification that is induced..? First-order logic leads to very general typings, probably too general for what is needed in semistructured data It could also lead to undecidability or intractability

Datalog: A rule-based language Datalog allows us to state that if a conjunction of facts holds, then some new fact can be derived Datalog rules allow us to define classes by specifying what incoming and outgoing edges are required Example: –r(X) :- ref(X, person, Y), p(Y), ref(X, company, Z), c(Z) –p(X) :- c(Y), ref(Y, manager, X), c(Z), ref(Z, employee, X), ref(X, worksfor, U), c(U), ref(X, name, N), string(N), ref(X, position, P), string(P) –c(X) :- p(Z), ref(Z, worksfor, X), p(Z), ref(Z, worksfor, X), ref(X, manager, M), p(M), ref(X, employee, E), p(E), ref(X, name, N), string(N), ref(X, address, A), string(A)

Fixpoint semantics Least fixpoint semantics –We start from an empty set of facts and derive nothing. Hence, the empty set of facts is the least fixpoint for this program Greatest fixpoint semantics –Typing the largest set of objects The goal is to find the greatest fixpoint for a given data graph. The desired model is the greatest fixpoint containing D.

Consider the following data graph D: &o1 {company: &o2{name: &o5 “o2”, address: &o6 “Versailles”, manager: &o3, employee: &o3, employee: &o4 }, person: &o3 { name: &o7 “Francois”, position: &o8 “CEO”, worksfor: &o2 }, person: &o4 { name: &o9 “Lucien”, position: &o10 “programmer”, worksfor: &o2 } } ref(&o1, company, &o2), ref(&o2, name, &o5), etc. string(&o5, string(&o6), etc.

Deriving the greatest fixpoint The desired model M can be derived by starting from a model containing D and all possible typing facts. Let J o = D U { r(&o1), r(&o2), r(&o3), r(&o4), p(&o1), p(&o2), p(&o3), p(&o4), c(&o1), c(&o2), c(&o3), c(&o4), } Deriving from J0 until a fixpoint is reached will get to the desired model M = J2 = J1 = D U {r(&o1), c(&o2), p(&o3), p(&o4)}

Simulation The aim is to produce a schema graph for a data graph whose semantics lead to a listing of all permitted labels. A schema graph is similar to a data graph with the following changes –Labels can be alterations (like address | name | url ) or underscore –Atomic values are type names, like string, int, float, etc. –Oids of complex objects are called as classes, like Person, Company, etc.

&r1 &p1&c1&p2&c2&p3 &s0&s1&s2&s3&s4&s5&s6&s7&s8&s9 &a1 &a2 &a3 &a4 &a5 &a6 &a7 person company personcompany person manager mgremp name positionaddrphone addrposition &s10 url worksfor emp description procurement salesrep contact task description performance “Smith”“Mgr” “Widget”“Trent”“Joe”

Schema graph Root Person Company String Any companyperson employee manager worksfor name|address|url name|phone|position description manager -

Simulation is defined as follows: Given graphs G 1 = (V 1, E 1 ), G 2 = (V 2, E 2 ), a relation R on V 1,V 2 is a simulation if it satisfies  l  L  x 1,y 1  V 1  x 2  V 2 (x 1 [l]y 1 ^ x 1 Rx 2   y 2  V 2 (y 1 Ry 2 ^ x 2 [l]y 2 )) The rule says that every edge in G 1 must have a “corresponding” edge in G 2 under the simulation x1 y1y2 x2 R R G1 G2 [l]

To define a simulation between a semistructured data instance and a schema graph, we add the following additional requirements: –The roots must be in the simulation: r R r’ –Whenever x R y, if y is an atomic type (like string, int), then x must be an atomic node too and have a value of that type. We say the simulation is typed

Data nodeSchema node &r1Root &c1, &c2Company &p1, &p2, &p3Person &s0,&s1,&s2,&s3…string &a1,&a2,&a3,&a4….Any The relation R defined by the example data graph and the given schema graph is a simulation

Back to the typing problem…. When does a data graph D conform to a schema graph S..? –When there exists a rooted, typed simulation between the data and the schema Which objects belong to each class..? –The principle is that oid ‘o’ should belong to class ‘c’ if o R c. In this way, a rooted simulation R will always classify all objects. –However, the classification need not be unique!, which leads to finding maximal simulation

string book titleauthor book titleauthor publisher book titleauthor year &o &b1 &b2 D = S =

Maximal simulation G 1 < =R G 2 : R is a simulation from G 1 to G 2 Fact: –if G 1 < =R 1 G 2 and G 1 < =R 2 G 2 then G 1 < =R 1 UR 2 G 2 –For any data graph D conforming to some schema graph S, there is always a maximal simulation from D to S. Back to the problem: Which objects belong to each class…? –An object ‘o’ belongs to some class ‘c’ if oRc, where R is the maximal solution between the OEM data and schema graph