Lore: A Database Management System for Semistructured Data.

Slides:



Advertisements
Similar presentations
XML: Extensible Markup Language
Advertisements

Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Intermediate Code Generation
TIMBER A Native XML Database Xiali He The Overview of the TIMBER System in University of Michigan.
Relational Databases for Querying XML Documents: Limitations & Opportunities VLDB`99 Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., Naughton,
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
SPRING 2004CENG 3521 Query Evaluation Chapters 12, 14.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
From Semistructured Data to XML: Migrating The Lore Data Model and Query Language Roy Goldman, Jason McHugh, Jennifer Widom Stanford University
Managing XML and Semistructured Data Lecture : Indexes.
Introduction to Structured Query Language (SQL)
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Query Languages Aswin Yedlapalli. XML Query data model Document is viewed as a labeled tree with nodes Successors of node may be : - an ordered sequence.
Indexing Semistructured Data J. McHugh, J. Widom, S. Abiteboul, Q. Luo, and A. Rajaraman Stanford University January 1998
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
1 COS 425: Database and Information Management Systems XML and information exchange.
Query Optimization for Semistructured Data Jason McHug, Jennifer Widom Stanford University - Rajendra S. Thapa.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
Database Systems and XML David Wu CS 632 April 23, 2001.
LORE Light Object Repository by Othman Chhoul CSC5370 Fall 2003.
Page 1 Multidatabase Querying by Context Ramon Lawrence, Ken Barker Multidatabase Querying by Context.
Lesson 6. Refinement of the Operator Model This page describes formally how we refine Figure 2.5 into a more detailed model so that we can connect it.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Introduction to Structured Query Language (SQL)
OEM and LORE Query Language Sanjay Madria Department of Computer Science University of Missouri-Rolla
1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.
Semi-Structured Data Models By Chris Bennett. Semi-Structured Data  What is it? Data where structure not necessarily determined in advance (often implicit.
4/20/2017.
XP New Perspectives on XML Tutorial 4 1 XML Schema Tutorial – Carey ISBN Working with Namespaces and Schemas.
IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.
2.2 A Simple Syntax-Directed Translator Syntax-Directed Translation 2.4 Parsing 2.5 A Translator for Simple Expressions 2.6 Lexical Analysis.
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
Chapter 3 Single-Table Queries
A Z Approach in Validating ORA-SS Data Models Scott Uk-Jin Lee Jing Sun Gillian Dobbie Yuan Fang Li.
XML과 Database 홍기형 성신여자대학교 성신여자대학교 홍기형.
1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.
A Query Translation Scheme for Rapid Implementation of Wrappers Presented By Preetham Swaminathan 03/22/2007 Yannis Papakonstantinou, Ashish Gupta, Hector.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
Querying Structured Text in an XML Database By Xuemei Luo.
7 1 Chapter 7 Introduction to Structured Query Language (SQL) Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
Database Systems Part VII: XML Querying Software School of Hunan University
Computing & Information Sciences Kansas State University Tuesday, 03 Apr 2007CIS 560: Database System Concepts Lecture 29 of 42 Tuesday, 03 April 2007.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
XML 2nd EDITION Tutorial 4 Working With Schemas. XP Schemas A schema is an XML document that defines the content and structure of one or more XML documents.
1 Tutorial 14 Validating Documents with Schemas Exploring the XML Schema Vocabulary.
Tutorial 13 Validating Documents with Schemas
Lore: A Database Management System for Semistructured Data.
Lore: A Database Management System for Semi-structured Data Jason McHugh, Serge Abiteboul, Roy Goldman, Dallan Quass, Jennifer Widom Stanford University.
Chapter 12 Query Processing. Query Processing n Selection Operation n Sorting n Join Operation n Other Operations n Evaluation of Expressions 2.
Semistructured Data. Semistructured data is data that has some structure, but it may be irregular and incomplete and does not necessarily conform to a.
XML and Database.
Tries Data Structure. Tries  Trie is a special structure to represent sets of character strings.  Can also be used to represent data types that are.
Session 1 Module 1: Introduction to Data Integrity
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
XML Extensible Markup Language
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
XML: Extensible Markup Language
CS522 Advanced database Systems
Querying and Transforming XML Data
A Simple Syntax-Directed Translator
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Database Management System
Chapter 12: Query Processing
File Processing : Query Processing
Lecture 2- Query Processing (continued)
Semi-Structured data (XML)
Presentation transcript:

Lore: A Database Management System for Semistructured Data

LORE Lore : Lightweight Object Repository

Lore - motivation The data may be irregular and thus not conform to a rigid schema. Relational data model has null values, and OO models have inheritance and complex objects. Both have difficulties in designing schemas to incorporate irregular data. It may be difficult to decide in advance on a single, correct schema, The structure of the data may evolve rapidly, data elements may change types, or data not conforming to previous structure may be added.

Thus, there is a need for management of semi-structured data! Lore system manages semi-structured data. The data managed by Lore is not confined to a schema and it may be irregular or incomplete. OEM is the Lore’s data model. OEM - object Exchange Model - graph based self-describing object instance model where nodes are objects and edges are labeled with attribute names and leaf nodes have atomic values Lore is light weight object repository and Lorel is Lore’s query language.

Path expression queries Automatic type coercion Use of Data guides – Structural summary of the database

Objects with Oid Atomic objects – no outgoing edge and contain a value from one of the basic atomic types such as Integer, real, string, gif, java, audio etc. Complex objects – may have outgoing edges Names – Sp. Labels that serves as aliases for objects and as entry points to the database.

Object Exchange Model - OEM Motivation - information exchange and extraction Each value exchanged is given an explicit label. Object  temp-in-Fahrenheit, integer, 80  - “temp- in-Fahrenheit” is the label. Each object is self- describing, with a label, type and value.  set-of-temps, set, {cmpnt1, cmpnt2}  cmpnt1 is  temp-in-Fahrenheit, integer, 80  cmpnt2 is  temp-in-Celsius, integer, 20 

Labels Plays two roles – identifying an object (component) –identifying the meaning of an object (component) Person-name both identifies cmpnt1 and coveys its meaning.  person-record, set, {cmpnt1, cmpnt2, cmpnt3}  cmpnt1 is  person-name, string, ``Fred’’  cmpnt2 is  office-num-in-bldg-5, integer, 333  cmpnt3 is  department, string, ``toy’’  In relational data this corresponds to ….

Labels - Issues What does the label mean? –Database of labels –Ontology of labels - within each source Labels are relative (more specific) to the source of the data object. Similar labels from different sources need to be resolved. Labels provide the flexibility in representing object structure

Self-describing data models Have been in existence for a long time? Why additional interest now? Use the ``nature’’ of self-describing data model for information exchange, and to extend the model to include object nesting. To provide an appropriate object request language (query facility)

OEM - Specification Each object in OEM has the following structure: –Label: A variable character string describing what the object represents. –Type: The data type of the object’s value. Each is either an atom type, or type set. –Value: A variable-length value of the object. –Object-ID: A unique variable-length identifier for the object or null. LabelTypeValueObject-ID

OEM - Summary OEM is an information exchange model. It does not specify how objects are stored at source. OEM does specify how objects are received at a client, but after objects are received they can be stored in any way the client likes. Note the schema-less nature of OEM is particularly useful when a client does not know in advance the labels or structure of OEM objects. Each source has a distinguished object with lexical identifier ``root’’.

Example doc1 is auths1 is auth11 is topic1 is call-no1 is doc2 is auths2 is auth21 is auth22 is auth23 is topic2 is call-no1 is docn is authsn is topic1 is call-no1 is biblio is the root object.

OEM - QL SELECT Fetch-expression FROM Object WHERE Condition The result of this query is itself an object, with special label ``answer’’:  answer, set, {obj1, obj2, …, objn}  Each returned obji is a component of object specified in the From clause of the query, where the component is located by the Fetch- expression and satisfies the Condition.

Path The notion of path is used in both Fetch- Expression in the Select clause and the condition in the Where clause. Path describes traversals through an object using subobject structure and labels. Example: ``biblio.doc.auth’’ Paths are used in Fetch-Expression to specify which components are are returned in the answer object. Paths are used in the condition to qualify the fetched objects or other (related) components in the same object structure.

Queries - Simple Retrieve the topic of each document for which ``Ullman’’ is one of the authors: SELECT biblio.doc.topic FROM root WHERE biblio.doc.auth-set.auth-ln = ``Ullman’’ Intuitively, the query’s where clause finds all paths through subobject structure with the sequence of labels [biblio,doc,auth-set,auth-ln] such that the object at the end of the path has value ``Ullman.’’ obj1 is obj2 is

Queries - ``wild-cards’’ Retrieve all documents with internal call number: SELECT biblio.?.topic FROM root WHERE biblio.?.internal-call-no ``?’’ label matches any label. For this query, the doc labels can be replaced by any other strings and query would produce the same result. By convention, two occurrences of ? In the same query must match the same label unless variables are used. obj1 is

Queries - ``wild-paths’’ Retrieve all documents with internal call number: SELECT *.topic FROM root WHERE *.internal-call-no Symbol ``*’’ matches any path of length one or more. The use of * followed by a single label is a convenient and common way to locate objects with a certain label in complex structure. Similar to ?, two occurrences of * in the same query must match the same sequence of labels, unless variables are used. obj1 is

Queries - variables Retrieve each document for which both ``Hopcroft’’ and ``Aho’’ are co-authors: SELECT biblio.doc FROM root WHERE biblio.doc.auth-set.auth-ln(a1)=``Aho’’ and biblio.doc.auth-set.auth-ln(a1)=``Hopcroft’’ Here, the query finds all the paths with structure [biblio, doc, auth-set], and with two distinct path completions with label auth with values ``Aho’’ and ``Hopcroft’’ obj1 is the complete doc2

OEM (Cont.) Examples: –Object &3 is complex, and its subobjects are &8, &9, &10, and &11. –Object &7 is atomic and has value “Clark”. DBGroup is a name that denotes object &1. (Names are entry points into the database).

DBGroup &1 &2&3&4&5&6 &11&8 &9 &10&12&13&14&7&15&16 &17&18&19&20 Member Office Age NameProject Name Age Office RoomBuilding Room “Clark”“Smith ” 46“Gates 252” “Lore”“Tsimmis ” “Jones”28 An OEM Database “CIS”“411”“CIS”252

Lorel Queries - Simple Path Expression Retrieve the offices of members with age greater than 30 years: Query SELECT DBGroup.Member.Office WHERE DBGroup.Member.Age > 30 ResultOffice “Gates 252” Office Building “CIS” Room “411”

Lorel Query Rewrite Previous query rewritten to: –select O from DBGroup.Member M, M.Office O where exists y in M.Age : y < 30 Comparison on age transformed to existential condition. –Since all properties are set-valued in OEM. –A user can ask DBGroup.Member.Age < 30 regardless of whether Age is single valued, set valued, or unknown.

Lorel Query Rewrite Why? –Breaking query into simple path expressions necessary for query optimization. –Need to explicitly handle coercion. Atomic objects and values. 0.5 < “0.9” should return true Comparing objects and sets of objects. DBGroup.Member.Age is a set of objects.

Queries - General Path Expression Query SELECT DBGroup.Member.Name WHERE DBGroup.Member.Office(.Room%|.Cubicle)? Like “%252” ResultName “Jones” Name “Smith” Room% matches all labels starting from Room, like Room68. “|” stands for disjunction. “?” indicates that the label pattern is optional. “like %252” specifies that the data value should end with string “252”.

Queries - SubQueries Retrieve Lore project members who work on other projects Query SELECT M.Name, ( SELECT M.Project.Title WHERE M.Project.Title != “Lore”) FROM DBGroup.Member M WHERE M.Project.Title = “Lore” ResultMember Name “Jones” Title “Tsimmis”

Data Guides A DataGuide is a concise and accurate summary of the structure of an OEM database (stored as OEM database itself, kind of like the system catalog). Why? –No explicit schema, how do we formulate meaningful queries? –Large databases (can’t just view graph structure). –What if a path expression doesn’t exist (waste). Each possible path expression is encoded once.

{9, 13}

DataGuides As Histograms Each object in the dataguide can have a link to its corresponding target set. –A target set is a set of oids reachable by that path. TS of DBGroup.Member.Age is {9, 13}. –This is a path index. Can find set of objects reachable by a particular path. –Can store statistics in DataGuide For example, the # of atomic objects of each type reachable by p.

Conclusions Takes advantage of the structure where it exists. Handles lack of structure well (data type coercion, general path expressions). Query language allows users to get and update data from semistructured sources. –DataGuide allows users to determine what paths exist, and gives useful statistical information Lore does facilitate query and updates on semi- structural databases

OEM vs. XML OEM’s objects correspond to elements in XML Sub-elements in XML are inherently ordered. XML elements may optionally include a list of attribute value pairs. Graph structure for multiple incoming edges specified in XML with references (ID, IDREF attributes). i.e. the Project attribute.

OEM to XML Example: – Jones 46 gates 252 This corresponds to rightmost member in the example OEM, where project is an attribute.

Query Optimization for Semistructured Data

External Data Manager Enables retrieval of information from other data sources, transparent to the user. An external object in Lore is a “placeholder” for the external data and specifies how lore interacts with an external data source. The spec for an external object includes: –Location of a wrapper program to fetch and convert data to OEM, time interval until fetched information becomes stale, and a set of arguments used to limit info fetched from external source.

Layer 0 Access to Lore – API, Applications Parser – textual query as input and parse tree as output Preprocessor – input parse tree to OQL like query Query plan and query optimizer (index etc)

Layer 1 Object manager – translation layer between OEM to lower level files, compare objects, Query operator – execute query plan, perform simple coercion, iterating over subobjects of a complex object

QUERY select M.Name, (select M.Project.Title where M.Project.Title != "Lore" ) from DBGroup.Member M where M.Project.Title = "Lore" RESULT Member Name "Jones" Title "Tsimmis"

update P.Member += ( select DBGroup.Member where DBGroup.Member.Name = "Clark" ) from DBGroup.Project P where P.Title = "Lore" or P.Title = "Tsimmis" P.member += specifies to add member edges between P and every object returned by subquery.

Query Execution Plan Iterative approach in query processing Execution begins at the top of the query plan, with each node in the plan requesting a tuple at a time from its children and performing some operation on the tuple. After a node completes its operation, it passes a resulting tuple to its parent.

Object assignment – OA is a simple data structures containing slots corresponding to range variables in the query and some additional slots Each slot within an OA will hold the oid of a vertex on a data path currently being considered by the query engine Example – OA1 holds oid for member “smith”, then OA2 and OA3 can hold the oids for one of smith’s office subobjects and one of his age subobjects resp.

Query Operators Each operator takes a number of arguments with the last argument being OA slot that will contain the result of the operation. Select, project has no target slot. Scan returns all oids that are subobjects of a given object Scan (starting OA slot, path exp., Target OA slot) Scan until no subobjects that satisfies path expression Scan (OA1, “office”, OA2) : place into slot OA2 one at a time all office subobjects appearing in slot OA1.

Join/select/project – nearly identical to RDBMS Project is to limit which objects should be returned Select applied predicate to the object identified by the oid in the OA slot specified Aggregate – implements quantification and aggregation node calls its child exhaustively, storing the results temp. or computing aggregation A new object is created when no more valid OAs

select O from DBGroup.Member M, M.Office O where exists A in M.Age : A > 30

Primary Operators Setop, ArithOp, Create Set and Groupby Setop handles union, intersect,except Arithop – addition, multiplication Createset – is to package the results of an arbitrary subquery before proceeding: its called its child exhaustively, storing each oid returned as part of a newly created complex object. It stores oid for the new set of objects within target slot Group by handles subquery that includes a groupby expression

select M.Name, count(M.Publication) from DBGroup.Member M where M.Dept = "CS" select (select N from M.Name N), count(select P from M.Publication P) from DBGroup.Member M where exists D in M.Dept : D = "CS"

Indexing Value Index – vindex Lindex – a link (edge) index Lindex – (oid, label) and returns the oid’s of all parents via the specified label (provides parent pointers) Vindex (label,operator,value) – returns all atomic objects having an incoming edge with the specified label and a value satisfying the specified operator (<)

Index Query Plan –Bottem – Up –Locate all objects with desired values and appropriately labeled incoming edges via vindex –Using Lindex then traverse up from these objects to match the path exp

New Query Operators 1.Vindex 2.Lindex 3.Once 4.Named-obj

Indexes Vindex(l, op, value, x) places into x all atomic objects that satisfy the “op value” condition with an incoming edge labeled l. –Vindex(“Age”, <, 30,y) places into y objects with age < 30. Lindex(x, l, y) places into x all objects that are parents of y via edge labeled l. –Lindex(x, “Age”, y) places into x all parents of y via label “Age”.

Indexes (cont.) Bindex(l, x, y) finds all parent-child object pairs connected by a label l. –Bindex(“Age”, x, y) locates all parent-child pairs with label Age. Pindex(PathExpression, x) placed into x all objects reachable via the path expression. –Pindex(“A.B x, x.C y”, y) places into y all objects reachable by going from A to B to C. –Uses DataGuide.

Vindex – as B+ tress Lindex – Linear Hashing String Vindex – index entries for all string based atomic values Real Vindex – index entries for all numeric based atomic values String-coerced-to real Vindex Contains all strings values that can be coerced into an integer or real

If the value is of type string then –Lookup in string index –If the value can be coerced to a real then lookup in the coerced value in real vindex If the value is of type real (integer) then –Lookup in real vindex –Also lookup in the string-coerced to real vindex

Simple Query select O from DBGroup.Member M, M.Office O where exists y in M.Age : y < 30 Possible plans: –Top-down (similar to pointer-chasing, nested-loops join) –Use Vindex to check y < 30, traverse backwards from child to parent using Lindex (bottom-up). –Hybrid, both top down and bottom up. Meet in middle.

Coercion for Basic op. Arg2StringRealInt Arg1 String - Stirng Real both Real Real Stirng Real -Int Real Int both Real Int Real -