Lore: A Database Management System for Semistructured Data.

Slides:



Advertisements
Similar presentations
Query Optimization Reserves Sailors sid=sid bid=100 rating > 5 sname (Simple Nested Loops) Imperative query execution plan: SELECT S.sname FROM Reserves.
Advertisements

XML: Extensible Markup Language
Evaluation of Relational Operators CS634 Lecture 11, Mar Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Relational Databases for Querying XML Documents: Limitations & Opportunities VLDB`99 Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., Naughton,
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
SPRING 2004CENG 3521 Query Evaluation Chapters 12, 14.
Xyleme A Dynamic Warehouse for XML Data of the Web.
From Semistructured Data to XML: Migrating The Lore Data Model and Query Language Roy Goldman, Jason McHugh, Jennifer Widom Stanford University
1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.
Indexing Semistructured Data J. McHugh, J. Widom, S. Abiteboul, Q. Luo, and A. Rajaraman Stanford University January 1998
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
Query Optimization for Semistructured Data Jason McHug, Jennifer Widom Stanford University - Rajendra S. Thapa.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
Database Systems and XML David Wu CS 632 April 23, 2001.
LORE Light Object Repository by Othman Chhoul CSC5370 Fall 2003.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
Lesson 6. Refinement of the Operator Model This page describes formally how we refine Figure 2.5 into a more detailed model so that we can connect it.
Lore: A Database Management System for Semistructured Data.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
OEM and LORE Query Language Sanjay Madria Department of Computer Science University of Missouri-Rolla
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.
2.2 A Simple Syntax-Directed Translator Syntax-Directed Translation 2.4 Parsing 2.5 A Translator for Simple Expressions 2.6 Lexical Analysis.
Selective and Authentic Third-Party distribution of XML Documents - Yashaswini Harsha Kumar - Netaji Mandava (Oct 16 th 2006)
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
XML과 Database 홍기형 성신여자대학교 성신여자대학교 홍기형.
1 Semi-structured data Patrick Lambrix Department of Computer and Information Science Linköpings universitet.
Ashwani Roy Understanding Graphical Execution Plans Level 200.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Optimization in XSLT and XQuery Michael Kay. 2 Challenges XSLT/XQuery are high-level declarative languages: performance depends on good optimization Performance.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
Database Systems Part VII: XML Querying Software School of Hunan University
Starting at Binary Trees
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Lore: A Database Management System for Semi-structured Data Jason McHugh, Serge Abiteboul, Roy Goldman, Dallan Quass, Jennifer Widom Stanford University.
CS 257 Chapter – 15.9 Summary of Query Execution Database Systems: The Complete Book Krishna Vellanki 124.
16.7 Completing the Physical- Query-Plan By Aniket Mulye CS257 Prof: Dr. T. Y. Lin.
CS4432: Database Systems II Query Processing- Part 2.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
Session 1 Module 1: Introduction to Data Integrity
1 Directed Graphs Chapter 8. 2 Objectives You will be able to: Say what a directed graph is. Describe two ways to represent a directed graph: Adjacency.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 12 – Introduction to.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Chapter 13: Query Processing
CS4432: Database Systems II Query Processing- Part 1 1.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
Storage and File Organization
15.1 – Introduction to physical-Query-plan operators
Record Storage, File Organization, and Indexes
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Database Management System
DATA STRUCTURES AND OBJECT ORIENTED PROGRAMMING IN C++
Prepared by : Ankit Patel (226)
CS222P: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Chapter 12: Query Processing
Chapter 15 QUERY EXECUTION.
File Processing : Query Processing
Lecture 12 Lecture 12: Indexing.
Implementation of Relational Operations
Evaluation of Relational Operations: Other Techniques
CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Evaluation of Relational Operations: Other Techniques
Presentation transcript:

Lore: A Database Management System for Semistructured Data

Why? Although data may exhibit some structure it may be too varied or irregular to map to a fixed schema. –Relational DBMS might use null values in this case. May be difficult to decide in advance on a specific schema. –Data elements may change types. –Structure changes a lot (lots of schema modifications).

Semistructured Data Examples: –Data from the web Overall site structure may change often. It would be nice to be able to query a web site. –Data integrated from multiple, heterogeneous data sources. Information sources change, or new sources added.

Object Exchange Model (OEM) Data in this model can be thought of as a labeled directed graph. –Schema-less and self-describing. Vertices in graph are objects. –Each object has a unique object identifier (oid), such as &5. –Atomic objects have no outgoing edges and are types such as int, real, string, gif, java, etc. –All other objects that have outgoing edges are called complex objects.

OEM (Cont.) Examples: –Object &3 is complex, and its subobjects are &8, &9, &10, and &11. –Object &7 is atomic and has value “Clark”. DBGroup is a name that denotes object &1. (Names are entry points into the database).

OEM to XML Example: – Jones 46 gates 252 This corresponds to rightmost member in the example OEM, where project is an attribute.

Lorel Query Language Need query language that supports path expressions for traversing graph data and handling of ‘typeless’ data. A simple path expression is a name followed by a sequence of labels. –DBGroup.Member.Office. –Set of objects that can be reached starting with the DBGroup object, following edges labels member and then office.

Lorel (cont.) Example: –select DBGroup.Member.Office where DBGroup.Member.Age < 30 Result: –Office “Gates 252” –Office Building “CIS” Room “411”

Lorel Query Rewrite Previous query rewritten to: –select O from DBGroup.Member M, M.Office O where exists y in M.Age : y < 30 Comparison on age transformed to existential condition. –Since all properties are set-valued in OEM. –A user can ask DBGroup.Member.Age < 30 regardless of whether Age is single valued, set valued, or unknown.

Lorel Query Rewrite Why? –Breaking query into simple path expressions necessary for query optimization. –Need to explicitly handle coercion. Atomic objects and values. 0.5 < “0.9” should return true Comparing objects and sets of objects. DBGroup.Member.Age is a set of objects.

Lorel (cont.) General path expressions are loosely specified patterns for labels in the database. (‘|’ disjunction, ‘?’ label pattern optional) Example: –select DBGroup.Member.Name where DBGroup.Member.Office(.Room%|.Cubicle)? like “%252” Result: –Name “Jones” Name “Smith”

Query and Update Processing Query is parsed Parse tree is preprocessed and translated to new OQL-like query. Query plan constructed. Query optimization. Opt. query plan executed.

System Architecture

Iterators and Object Assignments Use recursive iterator approach: –execution begins at top of query plan –each node in the plan requests a tuple at a time from its children and performs some operation on the tuple(s). –pass result tuples up to parent.

Object Assignments (OAs) OA is a data structure containing slots for range variables with additional slots depending on the query. Each slot within an OA will holds the oid of a vertex on a path being considered by the query engine. Example: if OA1 holds oid for “Smith” then OA2 and OA3 can hold the oids for one of Smiths Office objects and Age objects.

Query Operators For example, the Scan operator returns all oids that are subobjects of a given object following a specified path expression. –Scan (StartingOASlot, Path_expression, TargetOASlot) For each oid in StartingOASlot, check to see if object satisfies path_expression and place oid into TargetOASlot. Other operators include Join, Project, Select, Aggregation, etc. Join node like nested-loop join in relational DBMS.

Query Optimization Does only a few optimizations: –Push selection ops down query tree. –Eliminate/combine redundant query operators. Explores query plans that use indexes where possible. –Two kinds of indexes: –Lindex (link index) provide parent pointers impl. as hashing. –Vindex (value index) impl. as B+-trees

Indexes Because of non-strict typing system, have String Vindex, Real Vindex, and String-coerced-to-real Vindex. Separate B-Trees for each type are constructed. Using Vindex for comparison (e.g. Age < 30) consider the following: –If type is string, do lookup in String Vindex –If can convert to real the do lookup in String-coerced- to-real Vindex. –If type is real?

Other issues Update query operator example: –Update(Create_Edge, OA1, OA5, “Member”) –Create edge from results in OA1 to OA5 labeled “Member”. Lore arranges objects in physical disk pages, each page with a number of slots with a single object in each slot. –Objects placed according to first-fit algorithm. –Supports large objects spanning multiple pages. –Objects clustered in depth-first manner (since Scan traverses depth-first). –Garbage collector removes unreachable objects.

External Data Manager Enables retrieval of information from other data sources, transparent to the user. An external object in Lore is a “placeholder” for the external data and specifies how lore interacts with an external data source. The spec for an external object includes: –Location of a wrapper program to fetch and convert data to OEM, time interval until fetched information becomes stale, and a set of arguments used to limit info fetched from external source.

Data Guides A DataGuide is a concise and accurate summary of the structure of an OEM database (stored as OEM database itself, kind of like the system catalog). Why? –No explicit schema, how do we formulate meaningful queries? –Large databases (can’t just view graph structure). –What if a path expression doesn’t exist (waste). Each possible path expression is encoded once.

{9, 13}

DataGuides As Histograms Each object in the dataguide can have a link to its corresponding target set. –A target set is a set of oids reachable by that path. TS of DBGroup.Member.Age is {9, 13}. –This is a path index. Can find set of objects reachable by a particular path. –Can store statistics in DataGuide (more in next paper). For example, the # of atomic objects of each type reachable by p.

Conclusions Takes advantage of the structure where it exists. Handles lack of structure well (data type coercion, general path expressions). Query language allows users to get and update data from semistructured sources. –DataGuide allows users to determine what paths exist, and gives useful statistical information

Query Optimization for Semistructured Data

OEM vs. XML OEM’s objects correspond to elements in XML Sub-elements in XML are inherently ordered. XML elements may optionally include a list of attribute value pairs. Graph structure for multiple incoming edges specified in XML with references (ID, IDREF attributes). i.e. the Project attribute.

Indexes Vindex(op, value, l, x) places into x all atomic objects that satisfy the “op value” condition with an incoming edge labeled l. –Vindex(“Age”, <, 30,y) places into y objects with age < 30. Lindex(x, l, y) places into x all objects that are parents of y via edge labeled l. –Lindex(x, “Age”, y) places into x all parents of y via label “Age”.

Indexes (cont.) Bindex(l, x, y) finds all parent-child object pairs connected by a label l. –Bindex(“Age”, x, y) locates all parent-child pairs with label Age. Pindex(PathExpression, x) placed into x all objects reachable via the path expression. –Pindex(“A.B x, x.C y”, y) places into y all objects reachable by going from A to B to C. –Uses DataGuide.

Simple Query select O from DBGroup.Member M, M.Office O where exists y in M.Age : y < 30 Possible plans: –Top-down (similar to pointer-chasing, nested-loops join) –Use Vindex to check y < 30, traverse backwards from child to parent using Lindex (bottom-up). –Hybrid, both top down and bottom up. Meet in middle.

Select x From A.B x Where exists y in x.C: y = 5

Query Plan Generation (Overview) Logical query plan generator creates high-level execution strategy. Physical query plan enumerator uses statistics and a cost model to transform logical query plan into an estimated best physical plan that lies within their search space.

Logical Query Plans (cont.) Glue node represents a ‘rotation point’ that has as its children two independent subplans. –Rotating the order between independent components yields different plans. –Marks place where execution order is not fixed. Discover node chooses best way to bind variables x and y. Chain node chooses best evaluation of a path expression.

Logical query plan for: Select x From DBGroup.Member x Where exists y in x.Age: y<30 from clause where clause

Physical Query Plans

Physical Query Plans (cont.) Scan(x, l, y) places into y all objects that are subojects of x via edge labeled l. –Top-down (pointer chasing). Lindex plan is bottom-up approach. Bindex: Locate edges whose label appears infrequently in database. NLJ: left subplan passes variables to right subplan.

Statistics I/O metric uses estimated # of objects fetched. For every label subpath p of length <= k: –# Of atomic objects of each type reachable by p –Min, and max values of all atomic objects of each type reachable by p –# Of instances of path p, denoted |p| –# Of distinct objects reachable by p, denoted |p| d –# Of l-labeled subobjects of all objects reachable by p –# Of incoming l-labeled edges to any instance of p, denoted |p l |

Plan Enumeration Doesn’t consider joining two simple path expressions together unless they share a common variable. Pindex is used only when path expression begins with a name and no variable except the last is used in the query. Select clause always executes last. Doesn’t try to reorder multiple independent path expressions.

Results Used XML database about movies. Database graph contained 62,256 nodes and 130,402 edges. Experiment 1: Select DB.Movie.Title –Best plan is Pindex, followed by top-down –Worst plan is Bindex, with hash joins.

Results (cont.) Experiment 2: All Movies with a Genre of “Comedy” –Where clause is very selective, bottom-up does a Vindex for “Comedy” with incoming edge Genre

Results (cont.) Experiment 3: Query with two existentially quantified variables in the where clause. Errors due to bad estimates of atomic value distributions and set operation costs.

Results (cont.) Experiment 4: Select movies with certain quality rating. Quality ratings uncommon in database so optimizer chooses to find all ratings via Bindex, and then work bottom-up.

Conclusions Cost estimates are accurate and select the best plan most of the time Execution times of best and worst plans for a given query can differ by many orders of magnitude. Best strategy is highly dependent upon the query and database (Query optimization is good for XML data).