Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. Elance Inc.

Slides:



Advertisements
Similar presentations
CpSc 3220 File and Database Processing Lecture 17 Indexed Files.
Advertisements

The Hierarchical Model
Hierarchies & Trees in SQL by Joe Celko copyright 2008.
TREES Chapter 6. Trees - Introduction  All previous data organizations we've studied are linear—each element can have only one predecessor and successor.
Database Systems: Design, Implementation, and Management Tenth Edition
Introduction to Structured Query Language (SQL)
Introduction to Structured Query Language (SQL)
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Storing and Querying Ordered XML Using a Relational Database System By Khang Nguyen Based on the paper of Igor Tatarinov and Statis Viglas.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Introduction to Structured Query Language (SQL)
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
P2P Course, Structured systems 1 Introduction (26/10/05)
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 7 Introduction to Structured Query Language (SQL)
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
1 CS 728 Advanced Database Systems Chapter 17 Database File Indexing Techniques, B- Trees, and B + -Trees.
CS4432: Database Systems II
Binary Trees Chapter 6.
IST Databases and DBMSs Todd S. Bacastow January 2005.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 7 INTRODUCTION TO STRUCTURED QUERY LANGUAGE (SQL) Instructor Ms. Arwa.
The main mathematical concepts that are used in this research are presented in this section. Definition 1: XML tree is composed of many subtrees of different.
MIS 301 Information Systems in Organizations Dave Salisbury ( )
CODD’s 12 RULES OF RELATIONAL DATABASE
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
1 Index Structures. 2 Chapter : Objectives Types of Single-level Ordered Indexes Primary Indexes Clustering Indexes Secondary Indexes Multilevel Indexes.
Chapter 6 Binary Trees. 6.1 Trees, Binary Trees, and Binary Search Trees Linked lists usually are more flexible than arrays, but it is difficult to use.
7 1 Chapter 7 Introduction to Structured Query Language (SQL) Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
6 1 Lecture 8: Introduction to Structured Query Language (SQL) J. S. Chou, P.E., Ph.D.
Starting at Binary Trees
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Indexing and Hashing By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany) DIRECTOR ARUNAI ENGINEERING COLLEGE TIRUVANNAMALAI.
Database Fundamental & Design by A.Surasit Samaisut Copyrights : All Rights Reserved.
XML and Database.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 7 (Part II) INTRODUCTION TO STRUCTURED QUERY LANGUAGE (SQL) Instructor.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
1 CPS216: Advanced Database Systems Notes 05: Operators for Data Access (contd.) Shivnath Babu.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Binary Search Trees (BSTs) 18 February Binary Search Tree (BST) An important special kind of binary tree is the BST Each node stores some information.
Rooted Tree a b d ef i j g h c k root parent node (self) child descendent leaf (no children) e, i, k, g, h are leaves internal node (not a leaf) sibling.
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
CS 440 Database Management Systems Lecture 6: Data storage & access methods 1.
Hierarchical Retrieval Fresher Learning Program December, 2011.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Content based on Chapter 10 Database Management Systems, (3 rd.
7 1 Database Systems: Design, Implementation, & Management, 7 th Edition, Rob & Coronel 7.6 Advanced Select Queries SQL provides useful functions that.
CS4432: Database Systems II
LM 5 Introduction to SQL MISM 4135 Instructor: Dr. Lei Li.
CS 405G: Introduction to Database Systems 12. Index.
Chapter 11 Indexing And Hashing (1) Yonsei University 1 st Semester, 2016 Sanghyun Park.
SQL Basics Review Reviewing what we’ve learned so far…….
Database Applications (15-415) DBMS Internals- Part III Lecture 13, March 06, 2016 Mohammad Hammoud.
10/3/2017 Chapter 6 Index Structures.
Indexing Structures for Files and Physical Database Design
Index An index is a performance-tuning method of allowing faster retrieval of records. An index creates an entry for each value that appears in the indexed.
CS 540 Database Management Systems
Indexing and hashing.
Multiway Search Trees Data may not fit into main memory
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
Database.
File Systems and Databases
B+-Trees and Static Hashing
Indexing and Hashing Basic Concepts Ordered Indices
B+Trees The slides for this text are organized into chapters. This lecture covers Chapter 9. Chapter 1: Introduction to Database Systems Chapter 2: The.
Data Model.
CPS216: Advanced Database Systems
Chapter 17 Designing Databases
Tree-Structured Indexes
Presentation transcript:

Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. Elance Inc

Background  Node-Uniform Hierarchical (NUH for short) data can be visualized as a tree or forest graph where every node has the same set of attributes.  NUH data can be naturally represented in RDBMS by recursive tables where the parent-child relationship is implemented in a way that if record x is a child of record y, then the value of x’s parent_id column is the same as the value of the id column of y’s.  Standard SQL does not support for general query on NUH data in basic recursive tables.  Oracle comes with a native mechanism for general query on NUH data and beyond, known as connect-by. For all its elegancy and usefulness, it is short in two accounts: it is slow in cases due to the fact that the parent-child relationship cannot be directly indexed; and it is Oracle dependent in that SQL queries using connect-by cannot be easily adapted to RDBMS of other vendors.

Basic Recursive Table Design Basic Columns  Xid--system assigned unique id  Parent_xid--xid of parent of this entry  Entry_code--content unique identifier of the entry  Normal_stuff--one or more such columns for content values Some variant: Use a separate table to store the hierarchical relationship consisting essentially of two columns: xid/child_xid and parent_xid; and use FK to link the table to the main data table

Basic Recursive Table Query Mechanisms  Oracle-native Connect by  K-way self outer join (for up to level k depth)  Other??

Basic Idea of Path Code Approach  A node of a tree is fully determined by the path from the root to itself.  Path code as full representation of the path can be very compact in length, in the order of logarithmic of total size of the tree.  Path Code can be maintained dynamically feasibly.  Path code permits direct indexing.

Path-code enhanced recursive table design Basic Columns  xid  parent_xid  path_code--code of the path for the node (detail later)  entry_level--level of the record in the tree the entry belongs to  sibling_no--sequence no of the child entry_code with respect to the parent  is_leaf --1/0 for being a leaf/not a leaf  Entry_code --content unique identifier of the entry  normal_stuff --one or more

Value Setting for H columns (I)  Parent_xidset as usual  Sibling_no can be set according to any ordering, e.g. according to entry_code, starting at 1 for each parent; the sibling_no of root entries are set as if those roots were children of a super root;  Entry_level can be set from top down, having entry_level=0 for all root entries; and X.entry_level=k+1 if X has parent Y and Y.entry_level=k;  Is_leaf =0/1 if there is child of the node/not so

Value Setting for H columns (II)  Path_code –for root entries X: X.path_code = to_char(X.sibling_no, ‘00’) –for non-root entries X with X.parent_xid=Y.xid: X.path_code = Y.path_code||to_char(X.sibling_no,’00’) Path_code of a node N at level k has k+1 sections; level j section is left-zero padded string conversion of sibling_no of N or N’s parent at level j; For convenience, the last (rightmost) section is called the base section, the concatenation of all the non-last sections is called the ancestor section.

Example of H-value setting LOGO EC for entry_code EL for entry_level XID for xid PC for path_code Explanation Path_code is in the uniform format Path_code order is based on entry_code order but not on XID order. It could be otherwise. Path_code of a child is the path_code of its parent plus its base section code. Sibling_no is not shown but assumed to be in accordance with entry_code. Entry_code and xid value settings can be independent of each other. parent_xid, sibling_no, is_leaf and other fields are not shown. format assumption Node uniform (see next) Section length =2 String expression in format (Same assumption for later code examples)

Variants of path_code pattern (advanced topic)  node uniform: every section of all path codes has equal length (a simplest; and it is used in the previous example)  Level uniform: every section of the same level of all path_codes has equal length  Parent uniform: every child node of any parent node has equal path_code length  Dot (or delimiter) uniform: use the same delimiter character (e.g. dot) to separate all sections of all path_codes  Min uniform: the length of base section of the path_code is always maintained to be minimum  String/Binary/hex/ in expression and interpretation, sorting relevant, etc.  Sparse uniform: path_code sections each allows more values than actually and currently needed, for easing subsequent node insertions.

Query Patterns  Get all children of a parent P select * from T where path_code like P.path_code||’%’  Get all ancestors if a child C select * from T where C.path_code like path_code||’%’  Get all siblings of a node N s elect * from T where parent_xid = N.parent_xid

DML Patterns (insert at end)  (Insert record with path_code and sibling_no as null) insert into T(xid,parent_xid,entry_level, entry_code, normal_stuff) values c.xid,p.xid, p.entry_level + 1, c.entry_code, c.normal_stuff;  (Update sibling_no) update T set sibling_no = (select max(sibling_no)+1 from T where parent_xid = p.xid) where xid = c.xid;  (Update path_code) update T set path_code = p.path_code || to_char(sibling_no, '00') where xid = c.xid;  (reset is_leaf for p, detail omit)

DML Patterns (insert in middle)  (Insert record with path_code and sibling_no as null) insert into T(xid,parent_xid,entry_level, entry_code, normal_stuff) values c.xid,p.xid, p.entry_level + 1, c.entry_code, c.normal_stuff;  (Update sibling_no for those siblings elder than c) update T set sibling_no = sibling_no + 1 where parent_xid = p.xid and entry_code >c.entry_code;  (Update sibling_no for c) update T set sibling_no = (select max(sibling_no)+1 from T where parent_xid = p.xid and entry_code < c.entry_code;  (Update path_code for c) update T set path_code = p.path_code || to_char(sibling_no, '00') where xid = c.xid;  (Update path_code for those siblings elder than c and all decendents of those elder siblings, pcs_length stands for path_code section length) update T set path_code = substr(path_code,1, pcs_length*entry_level) || to_char(sibling_no, '00')||substr(path_code, pcs_length*(entry_level+1)+1) where path_code like p.path_code||’%’ and path_code > (select path_code from T where xid = c.xid)

DML patterns (delete)  (Delete node C and all its decendents) delete from T where path_code like C.path_code||’%’;  (Shifting sibling_no for those siblings elder than C) Update T set sibling_no = sibling_no -1 where parent_xid = P.parent_id and sibling_no >C.sibling_no;  (Shifting path_code for those siblings elder than C and their decendents) Update T set path_code = substr(path_code,1, pcs_length*C.entry_level) || to_char(sibling_no, '00')||substr(path_code, pcs_length*(C.entry_level+1)+1) where path_code like P.path_code||’%’ and path_code > C.path_code;  (reset is_leaf of the parent P of C according to whether P has other children)

Complexity Analysis  Space –length of path_code increases in order logarithmic of the total number of rows in the table (for non-degenerated hierarchical data). –e.g. length of c*20 vs 1M rows  Time – queries execute very much based on index range scan, usually the fastest available. –Inserts/deletes may involve sub-tree processing for delete or update.

Comparison with Connect-By categoryConnect-By Path code approach Data pattern General directed graph Strictly hierarchical Table design Recursive basic Recursive basic + extra H-columns Disk space cost Minimum Overhead of up to \log n Time Efficiency* Incapable of direct indexing or unknown Capable of direct indexing RDBMS Vendor independence* NoYes Query complexity* It depends DML complexity Minimum Substantial but in a small set of patterns

A stretched idea on RDBMS design  Make entry_id and parent_entry_id relationship declarative;  Enforce hierarchy constraint to the effect that each node can only have zero or one parent node;  Create and maintain path_code, entry_level, etc by RDBMS like creating and maintaining functional indexes;  Add syntax to SQL similar to Oracle’s connect-by, but with the extra of taking advantage of the hidden indexes.

Additional Questions and References  Whether and how to generalize the design for node non-uniform hierarchical data?  To see latest alternative approaches, e.g. David.html (Using ANSI SQL as a Conceptual Hierarchical Data Modeling and Processing Language for XML, by Michael M David) David.html David.html