Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. Elance Inc.

Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. Elance Inc. cyu@elance.com charles.yu@acm.org 2005-08

Background  Node-Uniform Hierarchical (NUH for short) data can be visualized as a tree or forest graph where every node has the same set of attributes.  NUH data can be naturally represented in RDBMS by recursive tables where the parent-child relationship is implemented in a way that if record x is a child of record y, then the value of x’s parent_id column is the same as the value of the id column of y’s.  Standard SQL does not support for general query on NUH data in basic recursive tables.  Oracle comes with a native mechanism for general query on NUH data and beyond, known as connect-by. For all its elegancy and usefulness, it is short in two accounts: it is slow in cases due to the fact that the parent-child relationship cannot be directly indexed; and it is Oracle dependent in that SQL queries using connect-by cannot be easily adapted to RDBMS of other vendors.

Basic Recursive Table Design Basic Columns  Xid--system assigned unique id  Parent_xid--xid of parent of this entry  Entry_code--content unique identifier of the entry  Normal_stuff--one or more such columns for content values Some variant: Use a separate table to store the hierarchical relationship consisting essentially of two columns: xid/child_xid and parent_xid; and use FK to link the table to the main data table

Basic Recursive Table Query Mechanisms  Oracle-native Connect by  K-way self outer join (for up to level k depth)  Other??

Basic Idea of Path Code Approach  A node of a tree is fully determined by the path from the root to itself.  Path code as full representation of the path can be very compact in length, in the order of logarithmic of total size of the tree.  Path Code can be maintained dynamically feasibly.  Path code permits direct indexing.

Path-code enhanced recursive table design Basic Columns  xid  parent_xid  path_code--code of the path for the node (detail later)  entry_level--level of the record in the tree the entry belongs to  sibling_no--sequence no of the child entry_code with respect to the parent  is_leaf --1/0 for being a leaf/not a leaf  Entry_code --content unique identifier of the entry  normal_stuff --one or more

Value Setting for H columns (I)  Parent_xidset as usual  Sibling_no can be set according to any ordering, e.g. according to entry_code, starting at 1 for each parent; the sibling_no of root entries are set as if those roots were children of a super root;  Entry_level can be set from top down, having entry_level=0 for all root entries; and X.entry_level=k+1 if X has parent Y and Y.entry_level=k;  Is_leaf =0/1 if there is child of the node/not so

Value Setting for H columns (II)  Path_code –for root entries X: X.path_code = to_char(X.sibling_no, ‘00’) –for non-root entries X with X.parent_xid=Y.xid: X.path_code = Y.path_code||to_char(X.sibling_no,’00’) Path_code of a node N at level k has k+1 sections; level j section is left-zero padded string conversion of sibling_no of N or N’s parent at level j; For convenience, the last (rightmost) section is called the base section, the concatenation of all the non-last sections is called the ancestor section.

Example of H-value setting LOGO EC for entry_code EL for entry_level XID for xid PC for path_code Explanation Path_code is in the uniform format Path_code order is based on entry_code order but not on XID order. It could be otherwise. Path_code of a child is the path_code of its parent plus its base section code. Sibling_no is not shown but assumed to be in accordance with entry_code. Entry_code and xid value settings can be independent of each other. parent_xid, sibling_no, is_leaf and other fields are not shown. format assumption Node uniform (see next) Section length =2 String expression in format (Same assumption for later code examples)

Variants of path_code pattern (advanced topic)  node uniform: every section of all path codes has equal length (a simplest; and it is used in the previous example)  Level uniform: every section of the same level of all path_codes has equal length  Parent uniform: every child node of any parent node has equal path_code length  Dot (or delimiter) uniform: use the same delimiter character (e.g. dot) to separate all sections of all path_codes  Min uniform: the length of base section of the path_code is always maintained to be minimum  String/Binary/hex/ in expression and interpretation, sorting relevant, etc.  Sparse uniform: path_code sections each allows more values than actually and currently needed, for easing subsequent node insertions.

Query Patterns  Get all children of a parent P select * from T where path_code like P.path_code||’%’  Get all ancestors if a child C select * from T where C.path_code like path_code||’%’  Get all siblings of a node N s elect * from T where parent_xid = N.parent_xid

DML Patterns (insert at end)  (Insert record with path_code and sibling_no as null) insert into T(xid,parent_xid,entry_level, entry_code, normal_stuff) values c.xid,p.xid, p.entry_level + 1, c.entry_code, c.normal_stuff;  (Update sibling_no) update T set sibling_no = (select max(sibling_no)+1 from T where parent_xid = p.xid) where xid = c.xid;  (Update path_code) update T set path_code = p.path_code || to_char(sibling_no, '00') where xid = c.xid;  (reset is_leaf for p, detail omit)

DML Patterns (insert in middle)  (Insert record with path_code and sibling_no as null) insert into T(xid,parent_xid,entry_level, entry_code, normal_stuff) values c.xid,p.xid, p.entry_level + 1, c.entry_code, c.normal_stuff;  (Update sibling_no for those siblings elder than c) update T set sibling_no = sibling_no + 1 where parent_xid = p.xid and entry_code >c.entry_code;  (Update sibling_no for c) update T set sibling_no = (select max(sibling_no)+1 from T where parent_xid = p.xid and entry_code < c.entry_code;  (Update path_code for c) update T set path_code = p.path_code || to_char(sibling_no, '00') where xid = c.xid;  (Update path_code for those siblings elder than c and all decendents of those elder siblings, pcs_length stands for path_code section length) update T set path_code = substr(path_code,1, pcs_length*entry_level) || to_char(sibling_no, '00')||substr(path_code, pcs_length*(entry_level+1)+1) where path_code like p.path_code||’%’ and path_code > (select path_code from T where xid = c.xid)

DML patterns (delete)  (Delete node C and all its decendents) delete from T where path_code like C.path_code||’%’;  (Shifting sibling_no for those siblings elder than C) Update T set sibling_no = sibling_no -1 where parent_xid = P.parent_id and sibling_no >C.sibling_no;  (Shifting path_code for those siblings elder than C and their decendents) Update T set path_code = substr(path_code,1, pcs_length*C.entry_level) || to_char(sibling_no, '00')||substr(path_code, pcs_length*(C.entry_level+1)+1) where path_code like P.path_code||’%’ and path_code > C.path_code;  (reset is_leaf of the parent P of C according to whether P has other children)

Complexity Analysis  Space –length of path_code increases in order logarithmic of the total number of rows in the table (for non-degenerated hierarchical data). –e.g. length of c*20 vs 1M rows  Time – queries execute very much based on index range scan, usually the fastest available. –Inserts/deletes may involve sub-tree processing for delete or update.

Comparison with Connect-By categoryConnect-By Path code approach Data pattern General directed graph Strictly hierarchical Table design Recursive basic Recursive basic + extra H-columns Disk space cost Minimum Overhead of up to \log n Time Efficiency* Incapable of direct indexing or unknown Capable of direct indexing RDBMS Vendor independence* NoYes Query complexity* It depends DML complexity Minimum Substantial but in a small set of patterns

A stretched idea on RDBMS design  Make entry_id and parent_entry_id relationship declarative;  Enforce hierarchy constraint to the effect that each node can only have zero or one parent node;  Create and maintain path_code, entry_level, etc by RDBMS like creating and maintaining functional indexes;  Add syntax to SQL similar to Oracle’s connect-by, but with the extra of taking advantage of the hidden indexes.

Additional Questions and References  Whether and how to generalize the design for node non-uniform hierarchical data?  To see latest alternative approaches, e.g. http://www.inconcept.com/JCM/May2005/ David.html (Using ANSI SQL as a Conceptual Hierarchical Data Modeling and Processing Language for XML, by Michael M David) http://www.inconcept.com/JCM/May2005/ David.html http://www.inconcept.com/JCM/May2005/ David.html

Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. Elance Inc.

Similar presentations

Presentation on theme: "Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. Elance Inc."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. Elance Inc.

Similar presentations

Presentation on theme: "Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. Elance Inc."— Presentation transcript:

Similar presentations

About project

Feedback