ArchIS: An Efficient Transaction-Time Temporal Database System Built on Relational Databases and XML Fusheng Wang University of California, Los Angeles.

ArchIS: An Efficient Transaction-Time Temporal Database System Built on Relational Databases and XML Fusheng Wang University of California, Los Angeles

Motivation: Temporal Applications  Financial applications  Record-keeping applications  Scheduling applications  Scientific applications Most database applications are temporal in nature:

Temporal Databases: the Reality  Over 40 temporal data models and query languages have been proposed in the past  A long struggle to get around the limitations of RDBMS  No DBMS vendors have moved aggressively to extend SQL with temporal support

What’s Needed?  Expressive temporal representations and data models with minimal or no extension  Powerful languages for temporal queries with minimal or no extension  Indexing, clustering and query optimization techniques for efficient query support  Architectures that bring these together A temporal database system that provides:

Outline  Motivation  Viewing Relation History in XML  Temporal Queries with XQuery  The ArchIS System  Performance Study  Database Compression  Conclusion

Background: Publishing Relational Database as XML  Publishing relational DBs as XML  as actual XML documents: SQL/XML  as XML views: SilkRoute, XPeranto

Viewing Relation History in XML  Our proposal: view the history of relational DBs as XML documents:  Such history can be naturally represented in XML, without any extension to the data model  Temporal queries can be expressed in XQuery as is—without any extension to the language  Amenable for efficiently implementations

Temporal Grouping in XML  Temporal data models can be classified as:  Temporally ungrouped  Temporally grouped  Temporally grouped data models have more expressive power and are more natural for users  It is difficult to fit temporally grouped models into RDBMS  Temporally grouped data model can be represented well in XML

Example: Transaction-Time History of Tables  Timestamped tuple snapshots (temporally ungrouped) nameempnosalarytitle deptno DOBstartend Bob1000360000Engineerd011945-04-091995-01-011995-05-31 Bob1000370000Engineerd011945-04-091995-06-011995-09-30 Bob1000370000Sr Engineerd021945-04-091995-10-011996-01-31 Bob1000370000Tech Leaderd021945-04-091996-02-011996-12-31 nameempnosalarytitledeptnoDOB Bob 1995-01-01:1996-12-31 10003 1995-01-01:1996-12-31 60000 1995-01-01:1996-05-31 Engineer 1995-01-01:1995-09-30 d01 1995-01-01:1995-09-30 1945-04-09 1995-01-01:1996-12-31 70000 1995-06-01:1996-12-31 Sr Engineer 1995-10-01:1996-01-31 d02 1995-10-01:1996-12-31 Tech Leader 1996-02-01:1996-12-31  Temporally grouped history of employees

XML Representation of DB History 10003 Bob 60000 70000 Engineer Sr Engineer Tech Leader d01 d02 1945-04-09 10003 Bob 60000 70000 Engineer Sr Engineer Tech Leader d01 d02 1945-04-09

Advantages of XML Representations  The attribute value history is grouped, and can be queried directly  The H-document has a well-defined schema generated from the current table  The interval constraints are maintained in the updates

Temporal Queries with XQuery  XQuery: the coming standard query language for XML  With XQuery, we can specify temporal queries without any extension:  Temporal projection, snapshot queries, temporal joins, interval queries  Complex queries: A SINCE B, continuous periods, period containment

Temporal Queries with XQuery  Temporal projection: retrieve the salary history of “Bob”: element salary_history { for $s in doc("employees.xml")/ employees/employee/[name=“Bob”]/salary return $s } element salary_history { for $s in doc("employees.xml")/ employees/employee/[name=“Bob”]/salary return $s }  Snapshot queries: retrieve the departments on 1996-01-31: for $d in doc("depts.xml")/depts/dept [tstart(.) = "1996-01-31"] let $n := $d/name[tstart(.) ="1996-01-31"] let $m := $d/manager[tstart(.) = "1996-01-31"] return( element dept{$n,$m } ) for $d in doc("depts.xml")/depts/dept [tstart(.) = "1996-01-31"] let $n := $d/name[tstart(.) ="1996-01-31"] let $m := $d/manager[tstart(.) = "1996-01-31"] return( element dept{$n,$m } )

Temporal Functions  Shield the user from the low-level details used in representing time, e.g., “now”  Eliminate the need for the user to write complex functions, e.g., coalescing  Predefined functions:  Restructuring: coalese($l)  Period comparison : toverlaps, tprecedes, tcontains, tequals, tmeets  Duration and date/time: tstart($e), tend($e), timespan($e)  telement(Ts, Te): constructs an empty element element timestamped as tstart=Ts, tend=Te

Support for ‘now’  ‘now’: no change until now  Internally, “end of time” values are used to denote ‘now’, e.g., 9999-12-31  Intervals are only accessed through built-in functions: tstart() returns the start of an interval, tend() returns the end or CURRENT_DATE if it’s different from 9999-12-31  In the output, tend value can be:  “9999-12-31”  CURRENT_DATE by using rtend($e) that recursively replaces all the occurrence of 9999-12-31 with the current date,  “now”, using externalnow($e) that recursively replaces all the occurrence of \9999-12-31" with the string \now".

The ArchIS System  Two approaches are possible for storing and querying H- documents (H-views)  Native XML database approach: store H-documents directly into XML DB  XML-enabled RDBMS. Design issues include:  mapping (shredding) the XML views representing the H- documents into tables (H-tables)  translation of queries from the XML views to the H-tables  indexing, clustering and query mapping techniques  ArchIS: Archival Information System

The ArchIS System: Architecture H-tables Relational Data Current Database Active Rules/ update logs Temporal XML Data SQL Queries Temporal XML Queries H-views (H-documents) ARCHIS

H-tables  Assumptions  Each entity or relation has a unique key ( or composite keys) to identify it which will not change along the history. e.g., employee: empno  H-tables:  attribute history table: store history of each attribute  key table: built for the key  global relation table: record the history of relations  e.g.: current database:  employee(empno, name, sex, DOB, deptno, salary, title)

H-tables (cont’d) current table H-tables employeeglobal relation table relations(relationname, tstart, tend) empnokey tableemployee_id(id, tstart, tend) nameattribute history table employee_name(id, name, tstart, tend) …… salaryemployee_salary(id, salary, tstart, tend) titleemployee_title(id, title, tstart, tend)

H-tables (cont’d)  Sample contents of employee_salary: ID SALARY TSTART TEND ======= ======= ========== ========== 100022 58805 02/04/1985 02/04/1986 100022 61118 02/05/1986 02/04/1987 100022 65103 02/05/1987 02/04/1988 100022 64712 02/05/1988 02/03/1989 100022 65245 02/04/1989 02/03/1990 100023 43162 07/13/1988 07/13/1989...

Updating Table Histories  Changes in the current database can be tracked with either update logs or triggers  DB2: triggers  ArchIS: update logs

Query Mapping  General purpose query mapping: XPeranto  In ArchIS, we have well-defined mapping between H-documents (or H-views) and H- tables  We map temporal XQuery queries into SQL, utilizing SQL/XML  SQL/XML is a new standard to map between RDBMS and XML  Both tag-binding and structure construction is pushed inside the relational engine, thus be very efficient

SQL/XML Publishing Functions  XMLElement and XMLAttribute  XMLAgg select XMLElement (Name "dept", XMLAttributes (tstart as "tstart", tend as "tend"), deptname) from dept where deptname = ‘Sales’ Sales Sales select XMLElement (Name as "new_employees", XMLAttributes ("02/04/2003" as "Since") XMLAgg (XMLElement (Name as "employee", e.name)) from employee_name as e where e.tstart >= ‘02/04/2003’ Bob Jack Bob Jack

XQuery Mapping to SQL with SQL/XML select XMLElement (Name "salaryhistory", XMLAgg (XMLElement (Name as "salary", XMLAttributes (S.tstart as tstart, S.tend as "tend"), S.salary))) from employee_salary as S, employee_name as N where N.id = S.id and N.name = 'Bob' group by N.id select XMLElement (Name "salaryhistory", XMLAgg (XMLElement (Name as "salary", XMLAttributes (S.tstart as tstart, S.tend as "tend"), S.salary))) from employee_salary as S, employee_name as N where N.id = S.id and N.name = 'Bob' group by N.id  Temporal projection: retrieve the salary history of “Bob”: element salary_history { for $s in doc("employees.xml")/ employees/employee/[name=“Bob”]/salary return $s } element salary_history { for $s in doc("employees.xml")/ employees/employee/[name=“Bob”]/salary return $s }

XQuery Mapping to SQL with SQL/XML: Steps  Identification of variable range  Map variables in FOR/LET clause into underlying H- tables  Generation of join conditions  There is a join condition any pair of distinct tuple variables: join them by ids  Translation of built in functions  Map built-in temporal functions in XQuery into functions in ArchIS  Output generation  use XMLElement and XMLAgg constructs

Temporal Clustering and Indexing  Tuples in H-tables are stored in the order of updates, thus neither temporally clustered nor clustered by objects  Traditional indexes such as B+ Tree will not help on snapshot queries, and better temporal clustering is needed  For every segment, usefulness: U = N live /N all  At the beginning, U =100%, and it decreases with updates  The minimum tolerable usefulness: U min

Live Segment-based Clustering Scheme Segment 1 All Segment 2 All Live Segment 3 All segstart1segend1segstart2segend2segstart3segend3 tstart tuple <= segend SEG tend tuple >= segstart SEG tstart tuple <= segend SEG tend tuple >= segstart SEG

Segment-based Clustering Scheme  Initially all tuples for an attribute history table are archived in a live segment SEG live with usefulness U =100%. With updates, when U drops below U min : 1. A new segment is allocated; 2. The interval of this segment is recorded in the table segment(segno, segstart, segend); 3. All tuples in SEGlive are copied into a new segment Si sorted by id; 4. All live tuples in SEG live are copied into a new live segment SEG live', and the old live segment is dropped; After that, the new segment SEG live’ becomes the new starting segment for updates

Segment-based Clustering Scheme (cont’d)  Sample segments: Segment1 (01/01/1985 - 10/17/1991): ID SALARY TSTART TEND 100002 40000 02/20/198802/19/1989 100002 42010 02/20/1989 02/19/1990 100002 42525 02/20/1990 02/19/1991 100002 42727 02/20/1991 12/31/9999... Segment2 (10/18/1991 - 07/08/1995): ID SALARY TSTART TEND 100002 42727 02/20/1991 02/19/1992 100002 45237 02/20/1992 02/18/1993 100002 46465 02/19/1993 02/18/1994 100002 47418 02/19/1994 02/18/1995 100002 47273 02/19/1995 12/31/9999...

Advantages of Segment-based Clustering Scheme  The current live segment always has a high usefulness, assuring efficient updates;  Records are globally temporally clustered on segments;  For snapshot queries, only one segment is used; for interval queries, only segments involved are used;  Flexibility to control the number of redundant tuples in segments with U min

Storage Usage of Segment-based Clustering Relative storage size with different U min N seg <= N 0/ (1-U min ) NSNS

Query Performance on Temporal Data with Segment-based Clustering Queries: Point: Q1 Snapshot: Q2 Interval: Q5 History: Q3, Q4, Q6

Performance Study: Experimental Setup  Systems: Tamino, DB2, and ArchIS  ArchIS uses BerkeleyDB as its storage manager, and it builds on top of it a SQL query engine  Temporal data set: the history of 300,024 employees over 17 years  The simulation models real world salary increases, changes of titles, and changes of departments  The size of the XML data is 334MB  The single large XML document is cut into a collection of 15,000 small XML documents with around 25KB each  Machine: Pentium IV 2.4GHz PC with RedHat 8.0

Performance Study: Query Performance snapshot query Q2 on ArchIS is 137 times faster than that on Tamino; interval query Q5 is 91 times faster; history Q6 is 25 times faster; Q4 4 times faster, and Q3 near 3 times faster. Tamino with clustering: snapshot Q2 is 3.3 times faster than without clustering ( still 41 times slower than archIS); interval query Q5 is 2.9 times faster than without clustering ( still 31 times slower than on ArchIS); history queries are much slower DB2 and ArchIS: with clustering Tamino: without clustering

Storage Utilization

Database Compression  The disparity between CPU/memory and disk speeds is becoming larger and larger  Cost to read one IDE disk page: 14ms  Cost to uncompress one page: 1.1ms(500MHz CPU)  0.26ms(2.4GHz CPU)  Cost to retrieve one compressed page: 14ms + 0.26ms = 14.3ms  Cost to retrieve uncompressed pages (3.6 pages):  14ms x 3.6 = 50.4ms

Page-based Compression: PageZIP  Traditional data compression tools: compress a file as a whole  PageZIP: page-based compression and uncompression at the granularity of a page  Based on gzip library: zlib  Benefit: save space; point, snapshot or interval queries only retrieve a small fraction of the history, and can be efficient

PageZIP Segment 1 Segment n page 1 ID: 1001 - 1100 page 2 ID: 1100 - 1203 page 3 ID: 1203 - 1331 … …

Storage Utilization with Compression  For each attribute history table, we compress it as a sequence of pages and store each page as a BLOB in a RDBMS employee_salary (sid, salary, tstart, tend) => employee_salary_blob(pageno, startsid, endsid, pageblob)

Query Performance with Compression

Update Performance  For RDBMS, only the current segment is used for updates. For Tamino, current data and historical data are clustered together  Update an employee’s salary:  DB2: 0.29 seconds; Tamino: 1.2 seconds  Assume that every employee gets updated once a year: about 1/260 of the total employee get updated every day on average  DB2: 1.52 seconds; Tamino: 15 seconds  In the worse case for segment-based archiving: 39 seconds for copying segments and 36 segments for compression: but only once

Summary  We built a transaction time temporal database on RDBMS and XML, with:  XML to support temporally grouped (virtual) representations of the database history  XQuery to express powerful temporal queries on such views  temporal clustering for managing the actual historical data in a RDBMS  SQL/XML for executing the queries on the XML views as equivalent queries on the relational DB  compression as option for efficient storage  ArchIS provides a unified solution for a wide spectrum of temporal application problems

Future Work  Friendly temporal query interfaces based on temporally grouped models  Other clustering and indexing techniques to be investigated  Other efficient data compression techniques proposed for XML data to be investigated  Apply the approach to valid-time DB and bi- temporal DB  Apply the approach to OODBMS and semi- structured data model

ArchIS: An Efficient Transaction-Time Temporal Database System Built on Relational Databases and XML Fusheng Wang University of California, Los Angeles.

Similar presentations

Presentation on theme: "ArchIS: An Efficient Transaction-Time Temporal Database System Built on Relational Databases and XML Fusheng Wang University of California, Los Angeles."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ArchIS: An Efficient Transaction-Time Temporal Database System Built on Relational Databases and XML Fusheng Wang University of California, Los Angeles.

Similar presentations

Presentation on theme: "ArchIS: An Efficient Transaction-Time Temporal Database System Built on Relational Databases and XML Fusheng Wang University of California, Los Angeles."— Presentation transcript:

Similar presentations

About project

Feedback