Presentation is loading. Please wait.

Presentation is loading. Please wait.

GUS 3.0: Implementation and Dependencies June 19, 2002 Jonathan Crabtree

Similar presentations


Presentation on theme: "GUS 3.0: Implementation and Dependencies June 19, 2002 Jonathan Crabtree"— Presentation transcript:

1 GUS 3.0: Implementation and Dependencies June 19, 2002 Jonathan Crabtree crabtree@pcbi.upenn.edu

2 Outline " Schema "implementation" – what's done, what's not " Dependencies – data migration and testing – other tasks " Database implementation details – design decisions and implications – production & development dbs " Future work/current schema issues

3 Implementation " Implemented so far: – The schema itself (Pinney) – Updated Perl object layer (Brunk) – Revised GUS Application (GA) code (Schug) – Preliminary version of allgenes interface (Fischer) " Not yet implemented: – Extensive testing of schema, interface, objects, etc. – Data migration from GUSdev to GUS 3.0

4 Migration Dependencies " Instantiate/finalize GUS 3.0 schema (Pinney) " Upgrade database server operating system " Install and configure new RAID device " Write scripts to migrate existing data – Resolve any remaining inconsistencies " Freeze access to database – Annotator's interface (Diskin, Mazzarelli) – Current allgenes update (Pinney, Fischer)

5 Migration Dependencies cont. " Run scripts to migrate all existing data – Fix any problems that arise " Begin to "certify" plugins as 3.0-compliant " Discuss: how much does the GUS 3.0 schema "implementation" depend on our data migration? " In other words, the 3.0 schema can be viewed as implemented but untested. " Conflict with PlasmoDB final release date?

6 Migration Highlights " Two "namespaces" (Oracle schemas) to five: – GUSdev,RADdev => Core,DoTS,SRes,RAD3,TESS – Certain tables are now shared in Core, SRes – Avoid primary key conflicts by reloading RAD3 " Restructuring of DoTS "central dogma" tables: – GeneInstance, RNAInstance, ProteinInstance – Also GO terms, new LOE and Complex tables " Other pervasive changes: – e.g. ExternalDatabase => ExternalDatabaseRelease

7 Other Tasks " Script(s) to automate schema creation: – schemas (in the Oracle sense) – tables – sequences (to generate primary key values) – views – "bootstrap" rows – populate other tables as desired? (Anatomy, etc.) – constraints – indexes – GRANT permissions as desired

8 Other Tasks II " Complete schema documentation " Convert plugins to new schema as needed – Remove site-specific dependencies or standardize " e.g., hard-coded references to specific external_db_ids – Particularly for data loading plugins " make it easier to load and display sample dataset " Formalize schema development process – Which changes are "major" or "minor"? – Automatically determine which plugins are affected?

9 Database Design Decisions " GUS vs. other "plain" relational databases: – 1. subclassing (extra views) – 2. [blame] tracking/access control (extra columns) – 3. versioning (extra tables) " Minimal reliance on database-specific features – no stored procedures – no server-side Java – no object-relational tables " Generic links and naming conventions

10 1. Subclassing With Views " Advantages – conceptual clarity – straightforward to query the superclasses – schema evolution; views are easier to change " Implications – large tables (number of columns and rows) – complicates query optimization (number of rows) – slows row accesses (number of columns)

11 Subclassing cont. " Query optimization issues – Cost-based query optimization requires statistics – Confounded by coexistence of subclasses in table – Bigger tables make the worst case worse " Physical I/O issues – Any row access must read the entire row, including a potentially large set of irrelevant column values – Also increases the likelihood of chaining

12 Subclassing - alternatives " Use views for the superclass not the subclasses? – Isolates subclasses from one another more – Requires changing tables rather than views – Superclass view will be a large SQL UNION – Queries likely less efficient over superclasses " Keep existing system, but use partitions to specify physical placement of subclass rows – Solves some, but not all of the problems

13 "Large" Tables I " GUS: indexes=25G tables=~100G – NASequenceImp = 11G – AssemblySequenceVer = 8.6 G – SimilaritySpan = 8G / 74 million rows – Similarity = 4.8 G / 38 million rows – AssemblySequence = 3.6G – Evidence = 3G / 36 million rows – SimilarityVer = 2.8G " Approximately 10-20 quite large tables

14 "Large" Tables II " Tables with largest average row length: – GeneMapVer = 811 bytes – NASequenceImp = 747 bytes – AssemblySequence = 521 bytes " Tables with the most chained rows: – NASequenceImp = 384,524 rows – AssemblySequenceVer = 56,873 rows – AssemblySequence = 21,458 rows

15 2. Tracking/Access Control " Advantages: – Enables DBA to disburse wrath appropriately – Aids in correcting errors " Disadvantages: – Extra columns have foreign key constraints – Several small tables become bottlenecks for certain DDL and database update operations – Access controls not fully implemented " where and how should they be implemented?

16 Tracking II

17 3. Versioning " Advantages: – required for complete tracking " Disadvantages: – space overhead, results in slower updates – requires application-level code to implement – may be unnecessary in some DBMSs – currently not used uniformly " Different versions coexisting e.g., PlasmoDB

18

19 Development => Production " nemesis/8i (GUS) and erebus/9i (GUSdev) " Release cycle based on whole-database copy " Uses Oracle IMPORT and EXPORT utilities: – EXPORT over network to flat files – change owner/schema name – change physical placement of tables, indexes " Alternatives: – transportable tablespaces, SQL-based copy

20 Future Work " Issues with current schema from PlasmoDB – free text searching (and use of CLOB values) – more sophisticated schema for tracking session- oriented data (more on this tomorrow) – supporting queries for genome browser(s)


Download ppt "GUS 3.0: Implementation and Dependencies June 19, 2002 Jonathan Crabtree"

Similar presentations


Ads by Google