Download presentation
Presentation is loading. Please wait.
Published byShon Martin Modified over 9 years ago
1
GUS 3.0: Implementation and Dependencies June 19, 2002 Jonathan Crabtree crabtree@pcbi.upenn.edu
2
Outline " Schema "implementation" what's done, what's not " Dependencies data migration and testing other tasks " Database implementation details design decisions and implications production & development dbs " Future work/current schema issues
3
Implementation " Implemented so far: The schema itself (Pinney) Updated Perl object layer (Brunk) Revised GUS Application (GA) code (Schug) Preliminary version of allgenes interface (Fischer) " Not yet implemented: Extensive testing of schema, interface, objects, etc. Data migration from GUSdev to GUS 3.0
4
Migration Dependencies " Instantiate/finalize GUS 3.0 schema (Pinney) " Upgrade database server operating system " Install and configure new RAID device " Write scripts to migrate existing data Resolve any remaining inconsistencies " Freeze access to database Annotator's interface (Diskin, Mazzarelli) Current allgenes update (Pinney, Fischer)
5
Migration Dependencies cont. " Run scripts to migrate all existing data Fix any problems that arise " Begin to "certify" plugins as 3.0-compliant " Discuss: how much does the GUS 3.0 schema "implementation" depend on our data migration? " In other words, the 3.0 schema can be viewed as implemented but untested. " Conflict with PlasmoDB final release date?
6
Migration Highlights " Two "namespaces" (Oracle schemas) to five: GUSdev,RADdev => Core,DoTS,SRes,RAD3,TESS Certain tables are now shared in Core, SRes Avoid primary key conflicts by reloading RAD3 " Restructuring of DoTS "central dogma" tables: GeneInstance, RNAInstance, ProteinInstance Also GO terms, new LOE and Complex tables " Other pervasive changes: e.g. ExternalDatabase => ExternalDatabaseRelease
7
Other Tasks " Script(s) to automate schema creation: schemas (in the Oracle sense) tables sequences (to generate primary key values) views "bootstrap" rows populate other tables as desired? (Anatomy, etc.) constraints indexes GRANT permissions as desired
8
Other Tasks II " Complete schema documentation " Convert plugins to new schema as needed Remove site-specific dependencies or standardize " e.g., hard-coded references to specific external_db_ids Particularly for data loading plugins " make it easier to load and display sample dataset " Formalize schema development process Which changes are "major" or "minor"? Automatically determine which plugins are affected?
9
Database Design Decisions " GUS vs. other "plain" relational databases: 1. subclassing (extra views) 2. [blame] tracking/access control (extra columns) 3. versioning (extra tables) " Minimal reliance on database-specific features no stored procedures no server-side Java no object-relational tables " Generic links and naming conventions
10
1. Subclassing With Views " Advantages conceptual clarity straightforward to query the superclasses schema evolution; views are easier to change " Implications large tables (number of columns and rows) complicates query optimization (number of rows) slows row accesses (number of columns)
11
Subclassing cont. " Query optimization issues Cost-based query optimization requires statistics Confounded by coexistence of subclasses in table Bigger tables make the worst case worse " Physical I/O issues Any row access must read the entire row, including a potentially large set of irrelevant column values Also increases the likelihood of chaining
12
Subclassing - alternatives " Use views for the superclass not the subclasses? Isolates subclasses from one another more Requires changing tables rather than views Superclass view will be a large SQL UNION Queries likely less efficient over superclasses " Keep existing system, but use partitions to specify physical placement of subclass rows Solves some, but not all of the problems
13
"Large" Tables I " GUS: indexes=25G tables=~100G NASequenceImp = 11G AssemblySequenceVer = 8.6 G SimilaritySpan = 8G / 74 million rows Similarity = 4.8 G / 38 million rows AssemblySequence = 3.6G Evidence = 3G / 36 million rows SimilarityVer = 2.8G " Approximately 10-20 quite large tables
14
"Large" Tables II " Tables with largest average row length: GeneMapVer = 811 bytes NASequenceImp = 747 bytes AssemblySequence = 521 bytes " Tables with the most chained rows: NASequenceImp = 384,524 rows AssemblySequenceVer = 56,873 rows AssemblySequence = 21,458 rows
15
2. Tracking/Access Control " Advantages: Enables DBA to disburse wrath appropriately Aids in correcting errors " Disadvantages: Extra columns have foreign key constraints Several small tables become bottlenecks for certain DDL and database update operations Access controls not fully implemented " where and how should they be implemented?
16
Tracking II
17
3. Versioning " Advantages: required for complete tracking " Disadvantages: space overhead, results in slower updates requires application-level code to implement may be unnecessary in some DBMSs currently not used uniformly " Different versions coexisting e.g., PlasmoDB
19
Development => Production " nemesis/8i (GUS) and erebus/9i (GUSdev) " Release cycle based on whole-database copy " Uses Oracle IMPORT and EXPORT utilities: EXPORT over network to flat files change owner/schema name change physical placement of tables, indexes " Alternatives: transportable tablespaces, SQL-based copy
20
Future Work " Issues with current schema from PlasmoDB free text searching (and use of CLOB values) more sophisticated schema for tracking session- oriented data (more on this tomorrow) supporting queries for genome browser(s)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.