Download presentation
Presentation is loading. Please wait.
Published byEdward Booker Modified over 9 years ago
1
GUS The Genomics Unified Schema A Platform for Genomics Databases V. Babenko, B. Brunk, J.Crabtree, S. Diskin, S. Fischer, G. Grant, Y. Kondrahkin, L.Li, J. Liu, J. Mazzarelli, D. Pinney, A. Pizarro, E. Manduchi, S. McWeeney, J. Schug, C. Stoeckert Center for Bioinformatics, University of Pennsylvania stevef,stoeckrt@pcbi.upenn.edu
2
Overview
3
Abstract The Genomics Unified Schema (GUS) is a strongly typed relational database schema and accompanying portable object-based software platform used for integration, analysis, curation, mining and presentation of sequence based genomics information. The schema is organized into five domains: a detailed model of the central dogma (gene, RNA, protein) including DNA, assembled RNA, and protein sequence, and a diversity of sequence annotation (DoTS); an MGED compliant warehouse of transcript expression experiments (RAD); a catalogue of grammars describing regulatory regions (TESS); a wide range of controlled vocabularies and ontologies (SRES); and a detailed representation of data provenance (CORE). (A sixth domain for protein expression is in progress.) GUS’s normalized relational structure and extent of integrated data enable powerful queries not viable in many other genomics systems. The platform facilitates maintenance of the warehouse and its utilization in web and data mining applications.
4
Goals of GUS Generic platform for model organism or disease specific databases Freely available at www.gusdev.org and www.cbil.upenn.eduwww.gusdev.org Integration of genome, transcript and protein data, including: Sequence Function Expression Interaction Regulation Orthologs and paralogs Support for: automated annotation and integration manual curation data mining/analysis and sophisticated queries web access
5
GUS Powers Multiple Genomics DBs AllGenes PlasmoDB EPConDB CoreSRESTESSRADDoTS Oracle RDBMS Object Layer for Data Loading Java Servlets Other sites, Other projects Other sites, Other projects
6
Components of GUS Relational database schema Lightweight object layer Application frameworks Data access Pipeline/workflow Web (servlets) Applications Annotator’s interface Parsers and exporters (using standards) Annotation and analysis programs Schema browser Utilizes Oracle 9i
7
Automated Analysis & Integration WWW queries, browsing, & download Java Servlets & Perl CGI Mining Applications DoTS Oracle/SQL Genomic Sequence microarray & SAGE Experiments Mapping Data GenBank, InterPro, GO, etc GSSs & ESTs Annotation QTL,POP, SNP, Clinical RADCoreSRes Object Layer TESS Annotator’s Interface Architecture of GUS
8
Usage of GUS Annotation Of genomes: gene models, sequence features Of genes: function, expression, regulation Integration From sequence to expression Map identifiers to/from external databases Data mining, creating curated datasets Algorithm-based: GO function prediction Genome-wide querying: find all pancreas-specific transcripts PANCchip: non-redundant genes expressed in pancreas found using ESTs, microarrays and cDNA libraries
9
GUS Schema
10
Schema features Extensive integrated genomics schema (300 tables) Divided into 5 distinct domains Highly normalized Strongly typed Controlled vocabularies used extensively Avoid using name-value pairs Subclassing Use views of superclass to define subclasses Useful for mapping into the object layer Warehousing Include databases such as Genbank, GO terms, Prodom, CDD. Facilitates management of value-added annotation across updates Cross references to external databases Tracking and versioning
11
Five domains OntologiesShared Resources SRes (Shared Resources) EvidenceData ProvenanceCore GrammarsGene regulation TESS (Trans Elem Search Site) MIAME/MAGEGene expression RAD (RNA Abundance DB) Central dogma Sequence and annotation DoTS (DB of Transcribed Seqs) HighlightsDomainNamespace * Protein interaction domain underway GUS is divided into 5 domains* (separate name spaces)
12
Data Provenance Core Ownership Protection Algorithms Versioning Workflows Ontologies SRes GO Species Anatomy/Tissue Developmental stage Disease state Genomic Sequence Genes, gene models STSs, repeats, etc Cross-species analysis Transcribed Sequence Characterize transcripts RH mapping Library analysis Cross-species analysis DOTS assemblies Protein Sequence Domains Function Structure Cross-species analysis DoTS Arrays SAGE Conditions Transcript Expression RAD Binding Sites Patterns Grammars Gene Regulation TESS DoTS RAD SRes Core TESS "Transcription factors upregulated in acute myeloid leukemia with sequence similarity to c-fos and common promoter motifs" Querying across the domains
13
DoTS central dogma schema Gene Instance Gene Feature (isa NA Feature) Genomic Sequence (isa NA Sequence) RNA Instance RNA Feature (isa NA Feature) RNA Sequence (isa NA Sequence) Protein Instance Protein Feature (isa NA Feature) Protein Sequence (isa AA Sequence)
14
RAD schema uses MAGE/MIAME MAGE Experiment Array BioMaterial BioAssay BioAssayData Protocol, Descr. HigherLevelAnalysis MAGE Experiment Array BioMaterial BioAssay BioAssayData Protocol, Descr. HigherLevelAnalysis MIAME Experimental Design Array design Samples Hybridization, Measure Normalization. MIAME Experimental Design Array design Samples Hybridization, Measure Normalization.
15
TESS schema ModelString ModelConsensusString ModelPositionalWeightMatrix ModelGrammar TESS.Model ActivityProteinDnaBinding ActivityTissueSpecificity TESS.Activity Moiety TESS.Moiety MoietyMultimer MoietyHeterodimer MoietyComplex TESS.FootprintInstance DoTS.NaFeature BindingSite Promoter... DoTS.NaSequence TESS.TrainingSet TESS.ParameterGroup TESS.Note
16
Ontologies and vocabularies Ontologies Gene Ontology (GO) Sequence Ontology (SO) (sequence features) Phenotype and Trait Ontology (PATO) Taxon (NCBI) Anatomy (Penn) Disease (ICD9) Developmental stage (multiple sources) And vocabularies External database names Genetic codes Review status
17
Evidence trail Evidence and tracking Data tables have columns for user, date, project, algorithm invocation Tables dedicated to algorithm, algorithm version and parameters 176 algorithms, including public and in-house Tracks automated and manual annotation, similarity and integration Versioning All updated or deleted rows are copied to version table
18
Sophisticated queries Sample queries from three projects that utilize GUS’s data integration and analysis www.allgenes.org “Is my cDNA similar to any mouse genes that are predicted to encode transcription factors and have been localized to mouse chromosome 5?” http://plasmodb.org “List all genes whose proteins are predicted to contain a signal peptide and for which there is evidence that they are expressed in Plasmodium falciparum’s late schizont stage” www.cbil.upenn.edu/EPConDB “Which genes on chromosome 2 are expressed in pancreas and are involved in signal transduction based on GO function assignments.”
19
Application Frameworks
20
GUS Object layer Lightweight Perl implementation Java on the way One object per table Parent/child relationships Cascading delete
21
Data input The GusApplication program manages inserts and updates to GUS, handling tracking and versioning. Specific tasks are implemented as plugins. Plugins use either GUS objects or SQL access. Low-level database access is provided by DBI classes. RADTESSDoTS CoreSRes DBI Plugin Object SuperClasses SQL GusApplication
22
Pipeline Perl API for defining annotation pipelines Supports sequential protocols Distributes compute intensive work to compute cluster Used for 90 stage pipeline to build DoTS transcript index
23
Web Servlets and cgi based design (JSP on the way) Automatic generation of HTML FORMs Automated input checking Integrated help features INPUT elements populated from the database Query history facility Boolean queries (AND, OR, SUBTRACT) Declarative configuration file Base system is relatively independent of GUS
24
Provided Applications
25
Assign Gene Name/Symbol Assign Gene Description Assign Gene Synonym(s)Evidence Annotator’s interface
26
Parsing & exporting Parsing Sequence DBs: Genbank (main, dbEST, NRDB), SWISS-PROT, TIGR Protein Motifs: CDD, Prodom, InterPro Expression: MAGE Ontologies: GO, SO, PATO Mapping data: RH maps Gene predictors: GLIMMER, Genscan, PHAT, GeneFinder Similarity: BLAST, BLAT, Sim4 CAP4 Exporting FASTA MAGE Table dumps DoTS Assemblies
27
Analysis & annotation GO functional assignment Expression analysis (PaGE) Anatomy classification Library distribution Genes from BLAT of DoTS against genome DoTS assembly and annotation Refresh warehouse Cluster and assemble mRNAs/ESTs into putative transcripts Annotate transcripts through similarity, GO function and markers Integrate previously existing manual curation
28
Functional predictions Genomic Sequence DoTS consensus Sequences mRNA/EST Sequence Clustering and Assembly Predicted Genes Gene Index Merge Genes Gene/RNA cluster assignment SIM4 or BLAT ProteinsRNAs Gene predictions GenScan/ HMMer, PHAT GO Functions Protein Motifs BLAST Similarities PFAM, Smart, ProDom BLASTP BLASTX Other computed annotation (EPCR, AssemblyAnatomyPercent, Index Key Words, SNP analysis) Annotate DoTS Manual Annotation Tasks translation framefinder DoTS Pipeline
29
References & Acknowledgements References Scearce, L. Marie, Brestelli, John E., McWeeney, Shannon K., Lee, Catherine S., Mazzarelli, Joan, Pinney, Deborah F., Pizarro, Angel, Stoeckert, C. J. Jr., Clifton, Sandra, Permutt, M. Alan, Brown, Juliana, Melton, Douglas A., Kaestner, Klaus H. (2002) Functional Genomics of the Endocrine Pancreas: The Pancreas Clone Set and PancChip, New Resources for Diabetes Research Diabetes 51: 1997-2004, 2002. Schug, J., Diskin, S., Mazzarelli, J., Brunk, Brian P., Stoeckert, C.J. (2002) Predicting Gene Ontology Functions from ProDom and CDD Protein Domains. Genome Res. 2002 12: 648-655.Predicting Gene Ontology Functions from ProDom and CDD Protein Domains. Bahl, A., Brunk, B., Coppel, R.L., Crabtree, J., Diskin, S.J., Fraunholz, M.J., Grant, G.R., Gupta, D., Huestis, R.L., Kissinger, J.C., Labo, P., Li, L., McWeeney, S.K., Milgram, A.J., Roos, D.S., Schug, J., Stoeckert, C.J. (2002) PlasmoDB: The Plasmodium Genome Resource. An integrated database providing tools for accessing and analyzing mapping, expression and sequence data (both finished and unfinished). Nucleic Acids Res. 2002 30: 87-90 PlasmoDB: The Plasmodium Genome Resource. An integrated database providing tools for accessing and analyzing mapping, expression and sequence data (both finished and unfinished). Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge, W., Ball, C.A., Causton, H.C., Gaasterland, T., Glenisson, P., Holstege, F.C.P., Kim, I.F., Markowitz, V., Matese, J.C., Parkinson, H., Robinson, A., Sarkans, U., Schulze-Kremer, S., Stewart, J., Taylor, R., Vilo, J., Vingron, M. (2001) Minimum Information About a Microarray Experiment (MIAME): Toward Standards for Microarray Data. Nature Genetics 29:365-371, 2001. Manduchi, E., Pizarro, A., Stoeckert, C. (2001) RAD (RNA Abundance Database): an infrastructure for array data analysis. Proc. SPIE, vol 4266, pp. 68-78. Davidson, S.B., Crabtree, J., Brunk, Brian P., Schug, J., Tannen, V., Overton, G.C., Stoeckert, C.J. Jr. (2001) K2/Kleisli and GUS: Experiments in Integrated Access to Genomic Data Sources. IBM Systems Journal: 40(2), p. 512-531. Crabtree, J., Wiltshire, T., Brunk, B., Zhao, S., Schug, J., Stoeckert, C., Bucan, M. (2001) High-resolution BAC-based Map of the Central Portion of Mouse Chromosome 5. Genome Res. October 2001; 11: 1746-1757. Acknowledgements NIH grant RO1-HG-01539-03 DOE grant DE-FG02-00ER62893 Burroughs Wellcome Fund NIDDK 56947 and 56954 with cosponsorship from the JDFI
30
Related posters 114A. Web-Based Biological Discovery using the GUS Integrated Database. 170A. TESS-II: Describing and Finding Gene Regulatory Sequences with Grammars 148A. Integrating Eukaryotic Genomes by Orthologous Groups: What is Unique about Apicomplexan Parasites?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.