Rationale for GUS Answer queries: ‘Identify all the “spots” on an array that represent genes on chromosome 1 that are predicted to be transcription factors’ ‘Identify tissues that “express” at least half the components of the interleukin-6 pathway’ ‘Identify image clones that represent genes for which there is evidence for expression in the pancreas’ Facilitate gene expression and pathways/networks analyses. Datamining of diverse genomics data Gene index (Gene centric view of biological data) Pragmatic: Combine CBIL databases and thus unify effort.
GUS: Genomics Unified Schema free text Ontologies Genes, gene models STSs, repeats, etc Cross-species analysis Genomic Sequence GO Species Tissue Dev. Stage RAD RNA Abundance DB Characterize transcripts RH mapping Library analysis Cross-species analysis DOTS Transcribed Sequence Special Features Transcript Expression Arrays SAGE Conditions Ownership Protection Algorithm Evidence Similarity Versioning under development Domains Function Structure Cross-species analysis Protein Sequence Pathways Networks Representation Reconstruction
GUS Data Public sequence data Public mapping data Annotation Genbank, SwissProt, Prodom, PFAM, UCSC Golden Path, TransFac.. Public mapping data Radiation hybrid Annotation DoTS (Assembled transcribed sequences) Gene predictions (via Plasmodium consortium collaborations) Functional predictions (GOFunction) Some comparative genomics (mouse/human) Transcription Factor binding site and promoter analyses data sets (some proprietary) from collaborators (primarily in RAD)
Light weight PERL object layer GUS system External Datasources Data Integration Computational Annotation Validation Light weight PERL object layer Data Warehouse Annotators interface Browser & bioWidgets Java Servlet (views)
Light weight PERL object layer GUS system External Datasources Data Integration Computational Annotation Validation Light weight PERL object layer Data Warehouse Annotators interface Browser & bioWidgets Java Servlet (views)
High Level Flow Diagram of GUS Annotation Genomic Sequence mRNA/EST Sequence Clustering and Assembly ORNL Gene predictions GRAIL/GenScan BLAST/SIM4 Predicted Genes DOTS consensus Sequences Merge Genes Gene/RNA cluster assignment Gene Index Gene families, Orthologs Assign Gene Name, Manual Annotation.. Predicted RNAs Predicted Proteins framefinder / DIANA BLASTX PFAM,SignalP, TMPred, ProDom, etc BLASTP Other Annotation (EPCR, AssemblyAnatomyPercent, Index Key Words, SNP analysis) BLAST Similarities Protein Features/Motifs Algorithms for functional predictions GO Functions
“Unassembled” clusters (consensus sequences and new) Incremental Updates of DoTS Sequences Incoming Sequences (EST/mRNA) Make Quality (remove vector, polyA, NNNs) “Quality” sequences AssemblySequence Block with RepeatMasker Blocked sequences Assign to DOTS consensus sequences (blastn at 40 bp length, 92% identity) Cluster incoming sequences that are not covered by consensus sequence. DOTS Consensus Sequences “Unassembled” clusters Assemble DOTS consensus sequences and incoming sequences with CAP4 - initially reassemble CAP4 assemblies (consensus sequences and new) Calculate new DOTS consensus sequence using weighted consensus sequence(s) and new CAP4 assembly. New Consensus sequences Update GUS database
Light weight PERL object layer GUS system External Datasources Data Integration Computational Annotation Validation Light weight PERL object layer Data Warehouse Annotators interface Browser & bioWidgets Java Servlet (views)
Different Views of GUS Focused annotation of specific organisms and biological systems: organisms biological systems Endocrine pancreas Human Mouse CNS GUS GUS Plasmodium falciparum Hematopoiesis *not drawn to scale*
WWW.ALLGENES.ORG
Summary of AllGenes.org content
Boolean Query - Finding Surface Antigens Can combine any of the ‘primitive’ queries on genes page. Query forms are ‘bookmarkable’.
StemCellDB-GUS website
Light weight PERL object layer GUS system External Datasources Data Integration Computational Annotation Validation Light weight PERL object layer Data Warehouse Annotators interface Browser & bioWidgets Java Servlet (views)
Assembly Validation Alignment to Genomic Sequence via Blast/sim4. preliminary data look good Assembly consistency
GUS: Genomics Unified Schema Ontologies Genes, gene models STSs, repeats, etc Cross-species analysis free Genomic Sequence text GO Species Tissue Dev. Stage RAD RNA Abundance DB Characterize transcripts RH mapping Library analysis Cross-species analysis DOTS Transcribed Sequence Special Features Transcript Expression Arrays SAGE Conditions Ownership Protection Algorithm Evidence Similarity Versioning Domains Function Structure Cross-species analysis Protein Sequence Pathways Networks Representation Reconstruction under development
RAD GUS TESS EST clustering and assembly Identify shared Genomic alignment and comparative Sequence analysis Identify shared TF binding sites
Acknowledgements CBIL: Chris Overton Brian Brunk Jonathan Crabtree Sharon Diskin Steve Fischer (Doubletwist) Mark Gibson (GeneLogic) Greg Grant Elisabetta Manduchi Joan Mazzarelli Debbie Pinney Angel Pizarro Jonathan Schug Jian Wang (Celera) Chris Stoeckert PlasmoDB collaborators: David Roos Martin Fraunholz Jesse Kissinger Jules Milgram Dan Lawson (Sanger) Ross Koppel (Monash U.) Malaria Genome Consortium Allgenes.org collaborators: Ed Uberbacher, ORNL Doug Hyatt, ORNL EPConDB collaborators: Klaus Kaestner Marie Scearce John Doug Melton, Harvard Alan Permutt, Wash. U Comparative Sequence Analysis Collaborators: Maja Bucan Tim Wiltshire A. Lengeling, L. Tarantino, S. Kanes Whitehead/MIT Center for Genome Research
StemCellDB Architecture All sequences get entered into flat file db first with efficient mechanism for filtering to public vs private Ones marked public get incoorporated into GUS and at regular intervals also submitted to dbEST (GenBank) Private sequences and ones not dealt with will stay in flat file db (current system). Public sequences will be removed from flat file db to decrease overhead and query times. StemCell static pages should be maintained by Princeton Automated annotation applied by CBIL and manual annotation in StemCell flat files moved over to GUS by semi-automatic methods CBIL annotators may prioritize StemCellDB entries for annotation