From EpoDB to EPConDB: Adventures in Gene Expression Databases

Slides:



Advertisements
Similar presentations
The MGED Ontology: Providing Descriptors for Microarray Data Trish Whetzel Department of Genetics Center for Bioinformatics University of Pennsylvania.
Advertisements

Oncomine Database Lauren Smalls-Mantey Georgia Institute of Technology June 19, 2006 Note: This presentation contains animation.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Integrated Data Systems for Genomic Analysis Genomics and Bioinformatics for the Advancement of Clinical Sciences Thomas Jefferson University, Oct. 14,
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Gene Expression And Regulation Bioinformatics January 11, 2006 D. A. McClellan
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
August 29, 2002InforMax Confidential1 Vector PathBlazer Product Overview.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
The MGED Ontology: A framework for describing functional genomics experiments SOFG Nov. 19, 2002 Chris Stoeckert, Ph.D. Dept. of Genetics & Center for.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
GUS Overview June 18, GUS-3.0 Supports application and data integration Uses an extensible architecture. Is object-oriented even though it uses.
INTRODUCTION GOAL: to provide novel types of interaction between classification systems and MIAME-compliant databases We present a prototype module aimed.
Chapter 14 Genomes and Genomics. Sequencing DNA dideoxy (Sanger) method ddGTP ddATP ddTTP ddCTP 5’TAATGTACG TAATGTAC TAATGTA TAATGT TAATG TAAT TAA TA.
GUS The Genomics Unified Schema A Platform for Genomics Databases V. Babenko, B. Brunk, J.Crabtree, S. Diskin, S. Fischer, G. Grant, Y. Kondrahkin, L.Li,
Sharing Microarray Experiment Knowledge Chips to Hits Oct. 28, 2002 Chris Stoeckert, Ph.D. Dept. of Genetics & Center for Bioinformatics University of.
GUS: A Functional Genomics Data Management System Chris Stoeckert, Ph.D. Center for Bioinformatics and Dept. of Genetics University of Pennsylvania ASM.
First GUS Workshop July 6-8, 2005 Penn Center for Bioinformatics Philadelphia, PA.
Copyright OpenHelix. No use or reproduction without express written consent1.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
NCBI Vector-Parasite Genomic Related Databases Chuong Huynh NIH/NLM/NCBI Sao Paulo, Brasil July 12, 2004
ModENCODE August 20-21, 2007 Drosophila Transcriptome: Aim 2.2.
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
What is an Ontology? An ontology is a specification of a conceptualization that is designed for reuse across multiple applications and implementations.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
RADical microarray data: standards, databases, and analysis Chris Stoeckert, Ph.D. University of Pennsylvania Yale Microarray Data Analysis Workshop December.
Protein and RNA Families
Annotator Interface Sharon Diskin GUS 3.0 Workshop June 18-21, 2002.
Alvis Brazma, Johan Rung, Ugis Sarkans, Thomas Schlitt, Jaak Vilo European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge,
Lettuce/Sunflower EST CGPDB project. Data analysis, assembly visualization and validation. Alexander Kozik, Brian Chan, Richard Michelmore. Department.
Ontologies Working Group Agenda MGED3 1.Goals for working group. 2.Primer on ontologies 3.Working group progress 4.Example sample descriptions from different.
Mining the Biomedical Research Literature Ken Baclawski.
Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.
A guided tour of Ensembl This quick tour will give you an outline view of what Ensembl is all about. You will learn: –Why we need Ensembl –What is in the.
Accessing and visualizing genomics data
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
The regulation of Caspase 8 chIP-seq motifs mRNA expression DNA methylation.
What is BLAST? Basic BLAST search What is BLAST?
GUS We have created the Genomic Unified Schema (GUS), a relational database that warehouses and integrates biological sequence, sequence annotation, and.
Introduction to Genes and Genomes with Ensembl
The Transcriptional Landscape of the Mammalian Genome
VectorBase genome annotation
Director’s Challenge IT Overview
Development of the Amphibian Anatomical Ontology
Interrogation of cross talk between proteins and gene regulatory networks in breast cancer Chambers, Teressa Lee Hiren Karathia Sridhar Hannenhalli.
EPConDB: Endocrine Pancreas Consortium Database
Lettuce/Sunflower EST CGPDB project.
University of Pittsburgh
Department of Genetics • Stanford University School of Medicine
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
Ensembl Genome Repository.
EXTENDING GENE ANNOTATION WITH GENE EXPRESSION
Fouzia Moussouni, Anita Burgun, Franck Le Duff,
Rationale for GUS Answer queries:
Current and Future Directions
Information Management Infrastructure for the Systematic Annotation of Vertebrate Genomes V Babenko (1), B Brunk (1), J Crabtree (1), S Diskin (1), Y Kondrahkin.
RAD (RNA Abundance Database)
The Computational Biology and Informatics Laboratory
Integrating Genomic Databases
Leveraging EST Sequencing, Micro Array Experiments and Database Integration for Gene Expression Analyses The Computational Biology and Informatics Laboratory.
Functional Genomics Consortium: NIDDK (Kaestner) and (Permutt)
BIOBASE Training TRANSFAC® ExPlain™
Aligning Transcribed Sequences to the Human and Mouse Genomes
Annotator Interface GUS 3.0 Workshop June 18-21, 2002.
Presentation transcript:

From EpoDB to EPConDB: Adventures in Gene Expression Databases Chris Stoeckert, Ph.D. Computational Biology and Informatics Laboratory

EpoDB: A Prototype Database for the Analysis of Genes Expressed During Vertebrate Erythropoiesis Retrieve all adult, mammalian b-globin gene transcription units. Retrieve the proximal promoter regions from all genes with a significant change in expression between the BFU stage and erythroblast stage. Display the predicted transcription factor binding sites in the proximal promoter region of the mouse b-globin gene. Stoeckert, Salas, Brunk, Overton (1999) Nucl. Acids Res. 26:288

EpoDB: A Data Warehouse Formed by Information Integration GENE STRUCTURE: GenBank GENE FUNCTION: Swiss-Prot TRANSCRIPTION UNIT (EpoDB) GENE REGULATION: Transfac ,TRRD GENE EXPRESSION: GERD

EpoDB Transcription Unit

Populating EpoDB Database: GenBank Swiss-Prot TRRD GERD Total: 7819 2381 171 65 Yes: 3715 1241 80 65

EpoDB Highlights Reference sequences Controlled vocabularies “gene ontology” Specified sequence retrieval All vertebrates Controlled vocabularies Gene names and family names Experiment descriptions Gene warehouse GenBank, SWISS-PROT, Transfac, literature But no ESTs!

http://www.cbil.upenn.edu/EpoDB

EpoDB Gene Landmark Query

EpoDB limitations Not scalable Manual triage of keyword-selected entries Manual selection of reference genes Did not handle data from high throughput technologies No ESTs Could not represent microarrays, SAGE Could not make use of unannotated genomic sequence Difficult to administer Prolog-based

Chronology of CBIL Systems Gene Expression Sequence Annotation 1995 2001 EpoDB GAIA ParaDB DoTS Relational DB (Oracle) with Perl object layer GUS RAD EPConDB PlasmoDB AllGenes

CBIL Project Architecture Sequence & annotation Gene index (ESTs and mRNAs) Microarray expression data experimental annotation Relational DB (Oracle) with Perl object layer GUS RAD

GUS: Genomics Unified Schema free text Controlled vocabs. GO Species Tissue Dev. Stage Genes, gene models STSs, repeats, etc Cross-species analysis Genomic Sequence RAD RNA Abundance DB Characterize transcripts RH mapping Library analysis Cross-species analysis DOTS Transcribed Sequence Special Features Transcript Expression Arrays SAGE Conditions Ownership Protection Algorithm Evidence Similarity Versioning under development Domains Function Structure Cross-species analysis Protein Sequence Pathways Networks Representation Reconstruction

GUS Object View Gene Genomic Sequence Gene Instance Gene Feature NA RNA RNA Sequence RNA Instance RNA Feature Protein Protein Sequence Protein Instance Protein Feature AA Sequence AA Feature

Clusters vs. Contig Assemblies UniGene Transcribed Sequences (DOTS) CAP4 (Paracel): Consensus Sequences -Alternative splicing -Paralogs BLAST: Clusters of ESTs & mRNAs

“Unassembled” clusters (consensus sequences and new) Incremental Updates of DoTS Sequences Incoming Sequences (EST/mRNA) Make Quality (remove vector, polyA, NNNs) “Quality” sequences AssemblySequence Block with RepeatMasker Blocked sequences Assign to DOTS consensus sequences (blastn at 40 bp length, 92% identity) Cluster incoming sequences that are not covered by consensus sequence. DOTS Consensus Sequences “Unassembled” clusters Assemble DOTS consensus sequences and incoming sequences with CAP4 - initially reassemble CAP4 assemblies (consensus sequences and new) Calculate new DOTS consensus sequence using weighted consensus sequence(s) and new CAP4 assembly. New Consensus sequences Update GUS database

Assembled Transcripts About 3 million human EST and mRNA sequences used Combined into 797,028 assemblies Cluster into 150,006 “genes” Can identify a protein for 76,771 genes And predict a function for 24,127 genes About 2 million mouse EST and mRNA sequences used Combined into 355,770 assemblies Cluster into 74,024 “genes” Can identify a protein for 34,008 genes And predict a function for 15,403 genes

Assembly Validation Alignment to Genomic Sequence via Blast/sim4. preliminary data look good Assembly consistency (Assemblies provide potential SNPs) Add BLAST sim4 figure

Crabtree et al. Genome Research 2001 Bridging Fingerprint Contigs and RH Maps on Mouse Chromosome 5 Crabtree et al. Genome Research 2001 Fingerprint Map Chr. 5 RH Map

Predicting Gene Ontology Functions

RAD Multiple labs Multiple biological systems Multiple platforms Expressed genes? Differentially-expressed genes? Co-regulated genes? Gene pathways?

RAD: RNA Abundance Database Experiment Platform Raw Data Processed Data Algorithm Metadata Compliant with the MGED standards

Microarray Gene Expression Database group (MGED) International effort on microarray data standards: Develop standards for storing and communicating microarray-based gene expression data defining the minimal information required to ensure reproducibility and verifiability of results and to facilitate data exchange (MIAME, MAGEML-MAGEDOM) collecting (and where needed creating) controlled vocabularies/ ontologies. developing standards for data comparison and normalization. The schema is compliant with the minimum annotations recommended by MGED. MIAME: Minimum Information About a Microarray Experiment (common set of concepts that need to be captured in a database to describe gene expression experiments adequately for interpretation, reproduction or critical assessment). MAML: MicroArray Mark-up Language (XML Document Type Definitions of the concepts). http://www.mged.org

Query RAD by Sample or by Experiment Access by Experiment groups Sample info ontologies Image info

Different Views of GUS/RAD Focused annotation of specific organisms and biological systems: organisms biological systems Endocrine pancreas Human Mouse CNS GUS GUS Plasmodium falciparum Hematopoiesis *not drawn to scale*

AllGenes

AllGenes “Erythroblast” Query

AllGenes Enhancements: Annotated Entries

AllGenes Enhancements: Genomic Data

http://plasmodb.org New site

Functional Genomics of the Developing Endocrine Pancreas cDNA libraries from pancreatic tissue Consortium libraries Novel genes relevant dbEST libraries Microarray studies on pancreatic tissue Genome wide-survey for genes expressed Pancreas chip Validated sequences of interest Novel sequences from libraries

www.cbil.upenn.edu/EPConDB

EPConDB: Content and Features Pancreas clone sets Panc Chip Clone sets 1.0, 1.5, 2.0 Transcripts found in consortium libraries Novel transcripts discovered from consortium libraries Microarray results Using Incyte’s GEM (genome-wide survey) Using Panc Chip Genes expressed in pancreas AllGenes queries: function, chromosomal location, name, accession Pathways

Relational DB (Oracle) with Perl object layer EPConDB Architecture Sequence & annotation Gene index (ESTs and mRNAs) Microarray expression data experimental annotation Relational DB (Oracle) with Perl object layer GUS RAD

EPConDB Pathway query

EPConDB Boolean Query

EPConDB History Query

EPConDB: Future Developments Add more microarray results Provide tools for microarray analysis Provide genomic alignments Provide tools for analysis of (putative) promoters

Microarray Analysis: Xcluster Xcluster provided by Gavin Sherlock

Microarray Analysis: R statistics SMA R package from Terry Speed’s group

Microarray Analysis: PaGE

Future EPConDB Query Result

Microarray Analysis: Data download

RAD GUS EST clustering and assembly Identify shared TF binding sites TESS (Transcription Element Search Software) PROM-REC (Promoter recognition) Genomic alignment and comparative Sequence analysis Identify shared TF binding sites

Summary EpoDB provides high quality genes for sequence analysis But is limited in scope AllGenes provides the entire transcriptome for a wide variety of human and mouse tissues Needs to provide high quality genes PlasmoDB provides the entire Plasmodium genome. Integrating EST, SAGE, and microarray data EPConDB provides integration of EST and microarray gene expression data for a specific system Will provide microarray analysis

Acknowledgements http:www.cbil.upenn.edu CBIL: Chris Overton Chris Stoeckert Vladimir Babenko Brian Brunk Jonathan Crabtree Sharon Diskin Greg Grant Yuri Kondrakhin Georgi Kostov Phil Le Li Li Junmin Liu Elisabetta Manduchi Joan Mazzarelli Shannon McWeeney Debbie Pinney Angel Pizarro Jonathan Schug Fidel Salas Juergen Haas Annotation collaborators: Nikolay Kolchanov Alexey Katohkin EPConDB collaborators: Klaus Kaestner Marie Scearce Doug Melton, Harvard Alan Permutt, Wash. U Comparative Sequence Analysis Collaborators: Maja Bucan Shaying Zhao Whitehead/MIT Center for Genome Research PlasmoDB collaborators: David Roos Martin Fraunholz Jesse Kissinger Jules Milgram Ross Koppel, Monash U. Malarial Genome Sequencing Consortium (Sanger Centre, Stanford U., TIGR/NMRC) http:www.cbil.upenn.edu