Presentation is loading. Please wait.

Presentation is loading. Please wait.

From EpoDB to EPConDB: Adventures in Gene Expression Databases

Similar presentations


Presentation on theme: "From EpoDB to EPConDB: Adventures in Gene Expression Databases"— Presentation transcript:

1 From EpoDB to EPConDB: Adventures in Gene Expression Databases
Chris Stoeckert, Ph.D. Computational Biology and Informatics Laboratory

2 EpoDB: A Prototype Database for the Analysis of Genes Expressed During Vertebrate Erythropoiesis
Retrieve all adult, mammalian b-globin gene transcription units. Retrieve the proximal promoter regions from all genes with a significant change in expression between the BFU stage and erythroblast stage. Display the predicted transcription factor binding sites in the proximal promoter region of the mouse b-globin gene. Stoeckert, Salas, Brunk, Overton (1999) Nucl. Acids Res. 26:288

3 EpoDB: A Data Warehouse Formed by Information Integration
GENE STRUCTURE: GenBank GENE FUNCTION: Swiss-Prot TRANSCRIPTION UNIT (EpoDB) GENE REGULATION: Transfac ,TRRD GENE EXPRESSION: GERD

4 EpoDB Transcription Unit

5 Populating EpoDB Database: GenBank Swiss-Prot TRRD GERD
Total: Yes:

6 EpoDB Highlights Reference sequences Controlled vocabularies
“gene ontology” Specified sequence retrieval All vertebrates Controlled vocabularies Gene names and family names Experiment descriptions Gene warehouse GenBank, SWISS-PROT, Transfac, literature But no ESTs!

7

8 EpoDB Gene Landmark Query

9 EpoDB limitations Not scalable
Manual triage of keyword-selected entries Manual selection of reference genes Did not handle data from high throughput technologies No ESTs Could not represent microarrays, SAGE Could not make use of unannotated genomic sequence Difficult to administer Prolog-based

10 Chronology of CBIL Systems
Gene Expression Sequence Annotation 1995 2001 EpoDB GAIA ParaDB DoTS Relational DB (Oracle) with Perl object layer GUS RAD EPConDB PlasmoDB AllGenes

11 CBIL Project Architecture
Sequence & annotation Gene index (ESTs and mRNAs) Microarray expression data experimental annotation Relational DB (Oracle) with Perl object layer GUS RAD

12 GUS: Genomics Unified Schema
free text Controlled vocabs. GO Species Tissue Dev. Stage Genes, gene models STSs, repeats, etc Cross-species analysis Genomic Sequence RAD RNA Abundance DB Characterize transcripts RH mapping Library analysis Cross-species analysis DOTS Transcribed Sequence Special Features Transcript Expression Arrays SAGE Conditions Ownership Protection Algorithm Evidence Similarity Versioning under development Domains Function Structure Cross-species analysis Protein Sequence Pathways Networks Representation Reconstruction

13 GUS Object View Gene Genomic Sequence Gene Instance Gene Feature NA
RNA RNA Sequence RNA Instance RNA Feature Protein Protein Sequence Protein Instance Protein Feature AA Sequence AA Feature

14 Clusters vs. Contig Assemblies
UniGene Transcribed Sequences (DOTS) CAP4 (Paracel): Consensus Sequences -Alternative splicing -Paralogs BLAST: Clusters of ESTs & mRNAs

15 “Unassembled” clusters (consensus sequences and new)
Incremental Updates of DoTS Sequences Incoming Sequences (EST/mRNA) Make Quality (remove vector, polyA, NNNs) “Quality” sequences AssemblySequence Block with RepeatMasker Blocked sequences Assign to DOTS consensus sequences (blastn at 40 bp length, 92% identity) Cluster incoming sequences that are not covered by consensus sequence. DOTS Consensus Sequences “Unassembled” clusters Assemble DOTS consensus sequences and incoming sequences with CAP4 - initially reassemble CAP4 assemblies (consensus sequences and new) Calculate new DOTS consensus sequence using weighted consensus sequence(s) and new CAP4 assembly. New Consensus sequences Update GUS database

16 Assembled Transcripts
About 3 million human EST and mRNA sequences used Combined into 797,028 assemblies Cluster into 150,006 “genes” Can identify a protein for 76,771 genes And predict a function for 24,127 genes About 2 million mouse EST and mRNA sequences used Combined into 355,770 assemblies Cluster into 74,024 “genes” Can identify a protein for 34,008 genes And predict a function for 15,403 genes

17 Assembly Validation Alignment to Genomic Sequence via Blast/sim4.
preliminary data look good Assembly consistency (Assemblies provide potential SNPs) Add BLAST sim4 figure

18 Crabtree et al. Genome Research 2001
Bridging Fingerprint Contigs and RH Maps on Mouse Chromosome 5 Crabtree et al. Genome Research 2001 Fingerprint Map Chr. 5 RH Map

19 Predicting Gene Ontology Functions

20 RAD Multiple labs Multiple biological systems Multiple platforms
Expressed genes? Differentially-expressed genes? Co-regulated genes? Gene pathways?

21 RAD: RNA Abundance Database
Experiment Platform Raw Data Processed Data Algorithm Metadata Compliant with the MGED standards

22 Microarray Gene Expression Database group (MGED)
International effort on microarray data standards: Develop standards for storing and communicating microarray-based gene expression data defining the minimal information required to ensure reproducibility and verifiability of results and to facilitate data exchange (MIAME, MAGEML-MAGEDOM) collecting (and where needed creating) controlled vocabularies/ ontologies. developing standards for data comparison and normalization. The schema is compliant with the minimum annotations recommended by MGED. MIAME: Minimum Information About a Microarray Experiment (common set of concepts that need to be captured in a database to describe gene expression experiments adequately for interpretation, reproduction or critical assessment). MAML: MicroArray Mark-up Language (XML Document Type Definitions of the concepts).

23 Query RAD by Sample or by Experiment
Access by Experiment groups Sample info ontologies Image info

24

25 Different Views of GUS/RAD
Focused annotation of specific organisms and biological systems: organisms biological systems Endocrine pancreas Human Mouse CNS GUS GUS Plasmodium falciparum Hematopoiesis *not drawn to scale*

26 AllGenes

27 AllGenes “Erythroblast” Query

28 AllGenes Enhancements: Annotated Entries

29 AllGenes Enhancements: Genomic Data

30 New site

31

32 Functional Genomics of the Developing Endocrine Pancreas
cDNA libraries from pancreatic tissue Consortium libraries Novel genes relevant dbEST libraries Microarray studies on pancreatic tissue Genome wide-survey for genes expressed Pancreas chip Validated sequences of interest Novel sequences from libraries

33

34 EPConDB: Content and Features
Pancreas clone sets Panc Chip Clone sets 1.0, 1.5, 2.0 Transcripts found in consortium libraries Novel transcripts discovered from consortium libraries Microarray results Using Incyte’s GEM (genome-wide survey) Using Panc Chip Genes expressed in pancreas AllGenes queries: function, chromosomal location, name, accession Pathways

35 Relational DB (Oracle) with Perl object layer
EPConDB Architecture Sequence & annotation Gene index (ESTs and mRNAs) Microarray expression data experimental annotation Relational DB (Oracle) with Perl object layer GUS RAD

36 EPConDB Pathway query

37 EPConDB Boolean Query

38 EPConDB History Query

39 EPConDB: Future Developments
Add more microarray results Provide tools for microarray analysis Provide genomic alignments Provide tools for analysis of (putative) promoters

40 Microarray Analysis: Xcluster
Xcluster provided by Gavin Sherlock

41 Microarray Analysis: R statistics
SMA R package from Terry Speed’s group

42 Microarray Analysis: PaGE

43 Future EPConDB Query Result

44 Microarray Analysis: Data download

45 RAD GUS EST clustering and assembly Identify shared TF binding sites
TESS (Transcription Element Search Software) PROM-REC (Promoter recognition) Genomic alignment and comparative Sequence analysis Identify shared TF binding sites

46 Summary EpoDB provides high quality genes for sequence analysis
But is limited in scope AllGenes provides the entire transcriptome for a wide variety of human and mouse tissues Needs to provide high quality genes PlasmoDB provides the entire Plasmodium genome. Integrating EST, SAGE, and microarray data EPConDB provides integration of EST and microarray gene expression data for a specific system Will provide microarray analysis

47 Acknowledgements http:www.cbil.upenn.edu CBIL: Chris Overton
Chris Stoeckert Vladimir Babenko Brian Brunk Jonathan Crabtree Sharon Diskin Greg Grant Yuri Kondrakhin Georgi Kostov Phil Le Li Li Junmin Liu Elisabetta Manduchi Joan Mazzarelli Shannon McWeeney Debbie Pinney Angel Pizarro Jonathan Schug Fidel Salas Juergen Haas Annotation collaborators: Nikolay Kolchanov Alexey Katohkin EPConDB collaborators: Klaus Kaestner Marie Scearce Doug Melton, Harvard Alan Permutt, Wash. U Comparative Sequence Analysis Collaborators: Maja Bucan Shaying Zhao Whitehead/MIT Center for Genome Research PlasmoDB collaborators: David Roos Martin Fraunholz Jesse Kissinger Jules Milgram Ross Koppel, Monash U. Malarial Genome Sequencing Consortium (Sanger Centre, Stanford U., TIGR/NMRC)


Download ppt "From EpoDB to EPConDB: Adventures in Gene Expression Databases"

Similar presentations


Ads by Google