Presentation is loading. Please wait.

Presentation is loading. Please wait.

Integrating Genomic Databases

Similar presentations


Presentation on theme: "Integrating Genomic Databases"— Presentation transcript:

1 Integrating Genomic Databases
Chris Stoeckert, Ph.D. Computational Biology and Informatics Laboratory

2 Talk Outline Challenge of integrating biological data
Federations vs warehouses GUS/RAD - warehouse approach K2 - connecting to other systems

3 Challenge of Integrating Biological Data
Many sources of different types Different types of data Biological sequence (DNA, RNA, protein) Gene expression Structure Etc… Different representations of data Flat file Relational Object-oriented Imposing semantics of biology Genes and RNAs and Proteins are related But may have different names Biology is context dependent

4 Examples of Different Sources and Types
Experiment ExpGroups Groups Exp.ControlGenes ControlGenes Hybridization Conditions Label Sample Treatment Disease Devel. Stage ExperimentSample Anatomy Taxon

5 Different Technologies for the Same Data Type

6 Why Bother to Integrate?
Remember the fable of the blind men and the elephant!

7 Federations vs Warehouses
Link to everybody Always current Generally stuck with data as is Warehouses Bring everything in house Can cleanse and add value to integrated data Staying up to date Davidson et al. IBM Systems Journal 2001

8 View and Warehouse Integration

9 GUS/RAD - Warehouse Approach
Gene Discovery EST analysis Genomic sequence analysis Gene Regulation Microarray analysis Promoter/ regulatory region analysis Biological data representation Data integration Ontology

10 Computational Biology and Informatics Laboratory October, 2001

11 GUS: Genomics Unified Schema
free text Controlled vocabs. GO Species Tissue Dev. Stage Genes, gene models STSs, repeats, etc Cross-species analysis Genomic Sequence RAD RNA Abundance DB Characterize transcripts RH mapping Library analysis Cross-species analysis DOTS Transcribed Sequence Special Features Transcript Expression Arrays SAGE Conditions Ownership Protection Algorithm Evidence Similarity Versioning under development Domains Function Structure Cross-species analysis Protein Sequence Pathways Networks Representation Reconstruction

12 RAD: RNA Abundance Database
Experiment Platform Raw Data Processed Data Algorithm Metadata Compliant with the MGED standards

13 Clusters vs. Contig Assemblies
UniGene Transcribed Sequences (DOTS) CAP4: (Paracel) -Consensus Sequences -Alternative splicing -Paralogs BLAST: Clusters of ESTs & mRNAs

14 Crabtree et al. Genome Research 2001
Bridging Fingerprint Contigs and RH Maps on Mouse Chromosome 5 Crabtree et al. Genome Research 2001 Fingerprint Map Chr. 5 RH Map

15 RAD GUS EST clustering and assembly Identify shared TF binding sites
TESS (Transcription Element Search Software) PROM-REC (Promoter recognition) Genomic alignment and comparative Sequence analysis Identify shared TF binding sites

16 Assembled Transcripts
About 3 million human EST and mRNA sequences used Combined into 797,028 assemblies Cluster into 150,006 “genes” Can identify a protein for 76,771 genes And predict a function for 24,127 genes About 2 million mouse EST and mRNA sequences used Combined into 355,770 assemblies Cluster into 74,024 “genes” Can identify a protein for 34,008 genes And predict a function for 15,403 genes

17 CBIL Project Architecture
Sequence & annotation Gene index (ESTs and mRNAs) Microarray expression data experimental annotation Relational DB (Oracle) with Perl object layer GUS RAD

18 AllGenes

19 AllGenes Enhancements: Genomic Data

20 New site

21

22 EPConDB Pathway query

23 View and Warehouse Integration

24 K2 - connecting to other systems

25 Linking GUS to Other Sources
Neurocartographer K2 Medline What papers have been published on genes that are expressed in this part of the brain?

26 Acknowledgements http://www.cbil.upenn.edu CBIL: Chris Stoeckert
Vladimir Babenko Brian Brunk Jonathan Crabtree Sharon Diskin Greg Grant Yuri Kondrakhin Georgi Kostov Phil Le Li Li Junmin Liu Elisabetta Manduchi Joan Mazzarelli Shannon McWeeney Debbie Pinney Angel Pizarro Jonathan Schug PlasmoDB collaborators: David Roos Martin Fraunholz Jesse Kissinger Jules Milgram Ross Koppel, Monash U. Malarial Genome Sequencing Consortium (Sanger Centre, Stanford U., TIGR/NMRC) EPConDB collaborators: Klaus Kaestner Marie Scearce Doug Melton, Harvard Alan Permutt, Wash. U Comparative Sequence Analysis Collaborators: Maja Bucan Shaying Zhao Whitehead/MIT Center for Genome Research K2/DARPA: Sue Davidson Scott Harker Jonathan Nissanov Carl Gustafson


Download ppt "Integrating Genomic Databases"

Similar presentations


Ads by Google