Integrating Genomic Databases

Slides:



Advertisements
Similar presentations
The MGED Ontology: Providing Descriptors for Microarray Data Trish Whetzel Department of Genetics Center for Bioinformatics University of Pennsylvania.
Advertisements

© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Gene Ontology John Pinney
Integrated Data Systems for Genomic Analysis Genomics and Bioinformatics for the Advancement of Clinical Sciences Thomas Jefferson University, Oct. 14,
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Alignment of mRNAs to genomic DNA Sequence Martin Berglund Khanh Huy Bui Md. Asaduzzaman Jean-Luc Leblond.
TRANSFAC Project Roadmap Discussion.  Structure DNA-binding domain (DBD)  The portion (domain) of the transcription factor that binds DNA Trans-activating.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
The MGED Ontology: A framework for describing functional genomics experiments SOFG Nov. 19, 2002 Chris Stoeckert, Ph.D. Dept. of Genetics & Center for.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Doug Brutlag 2011 Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University School of Medicine Genomics, Bioinformatics.
Computational Molecular Biology Biochem 218 – BioMedical Informatics Gene Regulatory.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
GUS Overview June 18, GUS-3.0 Supports application and data integration Uses an extensible architecture. Is object-oriented even though it uses.
GUS The Genomics Unified Schema A Platform for Genomics Databases V. Babenko, B. Brunk, J.Crabtree, S. Diskin, S. Fischer, G. Grant, Y. Kondrahkin, L.Li,
Sharing Microarray Experiment Knowledge Chips to Hits Oct. 28, 2002 Chris Stoeckert, Ph.D. Dept. of Genetics & Center for Bioinformatics University of.
GUS: A Functional Genomics Data Management System Chris Stoeckert, Ph.D. Center for Bioinformatics and Dept. of Genetics University of Pennsylvania ASM.
NCBI Vector-Parasite Genomic Related Databases Chuong Huynh NIH/NLM/NCBI Sao Paulo, Brasil July 12, 2004
Copyright OpenHelix. No use or reproduction without express written consent1.
What is an Ontology? An ontology is a specification of a conceptualization that is designed for reuse across multiple applications and implementations.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
RADical microarray data: standards, databases, and analysis Chris Stoeckert, Ph.D. University of Pennsylvania Yale Microarray Data Analysis Workshop December.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Annotator Interface Sharon Diskin GUS 3.0 Workshop June 18-21, 2002.
Biological Networks & Systems Anne R. Haake Rhys Price Jones.
EB3233 Bioinformatics Introduction to Bioinformatics.
GeWorkbench John Watkinson Columbia University. geWorkbench The bioinformatics platform of the National Center for the Multi-scale Analysis of Genomic.
Ontologies Working Group Agenda MGED3 1.Goals for working group. 2.Primer on ontologies 3.Working group progress 4.Example sample descriptions from different.
No reference available
A guided tour of Ensembl This quick tour will give you an outline view of what Ensembl is all about. You will learn: –Why we need Ensembl –What is in the.
Accessing and visualizing genomics data
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
The regulation of Caspase 8 chIP-seq motifs mRNA expression DNA methylation.
Regulation of Gene Expression
GUS We have created the Genomic Unified Schema (GUS), a relational database that warehouses and integrates biological sequence, sequence annotation, and.
Introduction to Genes and Genomes with Ensembl
The Transcriptional Landscape of the Mammalian Genome
Statistical Applications in Biology and Genetics
Interrogation of cross talk between proteins and gene regulatory networks in breast cancer Chambers, Teressa Lee Hiren Karathia Sridhar Hannenhalli.
Microarray Technology and Applications
EPConDB: Endocrine Pancreas Consortium Database
High-throughput Biological Data The data deluge
Functional Annotation of the Horse Genome
more regulating gene expression
Advanced PGDB Editing: Regulation GO Terms
Genome organization and Bioinformatics
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
Introduction to Bioinformatics II
Remember: Final Draft of Posters Due at 10 am tomorrow!
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Ensembl Genome Repository.
Next Generation Sequencing and Human Genome Databases
EXTENDING GENE ANNOTATION WITH GENE EXPRESSION
Fouzia Moussouni, Anita Burgun, Franck Le Duff,
Rationale for GUS Answer queries:
Current and Future Directions
Information Management Infrastructure for the Systematic Annotation of Vertebrate Genomes V Babenko (1), B Brunk (1), J Crabtree (1), S Diskin (1), Y Kondrahkin.
RAD (RNA Abundance Database)
The Computational Biology and Informatics Laboratory
From EpoDB to EPConDB: Adventures in Gene Expression Databases
Leveraging EST Sequencing, Micro Array Experiments and Database Integration for Gene Expression Analyses The Computational Biology and Informatics Laboratory.
Functional Genomics Consortium: NIDDK (Kaestner) and (Permutt)
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Introduction to Bioinformatics
Aligning Transcribed Sequences to the Human and Mouse Genomes
Annotator Interface GUS 3.0 Workshop June 18-21, 2002.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Presentation transcript:

Integrating Genomic Databases Chris Stoeckert, Ph.D. Computational Biology and Informatics Laboratory

Talk Outline Challenge of integrating biological data Federations vs warehouses GUS/RAD - warehouse approach K2 - connecting to other systems

Challenge of Integrating Biological Data Many sources of different types Different types of data Biological sequence (DNA, RNA, protein) Gene expression Structure Etc… Different representations of data Flat file Relational Object-oriented Imposing semantics of biology Genes and RNAs and Proteins are related But may have different names Biology is context dependent

Examples of Different Sources and Types Experiment ExpGroups Groups Exp.ControlGenes ControlGenes Hybridization Conditions Label Sample Treatment Disease Devel. Stage ExperimentSample Anatomy Taxon

Different Technologies for the Same Data Type

Why Bother to Integrate? Remember the fable of the blind men and the elephant! http://www.noogenesis.com/pineapple/blind_men_elephant.html

Federations vs Warehouses Link to everybody Always current Generally stuck with data as is Warehouses Bring everything in house Can cleanse and add value to integrated data Staying up to date Davidson et al. IBM Systems Journal 2001

View and Warehouse Integration

GUS/RAD - Warehouse Approach Gene Discovery EST analysis Genomic sequence analysis Gene Regulation Microarray analysis Promoter/ regulatory region analysis Biological data representation Data integration Ontology

Computational Biology and Informatics Laboratory October, 2001

GUS: Genomics Unified Schema free text Controlled vocabs. GO Species Tissue Dev. Stage Genes, gene models STSs, repeats, etc Cross-species analysis Genomic Sequence RAD RNA Abundance DB Characterize transcripts RH mapping Library analysis Cross-species analysis DOTS Transcribed Sequence Special Features Transcript Expression Arrays SAGE Conditions Ownership Protection Algorithm Evidence Similarity Versioning under development Domains Function Structure Cross-species analysis Protein Sequence Pathways Networks Representation Reconstruction

RAD: RNA Abundance Database Experiment Platform Raw Data Processed Data Algorithm Metadata Compliant with the MGED standards

Clusters vs. Contig Assemblies UniGene Transcribed Sequences (DOTS) CAP4: (Paracel) -Consensus Sequences -Alternative splicing -Paralogs BLAST: Clusters of ESTs & mRNAs

Crabtree et al. Genome Research 2001 Bridging Fingerprint Contigs and RH Maps on Mouse Chromosome 5 Crabtree et al. Genome Research 2001 Fingerprint Map Chr. 5 RH Map

RAD GUS EST clustering and assembly Identify shared TF binding sites TESS (Transcription Element Search Software) PROM-REC (Promoter recognition) Genomic alignment and comparative Sequence analysis Identify shared TF binding sites

Assembled Transcripts About 3 million human EST and mRNA sequences used Combined into 797,028 assemblies Cluster into 150,006 “genes” Can identify a protein for 76,771 genes And predict a function for 24,127 genes About 2 million mouse EST and mRNA sequences used Combined into 355,770 assemblies Cluster into 74,024 “genes” Can identify a protein for 34,008 genes And predict a function for 15,403 genes

CBIL Project Architecture Sequence & annotation Gene index (ESTs and mRNAs) Microarray expression data experimental annotation Relational DB (Oracle) with Perl object layer GUS RAD

AllGenes

AllGenes Enhancements: Genomic Data

http://plasmodb.org New site

EPConDB Pathway query

View and Warehouse Integration

K2 - connecting to other systems

Linking GUS to Other Sources Neurocartographer K2 Medline What papers have been published on genes that are expressed in this part of the brain?

Acknowledgements http://www.cbil.upenn.edu CBIL: Chris Stoeckert Vladimir Babenko Brian Brunk Jonathan Crabtree Sharon Diskin Greg Grant Yuri Kondrakhin Georgi Kostov Phil Le Li Li Junmin Liu Elisabetta Manduchi Joan Mazzarelli Shannon McWeeney Debbie Pinney Angel Pizarro Jonathan Schug PlasmoDB collaborators: David Roos Martin Fraunholz Jesse Kissinger Jules Milgram Ross Koppel, Monash U. Malarial Genome Sequencing Consortium (Sanger Centre, Stanford U., TIGR/NMRC) EPConDB collaborators: Klaus Kaestner Marie Scearce Doug Melton, Harvard Alan Permutt, Wash. U Comparative Sequence Analysis Collaborators: Maja Bucan Shaying Zhao Whitehead/MIT Center for Genome Research K2/DARPA: Sue Davidson Scott Harker Jonathan Nissanov Carl Gustafson http://www.cbil.upenn.edu