The Computational Biology and Informatics Laboratory

The Computational Biology and Informatics Laboratory
An Informatics Framework for the Analysis of Gene Regulation and Pathways The Computational Biology and Informatics Laboratory

Gene Regulation Stem Cell Development Organismal Biology
Tissue specificity Developmental regulation Stem Cell Development Pancreatic islet cells (hematopoiesis/ erythropoiesis/ others) Organismal Biology Plasmodium falciparum mouse chromosome 5/neural crest

Examples of Systems Under Study
Endoderm Pancreatic anlage PDX-1 p48 Pax4 Exocrine Endocrine

Knowledge Domains Sequence/ Sequence annotation Pathways/
Gene expression experiment Proteomics, Metabolomics Pathways/ Networks

GUS: Genomics Unified Schema
free text Controlled vocabs. GO Species Tissue Dev. Stage Genes, gene models STSs, repeats, etc Cross-species analysis Genomic Sequence RAD RNA Abundance DB Characterize transcripts RH mapping Library analysis Cross-species analysis DOTS Transcribed Sequence Special Features Transcript Expression Arrays SAGE Conditions Ownership Protection Algorithm Evidence Similarity Versioning under development Domains Function Structure Cross-species analysis Protein Sequence Pathways Networks Representation Reconstruction

What is GUS ? A relational schema A Perl API and annotation subsystem
Over 180 tables Organized around central dogma A Perl API and annotation subsystem lightweight object layer with “plug-ins” supports high-level programmatic access but allows SQL A generic user interface Java Servlet-based (Apache JServ) supports browsing and also restricted ad-hoc queries A data warehouse GenBank, dbEST, SWISS-PROT, UCSC “Golden Path”, others Gene Ontology (GO) terms and assignments Controlled vocabularies: taxonomy, anatomy, disease state DoTS: database of assembled ESTs and mRNAs

Clusters vs. Contig Assemblies
UniGene Transcribed Sequences (DOTS) BLAST: Clusters of ESTs & mRNAs CAP4: Consensus Sequences -Alternative splicing -Paralogs

Mouse Assemblies Over 2 million mouse EST and mRNA sequences used
(loaded into GUS as of June 1, 2001) Combined into 367,525 assemblies 71,602 assemblies had more than one sequence 22 sequences on ave./non-singleton assembly 993 nt = ave. length of non-singleton assemblies

Assembly Validation Alignment to Genomic Sequence via Blast/sim4.
preliminary data look good Assembly consistency (Assemblies provide potential SNPs) Add BLAST sim4 figure

Predicting Gene Ontology Functions

GUS Annotation Interface

Knowledge Domains Sequence/ Sequence annotation Pathways/
Gene expression experiment Proteomics, Metabolomics Pathways/ Networks

RAD Multiple labs Multiple biological systems Multiple platforms
Expressed genes? Differentially-expressed genes? Co-regulated genes? Gene pathways?

RAD: RNA Abundance Database
Experiment Platform Raw Data Processed Data Algorithm Metadata Compliant with the MGED standards

Microarray Gene Expression Database group (MGED)
International effort on microarray data standards: Develop standards for storing and communicating microarray-based gene expression data defining the minimal information required to ensure reproducibility and verifiability of results and to facilitate data exchange (MIAME, MAGEML-MAGEDOM) collecting (and where needed creating) controlled vocabularies/ ontologies. developing standards for data comparison and normalization. The schema is compliant with the minimum annotations recommended by MGED. MIAME: Minimum Information About a Microarray Experiment (common set of concepts that need to be captured in a database to describe gene expression experiments adequately for interpretation, reproduction or critical assessment). MAML: MicroArray Mark-up Language (XML Document Type Definitions of the concepts).

Experiment Tables Label Sample Treatment Disease Devel. Stage Anatomy
Hybridization Conditions Label Sample Treatment Disease Devel. Stage ExperimentSample Anatomy Taxon RelExperiments Exp.ControlGenes ControlGenes Experiment ExpGroups Groups

Query RAD by Sample or by Experiment
Access by Experiment groups Sample info ontologies Image info

Storing the Quantified Data is Just the Beginning
Analysis result e.g., cluster # is differentially- expressed Output of image analysis software Normalized data Selected data for analysis SpotResult/ SpotFamilyResult tables Analysis/ Algorithm tables

Different Views of GUS/RAD
Focused annotation of specific organisms and biological systems: organisms biological systems Endocrine pancreas Human Mouse CNS GUS GUS Plasmodium falciparum Hematopoiesis *not drawn to scale*

New site

Contig View OM Restriction Sites Microsatellites Self-BLAST NRDB-BLAST
SAGE Tags EST/GSS FullPHAT GeneFinder GlimerM Annotation (chr2-TIGR)

Gene Page - I Description Notes Protein Graphical View
Genomic Neighborhood GV P. yoelii similarity NRDB ProDom

Protein Graphical View
BLASTP Secondary Structure Xmembrane Motifs Signal Peptides Hydropathy

Boolean Queries

AllGenes

Assembly/RNA View

The Gabrg1-Gabra2-Gabrb1-Txk-Tec-Gsh2-Pdgfra-Kit-Kdr(Flk1)-Clock BAC contigs on Chr. 5
Sequence available Sequence available

DoTS Assemblies Can Provide A Bridge Between Radiation Hybrid and BAC Contig Maps
RH Map AV026557 AI848177 AI132477 AW490897 AI507113 AV038945 AV074028 C85052 AV364670 AW987574 AF026073 AF022894 AI586015 C80280 Kit

Annotation of Mouse BAC Draft Sequence: Localization of the mouse corin gene
Update?

Annotation of Kit draft sequence (232h18) Ordering and orienting pieces using conserved regions

Annotation of Kit draft sequence (232h18) Transcription Element Search System analysis
Searched entire human and mouse orthologous sequences with all TESS matrices. Identified binding sites over/under-represented in the conserved regions. Conserved sites dispersed over 150kb. Over-represented factors include AP2, Pax-6, S8, Oct-1, E2A, E2F-DRTF, TAL1-/E47, CdxA, Ubx, AbdB-r, Engrailed, Hairy, DFD

Connecting Genes and Gene Expression to Pathways

The allgenes (GUS) index provides annotation of array elements in RAD
EST clustering and assembly Different representations of the same RNA are identified. EST/mRNA annotations are combined. Consensus sequence is annotated (e.g., gene function).

Creating a “pancreas chip”
Top 15% of clone signals in 2 mouse pancreas, 1 human islet, and 1 human insulinoma (GEM) array experiments. AND All the ESTs from 5 islet cDNA libraries. Find mouse and human RNAs in GUS containing these clones/ESTs. If human, BLAST against mouse RNAs to find ortholog. Non-redundant list of mouse RNAs List of mouse IMAGE clone IDs

RAD GUS EST clustering and assembly Identify shared TF binding sites
TESS (Transcription Element Search Software) Genomic alignment and comparative Sequence analysis Identify shared TF binding sites

Example of Systems Under Study: Pancreatic development
Endocrine progenitor Exocrine exocrine cell PDX-1 PDX-1 Ngn3 Beta2 (NeuroD) Common progenitor PDX-1 Pax4 Nkx6.1 Beta2 Isl1 Pax6 Nkx2.2 Brn 4 alpha cell PP cell Glucagon beta cell delta cell Pancreatic polypeptide Insulin Somatostatin Beta/Delta Alpha/PP p48/PTF1 Amylase Adapted from: Huang Tsai, J Biomed Sci 2000:7:27-34 and Jensen et al, Diabetes 2000:

CAP4 provided by Paracel
Acknowledgements CBIL: Chris Overton Chris Stoeckert Vladimir Babenko Brian Brunk Jonathan Crabtree Sharon Diskin Greg Grant Yuri Kondrakhin Georgi Kostov Phil Le Elisabetta Manduchi Joan Mazzarelli Shannon McWeeney Debbie Pinney Angel Pizarro Jonathan Schug PlasmoDB collaborators: David Roos Martin Fraunholz Jesse Kissinger Jules Milgram Ross Koppel, Monash U. Malarial Genome Sequencing Consortium (Sanger Centre, Stanford U., TIGR/NMRC) Allgenes.org collaborators: Ed Uberbacher, ORNL Doug Hyatt, ORNL EPConDB collaborators: Klaus Kaestner Marie Scearce Doug Melton, Harvard Alan Permutt, Wash. U Comparative Sequence Analysis Collaborators: Maja Bucan Shaying Zhao Whitehead/MIT Center for Genome Research CAP4 provided by Paracel

WWW.CBIL.UPENN.EDU “allgenes” human and mouse gene index:
PlasmodiumDB: RAD, RNA Abundance Database: Endocrine Pancreas Consortium Database: TESS, Transcription Element Search System PaGE, Patterns from Gene Expression MGED:

Summary Genomics Unified Schema (GUS) integrates and adds value to genomic, transcribed, and protein sequence. RNA Abundance Database (RAD) captures experiment, platform, data, and analysis from array and SAGE experiments. RAD adds value through integration with GUS. System-specific views are available for human and mouse, Plasmodium falciparum, and endocrine pancreas. GUS and RAD can be used to design custom arrays, identify potential SNPs, genome annotation, and comparative sequence analysis. Tools such as TESS and PaGE have been developed to analyze the data in GUS and RAD. Include TESS? Add simulations result. Applying to microarray - issue of cleaning up and normalization

EST libraries (wide coverage, low resolution)
Ontologies (anatomy, development, disease) Expression patterns Microarrays (high resolution) Expression rules TFBS (promoter analysis) Protein domains (splice forms)

Ontologies in Gene Expression Databases
Controlled vocabulary (ontologies not always needed) hierarchical Directed acyclic graphs Schema Concepts as objects or relational tables Attributes and data types provide specification Relationships specified through subclassing (objects) or foreign keys (relational tables) Knowledge representation Link to other domains (gene sequence annotation, gene and protein roles, pathways) Facilitate data exchange by mapping common concepts

GUS Object View Gene Gene Feature Genomic Sequence NA Sequence RNA RNA
Protein Protein Feature Protein Sequence AA Sequence

High Level Flow Diagram of GUS Annotation
Genomic Sequence mRNA/EST Sequence BLAST/SIM4 ORNL Gene predictions GRAIL/GenScan Clustering and Assembly Predicted Genes DOTS consensus Sequences Merge Genes Gene/RNA cluster assignment Gene Index Gene families, Orthologs Assign Gene Name, Manual Annotation.. Predicted RNAs Predicted Proteins Grail/Genscan, DIANA/framefinder BLASTX PFAM,SignalP, TMPred, ProDom, etc BLASTP Algorithms for functional predictions BLAST Similarities Protein Features/Motifs GO Functions CellRoles

Anatomy Hierarchy

Predicted GO Functions

Schema Browser

Summary of allgenes.org content
Update, add GO figure

Information to be captured
Figure from: David J. Duggan et al. (1999) Expression Profiling using cDNA microarrays. Nature Genetics 21: 10-14

The Purpose of GUS Integrate >> Annotate >> Mine (and Track) Integrate existing databases and tools a single point of access to what is already known Provide an automated “lab. notebook” a permanent record of work in progress e.g., similarity searches, array data, etc. And ultimately: support data mining a potential source of novel discoveries

Query “History” Feature
1 2 4 3

Critical Assessment of Microarray Data Analysis ‘00
Golub et al. (1999), Science, 286: ALL-AML: heterogeneous groups:source (B-cells, T-cells, 4 AML types), sex, success, etc. Focus on B-cells (37 replicates) vs T-cells (9 replicates): combined the training and the test sets Affymetrix single sample hybridization each signal is a composite of hybridizations to probes in a set absent calls Note that the data was already normalized.

Distribution Heterogeneity
From the B-cells in the Golub et al. dataset.

“Deterministic” differential expression
B and T B T Deterministic picture: this one shows a (nearly) deterministically differentially expressed gene. If not for the one absent call in the T-cell distribution, it would be deterministic. log scale Identifier: U23852, T-lymphocyte specific protein tyrosine kinase p56lck (lck) aberrant mRNA

“Non-deterministic” differential expression
B and T B T log scale Identifier: M23323, T-cell surface glycoprotein CD3 epsilon chain precursor

PaGE: Patterns from Gene Expression
PaGE assigns confidence measures to predictions of differential expression. Handles multiple testing in a nonparametric (and non-standard) way. Does not use t-statistic. Patterns are generated by comparison of groups of replicates to a reference group. See Manduchi et al., Bioinformatics 16: , 2000.

PaGE: outline Find C (the upper cutratio) such that
is small (this is the false positive rate). Here i varies too. This C gives a cutoff for making predictions about up-regulation. Similarly for down-regulation (find an appropriate c [lower cutratio], reverse the above inequalities).

PaGE: approximations is bounded above by
After having shifted all intensities by an appropriate numerical constant, we approximate the unknown distribution of by that of where i varies over the gene tags and j varies of the replicates for group 1. Similarly for group 2.

B-cell vs. T-cell using PaGE
Here a confidence of 0.90 is used, the false positive rate is around So that on the shift error tables, you can see that a shift of 5000 and a f.p.r. of is about optimal across all tables. Column n,9 = fraction of times gene is up-regulated in T-cells out of 100 comparisons between n randomly chosen B-cell and all 9 T-cell expmts.

B-cell vs. T-cell using t-statistic
Column n,9 = fraction of times gene is up-regulated in T-cells out of 100 comparisons between n randomly chosen B-cell and all 9 T-cell expmts.

The Computational Biology and Informatics Laboratory

Similar presentations

Presentation on theme: "The Computational Biology and Informatics Laboratory"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Computational Biology and Informatics Laboratory

Similar presentations

Presentation on theme: "The Computational Biology and Informatics Laboratory"— Presentation transcript:

Similar presentations

About project

Feedback