Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts. Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of.

Slides:



Advertisements
Similar presentations
Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.
Advertisements

Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome ECS289A.
Methods to read out regulatory functions
Periodic clusters. Non periodic clusters That was only the beginning…
Annotation standards in ORegAnno (Draft) Obi Griffith The RegCreative Jamboree Nov 29, 2006 Ghent, Belgium.
Table 2 shows that the set TFsf-TGblbs of predicted regulatory links has better results than the other two sets, based on having a significantly higher.
Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium.
Gene Set Enrichment Analysis (GSEA)
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania.
Gene regulatory network
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Sexually dimorphic gene expression in somatic tissues. Authors: J. Isensee and P.Ruiz Noppinger Center for Cardiovascular Research, Center for Gender in.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Integrated analysis of regulatory and metabolic networks reveals novel regulatory mechanisms in Saccharomyces cerevisiae Speaker: Zhu YANG 6 th step, 2006.
Introduction to BioInformatics GCB/CIS535
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, (16 April 2004)
The MGED Ontology: A framework for describing functional genomics experiments SOFG Nov. 19, 2002 Chris Stoeckert, Ph.D. Dept. of Genetics & Center for.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Computational Molecular Biology Biochem 218 – BioMedical Informatics Gene Regulatory.
Inferring Cellular Networks Using Probabilistic Graphical Models Jianlin Cheng, PhD University of Missouri 2009.
Searching for TFBSs with TRANSFAC - Hot topics in Bioinformatics.
Genome of the week - Deinococcus radiodurans Highly resistant to DNA damage –Most radiation resistant organism known Multiple genetic elements –2 chromosomes,
GUS Overview June 18, GUS-3.0 Supports application and data integration Uses an extensible architecture. Is object-oriented even though it uses.
Genome Informatics 2005 ~ 220 participants 1 keynote speaker: David Haussler 47 talks 121 posters.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
GUS: A Functional Genomics Data Management System Chris Stoeckert, Ph.D. Center for Bioinformatics and Dept. of Genetics University of Pennsylvania ASM.
First GUS Workshop July 6-8, 2005 Penn Center for Bioinformatics Philadelphia, PA.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
HUMAN-MOUSE CONSERVED COEXPRESSION NETWORKS PREDICT CANDIDATE DISEASE GENES Ala U., Piro R., Grassi E., Damasco C., Silengo L., Brunner H., Provero P.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
RADical microarray data: standards, databases, and analysis Chris Stoeckert, Ph.D. University of Pennsylvania Yale Microarray Data Analysis Workshop December.
Identification of Compositionally Similar Cis-element Clusters in Coordinately Regulated Genes Anil G Jegga, Ashima Gupta, Andrew T Pinski, James W Carman,
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Gene Regulatory Networks and Neurodegenerative Diseases Anne Chiaramello, Ph.D Associate Professor George Washington University Medical Center Department.
While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible.
Bioinformatics and Computational Biology
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
E14.5E16.5E18.5 Normalized mRNA level Get1 Nfix Smarcd3 A Supplementary Figure 1 (A) The microarray expression levels of bladder terminal differentiation.
Gene Structure and Identification III BIO520 BioinformaticsJim Lund Previous reading: 1.3, , 10.4,
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Finding genes in the genome
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
GUS We have created the Genomic Unified Schema (GUS), a relational database that warehouses and integrates biological sequence, sequence annotation, and.
EPConDB: Endocrine Pancreas Consortium Database
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Ensembl Genome Repository.
EXTENDING GENE ANNOTATION WITH GENE EXPRESSION
Rationale for GUS Answer queries:
Current and Future Directions
Information Management Infrastructure for the Systematic Annotation of Vertebrate Genomes V Babenko (1), B Brunk (1), J Crabtree (1), S Diskin (1), Y Kondrahkin.
RAD (RNA Abundance Database)
Integrating Genomic Databases
Leveraging EST Sequencing, Micro Array Experiments and Database Integration for Gene Expression Analyses The Computational Biology and Informatics Laboratory.
Functional Genomics Consortium: NIDDK (Kaestner) and (Permutt)
Nora Pierstorff Dept. of Genetics University of Cologne
Volume 26, Issue 12, Pages e5 (March 2019)
Deep Learning in Bioinformatics
Presentation transcript:

Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts. Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania School of Medicine Nov. 15, 2005 University of Nebraska Medical Center

What is the code for determining where (and when) a gene is expressed? Expression TFBS1TFBS4TFBS3 TFBS4 TFBS2 TFBS1 TFBS = transcription factor binding site

Goal is to Identify Combinations of TFBS (cis-Regulatory Modules or CRMs) that Specify Tissue Expression From Wasserman & Sandelin, NRG 2004

A Genomics Unified Schema approach to understanding gene expression Jennifer Dommer, Steve Fischer, Thomas Gan, Greg Grant, John Iodice, Junmin Liu, Elisabetta Manduchi, Joan Mazzarelli, Debbie Pinney, Angel Pizarro, Mike Saffitz, Jonathan Schug, Chris Stoeckert, Trish Whetzel Computational Biology and Informatics Laboratory (CBIL), Penn Center for Bioinformatics

Cryptospiridium Database Beta Cell Biology Consortium Plasmodium Genome Resource Phytophthora Soybean EST Database GUS

GUS is an open source project Sanger Institute U. Georgia Flora Centromere Database U. Chicago Kansas U. U. Penn U. Toronto Virginia Bioinformatics Insitiute

GUS Project Goals Provide: –A platform for broad genomics data integration –An infrastructure system for functional genomics Support: –Websites with advanced query capabilities –Research driven queries and mining

GUS Project Resources Website -- –News, Documentation, Distributable, GUS-based Projects

GUS Components Schema Application Framework –Object/Relational Layer –Plugin API –Pipeline API Plug-ins Web Development Kit (WDK)

SchemasDomainFeatures DoTSSequence and annotation EST clusters Gene models RADGene expressionMIAME ProtProtein expressionMass spec Mzdata/pepXML StudyExperimentsFuGE TESSGene RegulationTFBS organization SResShared resourcesOntologies CoreAdministrationDocumentation, Data Provenance GUS 3.5 Schemas

RAD EST clustering and assembly DoTS Genomic alignment and comparative sequence analysis Identify shared TF binding sites TESS BioMaterial annotation SRes, Study

DoTS integrates sequence annotation including where expressed

RAD Contains Detailed Expression Experiments Including Tissue Surveys

TESS Allows You to Find Potential TFBS But there are too many potential sites!

Promoters Features Related to Tissue- Specificity as Measured by Shannon Entropy Jonathan Schug 1, Winfried-Paul Schuller 2, Claudia Kappen 2, J. Michael Salbaum 2, Maja Bucan 1, Christian J. Stoeckert Jr. 1 1.Center for Bioinformatics, Department of Genetics, University of Pennsylvania, Philadelphia, Pennsylvania 2.Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, Nebraska Genome Biology :R33

What is a Liver-Specific Gene? *

Assessing Tissue Specificity of Genes Using Shannon Entropy Shannon entropy is a measure of the uniformity of a discrete probability distribution. Given a set of T tissues, H ranges from 0 for a gene expressed in a single tissue to lg T for a gene expressed uniformly in all T tissues. It works well as a measure of overall tissue-specificity. To measure specificity to a particular tissue, we combine entropy H and the relative expression level in that tissue to get Q. Q = 0 for a tissue when the gene is expressed only in that tissue and Q = 2T for a typical tissue in uniform expression. (a) Very specific liver expression: H=1.6 and Qliver = 2.2, 98612_at cytochrome p450 (b) Near uniform expression : H=4.3 and Qliver=10.2, _s_at Clcn7 chloride channel 7

Agreement between Microarrays and ESTs on Tissue Specificity

Specificity Characteristics of Tissues

CpG Islands are Associated with the Start Sites of Genes with Wide-Spread Expression CpG island = minimum 200 bp, C+G > 0.6, obs./expect. >=0.5

Tissue-Specific and Non-Specific Promoters Have Distinct Base Compositions CpG+ CpG- Multi-Tissue H >= 4.4 Tissue Specific H <= 3.5 Promoters based on DBTSS (

TATA Boxes are Associated with Tissue-Specific Genes

Functional relationships of promoter classes based on over- represented GO terms (EASE)

First Clues: TATA Box indicates Tissue Specific; CpG indicates Wide Spread Expression Additional clues: CpG-/TATA+ indicates high expression, secreted proteins while CpG+/TATA- indicates cellular and mitchondrial proteins.

Expanding the Mammalian CArGome Qiang Sun 1, Guang Chen 2, Jeffrey W. Streb 1, Xiaochun Long 1, Yumei Yang 1, Christian J. Stoeckert, Jr. 2 and Joseph M. Miano 1 1.Cardiovascular Research Institute, University of Rochester School of Medicine, Rochester, New York 2.Center for Bioinformatics, University of Pennsylvania, Philadelphia, Pennsylvania Genome Research (in press)

Serum Response Factor (SRF) Target Genes

Finding Novel CArG elements Expect 1 CArG element about every kb just by chance. –CCWWWWWWGG with one mismatch allowed Use conservation to reduce false positives. –188 associated with 4362 orthologous genes –116 had orthologous CArGs –10/62 known genes found –Repeated with 9169 orthologous genes 489 predictions 32/62 known genes found 60 of 83 predictions were experimentally validated –Transfection assays –Binding assays –Knockdown assays

Serum Response Factor (SRF) Target Genes

More Clues: Human-mouse conservation enables identification of valid CArG elements CArG elements associated with many cytoskeletal genes suggesting role of SRF in cytoskeletal dynamics.

Using Bounded Collection Grammars to Identify cis- Regulatory Modules in Tissue Specific Genes Jonathan Schug Max Mintz (CIS, U Penn)

Bounded Collection Grammars Collection production rules for the GR response element in the PEPCK promoter

Rules are evaluated using the receiver operating characteristic (ROC) Each point is a different parameter setting for a rule applied to training sets. Typically use area under the curve (AUC) to rank rules.

Rules are built by increasing complexity when AUC improves Reduce search space by not pursuing unproductive paths. e.g., If (A,B) not better than A or B then don’t need to look at (A,B,C) or (A,B,D) or (A,B,C,D)

The 3-set rule for the PEPCK GR element Note improvements of 2-sets over solos and the 3-set over 2-sets.

Discovering regulatory modules by creating profiles for Gene Ontology Biological Processes based on tissue-specificity scores Elisabetta Manduchi, Jonathan Schug Klaus Kaestner (Genetics, U Penn)

If we focus on biological processes that are predominantly taking place in a given tissue, can we identify regulatory modules common to genes involved in these processes? Tissue Biological Process Genes

For a given tissue survey, we attach “tissue-specificity” profiles to gene sets defined by GO BPs, based on the ranked lists of genes in each tissue according to increasing Q. To this end, we use an Enrichment Score (ES) in the spirit of that described in Mootha et al. (2003), as a measure of tissue-specificity for that gene set. The ES turns out to be equivalent (i.e. equal up to a multiplicative constant) to a Kolmogorov-Smirnov statistic liver muscle brain ****** ****** ****** *steroid metabolism

Have applied to two different Affymetrix- based datasets –Schmueli et al. C R Biol GeneNote (human) –Su et al. PNAS GEA2 (human and mouse) We looked at ~ 2000 GO BPs that we could map to probe sets Application to Tissue Surveys

GO BPs having significantly specific profiles for each tissue can be identified significant in liver significant in heart and skeletal muscle

Mouse Tissue SurveyHuman Tissue SurveyTissue-specific GO BPs Reduce and Intersect Training Set of Promoter Sequences Training Set of Promoter Sequences Ortholog Pairs (Homologene) Learning Tissue-Specific Promoter Motifs Mm-based consensus conserved sequences Hs-based consensus conserved sequences 32 POS, 365 NEG UCSC conserved sequences Positive Solos Liver-specific Steroid metabolism GEA2 ROC area > 0.5

Common solos (31) Mm-based collections (30) Hs-based collections (83) (13) Common collections GATA MYCMAC/USF AIRE CAAT ER-LEFT TCF11 TTAC/EFC/NCX/VBPPBX INI AATCSREBP DBP Forkhead E4B EBOX CCAACREB/ATF S8/CART1/CHX10/NKX25 G_AA/CEBP/HLF TAACC LXR HNF1 ALX4 HNF4/TCF4/COUP/PPAR PPAR-LEFT ROAZ AML/PEBP BACH/NFE2/NRF2ZTA P53 GNCF/SF1 Liver-specific from Krivan and Wasserman (2001) Known Liver TFs

Learning Liver Specific CRMs for Steroid Metabolism AIRE P53 ER-LEFT {CREB/ATF, GATA} {GATA, GNCF/SF1} {Forkhead, GATA} {GATA, G_AA/CEBP/HLF} {GATA, SREBP} {GATA, TAACC} {GATA, ZTA} {AATC, SREBP} {CAAT, SREBP} {BACH/NFE2/NRF2, G_AA/CEBP/HLF} Without imposing prior knowledge, end up with rules that are highly enriched for TFs expected to play a role in liver-specific streroid metabolism.

Testing a learned CRM for liver steroid metabolism using a liver HNF3-beta (FoxA2) knock out mouse study {Forkhead, GATA} set rule applies to HNF3- beta/FoxA2 Search promoters of genes down-regulated in liver as measured on PancChip microarray –Pancreas-focused array with 7356 known genes. 52 (0.7%) map to steroid metabolism. –71 genes down-regulated 7 (10%) map to steroid metabolism. Genes down-regulated by knockout of a forkhead protein (Hnf3-beta) are significantly enriched in steroid metabolism

More Clues: We can identify candidate CRMs from top- ranking GO Biological Processes for tissues Tested a candidate CRM for liver steroid metabolism with a knock-out mouse. Support for role of one of the factors but not enough sensitivity for seeing both factors.

Future Directions Apply learning methods to many tissues and processes incorporating multiple surveys Add novel motifs to learning process Use ChIP and tissue-focused expression datasets to better evaluate Our goal is to make inferences of the form: "The gene set G shows specificity for tissue T and is regulated by module M in this context".