Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania.

Slides:

Advertisements

Similar presentations

Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Advertisements

Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome ECS289A.

Methods to read out regulatory functions

Periodic clusters. Non periodic clusters That was only the beginning…

Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.

Predicting Enhancers in Co-Expressed Genes Harshit Maheshwari Prabhat Pandey.

Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium.

Gene Set Enrichment Analysis (GSEA)

Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.

Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.

Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.

TRANSFAC Project Roadmap Discussion.  Structure DNA-binding domain (DBD)  The portion (domain) of the transcription factor that binds DNA Trans-activating.

Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.

The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.

An analysis of “Alignments anchored on genomic landmarks can aid in the identification of regulatory elements” by Kannan Tharakaraman et al. Sarah Aerni.

Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.

Similar Sequence Similar Function Charles Yan Spring 2006.

The Hardwiring of development: organization and function of genomic regulatory systems Maria I. Arnone and Eric H. Davidson.

1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, (16 April 2004)

The MGED Ontology: A framework for describing functional genomics experiments SOFG Nov. 19, 2002 Chris Stoeckert, Ph.D. Dept. of Genetics & Center for.

Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.

Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)

Identifying conserved promoter motifs and transcription factor binding sites in plant promoters Endre Sebestyén, ARI-HAS, Martonvásár, Hungary 26th, November,

Inferring Cellular Networks Using Probabilistic Graphical Models Jianlin Cheng, PhD University of Missouri 2009.

Searching for TFBSs with TRANSFAC - Hot topics in Bioinformatics.

Genome Informatics 2005 ~ 220 participants 1 keynote speaker: David Haussler 47 talks 121 posters.

MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.

From motif search to gene expression analysis

A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.

Sharing Microarray Experiment Knowledge Chips to Hits Oct. 28, 2002 Chris Stoeckert, Ph.D. Dept. of Genetics & Center for Bioinformatics University of.

GUS: A Functional Genomics Data Management System Chris Stoeckert, Ph.D. Center for Bioinformatics and Dept. of Genetics University of Pennsylvania ASM.

Basic features for portal users. Agenda - Basic features Overview –features and navigation Browsing data –Files and Samples Gene Summary pages Performing.

Proliferation cluster (G12) Figure S1 A The proliferation cluster is a stable one. A dendrogram depicting results of cluster analysis of all varying genes.

Regulation of Gene Expression: An Overview  Transcriptional  Tissue-specific transcription factors  Direct binding of hormones, growth factors, etc.

Sequence analysis – an overview A.Krishnamachari

Small RNAs and their regulatory roles. Presented by: Chirag Nepal.

Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

RADical microarray data: standards, databases, and analysis Chris Stoeckert, Ph.D. University of Pennsylvania Yale Microarray Data Analysis Workshop December.

Identification of Compositionally Similar Cis-element Clusters in Coordinately Regulated Genes Anil G Jegga, Ashima Gupta, Andrew T Pinski, James W Carman,

Protein and RNA Families

Motif discovery and Protein Databases Tutorial 5.

Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Gene Regulatory Networks and Neurodegenerative Diseases Anne Chiaramello, Ph.D Associate Professor George Washington University Medical Center Department.

Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts. Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of.

Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.

E14.5E16.5E18.5 Normalized mRNA level Get1 Nfix Smarcd3 A Supplementary Figure 1 (A) The microarray expression levels of bladder terminal differentiation.

Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.

Gene Structure and Identification III BIO520 BioinformaticsJim Lund Previous reading: 1.3, , 10.4,

Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.

Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.

Finding genes in the genome

Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.

COURSE OF BIOINFORMATICS Exam_30/01/2014 A.

GUS We have created the Genomic Unified Schema (GUS), a relational database that warehouses and integrates biological sequence, sequence annotation, and.

Dynamic epigenetic enhancer signatures reveal key transcription factors associated with monocytic differentiation states by Thu-Hang Pham, Christopher.

1 Department of Engineering, 2 Department of Mathematics,

1 Department of Engineering, 2 Department of Mathematics,

1 Department of Engineering, 2 Department of Mathematics,

EXTENDING GENE ANNOTATION WITH GENE EXPRESSION

Rationale for GUS Answer queries:

Current and Future Directions

Information Management Infrastructure for the Systematic Annotation of Vertebrate Genomes V Babenko (1), B Brunk (1), J Crabtree (1), S Diskin (1), Y Kondrahkin.

RAD (RNA Abundance Database)

From EpoDB to EPConDB: Adventures in Gene Expression Databases

Integrating Genomic Databases

Leveraging EST Sequencing, Micro Array Experiments and Database Integration for Gene Expression Analyses The Computational Biology and Informatics Laboratory.

Functional Genomics Consortium: NIDDK (Kaestner) and (Permutt)

Mapping Global Histone Acetylation Patterns to Gene Expression

Volume 26, Issue 12, Pages e5 (March 2019)

Presentation transcript:

Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania Nov. 17, 2004 Department of Physiology Seminar Series University of Kentucky

What is the code for determining where (and when) a gene is expressed? Expression TFBS1TFBS4TFBS3 TFBS4 TFBS2 TFBS1 TFBS = transcription factor binding site

Goal is to Identify Combinations of TFBS (cis-Regulatory Modules or CRMs) that Specify Tissue Expression From Wasserman & Sandelin, NRG 2004

A Genomics Unified Schema approach to understanding gene expression Dave Barkan, Jonathan Crabtree, Shailesh Date, Steve Fischer, Bindu Gajria, Thomas Gan, Greg Grant, Hongxian He, John Iodice, Li Li, Junmin Liu, Matt Mailman, Elisabetta Manduchi, Joan Mazzarelli, Debbie Pinney, Angel Pizarro, Mike Saffitz, Jonathan Schug, Chris Stoeckert, Trish Whetzel Computational Biology and Informatics Laboratory (CBIL), Penn Center for Bioinformatics

Stem Cell Gene Anatomy Project Beta Cell Biology Consortium Plasmodium Genome Resource Allgenes (human and mouse DoTS) GUS

CoreSRESTESSRADDoTS Oracle RDBMS Object Layer for Data Loading Java Servlets GUS is an open source project Sanger Institute U. Georgia Flora Centromere Database U. Chicago U. Penn U. Toronto Phytophthora sojae genome Virginia Bioinformiatics Insitiute

GUS (Genomics Unified Schema) MIAME/MAGE-OM Gene ExpressionRAD EST clusters and gene models Sequence and annotation DoTS DocumentationData ProvenanceCore Ontologies Shared Resources Sres TFBS organization Gene RegulationTESS FeaturesDomainNamespace

RAD EST clustering and assembly DoTS Genomic alignment and comparative sequence analysis Identify shared TF binding sites TESS BioMaterial annotation SRES

DoTS integrates sequence annotation including where expressed

kidney, mammary gland, brain, liver, colon, lung, retina, spinal cord, rhabdomyosarcoma cell line brain, liver, kidney, lung, melanocyte embryo, fetus, kidney, limb, retina, salivary gland brain, rhabdomyosarcoma cell line, kidney Sorbs1: sorbin and SH3 domain containing 1 - GO molecular function - actin binding and protein kinase binding - GO cellular component – actin cytoskeletal stress fibers

RAD Contains Detailed Expression Experiments Including Tissue Surveys

TESS Allows You to Find Potential TFBS But there are too many potential sites!

Promoters Features Related to Tissue- Specificity as Measured by Shannon Entropy Jonathan Schug 1, Winfried-Paul Schuller 2, Claudia Kappen 2, J. Michael Salbaum 2, Maja Bucan 3, Christian J. Stoeckert Jr. 1 1.Center for Bioinformatics, University of Pennsylvania, Philadelphia, Pennsylvania, 19104, USA 2.Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, Nebraska, 68198, USA 3.Department of Genetics, University of Pennsylvania, Philadelphia, Pennsylvania, 19104, USA

What is a Liver-Specific Gene? *

Assessing Tissue Specificity of Genes Using Shannon Entropy Shannon entropy is a measure of the uniformity of a discrete probability distribution. Given a set of T tissues, H ranges from 0 for a gene expressed in a single tissue to lg T for a gene expressed uniformly in all T tissues. It works well as a measure of overall tissue-specificity. To measure specificity to a particular tissue, we combine entropy H and the relative expression level in that tissue to get Q. Q = 0 for a tissue when the gene is expressed only in that tissue and Q = 2T for a typical tissue in uniform expression. (a) Very specific liver expression: H=1.6 and Qliver = 2.2, 98612_at cytochrome p450 (b) Near uniform expression : H=4.3 and Qliver=10.2, _s_at Clcn7 chloride channel 7

Agreement between Microarrays and ESTs on Tissue Specificity

Specificity Characteristics of Tissues

CpG Islands are Associated with the Start Sites of Genes with Wide-Spread Expression CpG island = minimum 200 bp, C+G > 0.6, obs./expect. >=0.5

Tissue-Specific and Non-Specific Promoters Have Distinct Base Compositions CpG+ CpG- Multi-Tissue H >= 4.4 Tissue Specific H <= 3.5

TATA Boxes are Associated with Tissue-Specific Genes

Functional relationships of promoter classes based on over- represented GO terms (EASE)

First Clues: TATA Box indicates Tissue Specific; CpG indicates Wide Spread Expression Additional clues: CpG-/TATA+ indicates high expression, secreted proteins while CpG+/TATA- indicates cellular and mitchondrial proteins.

Pattern Analysis of Pancreas Gene Promoters Guang (Gary) Chen, Jonathan Schug

Shannon Entropy GNF Gene Expression Atlas Gene Lists with Tissue Specificity Gene Lists with Tissue Specificity DBTSS Sequences around Transcription Start Sites Teiresias Pattern Clusters (PWM) Pattern Clusters (PWM) Represent Seqs with PWMs Gene Clusters Gene Ontology (GO) GO Category Analysis Patterns Pattern Clustering Comparative Genome Analysis Identifying TFBMs – Method Pipeline Starting with a gene expression tissue survey, pancreas-specific genes with common TFBS and biological processes are identified Tissue Specific Regulatory Modules Associated with GO Biological Process Tissue Specific Regulatory Modules Associated with GO Biological Process

–DBTSS: Database of Transcriptional Start Sites Based on 400,225 and 580,209 human and mouse full length cDNA sequences, DBTSS contains the genomic positions of the transcriptional start sites and the adjacent promoters for 8,793 and 6,875 human and mouse genes, respectively. Yutaka Suzuki, Riu Yamashita, Kenta Nakai and Sumio Sugano (2002). DBTSS: DataBase of human Transcriptional Start Sites and full-length cDNAs. Nucleic Acids Res. 30: –Pancreas genes are chosen based on efforts to understand pancreatic development and function (EPConDB) 500bp upstream for preliminary study 159 human (mouse) pancrea specific genes (Qislet <7, positive(p)) & 159 human (mouse) ubiquitous genes (Qislet >10, negative (n)) –This approach can be applied to any tissue to study the tissue specificity of transcription factor binding motifs (TFBMs) & Modules Methods & Resources (Cont.)

A Teiresias Pattern P is a pattern (with L ≤ W) if P containing at least L residues such that every subpattern of P containing L residues is at most W symbols in length. PatternACTGGCA. C. GT Method- Pattern Discovery - Teiresias Teiresias Patterns *Rigoutsos, I. and A. Floratos, Combinatorial Pattern Discovery in Biological Sequences: the TEIRESIAS Algorithm. Bioinformatics, 14(1), January 1998.

Identifying TFBMs - Pattern Distribution With 117 human pancreas specific genes (Q pancreas 10, negative (n)), roughly 90,000 patterns were discovered in the 1kb+/200bp- promoter region. Patterns with ∆ p-n >20 (in blue box) are more likely to be pancreas specific Each point represents a pattern with occurrence in positive data set (y-axis) and negative data set (x-axis) For each pattern (x-axis), the occurrence difference ∆ p-n (y-axis) between positive (Q 10) data set

Method - Pattern Clustering Pattern Clustering Patterns Smith- Waterma n Distance of pattern pair Hierarchic al K- Median Pattern Clusters (PWM) Num of Cluster Pattern Clustering

Results - Pattern Clustering Clustering Results (human, ∆ p-n >20, 72 patterns)

Identifying TFBMs 72 patterns (Human, ∆ p-n >20) were clustered to 18 pattern clusters and 6 of them were identified as known ones by searching TRANSFAC. Identified known binding sites associated with human pancreas genes AP2ALPHA MEF2 SRY NKX62CAP_01HOXA3

AP2ALPHA MEF2 NKX62 CAP_01 Identifying TFBM By conducting comparative genomic analysis, some discovered TFBMs are conserved between Human & Mouse pancreas Orthologs HOXA3

Gene Clustering - Based on TFBMs pancreas specific genes can be clustered according to presence or absence of conserved promoter motifs Upstream sequences can be characterized by pattern occurrences, which can then be used to calculate pairwise similarities between sequences. For simplicity, we just used a boolean model by considering 7 conserved pattern appearance. Centered pearson correlation was used to calculated similarity, and 117 pancreas specific (Q<6.5) were clustered into 10 clusters with hierarchical clustering.

Gene Clustering – GO Category Assign Gene Clusters to GO Category To interpret clustering results, we used EASE to find the significant biological features of a gene cluster of interest of a gene cluster through the GO Biological Process.

More Clues: Known and novel TFBS found associated with genes expressed in the pancreas See conservation of sites between human and mouse Associated with digestion, catabolism, and response to stimulus GO biological processes

Discovering regulatory modules by creating profiles for Gene Ontology Biological Processes based on tissue-specificity scores Elisabetta Manduchi, Jonathan Schug

If we focus on biological processes that are predominantly taking place in a given tissue, can we identify regulatory modules common to genes involved in these processes? Tissue Biological Process Genes

For a given tissue survey, we attach “tissue- specificity” profiles to gene sets defined by GO BPs, based on the ranked lists of genes in each tissue according to increasing Q. To this end, we use an Enrichment Score (ES) in the spirit of that described in Mootha et al. (2003), as a measure of tissue-specificity for that gene set. The ES turns out to be equivalent (i.e. equal up to a multiplicative constant) to a Kolmogorov- Smirnov statistic.

The following results refer to the application of the methods described above to the GeneNote tissue survey: –12 tissues in duplicate on the HGU95 Affymetrix chip set (Av2, B-E). We looked at the 2316 GO BPs that we could map to probe sets (using version of the Bioconductor GO and hgu95XXX metadata R packages). Application to a Human Tissue Survey

GO BPs having significantly specific profiles for each tissue can be identified significant in liver significant in heart and skeletal muscle

Excerpt of cluster of GO BPs based on their tissue-specificity profiles (up in spinal cord/brain)

Focusing on steroid metabolism A.After mapping probe sets to RefSeqs and retrieving from DBTSS their upstream sequences, we assembled a set of 63 promoter sequences, which was our positive set. B.We generated 5 negative sets, each consisting of 315 sequences, by randomly scrambling each of the positive set sequences. C.We ranked each of 666 Transcription Factor Binding Sites (TFBSs) from TRANSFAC - represented by position matrices - in terms of their ability (measured by average ROC area) in discriminating between the positive set and the negative sets.

D.We then selected high ranking TFBSs from (C) and high ranking TFBSs from an independent study focusing on liver specificity and formed all possible pairs between these two sets. E.These pairs were ranked according to their discriminative ability and on the basis of the distance between their components in the positive hits. Optimal parameters (distance and individual TFBS match scores) were selected for each pair scoring at the top. F.By assessing the performance over a test set composed of mouse promoter sequences, we found 2 candidate CRMs (involving 3 and, respectively, 4 TFBSs) with an over-representation of steroid metabolism genes. Focusing on steroid metabolism

Example of production hits to steroid metabolism mouse promoter sequences No. mouse promoter sequences: Of these 50 belong to genes mapping to steroid metabolism. No. production hits: 257. Of these 8 belong to genes mapping to steroid metabolism. TSS Production TFBSs: {FOXD3_01, GKLF_01, HFH1_01, MADSA_Q2} Parameters: max distance=130 FOXD3_01 min score= GKLF_01 min score= HFH1_01 min score= MADSA_Q2 min score= green=forward strand red=reverse strand shading indicates strength

More Clues: We can identify candidate CRMs from top- ranking GO Biological Processes for tissues Identified a candidate CRM for steroid metabolism.

Summary GUS is a functional genomics database system used by a growing number of sites for genome and expression projects. Using expression data in GUS and entropy-based metrics, we can rank genes according to their tissue-specificity and learn promoter properties and associate functional roles In addition to general properties of tissue-specific promoters, we are beginning to identify combinations of motifs (i.e., regulatory modules) associated with expression in specific tissues.

Future Directions Refine analysis from genes to transcripts Refine analysis from organs to cells Apply approach to splicing Apply approach to developmental stage and differentiation state Our goal is to make inferences of the form: "The gene set G shows specificity for tissue T and is regulated by module M in this context".