Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.

Slides:



Advertisements
Similar presentations
Methods to read out regulatory functions
Advertisements

Regulomics II: Epigenetics and the histone code Jim Noonan GENE760.
Functional Non-Coding DNA Part II DNA Regulatory Elements BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
Transcriptional regulation and promoter analysis
Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.
Promoter and Module Analysis Statistics for Systems Biology.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Analysis of ChIP-Seq Data
Data Analysis for High-Throughput Sequencing
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Promoter Panel Review. Background related Promoter In genetics, a promoter is a DNA sequence that enables a gene to be transcribed. It may be very long.
Identification of regulatory elements. Transcriptional Regulation Strongest regulation happens during transcription Best place to regulate: No energy.
Comparative Motif Finding
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
BACKGROUND E. coli is a free living, gram negative bacterium which colonizes the lower gut of animals. Since it is a model organism, a lot of experimental.
ChIP-seq QC Xiaole Shirley Liu STAT115, STAT215. Initial QC FASTQC Mappability Uniquely mapped reads Uniquely mapped locations Uniquely mapped locations.
CS 374: Relating the Genetic Code to Gene Expression Sandeep Chinchali.
Bryan Heck Tong Ihn Lee et al Transcriptional Regulatory Networks in Saccharomyces cerevisiae.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
REGULATORY GENOMICS Saurabh Sinha, Dept. of Computer Science & Institute of Genomic Biology, University of Illinois.
Searching for TFBSs with TRANSFAC - Hot topics in Bioinformatics.
Comparative Genomics II: Functional comparisons Caterino and Hayes, 2007.
Special Topics in Genomics Lecture 1: Introduction Instructor: Hongkai Ji Department of Biostatistics
1 1 - Lectures.GersteinLab.org Overview of ENCODE Elements Mark Gerstein for the "ENCODE TEAM"
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Chap. 7 Problem 1 In glucose media without lactose, the lac repressor is bound to the lac operator, and the CAP protein is not bound to its control site.
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
Mapping protein-DNA interactions by ChIP-seq Zsolt Szilagyi Institute of Biomedicine.
An Introduction to ENCODE Mark Reimers, VIPBG (borrowing heavily from John Stamatoyannopoulos and the ENCODE papers)
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University.
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
Transcription Biology Review Bios 691 – Systems Biology January 2008.
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
Small RNAs and their regulatory roles. Presented by: Chirag Nepal.
Inferring transcriptional and microRNA-mediated regulatory programs in glioblastma Setty, M., et al.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
I519 Introduction to Bioinformatics, Fall, 2012
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Day 5-2 What bioinformatics.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Combining SELEX with quantitative assays to rapidly obtain accurate models of protein–DNA interactions Jiajian Liu and Gary D. Stormo Presented by Aliya.
Local Multiple Sequence Alignment Sequence Motifs
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Last Class 1. Transcription 2. RNA Modification and Splicing
Analysis of ChIP-Seq Data Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers.
DNAse Hyper-Sensitivity BNFO 602 Biological Sequence Analysis, Spring 2014 Mark Reimers, Ph.D.
Biol 456/656 Molecular Epigenetics Lecture #5 Wed. Sept 2, 2015.
Transcription factor binding motifs (part II) 10/22/07.
A high-resolution map of human evolutionary constraints using 29 mammals Kerstin Lindblad-Toh et al Presentation by Robert Lewis and Kaylee Wells.
Network Motifs See some examples of motifs and their functionality Discuss a study that showed how a miRNA also can be integrated into motifs Today’s plan.
Transcriptional Enhancers Looking out for the genes and each other Sridhar Hannenhalli Department of Cell Biology and Molecular Genetics Center for Bioinformatics.
BIOBASE Training TRANSFAC ® Containing data on eukaryotic transcription factors, their experimentally-proven binding sites, and regulated genes ExPlain™
Regulation of Gene Expression
Epigenetics Continued
Figure 1. Annotation and characterization of genomic target of p63 in mouse keratinocytes (MK) based on ChIP-Seq. (A) Scatterplot representing high degree.
Latent Regulatory Potential of Human-Specific Repetitive Elements
In collaboration with Mikkelsen Lab
Presented by, Jeremy Logue.
Anh Pham Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer’s disease.
Presented by, Jeremy Logue.
BIOBASE Training TRANSFAC® ExPlain™
Increased signal intensity and significant enrichment of transcription factor motifs is observed with DSG in breast tissue. Increased signal intensity.
Volume 52, Issue 1, Pages (October 2013)
Presentation transcript:

Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG

Why TFBS? Protein transcription factors were among the first gene regulators identified Many TFBS have distinct sequences, which were suitable for bioinformatics analysis Now seen as one layer of mammalian gene regulation, along with 3D structure of chromatin, histones, anti-sense lncRNAs, miRNAs, sequestration and transport

Outline Transcription factors and DNA-binding proteins Factors affecting TF binding DNA motifs & PSWM

Transcription Factors and DNA-Binding Proteins Several dozen distinct families of proteins have independently evolved binding to specific DNA configurations (sequences) They perform a variety of functions in organizing DNA or regulating transcription Usually several are involved in initiating gene transcription

Transcription Factor Biology Most bind in major groove of DNA Many bind as dimers Typically TFs expressed at low levels – Small changes in expression have big effects Mostly phosphorylated or otherwise activated by other proteins – Typically end-points of signalling cascades from receptors on cell surface or in cytoplasm

Transcription Factor Families Over 40 known families In mammals majority of TFs from three families: – C2H2 zinc-finger (675 TFs in human) – Homeodomain (257 TFs) – Helix–loop–helix (87 TFs) From Nature education

Transcription Factor Binding Motifs Binding of most factors is very specific, and only significant at a small fraction of sites For many factors much of that specificity is captured by the typical DNA sequence Sequence specificity often represented by motif – Not all sites well represented by motif CTCF binding not captured well by motif NRF1 binding is well represented by motif

Transcription Factor Binding Sites ChIP-Seq experiments in the human genome typically find from several hundred to >20,000 locations where a particular TF is binding Binding sites may be stronger or weaker A typical set of ChIP-Seq reads for HNF4a (from BayesPeak paper)

TFs Often Bind Cooperatively Most TFBS occur in clusters in promoters or enhancers/silencers with several others of different kinds Usually only a few of these are actually functional in any one cell type Different clusters operate in different cell types

Dynamics of TF Binding TF comes on and off the DNA site, often cycling in minutes or seconds Cooperative binding stabilizes TF Many TFs act in respond to signals or stresses – Not captured systematically in most samples

TFBS Locations Often Evolve Rapidly Most enhancer TFBS in human do not align to TFBS in mouse From Odom et al, Nature Genetics 2007 From Schmidt et al Science 2010

Factors Affecting TF Binding - I Most TFs occupy less than a few percent of their consensus target sites in the genome Chromatin state – Most TFs can only recognize their motif(s) if the DNA is relatively open – Some ‘pioneer’ factors bind to their sites in 3nm fiber and open up chromatin for others Zaret & Carroll, Genes Dev, 2012

Factors Affecting TF Binding - II Allosteric hindrance – Presence of another TF on opposite side hinders binding by spreading major groove Cooperative Binding – Some TFs provide binding sites or enhance binding of specific others to DNA – Promotes non-linear step-like expression response to stimuli Spitz et al Nature Rev Genetics 2012 Kim et al Science 2013

Transcription Factor Binding Motifs Binding of most factors is very specific, and only significant at a small fraction of sites For many factors much of that specificity is captured by the typical DNA sequence Sequence specificity often represented by motif – Not all sites well represented by motif CTCF binding not captured well by motif NRF1 binding is well represented by motif

TFBS Motifs Are Stable Over Evolution Most transcription factors favor almost the same motifs in humans and in mice (and in lizards … and often even in flies) From Odom et al, Nature Genetics 2007

Position-Specific Weight Matrices Represent TFBS Better than Motifs Represent log of probability of each base occurring at each position in TFBS Often used to scan along genome calculating log-likelihood at each position A composite PWSM scan for SP1 (from PEAKS webpage)

TFBS Motif Databases JASPAR - – High-quality curated public data TRANSFAC - international.com/product/transcription- factor-binding-siteshttp:// international.com/product/transcription- factor-binding-sites – Commercial product with dated public version Several research groups doing genome-wide characterizations by various means

Finding TFBS and Motifs in Animals Sequence-based methods – Scanning known TFBS motif – If have several co-regulated genes, use HMM or Gibbs sampler to identify common motif in them Data-based methods – Use ChIP to identify locations of binding Needs good antibody; often picks up indirect binding – Compare promoters across genomes Need depth; miss enhancers and species-related changes – Look for DNAse footprints – Use SELEX or DS-DNA microarray to profile TF’s DBD

Other Approaches to Finding TFBS Systematic Evolution of Ligands by Exponential Enrichment (SELEX) From Jolma et al, Cell, 2013 Generate random DNA sequence library of moderate length. The sequences in the library are exposed to the target ligand, and those that do not bind the target are removed by affinity chromatography. The bound sequences are eluted, and then amplified by PCR, and the process is run again under more stringent elution conditions to purify the tightest-binding sequences.

Other Approaches to Finding TFBS Identify recurrent motifs under DNaseI footprints From Neph et al, Nature, 2012

Integrated Approaches to Identifying TFBS Combining Scores and TF-Specific ChIP-Seq Combining information from scanning and PhastCons or PhyloP conservation Combining information from DNAse, conservation and histone marks – Integrating DGF Combining information from DNAse, conservation and histone marks

Finding TFBS Motif via TF-Specific ChIP-Seq ChIP gives approximate (~200bp) TFBS locations Sequence can identify loci more specifically within ChIP peaks Use HMM or Gibbs Indirect binding won’t be found Weak binding can be accommodated From Gelfond et al Biometrics 2009

Finding Active TFBS in Tissues Need Bayes model to integrate information from various sources Easiest if have some PSWM for binding site We will focus on this situation Increasingly being done to discover novel motifs or PSWMs

Bayesian Hierarchical Models Prior probability of binding site set very low or estimated from TF-specific ChIP data In principle binding should be a continuous variable; we will treat as ‘yes-no’ Need to estimate probability of various genomic features – conservation, DNAse, histone marks – for TFBS and for background sequence

Bayes Model for Combining Scores and Conservation How to estimate P(conserved | TFBS)? Depends on depth of time for which conservation is used – For mammals ~ 40%; primates ~ 80% – Varies between promoter and enhancer Background state can be estimated from genome-wide conservation (typically %) Then combine by Bayes Formula C and S are conditionally independent given B, so P(C&S|B) = P(C|B)P(S|B) (likewise for ~B)

Bayes Model for Combining Scores and DNase Sensitivity How to estimate P(DHS | TFBS)? Almost all (~98%) of known TFBS occur in DHS Background state can be estimated from genome-wide levels (typically 1 or 2%) Then combine by Bayes Formula D & S are conditionally independent given B, so P(D&S|B) = P(D|B)P(S|B) – likewise P(D&S)=P(D)P(S)

What Information from Histone Marks? By themselves histone marks, esp H3K4me3, H3K4me1, H3K27me3 can be very informative After introducing DNAse data, these marks do not add much direct information Could be used to adjust probabilities for DHS and conservation (not yet done)

Chromia – A Method for Using Histone Marks and PSWM Uses an HMM approach to integrate PSWM and histone marks (NB P300 ~ H3K27me3)

CENTIPEDE– A Method for Combining DNAse, Conservation and PSWM Scores Combines several kinds of genomic information with PSWM to identify putative TFBS Confirmation by ChIP- Seq is quite good Pique-Regi R et al. Genome Res. 2011;21:

CENTIPEDE– A Method for Combining DNAse, Conservation and PSWM Scores Pique-Regi R et al. Genome Res. 2011;21: Model learned by the CENTIPEDE approach for the transcription factor NRSF. (A) Empirical density plots for key aspects of the data for sites inferred by CENTIPEDE to be bound (green lines, CENTIPEDE posterior probabilities >0.95) and unbound (red lines, probabilities < 0.5).