Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.

Slides:



Advertisements
Similar presentations
Methods to read out regulatory functions
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Epigenetics Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.
Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Understanding the Human Genome: Lessons from the ENCODE project
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Analysis of ChIP-Seq Data
Data Analysis for High-Throughput Sequencing
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Lecture 5: Learning models using EM
Transcription factor binding motifs (part I) 10/17/07.
BACKGROUND E. coli is a free living, gram negative bacterium which colonizes the lower gut of animals. Since it is a model organism, a lot of experimental.
Ab initio motif finding
ChIP-seq QC Xiaole Shirley Liu STAT115, STAT215. Initial QC FASTQC Mappability Uniquely mapped reads Uniquely mapped locations Uniquely mapped locations.
CS 374: Relating the Genetic Code to Gene Expression Sandeep Chinchali.
Bryan Heck Tong Ihn Lee et al Transcriptional Regulatory Networks in Saccharomyces cerevisiae.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Gaussian Processes for Transcription Factor Protein Inference Neil D. Lawrence, Guido Sanguinetti and Magnus Rattray.
1 1 - Lectures.GersteinLab.org Overview of ENCODE Elements Mark Gerstein for the "ENCODE TEAM"
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Mapping protein-DNA interactions by ChIP-seq Zsolt Szilagyi Institute of Biomedicine.
An Introduction to ENCODE Mark Reimers, VIPBG (borrowing heavily from John Stamatoyannopoulos and the ENCODE papers)
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
ChIP-on-Chip and Differential Location Analysis Junguk Hur School of Informatics October 4, 2005.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
I519 Introduction to Bioinformatics, Fall, 2012
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Day 5-2 What bioinformatics.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Combining SELEX with quantitative assays to rapidly obtain accurate models of protein–DNA interactions Jiajian Liu and Gary D. Stormo Presented by Aliya.
Local Multiple Sequence Alignment Sequence Motifs
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Overview of ENCODE Elements
Analysis of ChIP-Seq Data Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers.
DNAse Hyper-Sensitivity BNFO 602 Biological Sequence Analysis, Spring 2014 Mark Reimers, Ph.D.
Motif Search and RNA Structure Prediction Lesson 9.
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Biol 456/656 Molecular Epigenetics Lecture #5 Wed. Sept 2, 2015.
Transcription factor binding motifs (part II) 10/22/07.
A high-resolution map of human evolutionary constraints using 29 mammals Kerstin Lindblad-Toh et al Presentation by Robert Lewis and Kaylee Wells.
06/07/2015StatGen Journal Club1 On 06/07/2015 By - Dinesh Chandra Kundu Today’s paper “Predicting cell-type–specific gene expression from regions of open.
Evaluation of count scores for weight matrix motifs Project Presentation for CS598SS Hong Cheng and Qiaozhu Mei.
BIOBASE Training TRANSFAC ® Containing data on eukaryotic transcription factors, their experimentally-proven binding sites, and regulated genes ExPlain™
Additional high-throughput sequencing techniques (finding all functional elements of genome) June 15, 2017.
Yiming Kang, Hien-haw Liow, Ezekiel Maier, & Michael Brent
Figure 1. Annotation and characterization of genomic target of p63 in mouse keratinocytes (MK) based on ChIP-Seq. (A) Scatterplot representing high degree.
Learning Sequence Motif Models Using Expectation Maximization (EM)
Chao He
High-Resolution Profiling of Histone Methylations in the Human Genome
In collaboration with Mikkelsen Lab
High-Resolution Profiling of Histone Methylations in the Human Genome
Songjoon Baek, Ido Goldstein, Gordon L. Hager  Cell Reports 
Evolution of Alu Elements toward Enhancers
BIOBASE Training TRANSFAC® ExPlain™
Increased signal intensity and significant enrichment of transcription factor motifs is observed with DSG in breast tissue. Increased signal intensity.
IMPACT: Genomic Annotation of Cell-State-Specific Regulatory Elements Inferred from the Epigenome of Bound Transcription Factors  Tiffany Amariuta, Yang.
Identification of chromatin modifying complex recruiting H3K9 methyltransferases. a, A MEME-ChIP analysis was performed to identify the transcription factor.
Presentation transcript:

Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG

Questions If know motif (or sequence binding preferences) can you identify likely active TFBS? If you have a TF, can you find its motif and binding sites? Can you find motifs and binding sites for unknown TFs?

Finding TFBS and Motifs in Animals Sequence-based methods – If know sequence, scan known TFBS motif across genome Data-based methods – Use ChIP to identify locations of binding Needs good antibody; often picks up indirect binding – Compare promoters across genomes Need depth; miss enhancers and species-related changes – Look for DNAse footprints – Use SELEX or DS-DNA microarray to profile TF’s DBD Ideally combine both kinds of methods

Outline Bioinformatics approaches: PSWM Experimental approaches to finding TFBS Integrated approaches

Position-Specific Weight Matrices Represent TFBS Better than Motifs Represent log of probability of each base occurring at each position in TFBS Often used to scan along genome calculating log-likelihood at each position A composite PWSM scan for SP1 (from PEAKS webpage)

Standard Scoring Form of PSWM Goal to compute probability of sequence relative distribution on sets of sequences bound by TF, compared to probability under random distribution Assume independence of bases to simplify – Not bad for many; bad for some Log likelihood of sequence would be sum of LL for base i in position j: log 2 (p ij / b i ) – p ij is proportion of occurrences of base i – b i is baseline proportion of base i If b i s differ a lot from uniform then independence assumption often invalid – Many false positives from scan

Experimental Approaches to Identifying TFBS and Motifs

ChIP-Seq Can Identify Many TFBS From Rozowsky et al, Nature Biotech 2009 Chromatin Immuno-precipitation can identify where a TF binds to the genome One can try to identify sequences that occur more often than chance by a variety of methods Caveat: indirect binding may have wrong motif

Other Approaches to Finding TFBS Systematic Evolution of Ligands by Exponential Enrichment (SELEX) From Jolma et al, Cell, 2013 Generate random DNA sequence library of moderate length. The sequences in the library are exposed to the target ligand, and those that do not bind the target are removed by affinity chromatography. The bound sequences are eluted, and then amplified by PCR, and the process is run again under more stringent elution conditions to purify the tightest-binding sequences.

Finding TFBS by DNase Footprints From Neph et al, Nature, 2012

Identifying TFBS by Novel Recurrent Motifs under DNaseI Footprints From Neph et al, Nature, 2012

Integrated Approaches to Identifying Active TFBS in Tissues

Integrated Approaches to Identifying TFBS In this course we focus on binding sites for transcription factors with known motifs Combining PWM Scores and other genomic data – PhastCons or PhyloP conservation – DNAse and histone marks – Integrating DGF We will combine information using a Bayesian framework

Bayesian Hierarchical Model for Integrating Information PSWM Score distributions Conservation distribution DNase distribution Prior Probability of TFBS Posterior probabilities

Bayesian Hierarchical Models Prior probability of binding site set very low or estimated from TF-specific ChIP data In principle binding should be a continuous variable; we will treat as ‘yes-no’ Need to estimate probability of various genomic features – conservation, DNAse – for TFBS and for background sequence

What Information from Histone Marks? By themselves histone marks, esp H3K4me3, H3K4me1, H3K27me3 can be very informative After introducing DNAse data, these marks do not add much direct information Could be used to adjust probabilities for DHS and conservation (not yet done)

Bayes Model for Combining PWM Scores and Conservation How to estimate P(conserved | TFBS)? Depends on depth of time for which conservation is used – For mammals ~ 40%; primates ~ 80% – Varies between promoter and enhancer Background state can be estimated from genome-wide conservation (typically %) Then combine by Bayes Formula C and S are conditionally independent given B, so P(C&S|B) = P(C|B)P(S|B) (likewise for ~B)

Bayes Model for Combining Scores and DNase Sensitivity How to estimate P(DHS | TFBS)? Almost all (~98%) of known TFBS occur in DHS Background state can be estimated from genome- wide levels (typically 1 or 2%) Then combine by Bayes Formula D & S are conditionally independent given B, so P(D&S|B) = P(D|B)P(S|B)

Chromia – A Method for Using Histone Marks and PSWM Uses an HMM approach to integrate PSWM and histone marks (P300 marks enhancers)

CENTIPEDE– A Method for Combining DNAse, Conservation and PSWM Scores Combines several kinds of genomic information with PSWM to identify putative TFBS Confirmation by ChIP- Seq is quite good Pique-Regi R et al. Genome Res. 2011;21:

CENTIPEDE– A Method for Combining DNAse, Conservation and PSWM Scores Pique-Regi R et al. Genome Res. 2011;21: Model learned by the CENTIPEDE approach for the transcription factor NRSF. (A) Empirical density plots for key aspects of the data for sites inferred by CENTIPEDE to be bound (green lines, CENTIPEDE posterior probabilities >0.95) and unbound (red lines, probabilities < 0.5).