Motif Finding Continued

Slides:



Advertisements
Similar presentations
Transcriptional regulatory code of a eukaryotic genome Harbison CT et al. Nature, 2004, Vol. 431, pp Milica Volar 3 March 2005.
Advertisements

Methods to read out regulatory functions
Periodic clusters. Non periodic clusters That was only the beginning…
Epigenetics Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Chromatin Immuno-precipitation (CHIP)-chip Analysis
Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
Detecting DNA-protein Interactions Xinghua Lu Dept Biomedical Informatics BIOST 2055.
Understanding the Human Genome: Lessons from the ENCODE project
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Analysis of ChIP-Seq Data
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Transcription factor binding motifs (part I) 10/17/07.
An analysis of “Alignments anchored on genomic landmarks can aid in the identification of regulatory elements” by Kannan Tharakaraman et al. Sarah Aerni.
ChIP-seq QC Xiaole Shirley Liu STAT115, STAT215. Initial QC FASTQC Mappability Uniquely mapped reads Uniquely mapped locations Uniquely mapped locations.
Computational Molecular Biology Biochem 218 – BioMedical Informatics Gene Regulatory.
RNA-Seq and RNA Structure Prediction
1 1 - Lectures.GersteinLab.org Overview of ENCODE Elements Mark Gerstein for the "ENCODE TEAM"
Mapping protein-DNA interactions by ChIP-seq Zsolt Szilagyi Institute of Biomedicine.
Ultraconserved Elements in the Human Genome Bejerano, G., et.al. Katie Allen & Megan Mosher.
More on TF Motif Finding ChIP-chip / seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
ChIP-on-Chip and Differential Location Analysis Junguk Hur School of Informatics October 4, 2005.
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
Vidyadhar Karmarkar Genomics and Bioinformatics 414 Life Sciences Building, Huck Institute of Life Sciences.
I519 Introduction to Bioinformatics, Fall, 2012
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
A B IL-4(+) IL-4(-) IL-4(+) IL-4(-) ChIP-Seq (STAT6) Ramos IL-4 (+) P-value Ramos IL-4 (-) P-value BEAS2B IL-4 (+) P-value BEASB IL-4 (-) P-value fold.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Analysis of protein-DNA interactions with tiling microarrays
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Thoughts on ENCODE Annotations Mark Gerstein. Simplified Comprehensive (published annotation, mostly in '12 & '14 rollouts)
Local Multiple Sequence Alignment Sequence Motifs
Overview of ENCODE Elements
Analysis of ChIP-Seq Data Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers.
Biol 456/656 Molecular Epigenetics Lecture #5 Wed. Sept 2, 2015.
Transcription factor binding motifs (part II) 10/22/07.
A high-resolution map of human evolutionary constraints using 29 mammals Kerstin Lindblad-Toh et al Presentation by Robert Lewis and Kaylee Wells.
Enhancers and 3D genomics Noam Bar RESEARCH METHODS IN COMPUTATIONAL BIOLOGY.
Special Topics in Genomics ChIP-chip and Tiling Arrays.
ChIP-seq Downstream Analysis Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Additional high-throughput sequencing techniques (finding all functional elements of genome) June 15, 2017.
Il principio della ChIP: arricchimento selettivo della frazione di cromatina contenente una specifica proteina La ChIP può anche esser considerata.
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Regulation of Gene Expression
Transcription Regulation Transcription Factor Motif Finding
The Transcriptional Landscape of the Mammalian Genome
Epigenetics Continued
Figure 1. Annotation and characterization of genomic target of p63 in mouse keratinocytes (MK) based on ChIP-Seq. (A) Scatterplot representing high degree.
ENCODE Pseudogenes and Transcription
De novo Motif Finding using ChIP-Seq
Department of Computer Science
Structure of proximal and distant regulatory elements in the human genome Ivan Ovcharenko Computational Biology Branch National Center for Biotechnology.
Ci Chu, Kun Qu, Franklin L. Zhong, Steven E. Artandi, Howard Y. Chang 
Volume 7, Issue 5, Pages (June 2014)
Protein Occupancy Landscape of a Bacterial Genome
In collaboration with Mikkelsen Lab
Presented by, Jeremy Logue.
Fine-Resolution Mapping of TF Binding and Chromatin Interactions
Volume 17, Issue 6, Pages (November 2016)
Fine-Resolution Mapping of TF Binding and Chromatin Interactions
Songjoon Baek, Ido Goldstein, Gordon L. Hager  Cell Reports 
Presented by, Jeremy Logue.
Eukaryotic genomes are complex 3D structures comprised of modified and unmodified DNA, RNA and many types of interacting proteins Most DNA is wrapped around.
Ci Chu, Kun Qu, Franklin L. Zhong, Steven E. Artandi, Howard Y. Chang 
Presentation transcript:

Motif Finding Continued Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

Scoring Motifs Information Content (aka relative entropy) Suppose you have x aligned segments for the motif pb(s1 from mtf) / pb(s1 from bg) * pb(s2 from mtf) / pb(s2 from bg) *… pb(sx from mtf) / pb(sx from bg) Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ACTGGATG Motif Matrix Sites Segment ATGCAGCT score = p(generate ATGCAGCT from motif matrix) p(generate ATGCAGCT from background) p0A  p0T  p0G  p0C  p0A  p0G  p0C  p0T

Scoring Motifs pb(s1 from mtf) / pb(s1 from bg) * pb(sx from mtf) / pb(sx from bg) = (pA1/pA0)A1 (pT1/pT0)T1 (pT2/pT0)T2 (pG2/pG0)G2 (pC2/pC0)C2… Take log of this: = A1 log (pA1/pA0) + T1 log (pT1/pT0) + T2 log (pT2/pT0) + G2 log (pG2/pG0) + … Divide by the number of segments (if all the motifs have same number of segments) = pA1 log (pA1/pA0) + pT1 log (pT1/pT0) + pT2 log (pT2/pT0)… Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ACTGGATG

Scoring Motifs Original function: Information Content = Motif Conservedness: How likely to see the current aligned segments from this motif model Good ATGCA ATGCC TTGCA ATGGA Bad AGGCA ATCCC GCGCA CGGTA TGCCA ATGGT TTGAA

Scoring Motifs Original function: Information Content = Motif Specificity: How likely to see the current aligned segments from background Good AGTCC Bad ATAAA

Scoring Motifs Original function: Information Content = Which is better? (data = 8 seqs) = Motif 1 AGGCTAAC Motif 2 AGGCTAAC AGGCTACC AGCCTAAC AGGCCAAC TGGCTAAC AGGCTTAC AGGGTAAC

Specific (unlikely in genome background) Scoring Motifs Motif scoring function: Prefer: conserved motifs with many sites, but are not often seen in the genome background Motif Signal Abundant Positions Conserved Specific (unlikely in genome background)

Markov Background Increases Motif Specificity Prefers motif segments enriched only in data, but not so likely to occur in the background Segment ATGTA score = p(generate ATGTA from ) p(generate ATGTA from 0) 3rd order Markov dependency p( ) TCAGC = .25  .25  .25  .25  .25 .3  .18  .16  .22  .24 ATATA = .25  .25  .25  .25  .25 .3  .41  .38  .42  .30

Position Weight Matrix Update Advantage Can look for motifs of any widths Flexible with base substitutions Disadvantage: EM and Gibbs sampling: no guaranteed convergence time No guaranteed global optimum Break

Motif Finding in Lower Eukaryotes Upstream sequences longer (500-1000 bp), with some simple repeats Motif width varies (5 – 17 bases) Expression clusters provide decent input sequences quality for TF motif finding Motif combination and redundancy appears, although single motifs are usually significant enough for identification

Yeast Promoter Architecture Co-occurring regulators suggest physical interaction between the regulators

Introducing Sequence Conservation Kellis, et al, Nat 2003

Motif Finding in Higher Eukaryotes Regulatory sequences very long with repeats and far from target genes (enhancers) Motifs can be short or long (6-20 bases), and appear in combination and clusters Gene expression cluster not good enough input Need: Better known motifs: e.g. PBM Comparative Genomics: phastcons score Motif modules: motif clusters ChIP-chip/seq

UCSC PhastCons Conservation Functional regulatory sequences are under stronger evolutionary constraint Align orthologous sequences together PhastCons conservation score (0 – 1) for each nucleotide in the genome can be downloaded from UCSC

Ultra Conserved Elements > 200bp ultra conserved in vertebrates Exonic enriched in RNA processing Non-exonic enriched in TF binding sites for developmental genes Bejerano et al, Science 2004

Conservation vs Functions Non-conservation <> non-function Human Accelerated Region enriched in neurodevelopment Prabhakar et al, Science 2008

PreMod: motif clusters in conserved regions Blanchette et al, Genome Res 2006

ChIP-seq Break

Chromatin ImmunoPrecipitation (ChIP) There are, of course, lots of transcription factors floating around in the nucleus but not associated with the chromatin.

TF/DNA Crosslinking in vivo This is typically accomplished through the addition of formaldehyde. Note that there will also be protein-protein cross-linking, and the relative efficiency of protein-protein vs. protein-DNA links is still poorly understood.

Sonication (~500bp)

TF-specific Antibody

Immunoprecipitation

Reverse Crosslink and DNA Purification Per Buck and Lieb, “;ow DNA yields from the IP reactions usually make DNA amplification a requirement for DNA microarray-based detection. Randomly-primed or ligation-mediated PCR-based methods have been most commonly used.” There are also newer and perhaps more accurate (consistent) methods currently being explored.

ChIP-Seq Map 30-50 mers back to the genome ChIP-DNA Noise Sequence millions of 30-50 mer ends of fragments Map 30-50 mers back to the genome These are DNA fragments pulled down from different cells, all containing the binding site. When they are hybridized on the tiling array, probe signals will be high in the middle, and gradually decrease with distance. However, in the control, we should just see aggregate random noise and no peak. Therefore, by comparing ChIP with control, we can identify the factor’s binding locations.

MACS: Model-based Analysis for ChIP-Seq Use confident peaks to model shift size Binding

Peak Calls Tag distribution along the genome ~ Poisson distribution (λBG = total tag / genome size) ChIP-Seq show local biases in the genome Chromatin and sequencing bias

Peak Calls Tag distribution along the genome ~ Poisson distribution (λBG = total tag / genome size) ChIP-Seq show local biases in the genome Chromatin and sequencing bias 200-300bp control windows have to few tags But can look further Dynamic λlocal = max(λBG, [λctrl, λ1k,] λ5k, λ10k) ChIP Control 300bp 1kb 5kb 10kb http://liulab.dfci.harvard.edu/MACS/ Zhang et al, Genome Bio, 2008

ChIP-seq QC Break

ChIP-seq QC Read quality (FASTQC) Read mapping % (higher the better) Library complexity (avoid PCR amplification bias): # locations with 1 unique reads / # locations Good to keep one read / location in peak calling (default MACS) FDR adjustment?

ChIP-seq QC Number of peaks with good FDR and fold change (2-5 fold cutoff) FRiP score: Fraction of reads in peaks, factor-dependent Evolutionary Conservation Majority of sites not conserved Overlap with DNase-peaks Later lectures

ChIP-seq Downstream Analysis Break

Human TF Binding Distribution Most TF binding sites are outside promoters How to assign targets? Binding in promoter (how far)? Nearest distance? Number of binding? Other knowledge? Still open Q

Higher Order Chromatin Interactions Interactions ~ follows exponential decay with distance Lieberman-Aiden et al, Science 2009

How to Assign Targets for Enhancer Binding Transcription Factors? Regulatory potential: sum of binding sites weighted by distance to TSS with exponential decay Rank1 of genes based on binding TSS

How to Assign Targets for Enhancer Binding Transcription Factors? Regulatory potential: sum of binding sites weighted by distance to TSS with exponential decay Rank1 of genes based on binding Rank2 of genes based on expression Differential expression upon TF perturbation Gene expression correlation in cohorts Rank ordered list of targets: Rank1 * Rank2 Only minority of binding sites is functional in a condition

BETA Binding expression target analysis Is TF an activator, repressor or both? Do genes with higher regulatory potential show more up/down expression than random genes

BETA Binding expression target analysis Is TF an activator, repressor or both? Do genes with higher regulatory potential show more up/down expression than random genes Functional analysis of targets? How can a factor be both activator and repressor? Collaborating transcription factors Motif analysis on the binding sites near up vs down genes separately TF?? ER

Summary ChIP-seq identifies genome-wide in vivo protein-DNA interaction sites ChIP-seq peak calling to shift reads ChIP-seq QC: FDR, fold, conservation, etc Functional analysis of ChIP-seq data: Target identification Activator / repressor function Motif analysis