Presentation is loading. Please wait.

Presentation is loading. Please wait.

Motif Finding Continued

Similar presentations


Presentation on theme: "Motif Finding Continued"— Presentation transcript:

1 Motif Finding Continued
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

2 Scoring Motifs Information Content (aka relative entropy)
Suppose you have x aligned segments for the motif pb(s1 from mtf) / pb(s1 from bg) * pb(s2 from mtf) / pb(s2 from bg) *… pb(sx from mtf) / pb(sx from bg) Pos ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ACTGGATG Motif Matrix Sites Segment ATGCAGCT score = p(generate ATGCAGCT from motif matrix) p(generate ATGCAGCT from background) p0A  p0T  p0G  p0C  p0A  p0G  p0C  p0T

3 Scoring Motifs pb(s1 from mtf) / pb(s1 from bg) *
pb(sx from mtf) / pb(sx from bg) = (pA1/pA0)A1 (pT1/pT0)T1 (pT2/pT0)T2 (pG2/pG0)G2 (pC2/pC0)C2… Take log of this: = A1 log (pA1/pA0) + T1 log (pT1/pT0) + T2 log (pT2/pT0) + G2 log (pG2/pG0) + … Divide by the number of segments (if all the motifs have same number of segments) = pA1 log (pA1/pA0) + pT1 log (pT1/pT0) + pT2 log (pT2/pT0)… Pos ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ACTGGATG

4 Scoring Motifs Original function: Information Content =
Motif Conservedness: How likely to see the current aligned segments from this motif model Good ATGCA ATGCC TTGCA ATGGA Bad AGGCA ATCCC GCGCA CGGTA TGCCA ATGGT TTGAA

5 Scoring Motifs Original function: Information Content =
Motif Specificity: How likely to see the current aligned segments from background Good AGTCC Bad ATAAA

6 Scoring Motifs Original function: Information Content =
Which is better? (data = 8 seqs) = Motif 1 AGGCTAAC Motif 2 AGGCTAAC AGGCTACC AGCCTAAC AGGCCAAC TGGCTAAC AGGCTTAC AGGGTAAC

7 Specific (unlikely in genome background)
Scoring Motifs Motif scoring function: Prefer: conserved motifs with many sites, but are not often seen in the genome background Motif Signal Abundant Positions Conserved Specific (unlikely in genome background)

8 Markov Background Increases Motif Specificity
Prefers motif segments enriched only in data, but not so likely to occur in the background Segment ATGTA score = p(generate ATGTA from ) p(generate ATGTA from 0) 3rd order Markov dependency p( ) TCAGC = .25  .25  .25  .25  .25 .3  .18  .16  .22  .24 ATATA = .25  .25  .25  .25  .25 .3  .41  .38  .42  .30

9 Position Weight Matrix Update
Advantage Can look for motifs of any widths Flexible with base substitutions Disadvantage: EM and Gibbs sampling: no guaranteed convergence time No guaranteed global optimum Break

10 Motif Finding in Lower Eukaryotes
Upstream sequences longer ( bp), with some simple repeats Motif width varies (5 – 17 bases) Expression clusters provide decent input sequences quality for TF motif finding Motif combination and redundancy appears, although single motifs are usually significant enough for identification

11 Yeast Promoter Architecture
Co-occurring regulators suggest physical interaction between the regulators

12 Introducing Sequence Conservation
Kellis, et al, Nat 2003

13 Motif Finding in Higher Eukaryotes
Regulatory sequences very long with repeats and far from target genes (enhancers) Motifs can be short or long (6-20 bases), and appear in combination and clusters Gene expression cluster not good enough input Need: Better known motifs: e.g. PBM Comparative Genomics: phastcons score Motif modules: motif clusters ChIP-chip/seq

14 UCSC PhastCons Conservation
Functional regulatory sequences are under stronger evolutionary constraint Align orthologous sequences together PhastCons conservation score (0 – 1) for each nucleotide in the genome can be downloaded from UCSC

15 Ultra Conserved Elements
> 200bp ultra conserved in vertebrates Exonic enriched in RNA processing Non-exonic enriched in TF binding sites for developmental genes Bejerano et al, Science 2004

16 Conservation vs Functions
Non-conservation <> non-function Human Accelerated Region enriched in neurodevelopment Prabhakar et al, Science 2008

17 PreMod: motif clusters in conserved regions
Blanchette et al, Genome Res 2006

18 ChIP-seq Break

19 Chromatin ImmunoPrecipitation (ChIP)
There are, of course, lots of transcription factors floating around in the nucleus but not associated with the chromatin.

20 TF/DNA Crosslinking in vivo
This is typically accomplished through the addition of formaldehyde. Note that there will also be protein-protein cross-linking, and the relative efficiency of protein-protein vs. protein-DNA links is still poorly understood.

21 Sonication (~500bp)

22 TF-specific Antibody

23 Immunoprecipitation

24 Reverse Crosslink and DNA Purification
Per Buck and Lieb, “;ow DNA yields from the IP reactions usually make DNA amplification a requirement for DNA microarray-based detection. Randomly-primed or ligation-mediated PCR-based methods have been most commonly used.” There are also newer and perhaps more accurate (consistent) methods currently being explored.

25 ChIP-Seq Map 30-50 mers back to the genome
ChIP-DNA Noise Sequence millions of mer ends of fragments Map mers back to the genome These are DNA fragments pulled down from different cells, all containing the binding site. When they are hybridized on the tiling array, probe signals will be high in the middle, and gradually decrease with distance. However, in the control, we should just see aggregate random noise and no peak. Therefore, by comparing ChIP with control, we can identify the factor’s binding locations.

26 MACS: Model-based Analysis for ChIP-Seq
Use confident peaks to model shift size Binding

27 Peak Calls Tag distribution along the genome ~ Poisson distribution (λBG = total tag / genome size) ChIP-Seq show local biases in the genome Chromatin and sequencing bias

28 Peak Calls Tag distribution along the genome ~ Poisson distribution (λBG = total tag / genome size) ChIP-Seq show local biases in the genome Chromatin and sequencing bias bp control windows have to few tags But can look further Dynamic λlocal = max(λBG, [λctrl, λ1k,] λ5k, λ10k) ChIP Control 300bp 1kb 5kb 10kb Zhang et al, Genome Bio, 2008

29 ChIP-seq QC Break

30 ChIP-seq QC Read quality (FASTQC) Read mapping % (higher the better)
Library complexity (avoid PCR amplification bias): # locations with 1 unique reads / # locations Good to keep one read / location in peak calling (default MACS) FDR adjustment?

31 ChIP-seq QC Number of peaks with good FDR and fold change (2-5 fold cutoff) FRiP score: Fraction of reads in peaks, factor-dependent Evolutionary Conservation Majority of sites not conserved Overlap with DNase-peaks Later lectures

32 ChIP-seq Downstream Analysis
Break

33 Human TF Binding Distribution
Most TF binding sites are outside promoters How to assign targets? Binding in promoter (how far)? Nearest distance? Number of binding? Other knowledge? Still open Q

34 Higher Order Chromatin Interactions
Interactions ~ follows exponential decay with distance Lieberman-Aiden et al, Science 2009

35 How to Assign Targets for Enhancer Binding Transcription Factors?
Regulatory potential: sum of binding sites weighted by distance to TSS with exponential decay Rank1 of genes based on binding TSS

36 How to Assign Targets for Enhancer Binding Transcription Factors?
Regulatory potential: sum of binding sites weighted by distance to TSS with exponential decay Rank1 of genes based on binding Rank2 of genes based on expression Differential expression upon TF perturbation Gene expression correlation in cohorts Rank ordered list of targets: Rank1 * Rank2 Only minority of binding sites is functional in a condition

37 BETA Binding expression target analysis
Is TF an activator, repressor or both? Do genes with higher regulatory potential show more up/down expression than random genes

38 BETA Binding expression target analysis
Is TF an activator, repressor or both? Do genes with higher regulatory potential show more up/down expression than random genes Functional analysis of targets? How can a factor be both activator and repressor? Collaborating transcription factors Motif analysis on the binding sites near up vs down genes separately TF?? ER

39 Summary ChIP-seq identifies genome-wide in vivo protein-DNA interaction sites ChIP-seq peak calling to shift reads ChIP-seq QC: FDR, fold, conservation, etc Functional analysis of ChIP-seq data: Target identification Activator / repressor function Motif analysis


Download ppt "Motif Finding Continued"

Similar presentations


Ads by Google