Download presentation
Presentation is loading. Please wait.
1
De novo Motif Finding using ChIP-Seq
My Paper is rejected! De novo Motif Finding using ChIP-Seq Presenter: Zhizhuo Zhang Supervisor: Wing-Kin Sung
2
Copyright 2009 @ Zhang ZhiZhuo
Outline The Impact of Chip-Seq’s Properties in Motif Finding Our proposing algorithm (Pomoda) Problem 1 Solution Problem 2 Comment from Review 9/17/2018 Copyright Zhang ZhiZhuo
3
Copyright 2009 @ Zhang ZhiZhuo
De novo Motif Finding Input: A set of regulatory sequence that possibly bind to the same transcription factor obtained using experiment techniques such as ChIP, orthologous genes across various species or co-regulated genes identified by micro array analysis. Aim: We want to use a computational algorithm to search for a set of recurring motif models (de novo) in the given input sequences. 9/17/2018 Copyright Zhang ZhiZhuo
4
Copyright 2009 @ Zhang ZhiZhuo
Motif Modeling Consensus-based Motif : Guarantee global optimality and they are appropriate for short motifs. Very fast when implemented with optimized data structures: Suffix Tree, Mismatch Tree, Lookup Table… 9/17/2018 Copyright Zhang ZhiZhuo
5
Copyright 2009 @ Zhang ZhiZhuo
Motif Modeling PWM-based Motif: Assuming each position is independent with each other. It is more precise approximation to the real motif than consensus-based motif. (PWM—Position Weighted Matrix) 9/17/2018 Copyright Zhang ZhiZhuo
6
Copyright 2009 @ Zhang ZhiZhuo
Background Modeling Organism Specified Background: Hard to capture the negative information in background Position Specified Background: Reveal the biological context, and easier to capture the negative information 9/17/2018 Copyright Zhang ZhiZhuo
7
Copyright 2009 @ Zhang ZhiZhuo
Chip-Seq Technique chromatin immunoprecipitation (ChIP) with specific antibodies against these TFs was used to enrich the DNA fragments bound by these TFs, followed by di- rect ultra-high-throughput sequencing with the Solexa Genome Analyzer platform. Genomic regions defined bymultiple overlap- ping DNA fragments derived from the ChIP enrichments were considered as putative binding sites. 9/17/2018 Copyright Zhang ZhiZhuo
8
Comparison with Chip-Chip
9/17/2018 Copyright Zhang ZhiZhuo
9
What Chip-Seq means to us?
Sequences Motif Finding Tools Motif models More data Good news for data mining, but necessary for denovo motif finding Higher resolution job becomes easier, localization 9/17/2018 Copyright Zhang ZhiZhuo
10
Copyright 2009 @ Zhang ZhiZhuo
How large the data is? The definition of “large data” keeps changing! 10 years before, tens of sequences (Promoter Sequences: MEME,AlignACE) 5 years before, hundreds of sequences (Chip-Chip: Weeder) 2 years before, thousands of sequences (higher throughput Chip-Chip: Trawler, Amandeus) Now, tens of thousands of sequences (Chip-Seq: ?) 9/17/2018 Copyright Zhang ZhiZhuo
11
Higher Resolution Means?
Means finding main motif (antibody targeting TF) becomes a easy job! Main Motif would be very over-represented The Peak range just about 50 bp, simply align all the peak region, we can get the good motif. It means our focuses may change from the main TF to the TFs who are working with the main one. 9/17/2018 Copyright Zhang ZhiZhuo
12
Peak Oriented Motif Discovery
What information of Peak can be helpful? Peak Intensity Peak location Our targets: not only the main motif, but also the co-motifs sitting around the main motif. PI: higher intensity, higher change of main motif, imply higher change of co-motif PL: surrounding region would be enriched. 9/17/2018 Copyright Zhang ZhiZhuo
13
Copyright 2009 @ Zhang ZhiZhuo
POMODA Peak Oriented Motif Discovery Algorithm Centered on ChIP-seq peak of With the above discussion in mind, we look into an example. The colored dots are motif matches predicted by our scan. The histogram of each of them can be studied The main motif A co-motif Should be noise as it does not exhibit distance preference to the main motif 9/17/2018 Copyright Zhang ZhiZhuo
14
Copyright 2009 @ Zhang ZhiZhuo
How to Score a PWM Instead of comparing with sophicated background model, we just look at distribution from center to flanking region! Traditionally, people first extract DNA sequences from every peak around a fixed window d (says, d=400bp); then, utilizing some motif finder, we mine motifs which are over-represented in those sequences under some organism specific background model. There are two issues with the above method. First, we need to specify the window size d. If d is too big, the motif signal is diluted . If d is too small, we may fail to capture some motifs which occur in a window bigger than d. Second, we need to specify the background model. However, selecting background model is still an art. There is no guideline for selecting the correct background. 9/17/2018 Copyright Zhang ZhiZhuo
15
Copyright 2009 @ Zhang ZhiZhuo
Peak Enrichment Score Given a PWM motif Ɵ and a set of input sequences of length L centered by peak location, the peak enrichment score is defined as: H h Signal Noise Ratio = (H-h)/h A simple Score =H/h Noise Level Since we don’t know the exact size of the active region, and it may vary for different motif. Hence, we define a odd-ratio score base on dynamic window size. We can utilize the Peak intensity in calculating the Occurrences 9/17/2018 Copyright Zhang ZhiZhuo
16
Copyright 2009 @ Zhang ZhiZhuo
Problem 1 Although the scoring function is quite simple, an efficient optimization method is badly wanted, due to the huge size of datasets and many parameters in a PWM. 9/17/2018 Copyright Zhang ZhiZhuo
17
Copyright 2009 @ Zhang ZhiZhuo
Solution Hierarchically pruning the searching space! (Easy testing for the large number of candidates and perform more complicated work on the few left) Column based PWM updating by comparing Center components and Flanking components! 9/17/2018 Copyright Zhang ZhiZhuo
18
Copyright 2009 @ Zhang ZhiZhuo
Algorithm Overview Seed Finding PWM Extending & Refinement Redundant Motifs Filtering 9/17/2018 Copyright Zhang ZhiZhuo
19
Copyright 2009 @ Zhang ZhiZhuo
Seeds Finding GGTCAC CGGTCA GGGTCA AGGTCA … ATGACC CAGGTC AGGTCG CGTGAC CTGACC Enumerate all length 6 patterns Po 1 2 3 4 5 6 A 0.97 0.01 C G T AACTTG 9/17/2018 Copyright Zhang ZhiZhuo
20
PWM Extending & Refinement
Encapsulate the core PWM into a wide PWM For example, we implant the length 6 PWM into a length 26 PWM, as following: Po 1 2 …… 9 10 11 12 13 14 15 16 25 26 A 0.25 0.97 0.01 C G T Core PWM 9/17/2018 Copyright Zhang ZhiZhuo
21
PWM Extending & Refinement
Flank Instances PWM Extending & Refinement A…A…GGTCA…C…C T…G…GGTCA…A…G G…A…GGTCA…T…T T…G…GGTCA…G…G …… C…T…GGTCA…T…A The frequency C’s in the highlighted column is 100% and 20%, so if we take C at that column, we can gain 5 times higher score! GGTCANNNNC Select the best column to update based on Center PWM and Flank PWM. Center Instances A…A…GGTCA…C…C T…G…GGTCA…C…G …… C…T…GGTCA…C…A 9/17/2018 Copyright Zhang ZhiZhuo
22
Copyright 2009 @ Zhang ZhiZhuo
23
Copyright 2009 @ Zhang ZhiZhuo
More Details Two Cases: 1. Update a trivial column (extension) 2. Update a non-trivial column (refinement) 9/17/2018 Copyright Zhang ZhiZhuo
24
Update a trivial column (extension)
9/17/2018 Copyright Zhang ZhiZhuo
25
Update a non-trivial column (refinement)
9/17/2018 Copyright Zhang ZhiZhuo
26
Update a non-trivial column (refinement)
9/17/2018 Copyright Zhang ZhiZhuo
27
When the data is not large
The method above will be fail when the number of occurrences is small, that means the noise level not high enough! Fixed Method: when Flanking PWM element larger than max Flanking region ACGT frequency, that element will be set the value as that frequency. Then a discriminant process becomes an associational process! 9/17/2018 Copyright Zhang ZhiZhuo
28
Copyright 2009 @ Zhang ZhiZhuo
Problem 2 There are many extended sub-patterns after phase2, we need to generalize them and filter the redundant ones. How to measure the similarity of two sub-patterns? How to cluster similar ones and merge them to a generalized pattern? 9/17/2018 Copyright Zhang ZhiZhuo
29
Copyright 2009 @ Zhang ZhiZhuo
Problem 2 GGTCACHSTGAC CMRGGTCAS AGGTCASSCTGMCC CAGGTCASNNTGMCC CWGGGTCASNNTSNS GGTCARGGTCA GTCANCGT RRGNYRNCCTGACC AGGTCAAS GGTSACCCWG All these sub-patterns from the same generalized pattern! 9/17/2018 Copyright Zhang ZhiZhuo
30
Redundant Motifs Filtering (old solution)
Two levels filtering: Level 1: comparing to the accepted motifs Positions overlap more than 2% PWM divergence less than 0.18 Level 2: comparing to the filtered motifs Positions overlap more than 20% PWM divergence less than 0.14 9/17/2018 Copyright Zhang ZhiZhuo
31
Only Filter better than Merge
People like to suggest me to cluster the sub-patterns and perform a weighted merging. However, there are several problems make merging screw up. Similarity may be not correct, introducing noise in the cluster. Alignments between two PWMs may be not correct, diverge the column signals The weighted is may be not correct, the merged PWM is not optimized for the scoring anymore. To solve any problem above needs the state of art skill. 9/17/2018 Copyright Zhang ZhiZhuo
32
Copyright 2009 @ Zhang ZhiZhuo
Old solution There are 4 parameters in the 2-level filtering, and actually there are still other parameters as the significant levels of other statistics test like binominal and hyper-geometrics. All these parameters are tuned to filter out the unwanted motifs in the top list. That’s why you can always see a crazy guy tuning program in the lab! 9/17/2018 Copyright Zhang ZhiZhuo
33
Copyright 2009 @ Zhang ZhiZhuo
New Solution Do one level filter (2 common used parameter 5% and 0.24) Used the accepted sub-patterns as the starting point for MEME. That is, we use zoops MEME to generalize the pattern. (Only for the hot regions, which are covered by at least two sub-patterns) Do step1 again, and rank the final motifs by Peak Enrichment Score. 9/17/2018 Copyright Zhang ZhiZhuo
34
Results – Comparison (old)
Dataset: MCF7 dataset (ER), 4361 sequences LNCAP dataset (AR), sequences ES Cell datasets (OCT4,C-myc,Sox2,Zfx,Smad,…,CTCF) Evaluate “PWM divergence” with Transfac motif as in Harbison et al (2004) and Amadeus (2008) +/ bases from peak (Pomoda), and +/- 200 bases from peak for other algorithms Each motif finder report its top20 results Reason: 1. they can’t handle such large range, 2. their result will be worse when the width increase. 9/17/2018 Copyright Zhang ZhiZhuo
35
Copyright 2009 @ Zhang ZhiZhuo
9/17/2018 Copyright Zhang ZhiZhuo
36
The points of Reviewers
Length-6 conserved seed assumption may not true; There is no complexity analysis given; The selections of co-motifs is biased to my program; The parameters in my program are tuned to fit the data, not fair to other programs; Fixed input length to all other programs is not fair. 9/17/2018 Copyright Zhang ZhiZhuo
37
Copyright 2009 @ Zhang ZhiZhuo
Merry Christmas! 9/17/2018 Copyright Zhang ZhiZhuo
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.