Presentation is loading. Please wait.

Presentation is loading. Please wait.

De novo Motif Finding using ChIP-Seq

Similar presentations


Presentation on theme: "De novo Motif Finding using ChIP-Seq"— Presentation transcript:

1 De novo Motif Finding using ChIP-Seq
Presenter: Zhizhuo Zhang Supervisor: Wing-Kin Sung

2 Copyright 2009 @ Zhang ZhiZhuo
Outline Introduction of Chip-Seq Data The Impact of Chip-Seq’s Properties in Motif Finding Our proposing algorithm (Pomoda) Experiment Result Exploring Center Distribution 9/17/2018 Copyright Zhang ZhiZhuo

3 Copyright 2009 @ Zhang ZhiZhuo
Chip-Seq Technique chromatin immunoprecipitation (ChIP) with specific antibodies against these TFs was used to enrich the DNA fragments bound by these TFs, followed by di- rect ultra-high-throughput sequencing with the Solexa Genome Analyzer platform. Genomic regions defined bymultiple overlap- ping DNA fragments derived from the ChIP enrichments were considered as putative binding sites. 9/17/2018 Copyright Zhang ZhiZhuo

4 Comparison with Chip-Chip
9/17/2018 Copyright Zhang ZhiZhuo

5 What Chip-Seq means to us?
Sequences Motif Finding Tools Motif models More data Good news for data mining, but necessary for denovo motif finding Higher resolution job becomes easier, localization 9/17/2018 Copyright Zhang ZhiZhuo

6 Copyright 2009 @ Zhang ZhiZhuo
How large the data is? The definition of “large data” keeps changing! 10 years before, tens of sequences (Promoter Sequences: MEME,AlignACE) 5 years before, hundreds of sequences (Chip-Chip: Weeder) 2 years before, thousands of sequences (higher throughput Chip-Chip: Trawler, Amandeus) Now, tens of thousands of sequences (Chip-Seq: ?) 9/17/2018 Copyright Zhang ZhiZhuo

7 Higher Resolution Means?
Means finding main motif (antibody targeting TF) becomes a easy job! Main Motif would be very over-represented The Peak range just about 50 bp, simply align all the peak region, we can get the good motif. It means our focuses may change from the main TF to the TFs who are working with the main one. 9/17/2018 Copyright Zhang ZhiZhuo

8 Localization =?Over-Representation
Use Transfac motif scanning, result: If u use the center region as input sequence, u will see the OR If u use the surrounding region as bg, u will see localized. 9/17/2018 Copyright Zhang ZhiZhuo

9 Peak Oriented Motif Discovery
What information of Peak can be helpful? Peak Intensity Peak location Our targets: not only the main motif, but also the co-motifs sitting around the main motif. PI: higher intensity, higher change of main motif, imply higher change of co-motif PL: surrounding region would be enriched. 9/17/2018 Copyright Zhang ZhiZhuo

10 Copyright 2009 @ Zhang ZhiZhuo
POMODA Peak Oriented Motif Discovery Algorithm Centered on ChIP-seq peak of With the above discussion in mind, we look into an example. The colored dots are motif matches predicted by our scan. The histogram of each of them can be studied The main motif A co-motif Should be noise as it does not exhibit distance preference to the main motif 9/17/2018 Copyright Zhang ZhiZhuo

11 Copyright 2009 @ Zhang ZhiZhuo
Motif Modeling String Motif : Smaller searching space, enable fast string matching algorithm PWM Motif: More precise approximation to the real motif, statistics sound. (PWM—Position Weighted Matrix) 9/17/2018 Copyright Zhang ZhiZhuo

12 Copyright 2009 @ Zhang ZhiZhuo
Background Modeling Organism Specified Background: Hard to capture the negative information in background Position Specified Background: Reveal the biological context, and easier to capture the negative information 9/17/2018 Copyright Zhang ZhiZhuo

13 Position Specified Background
Given the peak position in chip-seq, we not only identify the active position(center) of the master TF, but also the active region of its co-motif. Active Region Background Region Peak in Chip-Seq 9/17/2018 Copyright Zhang ZhiZhuo

14 Center Enrichment Score
Since we don’t know the exact size of the active region, and it may vary for different motif. Hence, we define a odd-ratio score base on dynamic window size. We can utilize the Peak intensity in calculating the Occurrences 9/17/2018 Copyright Zhang ZhiZhuo

15 Copyright 2009 @ Zhang ZhiZhuo
Algorithm Overview Seed Finding PWM Extending & Refinement Redundant Motifs Filtering 9/17/2018 Copyright Zhang ZhiZhuo

16 Copyright 2009 @ Zhang ZhiZhuo
Seeds Finding GGTCAC CGGTCA GGGTCA AGGTCA … ATGACC CAGGTC AGGTCG CGTGAC CTGACC Enumerate all length 6 patterns Po 1 2 3 4 5 6 A 0.97 0.01 C G T AACTTG 9/17/2018 Copyright Zhang ZhiZhuo

17 PWM Extending & Refinement
Encapsulate the core PWM into a wide PWM For example, we implant the length 6 PWM into a length 26 PWM, as following: Po 1 2 …… 9 10 11 12 13 14 15 16 25 26 A 0.25 0.97 0.01 C G T Core PWM 9/17/2018 Copyright Zhang ZhiZhuo

18 PWM Extending & Refinement
Background Instances PWM Extending & Refinement A…A…GGTCA…C…C T…G…GGTCA…A…G G…A…GGTCA…T…T T…G…GGTCA…G…G …… C…T…GGTCA…T…A Select the best column to update based on Center PWM and Bg PWM. Center Instances A…A…GGTCA…C…C T…G…GGTCA…C…G …… C…T…GGTCA…C…A GGTCANNNNC 9/17/2018 Copyright Zhang ZhiZhuo

19 Redundant Motifs Filtering
Positions overlap more than 5% PWM divergence less than 0.18 9/17/2018 Copyright Zhang ZhiZhuo

20 Copyright 2009 @ Zhang ZhiZhuo
Results – Comparison Dataset: MCF7 dataset (ER), 4361 sequences LNCAP dataset (AR), sequences Evaluate “PWM divergence” with Transfac motif as in Harbison et al (2004) and Amadeus (2008) +/ bases from peak (Pomoda), and +/- 200 bases from peak for other algorithms Each motif finder report its top20 results Reason: 1. they can’t handle such large range, 2. their result will be worse when the width increase. 9/17/2018 Copyright Zhang ZhiZhuo

21 Copyright 2009 @ Zhang ZhiZhuo
Cell TF Pomoda Amadeus Trawler Weeder Mcf7 ER HNF3 GATA AP1 SP1 BACH1 E2F OCT1 AP4 LNCAP AR NF1 OCT ETS <0.12 <0.18 <0.24 9/17/2018 Copyright Zhang ZhiZhuo

22 Copyright 2009 @ Zhang ZhiZhuo
Comparison Pomoda Amadeus Trawler Weeder Background model Position Specified Organism Specified Motif model PWM (k-mer exact match) PWM (k-mer with mismatches ) PWM (IUPAC string in initial scan) k-mer with mismatches Algorithm Exhaustive search +PWM column updating Add mismatches Merge (recursively) EM Exhaustive search + clustering Motif Length Various length Fixed length Semi-various length Gap Detection Supported Not Supported Localization center windows size Over-represented bins Not supported Sequence Weighting Average Running time 30min 93min >4hours >4 hours 9/17/2018 Copyright Zhang ZhiZhuo

23 Copyright 2009 @ Zhang ZhiZhuo
Center Distribution Mixture Model: 9/17/2018 Copyright Zhang ZhiZhuo

24 Copyright 2009 @ Zhang ZhiZhuo
Thank You! 9/17/2018 Copyright Zhang ZhiZhuo


Download ppt "De novo Motif Finding using ChIP-Seq"

Similar presentations


Ads by Google