De novo Motif Finding using ChIP-Seq

Slides:



Advertisements
Similar presentations
Microarray statistical validation and functional annotation
Advertisements

Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome ECS289A.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Predicting Enhancers in Co-Expressed Genes Harshit Maheshwari Prabhat Pandey.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Heuristic alignment algorithms and cost matrices
Transcription factor binding motifs (part I) 10/17/07.
Similar Sequence Similar Function Charles Yan Spring 2006.
CisGreedy Motif Finder for Cistematic Sarah Aerni Mentors: Ali Mortazavi Barbara Wold.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
© Wiley Publishing All Rights Reserved.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Searching for structured motifs in the upstream regions of hsp70 genes in Tetrahymena termophila. Roberto Marangoni^, Antonietta La Terza*, Nadia Pisanti^,
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Journal report: High Resolution Model of Transcription Factor- DNA Affinities Improve In Vitro and In Vivo Binding Predictions Paper by: Phadera Gius,
Algorithms in Bioinformatics: A Practical Introduction
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.
Extracting binary signals from microarray time-course data Debashis Sahoo 1, David L. Dill 2, Rob Tibshirani 3 and Sylvia K. Plevritis 4 1 Department of.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Local Multiple Sequence Alignment Sequence Motifs
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Sequence Alignment.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
Transcription factor binding motifs (part II) 10/22/07.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Projects
Regulation of Gene Expression
Merja Oja, Jaakko Peltonen, Sami Kaski University of Helsinki and
Simon v1.0 Motif Searching Simon v1.0.
CS273B: Deep learning for Genomics and Biomedicine
Reconstructing the Evolutionary History of Complex Human Gene Clusters
A Very Basic Gibbs Sampler for Motif Detection
Bioinformatics tools to identify structured motifs in the upstream regions of stress-response-involved genes in Tetrahymena thermophila Antonietta La Terza*,
Learning Sequence Motif Models Using Expectation Maximization (EM)
De novo Motif Finding using ChIP-Seq
Department of Computer Science
William Norris Professor and Head, Department of Computer Science
1 Department of Engineering, 2 Department of Mathematics,
William Norris Professor and Head, Department of Computer Science
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Finding regulatory modules
SEG5010 Presentation Zhou Lanjun.
Simon V Motif Searching Simon V
In collaboration with Mikkelsen Lab
Sahand Kashani, Stuart Byma, James Larus 2019/02/16
Mapping Global Histone Acetylation Patterns to Gene Expression
Fine-Resolution Mapping of TF Binding and Chromatin Interactions
Fine-Resolution Mapping of TF Binding and Chromatin Interactions
Songjoon Baek, Ido Goldstein, Gordon L. Hager  Cell Reports 
Applying principles of computer science in a biological context
Nora Pierstorff Dept. of Genetics University of Cologne
Volume 122, Issue 6, Pages (September 2005)
BIOBASE Training TRANSFAC® ExPlain™
Presentation transcript:

De novo Motif Finding using ChIP-Seq My Paper is rejected! De novo Motif Finding using ChIP-Seq Presenter: Zhizhuo Zhang Supervisor: Wing-Kin Sung

Copyright 2009 @ Zhang ZhiZhuo Outline The Impact of Chip-Seq’s Properties in Motif Finding Our proposing algorithm (Pomoda) Problem 1 Solution Problem 2 Comment from Review 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo De novo Motif Finding Input: A set of regulatory sequence that possibly bind to the same transcription factor obtained using experiment techniques such as ChIP, orthologous genes across various species or co-regulated genes identified by micro array analysis. Aim: We want to use a computational algorithm to search for a set of recurring motif models (de novo) in the given input sequences. 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo Motif Modeling Consensus-based Motif : Guarantee global optimality and they are appropriate for short motifs. Very fast when implemented with optimized data structures: Suffix Tree, Mismatch Tree, Lookup Table… 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo Motif Modeling PWM-based Motif: Assuming each position is independent with each other. It is more precise approximation to the real motif than consensus-based motif. (PWM—Position Weighted Matrix) 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo Background Modeling Organism Specified Background: Hard to capture the negative information in background Position Specified Background: Reveal the biological context, and easier to capture the negative information 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo Chip-Seq Technique chromatin immunoprecipitation (ChIP) with specific antibodies against these TFs was used to enrich the DNA fragments bound by these TFs, followed by di- rect ultra-high-throughput sequencing with the Solexa Genome Analyzer platform. Genomic regions defined bymultiple overlap- ping DNA fragments derived from the ChIP enrichments were considered as putative binding sites. 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Comparison with Chip-Chip 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

What Chip-Seq means to us? Sequences Motif Finding Tools Motif models More data Good news for data mining, but necessary for denovo motif finding Higher resolution job becomes easier, localization 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo How large the data is? The definition of “large data” keeps changing! 10 years before, tens of sequences (Promoter Sequences: MEME,AlignACE) 5 years before, hundreds of sequences (Chip-Chip: Weeder) 2 years before, thousands of sequences (higher throughput Chip-Chip: Trawler, Amandeus) Now, tens of thousands of sequences (Chip-Seq: ?) 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Higher Resolution Means? Means finding main motif (antibody targeting TF) becomes a easy job! Main Motif would be very over-represented The Peak range just about 50 bp, simply align all the peak region, we can get the good motif. It means our focuses may change from the main TF to the TFs who are working with the main one. 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Peak Oriented Motif Discovery What information of Peak can be helpful? Peak Intensity Peak location Our targets: not only the main motif, but also the co-motifs sitting around the main motif. PI: higher intensity, higher change of main motif, imply higher change of co-motif PL: surrounding region would be enriched. 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo POMODA Peak Oriented Motif Discovery Algorithm Centered on ChIP-seq peak of With the above discussion in mind, we look into an example. The colored dots are motif matches predicted by our scan. The histogram of each of them can be studied The main motif A co-motif Should be noise as it does not exhibit distance preference to the main motif 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo How to Score a PWM Instead of comparing with sophicated background model, we just look at distribution from center to flanking region! Traditionally, people first extract DNA sequences from every peak around a fixed window d (says, d=400bp); then, utilizing some motif finder, we mine motifs which are over-represented in those sequences under some organism specific background model. There are two issues with the above method. First, we need to specify the window size d. If d is too big, the motif signal is diluted . If d is too small, we may fail to capture some motifs which occur in a window bigger than d. Second, we need to specify the background model. However, selecting background model is still an art. There is no guideline for selecting the correct background. 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo Peak Enrichment Score Given a PWM motif Ɵ and a set of input sequences of length L centered by peak location, the peak enrichment score is defined as: H h Signal Noise Ratio = (H-h)/h A simple Score =H/h Noise Level Since we don’t know the exact size of the active region, and it may vary for different motif. Hence, we define a odd-ratio score base on dynamic window size. We can utilize the Peak intensity in calculating the Occurrences 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo Problem 1 Although the scoring function is quite simple, an efficient optimization method is badly wanted, due to the huge size of datasets and many parameters in a PWM. 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo Solution Hierarchically pruning the searching space! (Easy testing for the large number of candidates and perform more complicated work on the few left) Column based PWM updating by comparing Center components and Flanking components! 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo Algorithm Overview Seed Finding PWM Extending & Refinement Redundant Motifs Filtering 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo Seeds Finding GGTCAC CGGTCA GGGTCA AGGTCA … ATGACC CAGGTC AGGTCG CGTGAC CTGACC Enumerate all length 6 patterns Po 1 2 3 4 5 6 A 0.97 0.01 C G T AACTTG 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

PWM Extending & Refinement Encapsulate the core PWM into a wide PWM For example, we implant the length 6 PWM into a length 26 PWM, as following: Po 1 2 …… 9 10 11 12 13 14 15 16 25 26 A 0.25 0.97 0.01 C G T Core PWM 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

PWM Extending & Refinement Flank Instances PWM Extending & Refinement A…A…GGTCA…C…C T…G…GGTCA…A…G G…A…GGTCA…T…T T…G…GGTCA…G…G …… C…T…GGTCA…T…A The frequency C’s in the highlighted column is 100% and 20%, so if we take C at that column, we can gain 5 times higher score! GGTCANNNNC Select the best column to update based on Center PWM and Flank PWM. Center Instances A…A…GGTCA…C…C T…G…GGTCA…C…G …… C…T…GGTCA…C…A 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo More Details Two Cases: 1. Update a trivial column (extension) 2. Update a non-trivial column (refinement) 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Update a trivial column (extension) 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Update a non-trivial column (refinement) 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Update a non-trivial column (refinement) 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

When the data is not large The method above will be fail when the number of occurrences is small, that means the noise level not high enough! Fixed Method: when Flanking PWM element larger than max Flanking region ACGT frequency, that element will be set the value as that frequency. Then a discriminant process becomes an associational process! 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo Problem 2 There are many extended sub-patterns after phase2, we need to generalize them and filter the redundant ones. How to measure the similarity of two sub-patterns? How to cluster similar ones and merge them to a generalized pattern? 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo Problem 2 GGTCACHSTGAC CMRGGTCAS AGGTCASSCTGMCC CAGGTCASNNTGMCC CWGGGTCASNNTSNS GGTCARGGTCA GTCANCGT RRGNYRNCCTGACC AGGTCAAS GGTSACCCWG All these sub-patterns from the same generalized pattern! 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Redundant Motifs Filtering (old solution) Two levels filtering: Level 1: comparing to the accepted motifs Positions overlap more than 2% PWM divergence less than 0.18 Level 2: comparing to the filtered motifs Positions overlap more than 20% PWM divergence less than 0.14 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Only Filter better than Merge People like to suggest me to cluster the sub-patterns and perform a weighted merging. However, there are several problems make merging screw up. Similarity may be not correct, introducing noise in the cluster. Alignments between two PWMs may be not correct, diverge the column signals The weighted is may be not correct, the merged PWM is not optimized for the scoring anymore. To solve any problem above needs the state of art skill. 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo Old solution There are 4 parameters in the 2-level filtering, and actually there are still other parameters as the significant levels of other statistics test like binominal and hyper-geometrics. All these parameters are tuned to filter out the unwanted motifs in the top list. That’s why you can always see a crazy guy tuning program in the lab! 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo New Solution Do one level filter (2 common used parameter 5% and 0.24) Used the accepted sub-patterns as the starting point for MEME. That is, we use zoops MEME to generalize the pattern. (Only for the hot regions, which are covered by at least two sub-patterns) Do step1 again, and rank the final motifs by Peak Enrichment Score. 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Results – Comparison (old) Dataset: MCF7 dataset (ER), 4361 sequences LNCAP dataset (AR), 10000 sequences ES Cell datasets (OCT4,C-myc,Sox2,Zfx,Smad,…,CTCF) Evaluate “PWM divergence” with Transfac motif as in Harbison et al (2004) and Amadeus (2008) +/- 5000 bases from peak (Pomoda), and +/- 200 bases from peak for other algorithms Each motif finder report its top20 results Reason: 1. they can’t handle such large range, 2. their result will be worse when the width increase. 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

The points of Reviewers Length-6 conserved seed assumption may not true; There is no complexity analysis given; The selections of co-motifs is biased to my program; The parameters in my program are tuned to fit the data, not fair to other programs; Fixed input length to all other programs is not fair. 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo Merry Christmas! 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo