De novo Motif Finding using ChIP-Seq

Slides:



Advertisements
Similar presentations
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Advertisements

Chromatin Immuno-precipitation (CHIP)-chip Analysis
Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Finding approximate palindromes in genomic sequences.
Transcription factor binding motifs (part I) 10/17/07.
Similar Sequence Similar Function Charles Yan Spring 2006.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan.
Analyzing ChIP-seq data
Massive Parallel Sequencing
“Hotspot” algorithm chr5:131,975, ,012,092 Idea: gauge enrichment of tags relative to a local background model based on the number of tags in a 50kb.
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
High-resolution computational models of genome binding events Yuan (Alan) Qi Joint work with Gifford and Young labs Dana-Farber Cancer Institute Jan 2007.
Object Detection with Discriminatively Trained Part Based Models
I519 Introduction to Bioinformatics, Fall, 2012
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Next Generation Sequencing
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Day 5-2 What bioinformatics.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Journal report: High Resolution Model of Transcription Factor- DNA Affinities Improve In Vitro and In Vivo Binding Predictions Paper by: Phadera Gius,
Algorithms in Bioinformatics: A Practical Introduction
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
E XPECTATION M AXIMIZATION M EETS S AMPLING IN M OTIF F INDING Zhizhuo Zhang.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Local Multiple Sequence Alignment Sequence Motifs
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Sequence Alignment.
2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Projects
Simon v1.0 Motif Searching Simon v1.0.
CS273B: Deep learning for Genomics and Biomedicine
Epigenetics Continued
A Very Basic Gibbs Sampler for Motif Detection
Babak Alipanahi1, Andrew Delong, Matthew T Weirauch & Brendan J Frey
Learning Sequence Motif Models Using Expectation Maximization (EM)
De novo Motif Finding using ChIP-Seq
TSS Annotation Workflow
Dynamic epigenetic enhancer signatures reveal key transcription factors associated with monocytic differentiation states by Thu-Hang Pham, Christopher.
Volume 3, Issue 1, Pages (July 2016)
Taichi Umeyama, Takashi Ito  Cell Reports 
Volume 23, Issue 7, Pages (May 2018)
(Regulatory-) Motif Finding
Simon V Motif Searching Simon V
In collaboration with Mikkelsen Lab
Fine-Resolution Mapping of TF Binding and Chromatin Interactions
Fine-Resolution Mapping of TF Binding and Chromatin Interactions
Human Promoters Are Intrinsically Directional
Volume 5, Issue 3, Pages e7 (September 2017)
Songjoon Baek, Ido Goldstein, Gordon L. Hager  Cell Reports 
Volume 133, Issue 7, Pages (June 2008)
Volume 133, Issue 6, Pages (June 2008)
Applying principles of computer science in a biological context
Nora Pierstorff Dept. of Genetics University of Cologne
Volume 122, Issue 6, Pages (September 2005)
Volume 3, Issue 4, Pages (April 2013)
Basic Local Alignment Search Tool
BIOBASE Training TRANSFAC® ExPlain™
High Sensitivity Profiling of Chromatin Structure by MNase-SSP
Fig. 4 p100/TSN enables E2F1 to interact with alternatively spliced transcripts. p100/TSN enables E2F1 to interact with alternatively spliced transcripts.
Fig. 5 E2F1 also interacts with alternatively spliced transcripts from the MECOM gene. E2F1 also interacts with alternatively spliced transcripts from.
Taichi Umeyama, Takashi Ito  Cell Reports 
Presentation transcript:

De novo Motif Finding using ChIP-Seq Presenter: Zhizhuo Zhang Supervisor: Wing-Kin Sung

Copyright 2009 @ Zhang ZhiZhuo Outline Introduction of Chip-Seq Data The Impact of Chip-Seq’s Properties in Motif Finding Our proposing algorithm (Pomoda) Experiment Result Exploring Center Distribution 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo Chip-Seq Technique chromatin immunoprecipitation (ChIP) with specific antibodies against these TFs was used to enrich the DNA fragments bound by these TFs, followed by di- rect ultra-high-throughput sequencing with the Solexa Genome Analyzer platform. Genomic regions defined bymultiple overlap- ping DNA fragments derived from the ChIP enrichments were considered as putative binding sites. 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Comparison with Chip-Chip 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

What Chip-Seq means to us? Sequences Motif Finding Tools Motif models More data Good news for data mining, but necessary for denovo motif finding Higher resolution job becomes easier, localization 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo How large the data is? The definition of “large data” keeps changing! 10 years before, tens of sequences (Promoter Sequences: MEME,AlignACE) 5 years before, hundreds of sequences (Chip-Chip: Weeder) 2 years before, thousands of sequences (higher throughput Chip-Chip: Trawler, Amandeus) Now, tens of thousands of sequences (Chip-Seq: ?) 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Higher Resolution Means? Means finding main motif (antibody targeting TF) becomes a easy job! Main Motif would be very over-represented The Peak range just about 50 bp, simply align all the peak region, we can get the good motif. It means our focuses may change from the main TF to the TFs who are working with the main one. 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Localization =?Over-Representation Use Transfac motif scanning, result: If u use the center region as input sequence, u will see the OR If u use the surrounding region as bg, u will see localized. 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Peak Oriented Motif Discovery What information of Peak can be helpful? Peak Intensity Peak location Our targets: not only the main motif, but also the co-motifs sitting around the main motif. PI: higher intensity, higher change of main motif, imply higher change of co-motif PL: surrounding region would be enriched. 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo POMODA Peak Oriented Motif Discovery Algorithm Centered on ChIP-seq peak of With the above discussion in mind, we look into an example. The colored dots are motif matches predicted by our scan. The histogram of each of them can be studied The main motif A co-motif Should be noise as it does not exhibit distance preference to the main motif 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo Motif Modeling String Motif : Smaller searching space, enable fast string matching algorithm PWM Motif: More precise approximation to the real motif, statistics sound. (PWM—Position Weighted Matrix) 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo Background Modeling Organism Specified Background: Hard to capture the negative information in background Position Specified Background: Reveal the biological context, and easier to capture the negative information 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Position Specified Background Given the peak position in chip-seq, we not only identify the active position(center) of the master TF, but also the active region of its co-motif. Active Region Background Region Peak in Chip-Seq 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Center Enrichment Score Since we don’t know the exact size of the active region, and it may vary for different motif. Hence, we define a odd-ratio score base on dynamic window size. We can utilize the Peak intensity in calculating the Occurrences 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo Algorithm Overview Seed Finding PWM Extending & Refinement Redundant Motifs Filtering 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo Seeds Finding GGTCAC CGGTCA GGGTCA AGGTCA … ATGACC CAGGTC AGGTCG CGTGAC CTGACC Enumerate all length 6 patterns Po 1 2 3 4 5 6 A 0.97 0.01 C G T AACTTG 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

PWM Extending & Refinement Encapsulate the core PWM into a wide PWM For example, we implant the length 6 PWM into a length 26 PWM, as following: Po 1 2 …… 9 10 11 12 13 14 15 16 25 26 A 0.25 0.97 0.01 C G T Core PWM 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

PWM Extending & Refinement Background Instances PWM Extending & Refinement A…A…GGTCA…C…C T…G…GGTCA…A…G G…A…GGTCA…T…T T…G…GGTCA…G…G …… C…T…GGTCA…T…A Select the best column to update based on Center PWM and Bg PWM. Center Instances A…A…GGTCA…C…C T…G…GGTCA…C…G …… C…T…GGTCA…C…A GGTCANNNNC 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Redundant Motifs Filtering Positions overlap more than 5% PWM divergence less than 0.18 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo Results – Comparison Dataset: MCF7 dataset (ER), 4361 sequences LNCAP dataset (AR), 10000 sequences Evaluate “PWM divergence” with Transfac motif as in Harbison et al (2004) and Amadeus (2008) +/- 5000 bases from peak (Pomoda), and +/- 200 bases from peak for other algorithms Each motif finder report its top20 results Reason: 1. they can’t handle such large range, 2. their result will be worse when the width increase. 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo Cell TF Pomoda Amadeus Trawler Weeder Mcf7 ER HNF3 GATA AP1 SP1 BACH1 E2F OCT1 AP4 LNCAP AR NF1 OCT ETS <0.12 <0.18 <0.24 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo Comparison Pomoda Amadeus Trawler Weeder Background model Position Specified Organism Specified Motif model PWM (k-mer exact match) PWM (k-mer with mismatches ) PWM (IUPAC string in initial scan) k-mer with mismatches Algorithm Exhaustive search +PWM column updating Add mismatches Merge (recursively) EM Exhaustive search + clustering Motif Length Various length Fixed length Semi-various length Gap Detection Supported Not Supported Localization center windows size Over-represented bins Not supported Sequence Weighting Average Running time 30min 93min >4hours >4 hours 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo Center Distribution Mixture Model: 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo

Copyright 2009 @ Zhang ZhiZhuo Thank You! 9/17/2018 Copyright 2009 @ Zhang ZhiZhuo