Download presentation
Presentation is loading. Please wait.
Published byFelicity Gibbs Modified over 9 years ago
1
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in lab we will predict orthologs using reciprocal genome-scale BLAST searches W Oct 31 – Phylogenetic Profiles ( an example of unsupervised machine learning) and supervised machine learning approaches and applications M Nov 5 - Phylogeny (Phylogeny Lab) W Nov 7 – Metabolic reconstruction and modeling ***2-3 pg paper on preliminary results due*** Today: Chip-chip and Chip-seq analysis
2
Chromatin immunoprecipitation (ChIP) 1.Chemical or light-based crosslinking added to living cells 2.Shear DNA by sonication or digestion 3.IP by specific Ab or Ab against protein tag 2
3
ChIP on ChIP (tiled genomic microarrays) Signal Intensity Array Probes Peak resolution a function of: - shearing size - probe resolution - ChIP enrichment 3
4
ChIP - Seq Read Counts 4
5
5
6
1.Map reads to the reference genome 2.Convert to ‘tag’ counts: sequence coverage at each base pair in the genome 3.Find peaks of high tag count (using a fixed/sliding window with count threshold) or based on bimodal peak distribution 4.Convert bimodal peaks into summits (by shifting 3’ tag positions OR by extending the tag signal to estimated size of fragments) 5.Identify summits that represent fragment enrichment relative to control 6.Assign a confidence score (p-value, enrichment score, and/or FDR)
7
Types of ‘control’ data for ChIP experiments 1.‘Input’ DNA = sheared but no IP 2.No-antibody mock IP 3.Untagged strain Almost always some background in mock-IP … hope is to have enrichment of IP material over background. * Certain artifacts can give the appearance of real peaks in control experiments.
8
Pepke et al. 2009 Read counts/ tag profile is generally smoothed before peak calling (e.g. running average) and then the ‘summit’ is inferred by the dual read peaks * using a method that incorporates measured background model is probably very important
10
10 3 Types of peaks 1. Sharp & narrow (100s bp) (eg. site-specific TF) 2. Broader but defined (kb) (eg. RNA Polymerase) 3. Very broad (regional, 1000s kb) (eg. heterochromatin histone marks) methods that identify bimodal peak profiles to identify summits work less well for biologically wider peaks/loci
11
Hidden Markov Models for Identifying Bound Fragments HMM ’ s are trained on known data to recognize different states (eg. bound vs. unbound fragments) and the probability of moving between those states Example: ChIP-chip data from a tiling microarray identifying regions bound to a transcription complex with a known 50bp binding sequence. You expect that a bound fragment will have high signal on the array and that the bound fragment will be 2-3 probes long. Once trained, an HMM can be used to identify the ‘ hidden ’ states in an unknown dataset, based on the known characteristics of each state ( ‘ emission probabilities ’ ) and the probability of moving between states ( ‘ transition probabilities ’ ) Example: “ A hidden Markov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences ” 2005. Li, Meyer, Liu
12
Example: ChIP-chip data from a tiling microarray identifying regions bound to a transcription complex with a known 50bp binding sequence. You expect that a bound fragment will have high signal on the array and that the bound fragment will be 2-3 probes long. P( I ) = 0.2 P( i ) = 0.8 P( I ) = 0.8 P( i ) = 0.2 P( I ) = 0.8 P( i ) = 0.2 P( I ) = 0.8 P( i ) = 0.2 I = Intensity units > 10,000i = Intensity units < 10,000 P= 0.5 P= 1.0 P= 0 P= 0.7 P= 0.3 P= 1.0 Unbound 25merBound 25mer
13
Example: ChIP-chip data from a tiling microarray identifying regions bound to a transcription complex with a known 50bp binding sequence. You expect that a bound fragment will have high signal on the array and that the bound fragment will be 2-3 probes long. P= 0.5 P= 1.0 P= 0 P= 0.7 P= 0.3 P= 1.0 Unbound 25merBound 25mer Emission Probabilities Transition Probabilities Given the data, an HMM will consider many different models and give back the optimal model P( I ) = 0.2 P( i ) = 0.8 P( I ) = 0.8 P( i ) = 0.2 P( I ) = 0.8 P( i ) = 0.2 P( I ) = 0.8 P( i ) = 0.2
14
14 Evaluated 11 different peak-calling algorithms using 3 real datasets * & default parameters (mimicking “non-expert users”) - methods with smaller peak lists often return peaks identified by other methods (more stringent) “many programs call similar peaks, though default parameters are tuned to different levels of stringency”
15
15
16
Output: list of peak locations (start & stop) and p-values Challenge is peaks do not show precisely where protein binds. Different programs vary in the width of the identified peaks Can apply the same type of motif finding to a set of IP’d regions to identify motifs shared by regions.
17
Other approaches ChIP-exo DNaseI hypersensitive sites Micrococcal nuclease sensitive sites (nucleosome mapping)
18
What can you do with the data? 1.Motif finding: look for motif shared in bound regions (e.g. XX) 2.Association bound loci with neighboring genes, elements -functional enrichment of neighboring genes -other non-random association among neighboring genes, e.g. shared expression profiles, expression dependency on factor in question 3.Locus distribution across the genome
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.