Download presentation
Presentation is loading. Please wait.
Published byJustyn Mangrum Modified over 9 years ago
1
ChIP-seq Olivier Elemento, PhD TA: Jenny Giannopoulou, PhD Institute for Computational Biomedicine CSHL High Throughput Data Analysis Workshop, June 2012
2
Plan 1.ChIP-seq 2.Quality Control of ChIP-seq data 3.ChIP-seq Peak detection 4.Peak Analysis and Interpretation 5.A few interesting ChIP-seq papers
3
1. ChIP-seq
4
ChIP-seq Illumina Transcription factor of interest (or histone modification) Antibody
5
Control: input DNA Illumina Can use IgG as additional control
6
ChIP-seq methodology Identify ChIP-grade antibody, determine specificity (Western, histone peptide array) Optimize conditions using single- locus ChIP-PCR (positive and negative controls) Sequence ChIP product using 1 Illumina lane per sample (no TruSeq ChIP-seq), single end Sequence input/IgG as control Assessing the specificity of a commercial H3K9m3 antibody using histone peptide arrays, K. Bunting & B. Swed, WCMC Abcam H3K9Me3 rabbit polyclonal (ab8898)
7
ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGA ACTGATTAGTGAATTC TGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTT GACTAATCACTTAAG Average length ~ 170bp
8
ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGA ACTGATTAGTGAATTC TGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTT GACTAATCACTTAAG Average length ~ 170bp 40-100bp
9
ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGA ACTGATTAGTGAATTC TGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTT GACTAATCACTTAAG Average length ~ 170bp 40-100bp
10
BWA tutorial (for aligning single end reads to genome) Get genome, e.g., from UCSC – http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz Combine into 1 file – tar zvfx chromFa.tar.gz – cat *.fa > wg.fa Indexing the genome – bwa index -p hg19bwaidx -a bwtsw wg.fa Align ChIP reads to reference genome – bwa aln -t 4 hg19bwaidx s_3_sequence.txt.gz > s_3_sequence.txt.bwa Convert to SAM format – bwa samse hg19bwaidx s_3_sequence.txt.bwa s_3_sequence.txt.gz > s_3_sequence.txt.sam Align input reads to same reference genome – bwa aln -t 4 hg19bwaidx s_4_sequence.txt.gz > s_4_sequence.txt.bwa Convert to SAM format – bwa samse hg19bwaidx s_4_sequence.txt.bwa s_4_sequence.txt.gz > s_4_sequence.txt.sam
11
Reads can map to multiple locations/chromosomes Read 1 Read 2 Reference Human Genome (hg18)
12
Reads map to one strand or the other Read 1 Read 2 hg18
13
SAM format DH1608P1_0130:6:1103:10579:166379#TTAGGC 16 chr1 1249828 37 51M * 0 0 GGGCGTGACTCTGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGACCCGCG eb`XXYbZdadee^ceV]X][ccTcc^ebeece eeeWbeeeeeeeceeaee XX:Z:NM_017871,32 NM:i:0 MD:Z:51 DH1608P1_0130:6:1102:3415:150915#TTAGGC 16 chr1 1249828 37 51M * 0 0 GGGCGGGACTCTGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGACCCGCG BBBBBBBBBBBac]bbbceedaeddeZceeea_ba_\_eee eeeedaeeee XX:Z:NM_017871,32 NM:i:1 MD:Z:5T45 DH1608P1_0130:6:1102:13118:62644#TTAGGC 16 chr1 1249828 37 51M * 0 0 GGGCGTGCCTCGGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGACCCGCG BBBBBBBBBBBBBBBBBBBBB`XTbSa`cffegdggeccbe effdeggggg XX:Z:NM_017871,32 NM:i:2 MD:Z:7A3T39 DH1608P1_0130:6:1203:3012:157120#TTAGGC 16 chr1 1249826 25 51M * 0 0 AAGGCCGTGACTCTGATCTCAGCCCTCGTCTCCGCCGCGCTCCCGGACCCG BBBBBBBB^`QWZZ]UXYSZSTFRU]Z__SO[adcc[acdV \`Y]YWY][_ XX:Z:NM_017871,34 NM:i:3 MD:Z:4G17G1A26 DH1608P1_0130:6:2206:4445:12756#TTAGGC 16 chr1 1246336 25 1M3487N50M * 0 0 CCAAAGGGTGTGACTCTGATCTCGGGCATCGTCTCCGCCGCGCTCCCGGAC BBBBBBBBBBBBBBBBBBBBBBBB`YdddYdc\ cacaNddddcdddaeeee XX:Z:NM_017871,37 NM:i:3 MD:Z:2C5C14A27 DH1608P1_0130:6:2203:7903:43788#TTAGGC 16 chr1 1246336 37 1M3487N50M * 0 0 CCCAAGGGCGTGACTCTGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGAC adbe[fbcbccb_cb^cb^^c^edgegggggdf ggefffgfbfggggegeg XX:Z:NM_017871,37 NM:i:0 MD:Z:51 MD tag, e.g, MD:Z:4T46 = 5 matches, 1 mismatch (T in read), 46 matches CIGAR string, eg 5M3487N46M = 5bp-long block, 3487 insert, 46bp-long block XT tag, e.g. XT:A:U = unique mapper; XT:A:R = more than 1 high-scoring matches
14
Quality Control
15
http://biowhat.ucsd.edu/homer/chipseq/qc.html Clonal reads
16
Fragment size analysis
18
http://biowhat.ucsd.edu/homer/chipseq/qc.html Fragment size analysis using opposite strand autocorrelation
19
http://biowhat.ucsd.edu/homer/chipseq/qc.html Fragment size analysis
20
GC-content analysis http://biowhat.ucsd.edu/homer/chipseq/qc.html
21
GC-content analysis http://biowhat.ucsd.edu/homer/chipseq/qc.html
22
Other QC measures Number of peaks: – 0 or very few peaks, even at permissive peak calling thresholds = bad experiment Motif enrichment – is expected motif enriched in peaks ?
23
ChIP-seq peak calling
24
MACS
25
The Poisson distribution MACS # in R P(X>=5|λ=0.001) is 1-sum(dpois(0:4, 0.001)) 2d λ=expected # of reads within an interval of 2d bp Estimate d based on high quality peaks
26
BayesPeak
27
BayesPeak (Bayesian Hidden Markov Models) Parameters estimated using Bayesian treatment Observed variable Hidden states
28
BayesPeak
29
Peak detection using ChIPseeqer http://icb.med.cornell.edu/wiki/index.php/Elementolab/ChIPseeqer_Tutorial (Elemento and Giannopoulou, 2011)
31
A nice peak
32
Not all peaks are that nice
33
Peak detection Calculate read count at each position (bp) in genome (we don’t use a sliding window) Determine if read count is greater than expected (at each position - bp)
34
Peak detection We need to correct for input DNA reads (control) - non-uniformaly distributed (form peaks too) - vastly different numbers of reads between ChIP and input
35
Read count genome Expected read count Expected read count = total number of reads * extended fragment length / chr length genome T A T T A A T T A T C C C C A T A T A T G A T A T Use Bioanalyzer (remove adapter lengths)
36
Is the observed read count at a given genomic position greater than expected ? x = observed read count λ = expected read count The Poisson distribution Read count Frequency Read counts follow a Poisson distribution
37
Is the observed read count at a given genomic position greater than expected ? x = 10 reads (observed) λ = 0.5 reads (expected) The Poisson distribution genome P(X>=10) = 1.7 x 10 -10 log10 P(X>=10) = -9.77 -log10 P(X>=10) = 9.77 # in R P(X>=10|λ=0.5) is 1-sum(dpois(0:9, 0.5))
38
Read count Expected read count -Log(p) Expected read count = total number of reads * extended frag len / chr len
39
Read count Expected read count Input reads -Log(p) Expected read count = total number of reads * extended frag len / chr len
40
Read count Expected read count -Log(P c ) Read count Expected read count -Log(P i ) [-Log(P c )] - [-Log(P i )] Threshold Genome positions (bp) INPUT ChIP
41
Normalized Peak score (at each bp) R = -log10 P(X input ) P(X ChIP ) Will detect peaks with high read counts in ChIP, low in Input Works when no input DNA ! (x=0)
42
Mappability
43
Non-mappable fraction of the genome chr189369067/761171530.123087459668913 (=12%) chr233849240/2429511490.139325292921335 chr327854877/1995018270.139622164963933 chr427090014/1912730630.141630052737745 chr624330283/1708999920.142365618132972 chr820932821/1462748260.143106107677065 chr526029902/1808578660.143924633059643 chr1219382853/1323495340.14645199279659 chr1120039443/1344523840.149044906485258 chr2010017788/624359640.160449000194824 chr726182588/1588214240.164855517225434 chr1022968951/1353747370.169669404417753 chr1714496284/787747420.184021980040252 chrX31269270/1549137540.201849540099583 chr155186693/2472497190.223202247602959 chr1328668063/1141429800.251159230291692 chr1623552340/888272540.265147676410215 chr1429689825/1063685850.279122120502026 chrM4628/165710.279283084907368 chr943125838/1402732520.307441635415995 chr1920251255/638116510.317359834491667 chr1531877970/1003389150.317702957023205 chr2116867677/469443230.359312392256674 chr2221176578/496914320.426161556382597 chrY43209644/577729540.747921665906161 (=74%) We enumerated all 30-mers, counted # occurrences, calculated non-unique fraction of genome Unique/mappable fraction = 1 – non- unique fraction
44
Read count Expected read count -Log(p) Expected read count = total number of reads * extended frag len / ( chr len * mappable fraction)
45
Peak detection Determine all genomic regions with R>=15 Merge peaks separated by less than 100bp Output all peaks with length >= 100bp Process 23M reads in <5mins
46
BCL6 ChIP-seq Lymphoma cell line (OCI-Ly1) Illumina 6 GA2x lanes for ChIP, 1 for input DNA, 1 for QC 36nt long sequences 32 Million reads Aligned/mapped to hg18 with BWA With Melnick lab at WCMC
47
ChIP reads Input reads Detected Peaks BCL6: 18,814 peaks 80% are within <20kb of a known gene
49
Loading peaks into GRange system(“split_samfile s_1_sequence.txt.sam –outdir CHIP/”) system(“split_samfile s_2_sequence.txt.sam –outdir INPUT/”) system(“ChIPseeqer.bin –chipdir CHIP –inputdir INPUT –t 15 –fold 2 –outfile peaks.txt”) tpeaks = read.table(paste(dataFolder, ”peaks.txt”, sep = ""), header = F) peaks = RangedData(ranges = IRanges(start = tpeaks[, 2], end = tpeaks[, 3]), space = tpeaks[, 1], summit = tpeaks[, 6], score = tpeaks[, 5])...
50
Other peak finders
51
Promoter-based analysis (not peak- based) h1 h2 h3 h5 … h1 h2 h3 h4 h5 Maximum peak height in 2kb promoter 2kb All TSS
52
4. Peak analysis and interpretation
53
Gene-based peak annotation
54
Integration of multiple peak lists RangeData in R
55
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/phastCons17way/ Conservation analysis fixedStep chrom=chr1 start=516994 step=1 0.005 0.009 0.023 0.036 0.048 0.059 0.068 0.077 0.084 0.091 0.097 0.102 0.107 0.110 0.113 0.115 0.116 fixedStep chrom=chr1 start=517900 step=1 0.114 0.112 0.109 0.105 0.101...
56
What is the cis-regulatory code of each factor ? Does they require any co- factors ? DNA Activation Repression
57
http://meme.nbcr.net/meme4_6_1/intro.html
59
No … Random regions Discovering regulatory sequences associated with peak regions True TF binding peak? Yes … Target regions 0.400.100.33 0.100.400.00 True TF peak Absent Present No Yes Motif correlation is quantified using the mutual information
60
Motif Search Algorithm k-mer MI CTCATCG 0.0618 TCATCGC 0.0485 AAAATTT 0.0438 GATGAGC 0.0434 AAAAATT 0.0383 ATGAGCT 0.0334 TTGCCAC 0.0322 TGCCACC 0.0298 ATCTCAT 0.0265... ACGCGCG 0.0018 CGACGCG 0.0012 TACGCTA 0.0011 ACCCCCT 0.0010 CCACGGC 0.0009 TTCAAAA 0.0005 AGACGCG 0.0004 CGAGAGC 0.0003 CTTATTA 0.0002 Not informative Highly informative... MI=0.081 MI=0.045 MI=0.040
61
Enrichment Depletion Motif co-occurrence anallysis Discovered Motifs FIRE automatically compares discovered motifs to known motifs in TRANSFAC and JASPAR
62
5. A few interesting papers
63
First ChIP-seq paper
64
Epigenetic modifications at enhancer regions
65
Chromatin states
66
Nucleosome localization
67
Whole-genome nucleosome location mapping in B cells Yanwen Jiang, PhD Principal Component Analysis of Nucleosome profiles
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.