ChIP-seq Olivier Elemento, PhD TA: Jenny Giannopoulou, PhD Institute for Computational Biomedicine CSHL High Throughput Data Analysis Workshop, June 2012
Plan 1.ChIP-seq 2.Quality Control of ChIP-seq data 3.ChIP-seq Peak detection 4.Peak Analysis and Interpretation 5.A few interesting ChIP-seq papers
1. ChIP-seq
ChIP-seq Illumina Transcription factor of interest (or histone modification) Antibody
Control: input DNA Illumina Can use IgG as additional control
ChIP-seq methodology Identify ChIP-grade antibody, determine specificity (Western, histone peptide array) Optimize conditions using single- locus ChIP-PCR (positive and negative controls) Sequence ChIP product using 1 Illumina lane per sample (no TruSeq ChIP-seq), single end Sequence input/IgG as control Assessing the specificity of a commercial H3K9m3 antibody using histone peptide arrays, K. Bunting & B. Swed, WCMC Abcam H3K9Me3 rabbit polyclonal (ab8898)
ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGA ACTGATTAGTGAATTC TGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTT GACTAATCACTTAAG Average length ~ 170bp
ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGA ACTGATTAGTGAATTC TGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTT GACTAATCACTTAAG Average length ~ 170bp bp
ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGA ACTGATTAGTGAATTC TGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTT GACTAATCACTTAAG Average length ~ 170bp bp
BWA tutorial (for aligning single end reads to genome) Get genome, e.g., from UCSC – Combine into 1 file – tar zvfx chromFa.tar.gz – cat *.fa > wg.fa Indexing the genome – bwa index -p hg19bwaidx -a bwtsw wg.fa Align ChIP reads to reference genome – bwa aln -t 4 hg19bwaidx s_3_sequence.txt.gz > s_3_sequence.txt.bwa Convert to SAM format – bwa samse hg19bwaidx s_3_sequence.txt.bwa s_3_sequence.txt.gz > s_3_sequence.txt.sam Align input reads to same reference genome – bwa aln -t 4 hg19bwaidx s_4_sequence.txt.gz > s_4_sequence.txt.bwa Convert to SAM format – bwa samse hg19bwaidx s_4_sequence.txt.bwa s_4_sequence.txt.gz > s_4_sequence.txt.sam
Reads can map to multiple locations/chromosomes Read 1 Read 2 Reference Human Genome (hg18)
Reads map to one strand or the other Read 1 Read 2 hg18
SAM format DH1608P1_0130:6:1103:10579:166379#TTAGGC 16 chr M * 0 0 GGGCGTGACTCTGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGACCCGCG eb`XXYbZdadee^ceV]X][ccTcc^ebeece eeeWbeeeeeeeceeaee XX:Z:NM_017871,32 NM:i:0 MD:Z:51 DH1608P1_0130:6:1102:3415:150915#TTAGGC 16 chr M * 0 0 GGGCGGGACTCTGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGACCCGCG BBBBBBBBBBBac]bbbceedaeddeZceeea_ba_\_eee eeeedaeeee XX:Z:NM_017871,32 NM:i:1 MD:Z:5T45 DH1608P1_0130:6:1102:13118:62644#TTAGGC 16 chr M * 0 0 GGGCGTGCCTCGGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGACCCGCG BBBBBBBBBBBBBBBBBBBBB`XTbSa`cffegdggeccbe effdeggggg XX:Z:NM_017871,32 NM:i:2 MD:Z:7A3T39 DH1608P1_0130:6:1203:3012:157120#TTAGGC 16 chr M * 0 0 AAGGCCGTGACTCTGATCTCAGCCCTCGTCTCCGCCGCGCTCCCGGACCCG BBBBBBBB^`QWZZ]UXYSZSTFRU]Z__SO[adcc[acdV \`Y]YWY][_ XX:Z:NM_017871,34 NM:i:3 MD:Z:4G17G1A26 DH1608P1_0130:6:2206:4445:12756#TTAGGC 16 chr M3487N50M * 0 0 CCAAAGGGTGTGACTCTGATCTCGGGCATCGTCTCCGCCGCGCTCCCGGAC BBBBBBBBBBBBBBBBBBBBBBBB`YdddYdc\ cacaNddddcdddaeeee XX:Z:NM_017871,37 NM:i:3 MD:Z:2C5C14A27 DH1608P1_0130:6:2203:7903:43788#TTAGGC 16 chr M3487N50M * 0 0 CCCAAGGGCGTGACTCTGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGAC adbe[fbcbccb_cb^cb^^c^edgegggggdf ggefffgfbfggggegeg XX:Z:NM_017871,37 NM:i:0 MD:Z:51 MD tag, e.g, MD:Z:4T46 = 5 matches, 1 mismatch (T in read), 46 matches CIGAR string, eg 5M3487N46M = 5bp-long block, 3487 insert, 46bp-long block XT tag, e.g. XT:A:U = unique mapper; XT:A:R = more than 1 high-scoring matches
Quality Control
Clonal reads
Fragment size analysis
Fragment size analysis using opposite strand autocorrelation
Fragment size analysis
GC-content analysis
GC-content analysis
Other QC measures Number of peaks: – 0 or very few peaks, even at permissive peak calling thresholds = bad experiment Motif enrichment – is expected motif enriched in peaks ?
ChIP-seq peak calling
MACS
The Poisson distribution MACS # in R P(X>=5|λ=0.001) is 1-sum(dpois(0:4, 0.001)) 2d λ=expected # of reads within an interval of 2d bp Estimate d based on high quality peaks
BayesPeak
BayesPeak (Bayesian Hidden Markov Models) Parameters estimated using Bayesian treatment Observed variable Hidden states
BayesPeak
Peak detection using ChIPseeqer (Elemento and Giannopoulou, 2011)
A nice peak
Not all peaks are that nice
Peak detection Calculate read count at each position (bp) in genome (we don’t use a sliding window) Determine if read count is greater than expected (at each position - bp)
Peak detection We need to correct for input DNA reads (control) - non-uniformaly distributed (form peaks too) - vastly different numbers of reads between ChIP and input
Read count genome Expected read count Expected read count = total number of reads * extended fragment length / chr length genome T A T T A A T T A T C C C C A T A T A T G A T A T Use Bioanalyzer (remove adapter lengths)
Is the observed read count at a given genomic position greater than expected ? x = observed read count λ = expected read count The Poisson distribution Read count Frequency Read counts follow a Poisson distribution
Is the observed read count at a given genomic position greater than expected ? x = 10 reads (observed) λ = 0.5 reads (expected) The Poisson distribution genome P(X>=10) = 1.7 x log10 P(X>=10) = log10 P(X>=10) = 9.77 # in R P(X>=10|λ=0.5) is 1-sum(dpois(0:9, 0.5))
Read count Expected read count -Log(p) Expected read count = total number of reads * extended frag len / chr len
Read count Expected read count Input reads -Log(p) Expected read count = total number of reads * extended frag len / chr len
Read count Expected read count -Log(P c ) Read count Expected read count -Log(P i ) [-Log(P c )] - [-Log(P i )] Threshold Genome positions (bp) INPUT ChIP
Normalized Peak score (at each bp) R = -log10 P(X input ) P(X ChIP ) Will detect peaks with high read counts in ChIP, low in Input Works when no input DNA ! (x=0)
Mappability
Non-mappable fraction of the genome chr / (=12%) chr / chr / chr / chr / chr / chr / chr / chr / chr / chr / chr / chr / chrX / chr / chr / chr / chr / chrM4628/ chr / chr / chr / chr / chr / chrY / (=74%) We enumerated all 30-mers, counted # occurrences, calculated non-unique fraction of genome Unique/mappable fraction = 1 – non- unique fraction
Read count Expected read count -Log(p) Expected read count = total number of reads * extended frag len / ( chr len * mappable fraction)
Peak detection Determine all genomic regions with R>=15 Merge peaks separated by less than 100bp Output all peaks with length >= 100bp Process 23M reads in <5mins
BCL6 ChIP-seq Lymphoma cell line (OCI-Ly1) Illumina 6 GA2x lanes for ChIP, 1 for input DNA, 1 for QC 36nt long sequences 32 Million reads Aligned/mapped to hg18 with BWA With Melnick lab at WCMC
ChIP reads Input reads Detected Peaks BCL6: 18,814 peaks 80% are within <20kb of a known gene
Loading peaks into GRange system(“split_samfile s_1_sequence.txt.sam –outdir CHIP/”) system(“split_samfile s_2_sequence.txt.sam –outdir INPUT/”) system(“ChIPseeqer.bin –chipdir CHIP –inputdir INPUT –t 15 –fold 2 –outfile peaks.txt”) tpeaks = read.table(paste(dataFolder, ”peaks.txt”, sep = ""), header = F) peaks = RangedData(ranges = IRanges(start = tpeaks[, 2], end = tpeaks[, 3]), space = tpeaks[, 1], summit = tpeaks[, 6], score = tpeaks[, 5])...
Other peak finders
Promoter-based analysis (not peak- based) h1 h2 h3 h5 … h1 h2 h3 h4 h5 Maximum peak height in 2kb promoter 2kb All TSS
4. Peak analysis and interpretation
Gene-based peak annotation
Integration of multiple peak lists RangeData in R
Conservation analysis fixedStep chrom=chr1 start= step= fixedStep chrom=chr1 start= step=
What is the cis-regulatory code of each factor ? Does they require any co- factors ? DNA Activation Repression
No … Random regions Discovering regulatory sequences associated with peak regions True TF binding peak? Yes … Target regions True TF peak Absent Present No Yes Motif correlation is quantified using the mutual information
Motif Search Algorithm k-mer MI CTCATCG TCATCGC AAAATTT GATGAGC AAAAATT ATGAGCT TTGCCAC TGCCACC ATCTCAT ACGCGCG CGACGCG TACGCTA ACCCCCT CCACGGC TTCAAAA AGACGCG CGAGAGC CTTATTA Not informative Highly informative... MI=0.081 MI=0.045 MI=0.040
Enrichment Depletion Motif co-occurrence anallysis Discovered Motifs FIRE automatically compares discovered motifs to known motifs in TRANSFAC and JASPAR
5. A few interesting papers
First ChIP-seq paper
Epigenetic modifications at enhancer regions
Chromatin states
Nucleosome localization
Whole-genome nucleosome location mapping in B cells Yanwen Jiang, PhD Principal Component Analysis of Nucleosome profiles