ChIP-seq Olivier Elemento, PhD TA: Jenny Giannopoulou, PhD Institute for Computational Biomedicine CSHL High Throughput Data Analysis Workshop, June 2012.

ChIP-seq Olivier Elemento, PhD TA: Jenny Giannopoulou, PhD Institute for Computational Biomedicine CSHL High Throughput Data Analysis Workshop, June 2012

Plan 1.ChIP-seq 2.Quality Control of ChIP-seq data 3.ChIP-seq Peak detection 4.Peak Analysis and Interpretation 5.A few interesting ChIP-seq papers

1. ChIP-seq

ChIP-seq Illumina Transcription factor of interest (or histone modification) Antibody

Control: input DNA Illumina Can use IgG as additional control

ChIP-seq methodology Identify ChIP-grade antibody, determine specificity (Western, histone peptide array) Optimize conditions using single- locus ChIP-PCR (positive and negative controls) Sequence ChIP product using 1 Illumina lane per sample (no TruSeq ChIP-seq), single end Sequence input/IgG as control Assessing the specificity of a commercial H3K9m3 antibody using histone peptide arrays, K. Bunting & B. Swed, WCMC Abcam H3K9Me3 rabbit polyclonal (ab8898)

ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGA ACTGATTAGTGAATTC TGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTT GACTAATCACTTAAG Average length ~ 170bp

ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGA ACTGATTAGTGAATTC TGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTT GACTAATCACTTAAG Average length ~ 170bp 40-100bp

BWA tutorial (for aligning single end reads to genome) Get genome, e.g., from UCSC – http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz Combine into 1 file – tar zvfx chromFa.tar.gz – cat *.fa > wg.fa Indexing the genome – bwa index -p hg19bwaidx -a bwtsw wg.fa Align ChIP reads to reference genome – bwa aln -t 4 hg19bwaidx s_3_sequence.txt.gz > s_3_sequence.txt.bwa Convert to SAM format – bwa samse hg19bwaidx s_3_sequence.txt.bwa s_3_sequence.txt.gz > s_3_sequence.txt.sam Align input reads to same reference genome – bwa aln -t 4 hg19bwaidx s_4_sequence.txt.gz > s_4_sequence.txt.bwa Convert to SAM format – bwa samse hg19bwaidx s_4_sequence.txt.bwa s_4_sequence.txt.gz > s_4_sequence.txt.sam

Reads can map to multiple locations/chromosomes Read 1 Read 2 Reference Human Genome (hg18)

Reads map to one strand or the other Read 1 Read 2 hg18

SAM format DH1608P1_0130:6:1103:10579:166379#TTAGGC 16 chr1 1249828 37 51M * 0 0 GGGCGTGACTCTGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGACCCGCG eb`XXYbZdadee^ceV]X][ccTcc^ebeece eeeWbeeeeeeeceeaee XX:Z:NM_017871,32 NM:i:0 MD:Z:51 DH1608P1_0130:6:1102:3415:150915#TTAGGC 16 chr1 1249828 37 51M * 0 0 GGGCGGGACTCTGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGACCCGCG BBBBBBBBBBBac]bbbceedaeddeZceeea_ba_\_eee eeeedaeeee XX:Z:NM_017871,32 NM:i:1 MD:Z:5T45 DH1608P1_0130:6:1102:13118:62644#TTAGGC 16 chr1 1249828 37 51M * 0 0 GGGCGTGCCTCGGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGACCCGCG BBBBBBBBBBBBBBBBBBBBB`XTbSa`cffegdggeccbe effdeggggg XX:Z:NM_017871,32 NM:i:2 MD:Z:7A3T39 DH1608P1_0130:6:1203:3012:157120#TTAGGC 16 chr1 1249826 25 51M * 0 0 AAGGCCGTGACTCTGATCTCAGCCCTCGTCTCCGCCGCGCTCCCGGACCCG BBBBBBBB^`QWZZ]UXYSZSTFRU]Z__SO[adcc[acdV \`Y]YWY][_ XX:Z:NM_017871,34 NM:i:3 MD:Z:4G17G1A26 DH1608P1_0130:6:2206:4445:12756#TTAGGC 16 chr1 1246336 25 1M3487N50M * 0 0 CCAAAGGGTGTGACTCTGATCTCGGGCATCGTCTCCGCCGCGCTCCCGGAC BBBBBBBBBBBBBBBBBBBBBBBB`YdddYdc\ cacaNddddcdddaeeee XX:Z:NM_017871,37 NM:i:3 MD:Z:2C5C14A27 DH1608P1_0130:6:2203:7903:43788#TTAGGC 16 chr1 1246336 37 1M3487N50M * 0 0 CCCAAGGGCGTGACTCTGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGAC adbe[fbcbccb_cb^cb^^c^edgegggggdf ggefffgfbfggggegeg XX:Z:NM_017871,37 NM:i:0 MD:Z:51 MD tag, e.g, MD:Z:4T46 = 5 matches, 1 mismatch (T in read), 46 matches CIGAR string, eg 5M3487N46M = 5bp-long block, 3487 insert, 46bp-long block XT tag, e.g. XT:A:U = unique mapper; XT:A:R = more than 1 high-scoring matches

Quality Control

http://biowhat.ucsd.edu/homer/chipseq/qc.html Clonal reads

Fragment size analysis

http://biowhat.ucsd.edu/homer/chipseq/qc.html Fragment size analysis using opposite strand autocorrelation

http://biowhat.ucsd.edu/homer/chipseq/qc.html Fragment size analysis

GC-content analysis http://biowhat.ucsd.edu/homer/chipseq/qc.html

Other QC measures Number of peaks: – 0 or very few peaks, even at permissive peak calling thresholds = bad experiment Motif enrichment – is expected motif enriched in peaks ?

ChIP-seq peak calling

The Poisson distribution MACS # in R P(X>=5|λ=0.001) is 1-sum(dpois(0:4, 0.001)) 2d λ=expected # of reads within an interval of 2d bp Estimate d based on high quality peaks

BayesPeak

BayesPeak (Bayesian Hidden Markov Models) Parameters estimated using Bayesian treatment Observed variable Hidden states

BayesPeak

Peak detection using ChIPseeqer http://icb.med.cornell.edu/wiki/index.php/Elementolab/ChIPseeqer_Tutorial (Elemento and Giannopoulou, 2011)

A nice peak

Not all peaks are that nice

Peak detection Calculate read count at each position (bp) in genome (we don’t use a sliding window) Determine if read count is greater than expected (at each position - bp)

Peak detection We need to correct for input DNA reads (control) - non-uniformaly distributed (form peaks too) - vastly different numbers of reads between ChIP and input

Read count genome Expected read count Expected read count = total number of reads * extended fragment length / chr length genome T A T T A A T T A T C C C C A T A T A T G A T A T Use Bioanalyzer (remove adapter lengths)

Is the observed read count at a given genomic position greater than expected ? x = observed read count λ = expected read count The Poisson distribution Read count Frequency Read counts follow a Poisson distribution

Is the observed read count at a given genomic position greater than expected ? x = 10 reads (observed) λ = 0.5 reads (expected) The Poisson distribution genome P(X>=10) = 1.7 x 10 -10 log10 P(X>=10) = -9.77 -log10 P(X>=10) = 9.77 # in R P(X>=10|λ=0.5) is 1-sum(dpois(0:9, 0.5))

Read count Expected read count -Log(p) Expected read count = total number of reads * extended frag len / chr len

Read count Expected read count Input reads -Log(p) Expected read count = total number of reads * extended frag len / chr len

Read count Expected read count -Log(P c ) Read count Expected read count -Log(P i ) [-Log(P c )] - [-Log(P i )] Threshold Genome positions (bp) INPUT ChIP

Normalized Peak score (at each bp) R = -log10 P(X input ) P(X ChIP ) Will detect peaks with high read counts in ChIP, low in Input Works when no input DNA ! (x=0)

Mappability

Non-mappable fraction of the genome chr189369067/761171530.123087459668913 (=12%) chr233849240/2429511490.139325292921335 chr327854877/1995018270.139622164963933 chr427090014/1912730630.141630052737745 chr624330283/1708999920.142365618132972 chr820932821/1462748260.143106107677065 chr526029902/1808578660.143924633059643 chr1219382853/1323495340.14645199279659 chr1120039443/1344523840.149044906485258 chr2010017788/624359640.160449000194824 chr726182588/1588214240.164855517225434 chr1022968951/1353747370.169669404417753 chr1714496284/787747420.184021980040252 chrX31269270/1549137540.201849540099583 chr155186693/2472497190.223202247602959 chr1328668063/1141429800.251159230291692 chr1623552340/888272540.265147676410215 chr1429689825/1063685850.279122120502026 chrM4628/165710.279283084907368 chr943125838/1402732520.307441635415995 chr1920251255/638116510.317359834491667 chr1531877970/1003389150.317702957023205 chr2116867677/469443230.359312392256674 chr2221176578/496914320.426161556382597 chrY43209644/577729540.747921665906161 (=74%) We enumerated all 30-mers, counted # occurrences, calculated non-unique fraction of genome Unique/mappable fraction = 1 – non- unique fraction

Read count Expected read count -Log(p) Expected read count = total number of reads * extended frag len / ( chr len * mappable fraction)

Peak detection Determine all genomic regions with R>=15 Merge peaks separated by less than 100bp Output all peaks with length >= 100bp Process 23M reads in <5mins

BCL6 ChIP-seq Lymphoma cell line (OCI-Ly1) Illumina 6 GA2x lanes for ChIP, 1 for input DNA, 1 for QC 36nt long sequences 32 Million reads Aligned/mapped to hg18 with BWA With Melnick lab at WCMC

ChIP reads Input reads Detected Peaks BCL6: 18,814 peaks 80% are within <20kb of a known gene

Loading peaks into GRange system(“split_samfile s_1_sequence.txt.sam –outdir CHIP/”) system(“split_samfile s_2_sequence.txt.sam –outdir INPUT/”) system(“ChIPseeqer.bin –chipdir CHIP –inputdir INPUT –t 15 –fold 2 –outfile peaks.txt”) tpeaks = read.table(paste(dataFolder, ”peaks.txt”, sep = ""), header = F) peaks = RangedData(ranges = IRanges(start = tpeaks[, 2], end = tpeaks[, 3]), space = tpeaks[, 1], summit = tpeaks[, 6], score = tpeaks[, 5])...

Other peak finders

Promoter-based analysis (not peak- based) h1 h2 h3 h5 … h1 h2 h3 h4 h5 Maximum peak height in 2kb promoter 2kb All TSS

4. Peak analysis and interpretation

Gene-based peak annotation

Integration of multiple peak lists RangeData in R

http://hgdownload.cse.ucsc.edu/goldenPath/hg18/phastCons17way/ Conservation analysis fixedStep chrom=chr1 start=516994 step=1 0.005 0.009 0.023 0.036 0.048 0.059 0.068 0.077 0.084 0.091 0.097 0.102 0.107 0.110 0.113 0.115 0.116 fixedStep chrom=chr1 start=517900 step=1 0.114 0.112 0.109 0.105 0.101...

What is the cis-regulatory code of each factor ? Does they require any co- factors ? DNA Activation Repression

http://meme.nbcr.net/meme4_6_1/intro.html

No … Random regions Discovering regulatory sequences associated with peak regions True TF binding peak? Yes … Target regions 0.400.100.33 0.100.400.00 True TF peak Absent Present No Yes Motif correlation is quantified using the mutual information

Motif Search Algorithm k-mer MI CTCATCG 0.0618 TCATCGC 0.0485 AAAATTT 0.0438 GATGAGC 0.0434 AAAAATT 0.0383 ATGAGCT 0.0334 TTGCCAC 0.0322 TGCCACC 0.0298 ATCTCAT 0.0265... ACGCGCG 0.0018 CGACGCG 0.0012 TACGCTA 0.0011 ACCCCCT 0.0010 CCACGGC 0.0009 TTCAAAA 0.0005 AGACGCG 0.0004 CGAGAGC 0.0003 CTTATTA 0.0002 Not informative Highly informative... MI=0.081 MI=0.045 MI=0.040

Enrichment Depletion Motif co-occurrence anallysis Discovered Motifs FIRE automatically compares discovered motifs to known motifs in TRANSFAC and JASPAR

5. A few interesting papers

First ChIP-seq paper

Epigenetic modifications at enhancer regions

Chromatin states

Nucleosome localization

Whole-genome nucleosome location mapping in B cells Yanwen Jiang, PhD Principal Component Analysis of Nucleosome profiles

ChIP-seq Olivier Elemento, PhD TA: Jenny Giannopoulou, PhD Institute for Computational Biomedicine CSHL High Throughput Data Analysis Workshop, June 2012.

Similar presentations

Presentation on theme: "ChIP-seq Olivier Elemento, PhD TA: Jenny Giannopoulou, PhD Institute for Computational Biomedicine CSHL High Throughput Data Analysis Workshop, June 2012."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ChIP-seq Olivier Elemento, PhD TA: Jenny Giannopoulou, PhD Institute for Computational Biomedicine CSHL High Throughput Data Analysis Workshop, June 2012.

Similar presentations

Presentation on theme: "ChIP-seq Olivier Elemento, PhD TA: Jenny Giannopoulou, PhD Institute for Computational Biomedicine CSHL High Throughput Data Analysis Workshop, June 2012."— Presentation transcript:

Similar presentations

About project

Feedback