Download presentation
Presentation is loading. Please wait.
Published byCameron James Modified over 9 years ago
1
Introduction of the ChIP-seq pipeline Shigeki Nakagome November 16 th, 2015 Di Rienzo lab meeting
2
ChIP-seq Park (2009) Nat Rev Genet ▸ A target protein (e.g. VDR) binds to DNA in an open chromatin region ▸ Sonicate open chromatin regions ▸ Capture VDR binding to DNA fragments by VDR antibody ▸ IPed DNA fragments are enriched with genomic regions bound by VDR ▸ Making a library with IPed DNA and sequence it by illumina (Single-end read; 50bp) Step 1: Chromatin immunoprecipitation (IP) Step 2: Next generation sequencing
3
▸ Identifying TF binding sites A workflow of processing sequenced data 3. Correct mapping bias (using WASP) 4. Calling genotypes and testing allelic imbalance (using QuASAR) 2. Checking the quality of IP 3. Calling peaks (using MACS2) 4. Annotating peaks (e.g. genomic context or the closest gene; using HOMER) 1. Mapping sequence reads ▸ Identifying SNPs associated with allelic imbalance Leung et al. (2015) Nature
4
▸ Identifying TF binding sites 3. Correct mapping bias (using WASP) 4. Calling genotypes and testing allelic imbalance (using QuASAR) 2. Checking the quality of IP 1. Mapping sequence reads ▸ Identifying SNPs associated with allelic imbalance Leung et al. (2015) Nature A workflow of processing sequenced data 3. Calling peaks (using MACS2) 4. Annotating peaks (e.g. genomic context or the closest gene; using HOMER)
5
1. Mapping sequence reads i) Map sequence reads to the human reference genome using BWA ▸ Aligning sequence reads: bwa aln -n 2 -o 0 REFERENCE.fa SEQUENCED.fastq > SEQUENCED.fastq.sai ▸ Generating a SAM format file: bwa samse REFERENCE.fa SEQUENCED.fastq.sai SEQUENCED.fastq > SEQUENCED.fastq.sam ii) Choose uniquely mapped reads based on ▸ Extracting uniquely mapped sequence reads based on the flag, “XT:A:U”: grep "XT:A:U" SEQUENCED.fastq.sam > SEQUENCED.fastq.sam.tmp ▸ Filtering sequence reads with a mapping quality > 30: samtools view -bhS -q 30 -F 4 -o SEQUENCED.fastq.bam SEQUENCED.fastq.sam.tmp iii) Remove PCR duplicates if sequence reads have identical coordinates ▸ Running a program, Picard: /group/../java -jar /group/../picard.jar MarkDuplicates INPUT=SEQUENCED.fastq.bam OUTPUT=SEQUENCED.fastq.bam.picard METRICS_FILE=rmdup.out REMOVE_DUPLICATES=true … iv) Use SEQUENCED.fastq.bam.picard (“uniquely mapped” + “non-PCR duplicates”) in downstream analyses
6
Park (2009) Nat Rev Genet ▸ Only 50 bp of the IPed DNA fragments is sequenced from the 5’ end, so the alignment results in two peaks from positive and negative strands ▸ If IP works, the densities (i.e. the numbers of sequence reads) from two peaks are correlated, keeping a certain distance (i.e. a length of each fragment) 2. Checking the quality of IP
7
▸ Measure a Strand Cross-Correlation (SCC) plot using a R program Rscript /group/../run_spp_nodups.R -c=SEQUENCED.fastq.bam.picard –savp -out=SEQUENCED.fastq.bam.picard.spp.out Phantom peak (corresponding to the read length: 50bp) (P cc ) ChIP-seq peak (ChIP cc ) ▸ X-axis: strand shift (i.e. distance between the peaks of positive and negative strands) ▸ Y-axis: cross-correlation(CC) between the densities of two peaks ▸ There are two peaks: one is a noise (phantom peak) and the other is IPed peak. ▸ Two statistics are defined: Normalized strand coefficient (NAC): ChIP cc /min cc Relative strand correlation (RSC): (ChIP cc -min cc )/(P cc -min cc ) ▸ According to ENCODE project, “NAC > 1.05” and “RSC > 0.8” are thresholds for good IPed data 2. Checking the quality of IP
8
3. Calling peaks (using MACS2) ChrStartEndLength-log 10 p-valueFold enrichment-log 10 q-value chr1139170951391728419025.9589712.3629120.51727 chr3482642384826447724098.2100331.3500190.15012
9
4. Annotating peaks (e.g. genomic context or the closest gene; using HOMER) ▸ Convert.bed file into.peak file using bed2pos.pl packaged in HOMER: bed2pos.pl out_file_macs2_summits.bed > out_file_macs2_summits.peak ▸ Run findMotifsGenome.pl to find motifs in the peaks called by MACS2: findMotifsGenome.pl out_file_macs2_summits.peak hg19 out_file_homer -size 100 -len 8,10,12,14,16 ▸ Run annotatePeaks.pl to annotate the peaks: annotatePeaks.pl out_file_macs2_summits.peak hg19 -size -100,100 -m homer_top10.motif > out_file_macs2_summits ▸ The output file includes the information on: Peak ID ChrStartEndStrand Peak score … Detailed Annotation Distance to TSS … Gene Name … XXX chr34826424748264447+ 90.1501 2 … promoter-TSS (NM_004345) -490 …CAMP… YYY chr5139986717139986917+ 49.4612 5 … L1MB4|LINE|L126218 … CD14 … RankP-value Log(P- value) % of Targets Best Match/Details 11e-246-5.687e+0252.47%MA0074.1_RXRA::VDR/Jaspar 21e-33-7.605e+014.93% VDR(NR),DR3/GM10855- VDR+vitD-ChIP- Seq(GSE22484)/Homer 31e-30-6.924e+0128.25% MF0004.1_Nuclear_Receptor_cl ass/Jaspar
10
▸ Identifying TF binding sites 3. Correct mapping bias (using WASP) 4. Calling genotypes and testing allelic imbalance (using QuASAR) 2. Checking the quality of IP 1. Mapping sequence reads ▸ Identifying SNPs associated with allelic imbalance Leung et al. (2015) Nature A workflow of processing sequenced data 3. Calling peaks (using MACS2) 4. Annotating peaks (e.g. genomic context or the closest gene; using HOMER)
11
3. Correct mapping bias (using WASP) ▸ WASP is a program to carefully map allele-specific reads, correct for incorrect heterozygous genotype calls, and model overdispersion of sequencing reads van de Geijn et al. (2015) Nature Methods ▸ This is an algorithm implemented in WASP to overcome mapping bias from reads with a reference allele
12
4. Calling genotypes and testing allelic imbalance (using QuASAR) ▸ Using the samtools mpileup command, create a pileup file from aligned reads: samtools mpileup -f /group/../hg19_all_contigs.fa -l /group/../1KG_SNPs_filt.bed /group/../input.bam | gzip > input.pileup.gz ▸ Convert the pileup file into bed format and use intersectBed to include the allele frequencies from a bed file: less input.pileup.gz | awk -v OFS='\t' '{ if ($4>0 && $5 !~ /[^\^][<>]/ && $5 !~ /\+[0-9]+[ACGTNacgtn]+/ && $5 !~ /-[0-9]+[ACGTNacgtn]+/ && $5 !~ /[^\^]\*/) print $1,$2-1,$2,$3,$4,$5,$6}' | sortBed -i stdin | intersectBed -a stdin -b /group/../1KG_SNPs_filt.bed -wo | cut -f 1- 7,11-14 | gzip > input.pileup.bed.gz ▸ Generate an input file for QuASAR: R --vanilla --args input.pileup.bed.gz < /group/../convertPileupToQuasar.R ChrStartEndRefAltSNP ID Fre q #ref#alt #not mapped to either allele chr114983761498377CTrs112606110.12400 chr153489135348914CTrs121249410.43210
13
4. Calling genotypes and testing allelic imbalance (using QuASAR) P-value = 0.0062; rs3738668 (A:21/C:2) ▸ THP1 treated by VD; FAIREseq data ▸ Monocytes treated by VD; VDR ChIPseq data P-value = 0.2492109; rs11784276 (T:5/C:2)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.