Introduction of the ChIP-seq pipeline Shigeki Nakagome November 16 th, 2015 Di Rienzo lab meeting.

Slides:



Advertisements
Similar presentations
Methods to read out regulatory functions
Advertisements

ChIP-seq Olivier Elemento, PhD TA: Jenny Giannopoulou, PhD Institute for Computational Biomedicine CSHL High Throughput Data Analysis Workshop, June 2012.
ChIP-seq Data Analysis
ChIP-seq analysis Ecole de bioinformatique AVIESAN – Roscoff, Jan 2013.
DNAseq analysis Bioinformatics Analysis Team
Detecting DNA-protein Interactions Xinghua Lu Dept Biomedical Informatics BIOST 2055.
ChIP-seq QC Xiaole Shirley Liu STAT115, STAT215. Initial QC FASTQC Mappability Uniquely mapped reads Uniquely mapped locations Uniquely mapped locations.
RNAseq Applications in Genome Studies
High Throughput Sequencing
Bioinformatics Analysis Team McGill University and Genome Quebec Innovation Center
NGS Analysis Using Galaxy
1 1 - Lectures.GersteinLab.org Overview of ENCODE Elements Mark Gerstein for the "ENCODE TEAM"
Whole Exome Sequencing for Variant Discovery and Prioritisation
Mapping protein-DNA interactions by ChIP-seq Zsolt Szilagyi Institute of Biomedicine.
MRNA protein DNA Activation Repression Translation Localization Stability Pol II 3’UTR Transcriptional and post-transcriptional regulation of gene expression.
Massive Parallel Sequencing
Proliferation cluster (G12) Figure S1 A The proliferation cluster is a stable one. A dendrogram depicting results of cluster analysis of all varying genes.
Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15.
DAY 1. GENERAL ASPECTS FOR GENETIC MAP CONSTRUCTION SANGREA SHIM.
NGS data analysis CCM Seminar series Michael Liang:
Next Generation DNA Sequencing
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
ChIP-seq hands-on Iros Barozzi, Campus IFOM-IEO (Milan) Saverio Minucci, Gioacchino Natoli Labs.
I519 Introduction to Bioinformatics, Fall, 2012
Chip – Seq Peak Calling in Galaxy Lisa Stubbs Chip-Seq Peak Calling in Galaxy | Lisa Stubbs | PowerPoint by Casey Hanson.
ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
EDACC Quality Characterization for Various Epigenetic Assays
Next Generation Sequencing
Other genomic arrays: Methylation, chIP on chip… UBio Training Courses.
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Day 5-2 What bioinformatics.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Algorithms in Bioinformatics: A Practical Introduction
A B IL-4(+) IL-4(-) IL-4(+) IL-4(-) ChIP-Seq (STAT6) Ramos IL-4 (+) P-value Ramos IL-4 (-) P-value BEAS2B IL-4 (+) P-value BEASB IL-4 (-) P-value fold.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
Genome-wide association study between DSE polymorphism and Poly-A usage in Human population Hiren Karathia Sridhar Hannenhalli.
Trinity College Dublin, The University of Dublin Data download: bioinf.gen.tcd.ie/GE3M25/project Get.fastq.gz file associated with your student ID
Investigate Variation of Chromatin Interactions in Human Tissues Hiren Karathia, PhD., Sridhar Hannenhalli, PhD., Michelle Girvan, PhD.
 CHANGE!! MGL Users Group meetings will now be on the 1 st Monday of each month 3:00-4:00 Room Note the change of time and room.
Overview of ENCODE Elements
Ke Lin 23 rd Feb, 2012 Structural Variation Detection Using NGS technology.
Supplemental Figure 1. False trans association due to probe cross-hybridization and genetic polymorphism at single base extension site. (A) The Infinium.
Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.
Calling Somatic Mutations using VarScan
Computing on TSCC Make a folder for the class and move into it –mkdir –p /oasis/tscc/scratch/username/biom262_harismendy –cd /oasis/tscc/scratch/username/biom262_harismendy.
Chip – Seq Peak Calling in Galaxy Lisa Stubbs Lisa Stubbs | Chip-Seq Peak Calling in Galaxy1.
Short Read Workshop Day 5: Mapping and Visualization
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
User-friendly Galaxy interface and analysis workflows for deep sequencing data Oskari Timonen and Petri Pölönen.
Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015.
HOMER – a one stop shop for ChIP-Seq analysis
Using Galaxy to build and run data processing pipelines Jelle Scholtalbers / Charles Girardot GBCS Genome Biology Computational Support.
High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.
Practice:submit the ChIP_Streamline.pbs 1.Replace with your 2.Make sure the.fastq files are in your GMS6014 directory.
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
Short Read Workshop Day 5: Mapping and Visualization Video 3 Introduction to BWA.
Canadian Bioinformatics Workshops
ChIP-seq Robert J. Trumbly
Invest. Ophthalmol. Vis. Sci ;57(10): doi: /iovs Figure Legend:
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Day 5 Session 29: Questions and follow-up…. James C. Fleet, PhD
ChIP-Seq Data Processing and QC
Epigenetics System Biology Workshop: Introduction
Fine-Resolution Mapping of TF Binding and Chromatin Interactions
ChIP-seq Robert J. Trumbly
Fine-Resolution Mapping of TF Binding and Chromatin Interactions
Genome-wide analysis of p53 occupancy.
CaQTL analysis identifies genetic variants affecting human islet cis-RE use. caQTL analysis identifies genetic variants affecting human islet cis-RE use.
Chromatin basics & ChIP-seq analysis
Presentation transcript:

Introduction of the ChIP-seq pipeline Shigeki Nakagome November 16 th, 2015 Di Rienzo lab meeting

ChIP-seq Park (2009) Nat Rev Genet ▸ A target protein (e.g. VDR) binds to DNA in an open chromatin region ▸ Sonicate open chromatin regions ▸ Capture VDR binding to DNA fragments by VDR antibody ▸ IPed DNA fragments are enriched with genomic regions bound by VDR ▸ Making a library with IPed DNA and sequence it by illumina (Single-end read; 50bp) Step 1: Chromatin immunoprecipitation (IP) Step 2: Next generation sequencing

▸ Identifying TF binding sites A workflow of processing sequenced data 3. Correct mapping bias (using WASP) 4. Calling genotypes and testing allelic imbalance (using QuASAR) 2. Checking the quality of IP 3. Calling peaks (using MACS2) 4. Annotating peaks (e.g. genomic context or the closest gene; using HOMER) 1. Mapping sequence reads ▸ Identifying SNPs associated with allelic imbalance Leung et al. (2015) Nature

▸ Identifying TF binding sites 3. Correct mapping bias (using WASP) 4. Calling genotypes and testing allelic imbalance (using QuASAR) 2. Checking the quality of IP 1. Mapping sequence reads ▸ Identifying SNPs associated with allelic imbalance Leung et al. (2015) Nature A workflow of processing sequenced data 3. Calling peaks (using MACS2) 4. Annotating peaks (e.g. genomic context or the closest gene; using HOMER)

1. Mapping sequence reads i) Map sequence reads to the human reference genome using BWA ▸ Aligning sequence reads: bwa aln -n 2 -o 0 REFERENCE.fa SEQUENCED.fastq > SEQUENCED.fastq.sai ▸ Generating a SAM format file: bwa samse REFERENCE.fa SEQUENCED.fastq.sai SEQUENCED.fastq > SEQUENCED.fastq.sam ii) Choose uniquely mapped reads based on ▸ Extracting uniquely mapped sequence reads based on the flag, “XT:A:U”: grep "XT:A:U" SEQUENCED.fastq.sam > SEQUENCED.fastq.sam.tmp ▸ Filtering sequence reads with a mapping quality > 30: samtools view -bhS -q 30 -F 4 -o SEQUENCED.fastq.bam SEQUENCED.fastq.sam.tmp iii) Remove PCR duplicates if sequence reads have identical coordinates ▸ Running a program, Picard: /group/../java -jar /group/../picard.jar MarkDuplicates INPUT=SEQUENCED.fastq.bam OUTPUT=SEQUENCED.fastq.bam.picard METRICS_FILE=rmdup.out REMOVE_DUPLICATES=true … iv) Use SEQUENCED.fastq.bam.picard (“uniquely mapped” + “non-PCR duplicates”) in downstream analyses

Park (2009) Nat Rev Genet ▸ Only 50 bp of the IPed DNA fragments is sequenced from the 5’ end, so the alignment results in two peaks from positive and negative strands ▸ If IP works, the densities (i.e. the numbers of sequence reads) from two peaks are correlated, keeping a certain distance (i.e. a length of each fragment) 2. Checking the quality of IP

▸ Measure a Strand Cross-Correlation (SCC) plot using a R program Rscript /group/../run_spp_nodups.R -c=SEQUENCED.fastq.bam.picard –savp -out=SEQUENCED.fastq.bam.picard.spp.out Phantom peak (corresponding to the read length: 50bp) (P cc ) ChIP-seq peak (ChIP cc ) ▸ X-axis: strand shift (i.e. distance between the peaks of positive and negative strands) ▸ Y-axis: cross-correlation(CC) between the densities of two peaks ▸ There are two peaks: one is a noise (phantom peak) and the other is IPed peak. ▸ Two statistics are defined: Normalized strand coefficient (NAC): ChIP cc /min cc Relative strand correlation (RSC): (ChIP cc -min cc )/(P cc -min cc ) ▸ According to ENCODE project, “NAC > 1.05” and “RSC > 0.8” are thresholds for good IPed data 2. Checking the quality of IP

3. Calling peaks (using MACS2) ChrStartEndLength-log 10 p-valueFold enrichment-log 10 q-value chr chr

4. Annotating peaks (e.g. genomic context or the closest gene; using HOMER) ▸ Convert.bed file into.peak file using bed2pos.pl packaged in HOMER: bed2pos.pl out_file_macs2_summits.bed > out_file_macs2_summits.peak ▸ Run findMotifsGenome.pl to find motifs in the peaks called by MACS2: findMotifsGenome.pl out_file_macs2_summits.peak hg19 out_file_homer -size 100 -len 8,10,12,14,16 ▸ Run annotatePeaks.pl to annotate the peaks: annotatePeaks.pl out_file_macs2_summits.peak hg19 -size -100,100 -m homer_top10.motif > out_file_macs2_summits ▸ The output file includes the information on: Peak ID ChrStartEndStrand Peak score … Detailed Annotation Distance to TSS … Gene Name … XXX chr … promoter-TSS (NM_004345) -490 …CAMP… YYY chr … L1MB4|LINE|L … CD14 … RankP-value Log(P- value) % of Targets Best Match/Details 11e e %MA0074.1_RXRA::VDR/Jaspar 21e e % VDR(NR),DR3/GM VDR+vitD-ChIP- Seq(GSE22484)/Homer 31e e % MF0004.1_Nuclear_Receptor_cl ass/Jaspar

▸ Identifying TF binding sites 3. Correct mapping bias (using WASP) 4. Calling genotypes and testing allelic imbalance (using QuASAR) 2. Checking the quality of IP 1. Mapping sequence reads ▸ Identifying SNPs associated with allelic imbalance Leung et al. (2015) Nature A workflow of processing sequenced data 3. Calling peaks (using MACS2) 4. Annotating peaks (e.g. genomic context or the closest gene; using HOMER)

3. Correct mapping bias (using WASP) ▸ WASP is a program to carefully map allele-specific reads, correct for incorrect heterozygous genotype calls, and model overdispersion of sequencing reads van de Geijn et al. (2015) Nature Methods ▸ This is an algorithm implemented in WASP to overcome mapping bias from reads with a reference allele

4. Calling genotypes and testing allelic imbalance (using QuASAR) ▸ Using the samtools mpileup command, create a pileup file from aligned reads: samtools mpileup -f /group/../hg19_all_contigs.fa -l /group/../1KG_SNPs_filt.bed /group/../input.bam | gzip > input.pileup.gz ▸ Convert the pileup file into bed format and use intersectBed to include the allele frequencies from a bed file: less input.pileup.gz | awk -v OFS='\t' '{ if ($4>0 && $5 !~ /[^\^][<>]/ && $5 !~ /\+[0-9]+[ACGTNacgtn]+/ && $5 !~ /-[0-9]+[ACGTNacgtn]+/ && $5 !~ /[^\^]\*/) print $1,$2-1,$2,$3,$4,$5,$6}' | sortBed -i stdin | intersectBed -a stdin -b /group/../1KG_SNPs_filt.bed -wo | cut -f 1- 7,11-14 | gzip > input.pileup.bed.gz ▸ Generate an input file for QuASAR: R --vanilla --args input.pileup.bed.gz < /group/../convertPileupToQuasar.R ChrStartEndRefAltSNP ID Fre q #ref#alt #not mapped to either allele chr CTrs chr CTrs

4. Calling genotypes and testing allelic imbalance (using QuASAR) P-value = ; rs (A:21/C:2) ▸ THP1 treated by VD; FAIREseq data ▸ Monocytes treated by VD; VDR ChIPseq data P-value = ; rs (T:5/C:2)