ChIP-Seq Data Processing and QC

Slides:



Advertisements
Similar presentations
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
Advertisements

Visualising and Exploring BS-Seq Data
Processing of miRNA samples and primary data analysis
ChIP-seq analysis Ecole de bioinformatique AVIESAN – Roscoff, Jan 2013.
DNAseq analysis Bioinformatics Analysis Team
Simon v2.3 RNA-Seq Analysis Simon v2.3.
SOLiD Sequencing & Data
SeqMonk tools for methylation analysis
Bioinformatics Analysis Team McGill University and Genome Quebec Innovation Center
NGS data format and General Quality Control. Data format “Flowchart” Sequencer raw data FastqSAM/BAM.
Before we start: Align sequence reads to the reference genome
NGS Analysis Using Galaxy
Whole Exome Sequencing for Variant Discovery and Prioritisation
Expression Analysis of RNA-seq Data
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
NGS data analysis CCM Seminar series Michael Liang:
Next Generation DNA Sequencing
RNA-Seq Analysis Simon V4.1.
ChIP-seq hands-on Iros Barozzi, Campus IFOM-IEO (Milan) Saverio Minucci, Gioacchino Natoli Labs.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Trinity College Dublin, The University of Dublin GE3M25: Data Analysis, Class 4 Karsten Hokamp, PhD Genetics TCD, 07/12/2015
IGV tools. Pipeline Download genome from Ensembl bacteria database Export the mapping reads file (SAM) Map reads to genome by CLC Using the mapping.
Trinity College Dublin, The University of Dublin Data download: bioinf.gen.tcd.ie/GE3M25/project Get.fastq.gz file associated with your student ID
No reference available
QC and pre-assembly analyses
Moderní metody analýzy genomu - analýza Mgr. Nikola Tom Brno,
Introduction of the ChIP-seq pipeline Shigeki Nakagome November 16 th, 2015 Di Rienzo lab meeting.
Computing on TSCC Make a folder for the class and move into it –mkdir –p /oasis/tscc/scratch/username/biom262_harismendy –cd /oasis/tscc/scratch/username/biom262_harismendy.
Short Read Workshop Day 5: Mapping and Visualization
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015.
HOMER – a one stop shop for ChIP-Seq analysis
Using Galaxy to build and run data processing pipelines Jelle Scholtalbers / Charles Girardot GBCS Genome Biology Computational Support.
From Reads to Results Exome-seq analysis at CCBR
Misleading bioinformatics: Mistakes, Biases, Mis-interpretations and how to avoid them Festival of Genomics 2017 Course Exercise Material:
Differential Methylation Analysis
Simon v RNA-Seq Analysis Simon v
Using command line tools to process sequencing data
SeqMonk tools for methylation analysis
SeqMonk tools for methylation analysis
Placental Bioinformatics
Simon v1.0 Motif Searching Simon v1.0.
Cancer Genomics Core Lab
Next Generation Sequencing Analysis
Biases and their Effect on Biological Interpretation
Short Read Sequencing Analysis Workshop
Invest. Ophthalmol. Vis. Sci ;57(10): doi: /iovs Figure Legend:
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
GE3M25: Data Analysis, Class 4
GE3M25: Data Analysis, Class3
The FASTQ format and quality control
EMC Galaxy Course November 24-25, 2014
Analysing ChIP-Seq Data
Yonglan Zheng Galaxy Hands-on Demo Step-by-step Yonglan Zheng
Artefacts and Biases in Gene Set Analysis
Simon v ChIP-Seq Analysis Simon v
Quality control for Sequencing Experiments
Exploring and Understanding ChIP-Seq data
Visualising and Exploring BS-Seq Data
A critical evaluation of HTQC: a fast quality control toolkit for Illumina sequencing data Chandan Pal, PhD student Sahlgrenska Academy Institute of.
Epigenetics System Biology Workshop: Introduction
Maximize read usage through mapping strategies
Simon V Motif Searching Simon V
Artefacts and Biases in Gene Set Analysis
Garbage In, Garbage Out: Quality control on sequence data
BF528 - Sequence Analysis Fundamentals
Computational Pipeline Strategies
RNA-Seq Data Analysis UND Genomics Core.
Quality Control & Nascent Sequencing
The Variant Call Format
Presentation transcript:

ChIP-Seq Data Processing and QC Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2018-01

Data Creation and Processing Starting DNA Fragmented DNA ChIPped DNA Mapped BAM File FastQ Sequence File Sequence Library

A typical ChIP Library Potential technical problems ChIP Fragment Adapter Barcode Potential technical problems Adapter contamination PCR Duplication Potential biological problems Lack of enrichment Other selection biases

QC of raw sequence Base Call Quality

QC of raw sequence Sequence Composition

QC of raw sequence Sequence Composition

QC of raw sequence Adapter Contamination ChIP Fragment Adapter Barcode Read Trim Galore! Quality and Adapter Trimming

Mapping ChIP Data All regions should be linear genomic stretches Standard genomic aligners are fine Bowtie2 http://bowtie-bio.sourceforge.net/bowtie2/ BWA http://bio-bwa.sourceforge.net/

Example Bowtie2 Mapping Create Genome Index (once - slow!) Map a single FastQ file bowtie2-build yeast_genome.fa yeast_index bowtie \ -x yeast_index \ -U data.fastq.gz \ | samtools view \ -bS \ -o data.bam

Post Alignment QC Mapping Statistics sample1: 2264052 (14.66%) aligned exactly 1 time sample2: 2698005 (18.79%) aligned exactly 1 time sample3: 13434392 (67.08%) aligned exactly 1 time sample4: 1108477 (6.70%) aligned exactly 1 time sample5: 2143911 (17.58%) aligned exactly 1 time sample6: 2980154 (13.98%) aligned exactly 1 time

Post Alignment Processing MAPQ Filtering ChIP-Seq relates sequences to positions in a reference genome You need to be confident that the reported position is correct Filtering on MAPQ value (likelihood of reported position being incorrect) is an easy way to do this MAPQ filtering should be performed in most cases samtools view -q 20 -b -o filtered.bam data.bam

Post Alignment Processing Deduplication java -jar picard.jar SortSam \ INPUT=filtered.bam \ OUTPUT=sorted.bam \ SORT_ORDER=coordinate java -jar picard.jar MarkDuplicates \ INPUT=sorted.bam \ OUTPUT=dedup.bam \ METRICS_FILE=metrics.txt DO NOT DEDUPLICATE AS A MATTER OF COURSE! THINK FIRST!

To Deduplicate or Not? Deduplication can make enrichment visually clearer and help to spot truly enriched regions Why not just deduplicate everything? Quantitation compression Repeat enrichment

Deduplication

Quantitation Compression from Deduplication

Repeat Enrichment from Deduplication Real Mapped Deduplicated Peak callers (MACS for example) deduplicate internally, so you don’t have to consciously do this. Only avoided by using uniquely mapped reads. Repeat Repeat

Assessing Duplication

Assessing Duplication

Standard Processing Workflow MultiQC Report Mapping Stats FastQ File FastQC Report Trimmed FQ File BAM File Filtered BAM Visualisation and Assessment Trim Galore SAM Tools Bowtie BWA FastQC FastQC