ChIP-Seq Data Processing and QC

Slides:

Advertisements

Similar presentations

IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy

Advertisements

Visualising and Exploring BS-Seq Data

Processing of miRNA samples and primary data analysis

ChIP-seq analysis Ecole de bioinformatique AVIESAN – Roscoff, Jan 2013.

DNAseq analysis Bioinformatics Analysis Team

Simon v2.3 RNA-Seq Analysis Simon v2.3.

SOLiD Sequencing & Data

SeqMonk tools for methylation analysis

Bioinformatics Analysis Team McGill University and Genome Quebec Innovation Center

NGS data format and General Quality Control. Data format “Flowchart” Sequencer raw data FastqSAM/BAM.

Before we start: Align sequence reads to the reference genome

NGS Analysis Using Galaxy

Whole Exome Sequencing for Variant Discovery and Prioritisation

Expression Analysis of RNA-seq Data

Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.

NGS data analysis CCM Seminar series Michael Liang:

Next Generation DNA Sequencing

RNA-Seq Analysis Simon V4.1.

ChIP-seq hands-on Iros Barozzi, Campus IFOM-IEO (Milan) Saverio Minucci, Gioacchino Natoli Labs.

Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.

Trinity College Dublin, The University of Dublin GE3M25: Data Analysis, Class 4 Karsten Hokamp, PhD Genetics TCD, 07/12/2015

IGV tools. Pipeline Download genome from Ensembl bacteria database Export the mapping reads file (SAM) Map reads to genome by CLC Using the mapping.

Trinity College Dublin, The University of Dublin Data download: bioinf.gen.tcd.ie/GE3M25/project Get.fastq.gz file associated with your student ID

No reference available

QC and pre-assembly analyses

Moderní metody analýzy genomu - analýza Mgr. Nikola Tom Brno,

Introduction of the ChIP-seq pipeline Shigeki Nakagome November 16 th, 2015 Di Rienzo lab meeting.

Computing on TSCC Make a folder for the class and move into it –mkdir –p /oasis/tscc/scratch/username/biom262_harismendy –cd /oasis/tscc/scratch/username/biom262_harismendy.

Short Read Workshop Day 5: Mapping and Visualization

A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.

Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015.

HOMER – a one stop shop for ChIP-Seq analysis

Using Galaxy to build and run data processing pipelines Jelle Scholtalbers / Charles Girardot GBCS Genome Biology Computational Support.

From Reads to Results Exome-seq analysis at CCBR

Misleading bioinformatics: Mistakes, Biases, Mis-interpretations and how to avoid them Festival of Genomics 2017 Course Exercise Material:

Differential Methylation Analysis

Simon v RNA-Seq Analysis Simon v

Using command line tools to process sequencing data

SeqMonk tools for methylation analysis

SeqMonk tools for methylation analysis

Placental Bioinformatics

Simon v1.0 Motif Searching Simon v1.0.

Cancer Genomics Core Lab

Next Generation Sequencing Analysis

Biases and their Effect on Biological Interpretation

Short Read Sequencing Analysis Workshop

Invest. Ophthalmol. Vis. Sci ;57(10): doi: /iovs Figure Legend:

S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.

GE3M25: Data Analysis, Class 4

GE3M25: Data Analysis, Class3

The FASTQ format and quality control

EMC Galaxy Course November 24-25, 2014

Analysing ChIP-Seq Data

Yonglan Zheng Galaxy Hands-on Demo Step-by-step Yonglan Zheng

Artefacts and Biases in Gene Set Analysis

Simon v ChIP-Seq Analysis Simon v

Quality control for Sequencing Experiments

Exploring and Understanding ChIP-Seq data

Visualising and Exploring BS-Seq Data

A critical evaluation of HTQC: a fast quality control toolkit for Illumina sequencing data Chandan Pal, PhD student Sahlgrenska Academy Institute of.

Epigenetics System Biology Workshop: Introduction

Maximize read usage through mapping strategies

Simon V Motif Searching Simon V

Artefacts and Biases in Gene Set Analysis

Garbage In, Garbage Out: Quality control on sequence data

BF528 - Sequence Analysis Fundamentals

Computational Pipeline Strategies

RNA-Seq Data Analysis UND Genomics Core.

Quality Control & Nascent Sequencing

The Variant Call Format

Presentation transcript:

ChIP-Seq Data Processing and QC Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2018-01

Data Creation and Processing Starting DNA Fragmented DNA ChIPped DNA Mapped BAM File FastQ Sequence File Sequence Library

A typical ChIP Library Potential technical problems ChIP Fragment Adapter Barcode Potential technical problems Adapter contamination PCR Duplication Potential biological problems Lack of enrichment Other selection biases

QC of raw sequence Base Call Quality

QC of raw sequence Sequence Composition

QC of raw sequence Sequence Composition

QC of raw sequence Adapter Contamination ChIP Fragment Adapter Barcode Read Trim Galore! Quality and Adapter Trimming

Mapping ChIP Data All regions should be linear genomic stretches Standard genomic aligners are fine Bowtie2 http://bowtie-bio.sourceforge.net/bowtie2/ BWA http://bio-bwa.sourceforge.net/

Example Bowtie2 Mapping Create Genome Index (once - slow!) Map a single FastQ file bowtie2-build yeast_genome.fa yeast_index bowtie \ -x yeast_index \ -U data.fastq.gz \ | samtools view \ -bS \ -o data.bam

Post Alignment QC Mapping Statistics sample1: 2264052 (14.66%) aligned exactly 1 time sample2: 2698005 (18.79%) aligned exactly 1 time sample3: 13434392 (67.08%) aligned exactly 1 time sample4: 1108477 (6.70%) aligned exactly 1 time sample5: 2143911 (17.58%) aligned exactly 1 time sample6: 2980154 (13.98%) aligned exactly 1 time

Post Alignment Processing MAPQ Filtering ChIP-Seq relates sequences to positions in a reference genome You need to be confident that the reported position is correct Filtering on MAPQ value (likelihood of reported position being incorrect) is an easy way to do this MAPQ filtering should be performed in most cases samtools view -q 20 -b -o filtered.bam data.bam

Post Alignment Processing Deduplication java -jar picard.jar SortSam \ INPUT=filtered.bam \ OUTPUT=sorted.bam \ SORT_ORDER=coordinate java -jar picard.jar MarkDuplicates \ INPUT=sorted.bam \ OUTPUT=dedup.bam \ METRICS_FILE=metrics.txt DO NOT DEDUPLICATE AS A MATTER OF COURSE! THINK FIRST!

To Deduplicate or Not? Deduplication can make enrichment visually clearer and help to spot truly enriched regions Why not just deduplicate everything? Quantitation compression Repeat enrichment

Deduplication

Quantitation Compression from Deduplication

Repeat Enrichment from Deduplication Real Mapped Deduplicated Peak callers (MACS for example) deduplicate internally, so you don’t have to consciously do this. Only avoided by using uniquely mapped reads. Repeat Repeat

Assessing Duplication

Assessing Duplication

Standard Processing Workflow MultiQC Report Mapping Stats FastQ File FastQC Report Trimmed FQ File BAM File Filtered BAM Visualisation and Assessment Trim Galore SAM Tools Bowtie BWA FastQC FastQC