Garbage In, Garbage Out: Quality control on sequence data

Slides:



Advertisements
Similar presentations
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
Advertisements

Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Peter Tsai Bioinformatics Institute, University of Auckland
RNA-seq Analysis in Galaxy
NGS data format and General Quality Control. Data format “Flowchart” Sequencer raw data FastqSAM/BAM.
Sequencing Data Quality Saulo Aflitos. Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion.
Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151.
NGS Analysis Using Galaxy
National Center for Genome Analysis Support: Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment.
Expression Analysis of RNA-seq Data
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
Giuseppe D'Auria Norwich September 2014 FISABIO, Valencia Introduction into the processing of raw data.
Pyrosequencing for Metagenomics: accessing and organizing raw data Giuseppe D’Auria FISABIO, Valencia Norwich September 2014.
Next Generation DNA Sequencing
Eran Yanowski, Eran Hornstein’s: Monitor drug impact on the transcriptome of mouse beta cells (primary and cell-line) using Transeq/RNA-Seq Report.
The iPlant Collaborative
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Quality Control Hubert DENISE
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
Introduction to RNAseq
Denovo Sequencing Practical. Overview Very small dataset from Staphylococcus aureus – 4 million x 75 base-pair, paired end reads Cover basic aspects of.
First of all: “Darnit Jim, I’m a doctor not a bioinformatician!”
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Setting up visualization. Make output folder for visualization files Log into vieques $ ssh
Canadian Bioinformatics Workshops
Short Read Workshop Day 1 - Experimental Design Example 1: How to log in to vieques.
From Reads to Results Exome-seq analysis at CCBR
Canadian Bioinformatics Workshops
Quality Control Metrics for DNA Sequencing
Konstantin Okonechnikov Qualimap v2: advanced quality control of
Simon v RNA-Seq Analysis Simon v
Using command line tools to process sequencing data
Placental Bioinformatics
Cancer Genomics Core Lab
RNA-Seq Green Line Overview
OptiSystem applications: SER & BER analysis of QAM-PSK-PAM systems
MGmapper A tool to map MetaGenomics data
3.3 Fundamentals of data representation
Bacterial Genome Assembly
Gene expression from RNA-Seq
Short Read Sequencing Analysis Workshop
QC analysis Uppsala University Work done by Jonas Almlöf
Lab meeting
Recall The Team Skills Analyzing the Problem
Primer design.
ChIP-Seq Analysis – Using CLCGenomics Workbench
The FASTQ format and quality control
EMC Galaxy Course November 24-25, 2014
Introduction into the processing of raw data
Workshop on Microbiome and Health
Bacterial Genome Assembly
Inferential Statistics
Stat 217 – Day 28 Review Stat 217.
ChIP-Seq Data Processing and QC
Exploring and Understanding ChIP-Seq data
Identification and Characterization of pre-miRNA Candidates in the C
Digital Certificates and X.509
A critical evaluation of HTQC: a fast quality control toolkit for Illumina sequencing data Chandan Pal, PhD student Sahlgrenska Academy Institute of.
Learning to count: quantifying signal
Maximize read usage through mapping strategies
Splenic CD169+ macrophages express a unique gene profile.
KEY CONCEPT _____ encode _______ that produce a ______ _____ of _____.
Box plots of quality scores over positions in sequenced reads.
BF nd (Next) Generation Sequencing
Additional file 2: RNA-Seq data analysis pipeline
BF528 - Sequence Analysis Fundamentals
Computational Pipeline Strategies
RNA-Seq Data Analysis UND Genomics Core.
Quality Control & Nascent Sequencing
The Variant Call Format
Presentation transcript:

Garbage In, Garbage Out: Quality control on sequence data

Key concepts of session The quality of the data limits what you can confidently say about the data and how you can subsequently use it. An important component to quality control is visualization: you must actually LOOK at your data.

So you have reads off a sequencer … where do you start? The fastQ format: More on the file format and quality encoding: https://en.wikipedia.org/wiki/FASTQ_format

Expectation

But the reality may be very different

So what? Why does QC matter? You are going to spend a LOT of time (and $) on this dataset. Downstream analysis software assumes pretty well behaved data!!

How to assess a bag of reads Pre-mapping: FastQC GC content read quality (Phred score) Post-mapping: read coverage (which regions, how much) complexity (# unique samples)

Protocol matters – how the experiment influences your QC Mistakes in protocol can result in abnormal distributions Poor read quality = poor mapping = poor coverage

WHY doesn’t it look like I wanted? Cell clustering – over-amplification Low library complexity Problems with amplification or size selection Problem with adapters See also: https://sequencing.qcfail.com/

But one person’s garbage is another’s treasure.

You can still obtain information Even low coverage samples can give you information: Which genes are being actively transcribed Differentially expressed genes (depending on depth and coverage)

Running FastQC – Pre-Trim Determine which adapters are present if you are unsure of the protocol Assess whether sequencing/protocol providing the results expected Refine trimming options

In this script, we will: Flip reads (reverse complement) – protocol dependent Run FastQC To run (after adjusting parameters in green box): $ bash fastqc_pretrim.sh

Open up our fastqc .html report

Trimming Many different trimming programs available We will use “bbduk” – quick runtime, lots of trim options $ vi trim.sh

In this script, we will: Trim for adapters (followed by length) Trim for quality To run (after adjusting rootname/project): $ bash trim.sh

View trim stats $ cd /home/user/hackcon/trimmed $ ls $ vim sample.stats What can we learn from this report?

Running FastQC – Post-Trim Determine which adapters are present if you are unsure of the protocol Assess whether sequencing/protocol providing the results expected Refine trimming options

In this script, we will: Assess our trimming parameters Determine if we need to re-trim or move forward with mapping To run (after adjusting rootname/project): $ bash fastqc_postrim.sh

Open up our fastqc .html report