A critical evaluation of HTQC: a fast quality control toolkit for Illumina sequencing data Chandan Pal, PhD student Sahlgrenska Academy Institute of.

Slides:



Advertisements
Similar presentations
DNAseq analysis Bioinformatics Analysis Team
Advertisements

SOLiD Sequencing & Data
Introduction to Short Read Sequencing Analysis
NGS data format and General Quality Control. Data format “Flowchart” Sequencer raw data FastqSAM/BAM.
Sequencing Data Quality Saulo Aflitos. Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion.
Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151.
NGS Analysis Using Galaxy
Whole Exome Sequencing for Variant Discovery and Prioritisation
De-novo Assembly Day 4.
National Center for Genome Analysis Support: Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment.
Bacterial Genome Assembly C. Victor Jongeneel Bacterial Genome Assembly | C. Victor Jongeneel | PowerPoint by Casey Hanson.
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
Introduction to Short Read Sequencing Analysis
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
Next Generation DNA Sequencing
Eran Yanowski, Eran Hornstein’s: Monitor drug impact on the transcriptome of mouse beta cells (primary and cell-line) using Transeq/RNA-Seq Report.
Quick introduction to genomic file types Preliminary quality control (lab)
Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University.
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
Quality Control Hubert DENISE
Introduction to RNAseq
Sequence File Formats.
De Novo Genome Assembly - Introduction
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
De novo assembly of RNA Steve Kelly
Moderní metody analýzy genomu - analýza Mgr. Nikola Tom Brno,
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Bacterial Genome Assembly Tutorial: C. Victor Jongeneel Bacterial Genome Assembly v9 | C. Victor Jongeneel1 Powerpoint: Casey Hanson.
From Reads to Results Exome-seq analysis at CCBR
DAY 2. GETTING FAMILIAR WITH NGS SANGREA SHIM. INDEX  Day 2  Get familiar with NGS  Understanding of NGS raw read file  Quality issue  Alignment/Mapping.
Canadian Bioinformatics Workshops
071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.
Konstantin Okonechnikov Qualimap v2: advanced quality control of
Simon v RNA-Seq Analysis Simon v
Using command line tools to process sequencing data
Overview Modern chip designs have multiple IP components with different process, voltage, temperature sensitivities Optimizing mix to different customer.
Short Read Sequencing Analysis Workshop
Lesson: Sequence processing
Amos Tanay Nir Yosef 1st HCA Jamboree, 8/2017
Component 1.6.
MGmapper A tool to map MetaGenomics data
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
Illumina Processing Steven Leonard
Topics Introduction Hardware and Software How Computers Store Data
Bacterial Genome Assembly
Transcriptomics II De novo assembly
Introduction Osborn.
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
ChIP-Seq Analysis – Using CLCGenomics Workbench
The FASTQ format and quality control
EMC Galaxy Course November 24-25, 2014
B3- Olympic High School Bioinformatics
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
Bacterial Genome Assembly
2nd (Next) Generation Sequencing
MapView: visualization of short reads alignment on a desktop computer
ChIP-Seq Data Processing and QC
Exploring and Understanding ChIP-Seq data
Smita Vijayakumar Qian Zhu Gagan Agrawal
Maximize read usage through mapping strategies
Garbage In, Garbage Out: Quality control on sequence data
BF nd (Next) Generation Sequencing
BF528 - Sequence Analysis Fundamentals
Toward Accurate and Quantitative Comparative Metagenomics
RNA-Seq Data Analysis UND Genomics Core.
The Variant Call Format
Presentation transcript:

A critical evaluation of HTQC: a fast quality control toolkit for Illumina sequencing data Chandan Pal, PhD student Sahlgrenska Academy Institute of Neuroscience and Physiology | Chandan Pal 19.03.2013

Components Why quality control toolkit? Illumina data summary Sources of errors in Illumina data A bit on QC pipeline Focus on HTQC toolkit Focus on other QC tools Critical analysis of HTQC Quiz (2 questions: answer and get chocolate bars) Institute of Neuroscience and Physiology | Chandan Pal 19.03.2013

Why Quality Control is so important? Produce poor /biased/ incorrect results Conclusion is dependent on QC and filtering Genome as a tool in Diagnostics Reducing volume of dataload for the softwares and the tools used for downstream analysis To solve it: Identify problems in data (visualization can help) Remove problems from the data (using filtering workflow) Institute of Neuroscience and Physiology | Chandan Pal 19.03.2013

Illumina short reads characteristics Length - 35-250 bp, 100bp most popular Attributes - high quality at 5’ start and lowers at 3’ end - Indels and homopolymer erros are rare Library types: - Single end (low reliable) - Paired-end (very reliable) - Mate-pair (varies) Quality score (Q-score): ASCII character code: offset 64 or 33) - Q40 (error probability (P) 0.0001 or 1 in 10000) (Qphred=-10log10(P)) - Q30 (error probability 0.001 or 1 in 1000) - Q20 (error probability 0.01 or 1 in 100) - Q10 (error probability 0.1 or 1 in 10) Institute of Neuroscience and Physiology | Chandan Pal 19.03.2013

Meaning of a Illumina read Run Lane Tile X-position Y-position Index Read Number @HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1 TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAG +HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1 efcfffffcfeefffcffffffddf`feed]`]_Ba_^__[YBBBRTT\]][]ddddd^dddadd^BBBBBBBBBBB base call base quality Institute of Neuroscience and Physiology | Chandan Pal 2019-01-11

How are quality scores generated? Using a model which uses a set of quality predictor values (intensity, background signal) as inputs , measure base call reliability and produce quality score as ouputs. Quality score is assigned to each base call for every cluster in a given tile on a flow cell at a specific cycle All quality score are recorded in base call files (.bcl) wich contain base call and quality score per cycle. During analysis quality score are written to fastq files in ASCII coded form Institute of Neuroscience and Physiology | Chandan Pal 2019-01-11

Sources of error in Illumina Non-random distribution of the reads in the sequenced sample over the reference (high coverage in the area of high %GC) Certain substitution error (wrong base call are frequently preceded after G or C) Frequencies of base substitution (vary: A to C is most frequent) Region specific sequencing artifacts Uncalled bases (encoded as ’B’ in quality string) Ambigiuos bases (velevet silently convert them to random base, some sotware misbehave, some crashes) Other Tile specific problems Institute of Neuroscience and Physiology | Chandan Pal 2019-01-11

Tile specific problem Base calling (problem in specific tile, chastity) Institute of Neuroscience and Physiology | Chandan Pal 19.03.2013

QC processing of raw data Fastq files QC inspection Filtering out bad data Institute of Neuroscience and Physiology | Chandan Pal 19.03.2013

Investigation on read quality General - Base quality distribution - Average base quality distribution - base counts by position - base quality by position Instrument specific - Tile base quality - Tile alignment quality Experiement specific - DNAseq (alignment score, mapping quality, targeted region QC, read-pair status, insert-size distribution) - Chip-Seq (Duplicate reads) Institute of Neuroscience and Physiology | Chandan Pal 2019-01-11

Trimming reads End trimming - clip bases from the ends till quality threshold is met (ignores low quality bases in middle of read) Fixed trimming - clip x bases from the end of every read (e.g., if you knew that something had gone wrong with the run after a certain cycle) Should trim low quality sequences - each bad base - window moving avergare (e.g. 4 to 5) - minimum % good per window Institute of Neuroscience and Physiology | Chandan Pal 19.03.2013

Effect of Data removal Losing lots of data (up to 99.5%) Losing data in a certain area of the genome with very low coverage Removal of wrong duplication as contaminant (but actually biological: mitochondrial or chloroplast sequence as common repeats in the target genome) Problem in de novo projects for assembly (contigs breaks) Institute of Neuroscience and Physiology | Chandan Pal 2019-01-11

Comparison of QC softwares Institute of Neuroscience and Physiology | Chandan Pal 19.03.2013

HTQC toolkit components ht_stat (QC report in text format) Ht_stat_draw.pl (visualization) ht_tile_filter (tile specific) ht_trim (trim bases from both ends of the reads) ht_qual_filter (remove low quality reads) ht_length_filter (remove short reads) Institute of Neuroscience and Physiology | Chandan Pal 2019-01-11

Read quality assessment Institute of Neuroscience and Physiology | Chandan Pal 19.03.2013

Institute of Neuroscience and Physiology | Chandan Pal 19.03.2013

Read quality assessment Institute of Neuroscience and Physiology | Chandan Pal 19.03.2013

Run-time efficiency A. Real: Elapsed time in real world B. User: CPU cost in user mode C. System: CPU cost in system mode Institute of Neuroscience and Physiology | Chandan Pal 19.03.2013

Time efficiency of the HTQC tool HTQC tool claims: Approx. 3-times FASTER than FastQC, Approx. 30 times FASTER than BIGpre, Approx. 40 times FASTER than solexaQA Use a lot less memory Institute of Neuroscience and Physiology | Chandan Pal 19.03.2013

Observation FastQC HTQC 2-times SLOWER than FastQC real 2m25.152s (24 million reads: 40 bp long) user 2m27.437s sys 0m2.023s HTQC real 5m58.007s user 5m48.289s sys 3m16.106s 2-times SLOWER than FastQC Used 1.9 Gb Memory (FastQC used 180 Mb) Institute of Neuroscience and Physiology | Chandan Pal 19.03.2013

Why taking a bit time? Bad Input-Output loop while ((curr_char = fgetc(handle)) !=EOF) ( if (curr_char == LINE_SEPARATOR) break; buffer.push_back(curr_char); ) Means it reads the huge Fastq file character by character and adds it to std:vector. Institute of Neuroscience and Physiology | Chandan Pal 2019-01-11

Pros and Cons of HTQC toolkit Drawbacks ht_stat will give the quality statistics Need to run ht_stat_draw.pl to get the plots GnuPlot need to be installed CMake need to be installed Pros Generate plots, quality statistics Trimming, filtering by quality/length/tile Written in C++ so faster than Perl based softwares Institute of Neuroscience and Physiology | Chandan Pal 2019-01-11

Overall quality filtering stages B-tail trimming removing reads containing adapter sequence prior to analysis. removing reads that have less than two-thirds of the bases with Q ≥ 30 in the first half of the read, reads not passing the chastity filter (specific for illumina pipeline filtering: CHASTITY >= 0.6, the ratio of the highest of the four (base type) intensities to the sum of highest two) reads containing at least one uncalled base. Institute of Neuroscience and Physiology | Chandan Pal 2019-01-11

Summary & Suggestions FastQC performed better in terms of time effiecinecy though no filtering facilities HTQC can do most what a user expect from a QC toolkit but have to run individual commands to do it. fastX toolkit can be an alternative: have to run individual commnds to perform each steps (same as HTQC). Can’t handle paired-ends reads. Initial investigation on FastQC then jump to fastX toolkit (single end)/HTQC toolkit (paired-end)/ fastX toolkit+ own script (paired-end) Institute of Neuroscience and Physiology | Chandan Pal 2019-01-11

Quiz Illumina first came to the market in A. 2004 B. 2005 C. 2006 D. 2007 2. First author of today’s paper A. Xi Yang B. Feng Li C. Jun Wu D. Xue Xiao Institute of Neuroscience and Physiology | Chandan Pal 2019-01-11