Download presentation
Presentation is loading. Please wait.
Published byArthur Johnson Modified over 6 years ago
1
A critical evaluation of HTQC: a fast quality control toolkit for Illumina sequencing data Chandan Pal, PhD student Sahlgrenska Academy Institute of Neuroscience and Physiology | Chandan Pal
2
Components Why quality control toolkit? Illumina data summary
Sources of errors in Illumina data A bit on QC pipeline Focus on HTQC toolkit Focus on other QC tools Critical analysis of HTQC Quiz (2 questions: answer and get chocolate bars) Institute of Neuroscience and Physiology | Chandan Pal
3
Why Quality Control is so important?
Produce poor /biased/ incorrect results Conclusion is dependent on QC and filtering Genome as a tool in Diagnostics Reducing volume of dataload for the softwares and the tools used for downstream analysis To solve it: Identify problems in data (visualization can help) Remove problems from the data (using filtering workflow) Institute of Neuroscience and Physiology | Chandan Pal
4
Illumina short reads characteristics
Length bp, 100bp most popular Attributes - high quality at 5’ start and lowers at 3’ end - Indels and homopolymer erros are rare Library types: - Single end (low reliable) - Paired-end (very reliable) - Mate-pair (varies) Quality score (Q-score): ASCII character code: offset 64 or 33) - Q40 (error probability (P) or 1 in 10000) (Qphred=-10log10(P)) - Q30 (error probability or 1 in 1000) - Q20 (error probability 0.01 or 1 in 100) - Q10 (error probability 0.1 or 1 in 10) Institute of Neuroscience and Physiology | Chandan Pal
5
Meaning of a Illumina read
Run Lane Tile X-position Y-position Index Read Number @HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1 TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAG +HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1 efcfffffcfeefffcffffffddf`feed]`]_Ba_^__[YBBBRTT\]][]ddddd^dddadd^BBBBBBBBBBB base call base quality Institute of Neuroscience and Physiology | Chandan Pal
6
How are quality scores generated?
Using a model which uses a set of quality predictor values (intensity, background signal) as inputs , measure base call reliability and produce quality score as ouputs. Quality score is assigned to each base call for every cluster in a given tile on a flow cell at a specific cycle All quality score are recorded in base call files (.bcl) wich contain base call and quality score per cycle. During analysis quality score are written to fastq files in ASCII coded form Institute of Neuroscience and Physiology | Chandan Pal
7
Sources of error in Illumina
Non-random distribution of the reads in the sequenced sample over the reference (high coverage in the area of high %GC) Certain substitution error (wrong base call are frequently preceded after G or C) Frequencies of base substitution (vary: A to C is most frequent) Region specific sequencing artifacts Uncalled bases (encoded as ’B’ in quality string) Ambigiuos bases (velevet silently convert them to random base, some sotware misbehave, some crashes) Other Tile specific problems Institute of Neuroscience and Physiology | Chandan Pal
8
Tile specific problem Base calling (problem in specific tile, chastity) Institute of Neuroscience and Physiology | Chandan Pal
9
QC processing of raw data
Fastq files QC inspection Filtering out bad data Institute of Neuroscience and Physiology | Chandan Pal
10
Investigation on read quality
General - Base quality distribution - Average base quality distribution - base counts by position - base quality by position Instrument specific - Tile base quality - Tile alignment quality Experiement specific - DNAseq (alignment score, mapping quality, targeted region QC, read-pair status, insert-size distribution) - Chip-Seq (Duplicate reads) Institute of Neuroscience and Physiology | Chandan Pal
11
Trimming reads End trimming
- clip bases from the ends till quality threshold is met (ignores low quality bases in middle of read) Fixed trimming - clip x bases from the end of every read (e.g., if you knew that something had gone wrong with the run after a certain cycle) Should trim low quality sequences - each bad base - window moving avergare (e.g. 4 to 5) - minimum % good per window Institute of Neuroscience and Physiology | Chandan Pal
12
Effect of Data removal Losing lots of data (up to 99.5%)
Losing data in a certain area of the genome with very low coverage Removal of wrong duplication as contaminant (but actually biological: mitochondrial or chloroplast sequence as common repeats in the target genome) Problem in de novo projects for assembly (contigs breaks) Institute of Neuroscience and Physiology | Chandan Pal
13
Comparison of QC softwares
Institute of Neuroscience and Physiology | Chandan Pal
14
HTQC toolkit components
ht_stat (QC report in text format) Ht_stat_draw.pl (visualization) ht_tile_filter (tile specific) ht_trim (trim bases from both ends of the reads) ht_qual_filter (remove low quality reads) ht_length_filter (remove short reads) Institute of Neuroscience and Physiology | Chandan Pal
15
Read quality assessment
Institute of Neuroscience and Physiology | Chandan Pal
16
Institute of Neuroscience and Physiology | Chandan Pal
17
Read quality assessment
Institute of Neuroscience and Physiology | Chandan Pal
18
Run-time efficiency A. Real: Elapsed time in real world
B. User: CPU cost in user mode C. System: CPU cost in system mode Institute of Neuroscience and Physiology | Chandan Pal
19
Time efficiency of the HTQC tool
HTQC tool claims: Approx. 3-times FASTER than FastQC, Approx. 30 times FASTER than BIGpre, Approx. 40 times FASTER than solexaQA Use a lot less memory Institute of Neuroscience and Physiology | Chandan Pal
20
Observation FastQC HTQC 2-times SLOWER than FastQC
real 2m25.152s (24 million reads: 40 bp long) user 2m27.437s sys 0m2.023s HTQC real 5m58.007s user 5m48.289s sys 3m16.106s 2-times SLOWER than FastQC Used 1.9 Gb Memory (FastQC used 180 Mb) Institute of Neuroscience and Physiology | Chandan Pal
21
Why taking a bit time? Bad Input-Output loop
while ((curr_char = fgetc(handle)) !=EOF) ( if (curr_char == LINE_SEPARATOR) break; buffer.push_back(curr_char); ) Means it reads the huge Fastq file character by character and adds it to std:vector. Institute of Neuroscience and Physiology | Chandan Pal
22
Pros and Cons of HTQC toolkit
Drawbacks ht_stat will give the quality statistics Need to run ht_stat_draw.pl to get the plots GnuPlot need to be installed CMake need to be installed Pros Generate plots, quality statistics Trimming, filtering by quality/length/tile Written in C++ so faster than Perl based softwares Institute of Neuroscience and Physiology | Chandan Pal
23
Overall quality filtering stages
B-tail trimming removing reads containing adapter sequence prior to analysis. removing reads that have less than two-thirds of the bases with Q ≥ 30 in the first half of the read, reads not passing the chastity filter (specific for illumina pipeline filtering: CHASTITY >= 0.6, the ratio of the highest of the four (base type) intensities to the sum of highest two) reads containing at least one uncalled base. Institute of Neuroscience and Physiology | Chandan Pal
24
Summary & Suggestions FastQC performed better in terms of time effiecinecy though no filtering facilities HTQC can do most what a user expect from a QC toolkit but have to run individual commands to do it. fastX toolkit can be an alternative: have to run individual commnds to perform each steps (same as HTQC). Can’t handle paired-ends reads. Initial investigation on FastQC then jump to fastX toolkit (single end)/HTQC toolkit (paired-end)/ fastX toolkit+ own script (paired-end) Institute of Neuroscience and Physiology | Chandan Pal
25
Quiz Illumina first came to the market in A. 2004 B. 2005 C. 2006
D. 2007 2. First author of today’s paper A. Xi Yang B. Feng Li C. Jun Wu D. Xue Xiao Institute of Neuroscience and Physiology | Chandan Pal
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.