Presentation is loading. Please wait.

Presentation is loading. Please wait.

Quality Control Hubert DENISE

Similar presentations


Presentation on theme: "Quality Control Hubert DENISE"— Presentation transcript:

1 Quality Control Hubert DENISE (hudenise@ebi.ac.uk)

2 Image credits: (1) Christina Toft & Siv G. E. Andersson; (2) Dalebroux Z D et al. Microbiol. Mol. Biol. Rev. 2010;74:171-199 Quality control Diversity analysis Metagenomics data analysis Functional analysis

3 QC rationale Why ?  Garbage in, garbage out  Base call error: - each base call has a quality score associated - specific platform-dependent errors  Reads quality decreases with reads length  NGS generates duplicate reads (false and real). Reducing duplication reduces analysis time and prevent analysis bias.

4 EBI Metagenomics: QC step by step  Clipping - low quality ends trimmed and adapter sequences removed using Biopython SeqIO package  Quality filtering - sequences with > 10% undetermined nucleotides removed  Read length filtering - short sequences are removed: 100 nt theshold  Duplicate sequences removal - clustered on 99% identity (UCLUST v 1.1.579 for 454 and Qiime prefix clustering for Illumina) and representative sequence chosen  Repeat masking - RepeatMasker (open-3.2.2), removed reads with 50% or more nucleotides masked

5 EBI Metagenomics: QC consequences Roche 454 Illumina Ion Torrent

6 MG-RAST QC EBI Metagenomics QC dereplication ( first 50 bp ) model organism screening ( bowtie ) length filtering ( >75 bp ) ambiguous base filtering ( <5 bp ) dynamic base filtering ( phred score ) analysis duplicate sequence filtering ( first 50 bp ) repeat masking clipping (10%) quality filtering ( phred score ) read length filtering (> 100bp) analysis

7 QC Tutorial Introduction to exercise Hubert Denise hudenise@ebi.ac.uk

8 QC Tutorial Today we’ll be investigating a dataset obtained from varying depths of water taken from the Pacific Ocean 25m125m 75m500m First we will look at the “HOT_Station_ALOHA,_25m_depth” fastq sequence file using the software FASTQC Then we will use the Trimmomatic package to: Perform quality and length trimming on this file

9 Performing QC steps using Trimmomatic All instructions are provided in the manual Trimmomatic is written in Java but you only need basic Unix knowledge to run it Trimmomatic functions: -removal of Illumina adapters from reads, -quality filtering, -length trimming, -conversion of quality score format In this tutorial we will only perform quality and length filtering More details at http://www.usadellab.org/cms/?page=trimmomatic. http://www.usadellab.org/cms/?page=trimmomatic

10 @D8QSB6V1:140:HA62CADXX:1:1101:1343:2227_1:N:0:AGTTCC TCGGTTTTTCATCCAATTGAGTCGTCCCGTTGATAGTGAACTGGTACGTCATCGACTGCA... + !!#$:(*1<=“#HHA@IJIIJIHIJIJIJIIIJIGIBGIJJIIIFHGBHIIJIIIIIJJI......TGCACGTTCGGATTGGTCACCTCAATCGCAATATCGTAGCGATTGTTACCCAGAGGAAATA...@CCFDFFFGHHHHIIIJIIJIHIJIJIJIIIJIGJHIJIIIFHGB2$’=IC5);=HA&&#% Trimmomatic steps used in this tutorial A - LEADING:8 TRAILING:8 quality threshold quality score (phred 33) 0 0 2 3 25 7 9 16 27 28 … 26 28 39 32 5 5 2 4 + 7 trimmed sequence

11 @D8QSB6V1:140:HA62CADXX:1:1101:1343:2227_1:N:0:AGTTCC TTTTTCATCCAATTGAGTCGTCCCGTTGATAG...CGTAGCGATTGTTACCCAGAGGA + :(*1<=“#HHA@IJIIJIHIJIJIJIIIJIGI...JHIJIIIFHGB2$’=IC5);=HA Trimmomatic steps used in this tutorial B – SLIDINGWINDOW:4:15 window size …40 40 7 32 5 5 2 4 average quality sum: 57 avg: 14.25 work in the 5’ to 3’ end direction (whole read is scanned) 7 9 16 25 28 1 2 27 sum: 58 avg: 14.5 + + 393231 sum = 141 avg = 32.25  no trimming etc … avg ≥ 15 : no trimming Final sequence 333617 sum = 59 avg < 15 => trimming

12 Hubert DENISE (hudenise@ebi.ac.uk)


Download ppt "Quality Control Hubert DENISE"

Similar presentations


Ads by Google