Download presentation
Presentation is loading. Please wait.
Published byDouglas Benson Modified over 6 years ago
1
Quality Control & Preprocessing of Metagenomic Data
SDSU Robert Schmieder –
2
Need for automated approach
Metagenomic datasets contain 100,000s (454) or 1,000,000s (Illumina) of sequences IlluminaHiSeq 2000: currently 300 GB of data soon 2,000 GB (≈33 human genomes with 20x coverage in single sequencing run) Can not just read sequence by sequence to get an idea of your data
3
Basic data analysis Perform similarity search New dataset Assemble
4
Bad data analysis
5
Bad data analysis
6
Bad data analysis
7
Bad data analysis
8
Bad data analysis
9
Bad data analysis
10
Good data analysis New dataset
11
Good data analysis New dataset Quality control & Preprocessing
12
Good data analysis New dataset Quality control & Preprocessing
Similarity search Assembly
13
Good data analysis New dataset Quality control & Preprocessing
Similarity search Assembly
14
3 Tools for metagenomic data
15
Quality control and data preprocessing
16
Number and Length of Sequences
17
Number/Length of sequences
Bad Reads should be approx. same length (same number of cycles) Short reads are likely lower quality Good
18
Quality of Sequences
19
Linearly degrading quality across the read
Trim low quality ends
20
Quality filtering Any region with homopolymer will tend to have a lower quality score Huseet al. found that sequences with an average score below 25 had more errors than those with higher averages Huse et al.: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology (2007)
21
Low quality sequence issue
Most assemblers or aligners do not take into account quality scores Errors in reads complicate assembly, might cause misassembly, or make assembly impossible
22
What if quality scores are not available ?
Alternative: Infer quality from the percent of Ns found in the sequence Removes regions with a high number of Ns Huseet al. found that presence of any ambiguous base calls was a sign for overall poor sequence quality Huse et al.: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology (2007)
23
Ambiguous bases If you can afford the loss, filter out all reads containing Ns Assemblers (e.g. Velvet) and aligners (SHAHA2, BWA, …) use 2-bit encoding system for nucleotides some replace Ns with random base, some with fixed base (e.g. SHAHA2 & Velvet = A) 2-bit example: 00 – A, 01 – C, 10 – G, 11 - T
24
Sequence duplicates
25
Real or artificial duplicate ?
Metagenomics = random sampling of genomic material Why do reads start at the same position? Why do these reads have the same errors? No specific pattern or location on sequencing plate 11-35% Gomez-Alvarez et al.: Systematic artifacts in metagenomes from complex microbial communities. ISME (2009) 25
27
One micro-reactor – Many beads
Martine Yerle (Laboratory of Cellular Genetics, INRA, France)
28
Impacts of duplicates False variant (SNP) calling
Require more computing resources Find similar database sequences for same query sequence Assembly process takes longer Increase in memory requirements Abundance or expression measures can be wrong
29
Impacts of duplicates False variant (SNP) calling
Require more computing resources Find similar database sequences for same query sequence Assembly process takes longer Increase in memory requirements Abundance or expression measures can be wrong
30
Depends on the experiment
In contrast, for Illumina reads with high coverage: eliminating singletons is an easy way of dramatically reducing the number of error- prone reads
31
Tag Sequences
32
No tag MID tag WTA tags
33
Detect and remove tag sequences
34
Fragment-to-fragment concatenations
35
Concatenated fragments in assembled contigs
37
Data upload Tag sequence definition
38
Tag sequence prediction
39
Parameter definition Download results
40
Sequence Contamination
41
Principal component analysis (PCA) of dinucleotide relative abundance
Microbial metagenomes Viral metagenomes
42
Identification and removal of sequence contamination
43
Contaminant identification
Current methods have critical limitations Dinucleotide relative abundance uses information content in sequences can not identify single contaminant sequences Sequence similarity seems to be only reliable option to identify single contaminant sequences BLAST against human reference genome is slow and lacks corresponding regions (gaps, variants, …) Novel sequences in every new human genome sequenced* * Li et al.: Building the sequence map of the human pan-genome. Nature Biotechnology (2010)
44
DeconSeq web interface
Two types of reference databases Remove Retain
45
DeconSeq web interface (cont.)
46
Human DNA contamination identified in 145 out of 202 metagenomes
47
Conclusions Quality control and data preprocessing are very important to increase quality of downstream analysis Preprocessing depends on the experiment
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.