Presentation is loading. Please wait.

Presentation is loading. Please wait.

Preprocessing Data Rob Schmieder.

Similar presentations


Presentation on theme: "Preprocessing Data Rob Schmieder."— Presentation transcript:

1 Preprocessing Data Rob Schmieder

2 Bad data analysis

3 Good data analysis New dataset Quality control & Preprocessing
Similarity search Assembly

4 3 Tools for metagenomic data

5 http://edwards.sdsu.edu/prinseq Quality control and data preprocessing
Rob Schmieder

6 Number and length of sequences
Bad Reads should be approx. same length (same number of cycles) → Short reads are likely lower quality Good

7 Linearly degrading quality across the read
Trim low quality ends

8 High quality throughout the sequence
Good quality through the length of the sequence Sequence quality falls off quickly → Bad sequence data

9 Ion quality scores

10 Low quality sequence issues
Most assemblers or aligners do not take into account quality scores Errors in reads complicate assembly, might cause misassembly, or make assembly impossible

11 What if quality scores are not available ?
Alternative: Infer quality from the percent of Ns found in the sequence Removes regions with a high number of Ns Huse et al. found that presence of any ambiguous base calls was a sign for overall poor sequence quality Huse et al.: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology (2007)

12 What if quality scores are not available ?
Alternative: Infer quality from the percent of Ns found in the sequence Removes regions with a high number of Ns Huse et al. found that presence of any ambiguous base calls was a sign for overall poor sequence quality Huse et al.: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology (2007)

13 Ambiguous bases If you can afford the loss, filter out all reads containing Ns Assemblers (e.g. Velvet) and aligners (SHAHA2, BWA, …) use 2-bit encoding system for nucleotides some replace Ns with random base, some with fixed base (e.g. SHAHA2 & Velvet = A) 2-bit example: 00 – A, 01 – C, 10 – G, 11 - T

14 Quality filtering Any region with homopolymer will tend to have a lower quality score Huseet al. found that sequences with an average score below 25 had more errors than those with higher averages Huse et al.: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology (2007)

15 Sequence duplicates

16 Real or artificial duplicate ?
Metagenomics = random sampling of genomic material Why do reads start at the same position? Why do these reads have the same errors? No specific pattern or location on sequencing plate 11-35% Gomez-Alvarez et al.: Systematic artifacts in metagenomes from complex microbial communities. ISME (2009) 16

17

18 One micro-reactor – Many beads
Martine Yerle (Laboratory of Cellular Genetics, INRA, France)

19 Impacts of duplicates False variant (SNP) calling
Require more computing resources Find similar database sequences for same query sequence Assembly process takes longer Increase in memory requirements Abundance or expression measures can be wrong

20 Impacts of duplicates False variant (SNP) calling
Require more computing resources Find similar database sequences for same query sequence Assembly process takes longer Increase in memory requirements Abundance or expression measures can be wrong Reference ...ACCACACGTGTTGTGTACATGAACACAGTATATGAGCATACAGAT... GTGTTGTGTACATGAACACAGTATATGAGCATACAGAT... GTGTACATGAACACAGTATATGAGCATACAGAT... TGAACACAGTCTATGAGCATACAGAT... TGAACACAGTCTATGAGCATACAGAT... TGAACACAGTCTATGAGCATACAGAT... TGAACACAGTCTATGAGCATACAGAT... TGAACACAGTCTATGAGCATACAGAT...

21 Impacts of duplicates False variant (SNP) calling
Require more computing resources Find similar database sequences for same query sequence Assembly process takes longer Increase in memory requirements Abundance or expression measures can be wrong

22 Detect and remove tag sequences

23 No tag MID tag WTA tags

24 Imperfect primer annealing

25 Fragment-to-fragment concatenations

26 Data upload Tag sequence definition

27 Tag sequence prediction

28 Parameter definition Download results

29 Identification and removal of sequence contamination

30 Contaminant identification
Previous methods had critical limitations Dinucleotide relative abundance uses information content in sequences  can not identify single contaminant sequences Sequence similarity seems to be only reliable option to identify single contaminant sequences BLAST against human reference genome is slow and lacks corresponding regions (gaps, variants, …) Novel sequences in every new human genome sequenced* * Li et al.: Building the sequence map of the human pan-genome. Nature Biotechnology (2010)

31 Faster algorithms for Next-gen data

32 Principal component analysis (PCA) of dinucleotide relative abundance
Microbial metagenomes Viral metagenomes

33 Contaminant identification
Current methods have critical limitations Dinucleotide relative abundance uses information content in sequences  can not identify single contaminant sequences Sequence similarity seems to be only reliable option to identify single contaminant sequences BLAST against human reference genome is slow and lacks corresponding regions (gaps, variants, …) Novel sequences in every new human genome sequenced* * Li et al.: Building the sequence map of the human pan-genome. Nature Biotechnology (2010)

34 DeconSeq web interface
Two types of reference databases Remove Retain

35 DeconSeq web interface (cont.)

36 DeconSeq Identity = How similar is the query sequence to the
reference sequence How much of query sequence is similar to reference sequence Coverage =

37 DeconSeq Blue = More similar to “retain”
Red = More similar to “remove”

38 Human DNA contamination identified in 145 out of 202 metagenomes

39


Download ppt "Preprocessing Data Rob Schmieder."

Similar presentations


Ads by Google