Sequencing Data Quality Saulo Aflitos
Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion Assembly - Concepts
Scaffold (≈ 2Mbp) Paired-End Mate-Pair LowComplexityRegion Pseudo Molecule (Super Scaffold) Scaffolding
Assembly
Repeats?! Scaffolding
Goldberg SMD et al x 3x2x 3x 1x Consensus Reads Contig Depth of Coverage Reality
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA NAAACGTACGTAAAANAAACGTACGTAAAA A/C A C 95% ±550% ±10 Heterozygozity
Raw Filtered Consequences of Data Cleaning
Sequencing Shotgun RNAseq
Sequencing Paired End Mate Pair
Shred Size Selection Adapter Sequencing Genome Ultrasound Physical RE Gel Beads ID Binding to Surface Circularization Illumina 454 PacBio Sample Preparation
Shredding
Size Selection
100bp Insert Size 150bp-2Kbp Illumina PE Read Length Sequencing
Insert Size 2K-20Kbp Read Length 500bp 454 MP 150bp Sequencing
Data
Machine Name Read ID (unique) Encoded Quality 0-40 Chance of being wrong FastQ
FastQ Format
% FastQ Statistics
Cleaning
Sequence duplication Per base N-content Per base GC content Per base sequence quality Per sequence quality Sequence length distribution Per base sequence content Contamination screen fastq screen Per sequence GC content FastQC Quality Checking Tool
SolexaQA Cleaning Tool
Exercise Create “cleaning” folder – mkdir cleaning; cd cleaning Inside it, run: wget -O saulo.bash Run it with: bash saulo.bash This will download FastQC and SolexaQA – FASTQC HELP : – FASTQC TUTORIAL: – FASTQC MANUAL : – SolexaQA Help : Run FastQC:./FastQC/fastqc & File > open [Files of Type = FastQ files]
Exercise Verify the two.fq files (you can use less ): – bad_MiSeq_dataset.fq – good_MiSeq_dataset.fq Clean the bad dataset with SolexaQA’s DynamicTrim.pl script: – perl SolexaQA_v.2.1/DynamicTrim.pl ► bad_MiSeq_dataset.fq -h 25 Verify the improvement (or not) by opening – bad_MiSeq_dataset.fq.trimmed
?