Quality Control & Preprocessing of Metagenomic Data

Slides:

Advertisements

Similar presentations

Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.

Advertisements

Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.

Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.

Mining SNPs from EST Databases Picoult-Newberg et al. (1999)

Project Proposals Due Monday Feb. 12 Two Parts: Background—describe the question Why is it important and interesting? What is already known about it? Proposed.

Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.

Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.

High Throughput Sequencing

CS 6293 Advanced Topics: Current Bioinformatics

Whole Exome Sequencing for Variant Discovery and Prioritisation

Massive Parallel Sequencing

Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.

Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.

Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.

Construction of Substitution Matrices

Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.

RNA-Seq Assembly 转录组拼接唐海宝基因组与生物技术研究中心 2013 年 11 月 23 日.

Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.

BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.

Next Generation Sequencing

RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.

Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.

Introduction to RNAseq

Anna Shcherbina Bioinformatics Challenge Day 01/10/2013 De novo assembly from clinical sample This work is sponsored by the Defense Threat Reduction Agency.

Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.

P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA

Construction of Substitution matrices

Doug Raiford Phage class: introduction to sequence databases.

__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.

Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.

SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score 

What is BLAST? Basic BLAST search What is BLAST?

Metagenomic dataset preprocessing – data reduction

When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.

Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.

JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results

Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.

From Reads to Results Exome-seq analysis at CCBR

Are Roche 454 shotgun reads giving a accurate picture of the genome?

071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.

What is BLAST? Basic BLAST search What is BLAST?

Next-generation sequencing technology

Virginia Commonwealth University

Metagenomic Species Diversity.

Lesson: Sequence processing

MGmapper A tool to map MetaGenomics data

Preprocessing Data Rob Schmieder.

Presented By: Chinua Umoja

Figure 1. The overall workflow of RNA-seq QC

QC analysis Uppsala University Work done by Jonas Almlöf

Denovo genome assembly of Moniliophthora roreri

Basics of BLAST Basic BLAST Search - What is BLAST?

Basics of Comparative Genomics

Next-generation sequencing technology

Research in Computational Molecular Biology , Vol (2008)

Sequence comparison: Local alignment

The FASTQ format and quality control

Henrik Lantz - NBIS/SciLife/Uppsala University

Jin Zhang, Jiayin Wang and Yufeng Wu

Discovery tools for human genetic variations

A critical evaluation of HTQC: a fast quality control toolkit for Illumina sequencing data Chandan Pal, PhD student Sahlgrenska Academy Institute of.

The ability of the SOP to sequence and identify unknown samples.

Basic Local Alignment Search Tool

Basics of Comparative Genomics

Basic Local Alignment Search Tool

Schematic representation of the SyngenicDNA approach.

Toward Accurate and Quantitative Comparative Metagenomics

Presentation transcript:

Quality Control & Preprocessing of Metagenomic Data EdwardsLab @ SDSU Robert Schmieder – rschmieder@gmail.com

Need for automated approach Metagenomic datasets contain 100,000s (454) or 1,000,000s (Illumina) of sequences IlluminaHiSeq 2000: currently 300 GB of data soon 2,000 GB (≈33 human genomes with 20x coverage in single sequencing run) Can not just read sequence by sequence to get an idea of your data

Basic data analysis Perform similarity search New dataset Assemble

Bad data analysis

Bad data analysis

Bad data analysis

Bad data analysis

Bad data analysis

Bad data analysis

Good data analysis New dataset

Good data analysis New dataset Quality control & Preprocessing

Good data analysis New dataset Quality control & Preprocessing Similarity search Assembly

Good data analysis New dataset Quality control & Preprocessing Similarity search Assembly

3 Tools for metagenomic data http://prinseq.sourceforge.net http://tagcleaner.sourceforge.net http://deconseq.sourceforge.net

Quality control and data preprocessing

Number and Length of Sequences

Number/Length of sequences Bad Reads should be approx. same length (same number of cycles)  Short reads are likely lower quality Good

Quality of Sequences

Linearly degrading quality across the read Trim low quality ends

Quality filtering Any region with homopolymer will tend to have a lower quality score Huseet al. found that sequences with an average score below 25 had more errors than those with higher averages Huse et al.: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology (2007)

Low quality sequence issue Most assemblers or aligners do not take into account quality scores Errors in reads complicate assembly, might cause misassembly, or make assembly impossible

What if quality scores are not available ? Alternative: Infer quality from the percent of Ns found in the sequence Removes regions with a high number of Ns Huseet al. found that presence of any ambiguous base calls was a sign for overall poor sequence quality Huse et al.: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology (2007)

Ambiguous bases If you can afford the loss, filter out all reads containing Ns Assemblers (e.g. Velvet) and aligners (SHAHA2, BWA, …) use 2-bit encoding system for nucleotides some replace Ns with random base, some with fixed base (e.g. SHAHA2 & Velvet = A) 2-bit example: 00 – A, 01 – C, 10 – G, 11 - T

Sequence duplicates

Real or artificial duplicate ? Metagenomics = random sampling of genomic material Why do reads start at the same position? Why do these reads have the same errors? No specific pattern or location on sequencing plate 11-35% Gomez-Alvarez et al.: Systematic artifacts in metagenomes from complex microbial communities. ISME (2009) 25

One micro-reactor – Many beads Martine Yerle (Laboratory of Cellular Genetics, INRA, France)

Impacts of duplicates False variant (SNP) calling Require more computing resources Find similar database sequences for same query sequence Assembly process takes longer Increase in memory requirements Abundance or expression measures can be wrong

Impacts of duplicates False variant (SNP) calling Require more computing resources Find similar database sequences for same query sequence Assembly process takes longer Increase in memory requirements Abundance or expression measures can be wrong

Depends on the experiment In contrast, for Illumina reads with high coverage: eliminating singletons is an easy way of dramatically reducing the number of error- prone reads

Tag Sequences

No tag MID tag WTA tags

Detect and remove tag sequences

Fragment-to-fragment concatenations

Concatenated fragments in assembled contigs

Data upload Tag sequence definition

Tag sequence prediction

Parameter definition Download results

Sequence Contamination

Principal component analysis (PCA) of dinucleotide relative abundance Microbial metagenomes Viral metagenomes

Identification and removal of sequence contamination

Contaminant identification Current methods have critical limitations Dinucleotide relative abundance uses information content in sequences  can not identify single contaminant sequences Sequence similarity seems to be only reliable option to identify single contaminant sequences BLAST against human reference genome is slow and lacks corresponding regions (gaps, variants, …) Novel sequences in every new human genome sequenced* * Li et al.: Building the sequence map of the human pan-genome. Nature Biotechnology (2010)

DeconSeq web interface Two types of reference databases Remove Retain

DeconSeq web interface (cont.)

Human DNA contamination identified in 145 out of 202 metagenomes

Conclusions Quality control and data preprocessing are very important to increase quality of downstream analysis Preprocessing depends on the experiment