Preprocessing Data Rob Schmieder.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Metabarcoding 16S RNA targeted sequencing
Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Project Proposals Due Monday Feb. 12 Two Parts: Background—describe the question Why is it important and interesting? What is already known about it? Proposed.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
CS 6293 Advanced Topics: Current Bioinformatics
Whole Exome Sequencing for Variant Discovery and Prioritisation
BLAST What it does and what it means Steven Slater Adapted from pt.
Massive Parallel Sequencing
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
Bacterial Genetics - Assignment and Genomics Exercise: Aims –To provide an overview of the development and.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Construction of Substitution Matrices
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Next Generation Sequencing
Basic Local Alignment Search Tool BLAST Why Use BLAST?
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Human Genomics. Writing in RED indicates the SQA outcomes. Writing in BLACK explains these outcomes in depth.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
From Smith-Waterman to BLAST
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
Short read alignment BNFO 601. Short read alignment Input: –Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score 
What is BLAST? Basic BLAST search What is BLAST?
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
Population sequencing using short reads: HIV as a case study Vladimir Jojic et.al. PSB 13: (2008) Presenter: Yong Li.
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
From Reads to Results Exome-seq analysis at CCBR
Are Roche 454 shotgun reads giving a accurate picture of the genome?
What is BLAST? Basic BLAST search What is BLAST?
Next-generation sequencing technology
Virginia Commonwealth University
Rob Edwards San Diego State University
Metagenomic Species Diversity.
Lesson: Sequence processing
Quality Control & Preprocessing of Metagenomic Data
Basics of BLAST Basic BLAST Search - What is BLAST?
Basics of Comparative Genomics
Next-generation sequencing technology
Research in Computational Molecular Biology , Vol (2008)
EMC Galaxy Course November 24-25, 2014
Department of Computer Science
Removing Erroneous Connections
Jin Zhang, Jiayin Wang and Yufeng Wu
Discovery tools for human genetic variations
Metagenomics Microbial community DNA extraction
BLAST.
Basic Local Alignment Search Tool
Maximize read usage through mapping strategies
Basic Local Alignment Search Tool (BLAST)
Basics of Comparative Genomics
Basic Local Alignment Search Tool
Schematic representation of the SyngenicDNA approach.
Figure Genetic characterization of the novel GYG1 gene mutation (A) GYG1_cDNA sequence and position of primers used. Genetic characterization of the novel.
Overview of Shotgun Sequence Analysis
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

Preprocessing Data Rob Schmieder

Bad data analysis

Good data analysis New dataset Quality control & Preprocessing Similarity search Assembly

3 Tools for metagenomic data http://prinseq.sourceforge.net http://tagcleaner.sourceforge.net http://deconseq.sourceforge.net

http://edwards.sdsu.edu/prinseq Quality control and data preprocessing Rob Schmieder

Number and length of sequences Bad Reads should be approx. same length (same number of cycles) → Short reads are likely lower quality Good

Linearly degrading quality across the read Trim low quality ends

High quality throughout the sequence Good quality through the length of the sequence Sequence quality falls off quickly → Bad sequence data

Ion quality scores

Low quality sequence issues Most assemblers or aligners do not take into account quality scores Errors in reads complicate assembly, might cause misassembly, or make assembly impossible

What if quality scores are not available ? Alternative: Infer quality from the percent of Ns found in the sequence Removes regions with a high number of Ns Huse et al. found that presence of any ambiguous base calls was a sign for overall poor sequence quality Huse et al.: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology (2007)

What if quality scores are not available ? Alternative: Infer quality from the percent of Ns found in the sequence Removes regions with a high number of Ns Huse et al. found that presence of any ambiguous base calls was a sign for overall poor sequence quality Huse et al.: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology (2007)

Ambiguous bases If you can afford the loss, filter out all reads containing Ns Assemblers (e.g. Velvet) and aligners (SHAHA2, BWA, …) use 2-bit encoding system for nucleotides some replace Ns with random base, some with fixed base (e.g. SHAHA2 & Velvet = A) 2-bit example: 00 – A, 01 – C, 10 – G, 11 - T

Quality filtering Any region with homopolymer will tend to have a lower quality score Huseet al. found that sequences with an average score below 25 had more errors than those with higher averages Huse et al.: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology (2007)

Sequence duplicates

Real or artificial duplicate ? Metagenomics = random sampling of genomic material Why do reads start at the same position? Why do these reads have the same errors? No specific pattern or location on sequencing plate 11-35% Gomez-Alvarez et al.: Systematic artifacts in metagenomes from complex microbial communities. ISME (2009) 16

One micro-reactor – Many beads Martine Yerle (Laboratory of Cellular Genetics, INRA, France)

Impacts of duplicates False variant (SNP) calling Require more computing resources Find similar database sequences for same query sequence Assembly process takes longer Increase in memory requirements Abundance or expression measures can be wrong

Impacts of duplicates False variant (SNP) calling Require more computing resources Find similar database sequences for same query sequence Assembly process takes longer Increase in memory requirements Abundance or expression measures can be wrong Reference ...ACCACACGTGTTGTGTACATGAACACAGTATATGAGCATACAGAT... GTGTTGTGTACATGAACACAGTATATGAGCATACAGAT... GTGTACATGAACACAGTATATGAGCATACAGAT... TGAACACAGTCTATGAGCATACAGAT... TGAACACAGTCTATGAGCATACAGAT... TGAACACAGTCTATGAGCATACAGAT... TGAACACAGTCTATGAGCATACAGAT... TGAACACAGTCTATGAGCATACAGAT...

Impacts of duplicates False variant (SNP) calling Require more computing resources Find similar database sequences for same query sequence Assembly process takes longer Increase in memory requirements Abundance or expression measures can be wrong

Detect and remove tag sequences http://edwards.sdsu.edu/tagcleaner

No tag MID tag WTA tags

Imperfect primer annealing

Fragment-to-fragment concatenations

Data upload Tag sequence definition

Tag sequence prediction

Parameter definition Download results

Identification and removal of sequence contamination http://edwards.sdsu.edu/deconseq

Contaminant identification Previous methods had critical limitations Dinucleotide relative abundance uses information content in sequences  can not identify single contaminant sequences Sequence similarity seems to be only reliable option to identify single contaminant sequences BLAST against human reference genome is slow and lacks corresponding regions (gaps, variants, …) Novel sequences in every new human genome sequenced* * Li et al.: Building the sequence map of the human pan-genome. Nature Biotechnology (2010)

Faster algorithms for Next-gen data

Principal component analysis (PCA) of dinucleotide relative abundance Microbial metagenomes Viral metagenomes

Contaminant identification Current methods have critical limitations Dinucleotide relative abundance uses information content in sequences  can not identify single contaminant sequences Sequence similarity seems to be only reliable option to identify single contaminant sequences BLAST against human reference genome is slow and lacks corresponding regions (gaps, variants, …) Novel sequences in every new human genome sequenced* * Li et al.: Building the sequence map of the human pan-genome. Nature Biotechnology (2010)

DeconSeq web interface Two types of reference databases Remove Retain

DeconSeq web interface (cont.)

DeconSeq Identity = How similar is the query sequence to the reference sequence How much of query sequence is similar to reference sequence Coverage =

DeconSeq Blue = More similar to “retain” Red = More similar to “remove”

Human DNA contamination identified in 145 out of 202 metagenomes

http://prinseq.sourceforge.net/manual.html