Metagenomic dataset preprocessing – data reduction

Slides:



Advertisements
Similar presentations
Homology Based Analysis of the Human/Mouse lncRNome
Advertisements

Metabarcoding 16S RNA targeted sequencing
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Workshop in Bioinformatics 2010 Class # Class 8 March 2010.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Eukaryotic Gene Finding
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Eukaryotic Gene Finding
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151.
NGS Analysis Using Galaxy
Advancing Science with DNA Sequence Natalia Ivanova MGM Workshop September 12, 2012 Metagenome analysis: use case.
From Metagenomic Sample to Useful Visual Anna Shcherbina 01/10/ Anna Shcherbina Bioinformatics Challenge Day 02/02/2013 From Metagenomic Sample to.
Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD
Gao Song 2010/07/14. Outline Overview of Metagenomices Current Assemblers Genovo Assembly.
Advancing Science with DNA Sequence Metagenome definitions: a refresher course Natalia Ivanova MGM Workshop September 12, 2012.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop January 31, 2012.
Pfam, DAS and the future Rob Finn DAS Workshop 2009.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012.
INTRODUCTION ● Expressed sequence tags offer a low cost approach to gene discovery ● For a range of non-model organisms, ESTs represent the only sequence.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Quality Control Hubert DENISE
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. [many slides borrowed from various sources]
Advancing Science with DNA Sequence Natalia Ivanova MGM Workshop September 29, 2011 Metagenome analysis: use case.
Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
Introduction to RNAseq
Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012.
Accurate estimation of microbial communities using 16S tags
Drinking from a fire hose: analysis of metagenomic data Rachel Mackelprang, Ph.D. Assistant Professor of Biology California State University Northridge.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.
What is BLAST? Basic BLAST search What is BLAST?
Work Presentation Novel RNA genes in A. thaliana Gaurav Moghe Oct, 2008-Nov, 2008.
Shruthi Prabhakara, Raj Acharya Department of Computer Science and Engineering, Pennsylvania State University We propose a two-pass semi-supervised fuzzy.
Canadian Bioinformatics Workshops
MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res
Culturable Bacterial Communities Analyzer DIANA VANESSA SARRIA-ZUNIGA ELIANA TORRES-ZELADA April 29, 2016.
Discussion on Genomic/Metagenomic Data for ANGUS Course Adina Howe.
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
De Novo Assembly of Mitochondrial Genomes from Low Coverage Whole-Genome Sequencing Reads Fahad Alqahtani and Ion Mandoiu University of Connecticut Computer.
Robert Edgar Independent scientist
What is BLAST? Basic BLAST search What is BLAST?
Virginia Commonwealth University
Quantitative Phylogenetic Assessment of Microbial Communities in Diverse Environments Xinjun Zhang.
The Integrated Microbial Genome (IMG) systems
Metagenomic Species Diversity.
Introduction to Bioinformatics Resources for DNA Barcoding
MGmapper A tool to map MetaGenomics data
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
Presented By: Chinua Umoja
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
EDNA analyze Wang Ying & Huang Junman.
Metafast High-throughput tool for metagenome comparison
Professors: Dr. Gribskov and Dr. Weil
Basics of BLAST Basic BLAST Search - What is BLAST?
Workshop on Microbiome and Health
Workshop on the analysis of microbial sequence data using ARB
Long way to solve short ncRNA data analysis problems – evaluation of small RNA-Seq datasets from non-model organisms in Galaxy Jochen Bick Jochen Bick.
CSC2431 February 3rd 2010 Alecia Fowler
Example usage of mockrobiota MC resource for marker gene and metagenome sequencing pipelines. Example usage of mockrobiota MC resource for marker gene.
Presentation transcript:

Metagenomic dataset preprocessing – data reduction Konstantinos Mavrommatis KMavrommatis@lbl.gov

Complexity Who is there? (phylogenetic content) What does it do? Acid Mine Drainage Sargasso Sea Termite Hindgut Cow rumen Soil The total metagenome is the result of a cell community. Cells belong to different organisms ranging from strains to domains. Who is there? (phylogenetic content) What does it do? (Functional content) Why is it there? (Comparative study) Species complexity 1 10 100 1000 10000

? Dataset processing Analysis Feature prediction QC Sample preparation High throughput sequencing Assemble reads Analysis Feature prediction ? QC Functional annotation and comparative analysis Binning

Dataset processing (v 3.0a) Submitted file Assembled contigs Submitted file 454 reads Submitted file Illumina reads Fasta/fastq File QC. Check character set and contig name. Remove trailing Ns. Trimming. Q=20 Trimming. Q=13 Fasta Low complexity. Size of 80 bp Dereplication. Prefix = 5, identity 95%, Clustering. 100% identity File for gene calling fasta

Dataset processing Feature prediction pipeline (v 3.0a) File for gene calling fasta CRISPR detection. crt / pilercr RNA detection. tRNAscan / hmmer / Blast / (isolates:Rfam) CDS detection. Isolates: prodigal Metagenomes: varies Unassembled reads + assembled contigs Conflict resolution Concatenation of all results. Creation of final output file File for IMG IMG

Dataset processing Quality trimming Courtesy Alex Copeland http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/ Remove sequences from the ends of the reads. lucy for 454 datasets. Illumina (longest high quality string)

Dataset processing Low complexity filter tatatatatatatatatat aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa using dust (NCBI) Remove sequences with less than 80 informative bases

Dataset processing Dereplication

Dataset processing Sequence dereplication atcccat atc-cat atcccat atcccat atcccat gctacat gctncat gctacat Not dereplicated gctacat using uclust 95% identity (global alignment). Identical prefix (5nt)

Dataset processing Evaluation of processing tools Unassembled sequences due to their small size, quality problems, and large number need to be processed with efficient pipelines. Simulated datasets: Using sequences extracted from finished genomes (Perfect sequences) Using reads that have been used to assemble finished genomes (Real errors). Evaluation and development of new tools/wrappers.

Dataset processing Feature prediction Available methods: Ab initio: Metagene, MetaGeneMark, FragGeneScan, Prodigal. Similarity based: Blastx, USEARCH. isolate CORRECT MISSED WRONG NEW metagenome

Method performance

Quality effect

Trimming

454 Ti(no errors)

454Ti(with errors)

Illumina 115 bp

Illumina 74 bp

Contigs frameshift Wrong prediction

Why annotate unassembled reads? Sample Total size 102,722,384 (2x150) reads Assembled contigs 1,375,950 contigs 5060 different pfams Assembled reads Mapped (by bwa) 11,778,925 reads Genes called on unassembled reads 64,737,444 genes 7481 different pfams 8,373,641 (12%) genes Similar to genes on contigs1 Genes with similarity to isolate genomes 40,778,854 genes Additional information about functions and phylogeny Assembled only More accurate statistics based on unassembled + assembled Unassembled + assembled + real metagenome

Processing time(metagenomes) Highlight metrics. Things that Show what I think should be the best metric for predcition for 2012 Total submissions Processing time Data size (bp) 336 2.45 days (annotation) 24 days (integration) 174,719,855 (average) 58,006,992,092 (total)

Processing time(isolates) Total submissions Processing time Data size (bp) 3630 10 hours(annotation) 12 days (integration) 1,658,242 (average) 4,114,099,773 (total)

Thank you for your attention