Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012.

Slides:



Advertisements
Similar presentations
In Silico Primer Design and Simulation for Targeted High Throughput Sequencing I519 – FALL 2010 Adam Thomas, Kanishka Jain, Tulip Nandu.
Advertisements

Next Generation Sequencing, Assembly, and Alignment Methods
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Bioinformatics and Phylogenetic Analysis
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
CSE182-L12 Gene Finding.
Similar Sequence Similar Function Charles Yan Spring 2006.
We are developing a web database for plant comparative genomics, named Phytome, that, when complete, will integrate organismal phylogenies, genetic maps.
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
Sequencing a genome and Basic Sequence Alignment
Metagenomics Binning and Machine Learning
Metagenomic Analysis Using MEGAN4
Mouse Genome Sequencing
Advancing Science with DNA Sequence Natalia Ivanova MGM Workshop September 12, 2012 Metagenome analysis: use case.
Advancing Science with DNA Sequence Data Curation in IMG-ER Natalia Ivanova MGM Workshop May 16, 2012.
From Metagenomic Sample to Useful Visual Anna Shcherbina 01/10/ Anna Shcherbina Bioinformatics Challenge Day 02/02/2013 From Metagenomic Sample to.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Fission Yeast Computing Workshop -1- Searching, querying, browsing downloading and analysing data using PomBase Basic PomBase Features Gene Page Overview.
Identify gene markers for different taxonomic groups in Archaea and Bacteria Genomes Dongying Wu 1,2, Jonathan A. Eisen 1,2 1. DOE Joint Genome Institute,
Gao Song 2010/07/14. Outline Overview of Metagenomices Current Assemblers Genovo Assembly.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Microbial diversity and virulence probing of five different body sites Anu Rebbapragada, Pub. Health Ontario Central Lab. Canada Wei-Jen Lin, Cal State.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
Sequencing a genome and Basic Sequence Alignment
Advancing Science with DNA Sequence Metagenome definitions: a refresher course Natalia Ivanova MGM Workshop September 12, 2012.
The iPlant Collaborative
RNA surveillance and degradation: the Yin Yang of RNA RNA Pol II AAAAAAAAAAA AAA production destruction RNA Ribosome.
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
Metagenomic Analysis Using MEGAN4 Peter R. Hoyt Director, OSU Bioinformatics Graduate Certificate Program Matthew Vaughn iPlant, University of Texas Super.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop January 31, 2012.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012.
Protein and RNA Families
CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis
Advancing Science with DNA Sequence Natalia Ivanova MGM Workshop September 29, 2011 Metagenome analysis: use case.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
Copyright OpenHelix. No use or reproduction without express written consent1.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Human Genomics. Writing in RED indicates the SQA outcomes. Writing in BLACK explains these outcomes in depth.
De novo assembly validation
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Metagenomic dataset preprocessing – data reduction
A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res
Date of download: 6/23/2016 Copyright © 2016 McGraw-Hill Education. All rights reserved. Pipeline for culture-independent studies of a microbiota. (A)
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
De Novo Assembly of Mitochondrial Genomes from Low Coverage Whole-Genome Sequencing Reads Fahad Alqahtani and Ion Mandoiu University of Connecticut Computer.
Virginia Commonwealth University
The Integrated Microbial Genome (IMG) systems
Canadian Bioinformatics Workshops
Metagenomic Species Diversity.
Introduction to Bioinformatics Resources for DNA Barcoding
The Integrated Microbial Genome (IMG) systems
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
Phylogeny - based on whole genome data
Metagenomic assembly Cedric Notredame
Research in Computational Molecular Biology , Vol (2008)
Workshop on the analysis of microbial sequence data using ARB
Metagenomics Image: Iverson et al. 2012, Science.
What do you with a whole genome sequence?
Victor M. Markowitz, I-Min A. Chen, Ken Chu, Amrita Pati, Natalia N
CSCI 1810 Computational Molecular Biology 2018
Presentation transcript:

Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012

1. Metagenome definitions: a refresher course

Metagenome definitions Metagenome is a collective genome of microbial community, AKA microbiome (native, enriched, sorted, etc.). Metagenomic library (or libraries) is constructed from isolated DNA (native, enriched, etc.). Metagenomic library can be single-end (AKA standard) or paired-end

Metagenome definitions Single-end (standard) metagenomic library will produce contigs upon assembly (i. e. longer sequences based on overlap between reads) Any Ns found in contigs correspond to low quality bases Paired-end metagenomic library will produce scaffolds upon assembly (non-contigous joining of reads based on read pair information) Ns found in scaffolds correspond either to low quality bases or to gaps of unknown size ATGCAAAGGCCGCATCCAGCAGGTT TACGTTTCCGGCGTAGGTCGTCCAA ATGCAAAGGCCGCATCC TACGTTTCCGGCGTAGG AGCAGGTT NNNNNN TCGTCCAA

Amplified and Unamplified Libraries Amplified Library Unamplified Library Fragmentation (1ug) Fragmentation (1ug) Double SPRI End repair / Phosphorylation End repair / Phosphorylation SPRI Clean Double SPRI A-tailing with Klenow exo- A-tailing with Klenow exo- SPRI Clean DNA Chip Heat Inactivation DNA Chip Adaptor Ligation Adaptor Ligation SPRI Clean PCR 10-cycle Amplification SPRI Clean DNA Chip SPRI Clean DNA Chip qPCR Quantification qPCR Quantification

Metagenome definitions (contd): Unless the community has very low complexity (i. e. dominated by one or a few clonal populations), assembly at 100% nucleotide identity will be very fragmented. What to do with k-mer based assemblies? Use multiple k-mer settings, combine assemblies with an overlap-layout consensus assembler like minimus2 using minimal % identity of 95%. Tradeoff between overlap length and % identity. overlap = alignment of reads at x% identity

Reasoning behind combining multiple assemblies

Trimming does not appear to be ideal for this process Assembly Pipeline v.0.9 CPU time intensive, no known metagenomic Kmer prediction algorithm A snapshot of older (454-Illumina) metagenome assembly pipeline Picking best kmer – manual process 8

Metagenome definitions (contd): Assembly of sequences at less than 100% identity => population contigs and scaffolds representing a consensus sequence of species population isolate contig species population contigs overlap = alignment of reads at x% identity

2 more important definitions Sequence coverage (AKA read depth) How many times each base has been sequenced => needs to be considered when calculated protein family abundance Per-contig average coverage Per-base coverage => per-gene coverage 2. Bins Scaffolds, contigs and unassembled reads can be binned into sets of sequences (bins) that likely originated from the same species population or a population from a broader taxonomic lineages

What IMG does and doesn’t do Scaffolds and contigs are generated by assembly – not provided in IMG/M Sequence coverage can be computed by the assembler based on alignments it generates (preferable) or can be added later by aligning reads to contigs – the latter can be provided in IMG/M Bins are generated by binning software – not provided in IMG/M Scaffolds, contigs and unassembled reads are annotated with non-coding RNAs, repeats (CRISPRs), and protein coding genes (CDSs); the latter are assigned to protein families (COGs, Pfams, TIGRfams, KEGG Orthology, EC numbers, internal clusters) – is provided in IMG/M

What’s the difference between IMG and MG-RAST, IMG and CAMERA? We prefer to assemble the data longer sequences -> better quality of gene prediction and functional annotation longer sequences -> chromosomal context and binning -> population-level analysis But we don’t provide assembly services except for metagenomes sequenced at the JGI we may be able to help with assembly of 454 we’re not equipped to assemble massive amounts of Illumina data http://galaxy.jgi-psf.org Contact person: Ed Kirton, ESKirton@lbl.gov IMG does not provide tools for analysis of 16S data from the metagenome itself we do assembly -> none of assembled 16S sequences is reliable BLASTn of reads matching conserved regions is misleading we do pyrotags for every metagenome sequenced at the JGI http://pyrotagger.jgi-psf.org

2. IMG/M features: divide and conquer (see also IMG/M -> Using IMG/M -> Using IMG/M -> IMG User Guide and IMG/M Addendum) http://img.jgi.doe.gov/m http://img.jgi.doe.gov/mer username: public password: public

IMG/M User Interface Map About IMG/M -> Using IMG/M -> User Interface Map

Dividing the contigs by GC content or length Statistics Microbiome Details -> Genome Statistics -> DNA Scaffolds Search Microbiome Details -> Scaffold Search

Dividing the genes phylogenetically: Phylogenetic Distribution Phylogenetic Distribution of Genes Microbiome Details -> Phylogenetic Distribution of Genes Components: histograms Protein Recruitment Plots summary statistics tables lists of genes histogram (phylum/class) gene counts gene lists summary statistics (family) (species) counts, lists, statistics counts, lists recruitment plots

Dividing the contigs: Scaffold Cart Lists of contigs or genes in Gene Cart E. g. Microbiome Details -> Genome Statistics -> DNA Scaffolds -> scaffold counts Scaffold Cart Features: Scaffold Export Adding all genes to Gene Cart Function Profile (against functions in Function Cart) Histograms by GC content, length and gene count Phylogenetic Distribution

All Carts in IMG are interconnected Gene Cart Scaffold Cart Function Cart

Dividing the genes by abundance/ by function Abundance Profiles Compare Genomes -> Abundance Profiles Tools Components: Common parameters: Normalization (none/scale for size) Type of count (raw counts/estimated gene copies) Type of protein family (COG, Pfam, Enzyme, TIGRfam)

Other tools Phylogenetic Marker COGs Find Functions -> Phylogenetic Marker COGs SNP BLAST and SNP Vista Gene Page -> SNP BLAST -> SNP VISTA IMG/M exercises: http://genomebiology.jgi-psf.org/Content/MGM-11.Feb2012/agenda.html The first 3 pages are questions without answers; the rest is a cheat sheet

Life outside IMG: binning tools Alignment-based tools MEGAN – BLAST+LCA http://www-ab.informatik.uni-tuebingen.de/software/megan MTR – BLAST+ MTR http://cs.ru.nl/gori/software/MTR.tar.gz SOrt-ITEMS – processed BLAST best hit http://metagenomics.atc.tcs.com/binning/SOrt-ITEMS CARMA and Web-CARMA – MSA + neighbor-joining tree http://webcarma.cebitec.uni-bielefeld.de Compositional tools PhyloPythia – 6-mers, SVM http://cbcsrv.watson.ibm.com/phylopythia.html TACOA – 2-6 mers, k-nearest neighbor classifier http://www.cebitec.uni-bielefeld.de/brf/tacoa/tacoa.html Phymm and PhymmBL – Interpolated Markov models (IMMs) http://www.cbcb.umd.edu/software/phymm/ ClaMS – DOR, DBC http://clams.jgi-psf.org

Life outside IMG: statistical analysis tools Comparison of 2 samples MEGAN - http://www-ab.informatik.uni-tuebingen.de/software/megan STAMP - http://kiwi.cs.dal.ca/Software/STAMP Comparison of sets of samples ShotgunFunctionalizeR – R package for statistical analysis - http://shotgun.zool.gu.se METAREP – package from JCVI, includes multidimensional scaling, hierarchical clustering, etc - http://www.jcvi.org/metarep METASTATS – package for analysis of paired samples with replicates - http://metastats.cbcb.umd.edu/ LEfSE – package for comparison of multiple classes of samples with replicates - http://huttenhower.sph.harvard.edu/lefse/