Genboree Microbiome Workbench 16S Workshop Part I March 11 th, 2014 Julia Cope Emily Hollister Kevin Riehle.

Slides:



Advertisements
Similar presentations
Genboree Microbiome Workbench 16S Workshop Part I March 11 th, 2014 Julia Cope Emily Hollister Kevin Riehle.
Advertisements

Genboree Microbiome Workbench 16S Workshop Part I March 11 th, 2014 Julia Cope Emily Hollister Kevin Riehle.
SRI International Bioinformatics Comparative Analysis Q
DNA BLAST Lab.
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
Metabarcoding 16S RNA targeted sequencing
Introduction to Excel 2007 Part 2: Bar Graphs and Histograms February 5, 2008.
The main tools and functions of the system can be accessed via this side bar Allometric equations editor can be accessed under utilities, and user.
Practical Bioinformatics Community structure measures for meta-genomics István Albert Bioinformatics Consulting Center Penn State.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Similar Sequence Similar Function Charles Yan Spring 2006.
10/17/071 Read: Ch. 15, GSF Comparing Ecological Communities Part Two: Ordination.
ARE OBSERVATIONS OBTAINED DIFFERENT?. ARE OBSERVATIONS OBTAINED DIFFERENT? You use different statistical tests for different problems. We will examine.
Welcome to the Turnitin.com Instructor Quickstart Tutorial ! This brief tour will take you through the basic steps teachers and students new to Turnitin.com.
A PCR-generated chimeric sequence usually comprises two phylogenetically distinct parent sequences and occurs when a prematurely terminated amplicon reanneals.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Overview of Search Engines
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Training Course 2 User Module Training Course 3 Data Administration Module Session 1 Orientation Session 2 User Interface Session 3 Database Administration.
Metagenomic Analysis Using MEGAN4
DISTRICT ONLY CONTACT: The new MRPC program is a single database designed to be accessed by ALL users within your district. When you receive the program.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Genboree Microbiome Workbench 16S Workshop Part I March 11 th, 2014 Julia Cope Emily Hollister Kevin Riehle.
H = -Σp i log 2 p i. SCOPI Each one of the many microbial communities has its own structure and ecosystem, depending on the body environment it exists.
StAR web server tutorial for ROC Analysis. ROC Analysis ROC Analysis: This module allows the user to input data for several classifiers to be tested.
Lead Management Tool Partner User Guide March 15, 2013
RNAseq analyses -- methods
Instructors begin using McGraw-Hill’s Homework Manager by creating a unique class Web site in the system. The Class Homepage becomes the entry point for.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
Welcome to DNA Subway Classroom-friendly Bioinformatics.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
Metagenomic Analysis Using MEGAN4 Peter R. Hoyt Director, OSU Bioinformatics Graduate Certificate Program Matthew Vaughn iPlant, University of Texas Super.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
A Genomics View of Unix. General Unix Tips To use the command line start X11 and type commands into the “xterm” window A few things about unix commands:
Lettuce/Sunflower EST CGPDB project. Data analysis, assembly visualization and validation. Alexander Kozik, Brian Chan, Richard Michelmore. Department.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
RDP – Capturing the Unclassified Use only on data that can be publicly shared. These are not secure tools.
Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
Accurate estimation of microbial communities using 16S tags
Construction of Substitution matrices
SAGExplore web server tutorial. The SAGExplore server has three different modules …
ExRNA Data Analysis Tools in the Genboree Workbench Organized and Hosted by the Data Management and Resource Repository (DMRR) Sai Lakshmi Subramanian.
Copyright OpenHelix. No use or reproduction without express written consent1.
From: Duggan et.al. Nature Genetics 21:10-14, 1999 Microarray-Based Assays (The Basics) Each feature or “spot” represents a specific expressed gene (mRNA).
Canadian Bioinformatics Workshops
Convenience Sample of 4 Adults and 6 Infants. Adults 4 visits over 2 weeks; infants 2 visits over 2 weeks Adult specimens: 1) plaque (by method, teeth,
Date of download: 6/23/2016 Copyright © 2016 McGraw-Hill Education. All rights reserved. Pipeline for culture-independent studies of a microbiota. (A)
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Discussion on Genomic/Metagenomic Data for ANGUS Course Adina Howe.
Date of download: 7/7/2016 Copyright © 2016 McGraw-Hill Education. All rights reserved. Pipeline for culture-independent studies of a microbiota. (A) DNA.
Robert Edgar Independent scientist
Metagenomic Species Diversity.
Comparative Analysis in BioCyc
Preprocessing Data Rob Schmieder.
Supplemental Digital Content 1. Table: Primers
EDNA analyze Wang Ying & Huang Junman.
Single Sample Registration
Volume 19, Issue 3, Pages (March 2016)
BLAST.
Volume 137, Issue 2, Pages (August 2009)
Volume 21, Issue 8, Pages (August 2014)
Genetic Determinants of the Gut Microbiome in UK Twins
Example usage of mockrobiota MC resource for marker gene and metagenome sequencing pipelines. Example usage of mockrobiota MC resource for marker gene.
by Peter J. Turnbaugh, Vanessa K. Ridaura, Jeremiah J
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

Genboree Microbiome Workbench 16S Workshop Part I March 11 th, 2014 Julia Cope Emily Hollister Kevin Riehle

Genboree Workflow Create Group Create Database Create Project Upload Files  Create Samples (Sample Import using metadata file)  Link Samples to Sequence Files (Sample File Linker)  QC and Attach Sequences (Sequence Import)  QIIME    RDP 

Data Analysis - QIIME How to select samples for analysis Chimera removal and why you should be thinking about it Output – downloading and organization – making sense of the files

Data Analysis - QIIME How to select samples for analysis

Data Analysis - QIIME – Selecting samples for analysis INPUT = One or more Sequence Import folders – All should be of the same variable region; ideally produced with the same primer and sequencing direction OUTPUT Targets = Your database (required), your project (optional)

Data Analysis - QIIME Caveats: All samples in your input folder will be analyzed – This includes no-template controls and positive controls – The % variation explained by you PCoA may be influenced by the inclusion of these samples QIIME on Genboree is not currently set up to allow users to subsample their data – This can be problematic if sequencing depth varies substantially across samples – It does however perform a “rounding up” normalization step

A bit about sequencing depth How deep should you go? There is no good answer Strong biological patterns can be detected with low sequencing depth – 10s to 100s of sequences can sometimes be enough – 1000s tend to be the norm Subtle biological patterns tend to require greater sequencing depth for detection Sequencing depth can be dictated by: – Sample quality – The number of samples placed on a run – Project budget Kuczynzski et al Nature Methods 7:

Unequal sequencing depth What’s the problem? Being certain that you are seeing the full view (…or at least equivalent glimpses of the) of your communities

Unequal sequencing depth What’s the problem? Unequal depth Avg Red = 5995 seqs Avg Blue = seqs Same data set Sampled are colored by library size Red ~4000 Orange ~5000 Yellow ~6000 Green 8,000-10,000 Blues 11,000-17,000

Unequal sequencing depth What’s the problem? Unequal depth Avg Red = 5995 seqs Avg Blue = seqs Equal depth All libraries were sub-sampled to ~4000 reads.

Data Analysis - QIIME Chimera removal and why you should be thinking about it – What is a chimeric sequence? – How frequently do they occur? – An example from real data – Why should you think about chimeras? – How to screen for chimeras using Genboree

What is a Chimeric Sequence? – In Greek mythology: A creature that was an amalgam of multiple animals Body of a lion, head of a goat, tail resembling a snake – In your sequence data: The combination of multiple sequences during PCR to create a hybrid – In sequence databases: A not-so-small nightmare of junk data Mis-annotation Enhanced “discovery” of novel organisms Chimera generation figure from: Haas et al. 2011, Genome Research 21:

How frequently do chimeras occur? – Schloss et al 2011: With mock communities of known composition: ~8% of raw sequences were chimeric Incidence increased with sequencing depth – Approaches for detection: Multiple algorithms available Genboree uses ChimeraSlayer – How it works: The ends of each read (~30% of total length) are compared to a chimera-free reference database Potential “parent” sequences are identified Identity of potential chimera to in silico chimera evaluated Schloss et al PLoS ONE 6(12):e27310 AATCGCGACCTGTTTAACCGTAGGTC AAACGCTTACGGAGCTACACGAGTC Query Parent 1 Parent 2 AATCGCGACCTGTGCTACACGGGTA AATCGCGACCTGTTTAACCGTAGGTC AAACGCTTACGGAGCTACACGGGTA Query Parent 1 Parent 2 Likely Chimera Non-chimera

An example from real data Chimeric alignment from: Haas et al. 2011, Genome Research 21: Alignment of chimeric sequences derived from Streptococcus (top, red) and Staphylococcus (bottom, black) Sequences were generated from 4 replicate PCR reactions/454 runs of V3V5 sequence

Why should you think about chimeras? – Spurious results Artificially increases estimates of richness and diversity You may discover a “new” (but fake) species – Should you trust all flagged chimeras? Most people do but….buyer beware False-positive rates are in the 1-4% range Some taxa are poorly represented in reference databases Prevotella and Acinetobacter are known to produce false-positive results in ChimeraSlayer – How to verify (digging in to your QIIME output) Obtain representative sequence(s) and verify their identity (e.g., BLAST vs. NCBI nt database, RDP SeqMatch) Sogin et al 2006 PNAS 103:

How to screen chimeras in Genboree – Run a QIIME job INPUT = Sequence Import folder OUTPUT Targets = Your database (required), your project (optional)

How to screen chimeras in Genboree – Select “Remove Chimeras” in the Tool Settings dialogue box Provide a study name Provide a job name (TIP: add chimeras_removed to you job name so that your output reflects that you selected this option) Click SUBMIT

Data Analysis - QIIME Output – downloading and organization – making sense of the files

How do I get my files out? – Entire folders can be archived/downloaded INPUT = Folder to be archived OUTPUT = Database to house archive

How do I get my files out? – Entire folders can be archived/downloaded Provide and archive name Choose your compression type Decide if you want the directory structure to be preserved SUBMIT

How do I get my files out? – Single files, including archives, can be downloaded one by one Click on your file of interest in the DATA SELECTOR window Click on the “Click to Download File” link in the DETAILS window Save the file to your computer or storage drive Most file types will require decompression

QIIME – making sense of the files – fasta.result.tar.gz – jobFile.json – mapping.txt – otu.table – phylogenetic.result.tar.gz – plots.result.tar.gz – raw.results.tar.gz – repr_set.fasta.ignore – sample.metadata – settings.json – taxonomy.result.tar.gz

QIIME – making sense of the files – fasta.result.tar.gz: multiple sequence alignment of your representative sequences file. Rep seqs = representative sequence for each OTU. – jobFile.json: a log of the settings used by Genboree to run your analysis – mapping.txt: a QIIME-compatible metadata file, includes barcode information – otu.table: a spreadsheet of OTU by sample distributions – phylogenetic.result.tar.gz: a phylogenetic tree of your rep seqs, additional files required for iTOL – plots.result.tar.gz: figures, html files for all PCoA plots produced in your QIIME run – raw.results.tar.gz: mapping file, otu table, rep seqs file, distance matrices underlying all PCoA calculations – repr_set.fasta.ignore: RDP classification (with confidence scores) of each rep seq – sample.metadata: like the mapping.txt file, with additional file locations for Genboree – settings.json: similar to the jobFile.json file – taxonomy.result.tar.gz: taxonomic summaries (per sample, at the Kingdom, Phylum, Class, Order, Family, and Genus levels)

Genboree Workflow Create Group Create Database Create Project Upload Files  Create Samples (Sample Import using metadata file)  Link Samples to Sequence Files (Sample File Linker)  QC and Attach Sequences (Sequence Import)  QIIME    RDP 

Data Analysis - RDP How to select samples Output – Downloading and organization – making sense of the files

Data Analysis - RDP – Selecting samples for analysis INPUT = One or more Sequence Import folders – All should be of the same variable region; ideally produced with the same primer and sequencing direction OUTPUT Targets = Your database (required), your project (optional)

Data Analysis - RDP Caveats: All samples in your input folder will be analyzed – This includes no-template controls and positive controls RDP on Genboree does not pre-filter for chimeric sequences RDP on Genboree is not currently set up to allow users to subsample their data – Depending on your application, this may be problematic if sequencing depth varies substantially across samples – It does however perform a “rounding up” normalization step and presents data on a relative abundance basis

How do I get my files out? – Entire folders can be archived/downloaded INPUT = Folder to be archived OUTPUT = Database to house archive

How do I get my files out? – Entire folders can be archived/downloaded Provide and archive name Choose your compression type Decide if you want the directory structure to be preserved SUBMIT

How do I get my files out? – Single files, including archives, can be downloaded one by one Click on your file of interest in the DATA SELECTOR window Click on the “Click to Download File” link in the DETAILS window Save the file to your computer or storage drive Most file types will require decompression

RDP – making sense of the files – domain.result.tar.gz – phylum.result.tar.gz – class.result.tar.gz – order.result.tar.gz – family.result.tar.gz – genus.result.tar.gz – sample.metadata – settings.json – count.result.tar.gz – count.xlsx – count_normalized.xlsx – weighted.xlsx – weighted_normalized.xlsx – png.result.tar.gz

RDP – making sense of the files – domain.result.tar.gz – phylum.result.tar.gz – class.result.tar.gz – order.result.tar.gz – family.result.tar.gz – genus.result.tar.gz – sample.metadata – settings.json – count.xlsx – count_normalized.xlsx – weighted.xlsx – weighted_normalized.xlsx – png.result.tar.gz Per sample summaries at various taxonomic levels, including raw counts and weighted values Per sample summaries at various taxonomic levels, raw counts or relative abundances (normalized) All of the plots produced during your run (e.g., heatmaps, stacked bar graphs) Per sample summaries at various taxonomic levels, weighted by confidence of ID assignments (raw counts or normalized)

Individual Time Confirm user accounts are created. Confirm users know where mock data or their data set are.