Practical Bioinformatics Community structure measures for meta-genomics István Albert Bioinformatics Consulting Center Penn State.

Slides:



Advertisements
Similar presentations
Introductory to database handling Endre Sebestyén.
Advertisements

16S sequencing for microbiome studies Nicola Segata and Nick Loman
Protein sequence clustering has been widely used as a part of the analysis of protein structure and function. We demonstrate an approach to protein clustering,
Chapter 1 - An Introduction to Computers and Problem Solving
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
Metabarcoding 16S RNA targeted sequencing
Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.
Information Retrieval in Practice
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
Bioinformatics and Phylogenetic Analysis
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
How to use the web for bioinformatics Molecular Technologies February 11, 2005 Ethan Strauss X 1373
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
Scaffold Download free viewer:
Overview of Search Engines
Creating and publishing accessible course materials Practical advise you can replicate.
NGS Analysis Using Galaxy
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Internet Skills An Introduction to HTML Alan Noble Room 504 Tel: (44562 internal)
Species Richness, Simpson’s, and Shannon-Weaver…oh my…
Metagenomic Analysis Using MEGAN4
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
From Metagenomic Sample to Useful Visual Anna Shcherbina 01/10/ Anna Shcherbina Bioinformatics Challenge Day 02/02/2013 From Metagenomic Sample to.
H = -Σp i log 2 p i. SCOPI Each one of the many microbial communities has its own structure and ecosystem, depending on the body environment it exists.
The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.
Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Identify gene markers for different taxonomic groups in Archaea and Bacteria Genomes Dongying Wu 1,2, Jonathan A. Eisen 1,2 1. DOE Joint Genome Institute,
Christian Rinke Microbial Genomics DOE, Joint Genome Institute Introduction to ARB (From A User's Perspective)
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
Event Data History David Adams BNL Atlas Software Week December 2001.
Weka: Experimenter and Knowledge Flow interfaces Neil Mac Parthaláin
Metagenomic Analysis Using MEGAN4 Peter R. Hoyt Director, OSU Bioinformatics Graduate Certificate Program Matthew Vaughn iPlant, University of Texas Super.
Capabilities of Software. Object Linking & Embedding (OLE) OLE allows information to be shared between different programs For example, a spreadsheet created.
Abstract Our current understanding of the taxonomic and phylogenetic diversity of cellular organisms, especially the bacteria and archaea, is mostly based.
Elucidating factors behind pair wise distances discrepancies between short and near full-length sequences. We hypothesized that since the 16S rRNA molecule.
Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
Sequence Alignment.
Accurate estimation of microbial communities using 16S tags
A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.
Downloading the MAXENT Software
Canadian Bioinformatics Workshops
MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res
Introducing DOTUR, a Computer Program for Defining Operational Taxonomic Units and Estimating Species Richness Patric D. Schloss and Jo Handelsman Department.
Presented by Samuel Chapman. Pyrosequencing-Intro The core idea behind pyrosequencing is that it utilizes the process of complementary DNA extension on.
Date of download: 6/23/2016 Copyright © 2016 McGraw-Hill Education. All rights reserved. Pipeline for culture-independent studies of a microbiota. (A)
ESPRIT. Taxonomy ● Works very well and gives accurate results ● Requires a previous blast search that may take long to complete ● When in doubt goes one.
Robert Edgar Independent scientist
Information Retrieval in Practice
Metagenomic Species Diversity.
The Original Question:
Figure 1. The relationships of bacterial operational taxonomic unit richness (A) and phylogenetic diversity (B) with aridity index based on 97% sequence.
(
Workshop on Microbiome and Health
Tutorial for using Case It for bioinformatics analyses
Microbiome: 16S rRNA Sequencing
H = -Σpi log2 pi.
Lesson 3 Bioinformatics Laboratory
Grauer and Barber Series
Cancer Cell Line Encyclopedia
Using Veera with R and Shiny to Build Complex Visualizations
Bacterial composition of olive fermentations is affected by microbial inoculation. Bacterial composition of olive fermentations is affected by microbial.
Fig. 3 Postnatal assembly of the humanized gut microbiota.
Presentation transcript:

Practical Bioinformatics Community structure measures for meta-genomics István Albert Bioinformatics Consulting Center Penn State

BioStar: Question and Answer site for Bioinformatics

Holistic data analysis Try to lay out all steps first – worry about parameter settings later – knowing what the data looks like Too much planning is actually detrimental So is fretting about the details Start with the end result  I’d like to have my data laid out as a …

Topic: metagenomics the majority of microbial biodiversity cannot be captured by cultivation based methods metagenomics = the study of genetic materials recovered from environmental samples

Increased complexity We’re interested in the bacterial species (memberships) present in the colonies Also interested in the relative abundances between these species We need to detect changes  we need to compare two bacterial communities

16S ribosomal protein highly conserved between different species of bacteria and archaea whereas the rest of genetic content varies greatly across species 16S RNA can be used for taxonomical classification

Two widely used approaches Classifying sequences: by their similarity to reference sequences (phylotyping) by their similarity to other sequences in the sample (operations taxonomic units  OTU)

Taxonomy Placing a bacteria into a taxonomy is difficult Several competing groups – each maintain a separate taxonomical database Three widely used curated taxonomy outlines that contain significant conflicts with each other

Greengenes

SILVA

RDP

NCBI taxonomy

One common phylotyping workflow Run the blast aligners with the reads against the NCBI bacterial database (can be very time consuming) Use MEGAN – Metagenome Analyzer to process the results

For 16S RNA there is faster way Align against a hand curated, prealigned representative selection (NAST algorithm). This needs far fewer resources

MEGAN Graphical user interface – nice visualizations

Pitfalls in phyloytping approaches The advantage of a phylotype based methods is that it places a label onto each sequence Yet the same species may have very different phenotypes Same phenotypes may actually belong to different lineages Nonetheless overall it works well for taxonomical classification

OTU based approaches Clustering based – sequences are clustered by their similarity (must be conserved regions 16S RNA) We (YOU) choose a percent similarity level that can range from 0  100% at which to merge sequences

Pitfalls of OTU based approaches No consistent method for converting between the thresholds used to define OTUs and taxonomic levels The distances within a taxonomic group are not evenly distributed Clustering is computationally intensive

The methods are slowly merging Phylotyping software get more OTU based functionality OTU based software get more phylotyping functionality

The mothur package Primarily OTU based but it has phylotyping functionality built in Exceedingly well documented with binaries for every platform download it for your computer

Comparing bacterial communities: A and B We always have an incomplete sample of a large community What is the overlap between A and B Is B a subset of A? If membership of A and B are identical are their abundances the same? What if A was sampled at a higher ratio than B

OTU based calculators for single communities Each calculator gives us a small window into one particular property of the dataset Community richness Community evenness Community diversity OTU number extrapolation

OTU based calculators for multiple communities Shared community richness Similarity in community membership Similarity in community structure

Community richness – alpha diversity Chao estimator Based on what we see how many microbes are really there Many other estimators: ACE, jackknife etc

Community diversity – dominance Berger-Parker index largest abundance / total number of individuals Many other estimators: Simpson, Shannon etc

 Analysis Examples

Running through an example Requirements: Use the datasets in day7/meta The mothur software is installed in: day7/meta/mothur Run it as:./mothur If you are having trouble accessing it copy the mothur executable next to the data

Example dataset From the paper: "Microbial diversity in the deep sea and the underexplored 'rare biosphere‘”, PNAS, 2006 The first to use pyrosequencing technology to sequence 16S rRNA gene tags.

mothur command line (a bit like R) You can type commands into mothur even better create a text file with commands and run those commands with mothur

All commands generate a log that you can look at later. This command redirects the log in to a know file. Note the command structure.

The way mothur works 1.Commands generate a readable output to the screen 2.The same output goes to the log file 3.Each command usually creates one or more result files that we can either load in later stages or plot (analyze)

Quality filtering We skip this step since the data is published and is quality filtered (and the authors did not provide the original sequence quality data) See the trim.seq command

Work on unique sequences - mothur will keep track how many times a sequence was seen

compare to the value before

We skipping a few steps that are not suited for in class exercise Running the alignments, distances and clustering The following slides show the commands but we should not do them in class as it may take a while. The dist folder contains this precomputed dataset Check the content of the commands1.txt and commands2.txt to see all the commands We also renamed sogin. unique.filter.fasta --> to good.fasta

For this you need the silva reference files  800MB This can take a bit more - depending on you computer and number of processors - with a very fast computer with 12 processors it took about five minutes

As a result we compute the similarity matrix that interrelates all sequences

We can now head onto doing analysis

Rarefaction curves

A separate outputfile for each group

Generate the rarefaction curves

A different file for each group  file extension r_chao

Visualize the rarefaction curve – here shown in Excel - we’ll learn how to do this in R