Meta’omic functional profiling with ShortBRED

Slides:



Advertisements
Similar presentations
Meta’omic functional profiling with HUMAnN Curtis Huttenhower Harvard School of Public Health Department of Biostatistics U. Oregon META Center.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Meta’omic functional profiling with HUMAnN Curtis Huttenhower Harvard School of Public Health Department of Biostatistics U. Oregon.
1 ADVANCED MICROSOFT POWERPOINT Lesson 5 – Using Advanced Text Features Microsoft Office 2003: Advanced.
Amplicon functional profiling with PICRUSt
Scaffold Download free viewer:
NGS Analysis Using Galaxy
Metagenomic Analysis Using MEGAN4
1 Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
Basic features for portal users. Agenda - Basic features Overview –features and navigation Browsing data –Files and Samples Gene Summary pages Performing.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
Copyright OpenHelix. No use or reproduction without express written consent1.
Microbial diversity and virulence probing of five different body sites Anu Rebbapragada, Pub. Health Ontario Central Lab. Canada Wei-Jen Lin, Cal State.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
Meta’omic functional profiling with ShortBRED Galeb Abu-Ali Curtis Huttenhower Harvard T.H. Chan School of Public Health Department of Biostatistics.
Metagenomic Analysis Using MEGAN4 Peter R. Hoyt Director, OSU Bioinformatics Graduate Certificate Program Matthew Vaughn iPlant, University of Texas Super.
Meta’omic Analysis with MetaPhlAn, HUMAnN, and LEfSe Curtis Huttenhower Harvard School of Public Health Department of Biostatistics.
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
Download all complete prokaryotic genomes from the NCBI RefSeq database Extract 16S rRNA sequences from each genome. Use UCLUST algorithm to cluster 16S.
Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
Meta’omic functional profiling with ShortBRED Curtis Huttenhower Harvard School of Public Health Department of Biostatistics U. Oregon.
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
SAGExplore web server tutorial. The SAGExplore server has three different modules …
The Treatment-Naive Microbiome in New-Onset Crohn’s Disease Dirk Gevers, Subra Kugathasan, Lee A. Denson, Yoshiki Vázquez-Baeza, Will Van Treuren, Boyu.
Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Advanced Taverna Aleksandra Pawlik University of Manchester materials by Katy Wolstencroft, Aleksandra Pawlik, Alan Williams
An Introduction to Meta’omic Analyses Curtis Huttenhower Galeb Abu-Ali Eric Franzosa Harvard T.H. Chan School of Public Health Department of Biostatistics.
Functional profiling with HUMAnN2
Using the bioBakery Curtis Huttenhower
Metagenomic Species Diversity.
AP CSP: Cleaning Data & Creating Summary Tables
Preprocessing Data Rob Schmieder.
Regulatory Genomics Lab
Strain profiling with StrainPhlAn and PanPhlAn
An Introduction to Meta’omic Analyses
Research in Computational Molecular Biology , Vol (2008)
Mirela Andronescu February 22, 2005 Lab 8.3 (c) 2005 CGDN.
Functional profiling with HUMAnN2
Taxonomic profiling with MetaPhlAn2
Identifying personal microbiomes using metagenomic codes
Systematic Characterization and Analysis of the Taxonomic Drivers of Functional Shifts in the Human Microbiome  Ohad Manor, Elhanan Borenstein  Cell Host.
Taxonomic profiling with MetaPhlAn2
Genome Center of Wisconsin, UW-Madison
Strain profiling with StrainPhlAn
Human Gut Microbiome: Function Matters
Curtis Huttenhower Galeb Abu-Ali Eric Franzosa
H = -Σpi log2 pi.
Similarity based on Shape and Appearance
Annotation Presentation
Basic Local Alignment Search Tool
Volume 20, Issue 5, Pages (November 2014)
Victor M. Markowitz, I-Min A. Chen, Ken Chu, Amrita Pati, Natalia N
Quantification of antibiotic resistance marker and virulence factor abundances on subway surfaces. Quantification of antibiotic resistance marker and virulence.
Sequence comparison: Multiple testing correction
Yating Liu July 2018 G-OnRamp workshop
Regulatory Genomics Lab
Gautam Dey, Tobias Meyer  Cell Systems 
Volume 20, Issue 5, Pages (November 2014)
A typical current computational meta'omic pipeline to analyze and contrast microbial communities. A typical current computational meta'omic pipeline to.
A Presentation by Regina Strelecki
Welcome - webinar instructions
Introduction to RNA-Seq & Transcriptome Analysis
Lab 3 – BLAST – Directed It’s a BLAST! (too easy?)
Regulatory Genomics Lab
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

Meta’omic functional profiling with ShortBRED Curtis Huttenhower 08-15-14 Harvard School of Public Health Department of Biostatistics U. Oregon META Center

Who is there? What are they doing? The two big questions… Who is there? (taxonomic profiling) What are they doing? (functional profiling)

(What we mean by “function”)

HMP Unified Metabolic Analysis Network HUMAnN HMP Unified Metabolic Analysis Network Short reads + protein families Translated BLAST search A1 A2 A3 B1 B2 C1 C2 C3 Weight hits by significance Sum over families Adjust for sequence length Repeat for each metagenomic or metatranscriptomic sample

HMP Unified Metabolic Analysis Network HUMAnN HMP Unified Metabolic Analysis Network Millions of hits are collapsed into thousands of gene families (KOs) (still a large number) Map genes to KEGG pathways modules Use MinPath (Ye 2009) to find simplest pathway explanation for observed genes ? Remove pathways unlikely to be present due to low organismal abundance Smooth/fill gaps Collapsing KO abundance into KEGG module abundance (or presence/absence) yields a smaller, more tractable feature set

HUMAnN accuracy Validated against synthetic metagenome samples (similar to MetaPhlAn validation) Gene family abundance and pathway presence/absence calls beat naïve best-BLAST-hit strategy

HUMAnN in action Franzosa et al. PNAS 11:E2329-38 (2014)

Gene families in median HMP stool sample PICRUSt: Inferring community metagenomic potential from marker gene sequencing With Rob Knight, Rob Beiko Seq. genomes Gene families in median HMP stool sample One can recover general community function with reasonable accuracy from 16S profiles. http://picrust.github.com Metagenomic abundance 16S predicted abundance Pathways and modules Orthologous gene families Taxon abundances PI-CRUST has been an excellent collaboration among three groups, ours, Rob Knight’s, and Rob Beiko’s, with the goal being to relate 16S-based taxonomic profiles with metabolic or functional profiles as might be recovered from shotgun metagenomic data. If you consider a 16S profile to identify the relative abundances of organisms from without a phylogenetic tree, often an exact reference genome is available that can quite specifically identify the functional potential of an organism. Sometimes a nearby genome is available instead, allowing an educated guess, and sometimes a more complex ancestral state reconstruction is needed to determine the most likely functional profile of an organism. Given such ancestral state reconstructions with appropriate confidence intervals, however, each reconstruction’s gene catalog can then be “multiplied” by the corresponding organism’s 16S abundance to infer community gene abundances. I honestly never expected this to work well, but when comparing inferred and real gene abundance profiles in communities from the HMP where both are available, the reconstruction actually performs remarkably well, even when I don’t cherry pick just the best few examples to show you. Morgan Langille from Rob Beiko’s group has an excellent poster detailing this process, and what it means in our context of IBD is that we can reasonably accurately go from 16S organismal abundances to gene family abundances, which can then be reconstructed into pathways using the same HUMAnN tool mentioned earlier for metagenomes. Jesse Zaneveld HUMAnN Reconstructed “genomes” Morgan Langille Relative abundance

What’s there: ShortBRED Jim Kaminski ShortBRED is a tool for quantifying protein families in metagenomes Short Better REad Dataset Inputs: FASTA file of proteins of interest Large reference database of protein sequences (FASTA or blastdb) Metagenomes (FASTA/FASTQ nucleotide files) Outputs: Short, unique markers for protein families of interest (FASTA) Relative abundances of protein families of interest in each metagenome (text file, RPKM) Compared to BLAST (or HUMAnN), this is: Faster More specific

What’s there: ShortBRED algorithm Cluster proteins of interest into families Record consensus sequences Identify and common areas among proteins Compared against each other Compared against reference database Remove all of these Remaining subseqs. uniquely ID a family Record these as markers for that family

What’s there: ShortBRED marker identification Prots of interest Reference database Cluster into families Identify short, common regions True Marker Junction Marker Quasi Marker

What’s there: ShortBRED family quantification Metagenome reads Translated search for high ID hits Normalize relative abundances ShortBRED markers

What’s there: ShortBRED’s fast Six synthetic metagenomes from GemSim, spiked with known proteins of interest: ARDB = Antibiotic Resistance VFDB = Virulence Factors

What’s there: ShortBRED’s accurate Six synthetic metagenomes from GemSim, spiked with known proteins of interest: ARDB = Antibiotic Resistance VFDB = Virulence Factors

Setup notes reminder Slides with green titles or text include instructions not needed today, but useful for your own analyses Keep an eye out for red warnings of particular importance Command lines and program/file names appear in a monospaced font. Commands you should specifically copy/paste are in monospaced bold blue.

What’s there: ShortBRED ShortBRED is available athttp://huttenhower.sph.harvard.edu/shortbred You could download ShortBRED by clicking here

From the command line... But don’t! Instead, we’ve installed ShortBRED already for you You can create your own virtual copy by running: ln -s /class/stamps-software/biobakery/shortbred/ To see what you can do, run: module add python/2.7.5 module add cd-hit/3.1.1 module add muscle/3.8.425 module add usearch/7.0.1090-64 module add NCBI_Blast_Executables ./shortbred/shortbred_identify.py -h | less ./shortbred/shortbred_quantify.py -h | less

Getting some annotated protein sequences You could download the ARDB protein sequences here Go to http://ardb.cbcb.umd.edu

From the command line... But don’t! Take a look by running: Instead, we’ve downloaded the important file for you Take a look by running: less /class/stamps-shared/biobakery/data/resisGenes.pfasta

Getting some reference protein sequences Go to http://metaref.org You could download the MetaRef protein sequences here

Running ShortBRED-Identify But don’t! We’ll use an example mini reference database for speed Lets make some antibiotic resistance markers by running: ./shortbred/shortbred_identify.py --goi /class/stamps-shared/biobakery/data/resisGenes.pfasta --ref ./shortbred/example/ref_prots.faa --markers ardb_markers.faa less ardb_markers.faa This should take ~5 minutes If you get bored waiting, kill it and copy: /class/stamps-shared/biobakery/results/shortbred/ardb_markers.faa It will produce lots of status output as it runs

ShortBRED markers True Markers at the top

Junction/Quasi Markers at the bottom ShortBRED markers Junction/Quasi Markers at the bottom

Running ShortBRED-Quantify Using your existing HMP data subset, you can search for antibiotic resistance proteins in the oral cavity by running: ./shortbred/shortbred_quantify.py --markers ardb_markers.faa --wgs 763577454-SRS014472-Buccal_mucosa.fasta --results 763577454-SRS014472-Buccal_mucosa-ARDB.txt less 763577454-SRS014472-Buccal_mucosa-ARDB.txt This should just a few seconds It will again produce lots of status output as it runs

ShortBRED marker quantification RPKMs and raw hit count Other columns are family name and total AAs among all family makers

AR proteins in the human gut That’s boring! Let’s get some real data scp the file to your own computer: /class/stamps-shared/biobakery/data/shortbred_ardb_hmp_t2d.tsv This is the result of running: ShortBRED-Identify on the real ARDB + reference ShortBRED-Quantify on the real HMP + T2D data (Qin Nature 2014) Summing each sample’s RPKMs for families in each ARDB resistance class

AR proteins in the human gut

What it means: LEfSe Visit LEfSe at: http://huttenhower.sph.harvard.edu/lefse First click here

1. Click here, browse to shortbred_ardb_hmp_t2d.tsv What it means: LEfSe Then upload your formatted table After you upload, wait for the progress meter to turn green! 1. Click here, browse to shortbred_ardb_hmp_t2d.tsv 3. Then watch here 2. Then here

What it means: LEfSe Then tell LEfSe about your metadata: 1. Click here 2. Then select Dataset 3. Then Gender 4. Then SampleID 5. Then click here

What it means: LEfSe Leave all parameters on defaults, and run LEfSe! You can try playing around with these parameters if desired 1. Click here 2. Then GO!

What it means: LEfSe You can plot the results as a bar plot Again, lots of graphical parameters to modify if desired 1. Click here 2. Then here

What it means: LEfSe In Galaxy, view a result by clicking on its “eye” Click here

What it means: LEfSe

2. Then selected your formatted data here What it means: LEfSe There’s no really any reason to plot a cladogram Although it will work! But you can see the raw data for individual biomarkers These are generated as a zip file of individual plots 1. Click here 2. Then selected your formatted data here 3. Then here

What it means: LEfSe Click here In Galaxy, download a result by clicking on its “disk” Then here

What it means: LEfSe Tetracycline Efflux Pumps Tet. Ribosomal Blockers Aminoglycoside Acetyltransferases

Summary HUMAnN ShortBRED Quality-controlled metagenomic reads in Tab-delimited gene, module, and pathway relative abundances out ShortBRED Raw metagenomic reads, Proteins of interest, and Protein reference database in Tab-delimited gene family rel. abundances out

Interested? We’re recruiting postdoctoral fellows! Thanks! http://huttenhower.sph.harvard.edu Human Microbiome Project 2 Ramnik Xavier Dirk Gevers Lita Procter Jon Braun Dermot McGovern Subra Kugathasan Ted Denson Janet Jansson Bruce Birren Chad Nusbaum Clary Clish Joe Petrosino Thad Stappenbeck Alex Kostic Levi Waldron Xochitl Morgan Tim Tickle Daniela Boernigen Soumya Banerjee Human Microbiome Project Jane Peterson Sarah Highlander Barbara Methe Karen Nelson George Weinstock Owen White George Weingart Emma Schwager Eric Franzosa Boyu Ren Tiffany Hsu Ali Rahnavard Joseph Moon Jim Kaminski Regina Joice Koji Yasuda Kevin Oh Galeb Abu-Ali Morgan Langille Rob Beiko Rob Knight Greg Caporaso Jesse Zaneveld Interested? We’re recruiting postdoctoral fellows! Afrah Shafquat Randall Schwager Chengwei Luo Keith Bayer Moran Yassour Alexandra Sirota