Nidhi Shah University of Maryland

Slides:



Advertisements
Similar presentations
Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.
Advertisements

Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
BLAST Sequence alignment, E-value & Extreme value distribution.
Metabarcoding 16S RNA targeted sequencing
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
Metagenomics Binning and Machine Learning
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Metagenomic Analysis Using MEGAN4
From Metagenomic Sample to Useful Visual Anna Shcherbina 01/10/ Anna Shcherbina Bioinformatics Challenge Day 02/02/2013 From Metagenomic Sample to.
H = -Σp i log 2 p i. SCOPI Each one of the many microbial communities has its own structure and ecosystem, depending on the body environment it exists.
Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Metagenomic Analysis Using MEGAN4 Peter R. Hoyt Director, OSU Bioinformatics Graduate Certificate Program Matthew Vaughn iPlant, University of Texas Super.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.
Accurate estimation of microbial communities using 16S tags
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
TIPP: Taxon Identification using Phylogeny-Aware Profiles Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign.
Shruthi Prabhakara, Raj Acharya Department of Computer Science and Engineering, Pennsylvania State University We propose a two-pass semi-supervised fuzzy.
Canadian Bioinformatics Workshops
MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res
Date of download: 6/23/2016 Copyright © 2016 McGraw-Hill Education. All rights reserved. Pipeline for culture-independent studies of a microbiota. (A)
Date of download: 7/7/2016 Copyright © 2016 McGraw-Hill Education. All rights reserved. Pipeline for culture-independent studies of a microbiota. (A) DNA.
Robert Edgar Independent scientist
Quantitative Phylogenetic Assessment of Microbial Communities in Diverse Environments Xinjun Zhang.
Canadian Bioinformatics Workshops
Metagenomic Species Diversity.
Introduction to Bioinformatics Resources for DNA Barcoding
Research Paper on BioInformatics
What Is Cluster Analysis?
Phylogeny - based on whole genome data
TIPP: Taxon Identification using Phylogeny-Aware Profiles
Research in Computational Molecular Biology , Vol (2008)
Unraveling the microbial profile of the rhizosphere of SDS-suppressive soils in Soybean fields Ali Y. Srour1, Jason Bond1, Leonor Leandro2, Dean Malvick3.
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Sequence comparison: Significance of similarity scores
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
CSc4730/6730 Scientific Visualization
A Two-Stage Microbial Association Mapping Framework with Advanced FDR Control Jiyuan Hu Ph.D. Division of Biostatistics New York University Langone Health.
H = -Σpi log2 pi.
TIPP and SEPP: Metagenomic Analysis using Phylogeny-Aware Profiles
Daniel A. Peterson, Daniel N. Frank, Norman R. Pace, Jeffrey I. Gordon 
BLAST.
TIPP: Taxon Identification using Phylogeny-Aware Profiles
Basic Local Alignment Search Tool
Volume 20, Issue 5, Pages (November 2014)
Inference of Environmental Factor-Microbe and Microbe-Microbe Associations from Metagenomic Data Using a Hierarchical Bayesian Statistical Model  Yuqing.
Basic Local Alignment Search Tool (BLAST)
Taxonomic identification and phylogenetic profiling
High-Throughput Identification and Quantification of Candida Species Using High Resolution Derivative Melt Analysis of Panfungal Amplicons  Tasneem Mandviwala,
Daniel A. Peterson, Daniel N. Frank, Norman R. Pace, Jeffrey I. Gordon 
Volume 20, Issue 5, Pages (November 2014)
Sequence alignment, E-value & Extreme value distribution
Volume 25, Issue 14, Pages R611-R613 (July 2015)
Overview of Shotgun Sequence Analysis
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

Nidhi Shah University of Maryland Embracing ambiguity in the taxonomic classification of microbiome sequencing data Nidhi Shah University of Maryland Hi, Good morning. I first want to thank the organizers of this meeting for inviting me here. I am Nidhi Shah, a graduate student in the computer science department at the University of Maryland working with Dr. Mihai Pop, who is here too.

Microbes are everywhere... I don’t think I need to say it to this crowd, but really Microbes are everywhere. They are In on and around us. Over the last decade or so, with the advances in amplicon sequencing and whole metagenome sequencing has enabled us to study these communities in detail

What is in my microbial community sample? Taxonomy - Classification of organisms, typically arranged in hierarchical ranks (e.g. kingdom, phylum, class, order, genus, species) What is each fragment in my sequencing dataset? One of the most important task in any microbiome study is to figure out what organisms are present and in what proportion. This process is formally called Taxonomic classification Taxonomy is basically a system to classify and organize these organisms. It is typically arranged in a hierarchical ranks (from kingdom, phylum, class.. All the way to species and strains. Such a system helps organize the knowledge of known organisms in an interpretable and easy way

An overwhelming number of genomes being sequenced annually Number of genomes sequenced annually roughly doubles More and more genomes are being sequenced every year because of the decreased cost of sequencing For the same reason, there are also lot of microbial samples being sequenced and analyzed. The plot here was made couple years ago by Dan Nasko – it still shows the trend of upward growth in the sequences in our databases. Produced from NCBI GenBank files June 2016 - DJ Nasko

An overwhelming number of genomes being sequenced annually Number of genomes sequenced annually roughly doubles There is still a lot more microbial diversity that is unrepresented in our databases Biases in reference database Uncaptured diversity Overrepresentation of well-studied population Produced from NCBI GenBank files June 2016 - DJ Nasko

Google: "taxonomic annotation" Database of known pages Report all that contain keyword Ranking important (which of the thousands is most relevant) Ideally, construct a phylogenetic tree Too expensive Instead search against a database Think like a search engine from Mihai Pop

Our "keywords" Reference databases E.g. SILVA, Greengenes, NR, NT >F4BT0V001CZSIM rank=0000138 x=1110.0 y=2700.0 length=57 ACTGCTCTCATGCTGCCTCCCGTAGGAGTGCCTCCCTGAGCCAGGATCAAACGTCTG >F4BT0V001BBJQS rank=0000155 x=424.0 y=1826.0 length=47 ACTGACTGCATGCTGCCTCCCGTAGGAGTGCCTCCCTGCGCCATCAA Reference databases E.g. SILVA, Greengenes, NR, NT Look through database for sequence Report all organisms that contain it Rank list by ?? (most relevant to least relevant) Draw parallels to web search

Landscape of Taxonomic annotation Approaches runtime ~ sensitivity Phylogenetic pplacer TIPP Sequence Alignment blast usearch diamond k-mer stat model RDP k-mer counting kraken Assume taxonomically related sequences have similar k-mer frequencies Composition based methods run rapidly Sequence alignment are sensitive than composition based methods but have higher runtime Phylogenetic methods in theory provide accurate annotation but have prohibitive computational requirements Assume taxonomically related sequences are similar Simulate evolution

Phylogenetic placement Input: Backbone alignment and backbone tree on full-length sequences, and a set of query sequences (e.g. reads in a sample from the same gene) Output: Placement of query sequence on backbone tree Note: if the backbone tree is a Taxonomy, then the placement gives taxonomic information about the query sequences The two basic steps in phylogenetic placement are align and then place the fragment. align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. from Tandy Warnow

Phylogenetic placement The two basic steps in phylogenetic placement are align and then place the fragment. align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. from Tandy Warnow

Challenges to Taxonomic Assignment Resolution of different methods-- ie. certain hypervariable regions of the 16S gene cannot distinguish between E. coli and Shigella spp., Computational bottlenecks Computational reproducibility No ground truth Reference database coverage and completeness There are many challenges, we’ll focus on Reference database coverage and completeness

When does taxonomic annotation work well? When you have just one closest neighbor in the tree, then transferring annotation is easy

What if you have too many close relatives in the tree? But what if the read aligns perfectly well to many sequences -- Dan’s findings -- more strains of same species in the database → difficult to get species level assignment The reason ? Most methods use latest common ancestor approach Makes annotation coarse grained Wouldn’t it be nice to know exactly which species are plausible as opposed to just giving genus level assignment Clinical setting - Commensal vs. pathogen Nasko, Daniel J., et al. "RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification." Genome biology 19.1 (2018): 165.

What if I’ve sequenced something novel? No near neighbors -- only bet is to use phylogenetic methods Too expensive

Can we extract enough information from sequence alignment to achieve accuracy close to phylogenetic methods?

ATLAS: Ambiguous Taxonomy eLucidation by Apportionment of Sequences Step 1: Run BLAST Input: The input is a set of annotated known reference sequences and the uncharacterized reads from the sample. We use BLAST to align query sequences MENTION MARKER GENE SEQUENCING

BLAST can have multiple, valid top hits Statistics like E-value, bitscore, perc identity How to select which of these top hits are relevant?

ATLAS Step 2: Run BLAST outlier detection algorithm In brief, BLAST outlier detection method basically constructs a multiple sequence alignment with query sequence and all the hits using BLAST-generated pairwise alignments. It then uses the Bayesian integral log odds score to determine whether the multiple sequence alignment can be split into two groups that model the data better than the single group. This process identifies significant BLAST hits for the query sequences, without resorting to ad hoc cut-offs on percent identity, bit-score, and/or E-value. Modify add Shah, N., Altschul, S. F., & Pop, M. (2018). Outlier detection in BLAST hits. Algorithms for Molecular Biology, 13(1), 7.

ATLAS Step 3: Create a confusion graph and partition it Once you have the set of related database sequences for all query sequences, we create a confusion graph and partition it

ATLAS Step 3: Create a confusion graph and partition it Step 4: Assign reads to partition In this graph, nodes are reference database sequences and the edges are weighted by the number of query sequences for which the two database sequences are present together in the outlier set. We use Louvain community detection to find tightly knit sub-communities in this graph. These partitions of database sequences attempt to capture the most accurate granularity of taxonomic assignment without relying solely on the main taxonomic levels The last step is to assign reads to these partitions. A query sequence is assigned to a partition if a certain user-defined percentage of sequences in the outlier set are present in that partition.

The Global Enteric Multicenter Study (GEMS) was created to characterize important pediatric diarrheal pathogens We use data from the Global Enteric Multicenter Study or GEMS, a highly-powered case/control study funded by the Bill and Melinda Gates foundation to identify pathogens significantly contributing to diarrheal disease in children under 5. These samples came from children in west and east Africa, and southeast Asia. Pop, M., Walker, A. W., Paulson, J., Lindsay, B., Antonio, M., Anowar Hossain, M., et al. (2014). Diarrhea in young children from low-income countries leads to large-scale alterations in intestinal microbiota composition. Genome Biol. 15, R76

Diarrhea in young children from low-income countries leads to large-scale alterations in intestinal microbiota composition This is from the initial paper. Contains lots of diversity, 22.76% of OTUs had no phylum level classification

ATLAS achieves similar results as phylogenetic placement method (TIPP) achieves similar accuracy with higher computational efficiency.

ATLAS captures sub-genera level partitions LCA of most partitions are at the genus level The picture on the right shows the partitions in five abundant genera in the GEMS case/control study These sub-genus partitions can help understand biological groups and provide more resolution for downstream questions such as differential analysis We also looked at some of the partitions with LCA at higher taxonomic levels and found -- partitions in Clostridia class

ATLAS captures Clostrial class groups

Conclusion Taxonomic annotation of microbiome samples is challenging ATLAS provides a new way to capture ambiguity rather than relying on standard taxonomic levels and latest common ancestor strategy ATLAS generates similar annotations as phylogenetic placement method with higher computational efficiency ATLAS can be combined with phylogenetic methods to improve runtime by pruning tree to identified partitions

Thank you! Contact - nidhi@cs.umd.edu Software - https://github.com/shahnidhi/outlier_in_BLAST_hits

Results ATLAS partitions capture biologically meaningful groups