TIPP: Taxonomic Identification And Phylogenetic Profiling Nam-phuong Nguyen Computer Science And Engineering University Of California, San Diego I would like to first thank Tandy for introducing me and IGB hosting this event. I would also like to thank everyone in attendence. Today I will talk about TIPP a method I developed for what I call microbial forensics and i will describe what i mean in this talk
Precision Medicine Personalized treatment based upon the patients’ phenotypes and genotypes Precision Medicine Initiative launched with $215M in 2016 Many different aspects including genomics, epigenetics, microbiome Precision medicine is a new paradigm for healthcare, the idea to create personal treatments for patients based upon their phenotype and genotype this is a hot topic, and in 2016 the precision medicine initiative launched with 215M in funding precision medicine takes many different characteristics of the patient into account including genomics episgenetics, and the microbiome Image courtesy of gurdanhealth.com
Precision Medicine Personalized treatment based upon the patients’ phenotypes and genotypes Precision Medicine Initiative launched with $215M in 2016 Many different aspects including genomics, epigenetics, microbiome Precision medicine is a new paradigm for healthcare, the idea to create personal treatments for patients based upon their phenotype and genotype this is a hot topic, and in 2016 the precision medicine initiative launched with 215M in funding precision medicine takes many different characteristics of the patient into account including genomics episgenetics, and the microbiome Image courtesy of gurdanhealth.com
Human Microbiome 10 times more bacteria cells than human cells Important role in regulating health Disruption associated with risk factors for diseases Analysis through metagenomics We are more bacteria cells then human cells so it's no surprise that bacteria plays a very important role in regulating our health Bacteria help us extract energy from our food and helps us maintain a healthy vaginal environment Dysbiosis or disruption of the microbiome is often associated with risk factors for diseases including bacteria vaginosis and dihearria Some of the key questions in understanding the microbiome is who is there and how much, and we call this an abundance profile, we answer these questions with metagenomics Image courtesy of humanlongevity.com
Metagenomics Analyzing DNA sequences from environmental sample Typical datasets contain millions of reads I’m going to discuss this idea of microbial forensics under the framework of metagenomic
Fundamental Questions What is the identity of a read? What is the microbial profile of a sample? What genes/functions are present? I’m going to discuss this idea of microbial forensics under the framework of metagenomic
Fundamental Questions What is the identity of a read? What is the microbial profile of a sample? What genes/functions are present? I’m going to discuss this idea of microbial forensics under the framework of metagenomic
Metagenomic Taxon Identification Objective: classify short reads in a metagenomic sample
Abundance Profiling Objective: distribution of the species (or genera, or families, etc.) within the sample For example, the distribution of a sample at the species level might be: Species A: 10% Species B: 25% Species C: 55% Species D: 1% Species E: 9% and the second related problem is known as abundance profiling. the connection between abundance profiling and identification is that we will be using identification to solve profiling
Genome-based profiling Population of 2 bacteria, A and B. B has twice as large genome as A. A A True profile: 67% A, 33% B Profile estimated from reads: 50% A, 50%B B Ecoli genome variation can be as large as 20% we can try to take into account the genome size by estimating abundances based upon coverage, however, genomes of bacteria can vary in size, if the reads come from unsequenced organisms, this can be difficult
Single copy marker-based profiling Population of 2 bacteria, A and B. B has twice as large genome as A. A A Each have a single copy of gene C True profile: 67% A, 33% B Profile estimated from reads: 67% A, 33%B B Our focus is on using phylogeny-based methods. Phylogeny based methods tries to infer the relationship of the query sequence to the known reference sequences using a phylogeny. This allows us to infer information about sequences from novel sequences
TIPP: Taxonomic Identification And Phylogenetic Profiling Fragmentary unknown reads for a gene Known full length sequences for a gene, and an alignment and a tree ACCG CGAG CGG GGCT … ACCT ensemble of HMMs+statistics AGG...GCAT (species1) TAGC...CCA (species2) TAGA...CTT (species3) AGC...ACA (species4) ACT..TAGAA (species5)
TIPP: Taxonomic Identification And Phylogenetic Profiling Nguyen et al., Bioinformatics, 2014 Reads Assign to marker genes Marker genes Classify reads Compute profile
Abundance Profiling Objective: distribution of the species (or genera, or families, etc.) within the sample. Leading techniques: PhymmBL (Brady & Salzberg, Nature Methods 2009) NBC (Rosen, Reichenberger, and Rosenfeld, Bioinformatics 2011) MetaPhyler (Liu et al., BMC Genomics 2011), from the Pop Lab at the University of Maryland MetaPHlAn (Segata et al., Nature Methods 2012), from the Huttenhower Lab at Harvard mOTU (Bork et al., Nature Methods 2013) MetaPhyler, MetaPHlAn, and mOTU are marker-based techniques (but use different marker genes). Make a diagram to emphasis differences between genome-based and marker-based
“Hard” genome datasets (known genomes and high indel error) On the hard datasets, where the reads come from known genomes (i.e., all methods have seen the genomes that the reads come from), but have high rates of sequencing errors, there was a large separation between the methods. What I'm showing is distance to the true profile on the y-axis so lower is better, and on the x-axis is the error at the different taxonomic levels. This column is for long reads, and this one for short reads. Note: NBC, MetaPhlAn, and MetaPhyler cannot classify any sequences from at least of the high indel long sequence datasets. mOTU terminates with an error message on all the high indel datasets.
“Novel” genome datasets Red line Note: mOTU terminates with an error message on the long fragment datasets and high indel datasets.
TIPP Compared To Other Profiling Methods TIPP is highly accurate, even in the presence of novel genomes and high sequencing error All other methods are less robust Accurate profiles can be estimated using only a portion of the reads
Do Individual Primates From The Same Species Have Personal Microbiomes? To answer this question, we need longitudinal data from many individuals, so we went ahead and did that
Humans have personalized microbiome Recent research has shown that individual humans have a personalized microbiome. In 2010, Fierer showed that you could identify who used which keyboard by comparing the residual contatct microbiome on a keyboard and the skin microbiome of the user. Fierer et al., PNAS 2010 showed that you can identify who had previously used a keyboard via the residual contact microbiome (three individuals in study)
Experimental Design Data collected by Patton’s Lab at U of Washington Dataset (unpublished; in preparation) Data collected by Patton’s Lab at U of Washington Longitudinal study of the vaginal, rectal, and fecal microbiome in 39 female captive Pigtailed Macacas Weekly matched paired samples taken over a period of a month from each individual 16S rRNA amplicon sequencing TIPP (Nguyen et al. 2014) used to generate profiles Questions How to the microbiomes differ by body site and individual Can we identify an individual based upon the microbiome? Add picture of macacas
Experimental Design Week 1 Week 2 Week 3 Which individual?
Identification Results vaginal',0.583 fecal',0.744 rectal',0.769 fecal+rectal',0.859 fecal+vaginal',0.846 rectal+vaginal',0.846 fecal+rectal+vaginal',0.917 2 matched paired samples very different from original donor donor
Future Directions Expanding the marker set, both in the number of species and genes Statistical approach to combining profiles from different marker genes Developing TIPP for virobiome Jigsaw analogy
Acknowledgements Illinois Tandy Warnow Rebecca Stumpf Bryan White Mike Nute Brenda Wilson UCSD Siavash Mirarab UMD Mihai Pop Bo Liu U of Copenhagen Alonzo Alfaro-Núñez Tom Hansen Anders Hansen Funding NSF 09-35347 NSF 08-20709 NSF 0733029 University of Alberta Double it
Questions? TIPP tutorial tomorrow at 10:00-11:00 in MR7 Instructions for downloading at https://github.com/smirarab/sepp/blob/master/README.TIPP.md Tutorial at https://github.com/smirarab/sepp/blob/master/tutorial/tipp- tutorial.md I am a comp scientist that works on developing algorithms for biology