Enrique Garcia-Assad, Indresh Singh, Pratap Venepally, Jason Inman

Slides:



Advertisements
Similar presentations
System Development Life Cycle (SDLC)
Advertisements

SplitMEM: graphical pan-genome analysis with suffix skips Shoshana Marcus May 29, 2014.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
We are developing a web database for plant comparative genomics, named Phytome, that, when complete, will integrate organismal phylogenies, genetic maps.
NGS Analysis Using Galaxy
Polymorphism and Variant Analysis Lab
Input for the Bayesian Phylogenetic Workflow All Input values could be loaded as text file or typing directly. Only for the multifasta file is advised.
From Metagenomic Sample to Useful Visual Anna Shcherbina 01/10/ Anna Shcherbina Bioinformatics Challenge Day 02/02/2013 From Metagenomic Sample to.
Development and Evaluation of a Comprehensive Functional Gene array for Environmental Studies Zhili He 1,2, C. W. Schadt 2, T. Gentry 2, J. Liebich 3,
Polymorphism & Variant Analysis Lab Saurabh Sinha Polymorphism and Variant Analysis Lab v1 | Saurabh Sinha 1 Powerpoint by Casey Hanson.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
© 2008, Renesas Technology America, Inc., All Rights Reserved 1 Introduction Purpose  The course describes the performance analysis and profiling tools.
Grid enabling phylogenetic inference on virus sequences using BEAST - a possibility? EUAsiaGrid Workshop 4-6 May 2010 Chanditha Hapuarachchi Environmental.
1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
Copyright OpenHelix. No use or reproduction without express written consent1.
We obtained breast cancer tissues from the Breast Cancer Biospecimen Repository of Fred Hutchinson Cancer Research Center. We performed two rounds of next-gen.
Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Overview of the iPlant Discovery Environment.
From basic Concepts to Advanced applications Molecular Evolution & Phylogeny By Ofir Cohen The Bioinformatics Unit G.S. Wise Faculty of Life Science Tel.
Build an Automated Workflow Visual Workflow Creator Discovery Environment.
LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Bayesian Evolutionary Analysis by Sampling Trees (BEAST) LEE KIM-SUNG Environmental Health Institute National Environment Agency.
HANDS-ON ConSurf! Web-Server: The ConSurf webserver.
1 A latent information function to extend domain attributes to improve the accuracy of small-data-set forecasting Reporter : Zhao-Wei Luo Che-Jung Chang,Der-Chiang.
VIEWS b.ppt-1 Managing Intelligent Decision Support Networks in Biosurveillance PHIN 2008, Session G1, August 27, 2008 Mohammad Hashemian, MS, Zaruhi.
Evolutionary Computation Evolving Neural Network Topologies.
GIAB: Genome reference material development resources for clinical sequencing Chunlin Xiao 1, Justin Zook 2, Shane Trask 1, Melissa Landrum 1, Marc Salit.
STEPS IN RESEARCH PROCESS 1. Identification of Research Problems This involves Identifying existing problems in an area of study (e.g. Home Economics),
Monitoring Windows Server 2012
2. Centers for Disease Control and Prevention (CDC), Atlanta, GA, USA
Lesson: Sequence processing
Phylogeny - based on whole genome data
Regulatory Genomics Lab
Rod Eyles1, John Juma1, Morag Ferguson1, Trushar Shah1 1 IITA, Nairobi
Tutorial for using Case It for bioinformatics analyses
Bioinformatics Capstone Project
A.R. Manges  Clinical Microbiology and Infection 
CLINICAL INFORMATION SYSTEM
Accurate genotyping of hepatitis C virus through nucleotide sequencing and identification of new HCV subtypes in China population  Y.-Q. Tong, B. Liu,
Whole-genome sequencing to delineate Mycobacterium tuberculosis outbreaks: a retrospective observational study  Dr Timothy M Walker, MRCP, Camilla LC.
S. Platt, B. Pichon, R. George, J. Green 
Genomics of medical importance
Acknowledgements and References
Renouncing Hotel’s Data Through Queries Using Hadoop
Code Analysis, Repository and Modelling for e-Neuroscience
Genome Biology & Applied Bioinformatics Mehmet Tevfik DORAK, MD PhD
Basic Local Alignment Search Tool
Introduction to Systems Analysis and Design Stefano Moshi Memorial University College System Analysis & Design BIT
Explore Evolution: Instrument for Analysis
Laura Bright David Maier Portland State University
Accurate genotyping of hepatitis C virus through nucleotide sequencing and identification of new HCV subtypes in China population  Y.-Q. Tong, B. Liu,
A. Papa, K. Dumaidi, F. Franzidou, A. Antoniadis 
Regulatory Genomics Lab
The mosaic genome structure and phylogeny of Shiga toxin-producing Escherichia coli O104:H4 is driven by short-term adaptation  K. Zhou, M. Ferdous, R.F.
Code Analysis, Repository and Modelling for e-Neuroscience
Injury epidemiology- Participatory action research and quantitative approaches in small populations Lorann Stallones, PhD Professor and Director, Colorado.
Detection of a new subgenotype of hepatitis B virus genotype A in Cameroon but not in neighbouring Nigeria  J.M. Hübschen, P.O. Mbah, J.C. Forbi, J.A.
Introduction to Producing Data
DESIGN OF EXPERIMENTS by R. C. Baker
Introduction To MATLAB
Bloodstream infections due to Peptoniphilus spp.: report of 15 cases
Regulatory Genomics Lab
Novel West Nile virus lineage 1a full genome sequences from human cases of infection in north-eastern Italy, 2011  L. Barzon  Clinical Microbiology and.
Soft System Stakeholder Analysis
Topology Optimization through Computer Aided Software
NatQuery An End-User Perspective On Using To Extract Data From ADABAS
Presentation transcript:

Analysis workflow for the evaluation of phylogenetic relationships among microbial genomes Enrique Garcia-Assad, Indresh Singh, Pratap Venepally, Jason Inman Abstract C Phylogenetic trees in Newick format Whole genome sequencing (WGS) is increasingly utilized as an investigative tool of choice for analyzing samples in clinical microbiology. It allows rapid identification of strains that are present in the samples and facilitates the formulation of a targeted treatment of the affected subjects, especially in disease outbreak events. A major bottleneck in such an approach currently is the lack of an established methodology which can be used to rapidly process a large number of microbial genomes. To address this, as a demonstration, we have outlined in this presentation one analysis method using commonly available software (programs). Run NASP Background Identification of microbial genomes in various environments and populations and understanding of their genetic relationship is critical in studies related to phylogeography, epidemiology, disease outbreaks and the development of diagnostic tools. In this presentation, we illustrate a method which is useful in such analyses involving large number of genomes assembled from high-throughput sequences. The analysis workflow uses NASP [1] in the first step to align the assembled Escherichia coli genomes obtained from Genbank to a reference for SNP identification. NASP can be used with multiple short-read aligners and SNP callers in addition to having the option of running it on the computer grid (network). In the second step, FastTree program [4] is used to cluster related genomes into a phylogenetic tree based on NASP output. Finally, the genotypic relationships among the input genomes are visualized by displaying them in graphical tree viewer programs FigTree [3] & CLC Workbench [2]. View tree in FigTree/CLC Workbench Run FastTree (bestsnp/miss-ingdata fasta) D Methods Fig 1. The schematic workflow of the analysis. The following analyses are performed in sequence. A). Alignment of genomes & SNP identification (NASP analysis): Two sets of genome sequences were used in the analysis. A smaller set of 20 random E. coli genomes (19 experimental and 1 reference)* were initially processed on the JCVI's Univa computer grid to optimize the run parameters for the program (not shown). The larger, second group of 50 genomes from the same strain (49 experimental – including 19 samples from the initial set - and 1 reference)** were processed subsequently using the parameters identical to the first run. As shown schematically in Fig 1, the genomes used in this study (see footnotes* & **) were downloaded from Genbank in FASTA format and placed in a 'run' directory. The genome designated as the reference was separated from the query (experimental) sequences by moving it into a separate directory. To ensure there was enough memory for a successful completion of the analysis with large number of genomes, the processing was executed on the high-memory (500 GB) Unix server. The NASP analysis was performed in the 'run' directory using parameters described in NAS SOP on the JCVI internal confluence page (http://confluence.jcvi.org/display/VISW/NASP+SOP). B). FastTree: The consensus SNP information generated by NASP in bestsnp.fasta (19 genomes) or missingdata.fasta (49 genomes) was used as input file to the FastTree program with default parameters to build phylogenetic trees (Fig 1). C). Evaluation of Phylogenetic trees: The tree files in Newick format generated by FastTree are imported into either FigTree or CLC Workbench GUI for evaluation of the phylogenetic relationship among the genomes (Fig 2, A-D). Results A Fig 2. The maximum-likelihood phylogenetic trees generated by FastTree. A). The 19 genomes clustered using consensus bestsnp.fasta & viewed in FigTree, B). The 49 genomes clustered using consensus missingdata.fasta & viewed in FigTree, C & D). same as in A & B but viewed in CLC Workbench. Separate reference & query genomes into different directories Obtain desired genomes Conclusions B The goal of the analysis was to demonstrate that NASP can process large genomic data sets for a rapid identification of SNPs which subsequently can be used to evaluate phylogentic relationships among the related microbial samples. Our study example shows that NASP indeed satisfies that requirement. The observation that a smaller set of 19 closely related genomes (conserved bestsnp.fasta) retain the same phylogenetic clustering relationships even when analyzed as a subsample of a larger variant 49 genome group (missingdata.fasta) suggests that NASP is not only fast but also produces accurate results. Acknowledgments Garcia-Assad, Enrique, would like to thank JCVI for the opportunity to take part in the summer internship program. The training and the knowledge he gained will help him in his future academic career. Log on to high-memory server Move into ‘run’ folder References [1]. Sahl, J.W., Lemmer, D., Travis, J., Schupp, J.M., Gillece, J.D., Aziz, M., Kiem, P. (2016). NASP: an accurate, rapid method for the identification of SNPs in WGS datasets that supports flexible input and output formats. [2]. CLC Workbench: https://www.qiagenbioinformatics.com/products/clc-genomics-workbench/ [3]. FigTree: http://tree.bio.ed.ac.uk/software/figtree/ [4]. Price, M.N., Dehal, P.S., Akrin, A.P. (2010). FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments. * 19 Genomes Sequence IDs: 1 NZ_CP018948.1, 2 NZ_CP018965.1, 3 NZ_CP012870.1, 4 NZ_JRRX01000061.1,  5 NZ_JRRX01000061.1, 6 NZ_CP018976.1, 7 NZ_JMJD01000003.1, 8 NZ_CP018995.1, 9 NZ_CP020368.1, 10 NZ_MBNW01000001.1, 11 NZ_CP007393.1, 12 NZ_CP019008.1, 13 NZ_CP019020.1, 14 NZ_CP019015.1, 15 NZ_CP019012.1, 16 NZ_CP019000.1, 17 NZ_CP018953.1, 18NZ_CCRB01000147.1, 19 BA000007.2 ** 49 Genomes Sequence IDs:20 LN832404.1, 21 NZ_CP012869.1, 22 NZ_MPCO01000001.1, 23 NZ_JRRN01000033.1, 24 NZ_LXPG01000002.1, 25 NZ_MRDN01000001.1, 26 NZ_MPGR01000001.1, 27 NZ_MBNX01000003.1, 28 NZ_MBNU01000003.1, 29 NZ_CP019029.1, 30 NZ_CP018957.1, 31 NZ_CP018970.1, 32 NZ_MBNY01000001.1, 33 NZ_CP020835.1, 34 NZ_LUYD01000001.1, 35 NZ_JRSB01000213.1, 36 NZ_JMJB01000001.1, 37 NZ_CP018979.1, 38 NZ_CP018983.1, 39 NZ_MPGS01000001.1, 40 NZ_CP018962.1, 41 NZ_CP019005.1, 42 NZ_CP018991.1, 43 NZ_MPGT01000001.1, 44 NZ_LWDC01000001.1, 45 NZ_MIGD01000006.1, 46 NZ_LAZO01000001.1, 47 NZ_MPCP01000001.1, 48 NZ_JRRS01000180.1, 49 NZ_LRDF01000001.1 (first 19 same as 19 Genomes)