Automatic ssu- rRNA novelty ranking pipeline Ssu-RNA sequence s One ranking score for each sequence for phylogenetic novelty Dongying Wu 03/2015.

Slides:



Advertisements
Similar presentations
Lab 1: Using data output from Qiime, transformations, quality control
Advertisements

Simple Regression Equation Multiple Regression y = a + bx Test Score Slope y-intercept Predicted Score  y = a + b x + b x + b x ….. Predicted Score 
Metabarcoding 16S RNA targeted sequencing
Classifying Organisms
Discovering Divisibility Rules.
A Guide to SQL, Seventh Edition. Objectives Create a new table from an existing table Change data using the UPDATE command Add new data using the INSERT.
11 WORKING WITH GROUPS Chapter 7. Chapter 7: WORKING WITH GROUPS2 CHAPTER OVERVIEW  Understand the functions of groups and how to use them.  Understand.
Chapter 3: System design. System design Creating system components Three primary components – designing data structure and content – create software –
Chapter 7 WORKING WITH GROUPS.
Chapter 9 Structuring System Requirements: Logic Modeling
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Metagenomic Analysis Using MEGAN4
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
3 x 5 means… 3 “lots of” 5 3 x 5 means… 3 “lots of” = 15.
Slides for “Data Mining” by I. H. Witten and E. Frank.
From Metagenomic Sample to Useful Visual Anna Shcherbina 01/10/ Anna Shcherbina Bioinformatics Challenge Day 02/02/2013 From Metagenomic Sample to.
Diversity of uncultured candidate division SR1 in anaerobic habitats James P. Davis Microbial & Molecular Genetics Oklahoma State University.
Database Management 9. course. Execution of queries.
Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Measuring the T m of DNA GC pairs connected by 3 H bonds AT pairs connected by 2 H bonds * Higher GC content  higher T m Absorbance of 260 nM light (UV)
Identify gene markers for different taxonomic groups in Archaea and Bacteria Genomes Dongying Wu 1,2, Jonathan A. Eisen 1,2 1. DOE Joint Genome Institute,
Phylogenetic trees School B&I TCD Bioinformatics May 2010.
Biological Classification 1 This chart shows one idea of how humans are related to some other non- living species time Our species.
Taxonomy of Cellular Life Taxonomy: classification (hierarchical grouping based on characteristics); nomenclature (naming); identification (define characteristics.
Sequence-based Similarity Module (BLAST & CDD only ) & Horizontal Gene Transfer Module (Ortholog Neighborhood & GC content only)
Metagenomic Analysis Using MEGAN4 Peter R. Hoyt Director, OSU Bioinformatics Graduate Certificate Program Matthew Vaughn iPlant, University of Texas Super.
E-R model for Exercise #1 Comments: 1. There is a lot of process, or data flow information in this description that will not be modeled in the E-R diagram,
Abstract Our current understanding of the taxonomic and phylogenetic diversity of cellular organisms, especially the bacteria and archaea, is mostly based.
CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis
GEBA Project Summary Dongying Wu. Phylogenetic Tree Building (Martin Wu) Concatenate alignments of 31 marker genes build a PHYML tree 667 non-GEBA genomes,
A Genomics View of Unix. General Unix Tips To use the command line start X11 and type commands into the “xterm” window A few things about unix commands:
Advancing Science with DNA Sequence Natalia Ivanova MGM Workshop September 29, 2011 Metagenome analysis: use case.
STAR Event data storage and management in STAR V. Perevoztchikov Brookhaven National Laboratory,USA.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
The Microbiome and Metagenomics
Accurate estimation of microbial communities using 16S tags
PICODIV will amass large amount of data –cultures –sequences –environmental data Databases –keep track of data produced –verify the data –avoid errors.
Major characteristics used in taxonomy
A Guide to MySQL 6. 2 Objectives Create a new table from an existing table Change data using the UPDATE command Add new data using the INSERT command.
Creating a data set From paper surveys to excel. STEPS 1.Order your filled questionnaires 2.Number your questionnaires 3.Name your variables. 4.Create.
Divisibility Factors and Primes Unit 4.1 and 4.2 Pages
ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.
Canadian Bioinformatics Workshops
MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res
Introducing DOTUR, a Computer Program for Defining Operational Taxonomic Units and Estimating Species Richness Patric D. Schloss and Jo Handelsman Department.
Culturable Bacterial Communities Analyzer DIANA VANESSA SARRIA-ZUNIGA ELIANA TORRES-ZELADA April 29, 2016.
Date of download: 6/23/2016 Copyright © 2016 McGraw-Hill Education. All rights reserved. Pipeline for culture-independent studies of a microbiota. (A)
Date of download: 7/7/2016 Copyright © 2016 McGraw-Hill Education. All rights reserved. Pipeline for culture-independent studies of a microbiota. (A) DNA.
Robert Edgar Independent scientist
Quantitative Phylogenetic Assessment of Microbial Communities in Diverse Environments Xinjun Zhang.
Metagenomic Species Diversity.
Environmental Biochemistry University of Oldenburg Fremantle, 2013
The Original Question:
PNAS 2012 Alpha diversity: how many species are in each sample?
Basics of BLAST Basic BLAST Search - What is BLAST?
Prioritize Organism Selection for the Genomic Encyclopedia Project to Optimize Phylogenetic Diversity Dongying Wu April 10, 2007.
Workshop on the analysis of microbial sequence data using ARB
Clustering.
Sequence Based Analysis Tutorial
Comparative Genomics.
Gene Family Ancestral State Phylogenetic Profiling
Batch Parties Changes.
Fractions of 16S rRNA genes from bacteria (top panel) and archaea (bottom panel) in public databases from primer-amplified metagenomes (with and without.
Gut microbiota signatures of longevity
(A) Taxonomic identity at the phylum level of raw leachate (RL) and enrichment microcosms (E) as determined via Ion Torrent 16S rRNA gene amplicon sequencing.
Box plots displaying the median, minimum, maximum, and first and third quartiles of the percentage of the 25 members of the core microbiome detected in.
P-network of 2,616 prokaryote genomes based on chromosomal sequences with rRNA genes removed. P-network of 2,616 prokaryote genomes based on chromosomal.
Nidhi Shah University of Maryland
Presentation transcript:

Automatic ssu- rRNA novelty ranking pipeline Ssu-RNA sequence s One ranking score for each sequence for phylogenetic novelty Dongying Wu 03/2015

A tree of query sequences with SILVA references (reference sequences with defined phyla) Cut the tree at the phylum level into OTUs, singletons of query sequences are from novel phyla

1Archaea;Ancient Archaeal Group(AAG)14Archaea;Crenarchaeota 44Archaea;Euryarchaeota1Archaea;Korarchaeota 2Archaea;Marine Hydrothermal Vent Group 1(MHVG-1)1Archaea;Nanoarchaeota 15Archaea;Thaumarchaeota 70Bacteria;Acidobacteria60Bacteria;Actinobacteria 3Bacteria;Aquificae44Bacteria;Armatimonadetes 23Bacteria;BD1-511Bacteria;BHI Bacteria;Bacteroidetes7Bacteria;CK-1C Bacteria;Caldiserica311Bacteria;Candidate division OD1 4Bacteria;Chlamydiae17Bacteria;Chlorobi 125Bacteria;Chloroflexi1Bacteria;Chrysiogenetes 128Bacteria;Cyanobacteria18Bacteria;Deferribacteres 8Bacteria;Deinococcus-Thermus3Bacteria;Dictyoglomi 22Bacteria;Elusimicrobia27Bacteria;Fibrobacteres 217Bacteria;Firmicutes24Bacteria;Fusobacteria 4Bacteria;GAL085Bacteria;GOUTA4 9Bacteria;Gemmatimonadetes6Bacteria;Hyd Bacteria;JL-ETNP-Z399Bacteria;Kazan-3B-28 2Bacteria;LD1-PA3822Bacteria;Lentisphaerae 3Bacteria;MVP-2129Bacteria;NPL-UPA2 22Bacteria;Nitrospirae3Bacteria;OC31 116Bacteria;Planctomycetes256Bacteria;Proteobacteria 8Bacteria;RF31Bacteria;RsaHF231 1Bacteria;S2R-291Bacteria;SBYG Bacteria;SM2F1145Bacteria;Spirochaetae 13Bacteria;Synergistetes55Bacteria;TA06 15Bacteria;TM613Bacteria;Tenericutes 3Bacteria;Thermodesulfobacteria18Bacteria;Thermotogae 24Bacteria;Verrucomicrobia6Bacteria;WCHB1-60 6Bacteria;aquifer13Bacteria;aquifer2 Number of representatives from each phylum (1950 bacteria/78 archaea)

38Eukaryota;Archaeplastida;Chloroplastida;Charophyta;Phragmoplastophyta;Streptophyta 29Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta 41Eukaryota;Excavata;Discoba;Discicristata;Euglenozoa;Euglenida 7Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Acanthocephala 27Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Annelida 66Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Arthropoda 2Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Brachiopoda 4Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Bryozoa 1Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Chaetognatha 8Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Cnidaria 3Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Ctenophora 1Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Cycliophora 4Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Echinodermata 3Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Entoprocta 3Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Gastrotricha 5Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Gnathostomulida 4Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Hemichordata 1Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Kinorhyncha 1Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Loricifera 1Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Mesozoa;Orthonectida 1Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Mesozoa;Rhombozoa 17Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Mollusca 1Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Myzostomida 17Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Nematoda 2Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Nematomorpha 10Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Nemertea 1Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Onychophora 27Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Platyhelminthes 2Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Priapulida 3Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Rotifera 4Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Tardigrada 1Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Xenoturbellida 3Eukaryota;Opisthokonta;Holozoa;Metazoa;Porifera 11Eukaryota;Opisthokonta;Nucletmycea;Fungi;Chytridiomycota 55Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota 12Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Basidiomycota 11Eukaryota;Opisthokonta;Nucletmycea;Fungi;Kickxellomycotina;Glomeromycota 25Eukaryota;Opisthokonta;Nucletmycea;Fungi;Microsporidia 7Eukaryota;Picozoa 95Eukaryota;SAR;Alveolata;Apicomplexa 2Eukaryota;SAR;Alveolata;Protalveolata;Chromerida 1Eukaryota;SAR;Stramenopiles;Diatomea;Coscinodiscophytina;Fragilariales;Ctenophora 3Eukaryota;SAR;Stramenopiles;Phaeophyceae 4Eukaryota;SAR;Stramenopiles;Xanthophyceae Number of representatives from each phylum (564 Eukaryota)

The selection process maximizes phylogenetic diversity. The core representative sequences only include those with PHYLUM assignments. Thus we have phylogenetic gaps in the representative data set. We have to include close relatives from SILVA of query sequences for tree building 1.Filling phylogenetic gaps in the core references 2.Guide short query sequences into the right positions in the tree Query ssu-rRNA Top 10 hit from SILVA by blat Align query sequences and hits by sina Build ML tree by Fasttree together with pre- aligned core reference sequences

Jessica Jarett’s test ssu-rRNA sequence of top hits +core references Eukaryota Archaea Bacteria

Eukaryota Archaea Bacteria Tree rooting to separate Eukaryota and Archaea/Bacteria (automatic)

Tree rooting for Archaea/Bacteria (automatic)

How to identify the cutoff line at the phylum level Cut the tree using different TreeOTU cutoffs, and compare the resulting OTUs with phylum level OTU standard defined by SILVA (query sequences are ignored during the comparison)

TreeOTU cutoff AMI compared to SILVA phylum level definition (Bacteria/Archaea) 0.42 is the TreeOTU cutoff for phylum Query in a OTU with reference sequences? Yes No 10-CSP1477__MDM2__DC4__Prim__02__M8__S11_B02_ CSP1477__MDM2__DC4__SYBR__02__M20__S7_C03_027

Cutoff phylum Cutoff query Novel score= - 1-Cutoff phylum (if Cutoff query >= Cutoff phylum ) Cutoff phylum Cutoff query Novel score= - Cutoff phylum (if Cutoff query < Cutoff phylum ) 1 (root) 0 (phylum line) -1(tip)

Chlorobi Ranking value:

Aquificae Deinococcus-Thermus ? Archaea ?

The pipeline has been completed. Here is one command line example: ~/dwu_scripts/single_cell/prep_seq_with_close_relatives_from_silva.pl -db../../SSURef_NR99_115_tax_silva_trunc.fasta -blat ~/bin/blat -input star16S18Sseq.txt -output star16S18Sseq.hit nohup ~/dwu_scripts/single_cell/run_sina.pl -i star16S18Sseq.hit -o star16S18Sseq.ali -db../../SSURef_NR99_115_SILVA_20_07_13_opt.arb -sina ~/bin/SINA/sina /sina & cat star16S18Sseq.ali../silva_phyla_rep.sina.fasta | ~/dwu_scripts/single_cell/trim_all_nt_gap.pl > star16S18Sseq.trim nohup ~/bin/FastTree -nt star16S18Sseq.trim > star16S18Sseq.tre & ~/dwu_scripts/single_cell/phylum_novelty_ranking.pl -tree star16S18Sseq.tre -reftaxa../silva_phyla.info -output star16S18Sseq.ranking Example of the output ##Bacteria && Archaea: standard reference TreerOTU cutoff: ##values: 1->root, 0->same_as_refrence_standard, -1->identical_seqeunces_in_the_references CanI4_Uncultured_Thiothrix_sp___JX Bacteria;ProteobacteriaFM CanF1_Uncultured_Thiothrix_sp___JX Bacteria;ProteobacteriaFM CanI1_Uncultured_Thiothrix_sp___JX Bacteria;ProteobacteriaFM CanI3_Uncultured_Thiothrix_sp___JX Bacteria;ProteobacteriaFM CanI2_-0.612Bacteria;ProteobacteriaFM AerM5_Uncultured_Bacterium_KF Bacteria;ProteobacteriaJX AerM4_Uncultured_Bacterium_KF Bacteria;ProteobacteriaJX CanJ2_Uncultured_Bacterium_JF Bacteria;ProteobacteriaDQ …. ##Eukaryota: standard reference TreerOTU cutoff: ##values: 1->root, 0->same_as_refrence_standard, -1->identical_seqeunces_in_the_references AC1_Uncultured_archaeon_clone_DQ Eukaryota;Archaeplastida;Chloroplastida;ChlorophytaEukaryota_AF __4_Saccharomyces_cerevisiae_YJM789_JQ Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;AscomycotaEukaryota_BAEL __2_Saccharomyces_cerevisiae_YJM993_CP Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;CnidariaEukaryota_ABRM __2_Saccharomyces_cerevisiae_YJM789_JQ Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;AscomycotaEukaryota_BAEL __3_Uncultured_Fungus_KC Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;AscomycotaEukaryota_GU __1_Saccharomyces_cerevisiae_YJM789_JQ Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;CnidariaEukaryota_ABRM __1_Saccharomyces_cerevisiae_YJM993_CP Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;CnidariaEukaryota_ABRM __1_Uncultured_Eukaryote_EU Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;AscomycotaEukaryota_BAEL Novelty ranking Star Diamonds sample from Ramunas Stepanauskas, Mariana Erasmus

SILVA is required only in the aligning step (sina), we can modify database for close relative selection, taxonomic level information, core reference sequences for novelty ranking. proteobacteria Database Problem

Firmicutes

Database and reference updating: 1. Add new phyla to the core reference sequences 2. Merge or split current phylum definitions Only one rule must be observed: The classification must be exclusive. one sequences in the database or core reference set must be in one taxonomic group only, the taxonomic groups should be treated equally e.g., if we break proteobacteria into 4 groups, the definition of proteobacteria need to be removed. Proteobacteria sequences that cannot be assigned to the 4 sub-groups can be in the database or core reference dataset. They can play roles in tree structure and novelty ranking, but play no roles in taxonomic assignments and the novelty baseline TreeOTU cutoff identification.

High novelty ranking sequence problem: M00954:45: A8ECK:1:1102:18894: Archaea;EuryarchaeotaArchaea_AF

Batch job result consistency problem In an ideal word, the pipeline takes one sequence at a time, one sequence have one novelty ranking score. But the aligning and tree building steps are slow, we have to combine queries together to build reasonable number of alignments and trees. How different bundling affects novelty ranking score need to be addressed.

Team members at the current stage: UC Davis: Dongying Wu, Guillaume Jospin, Jonathan A. Eisen JGI: Jessica Jarett, Tanja Woyke SCGC Bigelow : Ramunas Stepanauskas