Automatic ssu- rRNA novelty ranking pipeline Ssu-RNA sequence s One ranking score for each sequence for phylogenetic novelty Dongying Wu 03/2015
A tree of query sequences with SILVA references (reference sequences with defined phyla) Cut the tree at the phylum level into OTUs, singletons of query sequences are from novel phyla
1Archaea;Ancient Archaeal Group(AAG)14Archaea;Crenarchaeota 44Archaea;Euryarchaeota1Archaea;Korarchaeota 2Archaea;Marine Hydrothermal Vent Group 1(MHVG-1)1Archaea;Nanoarchaeota 15Archaea;Thaumarchaeota 70Bacteria;Acidobacteria60Bacteria;Actinobacteria 3Bacteria;Aquificae44Bacteria;Armatimonadetes 23Bacteria;BD1-511Bacteria;BHI Bacteria;Bacteroidetes7Bacteria;CK-1C Bacteria;Caldiserica311Bacteria;Candidate division OD1 4Bacteria;Chlamydiae17Bacteria;Chlorobi 125Bacteria;Chloroflexi1Bacteria;Chrysiogenetes 128Bacteria;Cyanobacteria18Bacteria;Deferribacteres 8Bacteria;Deinococcus-Thermus3Bacteria;Dictyoglomi 22Bacteria;Elusimicrobia27Bacteria;Fibrobacteres 217Bacteria;Firmicutes24Bacteria;Fusobacteria 4Bacteria;GAL085Bacteria;GOUTA4 9Bacteria;Gemmatimonadetes6Bacteria;Hyd Bacteria;JL-ETNP-Z399Bacteria;Kazan-3B-28 2Bacteria;LD1-PA3822Bacteria;Lentisphaerae 3Bacteria;MVP-2129Bacteria;NPL-UPA2 22Bacteria;Nitrospirae3Bacteria;OC31 116Bacteria;Planctomycetes256Bacteria;Proteobacteria 8Bacteria;RF31Bacteria;RsaHF231 1Bacteria;S2R-291Bacteria;SBYG Bacteria;SM2F1145Bacteria;Spirochaetae 13Bacteria;Synergistetes55Bacteria;TA06 15Bacteria;TM613Bacteria;Tenericutes 3Bacteria;Thermodesulfobacteria18Bacteria;Thermotogae 24Bacteria;Verrucomicrobia6Bacteria;WCHB1-60 6Bacteria;aquifer13Bacteria;aquifer2 Number of representatives from each phylum (1950 bacteria/78 archaea)
38Eukaryota;Archaeplastida;Chloroplastida;Charophyta;Phragmoplastophyta;Streptophyta 29Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta 41Eukaryota;Excavata;Discoba;Discicristata;Euglenozoa;Euglenida 7Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Acanthocephala 27Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Annelida 66Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Arthropoda 2Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Brachiopoda 4Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Bryozoa 1Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Chaetognatha 8Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Cnidaria 3Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Ctenophora 1Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Cycliophora 4Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Echinodermata 3Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Entoprocta 3Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Gastrotricha 5Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Gnathostomulida 4Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Hemichordata 1Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Kinorhyncha 1Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Loricifera 1Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Mesozoa;Orthonectida 1Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Mesozoa;Rhombozoa 17Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Mollusca 1Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Myzostomida 17Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Nematoda 2Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Nematomorpha 10Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Nemertea 1Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Onychophora 27Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Platyhelminthes 2Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Priapulida 3Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Rotifera 4Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Tardigrada 1Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Xenoturbellida 3Eukaryota;Opisthokonta;Holozoa;Metazoa;Porifera 11Eukaryota;Opisthokonta;Nucletmycea;Fungi;Chytridiomycota 55Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota 12Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Basidiomycota 11Eukaryota;Opisthokonta;Nucletmycea;Fungi;Kickxellomycotina;Glomeromycota 25Eukaryota;Opisthokonta;Nucletmycea;Fungi;Microsporidia 7Eukaryota;Picozoa 95Eukaryota;SAR;Alveolata;Apicomplexa 2Eukaryota;SAR;Alveolata;Protalveolata;Chromerida 1Eukaryota;SAR;Stramenopiles;Diatomea;Coscinodiscophytina;Fragilariales;Ctenophora 3Eukaryota;SAR;Stramenopiles;Phaeophyceae 4Eukaryota;SAR;Stramenopiles;Xanthophyceae Number of representatives from each phylum (564 Eukaryota)
The selection process maximizes phylogenetic diversity. The core representative sequences only include those with PHYLUM assignments. Thus we have phylogenetic gaps in the representative data set. We have to include close relatives from SILVA of query sequences for tree building 1.Filling phylogenetic gaps in the core references 2.Guide short query sequences into the right positions in the tree Query ssu-rRNA Top 10 hit from SILVA by blat Align query sequences and hits by sina Build ML tree by Fasttree together with pre- aligned core reference sequences
Jessica Jarett’s test ssu-rRNA sequence of top hits +core references Eukaryota Archaea Bacteria
Eukaryota Archaea Bacteria Tree rooting to separate Eukaryota and Archaea/Bacteria (automatic)
Tree rooting for Archaea/Bacteria (automatic)
How to identify the cutoff line at the phylum level Cut the tree using different TreeOTU cutoffs, and compare the resulting OTUs with phylum level OTU standard defined by SILVA (query sequences are ignored during the comparison)
TreeOTU cutoff AMI compared to SILVA phylum level definition (Bacteria/Archaea) 0.42 is the TreeOTU cutoff for phylum Query in a OTU with reference sequences? Yes No 10-CSP1477__MDM2__DC4__Prim__02__M8__S11_B02_ CSP1477__MDM2__DC4__SYBR__02__M20__S7_C03_027
Cutoff phylum Cutoff query Novel score= - 1-Cutoff phylum (if Cutoff query >= Cutoff phylum ) Cutoff phylum Cutoff query Novel score= - Cutoff phylum (if Cutoff query < Cutoff phylum ) 1 (root) 0 (phylum line) -1(tip)
Chlorobi Ranking value:
Aquificae Deinococcus-Thermus ? Archaea ?
The pipeline has been completed. Here is one command line example: ~/dwu_scripts/single_cell/prep_seq_with_close_relatives_from_silva.pl -db../../SSURef_NR99_115_tax_silva_trunc.fasta -blat ~/bin/blat -input star16S18Sseq.txt -output star16S18Sseq.hit nohup ~/dwu_scripts/single_cell/run_sina.pl -i star16S18Sseq.hit -o star16S18Sseq.ali -db../../SSURef_NR99_115_SILVA_20_07_13_opt.arb -sina ~/bin/SINA/sina /sina & cat star16S18Sseq.ali../silva_phyla_rep.sina.fasta | ~/dwu_scripts/single_cell/trim_all_nt_gap.pl > star16S18Sseq.trim nohup ~/bin/FastTree -nt star16S18Sseq.trim > star16S18Sseq.tre & ~/dwu_scripts/single_cell/phylum_novelty_ranking.pl -tree star16S18Sseq.tre -reftaxa../silva_phyla.info -output star16S18Sseq.ranking Example of the output ##Bacteria && Archaea: standard reference TreerOTU cutoff: ##values: 1->root, 0->same_as_refrence_standard, -1->identical_seqeunces_in_the_references CanI4_Uncultured_Thiothrix_sp___JX Bacteria;ProteobacteriaFM CanF1_Uncultured_Thiothrix_sp___JX Bacteria;ProteobacteriaFM CanI1_Uncultured_Thiothrix_sp___JX Bacteria;ProteobacteriaFM CanI3_Uncultured_Thiothrix_sp___JX Bacteria;ProteobacteriaFM CanI2_-0.612Bacteria;ProteobacteriaFM AerM5_Uncultured_Bacterium_KF Bacteria;ProteobacteriaJX AerM4_Uncultured_Bacterium_KF Bacteria;ProteobacteriaJX CanJ2_Uncultured_Bacterium_JF Bacteria;ProteobacteriaDQ …. ##Eukaryota: standard reference TreerOTU cutoff: ##values: 1->root, 0->same_as_refrence_standard, -1->identical_seqeunces_in_the_references AC1_Uncultured_archaeon_clone_DQ Eukaryota;Archaeplastida;Chloroplastida;ChlorophytaEukaryota_AF __4_Saccharomyces_cerevisiae_YJM789_JQ Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;AscomycotaEukaryota_BAEL __2_Saccharomyces_cerevisiae_YJM993_CP Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;CnidariaEukaryota_ABRM __2_Saccharomyces_cerevisiae_YJM789_JQ Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;AscomycotaEukaryota_BAEL __3_Uncultured_Fungus_KC Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;AscomycotaEukaryota_GU __1_Saccharomyces_cerevisiae_YJM789_JQ Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;CnidariaEukaryota_ABRM __1_Saccharomyces_cerevisiae_YJM993_CP Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;CnidariaEukaryota_ABRM __1_Uncultured_Eukaryote_EU Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;AscomycotaEukaryota_BAEL Novelty ranking Star Diamonds sample from Ramunas Stepanauskas, Mariana Erasmus
SILVA is required only in the aligning step (sina), we can modify database for close relative selection, taxonomic level information, core reference sequences for novelty ranking. proteobacteria Database Problem
Firmicutes
Database and reference updating: 1. Add new phyla to the core reference sequences 2. Merge or split current phylum definitions Only one rule must be observed: The classification must be exclusive. one sequences in the database or core reference set must be in one taxonomic group only, the taxonomic groups should be treated equally e.g., if we break proteobacteria into 4 groups, the definition of proteobacteria need to be removed. Proteobacteria sequences that cannot be assigned to the 4 sub-groups can be in the database or core reference dataset. They can play roles in tree structure and novelty ranking, but play no roles in taxonomic assignments and the novelty baseline TreeOTU cutoff identification.
High novelty ranking sequence problem: M00954:45: A8ECK:1:1102:18894: Archaea;EuryarchaeotaArchaea_AF
Batch job result consistency problem In an ideal word, the pipeline takes one sequence at a time, one sequence have one novelty ranking score. But the aligning and tree building steps are slow, we have to combine queries together to build reasonable number of alignments and trees. How different bundling affects novelty ranking score need to be addressed.
Team members at the current stage: UC Davis: Dongying Wu, Guillaume Jospin, Jonathan A. Eisen JGI: Jessica Jarett, Tanja Woyke SCGC Bigelow : Ramunas Stepanauskas