Download presentation
Presentation is loading. Please wait.
Published byTheodore Gibbs Modified over 9 years ago
1
Automatic ssu- rRNA novelty ranking pipeline Ssu-RNA sequence s One ranking score for each sequence for phylogenetic novelty Dongying Wu 03/2015
2
A tree of query sequences with SILVA references (reference sequences with defined phyla) Cut the tree at the phylum level into OTUs, singletons of query sequences are from novel phyla
3
1Archaea;Ancient Archaeal Group(AAG)14Archaea;Crenarchaeota 44Archaea;Euryarchaeota1Archaea;Korarchaeota 2Archaea;Marine Hydrothermal Vent Group 1(MHVG-1)1Archaea;Nanoarchaeota 15Archaea;Thaumarchaeota 70Bacteria;Acidobacteria60Bacteria;Actinobacteria 3Bacteria;Aquificae44Bacteria;Armatimonadetes 23Bacteria;BD1-511Bacteria;BHI80-139 103Bacteria;Bacteroidetes7Bacteria;CK-1C4-19 13Bacteria;Caldiserica311Bacteria;Candidate division OD1 4Bacteria;Chlamydiae17Bacteria;Chlorobi 125Bacteria;Chloroflexi1Bacteria;Chrysiogenetes 128Bacteria;Cyanobacteria18Bacteria;Deferribacteres 8Bacteria;Deinococcus-Thermus3Bacteria;Dictyoglomi 22Bacteria;Elusimicrobia27Bacteria;Fibrobacteres 217Bacteria;Firmicutes24Bacteria;Fusobacteria 4Bacteria;GAL085Bacteria;GOUTA4 9Bacteria;Gemmatimonadetes6Bacteria;Hyd24-12 10Bacteria;JL-ETNP-Z399Bacteria;Kazan-3B-28 2Bacteria;LD1-PA3822Bacteria;Lentisphaerae 3Bacteria;MVP-2129Bacteria;NPL-UPA2 22Bacteria;Nitrospirae3Bacteria;OC31 116Bacteria;Planctomycetes256Bacteria;Proteobacteria 8Bacteria;RF31Bacteria;RsaHF231 1Bacteria;S2R-291Bacteria;SBYG-2791 3Bacteria;SM2F1145Bacteria;Spirochaetae 13Bacteria;Synergistetes55Bacteria;TA06 15Bacteria;TM613Bacteria;Tenericutes 3Bacteria;Thermodesulfobacteria18Bacteria;Thermotogae 24Bacteria;Verrucomicrobia6Bacteria;WCHB1-60 6Bacteria;aquifer13Bacteria;aquifer2 Number of representatives from each phylum (1950 bacteria/78 archaea)
4
38Eukaryota;Archaeplastida;Chloroplastida;Charophyta;Phragmoplastophyta;Streptophyta 29Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta 41Eukaryota;Excavata;Discoba;Discicristata;Euglenozoa;Euglenida 7Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Acanthocephala 27Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Annelida 66Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Arthropoda 2Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Brachiopoda 4Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Bryozoa 1Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Chaetognatha 8Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Cnidaria 3Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Ctenophora 1Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Cycliophora 4Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Echinodermata 3Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Entoprocta 3Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Gastrotricha 5Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Gnathostomulida 4Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Hemichordata 1Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Kinorhyncha 1Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Loricifera 1Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Mesozoa;Orthonectida 1Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Mesozoa;Rhombozoa 17Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Mollusca 1Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Myzostomida 17Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Nematoda 2Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Nematomorpha 10Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Nemertea 1Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Onychophora 27Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Platyhelminthes 2Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Priapulida 3Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Rotifera 4Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Tardigrada 1Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Xenoturbellida 3Eukaryota;Opisthokonta;Holozoa;Metazoa;Porifera 11Eukaryota;Opisthokonta;Nucletmycea;Fungi;Chytridiomycota 55Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota 12Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Basidiomycota 11Eukaryota;Opisthokonta;Nucletmycea;Fungi;Kickxellomycotina;Glomeromycota 25Eukaryota;Opisthokonta;Nucletmycea;Fungi;Microsporidia 7Eukaryota;Picozoa 95Eukaryota;SAR;Alveolata;Apicomplexa 2Eukaryota;SAR;Alveolata;Protalveolata;Chromerida 1Eukaryota;SAR;Stramenopiles;Diatomea;Coscinodiscophytina;Fragilariales;Ctenophora 3Eukaryota;SAR;Stramenopiles;Phaeophyceae 4Eukaryota;SAR;Stramenopiles;Xanthophyceae Number of representatives from each phylum (564 Eukaryota)
5
The selection process maximizes phylogenetic diversity. The core representative sequences only include those with PHYLUM assignments. Thus we have phylogenetic gaps in the representative data set. We have to include close relatives from SILVA of query sequences for tree building 1.Filling phylogenetic gaps in the core references 2.Guide short query sequences into the right positions in the tree Query ssu-rRNA Top 10 hit from SILVA by blat Align query sequences and hits by sina Build ML tree by Fasttree together with pre- aligned core reference sequences
6
Jessica Jarett’s test ssu-rRNA sequence of 28 + 169 top hits +core references Eukaryota Archaea Bacteria
7
Eukaryota Archaea Bacteria Tree rooting to separate Eukaryota and Archaea/Bacteria (automatic)
8
Tree rooting for Archaea/Bacteria (automatic)
10
How to identify the cutoff line at the phylum level Cut the tree using different TreeOTU cutoffs, and compare the resulting OTUs with phylum level OTU standard defined by SILVA (query sequences are ignored during the comparison)
11
TreeOTU cutoff AMI compared to SILVA phylum level definition (Bacteria/Archaea) 0.42 is the TreeOTU cutoff for phylum Query in a OTU with reference sequences? Yes No 10-CSP1477__MDM2__DC4__Prim__02__M8__S11_B02_014 19-CSP1477__MDM2__DC4__SYBR__02__M20__S7_C03_027
12
Cutoff phylum Cutoff query Novel score= - 1-Cutoff phylum (if Cutoff query >= Cutoff phylum ) Cutoff phylum Cutoff query Novel score= - Cutoff phylum (if Cutoff query < Cutoff phylum ) 1 (root) 0 (phylum line) -1(tip)
13
Chlorobi Ranking value: -0.619
14
Aquificae Deinococcus-Thermus ? Archaea ?
15
The pipeline has been completed. Here is one command line example: ~/dwu_scripts/single_cell/prep_seq_with_close_relatives_from_silva.pl -db../../SSURef_NR99_115_tax_silva_trunc.fasta -blat ~/bin/blat -input star16S18Sseq.txt -output star16S18Sseq.hit nohup ~/dwu_scripts/single_cell/run_sina.pl -i star16S18Sseq.hit -o star16S18Sseq.ali -db../../SSURef_NR99_115_SILVA_20_07_13_opt.arb -sina ~/bin/SINA/sina-1.2.11/sina & cat star16S18Sseq.ali../silva_phyla_rep.sina.fasta | ~/dwu_scripts/single_cell/trim_all_nt_gap.pl > star16S18Sseq.trim nohup ~/bin/FastTree -nt star16S18Sseq.trim > star16S18Sseq.tre & ~/dwu_scripts/single_cell/phylum_novelty_ranking.pl -tree star16S18Sseq.tre -reftaxa../silva_phyla.info -output star16S18Sseq.ranking Example of the output ##Bacteria && Archaea: standard reference TreerOTU cutoff: 0.490 ##values: 1->root, 0->same_as_refrence_standard, -1->identical_seqeunces_in_the_references CanI4_Uncultured_Thiothrix_sp___JX435593-0.510Bacteria;ProteobacteriaFM174326.1.14180.14918 CanF1_Uncultured_Thiothrix_sp___JX435593-0.510Bacteria;ProteobacteriaFM174326.1.14180.1617 CanI1_Uncultured_Thiothrix_sp___JX435593-0.510Bacteria;ProteobacteriaFM174326.1.14180.20465 CanI3_Uncultured_Thiothrix_sp___JX435593-0.510Bacteria;ProteobacteriaFM174326.1.14180.20612 CanI2_-0.612Bacteria;ProteobacteriaFM165230.1.14300.12171 AerM5_Uncultured_Bacterium_KF135895-0.633Bacteria;ProteobacteriaJX223534.1.14520.21316 AerM4_Uncultured_Bacterium_KF135901-0.633Bacteria;ProteobacteriaJX223534.1.14520.12817 CanJ2_Uncultured_Bacterium_JF232531-0.735Bacteria;ProteobacteriaDQ264409.1.15050.1243 …. ##Eukaryota: standard reference TreerOTU cutoff: 0.370 ##values: 1->root, 0->same_as_refrence_standard, -1->identical_seqeunces_in_the_references AC1_Uncultured_archaeon_clone_DQ0887770.984Eukaryota;Archaeplastida;Chloroplastida;ChlorophytaEukaryota_AF525614.1.13251.41585 6__4_Saccharomyces_cerevisiae_YJM789_JQ277730-0.703Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;AscomycotaEukaryota_BAEL01000039.331389.3328630.20688 2__2_Saccharomyces_cerevisiae_YJM993_CP006467-0.703Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;CnidariaEukaryota_ABRM01021940.10464.120580.35983 6__2_Saccharomyces_cerevisiae_YJM789_JQ277730-0.703Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;AscomycotaEukaryota_BAEL01000039.331389.3328630.20556 6__3_Uncultured_Fungus_KC337083-0.703Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;AscomycotaEukaryota_GU324000.1.17940.33472 4__1_Saccharomyces_cerevisiae_YJM789_JQ277730-0.703Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;CnidariaEukaryota_ABRM01021940.10464.120580.35867 2__1_Saccharomyces_cerevisiae_YJM993_CP006467-0.703Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;CnidariaEukaryota_ABRM01021940.10464.120580.35882 6__1_Uncultured_Eukaryote_EU326631-0.703Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;AscomycotaEukaryota_BAEL01000039.331389.3328630.20425 Novelty ranking Star Diamonds sample from Ramunas Stepanauskas, Mariana Erasmus
16
SILVA is required only in the aligning step (sina), we can modify database for close relative selection, taxonomic level information, core reference sequences for novelty ranking. proteobacteria Database Problem
17
Firmicutes
18
Database and reference updating: 1. Add new phyla to the core reference sequences 2. Merge or split current phylum definitions Only one rule must be observed: The classification must be exclusive. one sequences in the database or core reference set must be in one taxonomic group only, the taxonomic groups should be treated equally e.g., if we break proteobacteria into 4 groups, the definition of proteobacteria need to be removed. Proteobacteria sequences that cannot be assigned to the 4 sub-groups can be in the database or core reference dataset. They can play roles in tree structure and novelty ranking, but play no roles in taxonomic assignments and the novelty baseline TreeOTU cutoff identification.
19
High novelty ranking sequence problem: M00954:45:000000000-A8ECK:1:1102:18894:55780.981Archaea;EuryarchaeotaArchaea_AF328210.1.10132.25351
20
Batch job result consistency problem In an ideal word, the pipeline takes one sequence at a time, one sequence have one novelty ranking score. But the aligning and tree building steps are slow, we have to combine queries together to build reasonable number of alignments and trees. How different bundling affects novelty ranking score need to be addressed.
21
Team members at the current stage: UC Davis: Dongying Wu, Guillaume Jospin, Jonathan A. Eisen JGI: Jessica Jarett, Tanja Woyke SCGC Bigelow : Ramunas Stepanauskas
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.