GEBA Project Summary Dongying Wu
Phylogenetic Tree Building (Martin Wu) Concatenate alignments of 31 marker genes build a PHYML tree 667 non-GEBA genomes, 53 genomes
Phylogenetic Distance (PD) PD=sum of all the branch lengths PD{A,B,C}=a+b+c+d A B a b C c d
Phylogenetic Distance Contribution of GEBA genomes 53 random non-GEBA taxa (from a pool of 667) contribute 3.15 to the tree PD (standard deviation:0.68 for 100 sampling) The total tree PD is 88.8, GEBA add 11.0 to the tree. The 26 GEBA actinobacteria add 4.29 to the total PD (actinobacteria as a whole add PD) 26 random non-GEBA actinobacteria (from a pool of 47) contribute 1.37 PD (standard deviation 0.28, 100 sampling)
227,562 genes from 56 genomes => 17,176,180 links Blastp: E value cutoff 1e-10, report hits Only blastp hits that span 80% of the lengths of both genes are kept as links Gene Family Classification
Links (matrix of sequence identities) Expansion Inflation (I=2) MCL Clustering Algorithm equilibrium state
/ / /56 2/56 - 5/56 5/ /56 1/ Number of Families F a m i l y S i z e ( g e n e s / g e n o m e )
Evenness estimation genomeGene distribution ratio for family X A0.316 B0.105 C0.026 D0 E0.184 F0.215 G0.158 Median dist: Distance averrage =0.087 Evenness=100 x e -4 x dist 0.031
Universality: ratio of genomes that a family appears in Evenness: even distribution of gene family members across genomes Size: number of members in a gene family
Family size
Large families: famID size functions F (75/genome)ABC-type transport system ATP-binding protein F (27/genome)multi-sensor hybrid histidine kinase F (24/genome)short chain dehydrogenase F (20/genome)acyl-CoA synthetase F (14/genome)serine/threonine protein kinase F (13/genome)two-component system response regulator (LuxR family) F (13/genome)two-component system response regulator (winged helix family) F (11/genome)drug resistance transporter F (11/genome)transcriptional regulator, LacI family F (10/genome)two-component system sensor sensor histidine kinase F (10/genome)sugar ABC transporter, permease component
Low universality large families: famID size organismfamily functiontaxonomy number F outer membrane proteinBacteroidetes; Proteobacteria F outer membrane protein Bacteroidetes F anti-sigma factor Bacteroidetes; Proteobacteria F transcriptional regulator, AraC family Bacteroidetes; proteobacteria F RNA polymerase ECF-type sigma factor Bacteroidetes (Sphingobacteriales) F DNA-binding proteinActinobacteria(Actinobacteridae) F FtsX transmembrane transport protein Bacteroidetes (Sphingobacteriales) F hypothetical protein Actinobacteria;(Coriobacteriaceae)
3 out of 9 largest families have very low evenness value ( < 5) short chain dehydrogenaseacyl-CoA synthetase two-component system response regulator (LuxR) HalobacteriaHalorhabdus_utahensis 55HalobacteriaHalomicrobium_mukohataei 54HalobacteriaHalogeometricum_borinquense 53AminanaerobiaThermanaerovibrio_acidaminovorans 52DeferribacteresDethiosulfovibrio_peptidovorans 51DeinococciMeiothermus_silvanus 50DeinococciMeiothermus_ruber 49ChloroflexiThermobaculum_terrenum 48ChloroflexiSphaerobacter_thermophilus 47ActinobacteriaConexibacter_woesei 46ActinobacteriaAtopobium_parvulum 45ActinobacteriaSlackia_heliotrinireducens 44ActinobacteriaEggerthella_lenta 43ActinobacteriaCryptobacterium_curtum 42ActinobacteriaAcidimicrobium_ferrooxidans 41ActinobacteriaKribbella_flavida 40ActinobacteriaCatenulispora_acidiphila 39ActinobacteriaStackebrandtia_nassauensis 38ActinobacteriaGeodermatophilus_obscurus 37ActinobacteriaNakamurella_multipartita 36ActinobacteriaActinosynnema_mirum 35ActinobacteriaSaccharomonospora_viridis 34ActinobacteriaTsukamurella_paurometabola 33ActinobacteriaGordonia_bronchialis 32ActinobacteriaStreptosporangium_roseum 31ActinobacteriaThermobispora_bispora 30ActinobacteriaThermomonospora_curvata 29ActinobacteriaNocardiopsis_dassonvillei 28ActinobacteriaKytococcus_sedentarius 27ActinobacteriaBrachybacterium_faecium 26ActinobacteriaBeutenbergia_cavernae 25ActinobacteriaCellulomonas_flavigena 24ActinobacteriaXylanimonas_cellulosilytica 23ActinobacteriaJonesia_denitrificans 22ActinobacteriaSanguibacter_keddieii 21FirmicutesAnaerococcus_prevotii 20FirmicutesAlicyclobacillus_acidocaldarius 19FirmicutesVeillonella_parvula 18FirmicutesDesulfotomaculum_acetoxidans 17FusobacteriaSebaldella_termitidis 16FusobacteriaLeptotrichia_buccalis 15FusobacteriaStreptobacillus_moniliformis 14SpirochaetesBrachyspira_murdochii 13BacteroidetesPlanctomyces_limnophilus 12BacteroidetesRhodothermus_marinus 11BacteroidetesCapnocytophaga_ochracea 10BacteroidetesChitinophaga_pinensis 09BacteroidetesPedobacter_heparinus 08BacteroidetesSpirosoma_linguale 07BacteroidetesDyadobacter_fermentans 06EpsilonproteobacteriaSulfurospirillum_deleyianum 05DeferribacteresDenitrovibrio_acetiphilus 04DeltaproteobacteriaHaliangium_ochraceum 03DeltaproteobacteriaDesulfomicrobium_baculatum 02DeltaproteobacteriaDesulfohalobium_retbaense 01GammaproteobacteriaKangiella_koreensis 50
phylum specific family 26/56 Actinobacteria Gene numberFrom Actinobacteria by chance
712 families (size >=10) are phylum specific Family size Organism number
Family sizeActonobacteriaBacteroidetesDeinococciFirmicutesFusobacteriaHalobacteria 10<= x < <= x < <= x < <= x < <= x < <= x < <= x < <= x < <= x < <= x Phylum-specific families from more than two organisms
F2699 Bacteroidetes=303; outer membrane protein *F2752 Actinobacteria=160; RNA polymerase, sigma-24 subunit, ECF family F2772 Bacteroidetes=147; putative ECF-type RNA polymerase sigma factor F2801 Actinobacteria=129; DNA-binding protein F2827 Bacteroidetes=114; FtsX-related transmembrane transport protein F2867 Actinobacteria=103; unknown functions The largest 6 phylum-specific families * From 15 organisms
Novel gene families: None of the genes in a family has a Genbank hit (e cutoff: 1e-5)
Streptococcus agalactiae “pan-genome” Tettelin H. et.al. PNAS 2005;102:
217,079 genes from 53 GEBA Bacterial genomes familiesN genomes Number of families with the selected genomes A:N from1 to 53 B:For every N, sample the families 100 times
Bacteria from GEBA project Genome Number Gene Family Number (including families with single members) Number of Genomes New Genome families
Actinobacteria: (73 genomes, including 26 GEBA genomes) Streptococcus agalactiae (8 strains) Enterobacteriaceae: (40 genomes) 9Escherichia coli 7Yersinia pestis 6Salmonella enterica 3Shigella flexneri Bacteria: (53 GEBA genomes)
Bacteria from GEBA project Genome Number Gene Family Number (including families with single members)
Genome Number Total Gene Number
S. agalactiae Enterobacteriaceae Actinobacteria Bacteria from GEBA project Total Gene Number Gene Family Number
Calculate the PD (Phylogenetic Diversity) Of a sub-tree
Bacteria from GEBA project Genome Number Phylogenetic Diversity
Bacteria from GEBA project Phylogenetic Diversity Gene Family Number
How far down the road GEBA has to go in terms of PD coverage Bacterial/Archaeal ss-rRNA from Greengenes clusters MCL99% Identity at 80% span Greengenes Bacterial/Archaeal ss-rRNA 667 Combo Bacterial ss-rRNA 50 Combo Archaeal ss-rRNA 56 GEBA ss-rRNA Retrieve alignments from greengenes QuickTree Distant Tree for all representatives Filter out ss-rRNA from Genome Porjects 99% identity cutoffs Filter out low-quality sequences short sequences <=1200nt low-quality sequences duplicates chimerics Trim by the greengenes mask
74437 non-environmental Bacterial/Archaeal ss-rRNA from Greengenes clusters MCL99% Identity at 80% span 9946 Greengenes Bacterial/Archaeal ss-rRNA 667 Combo Bacterial ss-rRNA 50 Combo Archaeal ss-rRNA 56 GEBA ss-rRNA Retrieve alignments from greengenes QuickTree Distant Tree for non-environmental representatives Filter out ss-rRNA from Genome Porjects 99% identity cutoffs Filter out low-quality sequences short sequences <=1200nt low-quality sequences duplicates chimerics Trim by the greengenes mask
GEBA Pre-GEBA Greengenes *start from Haemophilus influenzae Rd KW20 **In each group, the taxa are sorted by their PD contributions in descending order
GEBA genomes pre-GEBA genomes Organisms from the greengenes database (excluding environmental samples) Organism Numbers Phylogenetic Diversity
The slopes of the linear regression Lines represent the PD contribution of the genomes (each window contains 50 genomes)
Only the top 150 PD contributors out of 717 pre-GEBA genomes have an average PD contribution greater than the GEBA genomes. The genome sequencing efforts have only covered 11.5% phylogenetic diversity to date in this study. We can pick an additional 550 organisms and still have an average PD contribution greater than or equal to the 56 GEBA genomes To increase PD coverage to 50%, we need to sequence at least 1520 more genomes Non-environmental Tree
All-representative Tree Current genome sequences only cover 2.2% of the PD We can pick an additional 4400 organisms and still have an average PD contribution greater than or equal to the 56 GEBA genomes To cover 50% of the phylogenetic diversity, we have to sequences 9218 more genomes
rbcL
rbcL Active sites Catalytic RuBP binding
Glycerate-3-P P-glyceroyl-P GAPDHAPFructose-1,6-P Fructose-6-P Xylulose-P Ribulose-5-P Ribulose-1,5-P CO 2 rbcL pgk gap tpiA glpX tktA rpe Calvin cycle
OrganismphylumrpeprkrbcLrbcSpgk Thermomonospora_curvata_DSM_43183ActinobacteriaxxIxx Meiothermus_silvanus_DSM_0994DeinococcixxI,IVxx Acidimicrobium_ferrooxidansActinobacteriaxxIxx *Halogeometricum_borinquense_DSM_11551HalobacteriaxIIIx Halomicrobium_mukohataei_DSM_12286HalobacteriaxIIIx Alicyclobacillus_acidocaldarius_subspFirmicutesxxIVx Meiothermus_ruber_DSM_01279DeinococcixxIVx Nakamurella_multipartita_DSM_44233ActinobacteriaxxIV Planctomyces_limnophilus_DSM_03776BacteroidetesxIVx Rhodothermus_marinus_DSM_4252BacteroidetesxxIVx Veillonella_parvula_DSM_02008FirmicutesxIVx Geodermatophilus_obscurus_DSM_43160ActinobacteriaxxVx Pedobacter_heparinus_DSM_02366BacteroidetesxxVx Dyadobacter_fermentans_DSM_18053BacteroidetesxxVx Calvin Cycle * Finished genome