Download presentation
Presentation is loading. Please wait.
1
The Original Question:
What is the better way to select genomes to sequence so we can maximize gene discovery given our limited resources? Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R. Curr Opin Genet Dev Dec;15(6): Epub 2005 Sep 26
2
GEBA figure 2
3
Next Step: If we draw gene discovery curve for all the available taxonomy levels (nodes) 1. What is the relationships between gene families number (gene space) and genome number or the phylogenetic diversity (PD) at the taxonomy level? 2. Can we estimate the gene space of all microbial on the planet?
4
Next Step: If we draw gene discovery curve for all the available taxonomy levels (nodes) 1. What is the relationships between gene families number (gene space) and genome number or the phylogenetic diversity (PD) at the taxonomy level? 2. Can we estimate the gene space of all microbial on the planet?
5
We need: 1.Build gene families for all available genomes (Guillaume Jospin) 2.Build 16S trees for all greengenes 16S and get the subtree for the sequenced genomes IMG 1804 genomes with 6529,377genes 281,549 families, 939,289 singletons
6
Tree Building for greengenes 16S Sequences
510,111 sequences from greengenes Reduce Redundancy (99% identity cutoff over 95% length span) 162,365 sequences Build FastTree
7
Genomes: 1804 Tree: ( 9 genomes are excluded because
16S sequences cannot be found in greengenes they have <500 genes for incompleted genomes) Tree: (pruned 16S tree only include 1804 IMG organisms)
8
Genomes: 1804 Tree: ( 9 genomes are excluded because
16S sequences cannot be found in greengenes they have <500 genes for incompleted genomes) Tree: (pruned 16S tree only include 1804 IMG organisms)
9
Family number Sample gene families (exclude singletons) from the 1804 genomes Genome Number
10
For each node from the 16S tree
the genome number is N for a given node Pick a number at a even internal from 3 to N [interval=(N-3)/200] For example: a node have 600 nodes, I sample the following numbers If genomes (3, 6, 9,..,597, 600) For each X, randomly pick X genomes from the N genome pool, and calculate the PD of the subtree and the gene family number (repeat this step 100 times, to get the Average PD and family number for each X)
11
Family number Genome Number
12
Family number Genome Number
13
Family/genome
14
Random sampling protocol
Pick a random number N ( ) Select N genomes from the 16S tree, and keep only the subtree Pick X numbers at a even internal from 3 to N [interval=(N-3)/200] For each X, randomly pick X genomes from the N genome pool, and calculate the PD of the subtree and the gene family number (repeat this step 100 times, to get the Average PD and family number for each X) Repeat the above step 1000 times
15
Family number 1000 random genomes from the 1804 genomes Genome Number
16
117.32 family/genome x 162,365 = 19,0480,661 families
17
PD Genome Number Family number Genome Number
18
Family number PD
19
Family/PD PD
20
Family number 1000 random genomes from the 1804 genomes PD
21
family/PD x 3644 = 12,340,260 families Family/PD PD
22
Total PD of the greengenes 16S: 3644.21
According to random sample curve of the root node: When PD is 225, the slope (family/PD) reaches 0 (regression 0.75) When PD increase, family space stays constant ???
23
Singletons: To count or not to count?
281,549 families, 939,289 singletons Not to count: Many singletons are results of false prediction To count: The singletons here are results of single E value cutoff (not true singles)
24
Only sample 972 finished genomes and count all the singletons
Family number Genome Number
25
Only sample 972 finished genomes and count all the singletons
Family number Genome Number
26
PD Genome Number
27
Family number PD
28
Family number Genome Number
29
PD Genome Number
30
Family number PD
31
Family number/Genome Total PD of subtree
32
Family number/PD Total PD of subtree
33
Family number/PD
34
Family comparison between Genome A and B
Unique_family_for_B Overlap Unique_for_A Family Distance=(UA+UB)/(UA+UB+Overlap)
35
Family Distance PD between two genomes
36
Family Contribution of ONE genome vs its PD contribution
Randomly select 100 genomes from the tree Randomly select 1 genome from the tree, and calculation its family and PD contribution, do it 800 times Divide the 800 datapoint into categories according to PD contribution [0-0.1) [ ) [ ) .... For each category, calculated the average family contribution and the standard deviation Sample 100 times
37
PD contribution of one additional genome
Family Contribution of one additional genome
38
PD contribution of one additional genome
Family Contribution of one additional genome
39
Family Contribution of ONE genome vs its PD contribution
Calculate one genome's contribution to a pool of genomes(10, )
40
PD contribution of one additional genome
Genome Number Family Number PD 10 18595±2551 3.74±0.63 20 32552±2579 6.08±0.80 50 67454±5068 11.20±0.85 100 116192±5768 17.17±0.98 200 197728±6778 25.30±1.11 500 393797±7616 41.07±1.05 PD contribution of one additional genome Family Contribution of one additional genome
41
Novel families contributed by one genome are related its PD contribution(non-lineal)
Novel families contributed by one genome are related the PD and families of existing genome pool Random samples from a genome pool can predict the PD and genome family of the pool (chao estimator like approach can be applied) Current genome sequencing data cannot accurately predict the microbial protein spaces 1. it is not a random sample 2. it is at early stage can result in over-estimation even if we have a random sample (lineal stage of Chao estimater)
42
Greengenes 16S rRNA tree Total PD Genome Number
44
Gene Family Building for All the Sequenced Genomes from IMG
Protein encoding genes from all the genomes All vs all Blastp (1e-10 covering 80% of both query and hit) Links between genes MCL clustering Gene families IMG 1813 genomes with 6,530,517 genes BLAST and MCL clustering would overwhelm our computer resources
45
Identify Universally Distributed Families
100 representative genomes (85 bacterial and 15 archaeal genomes) 313,139 protein encoded genes 28,710,015 links 17232 gene families Universality is the percent of genomes a family covers links genes families Percentage Universality cutoffs
46
Guillaume Jospin from Eisen's Lab
502 families with universality >= 70 (70%) 502 HMM profiles Screen and build the 502 families for the 1813 IMG dataset Gene family building pipeline 1,219,828 families (281,562 if exclude singletons)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.