The Original Question:

The Original Question:
What is the better way to select genomes to sequence so we can maximize gene discovery given our limited resources? Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R. Curr Opin Genet Dev Dec;15(6): Epub 2005 Sep 26

GEBA figure 2

Next Step: If we draw gene discovery curve for all the available taxonomy levels (nodes) 1. What is the relationships between gene families number (gene space) and genome number or the phylogenetic diversity (PD) at the taxonomy level? 2. Can we estimate the gene space of all microbial on the planet?

We need: 1.Build gene families for all available genomes (Guillaume Jospin) 2.Build 16S trees for all greengenes 16S and get the subtree for the sequenced genomes IMG 1804 genomes with 6529,377genes 281,549 families, 939,289 singletons

Tree Building for greengenes 16S Sequences
510,111 sequences from greengenes Reduce Redundancy (99% identity cutoff over 95% length span) 162,365 sequences Build FastTree

Genomes: 1804 Tree: ( 9 genomes are excluded because
16S sequences cannot be found in greengenes they have <500 genes for incompleted genomes) Tree: (pruned 16S tree only include 1804 IMG organisms)

Family number Sample gene families (exclude singletons) from the 1804 genomes Genome Number

For each node from the 16S tree
the genome number is N for a given node Pick a number at a even internal from 3 to N [interval=(N-3)/200] For example: a node have 600 nodes, I sample the following numbers If genomes (3, 6, 9,..,597, 600) For each X, randomly pick X genomes from the N genome pool, and calculate the PD of the subtree and the gene family number (repeat this step 100 times, to get the Average PD and family number for each X)

Family number Genome Number

Family/genome

Random sampling protocol
Pick a random number N ( ) Select N genomes from the 16S tree, and keep only the subtree Pick X numbers at a even internal from 3 to N [interval=(N-3)/200] For each X, randomly pick X genomes from the N genome pool, and calculate the PD of the subtree and the gene family number (repeat this step 100 times, to get the Average PD and family number for each X) Repeat the above step 1000 times

Family number 1000 random genomes from the 1804 genomes Genome Number

117.32 family/genome x 162,365 = 19,0480,661 families

PD Genome Number Family number Genome Number

Family number PD

Family/PD PD

Family number 1000 random genomes from the 1804 genomes PD

family/PD x 3644 = 12,340,260 families Family/PD PD

Total PD of the greengenes 16S: 3644.21
According to random sample curve of the root node: When PD is 225, the slope (family/PD) reaches 0 (regression 0.75) When PD increase, family space stays constant ???

Singletons: To count or not to count?
281,549 families, 939,289 singletons Not to count: Many singletons are results of false prediction To count: The singletons here are results of single E value cutoff (not true singles)

Only sample 972 finished genomes and count all the singletons

PD Genome Number

Family number PD

PD Genome Number

Family number PD

Family number/Genome Total PD of subtree

Family number/PD Total PD of subtree

Family number/PD

Family comparison between Genome A and B
Unique_family_for_B Overlap Unique_for_A Family Distance=(UA+UB)/(UA+UB+Overlap)

Family Distance PD between two genomes

Family Contribution of ONE genome vs its PD contribution
Randomly select 100 genomes from the tree Randomly select 1 genome from the tree, and calculation its family and PD contribution, do it 800 times Divide the 800 datapoint into categories according to PD contribution [0-0.1) [ ) [ ) .... For each category, calculated the average family contribution and the standard deviation Sample 100 times

PD contribution of one additional genome
Family Contribution of one additional genome

Family Contribution of ONE genome vs its PD contribution
Calculate one genome's contribution to a pool of genomes(10, )

PD contribution of one additional genome
Genome Number Family Number PD 10 18595±2551 3.74±0.63 20 32552±2579 6.08±0.80 50 67454±5068 11.20±0.85 100 116192±5768 17.17±0.98 200 197728±6778 25.30±1.11 500 393797±7616 41.07±1.05 PD contribution of one additional genome Family Contribution of one additional genome

Novel families contributed by one genome are related its PD contribution(non-lineal)
Novel families contributed by one genome are related the PD and families of existing genome pool Random samples from a genome pool can predict the PD and genome family of the pool (chao estimator like approach can be applied) Current genome sequencing data cannot accurately predict the microbial protein spaces 1. it is not a random sample 2. it is at early stage can result in over-estimation even if we have a random sample (lineal stage of Chao estimater)

Greengenes 16S rRNA tree Total PD Genome Number

Gene Family Building for All the Sequenced Genomes from IMG
Protein encoding genes from all the genomes All vs all Blastp (1e-10 covering 80% of both query and hit) Links between genes MCL clustering Gene families IMG 1813 genomes with 6,530,517 genes BLAST and MCL clustering would overwhelm our computer resources

Identify Universally Distributed Families
100 representative genomes (85 bacterial and 15 archaeal genomes) 313,139 protein encoded genes 28,710,015 links 17232 gene families Universality is the percent of genomes a family covers links genes families Percentage Universality cutoffs

Guillaume Jospin from Eisen's Lab
502 families with universality >= 70 (70%) 502 HMM profiles Screen and build the 502 families for the 1813 IMG dataset Gene family building pipeline 1,219,828 families (281,562 if exclude singletons)

The Original Question:

Similar presentations

Presentation on theme: "The Original Question:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Original Question:

Similar presentations

Presentation on theme: "The Original Question:"— Presentation transcript:

Similar presentations

About project

Feedback