Presentation is loading. Please wait.

Presentation is loading. Please wait.

PANGENOMES How Many Microbial Genes Are There In the World? Nicholas P. Celms, James D. Nulton, Dr. Rob Edwards, Dr. Peter Salamon PANGENOMES How Many.

Similar presentations


Presentation on theme: "PANGENOMES How Many Microbial Genes Are There In the World? Nicholas P. Celms, James D. Nulton, Dr. Rob Edwards, Dr. Peter Salamon PANGENOMES How Many."— Presentation transcript:

1 PANGENOMES How Many Microbial Genes Are There In the World? Nicholas P. Celms, James D. Nulton, Dr. Rob Edwards, Dr. Peter Salamon PANGENOMES How Many Microbial Genes Are There In the World? Nicholas P. Celms, James D. Nulton, Dr. Rob Edwards, Dr. Peter Salamon THE CONCEPT Imagine a shopper in a store, treating each item as an individual probability test to see if they buy it or not. Now, imagine the store organized its aisles based on the probabilities of its items. This analogy is very similar to the Pangenome concept, only instead of items in a store, the focus is proteins in a genome, and instead of aisles, they proteins are grouped into pools. Figures 1 and 2 (below) demonstrate this distributive process. THE CONCEPT Imagine a shopper in a store, treating each item as an individual probability test to see if they buy it or not. Now, imagine the store organized its aisles based on the probabilities of its items. This analogy is very similar to the Pangenome concept, only instead of items in a store, the focus is proteins in a genome, and instead of aisles, they proteins are grouped into pools. Figures 1 and 2 (below) demonstrate this distributive process. PANGENOME ANALYSIS Pangenome analysis offers many obvious conclusions, and some that are more subtle. o Figure 2 (above) shows clearly that roughly 3000 proteins are only found in one of the 45 strains of ESS. These are called “distinctive features”. Almost 2000 proteins were found in all 45 strains, which is defined as the “conserved” set. o Summing # genes column gives predicted Pangenome size. o With 12410 proteins in the Pangenome Matrix, the predicted Pangenome size for ESS is approximately 25006 genes. o If 12410 genes are already sequenced, and the predicted total is 25006, the predicted completeness of sequencing for ESS is 49.6%. PANGENOME ANALYSIS Pangenome analysis offers many obvious conclusions, and some that are more subtle. o Figure 2 (above) shows clearly that roughly 3000 proteins are only found in one of the 45 strains of ESS. These are called “distinctive features”. Almost 2000 proteins were found in all 45 strains, which is defined as the “conserved” set. o Summing # genes column gives predicted Pangenome size. o With 12410 proteins in the Pangenome Matrix, the predicted Pangenome size for ESS is approximately 25006 genes. o If 12410 genes are already sequenced, and the predicted total is 25006, the predicted completeness of sequencing for ESS is 49.6%. CLIQUES AND CLANS See the glossary for term definitions. Since cliques appear in unison in their clan, it is very likely that these genes are functionally related. One example of functionally-related cliques identified is listed in figure 6, to the right. All of these proteins are phage-related, indicating that these genes entered the strains by horizontal gene transfer. Analysis of clique results can help determine phylogenetic relationships, divergent events, and horizontally-transferred genes. Some cliques are not statistically unlikely, while others show highly improbable clustering. A metric for determining statistical unlikelihood is under development presently, and will accelerate the process of determining the value of further investigation on a clique. CLIQUES AND CLANS See the glossary for term definitions. Since cliques appear in unison in their clan, it is very likely that these genes are functionally related. One example of functionally-related cliques identified is listed in figure 6, to the right. All of these proteins are phage-related, indicating that these genes entered the strains by horizontal gene transfer. Analysis of clique results can help determine phylogenetic relationships, divergent events, and horizontally-transferred genes. Some cliques are not statistically unlikely, while others show highly improbable clustering. A metric for determining statistical unlikelihood is under development presently, and will accelerate the process of determining the value of further investigation on a clique. CONTACT INFORMATION Nicholas Celms: nick.celms@gmail.com Contact me if you’d like a copy of the paper, or further information about the Pangenomes Project CONTACT INFORMATION Nicholas Celms: nick.celms@gmail.com Contact me if you’d like a copy of the paper, or further information about the Pangenomes Project METHODS The analysis process has two major parts, both of which rely on the Pangenome Matrix (see glossary). Pangenome Methods o Observed share spectrum is calculated (blue bars in Figure 2). o Optimum number of pools selected by Akaike Information Criterion scoring o Predicted share spectrum calculation (red stars in Figure 2). o Pool distribution predictions (see Figure 1) and probabilities result o Sum of # Genes gives predicted total Pangenome size Clique and Clan Methods o A protein’s column in the Pangenome matrix forms a binary string o The binary string has ones at indices representing strains that have this protein o Proteins with identical binary strings form a clique o Cliques are identified and annotating using Perl and the NMPDR database METHODS The analysis process has two major parts, both of which rely on the Pangenome Matrix (see glossary). Pangenome Methods o Observed share spectrum is calculated (blue bars in Figure 2). o Optimum number of pools selected by Akaike Information Criterion scoring o Predicted share spectrum calculation (red stars in Figure 2). o Pool distribution predictions (see Figure 1) and probabilities result o Sum of # Genes gives predicted total Pangenome size Clique and Clan Methods o A protein’s column in the Pangenome matrix forms a binary string o The binary string has ones at indices representing strains that have this protein o Proteins with identical binary strings form a clique o Cliques are identified and annotating using Perl and the NMPDR database Figure 3: For any given protein, the probability that it will be chosen by exactly k out of the n strains is given by this binomial expression. Figure 4: This equation defines the expectations, as based on the parameters of the model GLOSSARY o Pangenome – the unique set of all proteins found in all strains of an organism o ESS – Dataset of Escherichia (22 strains), Shigella, (8 strains) and Salmonella (15 strains) combined. Pangenome Matrix – a matrix with columns as proteins of the Pangenome and rows as strains of the organism. For a given index i,j : 1 if strain i has protein j, 0 if it does not. o The Pangenome matrix for ESS is 45 strains x 12410 proteins o Clique - A set of proteins that occur in the exact same strains o Clan- The set of strains in which a given clique appears GLOSSARY o Pangenome – the unique set of all proteins found in all strains of an organism o ESS – Dataset of Escherichia (22 strains), Shigella, (8 strains) and Salmonella (15 strains) combined. Pangenome Matrix – a matrix with columns as proteins of the Pangenome and rows as strains of the organism. For a given index i,j : 1 if strain i has protein j, 0 if it does not. o The Pangenome matrix for ESS is 45 strains x 12410 proteins o Clique - A set of proteins that occur in the exact same strains o Clan- The set of strains in which a given clique appears P ROTEIN A NNOTATION PEG : FIG |344609.3. PEG.988 FUNCTION : P HAGE MINOR TAIL PROTEIN #P HAGE MINOR TAIL PROTEIN T PEG : FIG |344609.3. PEG.989 FUNCTION : P HAGE MINOR TAIL PROTEIN #P HAGE MINOR TAIL COMPONENT PEG : FIG |344609.3. PEG.992 FUNCTION : P HAGE MINOR TAIL PROTEIN #P HAGE MINOR TAIL PROTEIN U PEG : FIG |344609.3. PEG.993 FUNCTION : P UTATIVE MINOR TAIL PROTEIN P OOL # G ENES P ROBABILITY 117578.2960.008 22364.4150.051 31178.6080.139 41070.5810.294 5295.1270.468 6303.5920.638 7115.2600.835 8387.6720.943 91712.3220.997 Figure 1: ((left) shows the number of genes and probabilities for each of the 9 pools for Escherichia, Shigella, and Salmonella. Figure 2: (right) shows the distributed share spectrum of ESS, which is the number of genes found in number of strains. Figure 1: ((left) shows the number of genes and probabilities for each of the 9 pools for Escherichia, Shigella, and Salmonella. Figure 2: (right) shows the distributed share spectrum of ESS, which is the number of genes found in number of strains. FUNDING Thanks to the National Science Foundation for funding the Undergraduate Bio Math Program at San Diego State University FUNDING Thanks to the National Science Foundation for funding the Undergraduate Bio Math Program at San Diego State University Figure 5: (above) The number of cliques identified at each size of clan. Figure 6: (below) This table shows one of the cliques found in Escherichia, Shigella, and Salmonella. It’s binary signature is: 0110011111010111011110110100100101101110110011 Figure 5: (above) The number of cliques identified at each size of clan. Figure 6: (below) This table shows one of the cliques found in Escherichia, Shigella, and Salmonella. It’s binary signature is: 0110011111010111011110110100100101101110110011


Download ppt "PANGENOMES How Many Microbial Genes Are There In the World? Nicholas P. Celms, James D. Nulton, Dr. Rob Edwards, Dr. Peter Salamon PANGENOMES How Many."

Similar presentations


Ads by Google