The Original Question:

Slides:



Advertisements
Similar presentations
Homology Based Analysis of the Human/Mouse lncRNome
Advertisements

A Novel Knowledge Based Method to Predicting Transcription Factor Targets
Computational Analysis of the Taxanomical Classification of Short 16S rRNA Sequences Christel Chehoud Mentor: Brian Haas.
Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.
Phylogenetic reconstruction
Molecular Evolution Revised 29/12/06
Practical Bioinformatics Community structure measures for meta-genomics István Albert Bioinformatics Consulting Center Penn State.
The Cobweb of life revealed by Genome-Scale estimates of Horizontal Gene Transfer Fan Ge, Li-San Wang, Junhyong Kim Mourya Vardhan.
Heuristic alignment algorithms and cost matrices
Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.
REGRESSION Predict future scores on Y based on measured scores on X Predictions are based on a correlation from a sample where both X and Y were measured.
© Wiley Publishing All Rights Reserved. Searching Sequence Databases.
From Metagenomic Sample to Useful Visual Anna Shcherbina 01/10/ Anna Shcherbina Bioinformatics Challenge Day 02/02/2013 From Metagenomic Sample to.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Automatic ssu- rRNA novelty ranking pipeline Ssu-RNA sequence s One ranking score for each sequence for phylogenetic novelty Dongying Wu 03/2015.
Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD
Identify gene markers for different taxonomic groups in Archaea and Bacteria Genomes Dongying Wu 1,2, Jonathan A. Eisen 1,2 1. DOE Joint Genome Institute,
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Lecture 13 Chi-square and sample variance Finish the discussion of chi-square distribution from lecture 12 Expected value of sum of squares equals n-1.
Microbial diversity and virulence probing of five different body sites Anu Rebbapragada, Pub. Health Ontario Central Lab. Canada Wei-Jen Lin, Cal State.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression.
Speaker: Bin-Shenq Ho Dec. 19, 2011
Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.
Abstract Our current understanding of the taxonomic and phylogenetic diversity of cellular organisms, especially the bacteria and archaea, is mostly based.
GEBA Project Summary Dongying Wu. Phylogenetic Tree Building (Martin Wu) Concatenate alignments of 31 marker genes build a PHYML tree 667 non-GEBA genomes,
Statistics with TI-Nspire™ Technology Module E. Lesson 2: Properties Statistics with TI-Nspire™ Technology Module E.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. [many slides borrowed from various sources]
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
Accurate estimation of microbial communities using 16S tags
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Supplementary Fig. 1 Supplementary Figure 1. Distributions of (A) exon and (B) intron lengths in O. sativa and A. thaliana genes. Green bars are used for.
Review Law of averages, expected value and standard error, normal approximation, surveys and sampling.
MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res
Introducing DOTUR, a Computer Program for Defining Operational Taxonomic Units and Estimating Species Richness Patric D. Schloss and Jo Handelsman Department.
METHOD: Family Classification Scheme 1)Set for a model building: 67 microbial genomes with identified protein sequences (Table 1) 2)Set for a model.
GP FOR ADAPTIVE MARKETS Jake Pacheco 6/11/2010. The Goal  Produce a system that can create novel quantitative trading strategies for the stock market.
Cell Lineage Analysis of a Mouse Tumor
Metagenomic Species Diversity.
Supplementary Fig. 1 Supplementary Figure 1. Distributions of (A) exon and (B) intron lengths in O. sativa and A. thaliana genes. Green bars are used.
Regression.
Genomic Data Manipulation Thinking about data visually
BLAST program selection guide
Regression.
Elizabeth Garrett Giovanni Parmigiani
Genomic Data Manipulation
Topic 5: Exploring Quantitative data
1 Department of Engineering, 2 Department of Mathematics,
Chapter 12 Regression.
1 Department of Engineering, 2 Department of Mathematics,
Regression.
Regression.
1 Department of Engineering, 2 Department of Mathematics,
Volume 11, Issue 3, Pages (March 2018)
Regression.
Volume 137, Issue 2, Pages (August 2009)
Gene Family Ancestral State Phylogenetic Profiling
Groups 36 and 630 Group 640 Group 31 Group 5 Groups 40,41, 655 and 669
Regression Chapter 8.
Regression.
Volume 21, Issue 8, Pages (August 2014)
Victor M. Markowitz, I-Min A. Chen, Ken Chu, Amrita Pati, Natalia N
Sequence comparison: Multiple testing correction
Xin Li, Alexis Battle, Konrad J. Karczewski, Zach Zappala, David A
Ruth E. Ley, Daniel A. Peterson, Jeffrey I. Gordon  Cell 
Volume 3, Issue 6, Pages (November 1998)
Volume 11, Issue 3, Pages (March 2018)
Phylogenetic comparison among selected Pasteurella multocida and Haemophilus influenzae species with completed genome sequences. Phylogenetic comparison.
Hypothesis Tests with Means of Samples
Presentation transcript:

The Original Question: What is the better way to select genomes to sequence so we can maximize gene discovery given our limited resources? Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R. Curr Opin Genet Dev. 2005 Dec;15(6):589-94. Epub 2005 Sep 26

GEBA figure 2

Next Step: If we draw gene discovery curve for all the available taxonomy levels (nodes) 1. What is the relationships between gene families number (gene space) and genome number or the phylogenetic diversity (PD) at the taxonomy level? 2. Can we estimate the gene space of all microbial on the planet?

Next Step: If we draw gene discovery curve for all the available taxonomy levels (nodes) 1. What is the relationships between gene families number (gene space) and genome number or the phylogenetic diversity (PD) at the taxonomy level? 2. Can we estimate the gene space of all microbial on the planet?

We need: 1.Build gene families for all available genomes (Guillaume Jospin) 2.Build 16S trees for all greengenes 16S and get the subtree for the sequenced genomes IMG 1804 genomes with 6529,377genes 281,549 families, 939,289 singletons

Tree Building for greengenes 16S Sequences 510,111 sequences from greengenes Reduce Redundancy (99% identity cutoff over 95% length span) 162,365 sequences Build FastTree

Genomes: 1804 Tree: ( 9 genomes are excluded because 16S sequences cannot be found in greengenes they have <500 genes for incompleted genomes) Tree: (pruned 16S tree only include 1804 IMG organisms)

Genomes: 1804 Tree: ( 9 genomes are excluded because 16S sequences cannot be found in greengenes they have <500 genes for incompleted genomes) Tree: (pruned 16S tree only include 1804 IMG organisms)

Family number Sample gene families (exclude singletons) from the 1804 genomes Genome Number

For each node from the 16S tree the genome number is N for a given node Pick a number at a even internal from 3 to N [interval=(N-3)/200] For example: a node have 600 nodes, I sample the following numbers If genomes (3, 6, 9,..,597, 600) For each X, randomly pick X genomes from the N genome pool, and calculate the PD of the subtree and the gene family number (repeat this step 100 times, to get the Average PD and family number for each X)

Family number Genome Number

Family number Genome Number

Family/genome

Random sampling protocol Pick a random number N (10-1804) Select N genomes from the 16S tree, and keep only the subtree Pick X numbers at a even internal from 3 to N [interval=(N-3)/200] For each X, randomly pick X genomes from the N genome pool, and calculate the PD of the subtree and the gene family number (repeat this step 100 times, to get the Average PD and family number for each X) Repeat the above step 1000 times

Family number 1000 random genomes from the 1804 genomes Genome Number

117.32 family/genome x 162,365 = 19,0480,661 families

PD Genome Number Family number Genome Number

Family number PD

Family/PD PD

Family number 1000 random genomes from the 1804 genomes PD

3386.46 family/PD x 3644 = 12,340,260 families Family/PD PD

Total PD of the greengenes 16S: 3644.21 According to random sample curve of the root node: When PD is 225, the slope (family/PD) reaches 0 (regression 0.75) When PD increase, family space stays constant ???

Singletons: To count or not to count? 281,549 families, 939,289 singletons Not to count: Many singletons are results of false prediction To count: The singletons here are results of single E value cutoff (not true singles)

Only sample 972 finished genomes and count all the singletons Family number Genome Number

Only sample 972 finished genomes and count all the singletons Family number Genome Number

PD Genome Number

Family number PD

Family number Genome Number

PD Genome Number

Family number PD

Family number/Genome Total PD of subtree

Family number/PD Total PD of subtree

Family number/PD

Family comparison between Genome A and B Unique_family_for_B Overlap Unique_for_A Family Distance=(UA+UB)/(UA+UB+Overlap)

Family Distance PD between two genomes

Family Contribution of ONE genome vs its PD contribution Randomly select 100 genomes from the tree Randomly select 1 genome from the tree, and calculation its family and PD contribution, do it 800 times Divide the 800 datapoint into categories according to PD contribution [0-0.1) [0.1-0.2) [0.2-0.3) .... For each category, calculated the average family contribution and the standard deviation Sample 100 times

PD contribution of one additional genome Family Contribution of one additional genome

PD contribution of one additional genome Family Contribution of one additional genome

Family Contribution of ONE genome vs its PD contribution Calculate one genome's contribution to a pool of genomes(10, 20 50 100 200 500)

PD contribution of one additional genome Genome Number Family Number PD 10 18595±2551 3.74±0.63 20 32552±2579 6.08±0.80 50 67454±5068 11.20±0.85 100 116192±5768 17.17±0.98 200 197728±6778 25.30±1.11 500 393797±7616 41.07±1.05 PD contribution of one additional genome Family Contribution of one additional genome

Novel families contributed by one genome are related its PD contribution(non-lineal) Novel families contributed by one genome are related the PD and families of existing genome pool Random samples from a genome pool can predict the PD and genome family of the pool (chao estimator like approach can be applied) Current genome sequencing data cannot accurately predict the microbial protein spaces 1. it is not a random sample 2. it is at early stage can result in over-estimation even if we have a random sample (lineal stage of Chao estimater)

Greengenes 16S rRNA tree Total PD Genome Number

2010.10

Gene Family Building for All the Sequenced Genomes from IMG Protein encoding genes from all the genomes All vs all Blastp (1e-10 covering 80% of both query and hit) Links between genes MCL clustering Gene families IMG 1813 genomes with 6,530,517 genes BLAST and MCL clustering would overwhelm our computer resources

Identify Universally Distributed Families 100 representative genomes (85 bacterial and 15 archaeal genomes) 313,139 protein encoded genes 28,710,015 links 17232 gene families Universality is the percent of genomes a family covers links genes families Percentage Universality cutoffs

Guillaume Jospin from Eisen's Lab 502 families with universality >= 70 (70%) 502 HMM profiles Screen and build the 502 families for the 1813 IMG dataset Gene family building pipeline 1,219,828 families (281,562 if exclude singletons)