Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gene Family Ancestral State Phylogenetic Profiling

Similar presentations


Presentation on theme: "Gene Family Ancestral State Phylogenetic Profiling"— Presentation transcript:

1 Gene Family Ancestral State Phylogenetic Profiling
Dongying Wu

2 Phylogenetic profiling
Phylogenetic profiling: Study the joint presence or absence of two traits across a phylogenetic tree to infer a meaningful biological connection (biological pathway discovery, protein subunit identification). genome 1 genome 2 genome 3 genome 4 genome 5 family 1 1 family 2 calculate the correlations of genome distributions between all pairs of gene families (e.g. pearson correlation) 1 means the two family share the same distribution profiles across the genomes, 0 means the two families are not correlated

3 Naive Correlation genomeID G1 G2 G3 G4 G5 G6 G7 G8 G9 Family A 1 Family B Pearson product-moment correlation coefficient between FamilyA and FamilyB in the above table assume the joint presence/absence is phylogenetic independent, we call it Naive Correlation G1 G2 G3 G4 G5 G6 G7 G8 G9

4 Independent Contrasts: Felsenstein (1985) proposed the first general statistical method for incorporating phylogenetic information to transform the original tip data into values that are statistically independent and identically distributed. T1_G1 T1_G2 T1_G3 T1_G4 T2_G1 T2_G2 T2_G3 T2_G4 G1 G2 G3 G4 v4 v1 v2 v3 N1 N2 v6 v5 N3 Trait 1(T1) Contrasts between the follow pairs are independent from one another: T1_G1 - T1_G2 T1_G1/v1 + T1_G2/v2 T1_N1 = G1-G2 1/v1 + 1/v2 v1 + v2 T1_G3/v3 + T1_G4/v4 - T1_N2 = T1_G3 T1_G4 1/v3 + 1/v4 G3-G4 v3 + v4 1 v’5 = v5 + 1/v1 + 1/v2 T1_N1 - T1_N2 N1-N2 1 v’6 = v6 + v’5 + v’6 1/v3 + 1/v4 v1

5 y: T2 x: T1 G1 N1 N2 N3 G2 G3 G4 “Correlation through origin”
v1 v2 v3 v4 N1 N2 N3 v5 v6 G2 G3 G4 T1_G1 T1_G2 T1_G3 T1_G4 T2_G1 T2_G2 T2_G3 T2_G4 y: T2 G4-G3 G1-G2 N1-N2 x: T1 N2-N1 “Correlation through origin” G2-G1 G3-G4 Theodore Garland et.al. Syst Biol 41(1) 18-32, 1992

6 The independent contrast approach has some issues:
Phylogenetic associated distribution is important, totally disregard it might not be biologically sound (e.g. cyanobacteria) Correlation calculation on all the nodes regardless the phylogenetic depth of the nodes is not a sound approach 3. Regression though origin sometimes artificially increase correlation Family A Family B G1=0 G2=0 G3=1 G4=1 G1=1 G2=1 G3=0 G4=0 Contrast=0 Contrast=0 Contrast=0 Contrast=0 Contrast=1 Contrast=1 Naïve correlation=-1, Independent contrast correlation=1

7 One approach to address the phylogenetic dependent issue of phylogenetic profiling:
Estimate ancestral states of gene family at a certain level (e.g. phylum), and observe the relationship between two genome families at that level TreeOTU TreeOTU cutoffs define the taxonomic levels for comparisons

8 Felsenstein (1985) ancestral state estimation
T1_G1/v1 + T1_G2/v2 T1_N1 = 1/v1 + 1/v2 T1_G1 T1_G2 T1_G3 T1_G4 T2_G1 T2_G2 T2_G3 T2_G4 T1_G3/v3 + T1_G4/v4 T1_N2 = G1 G2 G3 G4 1/v3 + 1/v4 v1 v2 v3 v4 1 N1 N2 v’5 = v5 + 1/v1 + 1/v2 1 v’6 = v6 v6 + v5 1/v3 + 1/v4 N3 T1_N1/v’5 + T1_N2/v’6 T1_N3 = 1/v’5 + 1/v’6

9 7850 genomes (7599 bacteria and 251 archaea), ML tree was built by FastTree2 using WAG model from concatenated alignments of 38 PHYECO markers (The genomes are chosen because they have at least 35/38 PHYECO markers) 9432 PfamA families are built at from hmmsearch with the trusted cutoffs (each family has at least 5 members) OTUs are generated at a cutoff of 0.01 interval by TreeOTU, and ancestral stages of presence/absence (1/0) were estimated for all the gene families and all the OTU groups

10 PF01867: CRISPR associated protein Cas1

11 PF02154: Flagellar motor switch

12 PF00283: Cytochrome b559 (Photosystem II)

13 AMI (adjusted mutual information) based TreeOTU cutoffs for different taxonomic level
(compared with NCBI taxa) Species (AMI=0.9201) Genus (AMI=0.8905) Family (AMI=0.9271) Order (AMI=0.9026) Class (AMI=0.9315) Phylum (AMI=0.8505) At one taxonomic level (e.g. class), ancestral states of the presence/absence of 9432 PfamA gene families in all the OTUs were estimated, pearson correlation based distances between pairs of families were calculated.

14 Cytoscape tree view (genus level)

15 Cytoscape tree view (family level)

16 Cytoscape tree view (order level)

17 Two families have different Pearson correlation distance,
Can we estimate an overall distance? Simple average has a problem: the distances at different level are not independent Pearson correlation between Pfam pairs at genus level 27,758,497 pairs Pearson correlation between Pfam pairs at species level

18 27,460,890 pairs Pearson correlation between Pfam pairs at class level
Pearson correlation between Pfam pairs at species level

19 correlation For each family pair: The overall distance = (dist_strain + dist_family + dist_phylum)/3

20 The support of 3 by cluster {1.2}: s=1+1=2
Bioinformatics Advance Access published February 24, 2010 SPICi: a fast clustering algorithm for large biological networks Peng Jiang 1,2, Mona Singh 1,2∗ 1 Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, 08544, USA. 2 Department of Computer Science, Princeton University, Princeton, New Jersey, 08544, USA. -d For each set of vertices S ⊂ V , we define its density as the sum of the weights of the edges among them, divided by the total number of possible edges -g For each vertex u and set S ⊂ V , we define the support of u by S as the sum of the confidences of u’s edges that are incident to vertices in S When expend a cluster to include a vertex, the support of the vertex by the cluster must >=density × cluster size × Ts (support cutoff) Cluster {1,2}: d=1/1=1 The support of 3 by cluster {1.2}: s=1+1=2

21 SPICi cluster size d=0.5 g=0.5 d=0.6 g=0.6 d=0.7 g=0.7 1 2163 3104 4871 2-10 461 707 1284 11-20 47 35 4 21-50 27 19 6 51-100 7 5 301- 2

22 Examples: (d=0.6, g=0.5) PF04002_RadC_like_JAB_domain
PF01119_DNA_mismatch_repair_protein_C_terminal_domain PF08676_MutL_C_terminal_dimerisation_domain PF05190_MutS_family_domain_IV PF01624_MutS_domain_I PF05192_MutS_domain_III PF05188_MutS_domain_II PF00488_MutS_domain_V PF12895_Anaphase_promoting_complex_cyclosome_subunit_3 PF13181_Tetratricopeptide_repeat PF00515_Tetratricopeptide_repeat PF13414_TPR_repeat PF07719_Tetratricopeptide_repeat PF13424_Tetratricopeptide_repeat PF13176_Tetratricopeptide_repeat PF13428_Tetratricopeptide_repeat PF13432_Tetratricopeptide_repeat PF13431_Tetratricopeptide_repeat PF13174_Tetratricopeptide_repeat PF14559_Tetratricopeptide_repeat PF12700_HlyD_family_secretion_protein PF13374_Tetratricopeptide_repeat PF02491_SHS2_domain_inserted_in_FTSA PF02899_Phage_integrase_N_terminal_SAM_like_domain PF06429_Flagellar_basal_body_rod_FlgEFG_protein_C_terminal PF00460_Flagella_basal_body_rod_protein PF02049_Flagellar_hook_basal_body_complex_protein_FliE PF03963_Flagellar_hook_capping_protein_N_terminal_region PF03748_Flagellar_basal_body_associated_protein_FliL PF14841_FliG_middle_domain PF00700_Bacterial_flagellin_C_terminal_helical_region PF02108_Flagellar_assembly_protein_FliH PF08345_Flagellar_M_ring_protein_C_terminal PF02154_Flagellar_motor_switch_protein_FliM PF04347_Flagellar_biosynthesis_protein_FliO PF02465_Flagellar_hook_associated_protein_2_N_terminus PF05130_FlgN_protein PF01584_CheW_like_domain PF02050_Flagellar_FliJ_protein PF02120_Flagellar_hook_length_control_protein_FliK PF13144_SAF_like PF07238_PilZ_domain PF02561_Flagellar_protein_FliS PF00015_Methyl_accepting_chemotaxis_protein_MCP_signalling_domain PF02895_Signal_transducing_histidine_kinase_homodimeric_domain PF01339 PF01339_CheB_methylesterase .(130 PfamA members)

23 What’s next: Maximum likelihood ancestral state estimation Go beyond PfamA Visualization


Download ppt "Gene Family Ancestral State Phylogenetic Profiling"

Similar presentations


Ads by Google