Download presentation
1
HOGENOM a phylogenomic database
Simon Penel, Pascal Calvat, Jean-Francois Dufayard, Vincent Daubin, Laurent Duret , Manolo Gouy, Dominique Guyot, Daniel Kahn, Vincent Miele, Vincent Navratil, Guy Perrière, Rémi Planel
2
Several phylogenomic databases developed at LBBE/PRABI
HOVERGEN Verterbrate Proteins from UniProt Clustering with SiLiX HOMOLENS Proteins from Ensembl Complete Genomes Clustering from Ensembl Trees calculated and annoated (S,D,L) with new methods (PhylDog,LBBE) HOGENOM Proteins from all available complete genomes (Bacteria, Eukaroyota, Archaea) Clustering with SiLiX and post-processing with HiFiX Trees will be annotated (S,D,L,T)
3
HOGENOM characteristics
all complete genomes from the whole tree of life (not restricted to particular phylum) Propose « gene families » : full length homologous sequences (different of « domain families »)
4
Domain vs. gene families
Protein domain family Families of homologous protein domains (ProDom): - Evolution by domain shuffling (duplication, loss, translocation)
5
Domain vs. gene families
Homologous gene family Families of homologous protein domains (ProDom): - Evolution by domain shuffling (duplication, loss, translocation) Homologous Gene families (HOGENOM): - Evolution of homologous genes by speciation or by gene duplication, or horizontal transfer - Sequences are homologous over their entire length (or almost)
6
Orthologs and paralogs in HOGENOM
HOGENOM is centered on phylogenetic trees of gene families. Information on orthologs and paralogs can be deduced from gene trees: - from the annotation of gene trees (Duplication, Speciation, Transfer) - from query tools such as tree-pattern matching
7
Building Compare all proteins against each other (BLAST)
Cluster homologous sequences into families (SILIX + HIFIX) Compute multiple alignments for each family Compute phylogenetic trees for each family Annotate phylogenetic trees (gene duplications, losses, transfers)
8
Compare all proteins against each other
Iterative BLAST calculation Use of a non-redundant protein sequence database … (all know proteins , about 20,000,000 non redondant sequences) … associated with a resulting BLAST hits database (from which blast hits may be extracted) Cluster, grid and cloud computing
9
Building Compare all proteins against each other (BLAST)
Cluster homologous sequences into families (SILIX + HIFIX) Compute multiple alignments for each family Compute phylogenetic trees for each family Annotate phylogenetic trees (gene duplications, losses, transfers)
10
Local pairwise alignments
SiLiX 1st step : similarity search Protein database Local pairwise alignments BLASTP BLOSUM62 E ≤ 10-4
11
SiLiX 2nd step : SiLiX clustering Use the all-against-all BLAST hits
12
SiLiX : Selection of consistent HSPs
Seq. A Seq. B S2 S1’ ∆lg1 lgHSP1 ∆lg2 ∆lg3 lgHSP2 Seq. A Seq. B
13
SiLiX : single linkage clustering
B A C HSP ≥ 80 % length Identity ≥ 35 % B A Cluster A, B, C C
14
SiLiX Computing efficiency: Clustering quality: Ultra-fast
SiLiX : single linkage clustering with alignment coverage constraints (Mièle et al. BMC Bioinformatics 2011) Computing efficiency: Ultra-fast Memory efficient Scalable (parallel architecture) Clustering quality: At least as good as the previously published methods
15
However … Because of over-extension of BLAST alignments, some sequences that share only partial homology may be clustered in a same family The risk of alignment over-extension is low, but becomes a problem for very large protein families Use more stringent clustering criteria ? No : optimal clustering criteria are not the same for all families
16
HiFiX The mode and tempo of evolution is specific to each protein family A multiple alignment provides information about the specific pattern of evolution of a family => this can be used to decide whether or not a new sequence belongs to that family
17
HiFiX Step 1: rapid clustering (SiLiX)
pre-families Step2: sub-clustering of pre-families into homogeneous protein clusters sub-families Step3: progressive merging of sub-families into families, with evaluation of multiple alignment quality at each step families
18
HiFiX
19
HiFiX
20
HiFiX
21
Results of clustering About 7,000,000 proteins clustered into 300,000 families Family size distribution: Number Sequences Number of Families at least ,920 2: ,398 10: ,450 500: ,026 more than
22
Building Compare all proteins against each other (BLAST)
Cluster homologous sequences into families (SILIX + HIFIX) Compute multiple alignments for each family Compute phylogenetic trees for each family Annotate phylogenetic trees (gene duplications, losses, transfers)
23
Compute multiple alignments
All alignments ( ~ 300, 000) have been calculated with ClustalΩ
24
Building Compare all proteins against each other (BLAST)
Cluster homologous sequences into families (SILIX + HIFIX) Compute multiple alignments for each family Compute phylogenetic trees for each family Annotate phylogenetic trees (gene duplications, losses, transfers)
25
Compute phylogenetic tree
Question: what about the alternative splicing ?
26
Alternative splicing In eukaryotes, due to alternative splicing , one unique gene may be be transcripted into several transcripts
27
Transcripts in HOGENOM6
We selected all the transcripts for each gene. Because the longest transcript is not allways the best!
28
Selection of a representaitive isoform in HOGENOM
Because: We don’t want several proteins for a same gene in a phylogenetic tree: may be seen as a duplication We want 1 protein per gene for statistic comparison among organisms
29
Selection of a representaitive isoform : how ?
30
Selection of a representative isoform : how ?
Eukarya 1 or more transcripts per gene Archaea and bacteria 1 transcript per gene
31
Selection of a representative isoform : how ?
Eukarya clustering Archaea and bacteria
32
Selection of a representative isoform : how ?
First step: when a gene has isoforms in different families ( ), choose a family for the gene
33
Selection of a representative isoform : how ?
We select the family with the highest number of eukaryotic genes (and not proteins) 1 1 1 2 2 2 3 2 genes 2 genes 3 genes
34
Selection of a representative isoform : how ?
We select the family with the highest number of eukaryotic genes (and not proteins) 1 1 1 2 2 If the number of eukaryotic genes are identical, we select the family with the highest number of eukaryotic proteins 2 3 2 genes 2 genes 3 genes
35
Selection of a representative isoform : how ?
We select the family with the highest number of eukaryotic genes (and not proteins) 1 1 1 2 2 If the number of eukaryotic genes are identical, we select the family with the highest number of eukaryotic proteins 2 3 If the number of eukaryotic proteins are identical, we select the family with the highest number of proteins 2 genes 2 genes 3 genes
36
Selection of a representative isoform : how ?
We select the family with the highest number of eukaryotic genes (and not proteins) 1 1 1 2 2 If the number of eukaryotic genes are identical, we select the family with the highest number of eukaryotic proteins 2 3 If the number of eukaryotic proteins are identical, we select the family with the highest number of proteins 2 genes 2 genes 3 genes The « rejected » isoforms are called « ISOFORMEX » SOME FAMILIES MAY FINALLY BE EMPTY AFTER THIS
37
Selection of a representative isoform : how ?
Second step: when a gene has isoforms in a family, choose a representative isoform for the gene 1 1 1 2 2 2 3 2 genes 2 genes 3 genes
38
Selection of a representative isoform : how ?
Second step: when a gene has isoforms in a family, choose a representative isoform for the gene 1 1 1 2 2 2 3 2 genes ? 2 genes ? 3 genes
39
Selection of a representative isoform : how ?
We use the alignment
40
Selection of a representative isoform : how ?
We use the alignment Suppression of ISOFORMEX
41
Selection of a representative isoform : how ?
We use the alignment Selection positions with < 50% gap
42
Selection of a representative isoform : how ?
For each isoform of a given gene, for each position, we count for 1 each time the residue is identical to the residue in at least one of the isoforms of all other eukaryotic genes. The isoform with the highest total is selected, the other isoforms being tagged as ISOFORMIN 1 2 2 2
43
Selection of a representative isoform : how ?
For each isoform of a given gene, for each position, we count for 1 each time the residue is identical to the residue in at least one of the isoforms of all other eukaryotic genes. The isoform with the highest total is selected, the other isoforms being tagged as ISOFORMIN 1 2 1 2 2
44
Selection of a representative isoform : how ?
For each isoform of a given gene, for each position, we count for 1 each time the residue is identical to the residue in at least one of the isoforms of all other eukaryotic genes. The isoform with the highest total is selected, the other isoforms being tagged as ISOFORMIN 1 2 1 2 2 2 2
45
Tree calculation
46
Tree calculation isformin isformin a b c isformin d isformex e f g
47
Tree calculation isformin isformin a b c isformin d isformex e f g
48
Tree calculation Gblocks Phyml, FastTree d isformin a isformin e f a b
isformex e f g
49
Building Compare all proteins against each other (BLAST)
Cluster homologous sequences into families (SILIX + HIFIX) Compute multiple alignments for each family Compute phylogenetic trees for each family Annotate phylogenetic trees (gene duplications, losses, transfers)
50
Annotate phylogenetic trees
Several methods are currently developed in the ANCESTROM project Speciation, Duplication and Loss Speciation, Duplication, Transfert and Loss See Vincent Daubin talk tomorow
51
Querying the database ACNUC server (client server application, R pacakge, python package, C API, bio++ API)
52
Querying the database Web interface on PRABI
53
Querying the database Web interface on PRABI
54
Querying the database Web interface on PRABI
55
Querying the database Homologous families detected with HMM (D. Guyot)
56
Querying the database New tools ! (R. Planel, J.F. Dufayard)
57
Querying the database Displaying the gene tree and the the syntheny context of the gene
58
Querying the database Displaying the gene tree and the the syntheny context of the gene
59
Querying the database Search for orthologous vertrebrate genes between mouse and man
60
Querying the database Search for orthologous vertrebrate genes between mouse and man
61
Thank you for your attention
Ancestrome: Integrative phylogenetic approaches for reconstructing ancestral "-omes"
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.