HOGENOM a phylogenomic database

Name: HOGENOM a phylogenomic database
Uploaded: 2017-11-29T08:00:44+00:00
Duration: PTM17S57
Channel: Jeffry Lawson
Description: HOGENOM a phylogenomic database

HOGENOM a phylogenomic database
Simon Penel, Pascal Calvat, Jean-Francois Dufayard, Vincent Daubin, Laurent Duret , Manolo Gouy, Dominique Guyot, Daniel Kahn, Vincent Miele, Vincent Navratil, Guy Perrière, Rémi Planel

Several phylogenomic databases developed at LBBE/PRABI
HOVERGEN Verterbrate Proteins from UniProt Clustering with SiLiX HOMOLENS Proteins from Ensembl Complete Genomes Clustering from Ensembl Trees calculated and annoated (S,D,L) with new methods (PhylDog,LBBE) HOGENOM Proteins from all available complete genomes (Bacteria, Eukaroyota, Archaea) Clustering with SiLiX and post-processing with HiFiX Trees will be annotated (S,D,L,T)

HOGENOM characteristics
all complete genomes from the whole tree of life (not restricted to particular phylum) Propose « gene families » : full length homologous sequences (different of « domain families »)

Domain vs. gene families
Protein domain family Families of homologous protein domains (ProDom): - Evolution by domain shuffling (duplication, loss, translocation)

Domain vs. gene families
Homologous gene family Families of homologous protein domains (ProDom): - Evolution by domain shuffling (duplication, loss, translocation) Homologous Gene families (HOGENOM): - Evolution of homologous genes by speciation or by gene duplication, or horizontal transfer - Sequences are homologous over their entire length (or almost)

Orthologs and paralogs in HOGENOM
HOGENOM is centered on phylogenetic trees of gene families. Information on orthologs and paralogs can be deduced from gene trees: - from the annotation of gene trees (Duplication, Speciation, Transfer) - from query tools such as tree-pattern matching

Building Compare all proteins against each other (BLAST)
Cluster homologous sequences into families (SILIX + HIFIX) Compute multiple alignments for each family Compute phylogenetic trees for each family Annotate phylogenetic trees (gene duplications, losses, transfers)

Compare all proteins against each other
Iterative BLAST calculation Use of a non-redundant protein sequence database … (all know proteins , about 20,000,000 non redondant sequences) … associated with a resulting BLAST hits database (from which blast hits may be extracted) Cluster, grid and cloud computing

Local pairwise alignments
SiLiX 1st step : similarity search Protein database  Local pairwise alignments BLASTP BLOSUM62 E ≤ 10-4

SiLiX 2nd step : SiLiX clustering Use the all-against-all BLAST hits

SiLiX : Selection of consistent HSPs
Seq. A Seq. B S2 S1’ ∆lg1 lgHSP1 ∆lg2 ∆lg3 lgHSP2 Seq. A Seq. B

SiLiX : single linkage clustering
B A C HSP ≥ 80 % length Identity ≥ 35 % B A Cluster A, B, C C

SiLiX Computing efficiency: Clustering quality: Ultra-fast
SiLiX : single linkage clustering with alignment coverage constraints (Mièle et al. BMC Bioinformatics 2011) Computing efficiency: Ultra-fast Memory efficient Scalable (parallel architecture) Clustering quality: At least as good as the previously published methods

However … Because of over-extension of BLAST alignments, some sequences that share only partial homology may be clustered in a same family The risk of alignment over-extension is low, but becomes a problem for very large protein families Use more stringent clustering criteria ? No : optimal clustering criteria are not the same for all families

HiFiX The mode and tempo of evolution is specific to each protein family A multiple alignment provides information about the specific pattern of evolution of a family => this can be used to decide whether or not a new sequence belongs to that family

HiFiX Step 1: rapid clustering (SiLiX)
pre-families Step2: sub-clustering of pre-families into homogeneous protein clusters sub-families Step3: progressive merging of sub-families into families, with evaluation of multiple alignment quality at each step families

Results of clustering About 7,000,000 proteins clustered into 300,000 families Family size distribution: Number Sequences Number of Families at least ,920 2: ,398 10: ,450 500: ,026 more than

Compute multiple alignments
All alignments ( ~ 300, 000) have been calculated with ClustalΩ

Compute phylogenetic tree
Question: what about the alternative splicing ?

Alternative splicing In eukaryotes, due to alternative splicing , one unique gene may be be transcripted into several transcripts

Transcripts in HOGENOM6
We selected all the transcripts for each gene. Because the longest transcript is not allways the best!

Selection of a representaitive isoform in HOGENOM
Because: We don’t want several proteins for a same gene in a phylogenetic tree: may be seen as a duplication We want 1 protein per gene for statistic comparison among organisms

Selection of a representaitive isoform : how ?

Selection of a representative isoform : how ?
Eukarya 1 or more transcripts per gene Archaea and bacteria 1 transcript per gene

Eukarya clustering Archaea and bacteria

First step: when a gene has isoforms in different families ( ), choose a family for the gene

We select the family with the highest number of eukaryotic genes (and not proteins) 1 1 1 2 2 2 3 2 genes 2 genes 3 genes

We select the family with the highest number of eukaryotic genes (and not proteins) 1 1 1 2 2 If the number of eukaryotic genes are identical, we select the family with the highest number of eukaryotic proteins 2 3 2 genes 2 genes 3 genes

We select the family with the highest number of eukaryotic genes (and not proteins) 1 1 1 2 2 If the number of eukaryotic genes are identical, we select the family with the highest number of eukaryotic proteins 2 3 If the number of eukaryotic proteins are identical, we select the family with the highest number of proteins 2 genes 2 genes 3 genes

We select the family with the highest number of eukaryotic genes (and not proteins) 1 1 1 2 2 If the number of eukaryotic genes are identical, we select the family with the highest number of eukaryotic proteins 2 3 If the number of eukaryotic proteins are identical, we select the family with the highest number of proteins 2 genes 2 genes 3 genes The « rejected » isoforms are called « ISOFORMEX » SOME FAMILIES MAY FINALLY BE EMPTY AFTER THIS

Second step: when a gene has isoforms in a family, choose a representative isoform for the gene 1 1 1 2 2 2 3 2 genes 2 genes 3 genes

Second step: when a gene has isoforms in a family, choose a representative isoform for the gene 1 1 1 2 2 2 3 2 genes ? 2 genes ? 3 genes

We use the alignment

We use the alignment Suppression of ISOFORMEX

We use the alignment Selection positions with < 50% gap

For each isoform of a given gene, for each position, we count for 1 each time the residue is identical to the residue in at least one of the isoforms of all other eukaryotic genes. The isoform with the highest total is selected, the other isoforms being tagged as ISOFORMIN 1 2 2 2

For each isoform of a given gene, for each position, we count for 1 each time the residue is identical to the residue in at least one of the isoforms of all other eukaryotic genes. The isoform with the highest total is selected, the other isoforms being tagged as ISOFORMIN 1 2 1 2 2

For each isoform of a given gene, for each position, we count for 1 each time the residue is identical to the residue in at least one of the isoforms of all other eukaryotic genes. The isoform with the highest total is selected, the other isoforms being tagged as ISOFORMIN 1 2 1 2 2 2 2

Tree calculation

Tree calculation isformin isformin a b c isformin d isformex e f g

Tree calculation Gblocks Phyml, FastTree d isformin a isformin e f a b
isformex e f g

Annotate phylogenetic trees
Several methods are currently developed in the ANCESTROM project Speciation, Duplication and Loss Speciation, Duplication, Transfert and Loss See Vincent Daubin talk tomorow

Querying the database ACNUC server (client server application, R pacakge, python package, C API, bio++ API)

Querying the database Web interface on PRABI

Querying the database Homologous families detected with HMM (D. Guyot)

Querying the database New tools ! (R. Planel, J.F. Dufayard)

Querying the database Displaying the gene tree and the the syntheny context of the gene

Querying the database Search for orthologous vertrebrate genes between mouse and man

Thank you for your attention
Ancestrome: Integrative phylogenetic approaches for reconstructing ancestral "-omes"

HOGENOM a phylogenomic database

Similar presentations

Presentation on theme: "HOGENOM a phylogenomic database"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

HOGENOM a phylogenomic database

Similar presentations

Presentation on theme: "HOGENOM a phylogenomic database"— Presentation transcript:

Similar presentations

About project

Feedback