Molecular Phylogeny Fredj Tekaia Institut Pasteur tekaia@pasteur.fr.

Molecular Phylogeny Fredj Tekaia Institut Pasteur

Evolutionary processes include:
Large scale comparative genome analysis revealed significant evolutionary processes: Evolutionary processes include: Ancestor Phylogeny* duplication genesis Expansion* HGT Exchange* species genome loss Deletion* selection*

Phylogeny analyses • Starting point: a set of homologous, aligned DNA or protein sequences • Result of the process: a tree describing evolutionary relationships between studied sequences i.e a phylogenetic tree

Plan: • Introduction; • Evolutionary processes; • Homologs - Paralogs - Orthologs; • Key features of phylogenetic trees; • Gene tree - species tree; • Multiple sequence alignment; • Methods for Phylogenetic tree construction; • Statistical evaluation of phylogenetic trees; • Introduction to Phylogenomy; • Introduction to Lateral Gene Transfer.

Within the field of phylogenetic reconstruction and taxonomy there have been two different ways and two different philosophies to the process of reconstructing a phylogeny. One approach is the phenetic approach. In this approach, a tree is constructed by considering the phenotypic similarities of the species without trying to understand the evolutionary pathways of the species. Since a tree constructed by this method does not necessarily reflect evolutionary relationships but rather is designed to represent phenotypic similarity, trees constructed via this method are called phenograms. A phylogenetic tree based on such information is often termed a dendrogram (a branching order that may or may not be the correct phylogeny). The second approach is called the cladistic approach. Via these methods, a tree is reconstructed by considering the various possible pathways of evolution and choosing from amongst these the best possible tree. Trees reconstructed via these methods are called cladograms. The phenetic philosophy as a way to do taxonomy is definitely incorrect. However, this does not mean that phenetic methods are necessarily poor estimates of the cladogram. For character data where ancestral forms are known and to construct a taxonomic classification the cladistic approach is almost certainly superior. However, the cladistic methods are often difficult to implement with assumptions that are not always satisfied with molecular data. The phenetic approaches are generally faster algorithms and often have nicer statistical properties for molecular data. Hence, there appears to be a place for both types of methods in the analysis of molecular sequence data.

Examples of phylogenetic trees

This tree is referred to as the tree of life or the universal tree.
Pace (2001) described a tree of life based on small subunit rRNA sequences. Pace, N. R. (1997) Science 276, This tree shows the main three branches described by Woese and colleagues. This tree is referred to as the tree of life or the universal tree.

Chlamydiae Fig. 1. Phylogeny of chlamydiae. 16S rRNA-based neighbor-joining tree showing the affiliation of environmental and pathogenic chlamydiae with major bacterial phyla. Arrow, to outgroup. Scale bar, 10% estimated evolutionary distance. Science 304:

Eukaryotes (Baldauf et al., 2000)

Dujon et al. Nature. 2004; 430:35-44. S. cerevisiae 4 C. glabrata 3 2
1 2 3 4 S. cerevisiae C. glabrata K. lactis D. hansenii Y. lipolytica map dispersion few duplicated blocks , many tandem repeats few duplicated blocks reductive evolution duplicated gene loss massive duplication genome size control MAT and centromeres Dujon et al. Nature. 2004; 430:35-44.

Chen et al. NAR 34: D363-D368 (2006)

Original version Actual versions
Hurles M (2004) Gene Duplication: The Genomic Trade in Spare Parts. PLoS Biol 2(7): e206.

Homologs - Paralogs - Orthologs
Homologs: A1, B1, A2, B2 Paralogs : A1 vs B1 and A2 vs B2 Orthologs: A1 vs A2 and B1 vs B2 S1 S2 a b Sequence analysis Species-1 Species-2 Duplication Ancestor Evolution Speciation A1 A2 B1 B2 A B Time

Molecular evolution GACGACCATAGACCAGCATAG GACTACCATAGA-CTGCAAAG
*** ******** * *** ** GACTACCATAGACT-GCAAAG *** ********* *** ** Two possible positions for the indel

Molecular Phylogenetic Analysis
Study of evolutionary relationships between genes and species. • The actual pattern of evolutionary history is the phylogeny or evolutionary tree which we try to estimate. • A tree is a mathematical structure which is used to model the actual evolutionary history of a group of sequences or organisms.

Molecular Phylogeny Analysis
• Specifying the history of gene evolution is one of the most important aims of the current study of molecular evolution; • Molecular phylogeny methods allow, from a given set of aligned sequences, the suggestion of phylogenetic trees (inferred trees) which aim at reconstructing the history of successive divergence which took place during the evolution, between the considered sequences and their common ancestor. These trees may not be the same as the true tree; • Reconstruction of phylogenetic trees is a statistical problem, and a reconstructed tree is an estimate of a true tree with a given topology and given branch length; • The accuracy of this estimation should be statistically established; • In practice, phylogenetic analyses usually generate phylogenetic trees with accurate parts and imprecise parts.

Nucleotide, amino-acid sequences
-GGAGCCATATTAGATAGA- -GGAGCAATTTTTGATAGA- Gly Ala Ile Leu asp Arg Gly Ala Ile Phe asp Arg • 3 different DNA positions but only one different amino acid position: 2 of the nucleotide substitutions are therefore synonymous and one is non-synonymous. DNA yields more phylogenetic information than proteins. The nucleotide sequences of a pair of homologous genes have a higher information content than the amino acid sequences of the corresponding proteins, because mutations that result in synonymous changes alter the DNA sequence but do not affect the amino acid sequence. (Amino-acid sequences are more efficiently aligned).

Phenetics and Cladistics
Phenetics (Michener and Sokal, 1957): Pheneticists argued that classifications should encompass as many variable characters as possible, these characters being analysed by rigorous mathematical methods. Such methods (exp. distance based) place a greater emphasis on the relationships among data sets than the paths they have taken to arrive at their current states. Cladistics (Hennig 1966): emphasizes the need for large datasets but differs from phenetics in that it does not give equal weight to all characters. Cladists, are generally more interested in evolutionary pathways than in relationships (exp. maximum parsimony).

Key features of phylogenetic trees
B C D internal nodes branches external nodes Hypothetical ancestor • • An unrooted tree • Rooted trees C D B A 1 A B C D 3 C D A B 2 A B C D 4 A B D C 5 The unrooted tree means that it is only an illustration of the relationships betwenn A, B, C and D and does not tell us anything about the series of evolutionary events that led to these genes. Five evolutionary pathways are possible, each depicted by a different rooted tree. To distinguish betwwen them the phylogenetic analysis must include at least one outgroup, this beeing a homologous gene that we know is less closely related to A, B, C and D than these four genes are to each other. The outgroup enables the root of the tree to be located and the correct evolutionary pathway to be identified. • • • • •

Rooted and Unrooted trees
• An important distinction in phylogenetics: trees that make an inference about a common ancestor and the direction of evolution, and those that do not. A B C D • A B C D • • In rooted trees a single node is designated as a common ancestor, and a unique path leads from it through evolutionary time to any other node. • Unrooted trees only specify the relationships between nodes and say nothing about the direction in which evolution occured. • Roots can usually be assigned to unrooted trees through the use of an outgroup.

Key features of phylogenetic trees
The numbers of possible rooted (NR) and unrooted (NU) trees for n sequences are given by: NR = (2n-3)!/2n-2(n-2)! NU = (2n-5)!/2n-3(n-3)! n NR NU • Note that only one of all possible trees can represent the true tree that represents phylogenetic relationships among the sequences.

Gene tree - Species tree
Species A Species B Species C Species D Species E Speciation events Species tree Gene A Gene B Gene C Gene D Gene E Mutation events Gene tree The two events - mutation and speciation- are not expected to occur at the same time. So gene trees cannot represent species tree.

Gene tree - Species tree
Genomes 2 edition T.A. Brown Gene tree - Species tree Species tree A B C Gene tree • Time Duplication Speciation

Tree construction: how to proceed?
1. Consider the set of sequences to analyse ; 2. Align "properly" these sequences ; 3. Apply phylogenetic making tree methods ; 4. Evaluate statistically the obtained phylogenetic tree. Methodology : 1- Multiple alignment; 2- Bootstrapping; 3- Consensus tree construction and evaluation;

Alignment is essential preliminary to tree construction
GACGACCATAGACCAGCATAG GACTACCATAGA-CTGCAAAG *** ******** * *** ** GACTACCATAGACT-GCAAAG *** ********* *** ** Two possible positions for the indel • If errors in indel placement are made in a multiple alignment then the tree reconstructed by phylogenetic analysis is unlikely to be correct.

Alignment and Gaps • The quality of the alignment is essential : each column of the alignment (site) is supposed to contain homologous residues (nucleotides, amino acids) that derive from a common ancestor. ==> Unreliable parts of the alignment must be omitted from further phylogenetic analysis. • Most methods take into account only substitutions ; gaps (insertion/deletion events) are not used. ==> gaps-containing sites are ignored.

Steps in Multiple Sequence Alignments
A common strategy of several popular multiple sequence alignment algorithms is to: 1- generate a pairwise distance matrix based on all possible pairwise alignments between the sequences being considered; 2- use a statistically based approach to construct an initial tree; 3- realign the sequences progressively in order of their relatedness according to the inferred tree; 4- construct a new tree from the pairwise distances obtained in the new multiple alignment; 5- repeat the process if the new tree is not the same as the previous one. Given that similar sequences can be aligned both more easily and with greater confidence, the alignment of multiple sequences should take into consideration the branching order of the sequences being studied. Sequences are generally added one at a time to the growing multiple alignment with the most related sequences being added first and the least related being added last. It is increasingly common, however, for analyses of the sequences themselves to be the way in which phylogenetic relationships are determined. In those cases, an integrated approach is generally adopted that simultaneously generates an alignment and a phylogeny. This approach typically requires many rounds of phylogenetic analysis and sequence alignment.

Procedure • An efficient procedure consists of aligning amino-acid sequences and use the resulting alignment as template for corresponding nucleotide sequences. Alignment is then garanteed at the codon level. 1. Alignment of a family of protein sequences using clustalW; 2. Alignment of corresponding DNA sequences using as template their corresponding amino acid alignment obtained in step 1; Note: clean multiple alignment from gaps common to the majority of considered sequences

Example SPO2.113.dna >KLLA-IPF5339
ATGGCTCCACCTACGAAAATACTCGGTCTTGACACTCAGCAGAGAATGCTTCAACGTGGT GAAAATTGCAGTTTAAAGTCTCTGGTACAGAATGAATGTGCTTTTAATGGTAATGACTAT GTGTGTACGCCTTTCAAAAGACTATTTGAACAATGCATGGTGAAGGATGGACGTGTATTA AACATTGAGGTAACAAATCTGAACACCAACAGATGA >KLWA-IPF3854 ATGGCACCGCCCACAGTAGTGTTTGGCAAAGAGGAACTAGAGCCTCTCTTGCGCAATGTT ATGGCGACGTGTATCTTCAAGTCTCTGACTCAAAGCGAATGCAACTTTGACGGTCATCAA TATGTTTGTGTACCTTTCAAGAGGGTGTTCAAAGAATGCAAGGTGGATGGGAAATCAATC AGAATAGAGGTGACAGATAGAAACACCAACAAGGCAAAAGCTGATGAGATGGTTGACAGT TTCTGGAATTCCCGAAAGTCATTTACACGGAATTGA SPO2.113.pep >KLLA-IPF5339 MAPPTKILGLDTQQRMLQRGENCSLKSLVQNECAFNGNDYVCTPFKRLFEQCMVKDGRVL NIEVTNLNTNR >KLWA-IPF3854 MAPPTVVFGKEELEPLLRNVMATCIFKSLTQSECNFDGHQYVCVPFKRVFKECKVDGKSI RIEVTDRNTNKAKADEMVDSFWNSRKSFTRN

Alignment of aa sequences:
>KLLA-IPF5339 MAPPTKILGLDTQQRMLQ-RGENCSLKSLVQNECAFNGNDYVCTPFKRLFEQCMVKDGRVLNIEVTNLNTNR >KLWA-IPF3854 MAPPTVVFGKEELEPLLRNVMATCIFKSLTQSECNFDGHQYVCVPFKRVFKECKV-DGKSIRIEVTDRNTNKAKADEMVDSFWNSRKSFTRN Using corresponding dna sequences: >KLLA-IPF5339 ATGGCTCCACCTACGAAAATACTCGGTCTTGACACTCAGCAGAGAATGCTTCAACGTGGTGAAAATTGCAGTTTAAAGTCTCTGGTACAGAATGAATGTGCTTTTAATGGTAATGACTATGTGTGTACGCCTTTCAAAAGACTATTTGAACAATGCATGGTGAAGGATGGACGTGTATTAAACATTGAGGTAACAAATCTGAACACCAACAGATGA >KLWA-IPF3854 ATGGCACCGCCCACAGTAGTGTTTGGCAAAGAGGAACTAGAGCCTCTCTTGCGCAATGTTATGGCGACGTGTATCTTCAAGTCTCTGACTCAAAGCGAATGCAACTTTGACGGTCATCAATATGTTTGTGTACCTTTCAAGAGGGTGTTCAAAGAATGCAAGGTGGATGGGAAATCAATCAGAATAGAGGTGACAGATAGAAACACCAACAAGGCAAAAGCTGATGAGATGGTTGACAGTTTCTGGAATTCCCGAAAGTCATTTACACGGAATTGA Construct corresponding dna alignment: >KLLA-IPF5339 ATGGCTCCACCTACGAAAATACTCGGTCTTGACACTCAGCAGAGAATGCTTCAA---CGTGGTGAAAATTGCAGTTTAAAGTCTCTGGTACAGAATGAATGTGCTTTTAATGGTAATGACTATGTGTGTACGCCTTTCAAAAGACTATTTGAACAATGCATGGTGAAGGATGGACGTGTATTAAACATTGAGGTAACAAATCTGAACACCAACAGA >KLWA-IPF ATGGCACCGCCCACAGTAGTGTTTGGCAAAGAGGAACTAGAGCCTCTCTTGCGCAATGTTATGGCGACGTGTATCTTCAAGTCTCTGACTCAAAGCGAATGCAACTTTGACGGTCATCAATATGTTTGTGTACCTTTCAAGAGGGTGTTCAAAGAATGCAAGGTG---GATGGGAAATCAATCAGAATAGAGGTGACAGATAGAAACACCAACAAGGCAAAAGCTGATGAGATGGTTGACAGTTTCTGGAATTCCCGAAAGTCATTTACACGGAAT

Phylogenetic tree construction methods
• A phylogenetic tree is characterised by its topology (form) and its length (sum of its branch lengths) ; • Each node of a tree is an estimation of the ancestor of the elements included in this node;

There are three main families of Methods: • Parsimony • Distance Methods • Maximum likelihood Methods

Methods directly based on sequences : • Maximum Parsimony : find a phylogenetic tree that explains the data, with as few evolutionary changes as possible. • Maximum likelihood : find a tree that maximizes the probability of the genetic data given the tree. Methods indirectly based on sequences : • Distance based methods (Neighbour Joining (NJ)): find a tree such that branch lengths of paths between sequences (species) fit a matrix of pairwise distances between sequences.

Parsimony The concept of parsimony is at the heart of all character-based methods of phylogenetic reconstruction. The 2 fundamental ideas of biological parsimony are: 1- mutations are exceedingly rare events (?) ; 2- the more unlikely events a model invokes, the less likely the model is to be correct. As a result, the relationship that requires the fewest number of mutations to explain the current state of the sequences being considered, is the relationship that is most likely to be correct.

Parsimony Informative and Uninformative Sites:
Multiple sequence alignment, for a parsimony approach, contains positions that fall into two categories in terms of their information content : those that have information (are informative) and those that do not (are uninformative). Example: seq 1 G G G G G G 2 G G G A G T 3 G G A T A G 4 G A T C A T Position 1 is said invariant and therefore uninformative, because all trees invoke the same number of mutations (0); Position 2 is uninformative because 1 mutation occurs in all three possible trees; Position 3 idem, because 2 mutations occur; Position 4 requires 3 mutations in all possible trees. Positions 5 and 6 are informative, because one of the trees invokes only one mutation and the other 2 alternative trees both require 2 mutations. In general, for a position to be informative regardless of how many sequences are aligned, it has to have at least 2 different nucleotides, and each of these nucleotides has to be present at least twice. Krane & Raymer 2002

6 5 4 3 2 1 G 1G 2T T4 G3 3G T2 T 4T T3 G A 1G 2G A4 A3 3A G2 4A G T
C4 T3 4 3T A2 A 4C G A 1G 2G T4 A3 3 3A G2 4T 1- position 1 on the multiple alignment All 4 positions have the same base « G » and the positio is said invariant. Invariant sites are uninformative, because each of the three sequences invokes exactly the same number of mutations (0). 2- Position 2 is uninformative from the parsimony perspective because one mutation occurs in all three of the possible trees. 3- Position 3 is uninformative because all trees require two mutations; 4- is uninformative because all three trees require three mutations. In contrast: , positions 5 and 6 are both informative because for both of them, one of the three trees invokes only one mutation and the other two alternative trees both require two. Informative sites are those that allow one of the trees (see next slide) to be distinguished from the other two on the basis of how many mutations they must invoke. In general, for a position to be informative regardless of how many sequences are aligned, it has to have at least two different nucleotides, and each of these nucleotides has to be present at least twice. All parsimony programs begin by applu-ying this fairly simple rule to the data set being analysed. Notice that four of the six positions being considered in the alignment shown are simply discarded and not considered any further in a parsimony analysis. All of those sites would have contributed to the pairwise similarity scores used by a distance based approach, and this diference alone can generate substantial differences in the conclusions reached by both types of approaches. Krane & Rayner, 2002. G 1G 2G A4 G3 2 3G G2 4A 1 G 1G 2G G4 G3 3G G2 4G

Maximum Parsimony (Fitch, 1977)
Parsimony criterion consists of determining the minimum number of changes (substitutions) required to transform a sequence to its nearest neighbor. The maximum parsimony algorithm searches for the minimum number of genetic events (nucleotide substitutions or amino-acid changes) to infer the most parsimonious tree from a set of sequences. The best tree is the one which needs the fewest changes. Problems : 1. within practical computational limits, this often leads to the generation of tens or more "equally most parsimonious trees" which makes it difficult to justify the choice of a particular tree ; 2. long computation time is needed to construct a tree.

Maximum Parsimony (Fitch, 1977),...
The Maximum parsimony method takes account of information pertaining to character variation in each position of the sequences multiple alignment, to recreate the series of nucleotide changes. The assumption, possibly erroneous, is that evolution follows the shortest possible route and that the correct phylogenetic tree is therefore the one that requires the minimum number of nucleotide changes to produce the observed differences between the sequences. Trees are therefore constructed at random and the nucleotide changes that they involve, calculated until all possible topologies have been examined and the one requiring the smallest number of steps identified. This is presented as the most likely inferred tree.

Distance Methods Each phylogenetic tree induces a matrix of distances between sequence pairs A given distance matrix corresponds to a single phylogenetic tree

Constructing Phylogenetic trees using Distance Methods
a) Sequence alignment; b) Matrix of evolutionary distances between pairs of sequences; c) Distance methods fit a tree to this matrix. k lk Di,j = the distance between i and j sequences; i li lc lr lm l d(i,m) = li + lc + lr + lj lj m j

Constructing Phylogenetic trees using Distance Methods
Di,j = the distance between i and j sequences; di,j = sum of branches on the tree path from i to j; • The phylogeny makes an estimation of the distance for each pair as the sum of branch lengths in the path from one sequence to another through the tree. •A measure of how close is the tree to D is given by the least square criterion : ∑( Di,j - di,j )2/ D2ij i,j •The phylogenetic tree topology is constructed by using a cluster analysis method (like the NJ method). 1. easy to perform ; 2. fast calculation ; 3. fit for sequences having high similarity scores ; drawbacks : 1. all sites are generally equally treated (do not take into account differences of substitution rates ) ; 2. not applicable to distantly related sequences; 3. Some of the information is lost, particularly those pertaining to the identities of the ancestral and derived nucleotides at each position in the multiple alignment

Example of distance mesure: Kimura’s two parameter distance (DNA)
• Hypotheses of the model : a)All sites evolve independently and following the same process. b) Substitutions occur according to two probabilities : One for transitions, one for transversions. Transitions : G <—>A or C <—>T Transversions : other changes c) The base substitution process is constant in time. • Quantification of evolutionary distance (d) as a function of the fraction of observed differences (p: transitions, q: transversions): Kimura (1980) J. Mol. Evol. 16:111

Neighbor-Joining method (Saitou & Nei 1987)
• A B C D E F G H H G A B C D E F • • To begin the reconstruction, it is initially assumed that there is just one internal node from which branches leading to all the DNA sequences radiate in a star-like pattern. Next, a pair of sequences is chosen at random, removed from the star, and attached to a second internal node, connected by a branch to the center of the star. The distance matrix is used to calculate the total branch length in this new “tree”. The sequences are then returned to their original positions and another pair attached to the second internal node, and again the total branch length is calculated. This operation is repeated until all the possible pairs have been examined, enabling the combination that gives the tree with the shortest total branch length to be identified. This pair of sequences will be neighbors in the final tree; in the interim, they are combined into a single unit, creating a new star with one branch fewer than the original one. The whole process of pair selection and tree-length calculated is now repeated so that a second pair of neighboring sequences is identified, and then repeated again so that a third pair is located, and so on. The result is a complete reconstructed tree.

Maximum likelihood This approach is a purely statistically based method. Probabilities are considered for every individual nucleotide substitution in a set of sequence alignment. Exp. Since transitions (exchanging purine for a purine and pyrimidine for a pyrimidine) are observed roughly 3 times as often as transversions (exchanging a purine for a pyrimidine or vice versa); it can be reasonably argued that a greater likelihood exists that the sequence with C and T are more closely related to each other than they are to the sequence with G. • Calculation of probabilities is complicated by the fact that the sequence of the common ancestor to the sequences considered being unknown. • Furthermore multiple substitutions may have occurred at one or more sites and that all sites are not necessarily independent or equivalent. .. C.. ..T.. ..G.. Still, objective criteria can be applied to calculating the probability for every site and for every possible tree that describes the relationships of the sequences in a multiple alignment.

Maximum likelihood According to this method, the bases (nucleotides or amino acids) of all sequences at each site are considered separately (as independent), and the log-likelihood of having these bases are computed for a given topology by using a particular probability model. This log-likelihood is added for all sites, and the sum of the log- likelihood is maximized to estimate the branch length of the tree. This procedure is repeated for all possible topologies, and the topology that shows the highest likelihood is chosen as the final tree. Notes : 1. This is the best justified method from a theoretical viewpoint; 2. ML estimates the branch lengths of the final tree ; 3. ML methods are usually consistent ; 4. Sequence simulation experiments have shown that this method works better than all others in most cases. Drawbacks : they need long computation time to construct a tree.

The choice of the outgroup
• Most of phylogenetic methods construct unrooted trees. • It is best to root such trees on biological grounds. • The most used technique consists of including in the sequence data set to be analysed, a sequence which has some relation with the considered sequences without belonging to the same family. • The aim is to normalize the branches of the unrooted tree relatively to the length of the branch related to the outgroup.

Evaluation of different methods
• None of the previous methods of phylogenetic reconstruction makes any garantee that they yield the one true tree that describes the evolutionary history of a set of aligned sequences • There is at present no statistical method allowing comparisons of trees obtained from different phylogenetic methods; nevertheless many attempts have been made to compare the relative consistency of the existing methods. • The consistency depends on many factors, including the topology and branch lengths of the real tree, the transition/transversion rate and the variability of the substitution rates. • In practice, one infers phylogeny between sequences which do not generally meet the specified hypothesis. • One expects that if sequences have strong phylogenetic relationships, different methods will result in the same phylogenetic tree.

Statistical evaluation of the obtained phylogenetic tree
• The accuracy is dependent on the considered multiple sequence alignments ; • ML estimates branch lengths, their degree of significance and their confidence limits ; • At present only sampling techniques allow to test the topology of a phylogenetic tree : Bootstrapping It consists of drawing columns from a sample of aligned sequences, with replacement, until one gets a data set of the same size as the original one (usually some columns are sampled several times and others left out).

Bootstrapping • Constructs a new multiple alignment at random from the real alignment, with the same size. Note that the same column can be sampled more than once, and consequently some columns are not sampled. ATAGCCATA ATACCCATG ATACCCATA ATCCCCCAT TCAAATGCA TCGAATCCA TCAAATCCA TCAACACCC

Properties of Bootstrap procedure
• Internal branches supported by ≥ 90% of replicates are considered as statistically significant. • The bootstrap procedure only detects if sequence length is enough to support a particular node. • The bootstrap procedure does not help determining if the tree-building method is good. A wrong tree can have 100 % bootstrap support for all its branches!

Methodology 1. Consider the set of sequences to analyse ;
2. Align "properly" these sequences ; 3. Apply phylogenetic making tree methods ; 4. Evaluate statistically the obtained phylogenetic tree. 1- Multiple alignment; 2- Bootstrapping (100 samples); 4- Consensus tree construction and evaluation;

Tree and sequence simulation experiment
P, PHYML F, fastDNAml L, NJML D, DNAPARS N, NJ 5000 random trees 40 taxa, 500 bases no molecular clock varying tree length K2P, a = 2 Manolo Gouy

Introduction to Phylogenomy

This tree is referred to as the tree of life or the universal tree.
Pace (2001) described a tree of life based on small subunit rRNA sequences. Pace, N. R. (1997) Science 276, This tree shows the main three branches described by Woese and colleagues. This tree is referred to as the tree of life or the universal tree.

Introduction to Phylogenomy
• Species tree construction; • The phylogeny of single genes may be different from the phylogeny of their corresponding species; • Idea of considering many genes instead of only one gene to estimate species phylogeny; • Concatenation of the set of genes common to the considered species; • Estimate species phylogeny from the concatenated genes; • Some difficulties related to this procedure.

Problems with species tree construction
• main difficulties in species tree construction include extensive incongruence between alternative phylogenies generated from single-gene data sets; Genes don't evolve at the same rate or in the same way, the evolutionary history inferred from one gene, say for rRNA, may be different from what another gene appears to show.

• “phylogenomic tree” (based on concatenation of a gene sample common to the considered species); S1 Sn . • genes don't evolve at the same rate or in the same way; • a limited number of genes are shared among all species;

These methods suffer difficulties related to the phylogenetic tree construction:
• global sequence alignment (quality, gaps,...); • substitution variations between genes; • different evolutionary histories of genes; • substitution saturation;...

Horizontal/Lateral Gene Transfer (HGT/LGT)
• Genome comparisons (particularly bacterial) show that during evolution a significant number of genes were laterally transferred from one species to another; • Tranferred genes are very difficult to detect such?

Lateral gene transfer and the nature of bacterial innovation
Howard Ochman, Jeffrey G. Lawrence and Eduardo A. Groisman (2000) Nature 405, Unlike eukaryotes, which evolve principally through the modification of existing genetic information, bacteria have obtained a significant proportion of their genetic diversity through the acquisition of sequences from distantly related organisms. Horizontal gene transfer produces extremely dynamic genomes in which substantial amounts of DNA are introduced into and deleted from the chromosome. These lateral transfers have effectively changed the ecological and pathogenic character of bacterial species.

Plus denotes the presence and minus the absence of a trait in more than 85% of strains. Evolutionary relationships among species are based on nucleotide sequence information. In many cases, genes acquired by horizontal transfer confer the species-specific traits.

Lengths of bars denote the amount of protein-coding DNA
Lengths of bars denote the amount of protein-coding DNA. For each bar, the native DNA is blue; foreign DNA identifiable as mobile elements, including transposons and bacteriophages, is yellow, and other foreign DNA is red. The percentage of foreign DNA is noted to the right of each bar. 'A' denotes an Archaeal genome.

Rujan & Martin: Trends Genet 2001 Mar;17(3):113-20.
Lateral gene transfer - what a problem for phylogenetics! How lateral gene transfer between prokaryotes subsequent to the origins of organelles can lead to erroneous inferences of eukaryotic gene origins. In the lower panel, a case of lateral gene transfer (LGT) is depicted as described in the text. The mechanism of LGT sketched here is intended to mean conjugation, but many mechanisms of lateral gene transfer are known and for the purposes of the figure, the mechanism is irrelevant. In the upper panel, the tree that would be constructed from those sequences is shown - although the plant obtained its gene from a cyanobacterium, LGT makes it look as though it came from Bacillus. As outlined in the text, there is a fine line that separates inferences drawn from phylogenetic data analysis and the evolutionary process itself. Pinning down the role of lateral gene transfer is a very tough problem. Rujan & Martin: Trends Genet 2001 Mar;17(3):

Evolution by Domains/Motifs
Using MEME/MAST programs

Simple case: SPO5.11

SPO10.135 Expansion Degradation/ 5 7 ancestor? 12

3, 9, 2, 1 5, 6, 8, 4 P5.2096 P5.2063 ancestral part?

References Books: • Phylogeny programs :
• MEGA: • PAML: Books: • Fundamental concepts of Bioinformatics. Dan E. Krane and Michael L. Raymer • Genomes 2 edition. T.A. Brown • Molecular Evolution; A phylogenetic Approach Page, RDM and Holmes, EC Blackwell Science • Manolo Gouy:

Molecular Phylogeny Fredj Tekaia Institut Pasteur tekaia@pasteur.fr.

Similar presentations

Presentation on theme: "Molecular Phylogeny Fredj Tekaia Institut Pasteur tekaia@pasteur.fr."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Molecular Phylogeny Fredj Tekaia Institut Pasteur tekaia@pasteur.fr.

Similar presentations

Presentation on theme: "Molecular Phylogeny Fredj Tekaia Institut Pasteur tekaia@pasteur.fr."— Presentation transcript:

Similar presentations

About project

Feedback