Bas E. Dutilh Phylogenomics Using complete genomes to determine the phylogeny of species
Tree of life Bacteria Archaea Eukaryota
Evolution What we can see are the present-day species Offspring looks like its parents Mutations –Phenotype –Genotype Nature selects: survival of the fittest
Phenotype Which properties to compare? Watanabe's Ugly Duckling Theorem: “All things have an infinite number of features. So any two things share an infinite number of features. Therefore two things cannot be of the same kind because they share more features than they do with things of a different kind.”
Evolution
Genotype Genome sequence is finite and you do not have to choose Genetic properties –Word frequency –Sequence (nt/aa) –Gene content –Gene order
Why sequence similarity works Every residue (nt/aa) is a separate dimension –Human: 3 billion nucleotides Most mutations are … Sequences never converge
Evolution: mutation and selection Mutation is responsible for changes Selection is responsible for continuity The more differences, the more distantly related two sequences are Contrary to structure or phenotype, sequences do not converge
Phylogenetics Distance matrix Hierarchical clustering Evaluate likelihood of all possible trees Maximum likelihood P P P Inferring the evolution of a gene
Substitution matrix Describes the rate at which one character in a sequence changes to other character states BLOck SUbstitution Matrix (BLOSUM) is based on observed substitutions between proteins with e.g. >62% sequence identity
Neighbour joining
Maximum likelihood Make all possible trees Calculate likelihood that the alignment evolved in this tree Maximum likelihood tree Very computer intensive PhyML searches “around” starting tree (e.g. NJ) P P P
Maximum parsimony Parsimony is a special case of likelihood The tree with the smallest number of mutations is the maximum parsimony tree
Fox et al, Science 1980 Present in all species Constant function Slowly evolving SSU rRNA
Olsen et al, J Bacteriol 1994 Phylogeny of SSU rRNA discovered the three domains Representative for the evolutionary history of species SSU rRNA Bacteria Archaea Eukaryota
ancestor Conflict between trees based on single genes Unrecognized paralogy Horizontal gene transfer Mutation saturation, biases, divergent rates Different genes tell different stories spec B spec A - Orthologs - Paralogs spec C
Is a tree the right representation? Genomes are chimeras with genes from different origins –Endosymbiosis (mitochondrion, chloroplast) –Horizontal gene transfer (many examples, often adaptations to environment)
More data = more consistent trees Combine information from more genes to average out these anomalies Complete genomes contain the maximum phylogenetic information
Fungi Yeasts, filamentous and dimorphic fungi Fungi are the eukaryotic clade with largest number of completely sequenced genomes S. cerevisiae is a well studied model organism Much consensus about phylogeny
Consensus phylogeny (literature) 19 target nodes
ancestor spec B spec A spec C Which genes to compare between species –Homologs (originated “de novo”) –Orthologs (originated at speciation) Orthology has higher resolution –Pairwise orthology –Cluster orthology –Tree-based orthology Orthology
Pairwise orthology (Inparanoid) Compare all proteins in species A to all proteins in species B to find homologs Find bi-directional best hit All proteins closer than bi-directional best hit are (in-) paralogs
Cluster orthology (COG) First group in-paralogs in every species Find bi-directional best hits between in- paralogous groups Join in-paralogs to orthologous groups –Link all pairs of in-paralogous groups –Only if link is confirmed by third species (triangle)
Tree based orthology Phylogenetic tree of homologs Find gene duplication nodes Two homologous genes are orthologs if last common ancestor is not a duplication node but a speciation node
Presence/absence matrix (0/1) Similarity: number of shared orthologous groups –Genomes that share few OGs are distantly related –Genomes that share many OGs are closely related Gene content methods OG1 OG2 OG3 OG4 … sp … sp … sp … … … … … … but…
Genome size correction Large genomes have more genes, so they also share more genes Divide number of shared genes by –Average genome size –Smallest of two genomes –Weighted average genome size # shared genes genome size P. chrysosporium Korbel et al, Trends Genet 2002
Similarity: corrected number of shared genes Distance: (1 – similarity) Neighbour joining \s sp1 sp2 sp3 sp4 … sp1 \ … sp2 \ … sp3 \1 0.3 … sp4 \1 … … … … … … ( ) # shared OGs (spA, spB) weighted average size (spA, spB) d dist (spA, spB) = 1 – Gene content methods
Dollo parsimony –Gaining a complex character (gene) is rare and happens once –Losing it is relatively easy –Minimize the number of gene losses for maximum parsimony Gene content methods
Superalignment methods Multiple alignment Concatenate alignments (1:1:1) A missing gene in a certain species (row) can be seen as a gap in the alignment
Superdistance methods Combine distance matrices from separate gene families, e.g. average
Supertree methods Make phylogenetic trees for all gene families separately Matrix Representation using Parsimony (MRP)
13 trees14 trees15 trees12 trees
Gene content vs. sequence based Gene content supertrees are different than sequence based supertrees
Consensus phylogeny (literature) 19 target nodes
Low-dimensional compared to genotype Intermediate between genotype and phenotype –Main dichotomy between yeasts and filamentous Fungi, not Ascomycota and Basidiomycota –Dimorphic Basidiomycota exclude filamentous P. chrysosporium Gene content 10.38
Superalignment Supertree Sequence based trees agree better with literature Literature is dominated by sequence based trees
Hyperthermophiles
Nanoarchaeum Nanoarchaeota Waters et al. PNAS 2003; Di Giulio, J Theor Biol 2006 Crenarchaeota Ciccarelli et al. Science 2006 Euryarchaeota Brochier et al. Genome Biol 2005 Cren Eury Gene content tree
Assignment Make a gene content tree Compare with other phylogenetic trees Describe the differences –Can you find literature that specifically studies these species? –What do you think is going on? Why are the trees different? Write a paper about some of your most interesting findings, include references