Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bas E. Dutilh Phylogenomics Using complete genomes to determine the phylogeny of species.

Similar presentations


Presentation on theme: "Bas E. Dutilh Phylogenomics Using complete genomes to determine the phylogeny of species."— Presentation transcript:

1 Bas E. Dutilh Phylogenomics Using complete genomes to determine the phylogeny of species

2 Tree of life Bacteria Archaea Eukaryota

3 Evolution What we can see are the present-day species Offspring looks like its parents Mutations –Phenotype –Genotype Nature selects: survival of the fittest

4 Phenotype Which properties to compare? Watanabe's Ugly Duckling Theorem: “All things have an infinite number of features. So any two things share an infinite number of features. Therefore two things cannot be of the same kind because they share more features than they do with things of a different kind.”

5 Evolution

6 Genotype Genome sequence is finite and you do not have to choose Genetic properties –Word frequency –Sequence (nt/aa) –Gene content –Gene order

7 Why sequence similarity works Every residue (nt/aa) is a separate dimension –Human: 3 billion nucleotides Most mutations are …  Sequences never converge

8 Evolution: mutation and selection Mutation is responsible for changes Selection is responsible for continuity The more differences, the more distantly related two sequences are Contrary to structure or phenotype, sequences do not converge

9 Phylogenetics Distance matrix Hierarchical clustering Evaluate likelihood of all possible trees Maximum likelihood P P P Inferring the evolution of a gene

10 Substitution matrix Describes the rate at which one character in a sequence changes to other character states BLOck SUbstitution Matrix (BLOSUM) is based on observed substitutions between proteins with e.g. >62% sequence identity

11 Neighbour joining

12 Maximum likelihood Make all possible trees Calculate likelihood that the alignment evolved in this tree  Maximum likelihood tree Very computer intensive PhyML searches “around” starting tree (e.g. NJ) P P P

13 Maximum parsimony Parsimony is a special case of likelihood The tree with the smallest number of mutations is the maximum parsimony tree

14 Fox et al, Science 1980 Present in all species Constant function Slowly evolving SSU rRNA

15 Olsen et al, J Bacteriol 1994 Phylogeny of SSU rRNA discovered the three domains Representative for the evolutionary history of species SSU rRNA Bacteria Archaea Eukaryota

16 ancestor Conflict between trees based on single genes Unrecognized paralogy Horizontal gene transfer Mutation saturation, biases, divergent rates Different genes tell different stories spec B spec A - Orthologs - Paralogs spec C

17 Is a tree the right representation? Genomes are chimeras with genes from different origins –Endosymbiosis (mitochondrion, chloroplast) –Horizontal gene transfer (many examples, often adaptations to environment)

18 More data = more consistent trees Combine information from more genes to average out these anomalies Complete genomes contain the maximum phylogenetic information

19 Fungi Yeasts, filamentous and dimorphic fungi Fungi are the eukaryotic clade with largest number of completely sequenced genomes S. cerevisiae is a well studied model organism Much consensus about phylogeny

20 Consensus phylogeny (literature) 19 target nodes

21

22 ancestor spec B spec A spec C Which genes to compare between species –Homologs (originated “de novo”) –Orthologs (originated at speciation) Orthology has higher resolution –Pairwise orthology –Cluster orthology –Tree-based orthology Orthology

23 Pairwise orthology (Inparanoid) Compare all proteins in species A to all proteins in species B to find homologs Find bi-directional best hit All proteins closer than bi-directional best hit are (in-) paralogs

24 Cluster orthology (COG) First group in-paralogs in every species Find bi-directional best hits between in- paralogous groups Join in-paralogs to orthologous groups –Link all pairs of in-paralogous groups –Only if link is confirmed by third species (triangle)

25 Tree based orthology Phylogenetic tree of homologs Find gene duplication nodes Two homologous genes are orthologs if last common ancestor is not a duplication node but a speciation node

26

27 Presence/absence matrix (0/1) Similarity: number of shared orthologous groups –Genomes that share few OGs are distantly related –Genomes that share many OGs are closely related Gene content methods OG1 OG2 OG3 OG4 … sp1 1 1 0 1 … sp2 0 1 0 0 … sp3 0 0 1 1 … … … … … … but…

28 Genome size correction Large genomes have more genes, so they also share more genes Divide number of shared genes by –Average genome size –Smallest of two genomes –Weighted average genome size # shared genes genome size P. chrysosporium Korbel et al, Trends Genet 2002

29 Similarity: corrected number of shared genes Distance: (1 – similarity) Neighbour joining \s sp1 sp2 sp3 sp4 … sp1 \1 0.2 0.4 0.2 … sp2 \1 0.9 0.1 … sp3 \1 0.3 … sp4 \1 … … … … … … ( ) # shared OGs (spA, spB) weighted average size (spA, spB) d 0 0.8 0 0.6 0.1 0 0.8 0.9 0.7 0 dist (spA, spB) = 1 – Gene content methods

30 Dollo parsimony –Gaining a complex character (gene) is rare and happens once –Losing it is relatively easy –Minimize the number of gene losses for maximum parsimony Gene content methods

31 Superalignment methods Multiple alignment Concatenate alignments (1:1:1) A missing gene in a certain species (row) can be seen as a gap in the alignment

32 Superdistance methods Combine distance matrices from separate gene families, e.g. average

33 Supertree methods Make phylogenetic trees for all gene families separately Matrix Representation using Parsimony (MRP)

34 13 trees14 trees15 trees12 trees

35 Gene content vs. sequence based Gene content supertrees are different than sequence based supertrees

36 Consensus phylogeny (literature) 19 target nodes

37 Low-dimensional compared to genotype Intermediate between genotype and phenotype –Main dichotomy between yeasts and filamentous Fungi, not Ascomycota and Basidiomycota –Dimorphic Basidiomycota exclude filamentous P. chrysosporium Gene content 10.38

38 Superalignment 18.21 Supertree 17.50 Sequence based trees agree better with literature Literature is dominated by sequence based trees

39 Hyperthermophiles

40 Nanoarchaeum Nanoarchaeota Waters et al. PNAS 2003; Di Giulio, J Theor Biol 2006 Crenarchaeota Ciccarelli et al. Science 2006 Euryarchaeota Brochier et al. Genome Biol 2005 Cren Eury Gene content tree

41 Assignment Make a gene content tree Compare with other phylogenetic trees Describe the differences –Can you find literature that specifically studies these species? –What do you think is going on? Why are the trees different? Write a paper about some of your most interesting findings, include references www.cmbi.ru.nl/edu/seminars


Download ppt "Bas E. Dutilh Phylogenomics Using complete genomes to determine the phylogeny of species."

Similar presentations


Ads by Google