Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Slides:



Advertisements
Similar presentations
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
An Introduction to Phylogenetic Methods
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Phylogenetic reconstruction
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
IE68 - Biological databases Phylogenetic analysis
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
Bioinformatics I Fall 2003 copyright Susan Smith 1 Phylogenetic Analysis.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
Phylogenetic reconstruction
Probabilistic methods for phylogenetic trees (Part 2)
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Phylogenetic trees Sushmita Roy BMI/CS 576
Phylogenetic Analysis
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
BINF6201/8201 Molecular phylogenetic methods
Molecular phylogenetics
Why Models of Sequence Evolution Matter Number of differences between each pair of taxa vs. genetic distance between those two taxa. The x-axis is a proxy.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
COMPUTATIONAL MODELS FOR PHYLOGENETIC ANALYSIS K. R. PARDASANI DEPTT OF APPLIED MATHEMATICS MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY (MANIT) BHOPAL.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Phylogenetic trees School B&I TCD Bioinformatics May 2010.
BINF6201/8201 Molecular phylogenetic methods
Phylogenetics and Coalescence Lab 9 October 24, 2012.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Phylogeny Ch. 7 & 8.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Phylogenetics.
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Phylogenetic basis of systematics
Inferring a phylogeny is an estimation procedure.
Multiple Alignment and Phylogenetic Trees
Goals of Phylogenetic Analysis
Multiple Alignment, Distance Estimation, and Phylogenetic Analysis
Methods of molecular phylogeny
Inferring phylogenetic trees: Distance and maximum likelihood methods
Phylogenetic Trees.
Summary and Recommendations
Why Models of Sequence Evolution Matter
#30 - Phylogenetics Distance-Based Methods
Lecture 7 – Algorithmic Approaches
Phylogeny.
But what if there is a large amount of homoplasy in the data?
Presentation transcript:

Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees

Why do trees?

Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal, often living species, individuals) Branches length scaled (length propn evo dist) Branches length unscaled, nominal, arbitrary Outgroupan OTU that is most distantly related to all the other OTUs in the study. Choose outgroup carefully

Phylogeny 102 Trees rooted N=(2n-3)! / 2 n-2 (n-2)! Trees unrooted N=(2n-5)! / 2 n-3 (n-3)! OTUs #rooted trees #unrooted trees *10 6 8*10 21

Four key aspects of tree A DC B A B C D Topology Branch lengths Root Confidence A B C D Basic tree D C B A D C B A

Methods Distance matrix –UPGMA –Neighbour joining NJ Maximum parsimony MP –tree requiring fewest changes Maximum likelihood ML –Most likely tree Bayesian: sort of ML –Samples large number of “pretty good” trees

Trees NJ Distance matrix UPGMA Unweighted Pair Group Method, with Arithmetic means assumes constant rate of evolution – molecular clock: don’t publish UPGMA trees Neighbor joining is very fast Often a “good enough” tree Embedded in ClustalW Use in publications only if too many taxa to compute with MP or ML

Distances from sequence Use Phylip Protdist or DNAdist D= non-ident residues/total sequence length Correction for multiple hits necessary because Jukes-Cantor assumes all subs equally likely Kimura: transition rate NE transversion rate Ts usually > Tv G AA

UPGMA – pencil and paper trees Two steps 1 find smallest distance in matrix cluster these 2 OTUs branch length = half distance between OTUs 2 construct new distance matrix replacing the 2 OTUs with the cluster recalculate distances as average of values compared (always use original matrix values) Iterate until one distance remains

Steps in detail The UPGMA method involves the successive clustering of the most closely related pairs of species (or groups of species). UPGMA assumes that sequences have evolved with a perfect molecular clock; because of this the tree is automatically rooted. A two-step procedure is repeated: Step 1: Look through the matrix for the smallest pairwise distance value, and join these two species (or groups of species) into a cluster. Calculate the branch length from the common ancestor to each species, as one half of the distance between the two species (or groups). In later rounds, internal branch lengths are calculated by subtraction. Step 2: Construct a new pairwise distance matrix, in which the new cluster replaces the two species (or groups) within it. Calculate the distance values from this cluster to other species (or groups), as the average of the values for the species being compared. Now return to step one, and repeat until the distance matrix contains only one value. At that point you can draw the final tree.

Mammal dataset 1> Spectacled bearTremarctos ornatus 2> Giant panda Ailuropoda melanoleuca 3> Red panda Ailurus fulgens 4> Raccoon Procyon lotor 5> Ocelot Felis pardalis mtDNA 16S rRNA 536 bp compared Numbers of nucleotide differences (above the diagonal), and percentage differences per site after correction for multiple hits by Jukes & Cantor's method (below the diagonal). Addresses an old taxonomic puzzle is Red panda a bear or a raccoon?

Distance matrix ________________________________________________________ Bear G.panda R.panda Raccoon Ocelot ________________________________________________________ Bear Giant panda Red panda Raccoon Ocelot ________________________________________________________ Round 1: cluster Bear and Giant 6.1 (rounded up to 1 deciplace) Uncorrected distance 60/536 = 11.2

Round 2 New matrix: e.g. Bear+G.panda vs Red panda = ( )/2 = ___________________________________________________ Be+Gp R.panda Raccoon Ocelot ___________________________________________________ Bear+G.panda -- Red panda Raccoon Ocelot ___________________________________________________ Round 2: cluster Red panda and 7.5

Round 3 _________________________________________ Be+Gp Rp+Rac Ocelot _________________________________________ Bear+G.panda -- R.panda+Racc Ocelot _________________________________________ Round 3: cluster (Bear+Giant panda) and (Red 8.2

Round 4 (final) _________________________________ Be+Gp+Rp+Ra Ocelot _________________________________ Be+Gp+Rp+Ra -- Ocelot _________________________________ Round 4: cluster (Bear+Giant panda+Red panda+Raccoon) and 9.7

TaDAAA the tree – = Internal branches by subtraction

Trees MP Maximum parsimony Minimum # mutations to construct tree Better than NJ – information lost in distance matrix – but much slower Sensitive to long-branch attraction –Long branches clustered together No explicit evolutionary model Protpars refuses to estimate branch lengths Informative sites

Long-branch attraction True tree MusHBA MusHBB HumHBB HumHBA Rodents evolve faster than primates False “LBA” tree MusHBA MusHBB HumHBA HumHBB

Trees ML Very CPU intensive Requires explicit model of evolution – rate and pattern of nucleotide substitution –JC Jukes/Cantor –K2P Kimura 2 parameter transition/transversion –F81 Felsenstein – base composition bias –HKY85 merges K2P and F81 Explicit model -> preferred statistically Assumes change more likely on long branch –So No long-branch attraction But Wrong model -> wrong tree

Models of sequence evolution HKY85 A C G T A  C   G   T  C  A   G   T  G  A   C   T  T  A   C   G 

Bayesian methods ML unsatisfactory because only best tree identified Bayesian methods investigate a sample of highly likely trees MrBayes is the program Option to specify “prior probabilities” for –Tree topology (can force only “sensible” trees) –Branch lengths (usually equal lengths, but rodents known to evolve faster than primates) –Rate matrix parameters

Maximum parsimony Site: OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * * It is a good alignment clearly aligning homologous sites without gaps. Here we have a representative alignment. Want to determine the phylogenetic relationships among the OTUs

There are 3 possible trees for 4 taxa (OTUs): \_____/ \_____/ \_____/ / \ / \ / \ Or (1,2)(3,4) (1,3)(2,4) and (1,4)(2,3) Aim to identify (phylogenetically) informative sites and use these to determine which tree is most parsimonious.

The identical sites 1, 6, 8 are useless for phylogenetic purposes. Site: OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *

Site 2 also useless: OTU1’s A could be grouped with any of the Gs. Site: OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *

Site 4 is uniformative as each site is different. UNLESS transitions weighted in which case (1,4)(2,3) Site: OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *

For site 3 each tree can be made with (minimum) 2 mutations: Site: OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *

(1,2)(3,4) G A G A G A \ / \ / \ / G---A C---A A---A / \ / \ / \ C A C A C A

(1,3)(2,4) G C can do worse:G C \ / \ / A---A G---A / \ / \ A A

(1,4)(2,3) G C \ / A---A / \ A So site 3 is (Counterintuitively) NOT informative

Site 5, however, is informative because one tree shortest. Site: OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *

(1,2)(3,4) (1,3)(2,4) (1,4)(2,3) G A G G G G \ / \ / \ / G---A A---A G---G / \ / \ / \ G A A A A A

Likewise sites 7 and 9. By majority rule most parsimonious tree is (1,2)(3,4) supported by 2/3 informative sites. Site: OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *

Protpars infile: BRU MSQNSLRLVE DNSV-DKTKA LDAALSQIER RLR V-DKSKA LEAALSQIER NGR MSD-DKSKA LAAALAQIEK ECO AIDE-NKQKA LAAALGQIEK YPR M AIDE-NKQKA LAAALGQIEK PSE MDD-NKKRA LAAALGQIER TTH MEE-NKRKS LENALKTIEK ACD MDEPGGKIE FSPAFMQIEG

Protpars treefile: (((((ACD,TTH),(PSE,(YPR,ECO)) ),NGR),RLR),BRU);

outfile: One most parsimonious tree found: +-ACD ! +-TTH +-6 ! ! +----PSE ! ! +-YPR ! ! +-4 ! ! +-ECO +-2 ! ! ! NGR --1 ! ! RLR ! BRU remember: this is an unrooted tree! requires a total of steps

Clustalw ****** PHYLOGENETIC TREE MENU ****** 1. Input an alignment 2. Exclude positions with gaps? = ON 3. Correct for multiple substitutions? = ON 4. Draw tree now 5. Bootstrap tree 6. Output format options S. Execute a system command H. HELP or press [RETURN] to go back to main menu

ClustalW trees Don’t use the.dnd file as a final tree –It’s only a temporary pairwise dendrogram/tree Always correct for mulitple hits/substs Usually toss all gaps

ClustalW NJ (((ACD: , TTH: ) : , ((BRU: , RLR: ) : , NGR: ) : ) : , (ECO: , YPR: ) : , PSE: ); topologically the same as (((ACD,TTH),((BRU,RLR),NGR)),(ECO,YPR),PSE); and compare to Protpars: (((((ACD,TTH),(PSE,(YPR,ECO))),NGR),RLR),BRU);

NJ vs ProtPars

Dealing with CDSs More info in DNA than proteins Systematic 3 rd posn changes can confuse Use DNA directly only if evol dist short For distant relationships: blank 3 rd positions Translate into protein to align –then copygaps back to DNA Use dnadist with weights to investigate rates

Trees General guidelines – NOT rules More data is better Excellent alignment = few informative sites Exclude unreliable data – toss all gaps? Use seqs/sites evolving at appropriate rate – Phylip DISTANCE – 3 rd positions saturated – 2 nd positions invariant – Fast evolving seqs for closely related taxa – Eliminate transition - homoplasy

Trees Beware base composition bias in unrelated taxa Are sites (hairpins?) independent? Are substitution rates equal across dataset? Long branches prone to error – remove them? –Choose outgroup carefully