Download presentation
Presentation is loading. Please wait.
1
Phylogenetic Analysis
2
Review of Linux ls cd mkdir less cp mv cat pwd >
3
Perl Variables Statements $DNA="A"; @DATA=('A', 'B');
%TABLE=(A=>'A', N=>'[AC]',); Statements print length open close substr push pop shift unshift
4
#!/usr/bin/perl –w $word = 'MNIDDKL'; if($word eq 'QSTVSGE') { print "QSTVSGE\n"; } elsif($word eq 'MRQQDMISHDEL') { print "MRQQDMISHDEL\n"; elsif ( $word eq 'MNIDDKL' ) { print "MNIDDKL-the magic word!\n"; else { print "Is \”$word\“ a peptide?\n"; exit;
5
$x = 10; $y = -20; if ($x <= 10) { print "1st true\n";} if ($x > 10) {print "2nd true\n";} if ($x <= 10 || $y > -21) {print "3rd true\n";} if ($x > 5 && $y < 0) {print "4th true\n";} if (($x > 5 && $y < 0) || $y > 5) {print "5th true\n";}
6
$position = 0; while ( $position < length $DNA) { $base = substr($DNA, $position, 1); if ( $base eq 'C' or $base eq 'G') { ++$count_of_CG; } $position++; for ( $position = 0 ; $position < length $DNA ; ++$position ) {
7
The Most Common Sequence Formats
8
Converting Formats Don’t re-compute your MSA if it is not in the right format Convert your file using one of the online conversion tools The 3 most popular reformatting utilities: Fmtseq The most complete RESDSEQ Very popular and robust SeqCheck Can clean FASTA sequences
10
Editing your MSA If your MSA looks bad . . .
Don’t torture the online server Edit the MSA yourself locally Never, ever, ever (ever) use a standard word processor Always use a dedicated MSA editor The most popular online tool is Jalview You can get it at
12
MSA => LOGO Graph A LOGO graph summarizes an MSA
Tall letters indicate highly conserved positions Short letters indicate poorly conserved positions LOGO graphs are ideal for identifying conserved patterns weblogo.berkeley.edu/
13
Molecular Evolution and Phylogenetic Reconstruction
14
Evolutionary Tree of Bears and Raccoons
15
Human Evolutionary Tree (cont’d)
16
Human Migration Out of Africa
1. Yorubans 2. Western Pygmies 3. Eastern Pygmies 4. Hadza 5. !Kung 1 2 3 4 5
17
Reading Your Tree There’s a lot of vocabulary in a tree
Nodes correspond to common ancestors The root is the oldest ancestor Often artificial Only meaningful with a good outgroup Trees can be un-rooted Branch lengths are only meaningful when the tree is scaled Cladograms are often scaled Phenograms are usualy unscaled
18
Rooted and Unrooted Trees
In the unrooted tree the position of the root (“oldest ancestor”) is unknown. Otherwise, they are like rooted trees
19
Type of Trees (Cladogram)
20
Type of Trees (Phylogram)
21
Orthology and Paralogy
直系(垂直)同源和旁系(平行)同源 Orthologous genes Separated by speciation Often have the same function Paralogous genes Separated by duplications Can have different functions In the graph: A is paralogous with B A1 is orthologous with A2
22
Which Sequences ? Orthologous sequences Paralogous sequences
Produce a species tree Show how the considered species have diverged Paralogous sequences Produce a gene tree Show the evolution of a protein family
23
Building the Right MSA Your MSA should have as few gaps as possible. Most time should remove columns with gaps. Some variability but not too much! Some conservation but not too much!
24
Building the Right Tree
There are three types of tree-reconstruction methods Distance-based methods Statistical methods Parsimony methods Statistical methods are the most accurate Maximum likelihood of success Bayesian methods Statistical methods take more time Limited to small datasets
25
Distance in Trees: an Exampe
j d1,4 = = 68
26
Compute a Distance Matrix
Evolutionary Distance - number of substitutions per 100 amino acids (for proteins) or nucleotides (for DNA) A C T G T A G G A A T C G C A A T G A A A G A A T C G C 3 observed changes A C T G T A G G A A T C G C A C T G C A G G A A T A G C A A T G A A A G A A T C G C 6 actual changes
27
Edit Distance vs Tree Distance
j d1,4 = = 68 D1,4 may be smaller than 68, as some changes may not be observed
28
Fitting Distance Matrix
Given n species, we can compute the n x n distance matrix Dij Evolution of these genes is described by a tree that we don’t know. We need an algorithm to construct a tree that best fits the distance matrix Dij
29
Reconstructing a 3 Leaved Tree
Tree reconstruction for any 3x3 matrix is straightforward We have 3 leaves i, j, k and a center vertex c Observe: dic + djc = Dij dic + dkc = Dik djc + dkc = Djk
30
Reconstructing a 3 Leaved Tree (cont’d)
dic + djc = Dij + dic + dkc = Dik 2dic + djc + dkc = Dij + Dik 2dic + Djk = Dij + Dik dic = (Dij + Dik – Djk)/2 Similarly, djc = (Dij + Djk – Dik)/2 dkc = (Dki + Dkj – Dij)/2
31
Additive Distance Matrices
Matrix D is ADDITIVE if there exists a tree T with dij(T) = Dij NON-ADDITIVE otherwise
32
The Four Point Condition (cont’d)
Compute: 1. Dij + Dkl, 2. Dik + Djl, 3. Dil + Djk 2 3 1 2 and 3 represent the same number: the length of all edges + the middle edge (it is counted twice) 1 represents a smaller number: the length of all edges – the middle edge
33
The Four Point Condition: Theorem
The four point condition for the quartet i,j,k,l is satisfied if two of these sums are the same, with the third sum smaller than these first two Theorem : An n x n matrix D is additive if and only if the four point condition holds for every quartet 1 ≤ i,j,k,l ≤ n
34
Distance Based Phylogeny Problem
Goal: Reconstruct an evolutionary tree from a distance matrix Input: n x n distance matrix Dij Output: weighted tree T with n leaves fitting D If D is additive, this problem has a solution and there is a simple algorithm to solve it
35
Using Neighboring Leaves to Construct the Tree
Find neighboring leaves i and j with parent k Remove the rows and columns of i and j Add a new row and column corresponding to k, where the distance from k to any other leaf m can be computed as: Dkm = (Dim + Djm – Dij)/2 Compress i and j into k, iterate algorithm for rest of tree
36
Finding Neighboring Leaves
To find neighboring leaves we simply select a pair of closest leaves.
37
Finding Neighboring Leaves
To find neighboring leaves we simply select a pair of closest leaves. WRONG
38
Finding Neighboring Leaves
Closest leaves aren’t necessarily neighbors i and j are neighbors, but (dij = 13) > (djk = 12) Finding a pair of neighboring leaves is a nontrivial problem!
39
Neighbor Joining Algorithm
In 1987 Naruya Saitou and Masatoshi Nei developed a neighbor joining algorithm for phylogenetic tree reconstruction Finds a pair of leaves that are close to each other but far from other leaves: implicitly finds a pair of neighboring leaves Advantages: works well for additive and other non-additive matrices, it does not have the flawed molecular clock assumption
40
Overview Based on the current distance matrix calculate the matrix Q (defined later). Find the pair of taxa for which has its lowest value Qij. Add a new node to the tree, joining these taxa to the rest of the tree. Calculate the distance from each of the taxa in the pair to this new node. Calculate the distance from each of the taxa outside of this pair to the new node. Start the algorithm again, replacing the pair of joined neighbors with the new node and using the distances calculated in the previous step.
41
Basic Algorithm
43
D Q
44
D Q
45
D Q
46
Another Example A B C D E F 5 4 7 6 8 10 9 11
47
Q(ij)=(N-2)d(ij) - [r(i) + r(j)]
A B C D E F 5 4 7 6 8 10 9 11 Q(ij)=(N-2)d(ij) - [r(i) + r(j)] A B C D E F
48
Q A B C D E F -52 -46 -40 -42 -44
49
Tree (So far) D(AU) =d(AB) / 2 + [r(A)-r(B)] / 2(N-2) = 1
D(BU) =d(AB) -D(AU) = 4 Tree (So far)
50
d(CU) = [d(AC) + d(BC) - d(AB)] / 2 = 3
d(DU) = [d(AD) + d(BD) - d(AB) ]/ 2 = 6 d(EU) = [d(AE) + d(BE) - d(AB) ]/ 2 = 5 d(FU) = [d(AF) + d(BF) - d(AB) ]/ 2 = 7 New Matrix U C D E F 3 6 5 7 8 9
51
Q(ij)=(N-2)d(ij) - [r(i) + r(j)]
U C D E F 3 6 5 7 8 9 Q(ij)=(N-2)d(ij) - [r(i) + r(j)] r(U)= =21 r(C)=24 r(D)=27 r(E)=24 r(F)=32 U C D E F
52
U C D E F -36 -30 -32 D(UW) =d(UC) / 2 + [r(U)-r(C)] / 2(N-2) = 1
D(CW) =d(UC) -D(UW) = 2 U C D E F -36 -30 -32
53
Programs BIONJ WEIGHBOR FastME
54
UPGMA: Unweighted Pair Group Method with Arithmetic Mean
UPGMA is a clustering algorithm that: computes the distance between clusters using average pairwise distance assigns a height to every vertex in the tree, effectively assuming the presence of a molecular clock and dating every vertex
55
UPGMA’s Weakness The algorithm produces an ultrametric tree : the distance from the root to any leaf is the same UPGMA assumes a constant molecular clock: all species represented by the leaves in the tree are assumed to accumulate mutations (and thus evolve) at the same rate. This is a major pitfalls of UPGMA.
56
UPGMA’s Weakness: Example
2 3 4 1 Correct tree UPGMA
57
Clustering in UPGMA Given two disjoint clusters Ci, Cj of sequences,
1 dij = ––––––––– {p Ci, q Cj}dpq |Ci| |Cj| Note that if Ck = Ci Cj, then distance to another cluster Cl is: dil |Ci| + djl |Cj| dkl = –––––––––––––– |Ci| + |Cj|
58
UPGMA Algorithm Initialization: Assign each xi to its own cluster Ci
Define one leaf per sequence, each at height 0 Iteration: Find two clusters Ci and Cj such that dij is min Let Ck = Ci Cj Add a vertex connecting Ci, Cj and place it at height dij /2 Delete Ci and Cj Termination: When a single cluster remains
59
UPGMA Algorithm (cont’d)
1 4 3 2 5
60
UPGMA Building Phylogenetic Trees by UPGMA: Example:
The distance matrix
61
UPGMA
62
UPGMA
63
UPGMA
64
UPGMA
65
UPGMA
66
UPGMA
67
UPGMA
68
UPGMA
69
UPGMA
70
UPGMA
71
UPGMA
72
UPGMA
73
UPGMA What are the distance between: W and X (Calculate).
74
UPGMA
75
UPGMA
76
UPGMA What are the distance between: Y and C (Calculate).
77
UPGMA
78
UPGMA
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.