Download presentation
Presentation is loading. Please wait.
Published byÍΕρρίκος Αργυριάδης Modified over 6 years ago
1
DNA variation in Ecology and Evolution IV- Clustering methods and Phylogenetic reconstruction
Maria Eugenia D’Amato This presentation is focused on methods of clustering: a way of recognizing similarity between elements. The elements of our study are either individuals or populations on which we collect characters like alleles, DNA sequences, etc. BCB 705:Biodiversity
2
Organization of the presentation
Distance ML MP Phylogenetic reconstruction Networks Multivariate analysis In this presentation we will see methods of phylogenetic reconstruction, concepts of clustering with discrete characters, multivariate analysis and networks. Among the methods for phylogenetic reconstruction we will see Distance methods, Maximum Likelihood and Maximum Parsimony.
3
Characters: independent homologous
Continuous Discrete Binary Multistate Characters and other concepts Experimental data are generated by a wide variety of techniques, some of which are introduced to in Class II. The data generated are called characters. The main criteria for selection of characters are their independence and homology. Statistically, characters are called variables. Characters can be discrete (quantitative) or continuous (qualitative). In this class we will focus only on discrete characters. These can be broadly grouped into multistate characters (with 3 or more possibilities) , or binary characters (only two possibilities). The latter can be exemplified by RAPD and AFLP data link to Class II, Slide 12 as well as by Restriction Enzymes recognition sites on RFLP link to Class II, Slides 5 and 7 analysis: they are either present (1) or absent (0). Multistate characters can be exemplified by multiallelic loci, with as many states as alleles; and DNA sequences, where each site can be represented by either of the 4 options (A, T, C or G). In this course we will not present quantitative characters. Examples of continuous characters are, e.g., height in people, skin color, etc. Levels of gene expression in different environments could be taken as qualitative data. Characters are presented in matrixes. Here we present an example of a matrix with binary data, with 5 elements and 6 variables or characters. This is a typical example of RAPD-type or RFLP-type data.
4
DNA sequence characters
Alignment = hypothesizing of a homology relationship for each site Sequence comparison BLAST search - GenBank Coding sequence blastn blastx The phylogenetic analysis of DNA sequences requires the identification of homology among genes and among positions (sites) of genes / alleles among different individuals or species. DNA sequences. The aligning of DNA sequences is the hypothesizing of a homology relationship for each nucleotide base position (Mindell, 1991). Alignments are relatively easy for protein-coding genes. Non-protein coding sequences such as mtDNA control region (also called D-loop) and rDNA genes usually suffer insertions or deletions due to the lack of selective pressure to maintain a coding frame link to Class I, Slide 29. Therefore, the alignment of these type of DNA sequences usually require the addition of gaps indicating a position where an insertion or deletion event took place. These sites are called indels.link to Class I, Slide 30. In a practical example, when we are obtaining new DNA sequences for our species or populations, the first validation of the newly obtained sequence is its comparison to other similar sequences deposited in a databes, e.g. GeneBank. ( This search engine recovers the most similar sequences in the database. If we are working on a protein-coding gene we can do either of two searches: blastn, which recovers matches between our DNA sequence (called QUERY) and other DNA sequences in the database (subjects); or blast a translated version of our DNA sequence and recover other translated sequences (blastx search). If our sequence is non-coding such as rDNA, tRNA, ITS or D-loop we use the first example. In the following slide we show an example of GeneBank BLAST search. Reference: Midell DP (1991). Aligning DNA sequences: homology and phylogenetic weighting. In: Phylogenetic analysis of DNA sequences, ed. By Miyamoto MM and Cracraft J. Oxford University Press. Non-coding DNA blastn
5
Blast search results GeneBank Accession numbers The lower the E-value,
Score E Sequences producing significant alignments: (Bits) Value gi| |dbj|AB | Mantella baroni mitochondrial ND e-18 gi|343991|dbj|D |FRGMTURF2 Rana catesbeiana mitochondri e-17 gi| |gb|AF |AF Rana sylvatica NADH dehydr e-16 Blastn In this slide we show an example of blastn search in GeneBank using a mtDNA ND2 fragment of the frog Microbatrachella capensis. The list shows the species that align better to our DNA sequence. All the species are frogs, therefore we can be confident with our result and discard the possibility of contamination (alignment to , e.g, fly or human DNA would imply contamination). By clicking on the GeneBank Accesion number we obtain the complete information of that sequence in the database. By clicking in the score, we get the specific alignment to that species. This is shown in Slide 6. GeneBank Accession numbers for the sequence The lower the E-value, the better the alignment Species that match the query
6
Blast search results >gi| |dbj|AB | Mantella baroni mitochondrial ND5, ND1, ND2 genes for NADH dehydrogenase subunit 5, NADH dehydrogenase subunit 1, NADH dehydrogenase subunit 2, complete cds Length=10814 Identities = 99/115 (86%), Gaps = 0/115 (0%) Strand=Plus/Minus Description of the genes contained in the sequence with this Accession number Strands aligned Score = 101 bits (51), Expect = 3e-18 Query TTAGTTGAGGATTAAATTTTAGGATAATAACTATTCAGCCGAGGTGGCTGATGGAAGAAA 510 ||||||||||||||||||||| ||||||| ||||||||| ||||| | |||||||| | Sbjct TTAGTTGAGGATTAAATTTTAAAATAATAAGTATTCAGCCCAGGTGACCAATGGAAGAGA Query AAGCTAAAATTTTACGTAGTTGTGTTTGGCTAATGCCGCCTCATCCGCCTACAAG 565 | |||| ||||||||||||||| |||||| |||| || ||||| || |||||||| Sbjct AGGCTATAATTTTACGTAGTTGAGTTTGGTTAATACCCCCTCAACCTCCTACAAG 5’end Blastn result This is the view we get after clicking on the bit score values. We see the alignment of our sequence to the sequence deposited in GenBank and additional information on the matching fragment. The alignment shows the proportion of matches, matches in a graphical view, and the information on which strand of the coding sequence was aligned (or sequenced). In this case, we see a Plus/Minus, indicating that we sequenced the opposite strand to that deposited in GeneBank. The following step, if working with coding regions, id the obtain the reverse-complement of this sequence and translate it, to confirm protein-sequence identity, and if needed , the position of the mutations (if synonymous or non-synonimous). alignment
7
Phylogenetic reconstruction Distance methods
C1 C2 C3 C4 C5 C6 C7 1 2 3 4 5 5 X 7 Distance criterion 5 x 5 Phylogenetic reconstruction Distance methods Distance methods imply the transformation of a matrix of individuals (or independent variable) x characters (or dependent variable). This matrix is, in turn, transformed into a matrix that contains information of similarity or distance between the independent variables (individuals). Following, we will see a few examples of Distance criterion for binary and DNA data. Similarity / dissimilarity criterion dendrogram
8
Distances criterion for binary data
a + b + c a = bands common to a and b b = bands exclusive to a c = bands exclusive to b J = Jaccard’s distance P1 (x2, y2) Distance method for binary data Among the most utilized phenetic distance for binary data are Jaccard’s and Manhattan distances. The first one is simply the proportion of shared elements over the total between all pairs of elements. It is expressed as J = a / (a + b + c) Where a = bands common to a and b b = bands exclusive to a c = bands exclusive to b Manhattan distance: the distance between two points is the absolute distance between two points P1 and P2 the sum of the absolute difference between their coordinates (x1, y1 and x2,y2). The yellow, blue and red lines in the picture have the same length. The diagonal represents the Euclidean distance. Euclidean distances joints any two points in the space following the formula: (x1-x2) 2 + (x2-y2) 2 In the picture, the pink line represents the Euclidean distance between the elements P1 and P2. Manhattan distance M = Euclidean distance (x1-x2) 2 + (x2-y2) 2 P2 (x1, y1)
9
Distance criterion for DNA data- Models of DNA susbstitution
p = n of different nucleotides/ total n nucleotides fAA fAC fAG fAT fCA fCC fCG fCT fGA fGC fGG fGT fTA fTC fTG fTT Fxy = Distances with nucleotide data The simplest distance measure is p, which is the number of nucleotides that differ between 2 sequences divided by the total number of nucleotides compared. In the present slide we show models of DNA base change. For this we utilize a 4 x 4 divergence matrix Fxy that shows the relative frequency of each nucleotide (or aminoacid) pair in a given alignment between two sequences X and Y. fAA fAC fAG fAT fCA fCC fCG fCT fGA fGC fGG fGT fTA fTC fTG fTT a b c d e f g h i j k l m n o p Fxy =
10
Models of DNA susbstitution
D = 1 – ( a + f + k + p) Jukes and Cantor Equal rate dxy = - ¾ ln (1- 4/3 D) F81 B = 1 – ( 2A + 2C + 2G + 2T) Unequal base freqs dxy = - B ln (1- D/B) Models of DNA substitution Jukes and Cantor models: assumes that the susbstitution rate is the same for all possible pairs D = 1 – ( a + f + k + p) Dxy = - ¾ ln (1- 4/3 D) F81: the maximun d espected from J&C is If this values is exceeded, the distance becomes undefined. F81 relaxes thwe conditions of equal base frequencies : B = 1 – ( 2A + 2C + 2G + 2T) dxy = - B ln (1- D/B) K2P model accounts for differences between transitions and transversions link to Slide 28, Class I rates. P = c + h + i + n Q = b + d + e + g + j + l + m + o Dxy = ½ ln (1/ ((1-2P-Q)) + ¼ ln (1/(1-2Q)) There are other more complex models of substitution that take into account unequal base frequencies variation of K2P (HKY model); and unequal subtitution rates for transitions between purines and pirimidines (e.g. Tajima Nei 1993), and other more complex methods that derive from the basic methods presented here. P = c + h + i + n Transitions Q = b + d + e + g + j + l + m + o Transversions K2P 1 1-2P-Q dxy = 1 ln 2 1 ln 1 Q +
11
Distances criterion for diploid data
Nei 1972 Jx = xi2 Jx = yi2 Jxy = xiyi Dn -ln Jxiyi JxiJyi = Distances for diploid data Nei’s 1972 distance is based on the an infinite alleles model of mutation, in which there is a rate of neutral mutation and each mutant is to a completely new alleles. It is assumed that all loci have the same rate of neutral mutation, and that the genetic variability initially in the population is at equilibrium between mutation and genetic drift, with the effective population size of each population remaining constant. It is described by the formula Dn = -ln I, or Dn -ln (Jxy JxJy ) Where xi and yi are the frequencies of the allele i in the taxa x and y. Jx = xi2 Jx = yi2 Jxy = xiyi I varies between 0 and 1 and Dn varies between 0 and infinite. This distance is influenced by within-taxon heterozygosity. Cavalli- Sforza is an Euclidean measure distance that assumes that there is no mutation, and that all gene frequency changes are by genetic drift alone. However they do not assume that population sizes have remained constant and equal in all populations. They cope with changing population size by having expectations that rise linearly not with time, but with the sum over time of 1/N, where N is the effective population size. Thus if population size doubles, genetic drift will be taking place more slowly, and the genetic distance will be expected to be rising only half as fast with respect to time. This measure overcomes the limitations of the Nei 1972 distance. Darc = (1/L) (2/)2 = cos-1 xiyi Cavalli Sforza 1967 Darc = (1/L) (2/)2 = cos-1 xiyi
12
Phylogenetic reconstruction criterion for distance data
Ultrametric tree (UPGMA) Additive tree (NJ) A A C V1 V1 V4 V3 B V3 V2 V2 V5 Phylogenetic reconstruction- Criterion for distance data We are going to present only two main methods of phylogenetic reconstruction from distance matrixes: one results in an ultrametric tree, and the other in an additive tree. The first pone is exemplified by the UPGMA (unweighted pair group method using arithmetic averages) and Neighbor joining trees. UPGMA trees assumes a root, distance to the common ancestor between two taxa are equal, and a constant rate of evolution. In NJ trees, the distance between taxa are the sum of the branches that connects them, and do not make assumptions about rooting. The length of the branches in the additive trees represents evolutionary distances. D V4 C B Properties Properties dAB = v1 + v2 dAC = v1 + v3 + v4 dAD = v1 + v3 + v5 dBC = v2 + v3 + v5 dCD = v4 + v5 dAB = v1 + v2 + v3 dAC = v1 + v2+ v4 dBC = v3 + v4 v3 = v4 v1 = v2 + v3 = v2 = v4
13
Tree after rooting at an internal node
Maximum Likelihood LD = Pr (DH) Tree after rooting at an internal node Unrooted tree J n C….GGACACGTTTA….C C….AGACACCTCTA….C C….GGATAAGTTAA….C C….GGATAGCCTAG….C C A G 6 5 4 3 2 1 4 3 2 1 Lj = Prob A C G + Prob + Prob……. C A G Maximum likelihood methods This method evaluates a hypothesis about evolutionary history in terms of the probability that the proposed model of evolutionary process link to Slide 10, this class would give origin to the obtained data. The result with the highest likelihood is the preferable one. More formally, given some data D, and a hypothesis H, the likelihood of the data is given by LD = Pr (DH) , which is the probability of obtaining D given H. The likelihood for a particular site is the sum of the probabilities of every possible reconstruction of ancestral states given the models of substitution utilized. The likelihood of a tree is the product of the likelihood at each site. The likelihood is evaluated by summing the log of the likelihood at each site, and reported as the log likelihood of the tree L = L1 x L2 x L3…x LN. = Lj LnL = ln L1+ ln L2 + …. LN = ln Lj L = L1 x L2 x L3…x LN. = Lj LnL = ln L1+ ln L2 + …. LN = ln Lj
14
Hypothesis testing Likelihood ratio test
Rate variation = log L1 – log L0 Appropriate substitution Model Hypothesis testing – Likelihood Ratio Test. We can test alternative hypotheses concerning the same data using the LRT. Because the likelihood are usually very small, we use log likelihoods. First, we define a null hypothesis H0 and an alternative hypothesis H1. = log L1 – log L0 where L1 and L0 are the maximum likelihood for the alternative hypothesis and the null hypothesis respectively. For nested hypotheses, 2 is approximately distributed following a 2 distribution with degrees of freedom equal to the difference in the number of parameters between the two hypotheses. These LRT can be used to test a model of substitution or rate variation. In the latter case, we test whether an ultrametric tree is significantly different from a model-based tree. If so, we can accept a molecular clock hypothesis. If sequences evolve at different rates, an ultrametric tree would be a poor representation of the relationship among sequences, and am additive tree would be, instead, more appropriate. 2 2 distribution d.f. = N sequences in the tree –2; or d.f = difference number of parameters H1 and H0
15
Bootstrapping How well supported are the groups?
After constructing the tree, we need to know the reliability of the reconstructed branches or groups. One of the most extended method to assess how well supported is a particular branch is the parametric bootstrapping (or just bootstrapping). It consists of a pseudoreplication of the character matrix with replacement to create new matrixes of the same size as the original. The proportion (or frequency) at which a branch is found upon analysis of the pseudoreplicates is called “bootstrap support”. We generally accept bootstrap support over 70%, and lower values indicate poor support. In the example, we see a phylogenetic reconstruction of the trumpet fish species Aulostomus. All clades show high bootstrap support. A corresponding network overimposed to the geographic Image: Reference: Bowen B. W., Bass A. L., Rocha L. A., Grant W. S., Robertson D. R. (2001) phylogeography of the trumpetfishes (Aulostomus): ring species complex on a global scale. Evolution, –1039. Trumpet fish
16
Maximum Parsimony Minimize tree length
To obtain rooted trees (and character polarity) use an outgroup . The ingroup is monophyletic. Tree (first site) 1 change 5 changes G ATATT ATCGT GCAGT GCCGT Maximum Parsimony The basic concept of this method is that the phylogenetic trees are reconstructed using the minimum number of step, thus reducing the risk of homoplasy (equal state but unequal origin, e.g. A at a site that originated from T in one individual and from C in another). By using an outgroup the tree is rooted, and we assume that the ingroup taxa is monophyletic. The outgroup gives polarity to the character change in the ingroup. The example shows 4 sequences and the unrooted tree ((1,2), (3,4)), and two possible reconstructions of the evolution of the first site. Under parsimony criteria, the tree that requires 1 change is preferred over the tree that requires 5 changes. Following we will see the steps required for the other sites. A G A 1 3 A G G A A G 2 4 A G
17
Maximum Parsimony- example
Site 2 Site 3 T C A A A A T C A A C C T C C C C C Site No changes Maximum Parsimony -example Here we see the continuation of the example of the previous slide, the tree ((1,2),(3,4)). Site 2 requires 1 step, whereas two equally likely possibilities of two steps are encountered for site 3. Site 4 requires 1 step. No changes are required for site 5. The total length of a tree is the summatory of all the changes required for all sites. It is expressed as the summatory of all sites k lengths l. L = ki=1 li Site 4 Tree length T G T T L = ki=1 li G G T T G T T G
18
Maximum parsimony: example
Sites Total Tree ((1,2),(3,4)) ((1,3),(2,4)) ((1,4),(2,3)) Maximum Parsimony -example Here, we see which of all the possible tree topologies would be accepted by the MP criterion. The Table shows three possible topologies for 4 taxa, and counts the total number of mutations (or steps) required to agree with the hypothesised topology. The minimal length is recovered for the topology of our example, therefore the tree ((1,2), (3,4)) is accepted as the most parsimonious. The site 4 and 5 are invariant under different tree topologies. Invariant sites , or sites that where only one sequence has a different nucleotide are called phylogenetically informative Phylogenetically informative sites
19
Networks agct acat agct ac ct acat acct agct
Phylogenetic representation allowing reticulation More appropriate for intraespecific data Ancestor is alive hybridization, recombination, horizontal transfer, polyploidization agct 1 acat Networks Some biological phenomena can be better represented by networks, which allow cycles and reticulation in the graphic representation. These phenomena include recombination events link to Class 1, Slide 6; hybridisation between lineages, processes of horizontal gene transfer (transfer of DNA between species) by retrotransposition see link to slide 22, Class1, polyploidization Slide 32, ClassI. Other causes of reticulation in some data are homoplasy and recurrent mutations. Networks can be more appropriate to describe intraespecific phylogenies, where the ancestor is still alive and can present alternative evolutionary paths in the form of cycles. See an example in Slides 26, 28, ClassII; and Slide8, Class III.. Among the most utilized networks are the Minimum Spanning trees, the shortest subset of edges that keeps the graph in one connected component. It is utilized, e.g., in wiring for telephonic companies. There are several available algorithms to connect the elements. In the slide we show an example of hybridisation represented by reticulation. Nodes are represented by numbers. At a time of nodes 2 and 3 and 4, lineages hybridize and combine their genetic information in node 4. The hybrid lineages continues in node 6, and contains two different independent evolutionary histories. The diagram in this slide depicts the simplest possible reticulation to generate a new lineage by hybrid speciation. The nodes are numbered for easy reference. Here the otherwise independent lineages that were generated by a normal speciation event at the root of the tree—leading to the independently evolving black and green DNA sequences—have hybridized The diagram in this slide depicts the simplest possible reticulation to generate a new lineage by hybrid speciation. The nodes are numbered for easy reference. Here the otherwise independent lineages that were generated by a normal speciation event at the root of the tree—leading to the independently evolving agct ac ct 2 3 4 5 7 6 acat acct agct
20
Multivariate clustering
C1 C2 C3 C4 C5 C6 C7 1 2 3 4 5 5 X 7 • Y 2nd axis similarity criterion correlations • Z 3rd axis • 7 x 7 • Multivariate Analysis Multivariate methods aim to detect general patterns and to indicate potentially interesting relationships in data. The most valuable methods are those that display their results graphically, rather than just examining statistical properties derived from the data Multivariate statistics or multivariate statistical analysis in statistics describes a collection of procedures which involve observation and analysis of more than one statistical variable at a time. Among the most utilized methods of multivariate analysis is PCA or Principal Components Analysis and MDS (Multidimensional Scaling). The basis of these methods is that a matrix of phylogenetic distances between n OTUs (operational taxonimoc unit or independent variables) can be used to position the OTUs in n-1 dimensional space. Ordination methods, such as PCA or MDS, group taxa with shared properties. The different coordinates usually correlate with different properties of the OTUs, and, if the ordination was calculated from quantitative or multi-state characters, such as sequences, or alleles, it is easy to calculate the correlation between each dimension and the original data. The major trends in the data set are expressed by the first few coordinates, while the finer details between closely related taxa are often hidden in minor coordinates; over two-thirds of the information in a well-structured data set, such as the phylogenetic relationships of a group of OTUs with clear lineages, will be represented by the first three variables. • X 1st axis Calculate eigenvectors with highest eigenvalues Project data onto new axes (eigenvectors)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.