Available at DNA variation in Ecology and Evolution DNA variation in Ecology and Evolution IV- Clustering methods and Phylogenetic reconstruction Maria Eugenia D’Amato BCB 705:Biodiversity
Organization of the presentation Phylogenetic reconstruction Networks Multivariate analysis Distance ML MP
Characters: Characters: independent homologous Continuous Discrete Binary Multistate
DNA sequence characters Alignment = hypothesizing of a homology relationship for each site Sequence comparison BLAST search - GenBank Coding sequenceblastn blastx Non-coding DNA blastn
Blast search results Score E Score E Sequences producing significant alignments: (Bits) Value gi| |dbj|AB | Mantella baroni mitochondrial ND e-18 gi|343991|dbj|D |FRGMTURF2 Rana catesbeiana mitochondri e-17 gi| |gb|AF |AF Rana sylvatica NADH dehydr e-16 The lower the E-value, the better the alignment GeneBank Accession numbers for the sequence Species that match the query
Blast search results >gi| |dbj|AB | Mantella baroni mitochondrial ND5, ND1, ND2 genes for NADH dehydrogenasegi| |dbj|AB | subunit 5, NADH dehydrogenase subunit 1, NADH dehydrogenase subunit 2, complete cds Length=10814 Identities = 99/115 (86%), Gaps = 0/115 (0%) Strand=Plus/Minus 5’end Score = 101 bits (51), Expect = 3e-18 Query 451 TTAGTTGAGGATTAAATTTTAGGATAATAACTATTCAGCCGAGGTGGCTGATGGAAGAAA 510 ||||||||||||||||||||| ||||||| ||||||||| ||||| | |||||||| | Sbjct TTAGTTGAGGATTAAATTTTAAAATAATAAGTATTCAGCCCAGGTGACCAATGGAAGAGA Query 511 AAGCTAAAATTTTACGTAGTTGTGTTTGGCTAATGCCGCCTCATCCGCCTACAAG 565 | |||| ||||||||||||||| |||||| |||| || ||||| || |||||||| Sbjct AGGCTATAATTTTACGTAGTTGAGTTTGGTTAATACCCCCTCAACCTCCTACAAG Description of the genes contained in the sequence with this Accession number Strands aligned alignment
Phylogenetic reconstruction Phylogenetic reconstruction Distance methods C1 C2 C3 C4 C5 C6 C Distance criterion Similarity / dissimilarity criterion dendrogram 5 x 5 5 X 7
Distances criterion for binary data a a + b + c a = bands common to a and b b = bands exclusive to a c = bands exclusive to b J = (x1, y1) (x2, y2) Jaccard’s distance Manhattan distance M = P1 P2 (x1-x2) 2 + (x2-y2) 2 Euclidean distance
Distance criterion for DNA data- Distance criterion for DNA data- Models of DNA susbstitution p = n of different nucleotides/ total n nucleotides f AA f AC f AG f AT f CA f CC f CG f CT f GA f GC f GG f GT f TA f TC f TG f TT Fxy = a b c d e f g h i j k l m n o p Fxy =
Models of DNA susbstitution Jukes and Cantor D = 1 – ( a + f + k + p) dxy = - ¾ ln (1- 4/3 D) F81 B = 1 – ( 2 A + 2 C + 2 G + 2 T ) dxy = - B ln (1- D/B) Equal rate Unequal base freqs K2P P = c + h + i + nTransitions Q = b + d + e + g + j + l + m + oTransversions 1 1-2P-Q dxy = 1 ln 2 1 ln Q +
Distances criterion for diploid data Dn -ln Jx i y i Jx i Jy i Nei 1972 = I Jx = xi 2 Jx = yi 2 Jxy = xiyi Cavalli Sforza 1967 Darc = (1/L) (2 / ) 2 = cos -1 xiyi
Phylogenetic reconstruction criterion for distance data V1 V2 V3 V4 V5 A B C D Additive tree (NJ) Ultrametric tree (UPGMA) A B C V1 V2 V3 V4 Properties dAB = v1 + v2 dAC = v1 + v3 + v4 dAD = v1 + v3 + v5 dBC = v2 + v3 + v5 dCD = v4 + v5 dAB = v1 + v2 + v3 dAC = v1 + v2+ v4 dBC = v3 + v4 v3 = v4 v1 = v2 + v3 = v2 = v4
Maximum Likelihood (1) (1)C….GGACACGTTTA….C (2) (2)C….AGACACCTCTA….C (3) (3)C….GGATAAGTTAA….C (4) (4)C….GGATAGCCTAG….C 1 J n C ACG C ACG Unrooted tree Tree after rooting at an internal node Lj = Prob A A C ACG + Prob A C + Prob……. L = L 1 x L 2 x L 3 …x L N. = Lj LnL = ln L 1 + ln L 2 + …. L N = ln Lj L D = Pr (D H)
Hypothesis testing Hypothesis testing Likelihood ratio test = log L 1 – log L 0 Rate variation Appropriate substitution Model 2 2 distribution d.f. = N sequences in the tree –2; or d.f = difference number of parameters H1 and H0
Bootstrapping Bootstrapping H ow well supported are the groups? Trumpet fish
Maximum Parsimony Minimize tree length To obtain rooted trees (and character polarity) use an outgroup. The ingroup is monophyletic. 1 1ATATT 2 2ATCGT 3 3GCAGT 4 4GCCGT Tree (first site) change 5 changes G G AG A A G G GA A A
C Maximum Parsimony- Maximum Parsimony- example C T C T T Site 2Site 3 AAC A A C C C C C AA Site 4 T G G G GG Site 5 No changes TT T T T T Tree length L = k i=1 li
Maximum parsimony: Maximum parsimony: example Sites Total ((1,2),(3,4)) ((1,3),(2,4)) ((1,4),(2,3)) Tree Phylogenetically informative sites
Networks Phylogenetic representation allowing reticulation More appropriate for intraespecific data Ancestor is alive hybridization, recombination, horizontal transfer, polyploidization agct acat acct acatagctacct
Multivariate clustering C1 C2 C3 C4 C5 C6 C X 7 similarity criterion correlations 7 x 7 Calculate eigenvectors with highest eigenvalues Project data onto new axes (eigenvectors) X 1 st axis Y 2 nd axis Z 3 rd axis