Download presentation
Presentation is loading. Please wait.
Published byKrista Amsel Modified over 6 years ago
1
Multiple Alignment, Distance Estimation, and Phylogenetic Analysis
Database search (keyword, similarity) Conserved Regions Multiple alignment • find oligonucleotide primers for PCR • predict secondary and tertiary structures of new sequences • detect similarity between new sequences and existing sequence families • find diagnostic patterns to characterize protein families Distance estimation Phylogenetic reconstruction Function Prediction April 2, 2004 BIOS816/VBMS818
2
Distance estimation ACTGTAGGAATCGC :X::X:X::::::: AATGAAAGAATCGC
nd = 3 L = 14 p = 3/14 = 0.214 The easiest Number of nucleotide substitutions per site (p) p = nd / L nd: the number of different nucleotides between the two sequences L: the number of nucleotides compared • It can be applied both for DNA and protein sequences April 2, 2004 BIOS816/VBMS818
3
Distance estimation AATGTAGGAATCGC ACTGTAGGAATCGC AATGAAAGAATCGC
Ancestral ACTGTAGGAATCGC AATGAAAGAATCGC April 2, 2004 BIOS816/VBMS818
4
Distance estimation AATGTAGGAATCGC ACTGTAGGAATCGC AATGAAAGAATCGC
Single substitution Ancestral ACTGTAGGAATCGC AATGAAAGAATCGC April 2, 2004 BIOS816/VBMS818
5
Distance estimation AATGTAGGAATCGC ACTGTAGGAATCGC AATGAAAGAATCGC
Single substitution ACTGTAGGAATCGC AATGAAAGAATCGC April 2, 2004 BIOS816/VBMS818
6
Distance estimation AATGTAGGAATCGC ACTGTAGGAATCGC AATGAAAGAATCGC
Single substitution ACTGTAGGAATCGC AATGAAAGAATCGC April 2, 2004 BIOS816/VBMS818
7
No hidden substitution
Distance estimation AATGTAGGAATCGC ACTGTAGGAATCGC AATGAAAGAATCGC 3 substitutions No hidden substitution April 2, 2004 BIOS816/VBMS818
8
Distance estimation AATGTAGGAATCGC G ACTGTAGGAATCGC AATGAAAGAATCGC
2 substitutions G ACTGTAGGAATCGC AATGAAAGAATCGC April 2, 2004 BIOS816/VBMS818
9
Distance estimation AATGTAGGAATCGC G C ACTGTAGGAATCGC AATGAAAGAATCGC
2 substitutions 2 substitutions G C ACTGTAGGAATCGC AATGAAAGAATCGC April 2, 2004 BIOS816/VBMS818
10
Distance estimation AATGTAGGAATCGC G C ACTGTAGGAATCGC AATGAAAGAATCGC
2 substitutions 2 substitutions G C ACTGTAGGAATCGC AATGAAAGAATCGC Observed number of differences = 3 April 2, 2004 BIOS816/VBMS818
11
Actual number of substitutions = 6
Distance estimation AATGTAGGAATCGC 2 substitutions 2 substitutions G C Multiple hit! ACTGTAGGAATCGC AATGAAAGAATCGC Observed number of differences = 3 < Actual number of substitutions = 6 April 2, 2004 BIOS816/VBMS818
12
Effect of multiple substitutions (hits)
(Actual number of substitutions) Actual divergence Divergence Time April 2, 2004 BIOS816/VBMS818
13
Effect of multiple substitutions (hits)
(Actual number of substitutions) Actual divergence Divergence Observed divergence Time April 2, 2004 BIOS816/VBMS818
14
Effect of multiple substitutions (hits)
(Actual number of substitutions) Actual divergence Divergence Observed divergence Actual = Observed Time April 2, 2004 BIOS816/VBMS818
15
Effect of multiple substitutions (hits)
Actual > Observed (Actual number of substitutions) Actual divergence Divergence Observed divergence Actual = Observed Time April 2, 2004 BIOS816/VBMS818
16
Effect of multiple substitutions (hits)
Actual >> Observed (Actual number of substitutions) Actual divergence Divergence Observed divergence Actual = Observed Time April 2, 2004 BIOS816/VBMS818
17
Distance estimation with multiple-hit corrections
(nucleotide substitutions) Jukes-Cantor method (one-parameter method) Kimura’s 2-parameter method A C G T - A C G T - k = -3/4ln(1-4p/3) k: the expected number of nucleotide substitutions per site p: the proportion of nucleotide differences All substitutions are equally likely Transitions and Transversions have different rates k = -1/2ln[1/(1-2P-Q)]+1/4ln[1/(1-2Q)] P: the proportion of transitional (Ts) differences Q: the proportion of transversional (Tv) differences April 2, 2004 BIOS816/VBMS818
18
Distance estimation with multiple-hit corrections
(nucleotide substitutions) k = -3/4ln(1-4p/3) If p ≥ 0.75, JC distance cannot be estimated k (Jukes-Cantor distance) (k = p) p = 0.75 p (uncorrected nucleotide difference) April 2, 2004 BIOS816/VBMS818
19
Distance estimation with multiple-hit corrections
(nucleotide substitutions) There are many distance estimation methods based on different models. 1. More parameters: • 1-parameter (Jukes-Cantor method) • 2-parameter (Kimura’s 2-p method) • 3, 4, 6, ... up to 12 parameters!! 2. Variation in base composition (A C G T): • 1-p & base comp. (Tajima & Nei or F81 method) • 2-p & base comp. (HKY85 or F84 method), etc. 3. Rate-variation among sites: approximated by a gamma-distribution CV (coefficient of variation of the rate): smaller less variation • 1-p & rate variation (Jin & Nei method), etc. 4. LogDet method: • No constrains on parameters, base composition can be varied among sequences • No among-site rate variation can be considered A C G T - C G T A G T A C April 2, 2004 BIOS816/VBMS818
20
Distance estimation with multiple-hit corrections
(nucleotide substitutions) Which distance method should we choose? Substitution pattern (e.g., Ts/Tv) Things to consider: Base composition bias Rate-heterogeneity among sites 1. More parameters more flexible, more realistic 2. More parameters larger sampling errors (lower precision) 3. More parameters more “undefined” distance problem (e.g., if p ≥ 0.75 in JC method, k becomes “undefined” or “infinite”) [k = -3/4ln(1-4p/3)] April 2, 2004 BIOS816/VBMS818
21
Distance estimation with multiple-hit corrections
(amino acid substitutions) 1. Poisson distance: k = -ln(1-p) k: the expected number of amino acid substitutions per site p: the proportion of amino acid differences 2. Kimura’s distance: k = -ln(1-p-0.2p2) • Approximation of PAM distance below (accurate when p < 0.75) • Distance becomes infinite when p ≥ 3. PAM distance & JTT distance • Distance based on PAM or JTT amino acid substitution matrix • JTT matrix is newer and based on much larger protein sample NOTE for ClustalW/ClustalX Kimura’s distance Hybrid between Kimura’s and PAM distances p ≤ 0.75 Use Kimura’s correction 0.75 < p ≤ 0.93 Use a conversion table with 0.01 interval (.75, .751, ...) 0.93 ≤ p k = 10.0 April 2, 2004 BIOS816/VBMS818
22
Phylogenetic reconstruction methods
Neighbor Joining (NJ) Maximum Parsimony (MP) Maximum Likelihood (ML) Data type: Distance Minimum evolution (shortest total branch length) *NJ does not search the ME tree. NJ provides a simplified (approximated) algorithm to find the ME tree. Sequence (or other) data Maximum parsimony (smallest number of evolutionary changes) Maximum Likelihood (highest probability of observing the data under a given tree and a given model of substitutions) Optimality criterion: Fastest Slowest April 2, 2004 BIOS816/VBMS818
23
Phylogenetic reconstruction: distance matrix methods
UPGMA (unweighted pair-group method with arithmetic mean) Example: a distance matrix for 5 sequences. B C D E C/D A/B/E A .53 .99 1.02 .82 A/B .90 .98 .78 .94 .80 .93 .73 .65 .86 .81 The pair with the smallest distance is grouped until all of the sequences are clustered in a tree • assumes all sequences evolve at the same rate • generates a rooted tree April 2, 2004 BIOS816/VBMS818
24
Phylogenetic reconstruction: distance matrix methods
NJ (neighbor joining method) Example: a distance matrix for 5 sequences. B C D E A .53 .99 1.02 .82 .80 .93 .73 .65 .81 .94 1) Start with a star-like phylogeny. 2) The total length of the tree (the sum of the branch lengths) is estimated. 3) Find neighbors sequentially that minimize the total length of the tree. • does not assume a constant rate • generates a unrooted tree April 2, 2004 BIOS816/VBMS818
25
Rooted trees vs. unrooted trees
A B C D (Time) A B C D A B C D April 2, 2004 BIOS816/VBMS818
26
Rooted trees vs. unrooted trees
A B C D (Time) A B C D Root A B C D B A C D April 2, 2004 BIOS816/VBMS818
27
Rooted trees vs. unrooted trees
A B C D (Time) A B C D A B C D B A C D C D A B April 2, 2004 BIOS816/VBMS818
28
Rooted trees vs. unrooted trees
A B C D (Time) A B C D Root A B C D April 2, 2004 BIOS816/VBMS818
29
Rooted trees vs. unrooted trees
A B C D (Time) A B C D A B C D Outgroup April 2, 2004 BIOS816/VBMS818
30
Phylogenetic reconstruction: bootstrap analysis
used to estimate the confidence level of phylogenetic hypotheses Multiple alignment S5 S4 S3 S2 S1 T C 8 G A 1 2 3 7 6 5 4 Site # April 2, 2004 BIOS816/VBMS818
31
Phylogenetic reconstruction: bootstrap analysis
used to estimate the confidence level of phylogenetic hypotheses T C 8 G A 1 2 3 7 6 5 4 Each column is independent sample April 2, 2004 BIOS816/VBMS818
32
Phylogenetic reconstruction: bootstrap analysis
used to estimate the confidence level of phylogenetic hypotheses T C 8 G A 1 2 3 7 6 5 4 S1 GGAGGTTA S1 CGCAGCAC S1 TTTTGGCG ... Many S2 GGAGGTTA S2 CGCAGTAT S2 TTTTGGTG resamplings S3 AAAAACTA S3 CACAACAC S3 CTCTAACA (~1000 replications) S4 AAAAATCA S4 CACAACAC S4 TCTCAACA S5 GGAGGTCG S5 TGTAGCGC S5 TCTCGGCG April 2, 2004 BIOS816/VBMS818
33
Phylogenetic reconstruction: bootstrap analysis
used to estimate the confidence level of phylogenetic hypotheses S1 GGAGGTTA S1 CGCAGCAC S1 TTTTGGCG ... Many S2 GGAGGTTA S2 CGCAGTAT S2 TTTTGGTG resamplings S3 AAAAACTA S3 CACAACAC S3 CTCTAACA (~1000 replications) S4 AAAAATCA S4 CACAACAC S4 TCTCAACA S5 GGAGGTCG S5 TGTAGCGC S5 TCTCGGCG S1 S2 S3 S4 S5 S1 S2 S3 S4 S5 S1 S2 S3 S4 S5 ... A phylogeny is reconstructed from each pseudoreplicate April 2, 2004 BIOS816/VBMS818
34
Phylogenetic reconstruction: bootstrap analysis
used to estimate the confidence level of phylogenetic hypotheses S1 S2 S3 S4 S5 S1 S2 S3 S4 S5 S1 S2 S3 S4 S5 ... Bootstrap support (%) 100 S1 100 S1 S2 S2 S3 S3 S4 S4 40 S5 S5 April 2, 2004 BIOS816/VBMS818
35
Phylogenetic reconstruction by ClustalX/W & Phylip
Phylogeny programs: ClustalW/ClustalX (DNA/protein distance, NJ) Phylip3.5 & Phylip3.6a3 (standalone, web-interface) PAUP (also included in GCG) Visualization: Phylip (treegram, etc.), TreeView, PAUP, NJplot More phylogeny programs April 2, 2004 BIOS816/VBMS818
36
Phylip programs Bootstrap: seqboot (sequence data)
DNA distance: dnadist (nucleotide sequence data) Protein distance: protdist (amino acid sequence data) Neighbor joining: neighbor (distance matrix) Consensus tree: consense (tree file) Tree drawing: drawgram, drawtree, retree (treefile) Phylip3.5 Phylip3.6a3 Input file: infile infile, intree Output file: outfile, treefile outfile, outtree dnadist: Kimura, Jin/Nei, ML (F84), F84, Kimura, JC, LogDet JC (rate variation can be incorporated with all but LogDet) protdist: Kimura, PAM (Dayhoff) Kimura, PAM (Dayhoff), JTT April 2, 2004 BIOS816/VBMS818
37
Phylogenetic reconstruction by ClustalW & Phylip
Bioinformatics Core Facility Web server (Phylip3.5) Bioinformatics Web: IU Center for Genomics & Bioinformatics (Phylip3.5) Institut Pasteur, Biological Software list (Phylip3.6a3) Phylip download site (Windows, Macintosh, Linux/Unix) Phylip3.5: Phylip3.6b: TreeView download site (Windows, Macintosh, Linux/Unix) NJPlot download site (Windows, Macintosh, Linux/Unix) April 2, 2004 BIOS816/VBMS818
38
Multiple Alignment by CLUSTALW
Bioinformatics Core Facility Web server Bioinformatics Web: IU Center for Genomics & Bioinformatics Institut Pasteur, Biological Software list EMBL-EBI ClustalW Form ClustalX FTP site (Windows, Macintosh, Linux/Unix) ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalX/ April 2, 2004 BIOS816/VBMS818
39
ClustalW/Phylip Exercise
1. Download the two sample data from the course web site: bglobin.seq - protein sequences Dloop.seq - DNA sequences [Use either DOS format or non-DOS format whichever the ones that work for you.] These sequences are in FASTA format. >HBB_HUMAN VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK EFTPPVQAAYQKVVAGVANALAHKYH >HBB_HORSE VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGNPKV KAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDPENFRLLGNVLVVVLARHFGK DFTPELQASYQKVVAGVANALAHKYH April 2, 2004 BIOS816/VBMS818
40
ClustalW/Phylip Exercise (continued)
2. Go to this ClustalW web site: 3. Enter your address. 4. Copy and paste bglobin.seq data. 5. Check “Phylip alignment ouput format”. 6. Check the available options. 7. Click “Run clustalw” button to start the alignment. Wait until the page changes to the results page... 8. Click “infile.phy” link to open the multiple alignment in Phylip format. 9. From the pull-down menu, choose “protdist” program. April 2, 2004 BIOS816/VBMS818
41
ClustalW/Phylip Exercise (continued)
10. Click “Run the selected program on infile.phy” button to start protdist. 11. Choose “Jones-Taylor-Thornton (JTT) matrix” as the distance model. 12. Check “Perform a bootstrap ...” option. Enter a “Random number seed” Enter 10 as the number of replicates. For the real analysis, use more than 500 or more (~1000). 13. Click “Run protdist” button to run the program. Wait until the page changes to the results page... 14. Click “outfile” link to check the distance matrix file you generated. 15. From the pull-down menu, choose “neighbor” program. April 2, 2004 BIOS816/VBMS818
42
ClustalW/Phylip Exercise (continued)
16. Click “Run the selected program on outfile” button to start neighbor. 17. Choose “Neighbor-joining” from the distance method. 18. Check Randomize (jumble) input order. Enter a “Random number seed” Using “randomization” option slows down the program. But usually it is a better idea to use this option to avoid any artifact. 19. Check “Analyze multiple data set”. If you are not doing boostrap analysis, you don’t have to check this option. Enter the number of data set (10 for this example) Check “Compute a consense tree” 20. Click “Run neighbor” to run the program. Wait until the page changes to the results page... April 2, 2004 BIOS816/VBMS818
43
ClustalW/Phylip Exercise (continued)
21. Click “outfile.consense” to open the output file. 22. Click “outtree.consense” to open the output file. Note that the numbers after the taxon names are not branch lengths but bootstrap values. 23. Click “outtree” to open the output file. This file contain 10 trees based on bootstrapped alignment. Save the first tree in a file to use it for TreeView demonstration. We are using this tree as an example. For the real analysis, you should create a NJ tree without doing bootstrap analysis to create a real NJ tree from the original multiple alignment. In the item 12, uncheck “Perform a bootstrap ...” option to generate the NJ tree without bootstrap analysis. April 2, 2004 BIOS816/VBMS818
44
TreeView Exercise 24. Find “TreeView” software on your machine and start the program. 25. From File menu, open the tree file you saved. 26. Try to click different tree icons to change the phylogeny format. Which format shows different branch lengths? 27. From Tree menu, select “Define outgroup” Choose one sequence as an outgroup. 28. From Tree menu, select “Root with outgroup” 28. From Edit menu, select “Edit tree”. Check how you can edit your tree. The assignment from my lectures (Assignment #4) is found in the Blackboard Assignment page . April 2, 2004 BIOS816/VBMS818
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.