Download presentation
Presentation is loading. Please wait.
Published byHarjanti Vera Kusnadi Modified over 6 years ago
1
Reed A. Cartwright Department of Genetics University of Georgia
Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution Reed A. Cartwright Department of Genetics University of Georgia
2
RA Cartwright rac@uga.edu - http://scit.us/
Phylogenies Phylogenies are not known. They are inferred. The accuracy of the inference is dependent on the quality of the data and quality of the methodology. RA Cartwright -
3
Estimating Phylogenies
Retrieve sequences from a database or a sequencer. Align sequences. Estimate the phylogeny from the alignment. However, bad alignments give bad phylogenies. RA Cartwright -
4
RA Cartwright rac@uga.edu - http://scit.us/
How do we know it works? Intuition Concordance Using rare instances of know phylogenies. Simulations RA Cartwright -
5
Why Simulate Phylogenies?
Techniques are often based on certain models of evolution. Simulating sequence evolution based on these models produces an ideal situation to test the techniques. Using other models can test how robust a technique is. RA Cartwright -
6
RA Cartwright rac@uga.edu - http://scit.us/
Testing Procedure 1. Start with a “known” tree. 3. Estimate the trees of the simulated data. 2. Simulate sequence sets based on the tree. 4. Compare estimated trees to the original tree. A B C D A AATTCTTTGAGTTAA B AATTCTTTGAGTTAA C AATTCTTAAAGTTAA D AATTCTTAAAGTTAA A B C D A B C D A AAAAGATAAAGCAAA--A B GAAAGATAAAGCAAA--A C GAAAGATAAAGAAAAACA D GAAAGATAAAGAAAAACA RA Cartwright -
7
Measuring the accuracy of estimates
What do you want to measure, topology or branch lengths? Simply binary system: correct or wrong. More flexible system: accuracy of clade estimation. RA Cartwright -
8
RA Cartwright rac@uga.edu - http://scit.us/
Clade Estimation Estimated Clade Yes No Actual Clade True Positive False Negative False Positive True Negative Sensitivity = TP/(TP+FN) Specificity = TN/(FP+TN) Positive Predictive Accuracy = TP/(TP+FP) Negative Predictive Accuracy = TN/(FN+TN) RA Cartwright -
9
How to best combine different clades into one metric?
Look at each clade separately? Lump all clades of the same size together? Lump all clades of different sizes together? How to do it? RA Cartwright -
10
RA Cartwright rac@uga.edu - http://scit.us/
Alignment Techniques The quality of a phylogeny depends on the quality of an alignment. There are two different classes of alignment techniques. Pairwise alignments Multiple alignments RA Cartwright -
11
RA Cartwright rac@uga.edu - http://scit.us/
Pairwise alignments Pairwise alignments align pairs of sequences. Typically use a spreadsheet like technique with penalties for mismatches and gaps. RA Cartwright -
12
RA Cartwright rac@uga.edu - http://scit.us/
Multiple alignments Align multiple sequences. Cannot directly use the techniques used for pairwise alignments. Typical implementations use a guide tree derived from sequence similarity scores of pairwise alignments. RA Cartwright -
13
RA Cartwright rac@uga.edu - http://scit.us/
Alignment Models The typical model of alignment is the affine gap model. In this model the cost of a gap is a linear function of the size of a gap: C(L) = O+E(L). This corresponds to a geometric-exponential model of gap sizes. RA Cartwright -
14
RA Cartwright rac@uga.edu - http://scit.us/
Power-Law The only problem with this is that indels do not obey this model. Several studies and some theory in nucleic acids and proteins have found that indels sizes obey a power law. The appropriate cost model for a power law is logarithmic: C(L) = O+E*Log(L). RA Cartwright -
15
RA Cartwright rac@uga.edu - http://scit.us/
So what? Only one program does logarithmic alignments, and it only works on protein pairs. Monotone by Richard Mott. Clustal uses the affine gap model. Above that Clustal uses a simple evolutionary model to estimate a guide tree for aligning multiple sequences. RA Cartwright -
16
RA Cartwright rac@uga.edu - http://scit.us/
Dealing with gaps Sequences are typically aligned before they are phylogenized. This is silly. We should estimate alignments at the same we estimate phylogenies. For now we are stuck doing it in pieces, but must be wary of introducing biases into our phylogenies when aligning. RA Cartwright -
17
RA Cartwright rac@uga.edu - http://scit.us/
Dealing with gaps Gaps contain phylogenetic signal which is usually ignored by researchers. Can look at how gaps influence phylogenetics? RA Cartwright -
18
RA Cartwright rac@uga.edu - http://scit.us/
Dealing with gaps To study how gaps can influence phylogenies we need a program that can simulate molecular evolution with indels. However, existing programs model indel formation rather poorly if they do at all. RA Cartwright -
19
Dawg is the solution to this gap
Dawg stands for “DNA Assembly With Gaps.” A portable and robust program for simulating molecular evolution. Development Website: RA Cartwright -
20
RA Cartwright rac@uga.edu - http://scit.us/
Comparing Software Feature Seq-Gen Evolver Rose Dawg Indels Yes Indel Parameter Estimator Recombination Substitution GTR PAM Rate Heterogeneity Γ+I Γ Input Format Switch File Unix Mac OS X Win32 RA Cartwright -
21
RA Cartwright rac@uga.edu - http://scit.us/
Parameters Tree phylogeny TreeScale coefficient to scale branch lengths by Sequence root sequences Length length of generated root sequences Rates rate of evolution of each root nucleotide Model model of evolution: GTR|JC|K2P|K3P|HKY|F81|F84|TN Freqs nucleotide (ACGT) frequencies Params parameters for the model of evolution Width block width for indels and recombination Scale block position scales Gamma coefficients of variance for rate heterogeneity Alpha shape parameters Iota proportions of invariant sites GapModel models of indel formation: NB|PL|US Lambda rates of indel formation GapParams parameter for the indel model Reps number of data sets to output File output file Format output format: Fasta|Nexus|Phylip|Clustal GapSingleChar output gaps as a single character GapPlus distinguish insertions from deletions in alignment LowerCase output sequences in lowercase Translate translate outputed sequences to amino acids NexusCode text or file to include between datasets in Nexus format Seed PRNG seed (integers) RA Cartwright -
22
RA Cartwright rac@uga.edu - http://scit.us/
Sample Input File # example.dawg Tree = ((AY727331: ,AY727330: ): , (AY727327: ,AY727326: ): ); Model = "GTR" Params = { , , , , , } Freqs = { , , , } Length = 300 Lambda = GapModel = "NB" GapParams = {1, } Format = "Clustal" File = "example.aln" Seed = 1981 RA Cartwright -
23
RA Cartwright rac@uga.edu - http://scit.us/
CLUSTAL multiple sequence alignment (Created by DAWG Version 1.0.0) AY TTCGAAAATATGTTAGTACTCAATATGAATTCTTTGAGTTAAAAAAGATAAAGCAAA--A AY TTCGAAAATATGTTAGTACTCAATATGAATTCTTTGAGTTAAGAAAGATAAAGCAAA--A AY TTCAAAAATATGCTAGGACTGAATATGAATTCTTAAAGTTAAGAAAGATAAAGAAAAACA AY TTCAAAAATATGCTAGGACTGAATATGAATTCTTAAAGTTAAGAAAGATAAAGAAAAACA AY ATACATAATGTGATTTCAATATTCCAATTACCTAACAATACGGCTATCAATTAAACGATT AY ATACATAATGTGATTTCAATATTCCAATTACCTAACAATACGGCTATCAATTAAACGATT AY GTACATAATGTAAA----TTATTGCAA AAAACGGCTAACAATTAGACGATT AY GTACATAATGTAAA----TTATTGCAA AAAACGGCTAACAATTAGACGATT AY TTAGGATTACACCGACAAATATTAGGCCGATATGAATTTAACATCATGTTGTATTTAGAT AY TTAGGATTACACCGACAAATATTAGGCCGATATGAATTTACCATCATGTTGTATTTAGAT AY TTAGGATTACGCTGACAAATATTAGGATGATATTAATTTA------TCTTGTATTTAGAT AY TTAGGATTACGCTGACAAATATTAGGATGATATTAATTTA------TCTTGTATTTAGAT AY GCTGTCTTTTATTAACATTCATCATTAAAT-TTGGAACCTTTTGCATTTAAGAAGTACAT AY GCTGTCTTTTATTAACATTCATCATTAAAT-TTGGAACCTTTTGTATTTAAGAAGTACAT AY GCTGTCTTTTATCAACATTCATCACTAGATATTGGAACCTATTGCATCTAAGAAGTACAT AY GCTGTCTTTTATCAACATTCATCACTAGATATTGGAACCTATTGCATCTAAGAAGTACAT AY GTTTAATAGTGTTTAAAA-TATATATGAAATTGATCATAAGGA---TCTATAAATGCGGT AY GTTTAATAGTGTTTATAA-TATATATGAAATTGATCGTAAGGA---TCTATAAATGCAGT AY GTTTAATAGGGTT-AAAACTATATATGAAGTCGATTATAAGGAATTTCTATAAATGTAGC AY GTTTAATAGGGTT-AAAACTATATATGAAGTCGATTATAAGGAATTTCTATAAATGTAGC AY TCTTCAATTTCTTG AY TCTTCAATTTCTTG AY TCTTCAATTTCCTA AY TCTTCAATTTCCTA RA Cartwright -
24
RA Cartwright rac@uga.edu - http://scit.us/
Estimating Indel Rate Dawg would be of little benefit if biologists could not estimate parameters of indel formation from real data. Dawg’s indel model allows such estimation, which is implemented in a Perl script, lambda.pl. RA Cartwright -
25
RA Cartwright rac@uga.edu - http://scit.us/
Lambda.pl Take an alignment and a phylogeny. The number of unique gaps in this alignment is approximately distributed as a Poisson with mean (λLT) λ = rate of indel formation L = average sequence length T = total branch length Therefore the rate of indel formation can be estimated as λ = N/(LT) N = number of unique gaps in alignment RA Cartwright -
26
Example Usage: Confidence Interval of Indel Rate
I aligned the sequences of chloroplast trnK introns from two Hibiscus and two Prunus species. Using Paup*, I estimated the phylogeny and substitution parameters. Using lambda.pl, I estimated the indel formation parameters. RA Cartwright -
27
RA Cartwright rac@uga.edu - http://scit.us/
Example Usage From these estimated parameters of evolution, I constructed an input file for Dawg. From the input file Dawg produced a thousand simulated sequence sets. The rate of indel formation was estimated for each of the simulated sequences. RA Cartwright -
28
RA Cartwright rac@uga.edu - http://scit.us/
Results The estimated rate of indel formation was Bootstrapping gave a 95% CI of to Biologically this is 8 to 21 indels per 100 substitutions. RA Cartwright -
29
RA Cartwright rac@uga.edu - http://scit.us/
Thanks RA Cartwright -
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.