Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence similarity (closer in evolutionary time) with archaeal genes Found yeast mitochondrial genes exhibit more sequence similarity with eubacterial genes
t-test and significance t-test determines if the data come from the same population or if there are significant differences Calculate the mean of data, standard deviation of each data set, derive a weighted standard deviation to be used in t-test Compare to t-critical value obtained from t- table or software
Origins of eukaryotic cells
Martin-Muller hypothesis Martin and Muller hypothesis
Evidence from phylogenetic relationships
Leprae vs. tuberculosis Leprae (3.2Mb) is ~50% coding, contrasted with 4.4 Mb and 91% coding for tuberculosis Comparing genomes using Mummer: scripts/CMR2/webmum/mumplothttp:// scripts/CMR2/webmum/mumplot
How Mummer works: Uses suffix trees to create an internal representation of a genome sequence Identify maximal unique matches (MUM); version 2.0 uses streaming whereas 1.0 adds sequence 2 to suffix tree for sequence 1 Alignment via Smith-Waterman
Origin of species Mitochondrial DNA and human evolution Evolution of pathogens
Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences
Why phylogenetics? Understand evolutionary history Map pathogen strain diversity for vaccines Assist in epidemiology (Dentist and HIV) Aid in prediction of function of novel genes Biodiversity Microbial ecology
Changes can occur
Observing differences in nucleotides The simplest measure of distance between two sequences is to count the # of sites where the two sequences differ If all sites are not equally likely to change, the same site may undergo repeated substitutions As time goes by, the number of differences between two sequences becomes less and less an accurate estimator of the actual number of substitutions that have occurred
The relationship between time and substitutions is non-linear
Various models have been generated to more accurately estimate distance and evolution All use the following framework: Probability matrix p AC is the probability of a site starting with an A had a C at the end of time interval t, etc. Base composition of sequence; f a = frequency of A
Jukes-Cantor Model Distance between any two sequences is given by: d = -3/4 ln(1-4/3p) p is the proportion of nucleotides that are different in the two sequences All substitutions are equally probable –Each position in matrix = ; except diagonal = 1-
Kimura’s two parameter model d = ½ ln[1/(1-2P-Q)] + ¼ ln[1/1-2Q)] P and Q are proportional differences between the two sequences due to transitions and transversions, respectively. Accounts for transition bias in sequences (transversions more rare)
Evolutionary models
Implementing models and building trees
Rooted vs. unrooted Root – ancestor of all taxa considered Unrooted – relationship without consideration of ancestry Often specify root with outgroup –Outgroup – distantly related species (ie. mammals and an archaeal species)
Tree building Get protein/RNA/DNA sequences Construct multiple sequence alignment Compute pairwise distances (if necessary) Build tree – topology and distances Estimate reliability Visualize
Distance methods UPMGA Neighbor joining
Unweighted pair-group method using arithmetic averages (UPGMA) Assumes a constant rate of gene substitution, evolution Clustering algorithm that measures distances between all sequences, merges the closest pair, recalculates that node as an average, then merges the next closest pair, re-iterate Usually gives a rooted tree
Testing the reliability of trees Interior branch test or Bootstrap analysis Bootstrap analysis – subsequences or sequence deletion or replacement; re-draw trees; how many times do you get some branching? Bootstrap values of 70 (95) or greater are normally considered reliable
Homework due on 10/6 Discovery questions in Chapter 2 4, 25-27