EVOLUTIONARY CHANGE IN DNA SEQUENCES - usually too slow to monitor directly… … so use comparative analysis of 2 sequences which share a common ancestor - determine number and nature of nt substitutions that have occurred (ie measure degree of divergence) spontaneous mutation rates? p for mammalian nuclear DNA (regions not under functional constraint)... much higher for viruses ~ 4 x nt sub per site per year eg to nt sub per site per generation
Potential pitfalls 2. If indels between two sequences, can they be aligned with confidence? - algorithms with gap penalties 1. Are all evolutionary changes being monitored? - if closely-related, high probability only one change at any given site… but if distant, may have been multiple substitutions (“hits”) at a site - can use algorithms to correct for this
Ancestral sequence Present day sequences Fig. 3.6
Page & Holmes Fig. 5.9 (If comparing long stretches, highly unlikely they would have converged to the same sequence) Homoplasy: same nt, but not directly inherited from ancestral sequence
Nucleotide substitutions within protein-coding sequences 1. Synonymous vs. non-synonymous Single step: Multiple steps: AATACT Is one pathway more likely than another? p.82
2. Nomenclature related to “degeneracy”: Non-degenerate - all possible changes at site are non-synonymous 2-fold degenerate - one of the 3 possible changes is synonymous 4-fold degenerate - all possible changes at site are synonymous
ALIGNMENT OF SEQUENCES FOR COMPARATIVE ANALYSIS 1.By manual inspection - if sequences very similar and no (or few) gaps 2. By sequence distance methods (often followed by “correction by visual inspection”) - use algorithms which minimize mismatches and gaps - gap penalty > mismatch penalty
Fig Alignment of human and chicken pancreatic hormone proteins no gap penality with gap penalty alignment as in (b), with biochemically similar aa
ArabAAG52143 FIVDEADLLLDLGFRRDVEKIIDCLPRQR QSLLFSATIPKEVRRVS-QLVLKR 539 ArabAAC26676 FIVDEADLLLDLGFKRDVEKIIDCLPRQR QSLLFSATIPKEVRRVS-QLVLKR 586 yeast -VLDEADRLLEIGFRDDLETISGILNEKNSKSADNIKTLLFSATLDDKVQKLANNIMNKK 323 ::**** **::**: *:*.*. *.:. ::******:.:*:::: ::: *: CLUSTAL W (1.81) Multiple Sequence Alignments Sequence 1: ArabidopsisAAG aa Sequence 2: ArabidopsisAAC aa Sequence 3: yeast 664 aa Sequences (2:3) Aligned. Score: 23 Sequences (1:2) Aligned. Score: 93 Sequences (1:3) Aligned. Score: 22 Multiple sequence alignments - CLUSTALW ww.ebi.ac.uk/clustalw (European Bioinformatics Institute) Symbols used? * :.
Avers Fig globin globin Human globin = 141 aa Human globin = 146 aa Was D-helix loss neutral or adaptive mutation? (Nature 352: , 1991) Alignment of human -globin and -globin proteins
In sequence comparisons, refer to nt (or aa) sequence relatedness as “… % identity” or “...% similarity” BUT NOT “ … % homology” because “homology” means “shares a common ancestor” “Non-evolutionary biologists” Petsko Genome Biol. 2:1002,2001 Reminder about definition of the word “homology”
“Normalized alignment score” NAS = (# identities x 10) + (# Cys identities x 20) – (# gaps x 25) Doolittle, R. “URFs & ORFs” p.14
Query = yeast mt ribosomal protein L8 gene (1275 nt) BLAST searches - to detect similarity between “sequence of interest” & databank entries Score = 383 bits (193), Expect = 1e-102 Identities = 196/197 (99%), Gaps = 0/197 (0%) Query AGCGTCAGGATAGCTCGCTCGATGTGGTCAGGCTAACACAATGAACAACGAGACTAGTG |||||||||||||||||||||||||| |||||||||||||||||||||||||||||||| Sbjct AGCGTCAGGATAGCTCGCTCGATGTGATCAGGCTAACACAATGAACAACGAGACTAGTG E-values: statistical measure of likelihood that sequences with this degree of similarity occur randomly ie. reflects number of hits expected by chance Example of high score “hit” (red) Query GTTTTCTTAATATTTATTTAAAAA |||||||||||||||| ||||||| Sbjct GTTTTCTTAATATTTAATTAAAAA Example of low score “hit” (blue or black) Score = 40.1 bits (20), Expect = 3.6 Identities = 23/24 (95%), Gaps = 0/24 (0%) “low complexity sequence”
Why is “sequence complexity” important when judging whether two sequences are homologous? Human DNA Chimp DNA Pu-rich region #1 Pu-rich region #2 (not homologous to #1) Region of unbiased base composition G=C=A=T AAGAGGAG How frequently is AAGAGGAG (8-nt sequence) expected to occur by chance in a DNA sequence? If sequence A is of low complexity (or short length), high % identity with sequence B may not reflect shared evolutionary origin AAGAGGAG
Advantages of using aa (rather than nt) sequences for identifying homologous genes among organisms? -20 amino acids vs. 4 nucleotides - for distantly related sequences – “saturation” of synonymous sites within codons (multiple hits) - degeneracy of genetic code & different codon usage patterns (and G+C% of genomes) among organisms But… for certain phylogenetic analyses, number of informative characters may be higher at DNA than protein level - lower chance of “spurious” matches - unrelated nt sequences (non-homologous) expected to show 25% identity by random chance (if unbiased base composition)
What if BLAST search were done at protein (instead of nt) level? Query = yeast mitochondrial ribosomal protein L8 (238 aa) Fungal Bacterial
Dot matrix method for aligning sequences - 2 sequences to be compared along X and Y axis of matrix - dots put in matrix when nts in the 2 sequences are identical mismatch = “gap” (or break) in line Fig. 3.7
indel = shift in diagonal Fig. 3.7
Dot matrix method - normally compare blocks rather than individual nts - spurious matches (background noise) influenced by 1. window size – overlapping fixed-length windows whereby sequence 1 compared with seq 2 2. stringency – minimum threshold value (% identity) at each step to score as hit - for coding regions, could use aa instead of nt sequences to reduce “noise”
Comparison of human chromosome 7 “draft” sequence (2001) with “near-complete” sequence (2004) Nature 431:935, 2004 How do you interpret the data in this figure? 2004 sequence (fewer errors) 2001 sequence Blowup of 500 kb region