Download presentation
Presentation is loading. Please wait.
Published byCorey Lewis Modified over 9 years ago
1
EVOLUTIONARY CHANGE IN DNA SEQUENCES - usually too slow to monitor directly… … so use comparative analysis of 2 sequences which share a common ancestor - determine number and nature of nt substitutions that have occurred (ie measure degree of divergence) spontaneous mutation rates? p. 35-37 for mammalian nuclear DNA (regions not under functional constraint)... much higher for viruses ~ 4 x 10 -9 nt sub per site per year eg. 10 -6 to 10 -3 nt sub per site per generation
2
Potential pitfalls 2. If indels between two sequences, can they be aligned with confidence? - algorithms with gap penalties 1. Are all evolutionary changes being monitored? - if closely-related, high probability only one change at any given site… but if distant, may have been multiple substitutions (“hits”) at a site - can use algorithms to correct for this
3
Ancestral sequence Present day sequences Fig. 3.6
4
Page & Holmes Fig. 5.9 (If comparing long stretches, highly unlikely they would have converged to the same sequence) Homoplasy: same nt, but not directly inherited from ancestral sequence
5
Nucleotide substitutions within protein-coding sequences 1. Synonymous vs. non-synonymous Single step: Multiple steps: AATACT Is one pathway more likely than another? p.82
6
2. Nomenclature related to “degeneracy”: Non-degenerate - all possible changes at site are non-synonymous 2-fold degenerate - one of the 3 possible changes is synonymous 4-fold degenerate - all possible changes at site are synonymous
7
ALIGNMENT OF SEQUENCES FOR COMPARATIVE ANALYSIS 1.By manual inspection - if sequences very similar and no (or few) gaps 2. By sequence distance methods (often followed by “correction by visual inspection”) - use algorithms which minimize mismatches and gaps - gap penalty > mismatch penalty
8
Fig. 3.12 Alignment of human and chicken pancreatic hormone proteins no gap penality with gap penalty alignment as in (b), with biochemically similar aa
9
ArabAAG52143 FIVDEADLLLDLGFRRDVEKIIDCLPRQR-------QSLLFSATIPKEVRRVS-QLVLKR 539 ArabAAC26676 FIVDEADLLLDLGFKRDVEKIIDCLPRQR-------QSLLFSATIPKEVRRVS-QLVLKR 586 yeast -VLDEADRLLEIGFRDDLETISGILNEKNSKSADNIKTLLFSATLDDKVQKLANNIMNKK 323 ::**** **::**: *:*.*. *.:. ::******:.:*:::: ::: *: CLUSTAL W (1.81) Multiple Sequence Alignments Sequence 1: ArabidopsisAAG52143 798 aa Sequence 2: ArabidopsisAAC26676 845 aa Sequence 3: yeast 664 aa Sequences (2:3) Aligned. Score: 23 Sequences (1:2) Aligned. Score: 93 Sequences (1:3) Aligned. Score: 22 Multiple sequence alignments - CLUSTALW ww.ebi.ac.uk/clustalw (European Bioinformatics Institute) Symbols used? * :.
10
Avers Fig. 3.23 globin globin Human globin = 141 aa Human globin = 146 aa Was D-helix loss neutral or adaptive mutation? (Nature 352: 349-51, 1991) Alignment of human -globin and -globin proteins
11
In sequence comparisons, refer to nt (or aa) sequence relatedness as “… % identity” or “...% similarity” BUT NOT “ … % homology” because “homology” means “shares a common ancestor” “Non-evolutionary biologists” Petsko Genome Biol. 2:1002,2001 Reminder about definition of the word “homology”
12
“Normalized alignment score” NAS = (# identities x 10) + (# Cys identities x 20) – (# gaps x 25) Doolittle, R. “URFs & ORFs” p.14
13
Query = yeast mt ribosomal protein L8 gene (1275 nt) BLAST searches www.ncbi.nlm.nih.gov/BLAST/ - to detect similarity between “sequence of interest” & databank entries Score = 383 bits (193), Expect = 1e-102 Identities = 196/197 (99%), Gaps = 0/197 (0%) Query AGCGTCAGGATAGCTCGCTCGATGTGGTCAGGCTAACACAATGAACAACGAGACTAGTG |||||||||||||||||||||||||| |||||||||||||||||||||||||||||||| Sbjct AGCGTCAGGATAGCTCGCTCGATGTGATCAGGCTAACACAATGAACAACGAGACTAGTG E-values: statistical measure of likelihood that sequences with this degree of similarity occur randomly ie. reflects number of hits expected by chance Example of high score “hit” (red) Query GTTTTCTTAATATTTATTTAAAAA |||||||||||||||| ||||||| Sbjct GTTTTCTTAATATTTAATTAAAAA Example of low score “hit” (blue or black) Score = 40.1 bits (20), Expect = 3.6 Identities = 23/24 (95%), Gaps = 0/24 (0%) “low complexity sequence”
14
Why is “sequence complexity” important when judging whether two sequences are homologous? Human DNA Chimp DNA Pu-rich region #1 Pu-rich region #2 (not homologous to #1) Region of unbiased base composition G=C=A=T AAGAGGAG How frequently is AAGAGGAG (8-nt sequence) expected to occur by chance in a DNA sequence? If sequence A is of low complexity (or short length), high % identity with sequence B may not reflect shared evolutionary origin AAGAGGAG
15
Advantages of using aa (rather than nt) sequences for identifying homologous genes among organisms? -20 amino acids vs. 4 nucleotides - for distantly related sequences – “saturation” of synonymous sites within codons (multiple hits) - degeneracy of genetic code & different codon usage patterns (and G+C% of genomes) among organisms But… for certain phylogenetic analyses, number of informative characters may be higher at DNA than protein level - lower chance of “spurious” matches - unrelated nt sequences (non-homologous) expected to show 25% identity by random chance (if unbiased base composition)
16
What if BLAST search were done at protein (instead of nt) level? Query = yeast mitochondrial ribosomal protein L8 (238 aa) Fungal Bacterial
17
Dot matrix method for aligning sequences - 2 sequences to be compared along X and Y axis of matrix - dots put in matrix when nts in the 2 sequences are identical mismatch = “gap” (or break) in line Fig. 3.7
18
indel = shift in diagonal Fig. 3.7
19
Dot matrix method - normally compare blocks rather than individual nts - spurious matches (background noise) influenced by 1. window size – overlapping fixed-length windows whereby sequence 1 compared with seq 2 2. stringency – minimum threshold value (% identity) at each step to score as hit - for coding regions, could use aa instead of nt sequences to reduce “noise”
20
Comparison of human chromosome 7 “draft” sequence (2001) with “near-complete” sequence (2004) Nature 431:935, 2004 How do you interpret the data in this figure? 2004 sequence (fewer errors) 2001 sequence Blowup of 500 kb region
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.