Download presentation
Presentation is loading. Please wait.
Published byJeremy Spencer Modified over 9 years ago
1
Bioinformática 2007-I Prof. Mirko Zimic Lunes -Alineamiento simple de secuencias (pairwise alignment). - Alineamiento local y global. - Matrices de ‘score’ -Algoritmos de Programación Dinámica -Dot Plot Miércoles Alineamiento simple de secuencias: Manejo de los programas: Clustal, Macaw y servidores en línea
2
“Nada en Biología tiene sentido a menos que se entienda en términos de Evolución” T. Dobzhansky
3
“Alinear” = “Comparar” Finches of the Galápagos Islands observed by Charles Darwin on the voyage of HMS Beagle Sequence alignment is similar to other types of comparative analysis Involves scoring similarities and differences among a group of related entities
4
Homología Homology Is the central concept for all of biology. Whenever we say that a mammalian hormone is the ‘same’ hormone as a fish hormone, that a human gene sequence is the ‘same’ as a sequence in a chimp or a mouse, that a HOX gene is the ‘same’ in a mouse, a fruit fly, a frog and a human - even when we argue that discoveries about a worm, a fruit fly, a frog, a mouse, or a chimp have relevance to the human condition - we have made a bold and direct statement about homology. The aggressive confidence of modern biomedical science implies that we know what we are talking about.” David B. Wake
5
Similitud ≠ Homología 1) 25% similarity ≥ 100 AAs is likely homology 2) Homology is an evolutionary statement which means “descent from a common ancestor” –common 3D structure –usually common function –all or nothing, cannot say "50% homologous"
6
C O M P A R A T I V E A N A L Y S I S Alignment algorithms model evolutionary processes GATTACCA GATGACCAGATTACCA Derivation from a common ancestor through incremental change due to dna replication errors, mutations, damage, or unequal crossing- over. insertion GATCATCAGATTGATCAGATTACCAGATTATCAGATTACCA deletion Substitution GAT ACCA T
7
C O M P A R A T I V E A N A L Y S I S Alignment algorithms model evolutionary processes GATTACCA GATGACCAGATTACCA Derivation from a common ancestor through incremental change GATCATCAGATTGATCA GATTACCAGATTATCAGATTACCA GATACCA Only extant sequences are known, ancestral sequences are postulated. GATCATCAGATTGATCA GATTACCA GATACCA
8
The term homology implies a common ancestry, which may be inferred from observations of sequence similarity C O M P A R A T I V E A N A L Y S I S Alignment algorithms model evolutionary processes GATTACCA GATGACCAGATTACCA Derivation from a common ancestor through incremental change. Mutations that do not kill the host may carry over to the population. Rarely are mutations kept/rejected by natural selection. GATCATCAGATTGATCA GATTACCAGATTATCAGATTACCA GATACCA
9
Sequence Alignments Why align? Can delineate sequence elements that are functionally significant Illuminates phylogenetic relationships Algorithms for sequence alignment Dynamic programming Dot-matrix Word-based algorithms Bayesian methods
10
What is Meant by Alignment? Identical nucleotide sequences (trivial example) A better alignment ATTCGGCATTCAGTGCTAGA Score = 20 (20 1) Imperfect match ATTCGGCATTCAGTGCTAGA ATTCGGCATTGCTAGA Score = 11 ATTCGGCATTCAGTGCTAGA ATTCGGCATT----GCTAGA Score = 14 = 10 + 6 + 4 (-0.5) { Gap penalty
11
Beware of aligning apples and oranges [and grapefruit]! Parologous versus orthologous; genomic versus cDNA; mature versus precursor.
12
Los alineamientos se pueden efectuar tanto en secuencias de ADN como en secuencias de proteínas…
13
Why Do We Want To Compare Sequences wheat --DPNKPKRAMTSFVFFMSEFRSEFKQKHSKLKSIVEMVKAAGER | | |||||||| || | ||| ||| | |||| |||| ????? KKDSNAPKRAMTSFMFFSSDFRS----KHSDL-SIVEMSKAAGAA EXTRAPOLATE ?????? Homology? SwissProt
14
Why Does It Make Sense To Align Sequences ? -Evolution is our Real Tool. -Nature is LAZY and Keeps re-using Stuff. -Evolution is mostly DIVERGEANT Same Sequence Same Ancestor
15
Why Does It Make Sense To Align Sequences ? Same Sequence Same Function Same 3D Fold Same Origin Comparing Is Reconstructing Evolution
16
An Alignment is a STORY ADKPKRPLSAYMLWLN ADKPKRPKPRLSAYMLWLN ADKPRRPLS-YMLWLN ADKPKRPLSAYMLWLN Mutations + Selection
17
An Alignment is a STORY ADKPRRP---LS-YMLWLN ADKPKRPKPRLSAYMLWLN Mutation InsertionDeletion ADKPKRPLSAYMLWLN ADKPKRPKPRLSAYMLWLN ADKPRRPLS-YMLWLN ADKPKRPLSAYMLWLN Mutations + Selection
18
Evolution is NOT Always Divergent… AFGP with (ThrAlaAla)n Similar To Trypsynogen AFGP with (ThrAlaAla)n NOT Similar to Trypsinogen N S SIMILAR Sequences BUT DIFFERENT origin …But in MOST cases, you may assume it is.
19
How Do Sequences Evolve ? CONSTRAINED Genome Positions Evolve SLOWLY EVERY Protein Family Has its Own Level Of Constraint FamilyK S K A Histone36.40 Insulin4.00.1 Interleukin I4.61.4 Globin5.10.6 Apolipoprot. AI4.51.6 Interferon G8.62.8 Rates in Substitutions/site/Billion Years as measured on Mouse Vs Human (80 Million years) Ks Synonymous Mutations, Ka Non-Neutral.
20
G C L I V A F Aliphatic Aromatic Hydrophobic C How Do Sequences Evolve ? The amino Acids Venn Diagram To Make Things Worse, Every Residue has its Own Personality S T W Y Q H K R E DN Polar P G Small C
21
How Do Sequences Evolve ? In a structure, each Amino Acid plays a Special Role OmpR, Cter Domain In the core, SIZE MATTERS On the surface, CHARGE MATTERS - - +
22
How Do Sequences Evolve ? Accepted Mutations Depend on the Structure Big -> Big Small->Small NO DELETION - - + Charged -> Charged Small Big or Small DELETIONS
23
How Can We Compare Sequences ? To Compare Two Sequences, We need: Their FunctionTheir Structure We Do Not Have Them !!!
24
How Can We Compare Sequences ? We will Need To Replace Structural Information With Sequence Information. Same Sequence Same Function Same 3D Fold Same Origin It CANNOT Work ALL THE TIME !!!
25
How Can We Compare Sequences ? To Compare Sequences, We need to Compare Residues We Need to Know How Much it COSTS to SUBSTITUTE an Alanine into an Isoleucine a Tryptophan into a Glycine … The table that contains the costs for all the possible substitutions is called the SUBSTITUTION MATRIX How to derive that matrix?
26
How Can We Compare Sequences ? Making a Substitution Matrix -Take 100 nice pairs of Protein Sequences, easy to align (80% identical). -Align them… -Count each mutations in the alignments -25 Tryptophans into phenylalanine -30 Isoleucine into Leucine … -For each mutation, set the substitution score to the log odd ratio: Expected by chance Observed Log
27
How Can We Compare Sequences ? Making a Substitution Matrix The Diagonal Indicates How Conserved a residue tends to be. W is VERY Conserved Some Residues are Easier To mutate into other similar Cysteins that make disulfide bridges and those that do not get averaged
29
How Can We Compare Sequences ? Using Substitution Matrix ADKPRRP---LS-YMLWLN ADKPKRPKPRLSAYMLWLN Mutation Insertion Deletion Given two Sequences and a substitution Matrix, We must Compute the CHEAPEST Alignment
30
Most popular Subsitution Matrices PAM250 Blosum62 (Most widely used) Raw Score TPEA ¦| | APGA TPEA ¦| | APGA Score = 1= 9 Question: Is it possible to get such a good alignment by chance only? +6+0+2 Scoring an Alignment
31
Insertions and Deletions Gap Penalties Opening a gap is more expensive than extending it Seq AGARFIELDTHE----CAT ||||||||||| ||| Seq BGARFIELDTHELASTCAT Seq AGARFIELDTHE----CAT ||||||||||| ||| Seq BGARFIELDTHELASTCAT gap Gap Opening Penalty Gap Extension Penalty
32
How Can We Compare Sequences ? Limits of the substitution Matrices They ignore non-local interactions and Assume that identical residues are equal They assume evolution rate to be constant ADKPKRPLSAYMLWLN ADKPKRPKPRLSAYMLWLN ADKPRRPLS-YMLWLN ADKPKRPLSAYMLWLN Mutations + Selection
33
How Can We Compare Sequences ? Limits of the substitution Matrices Substitution Matrices Cannot Work !!!
34
How Can We Compare Sequences ? Limits of the substitution Matrices I know… But at least, could I get some idea of when they are likely to do all right
35
How Can We Compare Sequences ? The Twilight Zone Length %Sequence Identity 100 Same 3D Fold Twilight Zone Similar Sequence Similar Structure 30% Different Sequence Structure ???? 30
36
How Can We Compare Sequences ? The Twilight Zone Substitution Matrices Work Reasonably Well on Sequences that have more than 30 % identity over more than 100 residues
42
Major Differences between PAM and BLOSUM
43
How Can We Compare Sequences ? Which Matrix Shall I use PAM: Distant Proteins High Index (PAM 350) BLOSUM: Distant Proteins Low Index (Blosum30) GONNET 250> BLOSUM62>PAM 250. But This will depend on: The Family. The Program Used and Its Tuning. Choosing The Right Matrix may be Tricky… Insertions, Deletions?
44
Dot Matrices Global Alignments Local Alignment HOW Can we Align Two Sequences ?
46
Cost L Afine Gap Penalty Global Alignments -Take 2 Nice Protein Sequences -A good Substitution Matrix (blosum) -A Gap opening Penalty (GOP) -A Gap extension Penalty (GEP) GOP GEP GOP Parsimony: Evolution takes the simplest path (So We Think…)
47
Insertions and Deletions Gap Penalties Opening a gap is more expensive than extending it Seq AGARFIELDTHE----CAT ||||||||||| ||| Seq BGARFIELDTHELASTCAT Seq AGARFIELDTHE----CAT ||||||||||| ||| Seq BGARFIELDTHELASTCAT gap Gap Opening Penalty Gap Extension Penalty
48
Global Alignments -Take 2 Nice Protein Sequences -A good Substitution Matrix (blosum) -A Gap opening Penalty (GOP) -A Gap extension Penalty (GEP) >Seq1 THEFATCAT >Seq2 THEFASTCAT -DYNAMIC PROGRAMMING DYNAMIC PROGRAMMING THEFA-TCAT THEFASTCAT
49
Global Alignments F A S T F A T ----FAT FAST--- (L1+l2)! (L1)!*(L2)! ---FAT- FAST--- --F-AT- FAST--- Brut Force Enumeration 2 () DYNAMIC PROGRAMMING
50
G A T A C T A G A T T A C C A Construct an optimal of these two sequences: Using these scoring rules: Match: Mismatch: Gap: +1 D Y N A M I C P R O G R A M M I N G Dynamic Programming Example
51
D Y N A M I C P R O G R A M M I N G GATACTA G A T T A C C A Arrange the sequence residues along a two-dimensional lattice Vertices of the lattice fall between letters
52
D Y N A M I C P R O G R A M M I N G GATACTA G A T T A C C A The goal is to find the optimal path from here to here
53
D Y N A M I C P R O G R A M M I N G GATACTA G A T T A C C A Each path corresponds to a unique alignment Which one is optimal?
54
D Y N A M I C P R O G R A M M I N G GATACTA G A T T A C C A The score for a path is the sum of its incremental edges scores A aligned with A Match = +1
55
D Y N A M I C P R O G R A M M I N G GATACTA G A T T A C C A The score for a path is the sum of its incremental edges scores A aligned with T Mismatch = -1
56
D Y N A M I C P R O G R A M M I N G GATACTA G A T T A C C A The score for a path is the sum of its incremental edges scores T aligned with NULL Gap = -1 NULL aligned with T
57
D Y N A M I C P R O G R A M M I N G GATACTA G A T T A C C A Incrementally extend the path 0 +1
58
D Y N A M I C P R O G R A M M I N G GATACTA G A T T A C C A Incrementally extend the path 0 +1 -2 Remember the best sub-path leading to each point on the lattice
59
D Y N A M I C P R O G R A M M I N G GATACTA G A T T A C C A Incrementally extend the path 0 -2 Remember the best sub-path leading to each point on the lattice 0 +2 +1 -20
60
D Y N A M I C P R O G R A M M I N G GATACTA G A T T A C C A Incrementally extend the path 0 -2 Remember the best sub-path leading to each point on the lattice 0 +2 +1 -20
61
D Y N A M I C P R O G R A M M I N G GATACTA G A T T A C C A Incrementally extend the path 0 Remember the best sub-path leading to each point on the lattice +1 -2 -3 -2 -3 -2 +3 0 0 +1 +2
62
D Y N A M I C P R O G R A M M I N G GATACTA G A T T A C C A Incrementally extend the path 0 Remember the best sub-path leading to each point on the lattice +1 -2 0 0 +1 +2 -5 -4 -5 -4 -3 -3-2 0 +1 +2 0 +1 +2 -3 -2 +1+3 +2 +1 +2 +3
63
D Y N A M I C P R O G R A M M I N G GATACTA G A T T A C C A Incrementally extend the path Remember the best sub-path leading to each point on the lattice 0 +1 -2 0 0 +1 +2 -4 -3 -2 0 +2 0 +1 +2-2 +2 +1 +2 +3 -8 -7 -6 -5 -7 -6 -5 -3 -2 -3 -4 0 +1 +3 +2 -4 -6 -3 -2 -3 -4 -5 +1+3 +1 0+2 +4 +3 +2 +3 -2 0 +2 +3
64
D Y N A M I C P R O G R A M M I N G GATACTA G A T T A C C A Trace-back to get optimal path and alignment 0 +1 -2 0 0 +1 +2 -4 -3 -2 0 +2 0 +1 +2-2 +2 +1 +2 +3 -8 -7 -6 -5 -7 -6 -5 -3 -2 -3 -4 0 +1 +3 +2 -4 -6 -3 -2 -3 -4 -5 +1+3 +1 0+2 +4 +3 +2 +3 -2 0 +2 +3
65
D Y N A M I C P R O G R A M M I N G GATACTA G A T T A C C A Print out the alignment AAAA -T-T TTTT AAAA CCCC TCTC AAAA GGGG
66
Global Alignments DYNAMIC PROGRAMMING Match=1MisMatch=-1Gap=-1 F A T FAST 1 -2 -3 0 -2-3-4 2 0 0 Dynamic Programming (Needlman and Wunsch) F A T FAST 1 -2 -3 0 -2-3-4 2 0 0 0 0 2 1 1 F A T FAST 1 -2-3-4 2 0 2 1 FAST FA-T
67
Local Alignments GLOBAL AlignmentLOCAL Alignment Smith And Waterman (SW)=LOCAL Alignment
68
Two different types of Alignment Needleman & Wunch (J. Mol. Biol. (1970) 48,443-453 : Problem of finding the best path. Revelation: Any partial sub- path that ends at a point along the true optimal path must itself be the optimal path leading to that point. This provides a method to create a matrix of path “score”, the score of a path leading to that point. Trace the optimal path from one end to the other of the two sequences. Global Alignment methods: Smith & Waterman.(J. Mol. Biol. (1981), 147,195-197: Use Needleman &Wunch, but report all non-overlapping paths, starting at the highest scoring points in the path graph. FASTP(Lipman &Pearson(1985),Science 227,1435-1441 BLAST (Altschul et al (1990),J. Mol. Bio. 215,408-410): don’t report all overlapping paths, but only attempt to find paths if there are words that are high-scoring. Speeds up considerably the alignments. Local Alignment methods:
69
Global vs. Local Alignment High-scoring subsequence Gap Global alignment Local alignment Global alignment: best overall alignment independent of whether local high-scoring sequences are included Local alignment: alignments involving high-scoring sequences take precedence of global features
70
G L O B A L & L O C A L S I M I L A R I T Y Implementations of dynamic programming for global and local similarities Optimal global alignment Needleman & Wunsch (1970) Sequences align essentially from end to end Optimal local alignment Smith & Waterman (1981) Sequences align only in small, isolated regions
71
Filtering low complexity sequences Filters out short repeats and low complexity regions from the query sequences before searching the database Filtering helps to obtain statistically significant results and reduce the background noise resulting from matches with repeats and low complexity regions The output shows which regions of the query sequence were masked
72
Sequence Periodicities in Kinetoplast DNA Marini et al. Proc. Natl. Acad. Sci. USA 79, 7664-7668 (1982)
73
Local Alignments We now have a PairWise Comparison Algorithm, We are ready to search Databases
74
Database Search 1.10e-20 10 1.10e-100 1.10e-2 1.10e-1 10 3 1 3 6 1.10e-2 1 20 15 13 QUERRY Comparison Engine Database E-values How many time do we expect such an Alignment by chance? SW Q
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.