Download presentation
Presentation is loading. Please wait.
1
|| || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG GAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACT GATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCA GAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAG GTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACA ACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCC TGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGT CATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGC ATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTT TCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACA ATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTT TCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTA CTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAG GGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGG TTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAAC AAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGT CTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAA GGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCC CTGGCTCACAAGTACCATTGA MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE… Before we begin…
2
Pairwise Sequence Alignment Lesson 2
3
What is sequence alignment? Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences. MVNLTSDEKTAVLALWNKVDVEDCGGE || || ||||| ||| || || || MVHLTPEEKTAVNALWGKVNVDAVGGE
4
Why sequence alignment? Predict characteristics of a protein – use the structure or function information on known proteins with similar sequences available in databases in order to predict the structure or function of an unknown protein Assumptions: similar sequences produce similar proteins
5
Local vs. Global Global alignment – finds the best alignment across the whole two sequences. Local alignment – finds regions of high similarity in parts of the sequences. Local alignment – finds regions of high similarity in parts of the sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ ADLG CDRYFQ |||| |||| | ADLG CDRYYQ Global alignment: forces alignment in regions which differ Local alignment concentrates on regions of high similarity
6
In the course of evolution, the sequences changed from the ancestral sequence by random mutations Three types of changes: 1. Insertion - an insertion of a letter or several letters to the sequence. AAGA AAGTA Sequence evolution AAG T A Insertion
7
In the course of evolution, the sequences changed from the ancestral sequence by random mutations Three types of changes : 1. Insertion - an insertion of a letter or several letters to the sequence. AAGA AAGTA 2. Deletion – a deletion of a letter (or more) from the sequence. AAGA AGA Sequence evolution AAG Deletion A
8
In the course of evolution, the sequences changed from the ancestral sequence by random mutations Three types of mutations: 1. Insertion - an insertion of a letter or several letters to the sequence. AAGA AAGTA 2. Deletion - deleting a letter (or more) from the sequence. AAGA AGA 3. Substitution – a replacement of one (or more) sequence letter by another AAGA AACA AAGA AACA Evolutionary changes in sequences AAA Substitution G C Insertion + Deletion Indel
9
Sequence alignment AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- One possible alignment: This alignment includes: 2 mismatches 4 indels (gap) 10 perfect matches
10
Choosing an alignment: Many different alignments are possible: AAGCTGAATTCGAA AGGCTCATTTCTGA A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Which alignment is better? AAGCTGAATT-C-GAA AGGCT-CATTTCTGA-
11
Scoring an alignment: example - naïve scoring system: Match: +1 Mismatch: -2 Indel: -1 AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- Score: = (+1)x10 + (-2)x2 + (-1)x4 = 2Score: = (+1)x9 + (-2)x2 + (-1)x6 = -1 A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Higher score Better alignment
12
Scoring system: Different scoring systems can produce different optimal alignments Scoring systems implicitly represent a particular theory of similarity/dissimilarity between sequence characters: evolution based, physico-chemical properties based Some mismatches are more plausible Some mismatches are more plausible Transition vs. TransversionTransition vs. Transversion Lys Arg ≠ Lys CysLys Arg ≠ Lys Cys Gap extension Vs. Gap opening Gap extension Vs. Gap opening
13
Substitutions Matrices Nucleic acids: Transition-transversion Transition-transversion Amino acids: Evolution (empirical data) based: (PAM, BLOSUM) Evolution (empirical data) based: (PAM, BLOSUM) Physico-chemical properties based (Grantham, McLachlan) Physico-chemical properties based (Grantham, McLachlan)
14
PAM Matrices Family of matrices PAM 80, PAM 120, PAM 250 The number with PAM matrices represent evolutionary distance Greater numbers denote greater distances
15
Which PAM matrix to use? Low PAM numbers: strong similarities High PAM numbers: weak similarities PAM120 for general use (40% identity) PAM120 for general use (40% identity) PAM60 for close relations (60% identity) PAM60 for close relations (60% identity) PAM250 for distant relations (20% identity) PAM250 for distant relations (20% identity) If uncertain, try several different matrices PAM40, PAM120, PAM250 PAM40, PAM120, PAM250
16
PAM - limitations Based on only one original dataset Examines proteins with few differences (85% identity) Based mainly on small globular proteins so the matrix is biased
17
BLOSUM Matrices Different BLOSUMn matrices are calculated independently from BLOCKS BLOSUMn is based on sequences that share at least n percent identity BLOSUM62 represents closer sequences than BLOSUM45
18
Example : Blosum62 derived from blocks of sequences that share at least 62% identity
19
Which BLOSUM matrix to use? Low BLUSOM numbers for distant sequences High BLUSOM numbers for similar sequences BLOSUM62 for general use BLOSUM62 for general use BLOSUM80 for close relations BLOSUM80 for close relations BLOSUM45 for distant relations BLOSUM45 for distant relations
20
PAM Vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45 More distant sequences
21
Gap penalty We expect to penalize gaps A different score for gap opening and for extension Insertions and deletions are rare in evolution Insertions and deletions are rare in evolution But once they occur, they are easy to extend But once they occur, they are easy to extend Gap-extension penalty < gap-opening penalty Gap-extension penalty < gap-opening penalty
22
Web servers for pairwise alignment
23
BLAST 2 sequences (bl2Seq) at NCBI Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool) engine for local alignment BLAST Does not use an exact algorithm but a heuristic
24
Back to NCBI
25
BLAST – bl2seq
26
blastn – nucleotide blastp – protein Bl2Seq - query
27
Bl2seq results
28
Match Dissimilarity Gaps Similarity Low complexity
29
Bl2seq results: Bits score – A score for the alignment according to the number of similarities, identities, etc. Bits score – A score for the alignment according to the number of similarities, identities, etc. Expected-score (E-value) –The number of alignments with the same score one can “expect” to see by chance when searching a database of a particular size. The closer the e- value approaches zero, the greater the confidence that the hit is real
30
BLAST – programs Query:DNAProtein Database:DNAProtein
31
BLAST – Blastp
32
Blastp - results
33
Blastp – results (cont’)
34
Blastp – acquiring sequences
35
blastp – acquiring sequences (cont’)
36
Fasta format – multiple sequences >gi|4504351|ref|NP_000510.1| delta globin [Homo sapiens] MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN ALAHKYH >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH >gi|4885393|ref|NP_005321.1| epsilon globin [Homo sapiens] MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAI ALAHKYH >gi|6715607|ref|NP_000175.1| G-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS ALSSRYH >gi|28302131|ref|NP_000550.2| A-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVAS ALSSRYH
37
Searching for remote homologs Sometimes BLAST isn’t enough Large protein family, and BLAST only finds close members. We want more distant members PSI-BLAST Profile HMMs (not discussed in this exercise)
38
PSI-BLAST Position Specific Iterated BLAST Regular blast Construct profile from blast results Blast profile search Final results
39
PSI-BLAST Advantage: PSI-BLAST looks for seq’s that are close to the query, and learns from them to extend the circle of friends Disadvantage: if we obtained a WRONG hit, we will get to unrelated sequences (contamination). This gets worse and worse each iteration
40
BLAST – PSI-Blast
41
PSI-Blast - results
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.