Download presentation
Presentation is loading. Please wait.
1
1 ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG GAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACT GATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCA GAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAG GTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACA ACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCC TGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGT CATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGC ATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTT TCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACA ATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTT TCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTA CTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAG GGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGG TTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAAC AAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGT CTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAA GGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCC CTGGCTCACAAGTACCATTGA MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE… || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… Motivation
2
2 Lesson 2 Aligning sequences and searching databases
3
3 Homology and sequence alignment
4
Homology = Similarity between objects due to a common ancestry Homology
5
5 Sequence homology VLSPAVKWAKVGAHAAGHG ||| || |||| | |||| VLSEAVLWAKVEADVAGHG Similarity between sequences as a result of common ancestry.
6
6 Sequence alignment Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences.
7
7 Why align? VLSPAVKWAKV ||| || |||| VLSEAVLWAKV 1.To detect if two sequence are homologous. If so, homology may indicate similarity in function (and structure). 2.Required for evolutionary studies (e.g., tree reconstruction). 3.To detect conservation (e.g., a tyrosine that is evolutionary conserved is more likely to be a phosphorylation site.
8
8 Insertions, deletions, and substitutions
9
9 Three types of changes: 1.Substitution – a replacement of one (or more) sequence letter by another: 2.Insertion - an insertion of a letter or several letters to the sequence: 3.Deletion - deleting a letter (or more) from the sequence: AA A TA Evolutionary changes in sequences Insertion + Deletion Indel AAG GAAA C G
10
10 Sequence alignment If two sequences share a common ancestor – for example human and armadillo hemoglobin, we can represent their evolutionary relationship using a tree VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV VLSEAVLWAKV
11
11 Perfect match VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV VLSEAVLWAKV A perfect match suggests that no change has occurred from the common ancestor (although this is not always the case). VLSEAVLWAKV
12
12 A substitution VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV VLSEAVLWAKV A substitution suggests that at least one change has occurred since the common ancestor (although we cannot say in which lineage it has occurred). VLSEAVLWAKV VLSPAVLWAKV
13
13 Indel VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV * Option 1: The ancestor had L and it was lost here *. In such a case, the event was a deletion. VLSEAVLWAKV *
14
14 Indel VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV VLSEAV WAKV * Option 2: The ancestor was shorter and the L was inserted here *. In such a case, the event was an insertion. VLSEAVLWAKV L *
15
15 Indel VLSPAV - WAKV Normally, given two sequences we cannot tell whether it was an insertion or a deletion, so we term the event as an indel. VLSEAVLWAKV Deletion?Insertion?
16
16 Indels in protein coding genes Indels in protein coding genes are often of 3bp, 6bp, 9bp, etc... Gene Search In fact, searching for indels of length 3K (K=1,2,3,…) can help algorithms that search a genome for open reading frames (ORFs).
17
17 Global and Local pairwise alignments
18
18 Global vs. Local Global alignment – finds the best alignment across the entire two sequences. Local alignment – finds regions of similarity in parts of the sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ ADLG CDRYFQ |||| |||| | ADLG CDRYYQ Local alignment will return only regions of good alignment Global alignment: forces alignment in regions which differ
19
19 Global alignment PTK2 protein tyrosine kinase 2 of human and rhesus monkey
20
20 Proteins are comprised of domains Domain B Protein tyrosine kinase domain Domain A Human PTK2 :
21
21 Protein tyrosine kinase domain In leukocytes, a different gene for tyrosine kinase is expressed. Domain X Protein tyrosine kinase domain Domain A
22
22 Domain X Protein tyrosine kinase domain Domain B Protein tyrosine kinase domain Domain A Leukocyte TK PTK2 The sequence similarity is restricted to a single domain
23
23 Global alignment of PTK and LTK
24
24 Local alignment of PTK and LTK
25
25 Conclusions Use global alignment when the two sequences share the same overall sequence arrangement. Use local alignment to detect regions of similarity.
26
26 How alignments scores are computed?
27
27 Pairwise alignment AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- One possible alignment:
28
28 AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- This alignment includes: 2 mismatches 4 indels (gap) 10 perfect matches
29
29 Choosing an alignment for a pair of sequences AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Which alignment is better? Many different alignments are possible for 2 sequences:
30
30 Scoring system (naïve) AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- Score: = (+1)x10 + (-2)x2 + (-1)x4 = 2Score: = (+1)x9 + (-2)x2 + (-1)x6 = -1 A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Higher score Better alignment Perfect match: +1 Mismatch: -2 Indel (gap): -1
31
31 Scoring systems
32
32 Scoring system In the example above, the choice of +1 for match,-2 for mismatch, and -1 for gap is quite arbitrary Different scoring systems different alignments We want a good scoring system…
33
33 Scoring matrix TCGA 2A 2-6G 2 C 2 T Representing the scoring system as a table or matrix n X n (n is the number of letters the alphabet contains. n=4 for nucleotides, n=20 for amino acids) Symmetric
34
34 DNA scoring matrices Uniform substitutions between all nucleotides: TCGAFrom To 2A 2-6G 2 C 2 T Match Mismatch
35
35 DNA scoring matrices Can take into account biological phenomena such as: Transition-transversion
36
36 Amino-acid scoring matrices Take into account physico- chemical properties
37
37 Amino-acid substitution matrices Actual substitutions: –Based on empirical data –Commonly used by many bioinformatics programs –PAM & BLOSUM
38
38 Protein matrices – actual substitutions The idea: Given an alignment of a large number of closely related sequences we can score the relation between amino acids based on how frequently they substitute each other M G Y D E M G Y E E M G Y D E M G Y Q E M G Y D E M G Y E E In the 4 th Column D/E is found in 7/8 of the cases (compared with 5/8 to D/Q and E/Q).
39
39 BLOSUM: Blo cks Su bstitution M atrix Based on BLOCKS database –~2000 blocks from 500 families of related proteins –Families of proteins with identical function Blocks are short conserved patterns of 3-60 aa without gaps AABCDA----BBCDA DABCDA----BBCBB BBBCDA-AA-BCCAA AAACDA-A--CBCDB CCBADA---DBBDCC AAACAA----BBCCC
40
40 BLOSUM Each block represents a sequence alignment with different identity percentage For each block the amino-acid substitution rates were calculated to create the BLOSUM matrix
41
41 BLOSUM Matrices BLOSUMn is based on sequences that share at least n percent identity BLOSUM62 represents closer sequences than BLOSUM45
42
42 Example : Blosum62 Derived from blocks where the sequences share at least 62% identity
43
43 Scoring gaps In advanced algorithms, two gaps of one amino-acid ( X-Y- ) are given a different score than one gap of two amino acids ( X--Y ). This is performed by giving different penalty for “opening” a gap and for extending a gap Gap extension penalty < Gap opening penalty
44
44 Intermediate summary 1.Scoring system = substitution matrix + gap penalty. 2.Used for both global and local alignment 3.For amino acids, there are two types of substitution matrices: PAM and BLOSUM
45
45 Computational aspects
46
46 Many possible alignments AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCT-GAATT-C-GAA A-GGCT-CATTTCTGA- AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- AAG-CTGAATT-C-GAA AGGCT-CATTT-CTGA- It is not trivial (for most people) to figure out how to go over all possible pairwise alignments and find the one with the highest score.
47
47 Optimal alignment algorithms Needleman-Wunsch (global) [1970] Smith-Waterman (local) [1981] Their algorithm’s complexity is O(mn) (m – length of sequence 1, n – length of sequence 2). Informally: If one doubles the sequence length of one sequence it doubles the computation time. If one doubles both it quadruples the computation time. For proteins of lengths < 1000 it takes much less than a second to compute the alignments.
48
48 Dynamic programming Solving a problem with many overlapping sub-problems Example: Fibonacci sequnce: 1, 1, 2, 3, 5, 8,13,… F)1) = F(2) = 1; F(n) = F(n-1) + F(n-2)
49
49 Dynamic programming Naïvely solving F(7): F(7) = F(6) + F(5) = F(5) + F(4) + F(4) + F(3) = F(4) + F(3) + F(3) + F(2) + F(3) + F(2) +F(2) + F(1) = F(3) + F(2) + F(2) + F(1) + F(2) + F(1) + F(2) + F(2) + F(1) + F(2) +F(2) + F(1) = F(2) + F(1) + F(2) + F(2) + F(1) + F(2) + F(1) + F(2) + F(2) + F(1) + F(2) +F(2) + F(1) = 13 F)1) = F(2) = 1; F(n) = F(n-1) + F(n-2)
50
50 Dynamic programming F(7) using Dynamic programming: F(3) = F(2) + F(1) = 1 + 1 = 2 F(4) = F(3) + F(2) = 2 + 1 = 3 F(5) = F(4) + F(3) = 3 + 2 = 5 F(6) = F(5) + F(4) = 5 + 3 = 8 F(7) = F(6) + F(5) = 8 + 5 = 13
51
51 Needleman Wunsch (1970) Gap penaltyjiF jiF yxsjiF jiF ji )1,( ),1( ),,()1,1( max),( 0iF),( i ×Gap penalty 0, jF)( j ×Gap penalty Base Case: Recursion rule Finds the best alignment for the first i characters of seq1 with the first j of seq2
52
52 Needleman Wunsch (1970) Gap penaltyjiF jiF yxsjiF jiF ji )1,( ),1( ),,()1,1( max),( 0iF),( i ×Gap penalty 0, jF)( j ×Gap penalty Base Case: Recursion rule Cool alignment applet: http://baba.sourceforge.net/ http://baba.sourceforge.net/
53
53 Searching databases
54
54 Searching a sequence database Idea: In order to find homologous sequences to a sequence of interest, one should compute its pairwise alignment against all known sequences in a database, and detect the best scoring significant homologs The same idea in short: Use your sequence as a query to find homologous sequences in a sequence database
55
55 Some terminology Query sequence - the sequence with which we are searching Hit – a sequence found in the database, suspected as homologous
56
56 Protein or DNA search
57
57 Query sequence: DNA or protein? For coding sequences, we can use the DNA sequence or the protein sequence to search for similar sequences. Which is preferable?
58
58 Protein is better! Selection (and hence conservation) works (mostly) at the protein level: CTTTCA = Leu-Ser TTGAGT = Leu-Ser
59
59 Query type Nucleotides: 4 letter alphabet Amino acids: 20 letter alphabet Two random DNA sequences will, on average, have 25% identity Two random protein sequences will, on average, have 5% identity
60
60 Conclusion The amino-acid sequence is often preferable for homology search
61
61 Computation time
62
62 How do we search a database? If each pairwise alignment takes 1/10 of a second, and if the database contains 10 7 sequences, it will take 10 6 seconds = 11.5 days to complete one search. 150,000 searches (at least!!) are performed per day. >82,000,000 sequence records in GenBank.
63
63 Conclusion Using the exact comparison pairwise alignment algorithm between the query and all DB entries – too slow
64
64 Heuristic Definition: a heuristic is a design to solve a problem that does not provide an exact solution (but is not too bad) but reduces the time complexity of the exact solution
65
65 BLAST
66
66 BLAST BLAST - Basic Local Alignment and Search Tool A heuristic for searching a database for similar sequences The heuristic based on restrictions of the similarity (such as using ungapped word matching instead of single character matching).
67
67 Query:DNAProtein Database:DNAProtein DNA or Protein All types of searches are possible blastn – nuc vs. nuc blastp – prot vs. prot blastx – translated query vs. protein database tblastn – protein vs. translated nuc. DB tblastx – translated query vs. translated database
68
68 E-value The number of times we will theoretically find an alignment with a score ≥ Y of a random sequence vs. a random database Theoretically, we could trust any result with an E-value ≤ 1 In practice – BLAST uses estimations. E-values of 10 -4 and lower indicate a significant homology. E-values between 10 -4 and 10 -2 should be checked (similar domains, maybe non-homologous). E-values between 10 -2 and 1 do not indicate a good homology
69
69 Filtering low complexity Low complexity regions : e.g., Proline rich areas (in proteins), Alu repeats (in DNA) Regions of low complexity generate high scores of alignment, BUT – this does not indicate homology
70
70 Solution In BLAST there is an option to mask low- complexity regions in the query sequence (such regions are represented as XXXXX in query)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.