1 ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG GAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACT GATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCA GAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAG GTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACA ACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCC TGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGT CATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGC ATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTT TCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACA ATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTT TCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTA CTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAG GGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGG TTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAAC AAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGT CTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAA GGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCC CTGGCTCACAAGTACCATTGA MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE… || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… Motivation
2 Lesson 2 Aligning sequences and searching databases
3 Homology and sequence alignment
Homology = Similarity between objects due to a common ancestry Homology
5 Sequence homology VLSPAVKWAKVGAHAAGHG ||| || |||| | |||| VLSEAVLWAKVEADVAGHG Similarity between sequences as a result of common ancestry.
6 Sequence alignment Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences.
7 Why align? VLSPAVKWAKV ||| || |||| VLSEAVLWAKV 1.To detect if two sequence are homologous. If so, homology may indicate similarity in function (and structure). 2.Required for evolutionary studies (e.g., tree reconstruction). 3.To detect conservation (e.g., a tyrosine that is evolutionary conserved is more likely to be a phosphorylation site.
8 Insertions, deletions, and substitutions
9 Three types of changes: 1.Substitution – a replacement of one (or more) sequence letter by another: 2.Insertion - an insertion of a letter or several letters to the sequence: 3.Deletion - deleting a letter (or more) from the sequence: AA A TA Evolutionary changes in sequences Insertion + Deletion Indel AAG GAAA C G
10 Sequence alignment If two sequences share a common ancestor – for example human and armadillo hemoglobin, we can represent their evolutionary relationship using a tree VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV VLSEAVLWAKV
11 Perfect match VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV VLSEAVLWAKV A perfect match suggests that no change has occurred from the common ancestor (although this is not always the case). VLSEAVLWAKV
12 A substitution VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV VLSEAVLWAKV A substitution suggests that at least one change has occurred since the common ancestor (although we cannot say in which lineage it has occurred). VLSEAVLWAKV VLSPAVLWAKV
13 Indel VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV * Option 1: The ancestor had L and it was lost here *. In such a case, the event was a deletion. VLSEAVLWAKV *
14 Indel VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV VLSEAV WAKV * Option 2: The ancestor was shorter and the L was inserted here *. In such a case, the event was an insertion. VLSEAVLWAKV L *
15 Indel VLSPAV - WAKV Normally, given two sequences we cannot tell whether it was an insertion or a deletion, so we term the event as an indel. VLSEAVLWAKV Deletion?Insertion?
16 Indels in protein coding genes Indels in protein coding genes are often of 3bp, 6bp, 9bp, etc... Gene Search In fact, searching for indels of length 3K (K=1,2,3,…) can help algorithms that search a genome for open reading frames (ORFs).
17 Global and Local pairwise alignments
18 Global vs. Local Global alignment – finds the best alignment across the entire two sequences. Local alignment – finds regions of similarity in parts of the sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ ADLG CDRYFQ |||| |||| | ADLG CDRYYQ Local alignment will return only regions of good alignment Global alignment: forces alignment in regions which differ
19 Global alignment PTK2 protein tyrosine kinase 2 of human and rhesus monkey
20 Proteins are comprised of domains Domain B Protein tyrosine kinase domain Domain A Human PTK2 :
21 Protein tyrosine kinase domain In leukocytes, a different gene for tyrosine kinase is expressed. Domain X Protein tyrosine kinase domain Domain A
22 Domain X Protein tyrosine kinase domain Domain B Protein tyrosine kinase domain Domain A Leukocyte TK PTK2 The sequence similarity is restricted to a single domain
23 Global alignment of PTK and LTK
24 Local alignment of PTK and LTK
25 Conclusions Use global alignment when the two sequences share the same overall sequence arrangement. Use local alignment to detect regions of similarity.
26 How alignments scores are computed?
27 Pairwise alignment AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- One possible alignment:
28 AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- This alignment includes: 2 mismatches 4 indels (gap) 10 perfect matches
29 Choosing an alignment for a pair of sequences AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Which alignment is better? Many different alignments are possible for 2 sequences:
30 Scoring system (naïve) AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- Score: = (+1)x10 + (-2)x2 + (-1)x4 = 2Score: = (+1)x9 + (-2)x2 + (-1)x6 = -1 A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Higher score Better alignment Perfect match: +1 Mismatch: -2 Indel (gap): -1
31 Scoring systems
32 Scoring system In the example above, the choice of +1 for match,-2 for mismatch, and -1 for gap is quite arbitrary Different scoring systems different alignments We want a good scoring system…
33 Scoring matrix TCGA 2A 2-6G 2 C 2 T Representing the scoring system as a table or matrix n X n (n is the number of letters the alphabet contains. n=4 for nucleotides, n=20 for amino acids) Symmetric
34 DNA scoring matrices Uniform substitutions between all nucleotides: TCGAFrom To 2A 2-6G 2 C 2 T Match Mismatch
35 DNA scoring matrices Can take into account biological phenomena such as: Transition-transversion
36 Amino-acid scoring matrices Take into account physico- chemical properties
37 Amino-acid substitution matrices Actual substitutions: –Based on empirical data –Commonly used by many bioinformatics programs –PAM & BLOSUM
38 Protein matrices – actual substitutions The idea: Given an alignment of a large number of closely related sequences we can score the relation between amino acids based on how frequently they substitute each other M G Y D E M G Y E E M G Y D E M G Y Q E M G Y D E M G Y E E In the 4 th Column D/E is found in 7/8 of the cases (compared with 5/8 to D/Q and E/Q).
39 BLOSUM: Blo cks Su bstitution M atrix Based on BLOCKS database –~2000 blocks from 500 families of related proteins –Families of proteins with identical function Blocks are short conserved patterns of 3-60 aa without gaps AABCDA----BBCDA DABCDA----BBCBB BBBCDA-AA-BCCAA AAACDA-A--CBCDB CCBADA---DBBDCC AAACAA----BBCCC
40 BLOSUM Each block represents a sequence alignment with different identity percentage For each block the amino-acid substitution rates were calculated to create the BLOSUM matrix
41 BLOSUM Matrices BLOSUMn is based on sequences that share at least n percent identity BLOSUM62 represents closer sequences than BLOSUM45
42 Example : Blosum62 Derived from blocks where the sequences share at least 62% identity
43 Scoring gaps In advanced algorithms, two gaps of one amino-acid ( X-Y- ) are given a different score than one gap of two amino acids ( X--Y ). This is performed by giving different penalty for “opening” a gap and for extending a gap Gap extension penalty < Gap opening penalty
44 Intermediate summary 1.Scoring system = substitution matrix + gap penalty. 2.Used for both global and local alignment 3.For amino acids, there are two types of substitution matrices: PAM and BLOSUM
45 Computational aspects
46 Many possible alignments AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCT-GAATT-C-GAA A-GGCT-CATTTCTGA- AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- AAG-CTGAATT-C-GAA AGGCT-CATTT-CTGA- It is not trivial (for most people) to figure out how to go over all possible pairwise alignments and find the one with the highest score.
47 Optimal alignment algorithms Needleman-Wunsch (global) [1970] Smith-Waterman (local) [1981] Their algorithm’s complexity is O(mn) (m – length of sequence 1, n – length of sequence 2). Informally: If one doubles the sequence length of one sequence it doubles the computation time. If one doubles both it quadruples the computation time. For proteins of lengths < 1000 it takes much less than a second to compute the alignments.
48 Dynamic programming Solving a problem with many overlapping sub-problems Example: Fibonacci sequnce: 1, 1, 2, 3, 5, 8,13,… F)1) = F(2) = 1; F(n) = F(n-1) + F(n-2)
49 Dynamic programming Naïvely solving F(7): F(7) = F(6) + F(5) = F(5) + F(4) + F(4) + F(3) = F(4) + F(3) + F(3) + F(2) + F(3) + F(2) +F(2) + F(1) = F(3) + F(2) + F(2) + F(1) + F(2) + F(1) + F(2) + F(2) + F(1) + F(2) +F(2) + F(1) = F(2) + F(1) + F(2) + F(2) + F(1) + F(2) + F(1) + F(2) + F(2) + F(1) + F(2) +F(2) + F(1) = 13 F)1) = F(2) = 1; F(n) = F(n-1) + F(n-2)
50 Dynamic programming F(7) using Dynamic programming: F(3) = F(2) + F(1) = = 2 F(4) = F(3) + F(2) = = 3 F(5) = F(4) + F(3) = = 5 F(6) = F(5) + F(4) = = 8 F(7) = F(6) + F(5) = = 13
51 Needleman Wunsch (1970) Gap penaltyjiF jiF yxsjiF jiF ji )1,( ),1( ),,()1,1( max),( 0iF),( i ×Gap penalty 0, jF)( j ×Gap penalty Base Case: Recursion rule Finds the best alignment for the first i characters of seq1 with the first j of seq2
52 Needleman Wunsch (1970) Gap penaltyjiF jiF yxsjiF jiF ji )1,( ),1( ),,()1,1( max),( 0iF),( i ×Gap penalty 0, jF)( j ×Gap penalty Base Case: Recursion rule Cool alignment applet:
53 Searching databases
54 Searching a sequence database Idea: In order to find homologous sequences to a sequence of interest, one should compute its pairwise alignment against all known sequences in a database, and detect the best scoring significant homologs The same idea in short: Use your sequence as a query to find homologous sequences in a sequence database
55 Some terminology Query sequence - the sequence with which we are searching Hit – a sequence found in the database, suspected as homologous
56 Protein or DNA search
57 Query sequence: DNA or protein? For coding sequences, we can use the DNA sequence or the protein sequence to search for similar sequences. Which is preferable?
58 Protein is better! Selection (and hence conservation) works (mostly) at the protein level: CTTTCA = Leu-Ser TTGAGT = Leu-Ser
59 Query type Nucleotides: 4 letter alphabet Amino acids: 20 letter alphabet Two random DNA sequences will, on average, have 25% identity Two random protein sequences will, on average, have 5% identity
60 Conclusion The amino-acid sequence is often preferable for homology search
61 Computation time
62 How do we search a database? If each pairwise alignment takes 1/10 of a second, and if the database contains 10 7 sequences, it will take 10 6 seconds = 11.5 days to complete one search. 150,000 searches (at least!!) are performed per day. >82,000,000 sequence records in GenBank.
63 Conclusion Using the exact comparison pairwise alignment algorithm between the query and all DB entries – too slow
64 Heuristic Definition: a heuristic is a design to solve a problem that does not provide an exact solution (but is not too bad) but reduces the time complexity of the exact solution
65 BLAST
66 BLAST BLAST - Basic Local Alignment and Search Tool A heuristic for searching a database for similar sequences The heuristic based on restrictions of the similarity (such as using ungapped word matching instead of single character matching).
67 Query:DNAProtein Database:DNAProtein DNA or Protein All types of searches are possible blastn – nuc vs. nuc blastp – prot vs. prot blastx – translated query vs. protein database tblastn – protein vs. translated nuc. DB tblastx – translated query vs. translated database
68 E-value The number of times we will theoretically find an alignment with a score ≥ Y of a random sequence vs. a random database Theoretically, we could trust any result with an E-value ≤ 1 In practice – BLAST uses estimations. E-values of and lower indicate a significant homology. E-values between and should be checked (similar domains, maybe non-homologous). E-values between and 1 do not indicate a good homology
69 Filtering low complexity Low complexity regions : e.g., Proline rich areas (in proteins), Alu repeats (in DNA) Regions of low complexity generate high scores of alignment, BUT – this does not indicate homology
70 Solution In BLAST there is an option to mask low- complexity regions in the query sequence (such regions are represented as XXXXX in query)