Dot Plots, Path Matrices, Score Matrices Sequence A V I L S T R I V H V N S I L P S T N V I L S T R I Sequence B V I L P E F S T diagonal lines give equivalent residues
Sequence A V I L S T R I V H V N S I L P S T N V I L S T R Sequence B F S T identical residues score 1 highest scoring path across the matrix gives best alignment
Alignment from Dot Plot VILSLV ILPQRSLVVILSLVI LALTV STVILSLVNVILPQR ILSLVISLAL score = 20 sequence identity = 20/26 = 75%
Residue substitution matrix ALVKRH… 1 0…… 1 Path or Score Matrix 1 1 …HRKVLA 1 H C N I R Q L P M A K 1 Residue substitution matrix
Needleman & Wunsch A H C N I R Q C L C R P M A 1 I 1 C 1 1 1 I 1 N 1 R I 1 C 1 1 1 I 1 N 1 R 1 1 C 1 1 1 K C 1 1 1 R 1 1 H 1 P 1
Needleman & Wunsch Algorithm Accumulate the matrix by starting in the bottom row moving from right to left a row at a time adding score from highest scoring cell in ‘column or row, to right and below’ to current cell. find the highest scoring path in the matrix by: starting in the top left corner moving down across the matrix from cell to cell choosing the highest scoring cell at each move the path can not go back on itself or cross the same row or column twice
Accumulating the Matrix Add to the score in the cell the highest score from a cell in: ‘the column or row to right and below’ i,j i-1,j-1 i-n,j-1 i-1,j-m
Sequence A A H C N I R Q C L C R P M A 8 7 6 6 5 4 4 3 3 2 1 I 7 7 6 6 6 4 4 3 3 2 1 C 6 6 7 6 5 4 4 4 3 3 1 I 6 6 6 5 6 4 4 3 3 2 1 N 5 5 5 6 5 5 4 3 3 3 1 R 4 4 4 4 4 5 4 3 3 2 2 Sequence B C 3 3 4 3 3 3 3 4 3 3 1 K 3 3 3 3 3 3 3 3 3 2 1 C 2 2 3 2 2 2 2 3 2 3 1 R 2 1 1 1 1 2 1 1 1 1 2 H 1 2 1 1 1 1 1 1 1 1 1 P 1
Possible Moves in Finding a Path across the Matrix start in the leftmost or topmost row move to the highest scoring cell in: ‘the column or row to right and below’ i,j i-1,j-1 i-n,j-1 i-1,j-m
Sequence A A H C N I R Q C L C R P M A 8 7 6 6 5 4 4 3 3 2 1 I 7 7 6 6 6 4 4 3 3 2 1 C 6 6 7 6 5 4 4 4 3 3 1 I 6 6 6 5 6 4 4 3 3 2 1 N 5 5 5 6 5 5 4 3 3 3 1 Sequence B R 4 4 4 4 4 5 4 3 3 2 2 C 3 3 4 3 3 3 3 4 3 3 1 K 3 3 3 3 3 3 3 3 3 2 1 C 2 2 3 2 2 2 2 3 2 3 1 R 2 1 1 1 1 2 1 1 1 1 2 H 1 2 1 1 1 1 1 1 1 1 1 P 1
A H C N I - R Q C L C R - P M A I C - I N R - C K C R H P M Sequence A 8 7 6 5 4 3 2 1 Sequence B A H C N I - R Q C L C R - P M A I C - I N R - C K C R H P M
V I L S L V I L P Q R S L V V I L S L V I L A L T V N P Q A Sequence A Sequence B runs (tuples) of 3 residues 3 6 6 3 5 3 6 6 gap penalty = 3 SCORE = 20 - 9 = 11 3
BLAST Basic Local Alignment Tool Altschul et al (1990) A highest scoring segment pair (HSP) is found between two sequences the sequences may be related if HSP score > cutoff matches significant ‘words’ or segments and then extends these matches using local dynamic programming
For each sequence find the ‘words’ with significant scores BLAST Step 1: match significant words query sequence of length L For each sequence find the ‘words’ with significant scores
BLAST Step 2: compare the word list to the database and identify exact matches
BLAST Step 3: for each word match, extend the alignment using a PAM matrix and dynamic programming
BLAST searches for 2 non-overlapping segments on same diagonal must be within a certain distance of each other before extension is invoked can also allow gaps so that the method joins segments on different diagonals
Assessing the Significance of Sequence Match length - can get artificially high scores between small sequences composition - if sequences are rich in particular amino acid residues can get high scores for unrelated proteins to assess the significance of a match it is necessary to compare the score with that returned by random or unrelated sequences if the database is small or when considering a pair-wise comparison, the sequences can be shuffled to generate random sequences
Assessing the Significance of Scores Returned from a Database Scan S - m frequency s.d mean probe score S score Z score = score (S) - mean for unrelated (m) standard deviation (s.d) Z value > 3 s.d related sequences
BLAST results S - score for the pairwise alignment. BLAST best hit >gi|17472322|ref|XP_061555.1| (XM_061555) similar to orphan G protein-coupled receptor GPR26 [Homo sapiens] Length = 337 Score = 298 bits (762), Expect = 8e-80 Identities = 168/327 (51%) Query: 1 MGPGEALLAGLLVMVLAVALLSNALVLLCCAYSAELRTRASGVLLVNLSLGHLLLAALDM 60 M A LAGLLV + V+LLSNALVLLC +SA++R +A + +NL+ G+LL ++M Sbjct: 1 MNSWNAGLAGLLVGTIGVSLLSNALVLLCLLHSADIRRQAPALFTLNLTCGNLLCTVVNM 60 Query: 61 PFTLLGVMRGRTPSAPGACQVIGFLDTFLASNAALSVAALSADQWLAVGFPLRYAGRLRP 120 P TL GV+ R P+ C++ FLDTFLA+N+ LS+AALS D+W+AV FPL Y ++R Sbjct: 61 PLTLAGVVAQRQPAGDRLCRLAAFLDTFLAANSMLSMAALSIDRWVAVVFPLSYRAKMRL 120 Query: 121 RYAGLLLGCAWGQSLAFSGAALGCSWLGYSSAFASCSLRLPPEPERPRFAAFTATLHAVG 180 R A L++ W +L F AAL SWLG+ +ASC+L ER RFA FT HA+ Sbjct: 121 RDAALMVAYTWLHALTFPAAALALSWLGFHQLYASCTLCSRRPDERLRFAVFTGAFHALS 180 S - score for the pairwise alignment. E value - number of hits you would expect by chance with score S or higher given the size of the database and the length of the alignment Good Match < 1 X 10-50 Possible Match 1 X 10-50 to 1 X 10-2
Multiple Alignment direct extensions of the standard DP approach for the alignment of 2 sequences are computationally impossible for more than 3 sequences practical heuristic solutions are based on the idea that sequences are evolutionary related and can be aligned using an underlying phylogenetic tree this is known as progressive alignment
(1) Pairwise Alignment B 4 sequences A, B, C, D A D 6 pairwise comparisons then cluster analysis A B C C D (2) Multiple Alignment following the tree from 1 B Align most similar pair D gaps to optimise alignment A Align next most similar pair C A new gap to optimise alignment of BD with AC C B Align alignments - preserve gaps D
Multiple Alignment start by aligning the most closely related pairs using DP and gradually align these groups together keeping the gaps that appear in earlier alignments fixed alternatively can add sequences one at a time to a growing multiple alignment the heuristic approach is not guaranteed to find the optimum alignment - but it is soundly based, biologically
More recent resources: ClustalW Higgins, 1997 since the choice of parameters used can have significant effect on the alignment for very distant sequences, ClustalW addresses this problem by: position specific gap opening and extension penalties using different amino acid substitution matrices - one for close relatives, one for distant More recent resources: MAFFT MUSCLE JALVIEW
ClustalW Gap penalties where structure is known, one would want to increase the gap penalty within helices and strands and decrease it between them - forcing gaps to occur more frequently in loops if no structure known, can use simple rules which depends on the residues occurring and the frequencies of gaps e.g. use lower gap penalties where gaps already occur
for each position in the alignment using an entropy measure Identifying sequence conserved residue positions multiple sequence alignment of relatives from functional group 1 = highly conserved Structural model Score conservation for each position in the alignment using an entropy measure 0 = unconserved Putative functional site Scorecons - Thornton 27
Searching Sequence Databases Do fast scans using approximate methods e.g. BLAST or PSIBLAST Align proteins carefully using a dynamic programming method Needleman & Wunsch Smith & Waterman Scan against sequence profiles (or HMMs) in secondary databases e.g. Pfam, Gene3D, InterPro Align query sequence against family relatives using: ClustalW, Jalview, MUSCLE, MAFFT Can you inherit functional information?
Profile Based Sequence Search Methods by comparing related sequences within a protein family can identify patterns of conserved residues even the most distant members of the family should have these patterns of conserved residues can make a profile which encapsulates these patterns and use it to detect more distantly related sequences highly conserved positions usually correspond to the buried core or functional residues within the active site
Iterated Application of BLAST PSI-BLAST Altschul et al. (1997) first constructs a multiple alignment of all the related sequences identified by BLAST then estimates the residue frequencies at each position to construct a score matrix Position Specific Score Matrices (PSSM) also known as weight matrices or 1D profile
PSI-BLAST Altschul et al. (1997) UniProt Database query sequence aligns matched sequences and builds profile further iterations pull out more distant sequence relatives
Use the Multiple Alignment to Calculate Residue Frequencies PSI-BLAST Use the Multiple Alignment to Calculate Residue Frequencies the residue frequencies at each position are used to calculate the scores for aligning a query sequence against the pattern query relatives putative relative P1……... P5 P6…………... Pn…………... three times more powerful than BLAST!!
Position specific substitution matrix Path matrix or score matrix 10 20 70 90 . …HRVLA A 10 I C I Path matrix or score matrix N R 70 C K C R 70 H 90 P
Searching Protein Family Databases Secondary databases (as opposed to primary sequence databases) group proteins into related families Families are usually represented by a sequence profile or sequence model (Hidden Markov Model HMM) derived from a multiple sequence alignment of the relatives
HMMs for Protein Domain Family Recognition Pfam, SUPERFAMILY, Gene3D : Hidden Markov Models (HMMs) sequence is aligned using a probabilistic model of interconnecting match, delete or insert states contains statistical information on observed and expected positional variation - “fingerprint of a protein family” 5 times more powerful than BLAST B E Mi Di Ii
UniProt coverage PDB coverage Pfam-A Pfam-B Other Pfam-A 10,340 curated families with annotation Pfam-B 224,303 families derived from ADDA (50% clearly related to a Pfam-A) UniProt coverage 74% of sequences 51% of residues PDB coverage 94% of sequences 76% of residues 36 36
representative members Pfam : SEED alignment representative members Profile-HMM HMMer-2.0 Search UniProt FULL alignment Manually curated Automatically made 37 37
Protein Pfam classification Protein fold, etc. This summarises the way we classify peptidases in MEROPS. We use a hierarchical system with three main levels. The individual peptidases are first grouped into families by significant sequence relationships. For each family a type example is nominated (highlighted in yellow here). This is a well-characterised member, and ideally one for which there is a crystal structure. Every member of the family must be shown to be related to the type example. And then homologous families are grouped together in clans. The homologous families are recognised primarily by similar protein folds. (There is a good deal more to it than this: there are subfamilies and subclans, and transitive relationships, but this is the essence of it.) Protein 38 38
Pfam classification Family Protein Protein fold, etc. This summarises the way we classify peptidases in MEROPS. We use a hierarchical system with three main levels. The individual peptidases are first grouped into families by significant sequence relationships. For each family a type example is nominated (highlighted in yellow here). This is a well-characterised member, and ideally one for which there is a crystal structure. Every member of the family must be shown to be related to the type example. And then homologous families are grouped together in clans. The homologous families are recognised primarily by similar protein folds. (There is a good deal more to it than this: there are subfamilies and subclans, and transitive relationships, but this is the essence of it.) Protein 39 39
Pfam classification Clan Family Protein Protein fold, etc. This summarises the way we classify peptidases in MEROPS. We use a hierarchical system with three main levels. The individual peptidases are first grouped into families by significant sequence relationships. For each family a type example is nominated (highlighted in yellow here). This is a well-characterised member, and ideally one for which there is a crystal structure. Every member of the family must be shown to be related to the type example. And then homologous families are grouped together in clans. The homologous families are recognised primarily by similar protein folds. (There is a good deal more to it than this: there are subfamilies and subclans, and transitive relationships, but this is the essence of it.) Protein 40 40
41 41
42
43
44 44