Presentation is loading. Please wait.

Presentation is loading. Please wait.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Similar presentations


Presentation on theme: "V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices."— Presentation transcript:

1 V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices Dot Plots, Path Matrices, Score Matrices

2 V ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B identical residues score 1 highest scoring path across the matrix gives best alignment

3 V I L S L V I L P Q R S L V V I L S L V I L A L T V STVILSLVRNVILPQRILSLVISLAL Sequence A Sequence B runs (tuples) of 3 residues 6 6 5 6 3 3 3 6 SCORE = 20 - 9 = 11 3 gap penalty = 3 = 3

4 Alignment from Dot Plot Alignment from Dot Plot VILSLV ILPQRSLVVILSLVI LALTV STVILSLVNVILPQR ILSLVISLAL score = 20 sequence identity = 20/26 = 75%

5 ALVKRH … … HRKVLA 1 1 1 1 000 0 …… Path or Score Matrix Residue substitution matrix 1

6 Needleman & Wunsch HCNIRQCLCRPMA A I C I N R C K C R H P 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

7 Needleman & Wunsch Algorithm Accumulate the matrix by adding to each cell the highest score in the column or row to the right and below itAccumulate the matrix by adding to each cell the highest score in the column or row to the right and below it find the highest scoring path in the matrix by:find the highest scoring path in the matrix by: starting in the top left cornerstarting in the top left corner moving down across the matrix from cell to cellmoving down across the matrix from cell to cell choosing the highest scoring cell at each movechoosing the highest scoring cell at each move the path can not go back on itself or cross the same row or column twicethe path can not go back on itself or cross the same row or column twice

8 Add to the score in the cell the highest score from a cell in the row or column to right and belowAdd to the score in the cell the highest score from a cell in the row or column to right and below Accumulating the Matrix i,j i-1,j-1 i-n,j-1 i-1,j-m

9 Sequence A HCNIRQCLCRPMA A I C I N R C K C R H P 8 7 6 6 5 4 3 3 2 2 1 0 7 7 6 6 5 4 3 3 2 1 2 0 6 6 7 6 5 4 4 3 3 1 1 0 6 6 6 5 6 4 3 3 2 1 1 0 5 6 5 6 5 4 3 3 2 1 1 0 4 4 4 4 5 5 3 3 2 2 1 0 4 4 4 4 4 4 3 3 2 1 1 0 3 3 4 3 3 3 4 3 3 1 1 0 3 3 3 3 3 3 3 3 2 1 1 0 2 2 3 2 3 2 3 2 3 1 1 0 1 1 1 1 1 2 1 1 1 2 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 Sequence B

10 start in the leftmost or topmost rowstart in the leftmost or topmost row move to the highest scoring cell in row or column to right and belowmove to the highest scoring cell in row or column to right and below Possible Moves in Finding a Path across the Matrix i,j i-1,j-1 i-n,j-1 i-1,j-m

11 Sequence A HCNIRQCLCRPMA A I C I N R C K C R H P 8 7 6 6 5 4 3 3 2 2 1 0 7 7 6 6 5 4 3 3 2 1 2 0 6 6 7 6 5 4 4 3 3 1 1 0 6 6 6 5 6 4 3 3 2 1 1 0 5 6 5 6 5 4 3 3 2 1 1 0 4 4 4 4 5 5 3 3 2 2 1 0 4 4 4 4 4 4 3 3 2 1 1 0 3 3 4 3 3 3 4 3 3 1 1 0 3 3 3 3 3 3 3 3 2 1 1 0 2 2 3 2 3 2 3 2 3 1 1 0 1 1 1 1 1 2 1 1 1 2 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 Sequence B

12 A H C N I - R Q C L C R - P M A I C - I N R - C K C R H P M

13 Searching Sequence Databases Can you inherit functional information? Do fast scans using approximate methods e.g. BLAST or PSIBLAST Align proteins carefully using a dynamic programming method Needleman & Wunsch Smith & Waterman Scan against sequence profiles (or HMMs) in secondary databases e.g. Pfam, Gene3D, InterPro Align query sequence against family relatives using: ClustalW, Jalview, MUSCLE, MAFFT

14 Profile Based Sequence Search Methods Ÿby comparing related sequences within a protein family can identify patterns of conserved residues Ÿeven the most distant members of the family should have these patterns of conserved residues Ÿcan make a profile which encapsulates these patterns and use it to detect more distantly related sequences Ÿhighly conserved positions usually correspond to the buried core or functional residues within the active site

15 first constructs a multiple alignment of all the related sequences identified by BLASTfirst constructs a multiple alignment of all the related sequences identified by BLAST then estimates the residue frequencies at each position to construct a score matrix Position Specific Score Matrices (PSSM) also known as weight matrices or profilesthen estimates the residue frequencies at each position to construct a score matrix Position Specific Score Matrices (PSSM) also known as weight matrices or profiles Iterated Application of BLAST PSI-BLAST Altschul et al. (1997)

16 PSI-BLAST UniProt Database query sequence further iterations pull out more distant sequence relatives aligns matched sequences and builds profile Altschul et al. (1997)

17 Use the Multiple Alignment to Calculate Residue Frequencies PSI-BLAST the residue frequencies at each position are used to calculate the scores for aligning a query sequence against the pattern P1……...P5P6…………...Pn…………... query relatives putativerelative three times more powerful than BLAST!!

18 A I C I N R C K C R H P Position specific substitution matrix … HRVLA 1010207090. 10 70 70 90 Path matrix or score matrix

19 Multiple Alignment direct extensions of the standard DP approach for the alignment of 2 sequences are computationally impossible for more than 3 sequencesdirect extensions of the standard DP approach for the alignment of 2 sequences are computationally impossible for more than 3 sequences practical heuristic solutions are based on the idea that sequences are evolutionary related and can be aligned using an underlying phylogenetic treepractical heuristic solutions are based on the idea that sequences are evolutionary related and can be aligned using an underlying phylogenetic tree this is known as progressive alignment

20 (1) Pairwise Alignment (2) Multiple Alignment following the tree from 1 4 sequences A, B, C, D A B C D 6 pairwise comparisons then cluster analysis B D A C A C B D A B D C Align most similar pair Align next most similar pair Align alignments - preserve gaps gaps to optimise alignment new gap to optimise alignment of BD with AC

21 Multiple Alignment start by aligning the most closely related pairs using DP and gradually align these groups together keeping the gaps that appear in earlier alignments fixedstart by aligning the most closely related pairs using DP and gradually align these groups together keeping the gaps that appear in earlier alignments fixed alternatively can add sequences one at a time to a growing multiple alignmentalternatively can add sequences one at a time to a growing multiple alignment the heuristic approach is not guaranteed to find the optimum alignment - but it is soundly based, biologically

22 ClustalW since the choice of parameters used can have significant effect on the alignment for very distant sequences, ClustalW addresses this problem by:since the choice of parameters used can have significant effect on the alignment for very distant sequences, ClustalW addresses this problem by: position specific gap opening and extension penalties using different amino acid substitution matrices - one for close relatives, one for distant Higgins, 1997 More recent resources: MAFFTMUSCLEJALVIEW

23 ClustalW where structure is known, one would want to increase the gap penalty within helices and strands and decrease it between them - forcing gaps to occur more frequently in loopswhere structure is known, one would want to increase the gap penalty within helices and strands and decrease it between them - forcing gaps to occur more frequently in loops if no structure known, can use simple rules which depends on the residues occurring and the frequencies of gapsif no structure known, can use simple rules which depends on the residues occurring and the frequencies of gaps e.g. use lower gap penalties where gaps already occur Gap penalties

24 Secondary databases (as opposed to primary sequence databases) group proteins into related families Families are usually represented by a sequence profile or sequence model (Hidden Markov Model HMM) derived from a multiple sequence alignment of the relatives Searching Protein Family Databases

25 Pfam, SUPERFAMILY, Gene3D : Hidden Markov Models (HMMs) sequence is aligned using a probabilistic model of interconnecting match, delete or insert states contains statistical information on observed and expected positional variation - “fingerprint of a protein family” BEMiMi DiDi IiIi HMMs for Protein Domain Family Recognition

26

27 Pfam-A 10,340 curated families with annotation Pfam-B 224,303 families derived from ADDA (50% clearly related to a Pfam-A) UniProt coverage 74% of sequences 51% of residues PDB coverage 94% of sequences 76% of residues Pfam-A Pfam-B Other

28 Pfam : Profile-HMM HMMer-2.0 FULL alignment Search UniProt Manually curatedAutomatically made SEED alignment representative members

29 Protein Pfam classification Protein fold, etc.

30 Protein Family Protein fold, etc. Pfam classification

31 Protein Clan Family Protein fold, etc. Pfam classification

32

33

34

35


Download ppt "V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices."

Similar presentations


Ads by Google