Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences Ÿif a related sequence has a known function can you inherit.

Similar presentations


Presentation on theme: "Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences Ÿif a related sequence has a known function can you inherit."— Presentation transcript:

1 Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences Ÿif a related sequence has a known function can you inherit functional properties Ÿif a related sequence has a known structure, can you model the unknown structure using the known? Ÿstructural information can often provide additional clues as to the function ŸWhat are the best methods to use? ŸWhat thresholds should be used for safe inheritance of functional properties?

2 a ab duplication speciation species 1species 2 abab paralogs orthologs Homologues are related sequences:

3 Protein Sequence and Structure Databases ŸGenBank sequence database in the States has over 120 million sequences - some partial. More than a million non- identical sequences ŸDNA database of Japan (DDBJ) ŸUniProt (SWISS-PROT) database has > a million non- identical sequences - validated gene sequences ŸProtein Structure Databank (PDB - States, ePDB - UK) has >70,000 entries

4 Web Based Public Resources containing Functional Annotations Protein Family and Function databases Pfam, InterPro, PROSITE, PRINTS, PANTHER, SMART, SCOP, CATH, HOMSTRADProtein Family and Function databases Pfam, InterPro, PROSITE, PRINTS, PANTHER, SMART, SCOP, CATH, HOMSTRAD Databases of biochemical pathways and biological databases KEGG, WIT, GO, FunCat, ECDatabases of biochemical pathways and biological databases KEGG, WIT, GO, FunCat, EC Databases of Protein-Ligand Interactions IntAct, MIPS, RELIBASE, BIND, DIP, IrefIndexDatabases of Protein-Ligand Interactions IntAct, MIPS, RELIBASE, BIND, DIP, IrefIndex Species Databases ENSEMBL, FlyBase, YPD, WORMDb, GenProtEC, EcoCycSpecies Databases ENSEMBL, FlyBase, YPD, WORMDb, GenProtEC, EcoCyc

5 Evolution of Protein Sequences Ÿsubstitutions due to single base mutations Ÿinsertions or deletions (indels) of residues - usually not in the secondary structures but in the connecting loops Ÿinsertions/deletions (indels) can make it harder to compare sequences - have to line up the equivalent regions and put gaps where there are indels

6 Evolution of Protein Sequences Sequence A Sequence B

7   VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT Human Hemoglobin: Alpha and Beta Chains VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQ   KTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPN RFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHL ALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEF DNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHH TPAVHASLDKFLASVSTVLTSKYR FGKEFTPPVQAAYQKVVAGVANALAHKYH

8   VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT Human Hemoglobin: Alpha and Beta Chains VHLTPEEKSAVTALWGKV NVDEVGGEALGRLLVVYPWT    KTYFPHF DLSH GSAQVKGHGKKVADALTNAVAHV QRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHL DDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAH DNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHH LPAEFTPAVHASLDKFLASVSTVLTSKYR FGKEFTPPVQAAYQKVVAGVANALAHKYH FGKEFTPPVQAAYQKVVAGVANALAHKYH

9 Percentage Sequence Identity Percentage Sequence Identity = number of identical residues X 100 number of residues in smallest protein For globin example without gaps ~9% with gaps ~41%

10 Searching for Homologues with Related Functions ŸHow do you handle the evolutionary changes? ŸHow similar do the sequences need to be to inherit structural and functional properties ŸHow do you cope with the volume of data ie millions of sequences to search?

11 Searching Sequence Databases Can you inherit functional information? Do fast scans using approximate methods e.g. BLAST or PSIBLAST Align proteins carefully using a dynamic programming method Needleman & Wunsch Smith & Waterman Scan against sequence profiles (or HMMs) in secondary databases e.g. Pfam, InterPro, Gene3D Align query sequence against family relatives using: ClustalW, Jalview, MUSCLE, MAFFT

12 V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices Dot Plots, Path Matrices, Score Matrices

13 V ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B identical residues score 1 highest scoring path across the matrix gives best alignment

14 V I L S L V I L P Q R S L V V I L S L V I L A L T V STVILSLVRNVILPQRILSLVISLAL Sequence A Sequence B runs (tuples) of 3 residues 6 6 5 6 3 3 3 6 SCORE = 20 - 9 = 11 3 gap penalty = 3 = 3

15 Alignment from Dot Plot Alignment from Dot Plot VILSLV ILPQRSLVVILSLVI LALTV STVILSLVNVILPQR ILSLVISLAL score = 20 sequence identity = 20/26 = 75%

16 Global alignment Local alignment Needleman & Wunsch Smith & Waterman Dynamic Programming Methods

17 Sequence A Sequence B

18 Significance of sequence similarity – length dependence Sequence identity (%) 0 20 40 0200400 Ÿprotein pairs having > 150 residues are homologous if the sequence identity is > 25% Ÿshort proteins/fragments of 20-40 residues - 30% sequence identity frequently occurs by chance length Homologous pairs

19 If proteins are homologous they are likely to have similar structures and functions….. Modelling a structure based on the structure of a homologue >= 30%Modelling a structure based on the structure of a homologue >= 30% Inheriting functional properties from a homologue >= 60%Inheriting functional properties from a homologue >= 60% The structures of proteins in a family tend to be much more highly conserved during evolution than the sequences (and, in some families, the function) Sequence identity between homologues required for inheriting structure or function:

20 Residue Substitution Matrices a substitution matrix is a 20 x 20 matrix which scores each possible comparison of residues a substitution matrix is a 20 x 20 matrix which scores each possible comparison of residues Identity Matrix Ÿsimplest scoring scheme - amino acids are either identical (score 1) or non-identical (score 0) Ÿscore residue pairs according to similarities in their physico-chemical properties e.g. val->leu scores well, val->arg scores low Ÿscore residue pairs according to how frequently the mutation is oberved to occur in evolution eg Dayhoff (PAM), BLOSSUM matrices Physicochemical Properties Matrix Evolutionary Matrices

21 Dayhoff Matrix (PAM or MDM) Ÿbased on evolutionary relationships, it is derived by analysing the substitutions observed in closely related sequences (>80% identity) Ÿthe method measures evolutionary distance by determining the number of point accepted mutations, where: 1PAM = a single point mutation every 100 residues for distant relatives in the twilight zone (<25% identity), generally use a 250 PAM matrix for database searches generally use 120 PAMS

22 BLOSUM Substitution Matrices matrix is derived from analysing substitution patterns in more distant relatives (i.e. < 85% identity)matrix is derived from analysing substitution patterns in more distant relatives (i.e. < 85% identity) for clusters of related sequences (e.g. 60% ID, 80% ID) derive multiple alignments without gaps, for short regions of related sequencesfor clusters of related sequences (e.g. 60% ID, 80% ID) derive multiple alignments without gaps, for short regions of related sequences use the alignments to calculate residue substitution frequenciesuse the alignments to calculate residue substitution frequencies Henikoff & Henikoff (1993)

23 Which Matrix Should be Used? Matrices derived from observed substitution data (e.g. DAYHOFF, BLOSUM) are better than identity matrix or those based on physical propertiesMatrices derived from observed substitution data (e.g. DAYHOFF, BLOSUM) are better than identity matrix or those based on physical properties various studies suggest that PAM250 gives the best result when aligning distant proteins using dynamic programming algorithmsvarious studies suggest that PAM250 gives the best result when aligning distant proteins using dynamic programming algorithms in database searching it may be better to use PAM120 or BLOSUM62in database searching it may be better to use PAM120 or BLOSUM62

24 BLAST Basic Local Alignment Tool Altschul et al (1990) A highest scoring segment pair (HSP) is found between two sequencesA highest scoring segment pair (HSP) is found between two sequences the sequences may be related if HSP score > cutoff matches significant ‘words’ or segments and then extends these matches using local dynamic programming matches significant ‘words’ or segments and then extends these matches using local dynamic programming

25 BLAST Step 1: match significant words query sequence of length L For each sequence find the ‘words’ with significant scores

26 BLAST Step 2: compare the word list to the database and identify exact matches

27 BLAST Step 3: for each word match, extend the alignment using a PAM matrix and dynamic programming

28 searches for 2 non-overlapping segments on same diagonalsearches for 2 non-overlapping segments on same diagonal must be within a certain distance of each other before extension is invokedmust be within a certain distance of each other before extension is invoked can also allow gaps so that the method joins segments on different diagonalscan also allow gaps so that the method joins segments on different diagonals BLAST

29 Assessing the Significance of Sequence Match length - can get artificially high scores between small sequenceslength - can get artificially high scores between small sequences composition - if sequences are rich in particular amino acid residues can get high scores for unrelated proteinscomposition - if sequences are rich in particular amino acid residues can get high scores for unrelated proteins to assess the significance of a match it is necessary to compare the score with that returned by random or unrelated sequencesto assess the significance of a match it is necessary to compare the score with that returned by random or unrelated sequences if the database is small or when considering a pair-wise comparison, the sequences can be shuffled to generate random sequencesif the database is small or when considering a pair-wise comparison, the sequences can be shuffled to generate random sequences

30 Assessing the Significance of Scores Returned from a Database Scan score frequency mean s.d S - m Z score = score (S) - mean for unrelated (m) standard deviation (s.d) Z value > 3 s.d related sequences probe score S

31 BLAST results BLAST best hit >gi|17472322|ref|XP_061555.1| (XM_061555) similar to orphan G protein-coupled receptor GPR26 [Homo sapiens] Length = 337 Score = 298 bits (762), Expect = 8e-80 Identities = 168/327 (51%) Query: 1 MGPGEALLAGLLVMVLAVALLSNALVLLCCAYSAELRTRASGVLLVNLSLGHLLLAALDM 60 M A LAGLLV + V+LLSNALVLLC +SA++R +A + +NL+ G+LL ++M Sbjct: 1 MNSWNAGLAGLLVGTIGVSLLSNALVLLCLLHSADIRRQAPALFTLNLTCGNLLCTVVNM 60 Query: 61 PFTLLGVMRGRTPSAPGACQVIGFLDTFLASNAALSVAALSADQWLAVGFPLRYAGRLRP 120 P TL GV+ R P+ C++ FLDTFLA+N+ LS+AALS D+W+AV FPL Y ++R Sbjct: 61 PLTLAGVVAQRQPAGDRLCRLAAFLDTFLAANSMLSMAALSIDRWVAVVFPLSYRAKMRL 120 Query: 121 RYAGLLLGCAWGQSLAFSGAALGCSWLGYSSAFASCSLRLPPEPERPRFAAFTATLHAVG 180 R A L++ W +L F AAL SWLG+ +ASC+L ER RFA FT HA+ Sbjct: 121 RDAALMVAYTWLHALTFPAAALALSWLGFHQLYASCTLCSRRPDERLRFAVFTGAFHALS 180 S - score for the pairwise alignment. E value - number of hits you would expect by chance with score S or higher given the size of the database and the length of the alignment Good Match < 1 X 10-50 Possible Match 1 X 10-50 to 1 X 10-2 1 X 10-50 to 1 X 10-2

32 Needleman & Wunsch HCNIRQCLCRPMA A I C I N R C K C R H P 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

33 Needleman & Wunsch Algorithm Accumulate the matrix by adding to each cell the highest score in the column or row to the right and below itAccumulate the matrix by adding to each cell the highest score in the column or row to the right and below it find the highest scoring path in the matrix by:find the highest scoring path in the matrix by: starting in the top left cornerstarting in the top left corner moving down across the matrix from cell to cellmoving down across the matrix from cell to cell choosing the highest scoring cell at each movechoosing the highest scoring cell at each move the path can not go back on itself or cross the same row or column twicethe path can not go back on itself or cross the same row or column twice

34 Add to the score in the cell the highest score from a cell in the row or column to right and belowAdd to the score in the cell the highest score from a cell in the row or column to right and below Accumulating the Matrix i,j i-1,j-1 i-n,j-1 i-1,j-m

35 Sequence A HCNIRQCLCRPMA A I C I N R C K C R H P 8 7 6 6 5 4 3 3 2 2 1 0 7 7 6 6 5 4 3 3 2 1 2 0 6 6 7 6 5 4 4 3 3 1 1 0 6 6 6 5 6 4 3 3 2 1 1 0 5 6 5 6 5 4 3 3 2 1 1 0 4 4 4 4 5 5 3 3 2 2 1 0 4 4 4 4 4 4 3 3 2 1 1 0 3 3 4 3 3 3 4 3 3 1 1 0 3 3 3 3 3 3 3 3 2 1 1 0 2 2 3 2 3 2 3 2 3 1 1 0 1 1 1 1 1 2 1 1 1 2 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 Sequence B

36 start in the leftmost or topmost rowstart in the leftmost or topmost row move to the highest scoring cell in row or column to right and belowmove to the highest scoring cell in row or column to right and below Possible Moves in Finding a Path across the Matrix i,j i-1,j-1 i-n,j-1 i-1,j-m

37 Sequence A HCNIRQCLCRPMA A I C I N R C K C R H P 8 7 6 6 5 4 3 3 2 2 1 0 7 7 6 6 5 4 3 3 2 1 2 0 6 6 7 6 5 4 4 3 3 1 1 0 6 6 6 5 6 4 3 3 2 1 1 0 5 6 5 6 5 4 3 3 2 1 1 0 4 4 4 4 5 5 3 3 2 2 1 0 4 4 4 4 4 4 3 3 2 1 1 0 3 3 4 3 3 3 4 3 3 1 1 0 3 3 3 3 3 3 3 3 2 1 1 0 2 2 3 2 3 2 3 2 3 1 1 0 1 1 1 1 1 2 1 1 1 2 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 Sequence B

38 A H C N I - R Q C L C R - P M A I C - I N R - C K C R H P M


Download ppt "Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences Ÿif a related sequence has a known function can you inherit."

Similar presentations


Ads by Google