Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences if a related sequence has a known function can you inherit.

Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences if a related sequence has a known function can you inherit functional properties if a related sequence has a known structure, can you model the unknown structure using the known? structural information can often provide additional clues as to the function What are the best methods to use? What thresholds should be used for safe inheritance of functional properties?

a ab duplication speciation species 1species 2 abab paralogs orthologs Homologues are related sequences:

Protein Sequence and Structure Databases GenBank sequence database in the States has over 120 million sequences - some partial. More than a million non- identical sequences DNA database of Japan (DDBJ) UniProt (SWISS-PROT) database has > a million non- identical sequences - validated gene sequences Protein Structure Databank (PDB - States, ePDB - UK) has >70,000 entries

Web Based Public Resources containing Functional Annotations Protein Family and Function databases Pfam, InterPro, PROSITE, PRINTS, PANTHER, SMART, SCOP, CATH, HOMSTRADProtein Family and Function databases Pfam, InterPro, PROSITE, PRINTS, PANTHER, SMART, SCOP, CATH, HOMSTRAD Databases of biochemical pathways and biological databases KEGG, WIT, GO, FunCat, ECDatabases of biochemical pathways and biological databases KEGG, WIT, GO, FunCat, EC Databases of Protein-Ligand Interactions IntAct, MIPS, RELIBASE, BIND, DIP, IrefIndexDatabases of Protein-Ligand Interactions IntAct, MIPS, RELIBASE, BIND, DIP, IrefIndex Species Databases ENSEMBL, FlyBase, YPD, WORMDb, GenProtEC, EcoCycSpecies Databases ENSEMBL, FlyBase, YPD, WORMDb, GenProtEC, EcoCyc

Evolution of Protein Sequences substitutions due to single base mutations insertions or deletions (indels) of residues - usually not in the secondary structures but in the connecting loops insertions/deletions (indels) can make it harder to compare sequences - have to line up the equivalent regions and put gaps where there are indels

Evolution of Protein Sequences Sequence A Sequence B

  VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT Human Hemoglobin: Alpha and Beta Chains VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQ   KTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPN RFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHL ALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEF DNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHH TPAVHASLDKFLASVSTVLTSKYR FGKEFTPPVQAAYQKVVAGVANALAHKYH

  VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT Human Hemoglobin: Alpha and Beta Chains VHLTPEEKSAVTALWGKV NVDEVGGEALGRLLVVYPWT    KTYFPHF DLSH GSAQVKGHGKKVADALTNAVAHV QRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHL DDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAH DNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHH LPAEFTPAVHASLDKFLASVSTVLTSKYR FGKEFTPPVQAAYQKVVAGVANALAHKYH FGKEFTPPVQAAYQKVVAGVANALAHKYH

Percentage Sequence Identity Percentage Sequence Identity = number of identical residues X 100 number of residues in smallest protein For globin example without gaps ~9% with gaps ~41%

Searching for Homologues with Related Functions How do you handle the evolutionary changes? How similar do the sequences need to be to inherit structural and functional properties How do you cope with the volume of data ie millions of sequences to search?

Searching Sequence Databases Can you inherit functional information? Do fast scans using approximate methods e.g. BLAST or PSIBLAST Align proteins carefully using a dynamic programming method Needleman & Wunsch Smith & Waterman Scan against sequence profiles (or HMMs) in secondary databases e.g. Pfam, InterPro, Gene3D Align query sequence against family relatives using: ClustalW, Jalview, MUSCLE, MAFFT

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices Dot Plots, Path Matrices, Score Matrices

V ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B identical residues score 1 highest scoring path across the matrix gives best alignment

V I L S L V I L P Q R S L V V I L S L V I L A L T V STVILSLVRNVILPQRILSLVISLAL Sequence A Sequence B runs (tuples) of 3 residues 6 6 5 6 3 3 3 6 SCORE = 20 - 9 = 11 3 gap penalty = 3 = 3

Alignment from Dot Plot Alignment from Dot Plot VILSLV ILPQRSLVVILSLVI LALTV STVILSLVNVILPQR ILSLVISLAL score = 20 sequence identity = 20/26 = 75%

Global alignment Local alignment Needleman & Wunsch Smith & Waterman Dynamic Programming Methods

Sequence A Sequence B

Significance of sequence similarity – length dependence Sequence identity (%) 0 20 40 0200400 protein pairs having > 150 residues are homologous if the sequence identity is > 25% short proteins/fragments of 20-40 residues - 30% sequence identity frequently occurs by chance length Homologous pairs

If proteins are homologous they are likely to have similar structures and functions….. Modelling a structure based on the structure of a homologue >= 30%Modelling a structure based on the structure of a homologue >= 30% Inheriting functional properties from a homologue >= 60%Inheriting functional properties from a homologue >= 60% The structures of proteins in a family tend to be much more highly conserved during evolution than the sequences (and, in some families, the function) Sequence identity between homologues required for inheriting structure or function:

Residue Substitution Matrices a substitution matrix is a 20 x 20 matrix which scores each possible comparison of residues a substitution matrix is a 20 x 20 matrix which scores each possible comparison of residues Identity Matrix simplest scoring scheme - amino acids are either identical (score 1) or non-identical (score 0) score residue pairs according to similarities in their physico-chemical properties e.g. val->leu scores well, val->arg scores low score residue pairs according to how frequently the mutation is oberved to occur in evolution eg Dayhoff (PAM), BLOSSUM matrices Physicochemical Properties Matrix Evolutionary Matrices

Dayhoff Matrix (PAM or MDM) based on evolutionary relationships, it is derived by analysing the substitutions observed in closely related sequences (>80% identity) the method measures evolutionary distance by determining the number of point accepted mutations, where: 1PAM = a single point mutation every 100 residues for distant relatives in the twilight zone (<25% identity), generally use a 250 PAM matrix for database searches generally use 120 PAMS

BLOSUM Substitution Matrices matrix is derived from analysing substitution patterns in more distant relatives (i.e. < 85% identity)matrix is derived from analysing substitution patterns in more distant relatives (i.e. < 85% identity) for clusters of related sequences (e.g. 60% ID, 80% ID) derive multiple alignments without gaps, for short regions of related sequencesfor clusters of related sequences (e.g. 60% ID, 80% ID) derive multiple alignments without gaps, for short regions of related sequences use the alignments to calculate residue substitution frequenciesuse the alignments to calculate residue substitution frequencies Henikoff & Henikoff (1993)

Which Matrix Should be Used? Matrices derived from observed substitution data (e.g. DAYHOFF, BLOSUM) are better than identity matrix or those based on physical propertiesMatrices derived from observed substitution data (e.g. DAYHOFF, BLOSUM) are better than identity matrix or those based on physical properties various studies suggest that PAM250 gives the best result when aligning distant proteins using dynamic programming algorithmsvarious studies suggest that PAM250 gives the best result when aligning distant proteins using dynamic programming algorithms in database searching it may be better to use PAM120 or BLOSUM62in database searching it may be better to use PAM120 or BLOSUM62

BLAST Basic Local Alignment Tool Altschul et al (1990) A highest scoring segment pair (HSP) is found between two sequencesA highest scoring segment pair (HSP) is found between two sequences the sequences may be related if HSP score > cutoff matches significant ‘words’ or segments and then extends these matches using local dynamic programming matches significant ‘words’ or segments and then extends these matches using local dynamic programming

BLAST Step 1: match significant words query sequence of length L For each sequence find the ‘words’ with significant scores

BLAST Step 2: compare the word list to the database and identify exact matches

BLAST Step 3: for each word match, extend the alignment using a PAM matrix and dynamic programming

searches for 2 non-overlapping segments on same diagonalsearches for 2 non-overlapping segments on same diagonal must be within a certain distance of each other before extension is invokedmust be within a certain distance of each other before extension is invoked can also allow gaps so that the method joins segments on different diagonalscan also allow gaps so that the method joins segments on different diagonals BLAST

Assessing the Significance of Sequence Match length - can get artificially high scores between small sequenceslength - can get artificially high scores between small sequences composition - if sequences are rich in particular amino acid residues can get high scores for unrelated proteinscomposition - if sequences are rich in particular amino acid residues can get high scores for unrelated proteins to assess the significance of a match it is necessary to compare the score with that returned by random or unrelated sequencesto assess the significance of a match it is necessary to compare the score with that returned by random or unrelated sequences if the database is small or when considering a pair-wise comparison, the sequences can be shuffled to generate random sequencesif the database is small or when considering a pair-wise comparison, the sequences can be shuffled to generate random sequences

Assessing the Significance of Scores Returned from a Database Scan score frequency mean s.d S - m Z score = score (S) - mean for unrelated (m) standard deviation (s.d) Z value > 3 s.d related sequences probe score S

BLAST results BLAST best hit >gi|17472322|ref|XP_061555.1| (XM_061555) similar to orphan G protein-coupled receptor GPR26 [Homo sapiens] Length = 337 Score = 298 bits (762), Expect = 8e-80 Identities = 168/327 (51%) Query: 1 MGPGEALLAGLLVMVLAVALLSNALVLLCCAYSAELRTRASGVLLVNLSLGHLLLAALDM 60 M A LAGLLV + V+LLSNALVLLC +SA++R +A + +NL+ G+LL ++M Sbjct: 1 MNSWNAGLAGLLVGTIGVSLLSNALVLLCLLHSADIRRQAPALFTLNLTCGNLLCTVVNM 60 Query: 61 PFTLLGVMRGRTPSAPGACQVIGFLDTFLASNAALSVAALSADQWLAVGFPLRYAGRLRP 120 P TL GV+ R P+ C++ FLDTFLA+N+ LS+AALS D+W+AV FPL Y ++R Sbjct: 61 PLTLAGVVAQRQPAGDRLCRLAAFLDTFLAANSMLSMAALSIDRWVAVVFPLSYRAKMRL 120 Query: 121 RYAGLLLGCAWGQSLAFSGAALGCSWLGYSSAFASCSLRLPPEPERPRFAAFTATLHAVG 180 R A L++ W +L F AAL SWLG+ +ASC+L ER RFA FT HA+ Sbjct: 121 RDAALMVAYTWLHALTFPAAALALSWLGFHQLYASCTLCSRRPDERLRFAVFTGAFHALS 180 S - score for the pairwise alignment. E value - number of hits you would expect by chance with score S or higher given the size of the database and the length of the alignment Good Match < 1 X 10-50 Possible Match 1 X 10-50 to 1 X 10-2 1 X 10-50 to 1 X 10-2

Needleman & Wunsch HCNIRQCLCRPMA A I C I N R C K C R H P 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

Needleman & Wunsch Algorithm Accumulate the matrix by adding to each cell the highest score in the column or row to the right and below itAccumulate the matrix by adding to each cell the highest score in the column or row to the right and below it find the highest scoring path in the matrix by:find the highest scoring path in the matrix by: starting in the top left cornerstarting in the top left corner moving down across the matrix from cell to cellmoving down across the matrix from cell to cell choosing the highest scoring cell at each movechoosing the highest scoring cell at each move the path can not go back on itself or cross the same row or column twicethe path can not go back on itself or cross the same row or column twice

Add to the score in the cell the highest score from a cell in the row or column to right and belowAdd to the score in the cell the highest score from a cell in the row or column to right and below Accumulating the Matrix i,j i-1,j-1 i-n,j-1 i-1,j-m

Sequence A HCNIRQCLCRPMA A I C I N R C K C R H P 8 7 6 6 5 4 3 3 2 2 1 0 7 7 6 6 5 4 3 3 2 1 2 0 6 6 7 6 5 4 4 3 3 1 1 0 6 6 6 5 6 4 3 3 2 1 1 0 5 6 5 6 5 4 3 3 2 1 1 0 4 4 4 4 5 5 3 3 2 2 1 0 4 4 4 4 4 4 3 3 2 1 1 0 3 3 4 3 3 3 4 3 3 1 1 0 3 3 3 3 3 3 3 3 2 1 1 0 2 2 3 2 3 2 3 2 3 1 1 0 1 1 1 1 1 2 1 1 1 2 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 Sequence B

start in the leftmost or topmost rowstart in the leftmost or topmost row move to the highest scoring cell in row or column to right and belowmove to the highest scoring cell in row or column to right and below Possible Moves in Finding a Path across the Matrix i,j i-1,j-1 i-n,j-1 i-1,j-m

Sequence A HCNIRQCLCRPMA A I C I N R C K C R H P 8 7 6 6 5 4 3 3 2 2 1 0 7 7 6 6 5 4 3 3 2 1 2 0 6 6 7 6 5 4 4 3 3 1 1 0 6 6 6 5 6 4 3 3 2 1 1 0 5 6 5 6 5 4 3 3 2 1 1 0 4 4 4 4 5 5 3 3 2 2 1 0 4 4 4 4 4 4 3 3 2 1 1 0 3 3 4 3 3 3 4 3 3 1 1 0 3 3 3 3 3 3 3 3 2 1 1 0 2 2 3 2 3 2 3 2 3 1 1 0 1 1 1 1 1 2 1 1 1 2 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 Sequence B

A H C N I - R Q C L C R - P M A I C - I N R - C K C R H P M

Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences if a related sequence has a known function can you inherit.

Similar presentations

Presentation on theme: "Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences if a related sequence has a known function can you inherit."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences if a related sequence has a known function can you inherit.

Similar presentations

Presentation on theme: "Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences if a related sequence has a known function can you inherit."— Presentation transcript:

Similar presentations

About project

Feedback

Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences if a related sequence has a known function can you inherit.

Presentation on theme: "Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences if a related sequence has a known function can you inherit."— Presentation transcript: