Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences Ÿif a related sequence has a known function can you inherit.

Slides:



Advertisements
Similar presentations
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Advertisements

1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
Introduction to Bioinformatics
Sequence Similarity Searching Class 4 March 2010.
Heuristic alignment algorithms and cost matrices
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Introduction to bioinformatics
Sequence Analysis Tools
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Heuristic Approaches for Sequence Alignments
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
Pairwise & Multiple sequence alignments
An Introduction to Bioinformatics
Protein Sequence Alignment and Database Searching.
BLAST Workshop Maya Schushan June 2009.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Construction of Substitution matrices
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Dot Plots, Path Matrices, Score Matrices
Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences if a related sequence has a known function can you inherit.
Sequence Based Analysis Tutorial
Large-Scale Genomic Surveys
Sequence Based Analysis Tutorial
Pairwise Sequence Alignment
Pairwise Alignment Global & local alignment
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Presentation transcript:

Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences Ÿif a related sequence has a known function can you inherit functional properties Ÿif a related sequence has a known structure, can you model the unknown structure using the known? Ÿstructural information can often provide additional clues as to the function ŸWhat are the best methods to use? ŸWhat thresholds should be used for safe inheritance of functional properties?

a ab duplication speciation species 1species 2 abab paralogs orthologs Homologues are related sequences:

Protein Sequence and Structure Databases ŸGenBank sequence database in the States has over 120 million sequences - some partial. More than a million non- identical sequences ŸDNA database of Japan (DDBJ) ŸUniProt (SWISS-PROT) database has > a million non- identical sequences - validated gene sequences ŸProtein Structure Databank (PDB - States, ePDB - UK) has >70,000 entries

Web Based Public Resources containing Functional Annotations Protein Family and Function databases Pfam, InterPro, PROSITE, PRINTS, PANTHER, SMART, SCOP, CATH, HOMSTRADProtein Family and Function databases Pfam, InterPro, PROSITE, PRINTS, PANTHER, SMART, SCOP, CATH, HOMSTRAD Databases of biochemical pathways and biological databases KEGG, WIT, GO, FunCat, ECDatabases of biochemical pathways and biological databases KEGG, WIT, GO, FunCat, EC Databases of Protein-Ligand Interactions IntAct, MIPS, RELIBASE, BIND, DIP, IrefIndexDatabases of Protein-Ligand Interactions IntAct, MIPS, RELIBASE, BIND, DIP, IrefIndex Species Databases ENSEMBL, FlyBase, YPD, WORMDb, GenProtEC, EcoCycSpecies Databases ENSEMBL, FlyBase, YPD, WORMDb, GenProtEC, EcoCyc

Evolution of Protein Sequences Ÿsubstitutions due to single base mutations Ÿinsertions or deletions (indels) of residues - usually not in the secondary structures but in the connecting loops Ÿinsertions/deletions (indels) can make it harder to compare sequences - have to line up the equivalent regions and put gaps where there are indels

Evolution of Protein Sequences Sequence A Sequence B

  VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT Human Hemoglobin: Alpha and Beta Chains VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQ   KTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPN RFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHL ALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEF DNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHH TPAVHASLDKFLASVSTVLTSKYR FGKEFTPPVQAAYQKVVAGVANALAHKYH

  VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT Human Hemoglobin: Alpha and Beta Chains VHLTPEEKSAVTALWGKV NVDEVGGEALGRLLVVYPWT    KTYFPHF DLSH GSAQVKGHGKKVADALTNAVAHV QRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHL DDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAH DNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHH LPAEFTPAVHASLDKFLASVSTVLTSKYR FGKEFTPPVQAAYQKVVAGVANALAHKYH FGKEFTPPVQAAYQKVVAGVANALAHKYH

Percentage Sequence Identity Percentage Sequence Identity = number of identical residues X 100 number of residues in smallest protein For globin example without gaps ~9% with gaps ~41%

Searching for Homologues with Related Functions ŸHow do you handle the evolutionary changes? ŸHow similar do the sequences need to be to inherit structural and functional properties ŸHow do you cope with the volume of data ie millions of sequences to search?

Searching Sequence Databases Can you inherit functional information? Do fast scans using approximate methods e.g. BLAST or PSIBLAST Align proteins carefully using a dynamic programming method Needleman & Wunsch Smith & Waterman Scan against sequence profiles (or HMMs) in secondary databases e.g. Pfam, InterPro, Gene3D Align query sequence against family relatives using: ClustalW, Jalview, MUSCLE, MAFFT

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices Dot Plots, Path Matrices, Score Matrices

V ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B identical residues score 1 highest scoring path across the matrix gives best alignment

V I L S L V I L P Q R S L V V I L S L V I L A L T V STVILSLVRNVILPQRILSLVISLAL Sequence A Sequence B runs (tuples) of 3 residues SCORE = = 11 3 gap penalty = 3 = 3

Alignment from Dot Plot Alignment from Dot Plot VILSLV ILPQRSLVVILSLVI LALTV STVILSLVNVILPQR ILSLVISLAL score = 20 sequence identity = 20/26 = 75%

Global alignment Local alignment Needleman & Wunsch Smith & Waterman Dynamic Programming Methods

Sequence A Sequence B

Significance of sequence similarity – length dependence Sequence identity (%) Ÿprotein pairs having > 150 residues are homologous if the sequence identity is > 25% Ÿshort proteins/fragments of residues - 30% sequence identity frequently occurs by chance length Homologous pairs

If proteins are homologous they are likely to have similar structures and functions….. Modelling a structure based on the structure of a homologue >= 30%Modelling a structure based on the structure of a homologue >= 30% Inheriting functional properties from a homologue >= 60%Inheriting functional properties from a homologue >= 60% The structures of proteins in a family tend to be much more highly conserved during evolution than the sequences (and, in some families, the function) Sequence identity between homologues required for inheriting structure or function:

Residue Substitution Matrices a substitution matrix is a 20 x 20 matrix which scores each possible comparison of residues a substitution matrix is a 20 x 20 matrix which scores each possible comparison of residues Identity Matrix Ÿsimplest scoring scheme - amino acids are either identical (score 1) or non-identical (score 0) Ÿscore residue pairs according to similarities in their physico-chemical properties e.g. val->leu scores well, val->arg scores low Ÿscore residue pairs according to how frequently the mutation is oberved to occur in evolution eg Dayhoff (PAM), BLOSSUM matrices Physicochemical Properties Matrix Evolutionary Matrices

Dayhoff Matrix (PAM or MDM) Ÿbased on evolutionary relationships, it is derived by analysing the substitutions observed in closely related sequences (>80% identity) Ÿthe method measures evolutionary distance by determining the number of point accepted mutations, where: 1PAM = a single point mutation every 100 residues for distant relatives in the twilight zone (<25% identity), generally use a 250 PAM matrix for database searches generally use 120 PAMS

BLOSUM Substitution Matrices matrix is derived from analysing substitution patterns in more distant relatives (i.e. < 85% identity)matrix is derived from analysing substitution patterns in more distant relatives (i.e. < 85% identity) for clusters of related sequences (e.g. 60% ID, 80% ID) derive multiple alignments without gaps, for short regions of related sequencesfor clusters of related sequences (e.g. 60% ID, 80% ID) derive multiple alignments without gaps, for short regions of related sequences use the alignments to calculate residue substitution frequenciesuse the alignments to calculate residue substitution frequencies Henikoff & Henikoff (1993)

Which Matrix Should be Used? Matrices derived from observed substitution data (e.g. DAYHOFF, BLOSUM) are better than identity matrix or those based on physical propertiesMatrices derived from observed substitution data (e.g. DAYHOFF, BLOSUM) are better than identity matrix or those based on physical properties various studies suggest that PAM250 gives the best result when aligning distant proteins using dynamic programming algorithmsvarious studies suggest that PAM250 gives the best result when aligning distant proteins using dynamic programming algorithms in database searching it may be better to use PAM120 or BLOSUM62in database searching it may be better to use PAM120 or BLOSUM62

BLAST Basic Local Alignment Tool Altschul et al (1990) A highest scoring segment pair (HSP) is found between two sequencesA highest scoring segment pair (HSP) is found between two sequences the sequences may be related if HSP score > cutoff matches significant ‘words’ or segments and then extends these matches using local dynamic programming matches significant ‘words’ or segments and then extends these matches using local dynamic programming

BLAST Step 1: match significant words query sequence of length L For each sequence find the ‘words’ with significant scores

BLAST Step 2: compare the word list to the database and identify exact matches

BLAST Step 3: for each word match, extend the alignment using a PAM matrix and dynamic programming

searches for 2 non-overlapping segments on same diagonalsearches for 2 non-overlapping segments on same diagonal must be within a certain distance of each other before extension is invokedmust be within a certain distance of each other before extension is invoked can also allow gaps so that the method joins segments on different diagonalscan also allow gaps so that the method joins segments on different diagonals BLAST

Assessing the Significance of Sequence Match length - can get artificially high scores between small sequenceslength - can get artificially high scores between small sequences composition - if sequences are rich in particular amino acid residues can get high scores for unrelated proteinscomposition - if sequences are rich in particular amino acid residues can get high scores for unrelated proteins to assess the significance of a match it is necessary to compare the score with that returned by random or unrelated sequencesto assess the significance of a match it is necessary to compare the score with that returned by random or unrelated sequences if the database is small or when considering a pair-wise comparison, the sequences can be shuffled to generate random sequencesif the database is small or when considering a pair-wise comparison, the sequences can be shuffled to generate random sequences

Assessing the Significance of Scores Returned from a Database Scan score frequency mean s.d S - m Z score = score (S) - mean for unrelated (m) standard deviation (s.d) Z value > 3 s.d related sequences probe score S

BLAST results BLAST best hit >gi| |ref|XP_ | (XM_061555) similar to orphan G protein-coupled receptor GPR26 [Homo sapiens] Length = 337 Score = 298 bits (762), Expect = 8e-80 Identities = 168/327 (51%) Query: 1 MGPGEALLAGLLVMVLAVALLSNALVLLCCAYSAELRTRASGVLLVNLSLGHLLLAALDM 60 M A LAGLLV + V+LLSNALVLLC +SA++R +A + +NL+ G+LL ++M Sbjct: 1 MNSWNAGLAGLLVGTIGVSLLSNALVLLCLLHSADIRRQAPALFTLNLTCGNLLCTVVNM 60 Query: 61 PFTLLGVMRGRTPSAPGACQVIGFLDTFLASNAALSVAALSADQWLAVGFPLRYAGRLRP 120 P TL GV+ R P+ C++ FLDTFLA+N+ LS+AALS D+W+AV FPL Y ++R Sbjct: 61 PLTLAGVVAQRQPAGDRLCRLAAFLDTFLAANSMLSMAALSIDRWVAVVFPLSYRAKMRL 120 Query: 121 RYAGLLLGCAWGQSLAFSGAALGCSWLGYSSAFASCSLRLPPEPERPRFAAFTATLHAVG 180 R A L++ W +L F AAL SWLG+ +ASC+L ER RFA FT HA+ Sbjct: 121 RDAALMVAYTWLHALTFPAAALALSWLGFHQLYASCTLCSRRPDERLRFAVFTGAFHALS 180 S - score for the pairwise alignment. E value - number of hits you would expect by chance with score S or higher given the size of the database and the length of the alignment Good Match < 1 X Possible Match 1 X to 1 X X to 1 X 10-2

Needleman & Wunsch HCNIRQCLCRPMA A I C I N R C K C R H P

Needleman & Wunsch Algorithm Accumulate the matrix by adding to each cell the highest score in the column or row to the right and below itAccumulate the matrix by adding to each cell the highest score in the column or row to the right and below it find the highest scoring path in the matrix by:find the highest scoring path in the matrix by: starting in the top left cornerstarting in the top left corner moving down across the matrix from cell to cellmoving down across the matrix from cell to cell choosing the highest scoring cell at each movechoosing the highest scoring cell at each move the path can not go back on itself or cross the same row or column twicethe path can not go back on itself or cross the same row or column twice

Add to the score in the cell the highest score from a cell in the row or column to right and belowAdd to the score in the cell the highest score from a cell in the row or column to right and below Accumulating the Matrix i,j i-1,j-1 i-n,j-1 i-1,j-m

Sequence A HCNIRQCLCRPMA A I C I N R C K C R H P Sequence B

start in the leftmost or topmost rowstart in the leftmost or topmost row move to the highest scoring cell in row or column to right and belowmove to the highest scoring cell in row or column to right and below Possible Moves in Finding a Path across the Matrix i,j i-1,j-1 i-n,j-1 i-1,j-m

Sequence A HCNIRQCLCRPMA A I C I N R C K C R H P Sequence B

A H C N I - R Q C L C R - P M A I C - I N R - C K C R H P M