Dot Plots, Path Matrices, Score Matrices

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Heuristic alignment algorithms and cost matrices
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Progressive MSA Do pair-wise alignment Develop an evolutionary tree Most closely related sequences are then aligned, then more distant are added. Genetic.
Similar Sequence Similar Function Charles Yan Spring 2006.
Heuristic Approaches for Sequence Alignments
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Protein Sequence Alignment and Database Searching.
Hidden Markov Models for Sequence Analysis 4
BLAST Workshop Maya Schushan June 2009.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center.
Multiple sequence alignments Introduction to Bioinformatics Jacques van Helden Aix-Marseille Université (AMU), France Lab.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences Ÿif a related sequence has a known function can you inherit.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
Sequence Alignment. Assignment Read Lesk, Problem: Given two sequences R and S of length n, how many alignments of R and S are possible? If you.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
INTRODUCTION TO BIOINFORMATICS
Blast Basic Local Alignment Search Tool
The ideal approach is simultaneous alignment and tree estimation.
Sequence comparison: Local alignment
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Genome Annotation Continued
Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences if a related sequence has a known function can you inherit.
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
Basic Local Alignment Search Tool (BLAST)
Sequence alignment, E-value & Extreme value distribution
Searching Sequence Databases
Presentation transcript:

Dot Plots, Path Matrices, Score Matrices Sequence A V I L S T R I V H V N S I L P S T N V I L S T R I Sequence B V I L P E F S T diagonal lines give equivalent residues

Sequence A V I L S T R I V H V N S I L P S T N V I L S T R Sequence B F S T identical residues score 1 highest scoring path across the matrix gives best alignment

Alignment from Dot Plot VILSLV ILPQRSLVVILSLVI LALTV STVILSLVNVILPQR ILSLVISLAL score = 20 sequence identity = 20/26 = 75%

Residue substitution matrix ALVKRH… 1 0…… 1 Path or Score Matrix 1 1 …HRKVLA 1 H C N I R Q L P M A K 1 Residue substitution matrix

Needleman & Wunsch A H C N I R Q C L C R P M A 1 I 1 C 1 1 1 I 1 N 1 R I 1 C 1 1 1 I 1 N 1 R 1 1 C 1 1 1 K C 1 1 1 R 1 1 H 1 P 1

Needleman & Wunsch Algorithm Accumulate the matrix by starting in the bottom row moving from right to left a row at a time adding score from highest scoring cell in ‘column or row, to right and below’ to current cell. find the highest scoring path in the matrix by: starting in the top left corner moving down across the matrix from cell to cell choosing the highest scoring cell at each move the path can not go back on itself or cross the same row or column twice

Accumulating the Matrix Add to the score in the cell the highest score from a cell in: ‘the column or row to right and below’ i,j i-1,j-1 i-n,j-1 i-1,j-m

Sequence A A H C N I R Q C L C R P M A 8 7 6 6 5 4 4 3 3 2 1 I 7 7 6 6 6 4 4 3 3 2 1 C 6 6 7 6 5 4 4 4 3 3 1 I 6 6 6 5 6 4 4 3 3 2 1 N 5 5 5 6 5 5 4 3 3 3 1 R 4 4 4 4 4 5 4 3 3 2 2 Sequence B C 3 3 4 3 3 3 3 4 3 3 1 K 3 3 3 3 3 3 3 3 3 2 1 C 2 2 3 2 2 2 2 3 2 3 1 R 2 1 1 1 1 2 1 1 1 1 2 H 1 2 1 1 1 1 1 1 1 1 1 P 1

Possible Moves in Finding a Path across the Matrix start in the leftmost or topmost row move to the highest scoring cell in: ‘the column or row to right and below’ i,j i-1,j-1 i-n,j-1 i-1,j-m

Sequence A A H C N I R Q C L C R P M A 8 7 6 6 5 4 4 3 3 2 1 I 7 7 6 6 6 4 4 3 3 2 1 C 6 6 7 6 5 4 4 4 3 3 1 I 6 6 6 5 6 4 4 3 3 2 1 N 5 5 5 6 5 5 4 3 3 3 1 Sequence B R 4 4 4 4 4 5 4 3 3 2 2 C 3 3 4 3 3 3 3 4 3 3 1 K 3 3 3 3 3 3 3 3 3 2 1 C 2 2 3 2 2 2 2 3 2 3 1 R 2 1 1 1 1 2 1 1 1 1 2 H 1 2 1 1 1 1 1 1 1 1 1 P 1

A H C N I - R Q C L C R - P M A I C - I N R - C K C R H P M Sequence A 8 7 6 5 4 3 2 1 Sequence B A H C N I - R Q C L C R - P M A I C - I N R - C K C R H P M

V I L S L V I L P Q R S L V V I L S L V I L A L T V N P Q A Sequence A Sequence B runs (tuples) of 3 residues 3 6 6 3 5 3 6 6 gap penalty = 3 SCORE = 20 - 9 = 11 3

BLAST Basic Local Alignment Tool Altschul et al (1990) A highest scoring segment pair (HSP) is found between two sequences the sequences may be related if HSP score > cutoff matches significant ‘words’ or segments and then extends these matches using local dynamic programming

For each sequence find the ‘words’ with significant scores BLAST Step 1: match significant words query sequence of length L For each sequence find the ‘words’ with significant scores

BLAST Step 2: compare the word list to the database and identify exact matches

BLAST Step 3: for each word match, extend the alignment using a PAM matrix and dynamic programming

BLAST searches for 2 non-overlapping segments on same diagonal must be within a certain distance of each other before extension is invoked can also allow gaps so that the method joins segments on different diagonals

Assessing the Significance of Sequence Match length - can get artificially high scores between small sequences composition - if sequences are rich in particular amino acid residues can get high scores for unrelated proteins to assess the significance of a match it is necessary to compare the score with that returned by random or unrelated sequences if the database is small or when considering a pair-wise comparison, the sequences can be shuffled to generate random sequences

Assessing the Significance of Scores Returned from a Database Scan S - m frequency s.d mean probe score S score Z score = score (S) - mean for unrelated (m) standard deviation (s.d) Z value > 3 s.d related sequences

BLAST results S - score for the pairwise alignment. BLAST best hit >gi|17472322|ref|XP_061555.1| (XM_061555) similar to orphan G protein-coupled receptor GPR26 [Homo sapiens] Length = 337 Score = 298 bits (762), Expect = 8e-80 Identities = 168/327 (51%) Query: 1 MGPGEALLAGLLVMVLAVALLSNALVLLCCAYSAELRTRASGVLLVNLSLGHLLLAALDM 60 M A LAGLLV + V+LLSNALVLLC +SA++R +A + +NL+ G+LL ++M Sbjct: 1 MNSWNAGLAGLLVGTIGVSLLSNALVLLCLLHSADIRRQAPALFTLNLTCGNLLCTVVNM 60 Query: 61 PFTLLGVMRGRTPSAPGACQVIGFLDTFLASNAALSVAALSADQWLAVGFPLRYAGRLRP 120 P TL GV+ R P+ C++ FLDTFLA+N+ LS+AALS D+W+AV FPL Y ++R Sbjct: 61 PLTLAGVVAQRQPAGDRLCRLAAFLDTFLAANSMLSMAALSIDRWVAVVFPLSYRAKMRL 120 Query: 121 RYAGLLLGCAWGQSLAFSGAALGCSWLGYSSAFASCSLRLPPEPERPRFAAFTATLHAVG 180 R A L++ W +L F AAL SWLG+ +ASC+L ER RFA FT HA+ Sbjct: 121 RDAALMVAYTWLHALTFPAAALALSWLGFHQLYASCTLCSRRPDERLRFAVFTGAFHALS 180 S - score for the pairwise alignment. E value - number of hits you would expect by chance with score S or higher given the size of the database and the length of the alignment Good Match < 1 X 10-50 Possible Match 1 X 10-50 to 1 X 10-2

Multiple Alignment direct extensions of the standard DP approach for the alignment of 2 sequences are computationally impossible for more than 3 sequences practical heuristic solutions are based on the idea that sequences are evolutionary related and can be aligned using an underlying phylogenetic tree this is known as progressive alignment

(1) Pairwise Alignment B 4 sequences A, B, C, D A D 6 pairwise comparisons then cluster analysis A B C C D (2) Multiple Alignment following the tree from 1 B Align most similar pair D gaps to optimise alignment A Align next most similar pair C A new gap to optimise alignment of BD with AC C B Align alignments - preserve gaps D

Multiple Alignment start by aligning the most closely related pairs using DP and gradually align these groups together keeping the gaps that appear in earlier alignments fixed alternatively can add sequences one at a time to a growing multiple alignment the heuristic approach is not guaranteed to find the optimum alignment - but it is soundly based, biologically

More recent resources: ClustalW Higgins, 1997 since the choice of parameters used can have significant effect on the alignment for very distant sequences, ClustalW addresses this problem by: position specific gap opening and extension penalties using different amino acid substitution matrices - one for close relatives, one for distant More recent resources: MAFFT MUSCLE JALVIEW

ClustalW Gap penalties where structure is known, one would want to increase the gap penalty within helices and strands and decrease it between them - forcing gaps to occur more frequently in loops if no structure known, can use simple rules which depends on the residues occurring and the frequencies of gaps e.g. use lower gap penalties where gaps already occur

for each position in the alignment using an entropy measure Identifying sequence conserved residue positions multiple sequence alignment of relatives from functional group 1 = highly conserved Structural model Score conservation for each position in the alignment using an entropy measure 0 = unconserved Putative functional site Scorecons - Thornton 27

Searching Sequence Databases Do fast scans using approximate methods e.g. BLAST or PSIBLAST Align proteins carefully using a dynamic programming method Needleman & Wunsch Smith & Waterman Scan against sequence profiles (or HMMs) in secondary databases e.g. Pfam, Gene3D, InterPro Align query sequence against family relatives using: ClustalW, Jalview, MUSCLE, MAFFT Can you inherit functional information?

Profile Based Sequence Search Methods by comparing related sequences within a protein family can identify patterns of conserved residues even the most distant members of the family should have these patterns of conserved residues can make a profile which encapsulates these patterns and use it to detect more distantly related sequences highly conserved positions usually correspond to the buried core or functional residues within the active site

Iterated Application of BLAST PSI-BLAST Altschul et al. (1997) first constructs a multiple alignment of all the related sequences identified by BLAST then estimates the residue frequencies at each position to construct a score matrix Position Specific Score Matrices (PSSM) also known as weight matrices or 1D profile

PSI-BLAST Altschul et al. (1997) UniProt Database query sequence aligns matched sequences and builds profile further iterations pull out more distant sequence relatives

Use the Multiple Alignment to Calculate Residue Frequencies PSI-BLAST Use the Multiple Alignment to Calculate Residue Frequencies the residue frequencies at each position are used to calculate the scores for aligning a query sequence against the pattern query relatives putative relative P1……... P5 P6…………... Pn…………... three times more powerful than BLAST!!

Position specific substitution matrix Path matrix or score matrix 10 20 70 90 . …HRVLA A 10 I C I Path matrix or score matrix N R 70 C K C R 70 H 90 P

Searching Protein Family Databases Secondary databases (as opposed to primary sequence databases) group proteins into related families Families are usually represented by a sequence profile or sequence model (Hidden Markov Model HMM) derived from a multiple sequence alignment of the relatives

HMMs for Protein Domain Family Recognition Pfam, SUPERFAMILY, Gene3D : Hidden Markov Models (HMMs) sequence is aligned using a probabilistic model of interconnecting match, delete or insert states contains statistical information on observed and expected positional variation - “fingerprint of a protein family” 5 times more powerful than BLAST B E Mi Di Ii

UniProt coverage PDB coverage Pfam-A Pfam-B Other Pfam-A 10,340 curated families with annotation Pfam-B 224,303 families derived from ADDA (50% clearly related to a Pfam-A) UniProt coverage 74% of sequences 51% of residues PDB coverage 94% of sequences 76% of residues 36 36

representative members Pfam : SEED alignment representative members Profile-HMM HMMer-2.0 Search UniProt FULL alignment Manually curated Automatically made 37 37

Protein Pfam classification Protein fold, etc. This summarises the way we classify peptidases in MEROPS. We use a hierarchical system with three main levels. The individual peptidases are first grouped into families by significant sequence relationships. For each family a type example is nominated (highlighted in yellow here). This is a well-characterised member, and ideally one for which there is a crystal structure. Every member of the family must be shown to be related to the type example. And then homologous families are grouped together in clans. The homologous families are recognised primarily by similar protein folds. (There is a good deal more to it than this: there are subfamilies and subclans, and transitive relationships, but this is the essence of it.) Protein 38 38

Pfam classification Family Protein Protein fold, etc. This summarises the way we classify peptidases in MEROPS. We use a hierarchical system with three main levels. The individual peptidases are first grouped into families by significant sequence relationships. For each family a type example is nominated (highlighted in yellow here). This is a well-characterised member, and ideally one for which there is a crystal structure. Every member of the family must be shown to be related to the type example. And then homologous families are grouped together in clans. The homologous families are recognised primarily by similar protein folds. (There is a good deal more to it than this: there are subfamilies and subclans, and transitive relationships, but this is the essence of it.) Protein 39 39

Pfam classification Clan Family Protein Protein fold, etc. This summarises the way we classify peptidases in MEROPS. We use a hierarchical system with three main levels. The individual peptidases are first grouped into families by significant sequence relationships. For each family a type example is nominated (highlighted in yellow here). This is a well-characterised member, and ideally one for which there is a crystal structure. Every member of the family must be shown to be related to the type example. And then homologous families are grouped together in clans. The homologous families are recognised primarily by similar protein folds. (There is a good deal more to it than this: there are subfamilies and subclans, and transitive relationships, but this is the essence of it.) Protein 40 40

41 41

42

43

44 44