V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Slides:



Advertisements
Similar presentations
Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Techniques for Protein Sequence Alignment and Database Searching
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Heuristic alignment algorithms and cost matrices
Profile-profile alignment using hidden Markov models Wing Wong.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Progressive MSA Do pair-wise alignment Develop an evolutionary tree Most closely related sequences are then aligned, then more distant are added. Genetic.
Multiple sequence alignments and motif discovery Tutorial 5.
Similar Sequence Similar Function Charles Yan Spring 2006.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Exploring Protein Sequences Tutorial 5. Exploring Protein Sequences Multiple alignment –ClustalW Motif discovery –MEME –Jaspar.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequence comparison: Local alignment
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple sequence alignment
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Protein Sequence Alignment and Database Searching.
Hidden Markov Models for Sequence Analysis 4
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center.
Multiple sequence alignments Introduction to Bioinformatics Jacques van Helden Aix-Marseille Université (AMU), France Lab.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences Ÿif a related sequence has a known function can you inherit.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Sequence Based Analysis Tutorial
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Step 3: Tools Database Searching
(H)MMs in gene prediction and similarity searches.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Sequence similarity, BLAST alignments & multiple sequence alignments
INTRODUCTION TO BIOINFORMATICS
The ideal approach is simultaneous alignment and tree estimation.
Sequence comparison: Local alignment
Dot Plots, Path Matrices, Score Matrices
Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences if a related sequence has a known function can you inherit.
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
Presentation transcript:

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices Dot Plots, Path Matrices, Score Matrices

V ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B identical residues score 1 highest scoring path across the matrix gives best alignment

V I L S L V I L P Q R S L V V I L S L V I L A L T V STVILSLVRNVILPQRILSLVISLAL Sequence A Sequence B runs (tuples) of 3 residues SCORE = = 11 3 gap penalty = 3 = 3

Alignment from Dot Plot Alignment from Dot Plot VILSLV ILPQRSLVVILSLVI LALTV STVILSLVNVILPQR ILSLVISLAL score = 20 sequence identity = 20/26 = 75%

ALVKRH … … HRKVLA …… Path or Score Matrix Residue substitution matrix 1

Needleman & Wunsch HCNIRQCLCRPMA A I C I N R C K C R H P

Needleman & Wunsch Algorithm Accumulate the matrix by adding to each cell the highest score in the column or row to the right and below itAccumulate the matrix by adding to each cell the highest score in the column or row to the right and below it find the highest scoring path in the matrix by:find the highest scoring path in the matrix by: starting in the top left cornerstarting in the top left corner moving down across the matrix from cell to cellmoving down across the matrix from cell to cell choosing the highest scoring cell at each movechoosing the highest scoring cell at each move the path can not go back on itself or cross the same row or column twicethe path can not go back on itself or cross the same row or column twice

Add to the score in the cell the highest score from a cell in the row or column to right and belowAdd to the score in the cell the highest score from a cell in the row or column to right and below Accumulating the Matrix i,j i-1,j-1 i-n,j-1 i-1,j-m

Sequence A HCNIRQCLCRPMA A I C I N R C K C R H P Sequence B

start in the leftmost or topmost rowstart in the leftmost or topmost row move to the highest scoring cell in row or column to right and belowmove to the highest scoring cell in row or column to right and below Possible Moves in Finding a Path across the Matrix i,j i-1,j-1 i-n,j-1 i-1,j-m

Sequence A HCNIRQCLCRPMA A I C I N R C K C R H P Sequence B

A H C N I - R Q C L C R - P M A I C - I N R - C K C R H P M

Searching Sequence Databases Can you inherit functional information? Do fast scans using approximate methods e.g. BLAST or PSIBLAST Align proteins carefully using a dynamic programming method Needleman & Wunsch Smith & Waterman Scan against sequence profiles (or HMMs) in secondary databases e.g. Pfam, Gene3D, InterPro Align query sequence against family relatives using: ClustalW, Jalview, MUSCLE, MAFFT

Profile Based Sequence Search Methods Ÿby comparing related sequences within a protein family can identify patterns of conserved residues Ÿeven the most distant members of the family should have these patterns of conserved residues Ÿcan make a profile which encapsulates these patterns and use it to detect more distantly related sequences Ÿhighly conserved positions usually correspond to the buried core or functional residues within the active site

first constructs a multiple alignment of all the related sequences identified by BLASTfirst constructs a multiple alignment of all the related sequences identified by BLAST then estimates the residue frequencies at each position to construct a score matrix Position Specific Score Matrices (PSSM) also known as weight matrices or profilesthen estimates the residue frequencies at each position to construct a score matrix Position Specific Score Matrices (PSSM) also known as weight matrices or profiles Iterated Application of BLAST PSI-BLAST Altschul et al. (1997)

PSI-BLAST UniProt Database query sequence further iterations pull out more distant sequence relatives aligns matched sequences and builds profile Altschul et al. (1997)

Use the Multiple Alignment to Calculate Residue Frequencies PSI-BLAST the residue frequencies at each position are used to calculate the scores for aligning a query sequence against the pattern P1……...P5P6…………...Pn…………... query relatives putativerelative three times more powerful than BLAST!!

A I C I N R C K C R H P Position specific substitution matrix … HRVLA Path matrix or score matrix

Multiple Alignment direct extensions of the standard DP approach for the alignment of 2 sequences are computationally impossible for more than 3 sequencesdirect extensions of the standard DP approach for the alignment of 2 sequences are computationally impossible for more than 3 sequences practical heuristic solutions are based on the idea that sequences are evolutionary related and can be aligned using an underlying phylogenetic treepractical heuristic solutions are based on the idea that sequences are evolutionary related and can be aligned using an underlying phylogenetic tree this is known as progressive alignment

(1) Pairwise Alignment (2) Multiple Alignment following the tree from 1 4 sequences A, B, C, D A B C D 6 pairwise comparisons then cluster analysis B D A C A C B D A B D C Align most similar pair Align next most similar pair Align alignments - preserve gaps gaps to optimise alignment new gap to optimise alignment of BD with AC

Multiple Alignment start by aligning the most closely related pairs using DP and gradually align these groups together keeping the gaps that appear in earlier alignments fixedstart by aligning the most closely related pairs using DP and gradually align these groups together keeping the gaps that appear in earlier alignments fixed alternatively can add sequences one at a time to a growing multiple alignmentalternatively can add sequences one at a time to a growing multiple alignment the heuristic approach is not guaranteed to find the optimum alignment - but it is soundly based, biologically

ClustalW since the choice of parameters used can have significant effect on the alignment for very distant sequences, ClustalW addresses this problem by:since the choice of parameters used can have significant effect on the alignment for very distant sequences, ClustalW addresses this problem by: position specific gap opening and extension penalties using different amino acid substitution matrices - one for close relatives, one for distant Higgins, 1997 More recent resources: MAFFTMUSCLEJALVIEW

ClustalW where structure is known, one would want to increase the gap penalty within helices and strands and decrease it between them - forcing gaps to occur more frequently in loopswhere structure is known, one would want to increase the gap penalty within helices and strands and decrease it between them - forcing gaps to occur more frequently in loops if no structure known, can use simple rules which depends on the residues occurring and the frequencies of gapsif no structure known, can use simple rules which depends on the residues occurring and the frequencies of gaps e.g. use lower gap penalties where gaps already occur Gap penalties

Secondary databases (as opposed to primary sequence databases) group proteins into related families Families are usually represented by a sequence profile or sequence model (Hidden Markov Model HMM) derived from a multiple sequence alignment of the relatives Searching Protein Family Databases

Pfam, SUPERFAMILY, Gene3D : Hidden Markov Models (HMMs) sequence is aligned using a probabilistic model of interconnecting match, delete or insert states contains statistical information on observed and expected positional variation - “fingerprint of a protein family” BEMiMi DiDi IiIi HMMs for Protein Domain Family Recognition

Pfam-A 10,340 curated families with annotation Pfam-B 224,303 families derived from ADDA (50% clearly related to a Pfam-A) UniProt coverage 74% of sequences 51% of residues PDB coverage 94% of sequences 76% of residues Pfam-A Pfam-B Other

Pfam : Profile-HMM HMMer-2.0 FULL alignment Search UniProt Manually curatedAutomatically made SEED alignment representative members

Protein Pfam classification Protein fold, etc.

Protein Family Protein fold, etc. Pfam classification

Protein Clan Family Protein fold, etc. Pfam classification