Protein Sequence Alignment and Database Searching.

Slides:



Advertisements
Similar presentations
BLAST Sequence alignment, E-value & Extreme value distribution.
Advertisements

Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
Structural bioinformatics
©CMBI 2005 Sequence Alignment In phylogeny one wants to line up residues that came from a common ancestor. For information transfer one wants to line up.
Heuristic alignment algorithms and cost matrices
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Sequence Analysis Tools
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Chapter 5 Multiple Sequence Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple sequence alignment
An Introduction to Bioinformatics
BLAST Workshop Maya Schushan June 2009.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Multiple alignments, PATTERNS, PSI-BLAST.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Dot Plots, Path Matrices, Score Matrices
Sequence comparison: Significance of similarity scores
Sequence Based Analysis Tutorial
BLAST.
Pairwise Sequence Alignment
Sequence comparison: Significance of similarity scores
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Protein Sequence Alignment and Database Searching

What is a protein sequence alignment? The equivalencing of residues in two different proteins. Alignment implies that the aligned residues in the proteins are performing similar roles in the two different proteins. Important to think of proteins as three- dimensional objects, not just strings of letters.

Barton, G. J. et al, (1992), "Human Platelet Derived Endothelial Cell Growth Factor is Homologous to E.coli Thymidine Phosphorylase", Prot. Sci., 1,

Immunoglobulin Variable Domains

Protein Sequence Alignment - How? Need scoring scheme for matching amino acid residues. Need to cope with insertions and deletions (gaps or indels). Need algorithm to find ‘best’ alignment. Need some way of judging if the alignment is likely to be correct.

Protein Scoring Schemes A table of scores for aligning each possible amino acid pair. Simplest scheme, just scores 1 for identity and 0 for non identity. Better schemes weight similarities in amino acid properties or observed substitutions. For example, BLOSUM and PAM series.

A R N D C Q E G H I L K M F P S T W Y V B Z X * A R N D C Q E G H I L K M F P S T W Y V B Z X * BLOSUM62 Matrix

Finding the ‘best’ alignment The mathematically best alignment is the one that gives the highest score when the amino acids of the two proteins are aligned. This alignment is not necessarily the one that is biologically meaningful.

Dot-Plot comparison of Human Annexin I with itself. Four repeats (domains ?) are visible. Sequence Analysis of Annexin Domains Program: DOTTER

Gap Penalties Score for aligning a residue or residues in one protein to a gap in the other. Most usual form: penalty = ul + v where l is the length of the gap and u and v are constants. u is often called the gap extension penalty, v, the gap creation penalty.

Dynamic Programming Trick to avoid having to generate all possible alignments. First introduced in molecular biology by Needleman and Wunsch (1970). Many variations on the theme. Basis of (nearly) all sequence alignment programs. Finds the mathematically ‘best’ score for alignment of two sequences of length M and N in MN steps.

Is the alignment correct? Randomisation test (Monte-Carlo) can suggest if the sequences are similar enough to align accurately. Z-score from randomisation test > 6 suggest alignment will be correct over most of its length.

What is a randomisation test? Align sequences by dynamic programming and record score S. Shuffle order of amino acids in the sequences and re-align the pair. Record the score for this alignment, repeat 100 times. Calculate mean and Standard Deviation (sd) of shuffled sequence comparison scores. Z= (S-mean)/sd

Mean x (e.g. 0.0) Value V (e.g. 4.3) Standard Deviation  (e.g. 1.8) Z-score = (Value – Mean)/(Standard Deviation) = (V – x) /  e.g. = ( )/1.8 = 2.39

Why perform multiple alignment? Can help improve alignment accuracy between any pair of sequences. Prediction of functionally important residues. Sub-family analysis (not this lecture.) Prediction of secondary structure and buried residues (not this lecture.)

Single sequence N Q L E V F M D G E L A... physico-chemical properties of amino acids

Multiple sequences N Q L E V F M D G E L E A... N D E K V Y M E G D I Q V...

Multiple sequences N Q L E V F M D G E L E A... N D E K V Y M E G D I Q V... N S S Q V K I K G Q V D L... N N T N V A M R G K M N T... conserved positions with conserved hydrophobics

Multiple sequences help fit a sequence on a structure (threading) N Q L E V F M D G E L E A... N D E K V Y M E G D I Q V... N S S Q V K I K G Q V D L... N N T N V A M R G K M N T...

Multiple sequences help alignment itself N V A H G K M... N T N V I R G K M N T E V F D G E L... D E K V Y E G N I Q V

Multiple sequences help alignment itself (also pattern matching) E F M D L E A... K Y M E I Q V... Q K I V D L Q... N V A H G K M... Q L E V A D G E L E A D V K V L Y G D I Q V S V Q V K K G Q V D L N T N V I R G K M N T E V F D G E L... K V Y E G D I... Q V K K G Q V... N V A R G K M... Q L E F M D E W L E A D E K V Y E G N I Q V S S Q K I K Q A V D L N T N A M R K F M N T

Multiple sequences help alignment itself (also pattern matching) E F M D L E A... K Y M E I Q V... Q K I V D L Q... N V A H G K M... Q L E V A D G E L E A D V K V L Y G D I Q V S V Q V K K G Q V D L N T N V I R G K M N T E V F D G E L... K V Y E G D I... Q V K K G Q V... N V A R G K M... Q L E F M D E W L E A D E K V Y E G N I Q V S S Q K I K Q A V D L N T N A M R K F M N T

Multiple sequences help alignment itself (also pattern matching) E F M D K Y M E Q K I V N V A H Q L E V A D G E L E A D V K V L Y G D I Q V S V Q V K K G Q V D L N T N V I R G K M N T E V F D G E L... K V Y E G D I... Q V K K G Q V... N V A R G K M... L E F M D E W L E A E K V Y E G N I Q V S Q K I K Q A V D L T N A M R K F M N T

Multiple Sequence Alignment How? Alignment of more than 2 sequences. Can’t directly extend dynamic programming to more than 3 sequences due to memory and CPU limitations. Corner cutting can allow alignments up to around 10 sequences. Practical multiple alignment methods are HIERARCHICAL.

Hierarchical multiple alignment Compare all pairs of sequences Generate a guide tree or dendrogram Follow tree from leaves to root, building the alignment as you go. Most popular program is CLUSTAL. Others are AMPS, MULTAL and PileUp.

Protein Sequence Database Searching Take single sequence and look for similar sequences in a large database. –For database of 2,300,000 sequences, needs 2,300,000 sequence comparisons –Needs good statistics to evaluate quality of match. –Needs local alignment method.

A protein may have multiple domains and so only match in some regions. Local alignment methods (algorithms) overcome this problem. Smith & Waterman algorithm

Ranking the results list Want proteins that are similar to rank above those that are not! No method does this perfectly.

Black bars - proteins related to query sequence. White bars - proteins that are unrelated to query. (a) - no separation (b) - partial separation (c) - full separation. (c) is the goal of searching, but this rarely happens...

Expectation Value For a sequence pair that scores S in a database search, the E-value is the number of sequences that one would expect to see with a score at least as high as S in the database. E values are usually estimated from the Extreme Value Distribution (EVD)

Expectation values If E=5 for a score of 200 in a database search, then one would expect to see 5 sequences with this score or higher by chance alone. If E= for a score of 750, then one would not expect to see sequence pairs with this score by chance alone, so the pair are probably related.

Database Searching Algorithms Can use dynamic programming to search. Slowest, but best method. Most commonly, HEURISTIC methods are used - e.g. BLAST and FASTA. These reduce the time for a search by taking shortcuts.

FASTA Algorithm Does fast lookup of identical matches Then looks for runs of identity Then builds alignment Then estimates significance

BLAST Algorithm Basic Local Alignment Search Tool Applications to Protein-Protein, Protein- DNA, DNA-Protein and DNA-DNA comparisons.

More advanced searching Iterative searching - PSI-BLAST Profile searching Hidden Markov Models (HMMs) Combination of sequence information with other information.

Reading material for this lecture - look at BLAST service. - look at Tools, in particular SRS and CLUSTAL. Book chapter (online) Same information in PDF File:

The end