Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Slides:



Advertisements
Similar presentations
Global Sequence Alignment by Dynamic Programming.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Lecture 8 Alignment of pairs of sequence Local and global alignment
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Sequences Alignment Statistics
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Introduction to Bioinformatics Algorithms Sequence Alignment.
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Introduction to bioinformatics
Sequence Analysis Tools
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Pairwise alignment Computational Genomics and Proteomics.
Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Developing Pairwise Sequence Alignment Algorithms
Bioinformatics in Biosophy
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Sequence Alignment Xuhua Xia
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Construction of Substitution matrices
Step 3: Tools Database Searching
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Sequence comparison: Local alignment
Biology 162 Computational Genetics Todd Vision Fall Aug 2004
Pairwise Sequence Alignment
Pairwise Alignment Global & local alignment
Alignment IV BLOSUM Matrices
Presentation transcript:

Pairwise Sequence Alignment

The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq. 1ACGCTGAACGCTGA Seq. 2A - - CTGTACTGT - - Seeks alignments  high seq. identity, few mismatchs and gaps Assumption – the observed identity in seqs. to be aligned is the result of either random or of a shared evolutionary origin Identity ≠ similarity Sequence identity = Homology (a risky assumption) Sequence identity ≠ Homology Pairwise Sequence Alignment

Figure A Common evolutionary events and their effects on alignment indel Same true alignment arise through different evolutionary events Scoring scheme: substitution  -1, indel  -5, match  3 Score

Find the optimal score  the best guess for the true alignment Find the optimal pairwise alignment of two seqs.  inserted gaps into one or both of them  maximize the total alignment score Dynamic programming (DP) – Needleman and Wunsch (1970), Smith and Waterman (1980), this algorithm guarantees that we find all optimal alignments of two seqs. of lengths m and n BLAST is based on DP with improvement on speed Prof. Waterman Pairwise Sequence Alignment

The score for alignment of i residues of sequence 1 against j residues of sequence 2 is given by where c(i,j) = the score for alignment of residues i and j and takes the value 3 for a match or -1 for a mismatch, c(-,j) = the penalty for aligning a residue with a gap, which takes the value of -5

The entry for S(1,1) is the maximum of the following three events: S(0,0) + c(A,A) = = 3 [c(A,A) = c(1,1)] S(0,1) + c(A, -) = = -10 [c(A, -) = c(1, -)] S(1,0) + c(-, A) = = -10 [c(-,A) = c(-, 1)] Similarly, one finds S(2,1) as the maximum of three values: (-5)-1=-6; 3-5=-2; and (-10)-5=-15  the best is entry is the addition of the C indel to the A-A match, for a score of -2 (see next page). Pairwise Sequence Alignment

The alignment matrix of sequences 1 and 2 S(2,1) = max {S(1,0) + c(2,1), S(1,1) + c(2,-), S(2,0) + c(-,1)} = max { S(1,0) + c(C,A), S(1,1) + c(C,-), S(2,0) + c(-,A) } = max { -5-1, 3-5, } = -2

Pairwise Sequence Alignment Traceback  determine the actual alignment From the top right hand corner  the (7,5) cell For example the 1 in the (7,5) cell could only be reached by the addition of the mismatch A-T ACGCTGA A - - CTGT or ACGCTGA AC - - TGT 4 matches 1 mismatch 2 indels Ambiguity – has to do with which C in seq. 1 aligns with the C in seq. 2

Parameters settings - Gap penalties Default settings are the easiest to use but they are not necessarily yield the correct alignment constant penalty  independent of the length of gap, A proportional penalty  penalty is proportional to the length L of the gap, BL (that is what we used in the this lecture) affine gap penalty  gap-opening penalty + gap-extension penalty = A+BL There is no rule for predicting the penalty that best suits the alignment Optimal penalties vary from seq. to seq.  it is a matter of trial and error Usually A > B, because of opening a gap (usually A/B ~ 10) Hint: (1) compare distantly related seqs. high A and very low B often give the best results  penalized more on their existence than on their length, (2) compare closely related seqs., penalize both of extension and extension Pairwise Sequence Alignment

Exercise - Computing an optimal sequence alignment Two score schemes (1)Gap penalty = -5, mismatch = -1, match =3 (2)Gap penalty = -1, mismatch = -1, match =3 (1)First alignment score = 5*3 + 2*(-1) =13 Second/Third alignment score = 6*3 + 2*(-5) = 8 (2) First alignment score = 5*3 + 2*(-1) =13 Second/Third alignment score = 6*3 + 2*(-1) = 16 A more serious problem – identify the wrong alignment

Exercise Computing an optimal sequence alignment Gap penalty = -5Gap penalty = -1

Dynamic Programming do not provide the user with a measure of statistical similarity when regions of local similarity when regions of local similarity are found Take into account not just the position-position overlap between two seqs. but the characteristics of the a.a being aligned  define scoring matrices Protein scoring matrices take three major biological factors into account: Conservation – the numbers within the scoring matrix provide a way of representing what a.a. are capable of substituting for other a.a. (characteristics such as charge, size, hydrophobicity) Frequency – a.a cannot freely substitute for one another, the matrices need to reflect how often particular a.a occur among the entire proteins. Evolution – scoring matrices implicitly represent evolutionary patterns, and matrices can be adjusted to favor the detection of closely related or more distantly related proteins. BLAST (Scoring matrices)

Scoring matrices and the Log Odds Ratio where p i [p j ] = probability with which a.a i [j] occurs among all proteins q i, j = how often the two a.a i and j are seen to align with one another in MSA of protein families or in seqs. that are known to have a biological relationship. BLAST (Scoring matrices)

Amino acid substitution matrix (PAM and BLOSUM) Leave most adjustable parameters to the default value except the scoring matrix Box 2.1  a simple scheme for scoring seq. matches and mismatches (all mismatches received the same penalty) Scoring matrix allows some mismatches to be penalized less then others Leucine-isoleucine mismatch < leucine-tryptophan mismatch PAM (Point Accepted Mutations) scoring matrices – derived from closely related species (evolutionary point of view, avoid the complications of unobserved multiple substitutions at a single position) PAM derived from the likelihood of amino acids substitution during the evolutionary process PAM matrices with a smaller number represent shorter evolutionary distance PAM1 – one a.a change per 100 a.a, or roughly 1% divergence BLAST (PAM matrices)

PAM Asp  Glu 0.95%

BLOSUM (BLOck SUM) – there are evidence it outperform PAM Block  proteins in the same family can be aligned without introducing a gap (not the individual seqs.) So any given protein can contain one or more blocks, corresponding to each of its functional or structural motif With these protein blocks, it is possible to look for substitution patterns only in the most conserved regions of a protein  block substitution matrices are generated BLOSUM scoring matrix – based on data from distantly related seqs. (default BLOSUM62 for general use) The most commonly used matrices are PAM120, PAM250, BLOSUM50 and BLOSUM 62 BLOSUM matrices with a smaller number represent a longer evolutionary distance BLAST (BLOSUM matrices)

The BLOSUM62 substitution matrix Values below zero indicate amino acid changes that are more likely to have a functional effect than values of zero and above.

PAM250 equivalent to BLOSUM45 PAM160 equivalent to BLOSUM62 PAM120 equivalent to BLOSUM80 BLAST (relating PAM to BLOSUM) MatrixBest useSimilarity(%) BLOSUM90 Short alignments that are highly similar BLOSUM80 Detecting members of a protein family BLOSUM62 Most effective in finding all potential similarities BLOSUM30 Longer alignment of more divergent seqs. <30 Selecting an appropriate scoring matrix

BLAST (Sensitivity and Specificity)