Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.

Slides:



Advertisements
Similar presentations
Global Sequence Alignment by Dynamic Programming.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Dynamic Programming: Sequence alignment
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Lecture 8 Alignment of pairs of sequence Local and global alignment
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Sequence Alignment.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Introduction to Bioinformatics
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Introduction to bioinformatics
Sequence similarity.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
Sequence Alignment Oct 9, 2002 Joon Lee Genomics & Computational Biology.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Sequence Alignments Introduction to Bioinformatics.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Protein Sequence Comparison Patrice Koehl
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
1 Lesson 3 Aligning sequences and searching databases.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.
Sequence comparison: Local alignment
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Introduction to Bioinformatics Algorithms Sequence Alignment.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Construction of Substitution Matrices
Chapter 3 Computational Molecular Biology Michael Smith
Arun Goja MITCON BIOPHARMA
8/31/07BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment1 BCB 444/544 Lecture 6 Try to Finish Dynamic Programming Global & Local Alignment.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
Sequence comparison and database search.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Sequence comparison: Local alignment
Sequence Alignment.
Biology 162 Computational Genetics Todd Vision Fall Aug 2004
Sequence Alignment 11/24/2018.
Pairwise sequence Alignment.
#7 Still more DP, Scoring Matrices
Intro to Alignment Algorithms: Global and Local
Pairwise Sequence Alignment
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Pairwise Alignment Global & local alignment
Pairwise Sequence Alignment (II)
Presentation transcript:

Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment methods Alignment methods Significance of alignments Significance of alignments

Definitions An alignment is a mutual arrangement of sequences, which exhibits where the sequences are similar, and where they differ. An optimal alignment is one that exhibits the most correspondences and the least differences. It is the alignment with the highest score. May or may not be biologically meaningful.

Why do alignments? Sequence Alignment is useful for discovering structural, functional and evolutional information in biological sequences.

How to measure the similarity Three kinds of changes can occur at any given position within a sequence: Mutation Mutation Insertion Insertion Deletion Deletion Insertion and deletion have been found to occur in nature at a significantly lower frequency than mutations. Insertion and deletion have been found to occur in nature at a significantly lower frequency than mutations. indel

v : A T G T T A T w : A T C G T A C m = 7 n = 7 AT--GTAT ATCG A C letters of v letters of w T T 5 matches2 insertions2 deletions Given 2 DNA sequences v and w: Alignment: 2 row representation ??? 4 matches 3 mismatchs

Aligning DNA Sequences V = ATCTGATG W = TGCATAC n = 8 m = 7 ATCTGATG TGCATAC V W match insertion deletion mismatch indels matches mismatches insertions deletions

Scoring Matrices for Aligning DNA Sequences Transition --- substitutions in which a purine (A/G) is replaced by another purine (A/G) or a pyrimadine (C/T) is replaced by another pyrimadine (C/T). Transversions --- (A/G)  (C/T)

Scoring a Sequence Alignment Match score:+1 Match score:+1 Mismatch score:+0 Mismatch score:+0 Gap penalty:–1 Gap penalty:–1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 × (+1) Matches: 18 × (+1) Mismatches: 2 × 0 Mismatches: 2 × 0 Gaps: 7 × (– 1) Gaps: 7 × (– 1) Score = +11 ?

Aligning protein sequences FFDGGLQMQMLKDKFPMEGGQKDPKQRI

Amino Acid Substitution Matrices PAM - point accepted mutation based on global alignment [evolutionary model] BLOSUM - block substitutions based on local alignments [similarity among conserved sequences]

Part of PAM 250 Matrix CSTPAG C12 S02 T-213 P-3106 A G Log-odds = log ( ) chance to see pair in homologous proteins chance to see pair in unrelated proteins by chance

PAM 250 Matrix CSTPAGNDEQHRKMILVFYW C12 S02 T-213 P-3106 A G N D E Q H R K M I L V F Y W

Scoring Matrix: Example ARNK A5-2 R-73 N--70 K---6 Notice that although R and K are different amino acids, they have a positive score. Why? They are both positively charged amino acids  will not greatly change function of protein. AKRANR KAAANK -1+(-1)+(-2) =11

Sequence Alignment Problem T C A T G C A T T G

Elements of Dynamic Programming Dynamic Programming method is used to solve optimization problems to which optimal solutions depend on the optimal solutions to the subproblems. It involves Dynamic Programming method is used to solve optimization problems to which optimal solutions depend on the optimal solutions to the subproblems. It involves Characterize the structure of the optimal solutions Characterize the structure of the optimal solutions Recursively define the score of an optimal solution in terms of the scores of optimal solutions of sub-problems Recursively define the score of an optimal solution in terms of the scores of optimal solutions of sub-problems Compute the solution in a bottom-up fashion Compute the solution in a bottom-up fashion Trace back the optimal solution Trace back the optimal solution

Dynamic Programming Consider two sequences: AAAT AGC AGC To find the optimal solution, if T is aligned with C, we have to find the best alignment between AAA and AG.  Best solution depends on the best solutions of the subproblems.

Dynamic Programming Consider two sequences: AAAT AGC AGC To find the optimal solution, we have to find the best alignment between AAA and AG, AAA and AGC or AAAT and AG.  Best solution depends on the best solutions of the subproblems.

Dynamic Programming Optimal solutions for the subproblems have to be solved recursively. Let n be the size of sequence s = AAAT, m be the size of sequence t = AGC. m be the size of sequence t = AGC. Consider subproblems: matching the prefixes of s and t. t has ? possible prefixes, including empty string s has ? possible prefixes, including empty string n+1 m+1

Dynamic Programming We would like to match s[1…i] and t[1…j]: Align s[1…i] with t[1…j-1] and match a space with t[j] Align s[1…i] with t[1…j-1] and match a space with t[j] Align s[1…i-1] with t[1…j-1] and match s[i] with t[j] Align s[1…i-1] with t[1…j-1] and match s[i] with t[j] Align s[1…i-1] with t[1…j] and match a space with s[i] Align s[1…i-1] with t[1…j] and match a space with s[i] Similarity between s and t: Score(s[1…i],t[1…j])=max  Score(s[1…i],t[1…j-1])+gap penalty Score(s[1…i-1],t[1…j-1])+score(s[i],s[j]) Score(s[1…i-1],t[1…j])+gap penalty

Definitions Global alignment - Needleman-Wunsch (1970) maximizes the number of matches between the sequences along the entire length of the sequences. Local alignment - Smith-Waterman (1981) is a modification of the dynamic programming algorithm gives the highest scoring local match between two sequences.

Example Let gap = -2 Let gap = -2 match = 1 mismatch = -1. C A A Aempty C G A empty AAACA-GC Complexity : O(mn) ?

Gap Penalty Scoring Indels: Naive Approach A fixed penalty σ is given to every indel: A fixed penalty σ is given to every indel: -σ for 1 indel, -σ for 1 indel, -2σ for 2 consecutive indels -2σ for 2 consecutive indels -3σ for 3 consecutive indels, etc. -3σ for 3 consecutive indels, etc. Can be too severe penalty for a series of 100 consecutive indels

Affine Gap Penalties In nature, a series of k indels often come as a single event rather than a series of k single nucleotide events: In nature, a series of k indels often come as a single event rather than a series of k single nucleotide events: Normal scoring would give the same score for both alignments This is more likely. This is less likely.

Gap Penalty Gap opening penalty defines the cost for opening a gap in one of the sequences. The penalty must be tuned based on the default matrix. Gap extension penalty is an extra penalty proportional to the length of the gap. The gap extension penalty is always lower than gap opening penalty. Optimal penalties vary from sequence to sequence, and finding the most adequate value is a matter of empirical trial and error.

Accounting for Gaps Gaps- contiguous sequence of spaces in one of the rows Gaps- contiguous sequence of spaces in one of the rows Score for a gap of length x is: Score for a gap of length x is: -(ρ + σx) -(ρ + σx) where x length of the gap, ρ >0 is the penalty for introducing a gap: where x length of the gap, ρ >0 is the penalty for introducing a gap: gap opening penalty gap opening penalty ρ will be large relative to σ: ρ will be large relative to σ: gap extension penalty gap extension penalty because you do not want to add too much of a penalty for extending the gap. because you do not want to add too much of a penalty for extending the gap.

Affine Gap Penalties Gap penalties: Gap penalties: -ρ-σ when there is 1 indel -ρ-σ when there is 1 indel -ρ-2σ when there are 2 indels -ρ-2σ when there are 2 indels -ρ-3σ when there are 3 indels, etc. -ρ-3σ when there are 3 indels, etc. -ρ- x · σ (-gap opening - x gap extensions) -ρ- x · σ (-gap opening - x gap extensions) Somehow reduced penalties (as compared to na ï ve scoring) are given to runs of horizontal and vertical edges Somehow reduced penalties (as compared to na ï ve scoring) are given to runs of horizontal and vertical edges

Affine Gap Penalties and Edit Graph To reflect affine gap penalties we have to add “long” horizontal and vertical edges to the edit graph. Each such edge of length x should have weight -  - x * 

There are many such edges! Adding them to the graph increases the running time of the alignment algorithm by a factor of n (where n is the length of the sequence) So the complexity increases from O(n 2 ) to O(n 3 ) from O(n 2 ) to O(n 3 ) Adding “Affine Penalty” Edges to the Edit Graph

Optimal alignment Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically meaningful. Global alignment - Needleman-Wunsch (1970) maximizes the number of matches between the sequences along the entire length of the sequences. Local alignment - Smith-Waterman (1981) gives the highest scoring local match between two sequences.

Local Alignment Problem first formulated: Problem first formulated: Smith and Waterman (1981) Smith and Waterman (1981) Problem: Problem: Find an optimal alignment between a substring of s and a substring of t Find an optimal alignment between a substring of s and a substring of t Algorithm: Algorithm: is a variant of the basic algorithm for global alignment is a variant of the basic algorithm for global alignment

Motivation Searching for unknown domains or motifs within proteins from different families Searching for unknown domains or motifs within proteins from different families Proteins encoded from Homeobox genes (only conserved in 1 region called Homeo domain – 60 amino acids long) Proteins encoded from Homeobox genes (only conserved in 1 region called Homeo domain – 60 amino acids long) Identifying active sites of enzymes Identifying active sites of enzymes Comparing long stretches of anonymous DNA Comparing long stretches of anonymous DNA Querying databases where query word much smaller than sequences in database Querying databases where query word much smaller than sequences in database Analyzing repeated elements within a single sequence Analyzing repeated elements within a single sequence

Local Alignment Let n be the size of sequence s = GATCACCT m be the size of sequence t = GATACCC. m be the size of sequence t = GATACCC. Consider subproblems: matching the suffixes of s and t. t has ? possible suffixes, including empty string s has ? possible suffixes, including empty string n+1 m+1

DP for Local Alignment Match the suffixes of s[1…i] and t[1…j]: Match the suffixes of s[1…i] and t[1…j]: Align suffixes of s[1…i] with t[1…j-1] & match a space with t[j] Align suffixes of s[1…i] with t[1…j-1] & match a space with t[j] Align suffixes of s[1…i-1] with t[1…j-1] & match s[i] with t[j] Align suffixes of s[1…i-1] with t[1…j-1] & match s[i] with t[j] Align suffixes of s[1…i-1] with t[1…j] & match a space with s[i] Align suffixes of s[1…i-1] with t[1…j] & match a space with s[i] Score(s[1…i],t[1…j])=max Score(s[1…i],t[1…j-1])+gap penalty Score(s[1…i-1],t[1…j-1])+score(s[i],s[j]) Score(s[1…i-1],t[1…j])+gap penalty 0 S ij – highest score for alignment between 2 prefixes ending at i and j

Local Alignment Let gap = -2 Let gap = -2 match = 1 mismatch = -1. GATCACCTGATACCC GATCACCTGAT_ACCC

Local Alignment Let gap = -2 Let gap = -2 match = 1 mismatch = -1. GATCACCTGATACCC GATCACCTGAT_ACCC

Local Alignment Let gap = -2 Let gap = -2 match = 1 mismatch = -1. ACACACTA AGCACAC -ACACACTA A G C A C A C A

Smith & Waterman Place each sequence along one axis Place score 0 at the up-left corner Fill in 1 st row & column with 0s Fill in the matrix with max value of 4 possible values: 0 Vertical move: Score + gap penalty Horizontal move: Score + gap penalty Diagonal move: Score + match/mismatch score The optimal alignment score is the max in the matrix To reconstruct the optimal alignment, trace back where the MAX at each step came from, stop when a zero is hit

Semi-global Alignment Example:CAGCA-CTTGGATTCTCGG–––CAGCGTGG––––––––CAGCACTTGGATTCTCGGCAGC––––G––T––––GG We like the first alignment much better. In semiglobal comparison, we score the alignments ignoring some of the end spaces.

Global Alignment Example:AAACCC A - - CCC Prefer to see: AAACCC AAACCC - - ACCC - - ACCC Do not want to penalize the end spaces emptyAAACCC empty A C C C

SemiGlobal Alignment Example: s = AAACCC t = - - ACCC t = - - ACCC emptyAAACCC empty A-2111 C C C

SemiGlobal Alignment Example: s = AAACCCG t = - - ACCC - t = - - ACCC - emptyAAACCC empty A-2111 C C C G

SemiGlobal Alignment Summary of end space charging procedures: Summary of end space charging procedures: Place where spaces are not penalized for Action Beginning of 2 nd sequence End of 1 st sequence Beginning of 1 st sequence End of 2 nd sequence Initialize 1 st row with zeros Look for max in last row Initialize 1 st column with zeros Look for max in last column

Global vs Local Demo

R R: R: R IDE: R IDE: R manual: R manual: R & IDE downloads: R & IDE downloads: Quick R: Quick R:

R Demo Bioconductor Bioconductor source(" source(" biocLite() biocLite() library(Biostrings) #load library pairwiseAlignment(pattern = c("succeed", "precede"), subject = "supersede") ?pairwiseAlignment