Biology 162 Computational Genetics Todd Vision Fall Aug 2004

Slides:



Advertisements
Similar presentations
Global Sequence Alignment by Dynamic Programming.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
DNA sequences alignment measurement
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Heuristic alignment algorithms and cost matrices
1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 3: Pair-wise alignment Centre for Integrative Bioinformatics VU.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Sequence Alignment III CIS 667 February 10, 2004.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Pairwise alignment Computational Genomics and Proteomics.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Local alignment
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Developing Pairwise Sequence Alignment Algorithms
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Arun Goja MITCON BIOPHARMA
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
DNA, RNA and protein are an alien language
Step 3: Tools Database Searching
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Introduction to sequence alignment Mike Hallett (David Walsh)
Bioinformatics for Research
INTRODUCTION TO BIOINFORMATICS
Sequence comparison: Local alignment
#8 Finish DP, Scoring Matrices, Stats & BLAST
Pairwise sequence Alignment.
#7 Still more DP, Scoring Matrices
Pairwise Sequence Alignment
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Pairwise Alignment Global & local alignment
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Biology 162 Computational Genetics Todd Vision Fall 2004 31 Aug 2004 Pairwise alignment Biology 162 Computational Genetics Todd Vision Fall 2004 31 Aug 2004

Preview General concepts in alignment How to read a dotplot Scoring matrices and gap penalties Basic dynamic programming algorithms Needleman-Wunsch Smith-Waterman Using more realistic gap penalties

Homology Two sequences are homologous if they are descended from a common ancestral sequence. Homology is either all or nothing Only similarity can be quantitative An alignment is a model of positional homology between nucleotide or amino acid residues

Some applications of sequence alignment Sequence/genome assembly Locating exons within genomic sequences Functional annotation by homology search against a database Identification of conserved signatures/motifs/domains Molecular evolution and phylogenetics Structural homology modeling

Alignments classified by Span Global, encompassing full-length sequences Local, restricted to conserved segments Number of sequences Pairwise, involving only two sequences Multiple, involving more than two Algorithm Optimality guarantee Heuristic

Amino acids versus DNA DNA sequences give much worse alignments than amino acid sequences Fewer letters Less realistic scoring matrices Some applications can align codons How large would that scoring matrix be? If that’s not possible Use aa alignment to guide DNA alignment of coding sequences Use conceptual translations (6 potential coding frames) for database searches

Dotplots: phage l cI vs. P22 c2 repressor B A Window size 1 11 25 Stringency 1 7 15

Internal repeats: human LDL protein Window size = 23 bp Stringency = 7 bp

Inversions

Hanging ends y y x x Overlap Nesting x y

Percent amino acid identity Twilight Zone 100 90 80 70 60 50 40 30 20 10 Percent amino acid identity

Scoring an alignment Possible relationships at a position Match (identity) Mismatch (substitution) Gap (insertion/deletion, or indel) A scoring matrix is used for matches and mismatches Typically binary for nucleic acids PAM, BLOSUM, & others for amino acids Gap penalties must be “tacked on” The alignment score is the sum of the scores at each position in the alignment, including gaps

LOD scores Let pab be the expected frequency of aligned residue pair a and b among all aligned residues Let qa be the frequency of individual amino acid a

Point Accepted Mutation (PAM) matrices Trained on alignments of closely related proteins PAM1 implies 1 substitution per 100 amino acids PAM250 = (PAM1)250 Training set strongly biased toward globular proteins (more suitable matrices are available for more specific protein classes)

Choosing the right PAM matrix Low PAM values discriminate among amino acids more dramatically As the exponent increases, values within a row converge on amino acid frequencies Choice of matrix typically depends on observed % identity Classic chicken and egg problem PAM250 corresponds to 20% identity Assumes substitution rate is equal among sites (Poisson model), which we know to be false

BLOSUM scoring matrices Trained on ungapped alignments (blocks) of divergent sequences to capture ‘long-term’ substitution patterns Named BLOSUMx, where x (from 0 to 100) is the minimum percent identity of the sequences in the alignment. (The smaller the value of x, the more divergent the sequences). Note that numbers have opposite meanings for PAM and BLOSUM! BLOSUM62 is in wide use (eg it is the default in BLAST)

BLOSUM62 BLOSUM62 A R N D C Q E G H I L K M F P S T W Y V B Z X * * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1 BLOSUM62

Gap penalties Naïve score Affine scores Each gap position receives independent penalty of d Affine scores Score depends on length of contiguous gap Gap opening penalty d Gap extension penalty e

Dynamic programming A problem solving technique that employs recursion to solve a larger problem by solving a nested set of similar subproblems

Application to pairwise alignment Imagine We know the score for the first i-1 and j-1 residues of sequences x and y i-1 and j-1 are aligned in the optimal alignment There are three possibilities for the next position in the alignment A gap in sequence x A gap in sequence y A match or mismatch between i and j The maximum scoring alignment among these has to be in the optimal global alignment

Overview of algorithm We can use this fact to recursively fill out a matrix containing the score F(i,j) of the optimal alignment for every pair of residues i and j We also store a pointer to one of three previously filled out cells in the matrix, forming a path graph The optimal global alignment must be a path within the path graph It can be found by performing a traceback from the final cell in the recursion

Path Graph Diagonal moves represent matches and mismatches Horizontal and vertical moves represent gaps (indels)

Needleman-Wunsch recursion xi, yj match yj aligned to gap xi aligned to gap Let s(C,C)=1, s(C,G)=s(G,C)=-2, and d=-1 y ?C ?- A G A C 3 2 ?- ?G A G A C 3 2 A C 3 3 x A C 3 x: AC y: AC

Needleman-Wunsch c g t g c g t c t g t g a Initialize F(0,0) = 0 Use pointers to remember path Match=+1, Mismatch=-1, Gap=-2 arbitrary order of precedence:  , ,  Needleman-Wunsch c g t g c g t c t g t g a  -2  -4  -6  -8  -10  -12  -14 -2  -4  -6  -8  -10  -12  -14

Needleman-Wunsch: completed path graph c g t g c g t c t g t g a -2 -4 -6 -8 -10 -12 -14 -2 +1 -1 -3 -5 -7 -9 -11 -4 -1  0 -8 -6 -3 -1  0 -8 -5 -2 -10 -7 +2 -12 -9  0 -14 -11 -3

Needleman-Wunsch: traceback c g t g c g t c t g t g a -2 -4 -6 -8 -10 -12 -14 -2 +1 -1 -3 -5 -7 -9 -11 -4 -1  0 -8 -6 -3 -1  0 -8 -5 -2 -10 -7 +2 -12 -9  0 -14 -11 -3 cgtgcg-t | || | | c-tgtgat optimal global alignment:

Complexity of algorithm For sequences of length m and n We consider 3 options at each cell We store mn scores and pointers We trace back no more than m+n steps 3mn +m+n in time, 3mn in memory O(mn) in both time and memory If m=n, O(n2)

Smith-Waterman algorithm Local pairwise alignment Cells with negative scores are set to zero Traceback from highest scoring cell Stop when 0 is encountered Also O(nm)

Smith-Waterman recursion xi, yj match yj aligned to gap xi aligned to gap score ≤ 0

Smith-Waterman algorithm c g t g c g t c t g t g a +1 +2 +3 +1 +1 cgtgcgt ||| ctgtgat optimal local alignment:

Guaranteeing a local alignment Use of SW algorithm alone does not guarantee “local” behavior Sensitive to the scoring function (should be negative for random sequences) Use of LOD scores help ensure this Gap penalties must also be chosen carefully If it is cheaper to insert a gap than to tolerate a mismatch, then gaps will be inserted where no alignment is possible

More realistic gap penalties General gap function g(g) Requires O(n3) operations

Affine gap penalties Gap score: g(g)=-d-(g-1)e Can be done in O(mn) We need to keep track of three scores (and pointers) at each cell

A theme with variations Overlapping or nested sequences Do not penalize hanging ends Repeated sequences Asymmetric algorithm can find multiple local alignments of x in y or y in x The basic idea admits many variations

Things to keep in mind about pairwise alignments There may be multiple optima Optimality is only guaranteed with respect to the scoring function – the alignment may still be biologically wrong! O(mn) is still too big when n is the size of a major sequence database

Summary Dotplots are an excellent visual tool to decide whether and what kind of alignment is appropriate PAM and BLOSUM series matrices provide empirical LOD values for scoring alignments Different flavors of alignment are produced by variants on a basic dynamic programming algorithm Needleman-Wunsch for global alignments Smith-Waterman for local alignments Affine gap penalties balance biological realism with computational feasibility

Reading assignment Nicholas et al. (2002) Strategies for multiple sequence alignment. Biotechniques 32, 572-591 (handout)