CS 5263 Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.

Slides:



Advertisements
Similar presentations
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Global Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
Lecture 8 Alignment of pairs of sequence Local and global alignment
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Linear-Space Alignment. Subsequences and Substrings Definition A string x’ is a substring of a string x, if x = ux’v for some prefix string u and suffix.
Genomic Sequence Alignment. Overview Dynamic programming & the Needleman-Wunsch algorithm Local alignment—BLAST Fast global alignment Multiple sequence.
Lecture outline Database searches
Computational Genomics Lecture 1, Tuesday April 1, 2003.
Heuristic alignment algorithms and cost matrices
CS 5263 Bioinformatics Lecture 5: Affine Gap Penalties.
Sequence Alignment. Scoring Function Sequence edits: AGGCCTC  MutationsAGGACTC  InsertionsAGGGCCTC  DeletionsAGG. CTC Scoring Function: Match: +m Mismatch:
Sequence Alignment. CS262 Lecture 2, Win06, Batzoglou Complete DNA Sequences More than 300 complete genomes have been sequenced.
Linear-Space Alignment. Linear-space alignment Using 2 columns of space, we can compute for k = 1…M, F(M/2, k), F r (M/2, N – k) PLUS the backpointers.
Sequence Alignment Cont’d. Needleman-Wunsch with affine gaps Initialization:V(i, 0) = d + (i – 1)  e V(0, j) = d + (j – 1)  e Iteration: V(i, j) = max{
Sequence Alignment Cont’d. Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Sequence Alignment.
Sequence Alignment Lecture 2, Thursday April 3, 2003.
Sequence Alignment. Before we start, administrivia Instructor: Serafim Batzoglou, CS x Office hours: Monday 2:00-3:30 TA:
Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia.
Sequence Alignment Cont’d. Evolution Scoring Function Sequence edits: AGGCCTC  Mutations AGGACTC  Insertions AGGGCCTC  Deletions AGG.CTC Scoring Function:
CS 6293 Advanced Topics: Current Bioinformatics Lectures 3-4: Pair-wise Sequence Alignment.
Sequence Alignment Cont’d. Linear-space alignment Iterate this procedure to the left and right! N-k * M/2 k*k*
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Sequence Alignment Slides courtesy of Serafim Batzoglou, Stanford Univ.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
CS262 Lecture 4, Win07, Batzoglou Heuristic Local Alignerers 1.The basic indexing & extension technique 2.Indexing: techniques to improve sensitivity Pairs.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Sequence Alignment Lecture 2, Thursday April 3, 2003.
Protein Sequence Comparison Patrice Koehl
Sequence Alignment. CS262 Lecture 3, Win06, Batzoglou Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition.
Sequence Alignment. -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N,
Sequence alignment, E-value & Extreme value distribution
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
CS 5263 Bioinformatics Lecture 4: Global Sequence Alignment Algorithms.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
An Introduction to Bioinformatics
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Comp. Genomics Recitation 3 The statistics of database searching.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.
CS 5263 Bioinformatics Lecture 6: Sequence Alignment Statistics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
CS 5263 Bioinformatics Lecture 7: Heuristic Sequence Alignment Algorithms (BLAST)
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
CS 5263 Bioinformatics Lecture 7: Heuristic Sequence Alignment Tools (BLAST) Multiple Sequence Alignment.
CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Sequence Alignment.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Lectures 3-6: Pair-wise Sequence Alignment
Lecture 5: Local Sequence Alignment Algorithms
CS 6293 Advanced Topics: Translational Bioinformatics
Lecture 6: Sequence Alignment Statistics
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

CS 5263 Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment

Outline Part I: Algorithms –Biological problem –Intro to dynamic programming –Global sequence alignment –Local sequence alignment –More efficient algorithms Part II: Biological issues –Model gaps more accurately –Alignment statistics Part III: BLAST

Evolution at the DNA level …ACGGTGCAGTCACCA… …ACGTTGC-GTCCACCA… C DNA evolutionary events (sequence edits): Mutation, deletion, insertion

Sequence conservation implies function OK X X Still OK? next generation

Why sequence alignment? Conserved regions are more likely to be functional –Can be used for finding genes, regulatory elements, etc. Similar sequences often have similar origin and function –Can be used to predict functions for new genes / proteins Sequence alignment is one of the most widely used computational tools in biology

Global Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition An alignment of two strings S, T is a pair of strings S ’, T ’ (with spaces) s.t. (1) |S ’ | = |T ’ |, and (|S| = “ length of S ” ) (2) removing all spaces in S ’, T ’ leaves S, T AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC S T S’ T’

What is a good alignment? Alignment: The “ best ” way to match the letters of one sequence with those of the other How do we define “ best ” ?

The score of aligning (characters or spaces) x & y is σ (x,y). Score of an alignment: An optimal alignment: one with max score S’: -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- T’: TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Scoring Function Sequence edits: AGGCCTC –Mutations AGGACTC –InsertionsAGGGCCTC –DeletionsAGG-CTC Scoring Function: Match: +m~~~AAC~~~ Mismatch: -s~~~A-A~~~ Gap (indel):-d

Match = 2, mismatch = -1, gap = -1 Score = 3 x 2 – 2 x 1 – 1 x 1 = 3

More complex scoring function Substitution matrix –Similarity score of matching two letters a, b should reflect the probability of a, b derived from the same ancestor –It is usually defined by log likelihood ratio –Active research area. Especially for proteins. –Commonly used: PAM, BLOSUM

An example substitution matrix ACGT A3-2-2 C3 G3-2 T3

How to find an optimal alignment? A naïve algorithm: for all subseqs A of S, B of T s.t. |A| = |B| do align A[i] with B[i], 1 ≤i ≤|A| align all other chars to spaces compute its value retain the max end output the retained alignment S = abcd A = cd T = wxyz B = xz -abc-d a-bc-d w--xyz -w-xyz

Analysis Assume |S| = |T| = n Cost of evaluating one alignment: ≥n How many alignments are there: –pick n chars of S,T together –say k of them are in S –match these k to the k unpicked chars of T Total time: E.g., for n = 20, time is > 2 40 >10 12 operations

Intro to Dynamic Programming

Dynamic programming What is dynamic programming? –A method for solving problems exhibiting the properties of overlapping subproblems and optimal substructureoverlapping subproblemsoptimal substructure –Key idea: tabulating sub-problem solutions rather than re-computing them repeatedly Two simple examples: –Computing Fibonacci numbers –Find the special shortest path in a grid

Example 1: Fibonacci numbers 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, … F(0) = 1; F(1) = 1; F(n) = F(n-1) + f(n-2) How to compute F(n)?

A recursive algorithm function fib(n) if (n == 0 or n == 1) return 1; else return fib(n-1) + fib(n-2); F(9) F(8)F(7) F(6) F(5) F(6)F(5) F(4)F(5) F(4) F(3)

Time complexity: –Between 2 n/2 and 2 n –O(1.62 n ), i.e. exponential Why recursive Fib algorithm is inefficient? –Overlapping subproblems n/2 n

An iterative algorithm function fib(n) F[0] = 1;F[1] = 1; for i = 2 to n F[i] = F[i-1] + F[i-2]; Return F[n]; Time complexity: Time: O(n), space: O(n)

Example 2: shortest path in a grid S G m n Each edge has a length (cost). We need to get to G from S. Can only move right or down. Aim: find a path with the minimum total length

Optimal substructures Naïve algorithm: enumerate all possible paths and compare costs –Exponential number of paths Key observation: –If a path P(S, G) is the shortest from S to G, any of its sub-path P(S,x), where x is on P(S,G), is the shortest from S to x

Proof Proof by contradiction –If the path between P(S,x) is not the shortest, i.e., P’(S,x) < P(S,x) –Construct a new path P’(S,G) = P’(S,x) + P(x, G) –P’(S,G) P(S,G) is not the shortest –Contradiction –Therefore, P(S, x) is the shortest S G x

Recursive solution Index each intersection by two indices, (i, j) Let F(i, j) be the total length of the shortest path from (0, 0) to (i, j). Therefore, F(m, n) is the shortest path we wanted. To compute F(m, n), we need to compute both F(m-1, n) and F(m, n-1) m n (0,0) (m, n) F(m-1, n) + length((m-1, n), (m, n)) F(m, n) = min F(m, n-1) + length((m, n-1), (m, n))

Recursive solution But: if we use recursive call, many subpaths will be recomputed for many times Strategy: pre-compute F values starting from the upper-left corner. Fill in row by row (what other order will also do?) m n F(i-1, j) + length((i-1, j), (i, j)) F(i, j) = min F(i, j-1) + length((i, j-1), (i, j)) (0,0) (m, n) (i, j) (i-1, j) (i, j-1)

Dynamic programming illustration S G F(i-1, j) + length(i-1, j, i, j) F(i, j) = min F(i, j-1) + length(i, j-1, i, j)

Trackback

Elements of dynamic programming Optimal sub-structures –Optimal solutions to the original problem contains optimal solutions to sub-problems Overlapping sub-problems –Some sub-problems appear in many solutions Memorization and reuse –Carefully choose the order that sub-problems are solved

Dynamic Programming for sequence alignment Suppose we wish to align x 1 ……x M y 1 ……y N Let F(i,j) = optimal score of aligning x 1 ……x i y 1 ……y j Scoring Function: Match: +m Mismatch: -s Gap (indel):-d

Elements of dynamic programming Optimal sub-structures –Optimal solutions to the original problem contains optimal solutions to sub-problems Overlapping sub-problems –Some sub-problems appear in many solutions Memorization and reuse –Carefully choose the order that sub-problems are solved

Optimal substructure If x[i] is aligned to y[j] in the optimal alignment between x[1..M] and y[1..N], then The alignment between x[1..i] and y[1..j] is also optimal Easy to prove by contradiction... 12iM 12 j N x:x: y:y:

Recursive solution Notice three possible cases: 1.x M aligns to y N ~~~~~~~ x M ~~~~~~~ y N 2.x M aligns to a gap ~~~~~~~ x M ~~~~~~~  3.y N aligns to a gap ~~~~~~~  ~~~~~~~ y N m, if x M = y N F(M,N) = F(M-1, N-1) + -s, if not F(M,N) = F(M-1, N) - d F(M,N) = F(M, N-1) - d max

Recursive solution Generalize: F(i-1, j-1) +  (X i,Y j ) F(i,j) = max F(i-1, j) – d F(i, j-1) – d  (X i,Y j ) = m if X i = Y j, and –s otherwise Boundary conditions: –F(0, 0) = 0. –F(0, j) = ? –F(i, 0) = ? -jd: y[1..j] aligned to gaps. -id: x[1..i] aligned to gaps.

What order to fill? F(0,0) F(M,N) F(i, j)F(i, j-1) F(i-1, j)F(i-1, j-1) i j

What order to fill? F(0,0) F(M,N)

Example x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A T A F(i,j) i = j = F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d

Example x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A T-2 A-3 j = F(i,j) i = F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d

Example x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T A-3 j = F(i,j) i = F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d

Example x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T 0010 A-3 j = F(i,j) i = F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d

Example x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T 0010 A-3 02 j = Optimal Alignment: F(4,3) = 2 F(i,j) i = F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d

Example x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T 0010 A-3 02 j = Optimal Alignment: F(4,3) = 2 This only tells us the best score F(i,j) i = F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d

Trace-back x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T 0010 A-3 02 j = F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i = A A

Trace-back AGTA A10 -2 T 0010 A-3 02 F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d x = AGTAm = 1 y = ATAs = 1 d = 1 j = F(i,j) i = TA TA

Trace-back x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T 0010 A-3 02 j = F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i = GTA -TA

Trace-back x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T 0010 A-3 02 j = F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i = AGTA A-TA

Trace-back x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T 0010 A-3 02 j = Optimal Alignment: F(4,3) = 2 AGTA A  TA F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i =

Using trace-back pointers x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A T-2 A-3 j = F(i,j) i =

Using trace-back pointers x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T A-3 j = F(i,j) i =

Using trace-back pointers x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T 0010 A-3 j = F(i,j) i =

Using trace-back pointers x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T 0010 A-3 02 j = F(i,j) i =

Using trace-back pointers x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T 0010 A-3 02 j = F(i,j) i =

Using trace-back pointers x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T 0010 A-3 02 j = F(i,j) i =

Using trace-back pointers x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T 0010 A-3 02 j = F(i,j) i =

Using trace-back pointers x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T 0010 A-3 02 j = Optimal Alignment: F(4,3) = 2 AGTA A  TA F(i,j) i =

The Needleman-Wunsch Algorithm 1.Initialization. a.F(0, 0) = 0 b.F(0, j) = - j  d c.F(i, 0)= - i  d 2.Main Iteration. Filling in scores a.For each i = 1……M For each j = 1……N F(i-1,j) – d [case 1] F(i, j) = max F(i, j-1) – d [case 2] F(i-1, j-1) + σ(x i, y j ) [case 3] UP, if [case 1] Ptr(i,j)= LEFTif [case 2] DIAGif [case 3] 3.Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

Complexity Time: O(NM) Space: O(NM) Linear-space algorithms do exist (with the same time complexity)

Equivalent graph problem (0,0) (3,4) A G TA A A T S1 = S2 = Number of steps: length of the alignment Path length: alignment score Optimal alignment: find the longest path from (0, 0) to (3, 4) General longest path problem cannot be found with DP. Longest path on this graph can be found by DP since no cycle is possible.  : a gap in the 2 nd sequence  : a gap in the 1 st sequence : match / mismatch Value on vertical/horizontal line: -d Value on diagonal: m or -s 1

Question If we change the scoring scheme, will the optimal alignment be changed? –Old: Match = 1, mismatch = gap = -1 –New: match = 2, mismatch = gap = 0 –New: Match = 2, mismatch = gap = -2?

Question What kind of alignment is represented by these paths? A BCBC A BCBC A BCBC A BCBC A BCBC A- BC A-- -BC --A BC- -A- B-C -A BC Alternating gaps are impossible if –s > -2d

A variant of the basic algorithm Scoring scheme: m = s = d: 1 Seq1: CAGCA-CTTGGATTCTCGG || |:||| Seq2: ---CAGCGTGG Seq1: CAGCACTTGGATTCTCGG |||| | | || Seq2: CAGC-----G-T----GG The first alignment may be biologically more realistic in some cases (e.g. if we know s2 is a subsequence of s1) Score = -7 Score = -2

A variant of the basic algorithm Maybe it is OK to have an unlimited # of gaps in the beginning and end: CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCGAGTTCATCTATCAC--GACCGC--GGTCG Then, we don ’ t want to penalize gaps in the ends

The Overlap Detection variant Changes: 1.Initialization For all i, j, F(i, 0) = 0 F(0, j) = 0 2.Termination max i F(i, N) F OPT = max max j F(M, j) x 1 ……………………………… x M y N ……………………………… y 1

Different types of overlaps x y x y

The local alignment problem Given two strings X = x 1 ……x M, Y = y 1 ……y N Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g. X = abcxdex X’ = cxde Y = xxxcde Y’ = c-de x y

Why local alignment Conserved regions may be a small part of the whole –Global alignment might miss them if flanking “junk” outweighs similar regions Genes are shuffled between genomes A A B CD BC D

Naïve algorithm for all substrings X’ of X and Y’ of Y Align X’ & Y’ via dynamic programming Retain pair with max value end ; Output the retained pair Time: O(n 2 ) choices for A, O(m 2 ) for B, O(nm) for DP, so O(n 3 m 3 ) total.

Reminder The overlap detection algorithm –We do not give penalty to gaps at either end Free gap

The local alignment idea Do not penalize the unaligned regions (gaps or mismatches) The alignment can start anywhere and ends anywhere Strategy: whenever we get to some low similarity region (negative score), we restart a new alignment –By resetting alignment score to zero

The Smith-Waterman algorithm Initialization: F(0, j) = F(i, 0) = 0 0 F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) +  (x i, y j ) Iteration: F(i, j) = max

The Smith-Waterman algorithm Termination: 1.If we want the best local alignment… F OPT = max i,j F(i, j) 2.If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace back

xxxcde a0 b0 c0 x0 d0 e0 x0 Match: 2 Mismatch: -1 Gap: -1

xxxcde a b c0 x0 d0 e0 x0 Match: 2 Mismatch: -1 Gap: -1

xxxcde a b c x0 d0 e0 x0 Match: 2 Mismatch: -1 Gap: -1

xxxcde a b c x d0 e0 x0 Match: 2 Mismatch: -1 Gap: -1

xxxcde a b c x d e0 x0 Match: 2 Mismatch: -1 Gap: -1

xxxcde a b c x d e x0 Match: 2 Mismatch: -1 Gap: -1

xxxcde a b c x d e x Match: 2 Mismatch: -1 Gap: -1

Trace back xxxcde a b c x d e x Match: 2 Mismatch: -1 Gap: -1

Trace back xxxcde a b c x d e x Match: 2 Mismatch: -1 Gap: -1 cxde | || c-de x-de | || xcde

No negative values in local alignment DP array Optimal local alignment will never have a gap on either end Local alignment: “Smith-Waterman” Global alignment: “Needleman-Wunsch”

Analysis Time: –O(MN) for finding the best alignment –Time to report all alignments depends on the number of sub-opt alignments Memory: –O(MN) –O(M+N) possible

More efficient alignment algorithms

Given two sequences of length M, N Time: O(MN) –Ok, but still slow for long sequences Space: O(MN) –bad –1Mb seq x 1Mb seq = 1TB memory Can we do better?

Bounded alignment Good alignment should appear near the diagonal

Bounded Dynamic Programming If we know that x and y are very similar Assumption: # gaps(x, y) < k xixi Then,|implies | i – j | < k yj yj

Bounded Dynamic Programming Initialization: F(i,0), F(0,j) undefined for i, j > k Iteration: For i = 1…M For j = max(1, i – k)…min(N, i+k) F(i – 1, j – 1)+  (x i, y j ) F(i, j) = max F(i, j – 1) – d, if j > i – k F(i – 1, j) – d, if j < i + k Termination:same x 1 ………………………… x M y N ………………………… y 1 k

Analysis Time: O(kM) << O(MN) Space: O(kM) with some tricks 2k M => M

Given two sequences of length M, N Time: O(MN) –ok Space: O(MN) –bad –1mb seq x 1mb seq = 1TB memory Can we do better?

Linear space algorithm If all we need is the alignment score but not the alignment, easy! We only need to keep two rows (You only need one row, with a little trick) But how do we get the alignment?

Linear space algorithm When we finish, we know how we have aligned the ends of the sequences Naïve idea: Repeat on the smaller subproblem F(M-1, N-1) Time complexity: O((M+N)(MN)) XMYNXMYN

(0, 0) (M, N) M/2 Key observation: optimal alignment (longest path) must use an intermediate point on the M/2-th row. Call it (M/2, k), where k is unknown.

Longest path from (0, 0) to (6, 6) is max_k (LP(0,0,3,k) + LP(6,6,3,k)) (0,0) (6,6) (3,2) (3,4)(3,6)(3,0)

Hirschberg’s idea Divide and conquer! M/2 F(M/2, k) represents the best alignment between x 1 x 2 …x M/2 and y 1 y 2 …y k Forward algorithm Align x 1 x 2 …x M/2 with Y X Y

Backward Algorithm M/2 B(M/2, k) represents the best alignment between reverse(x M/2+1 …x M ) and reverse(y k y k+1 …y N ) Backward algorithm Align reverse(x M/2+1 …x M ) with reverse(Y) Y X

Linear-space alignment Using 2 (4) rows of space, we can compute for k = 1…N, F(M/2, k), B(M/2, k) M/2

Linear-space alignment Now, we can find k * maximizing F(M/2, k) + B(M/2, k) Also, we can trace the path exiting column M/2 from k * Conclusion: In O(NM) time, O(N) space, we found optimal alignment path at row M/2

Linear-space alignment Iterate this procedure to the two sub-problems! N-k * M/2 k*k*

Analysis Memory: O(N) for computation, O(N+M) to store the optimal alignment Time: –MN for first iteration –k M/2 + (N-k) M/2 = MN/2 for second –… k N-k M/2

MNMN/2MN/4 MN/8 MN + MN/2 + MN/4 + MN/8 + … = MN (1 + ½ + ¼ + 1/8 + 1/16 + …) = 2MN = O(MN)

Outline Part I: Algorithms –Biological problem –Intro to dynamic programming –Global sequence alignment –Local sequence alignment –More efficient algorithms Part II: Biological issues –Model gaps more accurately –Alignment statistics Part III: BLAST

How to model gaps more accurately

What’s a better alignment? GACGCCGAACG ||||| ||| GACGC---ACG GACGCCGAACG |||| | | || GACG-C-A-CG Score = 8 x m – 3 x d However, gaps usually occur in bunches. -During evolution, chunks of DNA may be lost entirely -Aligning genomic sequences vs. cDNAs (reverse complimentary to mRNAs)

Model gaps more accurately Current model: –Gap of length n incurs penalty n  d General: –Convex function –E.g.  (n) = c * sqrt (n)   n n

General gap dynamic programming Initialization:same Iteration: F(i-1, j-1) + s(x i, y j ) F(i, j) = max max k=0…i-1 F(k,j) –  (i-k) max k=0…j-1 F(i,k) –  (j-k) Termination: same Running Time: O((M+N)MN)(cubic) Space: O(NM) (linear-space algorithm not applicable)

Compromise: affine gaps  (n) = d + (n – 1)  e | | gap open extension d e  (n) Match: 2 Gap open: -5 Gap extension: -1 GACGCCGAACG ||||| ||| GACGC---ACG GACGCCGAACG |||| | | || GACG-C-A-CG 8x2-5-2 = 98x2-3x5 = 1 We want to find the optimal alignment with affine gap penalty in O(MN) time O(MN) or better O(M+N) memory

Allowing affine gap penalties Still three cases –x i aligned with y j –x i aligned to a gap Are we continuing a gap in x? (if no, start is more expensive) –y j aligned to a gap Are we continuing a gap in y? (if no, start is more expensive) We can use a finite state machine to represent the three cases as three states –The machine has two heads, reading the chars on the two strings separately –At every step, each head reads 0 or 1 char from each sequence –Depending on what it reads, goes to a different state, and produces different scores

Finite State Machine F: have just read 1 char from each seq (x i aligned to y j ) Ix: have read 0 char from x. (y j aligned to a gap) Iy: have read 0 char from y ( x i aligned to a gap) F Ix Iy ? / ? Input Output State

F Ix Iy (x i,y j ) /  (x i,-) / d (x i,-) / e (-, y j ) / d (-, y j ) / e Input Output Start state Current stateInputOutputNext state F (x i,y j )  F F (-,y j )d Ix F (x i,-)d Iy Ix (-,y j )e Ix … …… …

AAC ACT F-F-F-FF-F-F-F AAC ||| ACT F-Iy-F-F-IxF-Iy-F-F-Ix AAC- || -ACT F-F-Iy-F-IxF-F-Iy-F-Ix AAC- | A-CT F Ix Iy (x i,y j ) /  (x i,-) / d (x i,-) / e (-, y j ) / d (-, y j ) / e start state Given a pair of sequences, an alignment (not necessarily optimal) corresponds to a state path in the FSM. Optimal alignment: find a state path to read the two sequences such that the total output score is the highest

Dynamic programming We encode this information in three different matrices For each element (i,j) we use three variables –F(i,j): best alignment (score) of x 1..x i & y 1..y j if x i aligns to y j –I x (i,j): best alignment of x 1..x i & y 1..y j if y j aligns to gap –I y (i,j): best alignment of x 1..x i & y 1..y j if x i aligns to gap xixi yjyj xixi yjyj xixi yjyj F(i, j) Ix(i, j) Iy(i, j)

F Ix Iy (x i,y j ) /  (x i,-) /d (x i,-)/e (-, y j ) /d (-, y j )/e F(i-1, j-1) +  (x i, y j ) F(i, j) = max Ix(i-1, j-1) +  (x i, y j ) Iy(i-1, j-1) +  (x i, y j ) xixi yjyj

F Ix Iy (x i,y j ) /  (x i,-) /d (x i,-)/e (-, y j ) /d (-, y j )/e F(i, j-1) + d Ix(i, j) = max Ix(i, j-1) + e xixi yjyj Ix(i, j)

F Ix Iy (x i,y j ) /  (x i,-) /d (x i,-)/e (-, y j ) /d (-, y j )/e F(i-1, j) + d Iy(i, j) = max Iy(i-1, j) + e xixi yjyj Iy(i, j)

F(i – 1, j – 1) F(i, j) =  (x i, y j ) + max I x (i – 1, j – 1) I y (i – 1, j – 1) F(i, j – 1) + d I x (i, j) = max I x (i, j – 1) + e F(i – 1, j) + d I y (i, j) = max I y (i – 1, j) + e Continuing alignment Closing gaps in x Closing gaps in y Opening a gap in x Gap extension in x Opening a gap in y Gap extension in y

Data dependency F IxIx IyIy i j i-1 j-1 i-1 j-1

Data dependency IyIy IxIx F i j i j i j

If we stack all three matrices –No cyclic dependency –Therefore, we can fill in all three matrices in order

Algorithm for i = 1:m –for j = 1:n Fill in F(i, j), I x (i, j), I y (i, j) –end end F(M, N) = max (F(M, N), I x (M, N), I y (M, N)) Time: O(MN) Space: O(MN) or O(N) when combined with the linear-space algorithm

Exercise x = GCAC y = GCC m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- -- -- -- -- -- -- -- - - -- -- -- F: aligned on bothIy: Insertion on y F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j) G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = Ix: Insertion on x  (xi, yj) d e d e m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- 2 -- -- -- -- -- -- -- - - -- -- -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1)  (xi, yj) = 2 m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- 2-7 -- -- -- -- -- -- -- - - -- -- -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1)  (xi, yj) = -2 m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -- -- -- -- -- -- - - -- -- -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1)  (xi, yj) = -2 m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -- -- -- -- -- - -- -3 -- -- -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = Ix(i,j) Ix(i,j-1) F(i,j-1) d = -5 e = -1 m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -- -- -- -- -- - -- - -- -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = Ix(i,j) Ix(i,j-1) F(i,j-1) d = -5 e = -1 m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -- -- -- -- -- -5 -- -- -- - -- - -- -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = Iy(i,j) Iy(i-1,j) F(i-1,j) d=-5 e=-1 m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -7 -- -- -- -- -- -5 -- -- -- - -- - -- -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1)  (xi, yj) = -2 m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -74 -- -- -- -- -- -5 -- -- -- - -- - -- -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1)  (xi, yj) = 2 m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -74 -- -- -- -- -- -5 -- -- -- - -- - -- -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1)  (xi, yj) = 2 m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -74 -- -- -- -- -- -5 -- -- -- - -- - -- -12 -- -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = Ix(i,j) Ix(i,j-1) F(i,j-1) d = -5 e = -1 m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -74 -- -- -- -- -- -5 -- -- -- - -- - -- -12 -- -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = Iy(i,j) Iy(i-1,j) F(i-1,j) d=-5 e=-1 m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -74 -- -- -- -- -- -5 -- -- -- - -- - -- -12 -- -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j)  (xi, yj) d e d e m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - - - -- -- -- -- -- -- - -- - -- -12 -- -- - FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j)  (xi, yj) d e d e m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -74 -- - -- -- -- -- -- -- - -- - -- -12 -- -- - FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = Iy(i,j) Iy(i-1,j) F(i-1,j) d=-5 e=-1 m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -74 -- - -- -- -- -- -- -- - -- - -- -12 -- -- - FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = Iy(i,j) Iy(i-1,j) F(i-1,j) d=-5 e=-1 m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -74 -- - - -- -- -5 -- -- -- - -- - -- -12 -- -- - -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j)  (xi, yj) d e d e m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -74 -- - - -- -- -5 -- -- -- - -- - -- -12 -- -- - -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j)  (xi, yj) d e d e m = 2 s = -2 d = -5 e = -1

0 -- -- -- -- - -74 -- - - -- -- -5 -- -- -- - -- - -- -12 -- -- - -- FIy Ix G C C GCACGCAC GCACGCAC GCACGCAC GCAC || | GC-C x = y = x = y = x = y = x y GCACGCAC G C C x = y = m = 2 s = -2 d = -5 e = -1

Statistics of alignment Where does  (x i, y j ) come from? Are two aligned sequences actually related?

Probabilistic model of alignments We’ll first focus on protein alignments without gaps Given an alignment, we can consider two possible models –R: the sequences are related by evolution –U: the sequences are unrelated How can we distinguish these two models? How is this view related to amino-acid substitution matrix?

Model for unrelated sequences Assume each position of the alignment is independently sampled from some distribution of amino acids p s : probability of amino acid s in the sequences Probability of seeing an amino acid s aligned to an amino acid t by chance is –Pr(s, t | U) = p s * p t Probability of seeing an ungapped alignment between x = x 1 …x n and y = y 1 …y n randomly is i

Model for related sequences Assume each pair of aligned amino acids evolved from a common ancestor Let q st be the probability that amino acid s in one sequence is related to t in another sequence The probability of an alignment of x and y is give by

Probabilistic model of Alignments How can we decide which model (U or R) is more likely? One principled way is to consider the relative likelihood of the two models (the odds ratio) –A higher ratio means that R is more likely than U

Log odds ratio Taking logarithm, we get Recall that the score of an alignment is given by

Therefore, if we define We are actually defining the alignment score as the log odds ratio between the two models R and U

How to get the probabilities? p s can be counted from the available protein sequences But how do we get q st ? (the probability that s and t have a common ancestor) Counted from trusted alignments of related sequences

Protein Substitution Matrices Two popular sets of matrices for protein sequences –PAM matrices [Dayhoff et al, 1978] Better for aligning closely related sequences –BLOSUM matrices [Henikoff & Henikoff, 1992] For both closely or remotely related sequences

BLOSUM-N matrices Constructed from a database called BLOCKS Contain many closely related sequences –Conserved amino acids may be over-counted N = 62: the probabilities q st were computed using trusted alignments with no more than 62% identity –identity: % of matched columns Using this matrix, the Smith-Waterman algorithm is most effective in detecting real alignments with a similar identity level (i.e. ~62%)

Positive for chemically similar substitution Common amino acids get low weights Rare amino acids get high weights : Scaling factor to convert score to integer. Important: when you are told that a scoring matrix is in half-bits => = ½ ln2

BLOSUM-N matrices If you want to detect homologous genes with high identity, you may want a BLOSUM matrix with higher N. say BLOSUM75 On the other hand, if you want to detect remote homology, you may want to use lower N, say BLOSUM50 BLOSUM-62: good for most purposes Weak homologyStrong homology

For DNAs No database of trusted alignments to start with Specify the percentage identity you would like to detect You can then get the substitution matrix by some calculation

For example Suppose p A = p C = p T = p G = 0.25 We want 88% identity q AA = q CC = q TT = q GG = 0.22, the rest = 0.12/12 = 0.01  (A, A) =  (C, C) =  (G, G) =  (T, T) = log (0.22 / (0.25*0.25)) = 1.26  (s, t) = log (0.01 / (0.25*0.25)) = for s ≠ t.

Substitution matrix ACGT A C G T 1.26

Scale won’t change the alignment Multiply by 4 and then round off to get integers ACGT A5-7 C 5 G 5 T 5

Arbitrary substitution matrix Say you have a substitution matrix provided by someone It’s important to know what you are actually looking for when you use the matrix

What’s the difference? Which one should I use for my sequences? ACGT A1-2 C 1 G 1 T 1 ACGT A5-4 C 5 G 5 T 5 NCBI-BLASTWU-BLAST

We had Scale it, so that Reorganize:

Since all probabilities must sum to 1, We have Suppose again p s = 0.25 for any s We know  (s, t) from the substitution matrix We can solve the equation for λ Plug λ into to get q st

ACGT A1-2 C 1 G 1 T 1 ACGT A5-4 C 5 G 5 T 5 = 1.33 q st = 0.24 for s = t, and for s ≠ t Translate: 95% identity = 0.19 q st = 0.16 for s = t, and 0.03 for s ≠ t Translate: 65% identity NCBI-BLASTWU-BLAST

Details for solving Known:  (s,t) = 1 for s=t, and  (s,t) = -2 for s  t. Since and  s,t q st = 1, we have 12 * ¼ * ¼ * e * ¼ * ¼ * e = 1 Let e = x, we have ¾ x -2 + ¼ x = 1. Hence, x 3 – 4x = 0; X has three solutions: 3.8, 1, -0.8 Only the first solution leads to a positive = ln (3.8) = 1.33 ACGT A1-2 C 1 G 1 T 1

Statistics of alignment Where does  (x i, y j ) come from? Are two aligned sequences actually related?

Statistics of Alignment Scores Q: How do we assess whether an alignment provides good evidence for homology (i.e., the two sequences are evolutionarily related)? –Is a score 82 good? What about 180? A: determine how likely it is that such an alignment score would result from chance

P-value of alignment p-value –The probability that the alignment score can be obtained from aligning random sequences –Small p-value means the score is unlikely to happen by chance The most common thresholds are 0.01 and 0.05 –Also depend on purpose of comparison and cost of misclaim

Statistics of global seq alignment Theory only applies to local alignment For global alignment, your best bet is to do Monte-Carlo simulation –What’s the chance you can get a score as high as the real alignment by aligning two random sequences? Procedure –Given sequence X, Y –Compute a global alignment (score = S) –Randomly shuffle sequence X (or Y) N times, obtain X 1, X 2, …, X N –Align each X i with Y, (score = R i ) –P-value: the fraction of R i >= S

Human HEXA Fly HEXO1 Score = -74

-74 Distribution of the alignment scores between fly HEXO1 and 200 randomly shuffled human HEXA sequences There are 88 random sequences with alignment score >= -74. So: p-value = 88 / 200 = 0.44 => alignment is not significant

…………………………………………………… Mouse HEXA Human HEXA Score = 732

732 Distribution of the alignment scores between mouse HEXA and 200 randomly shuffled human HEXA sequences No random sequences with alignment score >= 732 –So: the P-value is less than 1 / 200 = 0.05 To get smaller p-value, have to align more random sequences –Very slow Unless we can fit a distribution (e.g. normal distribution) –Such distribution may not be generalizable –No theory exists for global alignment score distribution

Statistics for local alignment Elegant theory exists Score for ungapped local alignment follows extreme value distribution (Gumbel distribution) Normal distribution Extreme value distribution An example extreme value distribution: Randomly sample 100 numbers from a normal distribution, and compute max Repeat 100 times. The max values will follow extreme value distribution

Statistics for local alignment Given two unrelated sequences of lengths M, N Expected number of ungapped local alignments with score at least S can be calculated by –E(S) = KMN exp[- S] –Known as E-value – : scaling factor as computed in last lecture –K: empirical parameter ~ 0.1 Depend on sequence composition and substitution matrix

P-value for local alignment score P-value for a local alignment with score S when P is small.

Example You are aligning two sequences, each has 1000 bases m = 1, s = -1, d = -inf (ungapped alignment) You obtain a score 20 Is this score significant?

= ln3 = 1.1 (computed as discussed on slide #41) E(S) = K MN exp{- S} E(20) = 0.1 * 1000 * 1000 * = 3 x P-value = 3 x << 0.05 The alignment is significant 20 Distribution of 1000 random sequence pairs

Multiple-testing problem Searching a 1000-base sequence against a database of 10 6 sequences (each of length 1000) How significant is a score 20 now? You are essentially comparing 1000 bases with 1000x10 6 = 10 9 bases (ignore edge effect) E(20) = 0.1 * 1000 * 10 9 * = 30 By chance we would expect to see 30 matches –The P-value (probability of seeing at least one match with score >= 30) is 1 – e -30 = –The alignment is not significant –Caution: it does NOT mean that the two sequences are unrelated. Rather, it simply means that you have NO confidence to say whether the two sequences are related.

Score threshold to determine significance You want a p-value that is very small (even after taking into consideration multiple-testing) What S will guarantee you a significant p-value? E(S)  P(S) << 1 =>KMN exp[- S] << 1 => log(KMN) - S < 0 => S > T + log(MN) / (T = log(K) /, usually small)

Score threshold to determine significance In the previous example –m = 1, s = -1, d = -inf => = 1.1 Aligning 1000bp vs 1000bp S > log(10 6 ) / 1.1 = 13. So 20 is significant. Searching 1000bp against 10 6 x 1000bp S > log(10 12 ) / 1.1 = 25. so 20 is not significant.

Statistics for gapped local alignment Theory not well developed Extreme value distribution works well empirically Need to estimate K and empirically –Given the database and substitution matrix, generate some random sequence pairs –Do local alignment –Fit an extreme value distribution to obtain K and

Alignment statistics summary How to obtain a substitution matrix? –Obtain q st and p s from established alignments (for DNA: from your knowledge) –Computing score: How to understand arbitrary substitution matrix? –Solve function to obtain and target q st –Which tells you what percent identity you are expecting How to understand alignment score? –probability that a score can be expected from chance. –Global alignment: Monte-Carlo simulation –Local alignment: Extreme Value Distribution Estimate p-value from a score Determine a score threshold without computing a p-value

Part III: Heuristic Local Sequence Alignment: BLAST

State of biological databases Sequenced Genomes: Human 3  10 9 Yeast1.2  10 7 Mouse2.7  10 9 Rat2.6  10 9 Neurospora 4  10 7 Fugu fish3.3  10 8 Tetraodon3  10 8 Mosquito 2.8  10 8 Drosophila1.2  10 8 Worm 1.0  10 8 Rice1.0  10 9 Arabidopsis1.2  10 8 sea squirts 1.6  10 8 Current rate of sequencing (before new-generation sequencing): 4 big labs  3  10 9 bp /year/lab 10s small labs Private sectors With new-generation sequencing: Easily generating billions of reads daily

Some useful applications of alignments Given a newly discovered gene, - Does it occur in other species? Assume we try Smith-Waterman: The entire genomic database Our new gene May take several weeks!

Some useful applications of alignments Given a newly sequenced organism, - Which subregions align with other organisms? -Potential genes - Other functional units Assume we try Smith-Waterman: The entire genomic database Our newly sequenced mammal 3  > 1000 years ???

BLAST Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J Mol Biol 1990 –The most widely used bioinformatics tool Which is better: long mediocre match or a few nearby, short, strong matches with the same total score? –Score-wise, exactly equivalent –Biologically, later may be more interesting, & is common –At least, if must miss some, rather miss the former BLAST is a heuristic algorithm emphasizing the later –speed/sensitivity tradeoff: BLAST may miss former, but gains greatly in speed

BLAST Available at NCBI (National Center for Biotechnology Information) for download and online use. Along with many sequence databases Main idea: 1.Construct a dictionary of all the words in the query 2.Initiate a local alignment for each word match between query and DB Running Time: O(MN) However, orders of magnitude faster than Smith-Waterman query DB

BLAST  Original Version Dictionary: All words of length k (~11 for DNA, 3 for proteins) Alignment initiated between words of alignment score  T (typically T = k) Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold …… query DB query scan

BLAST  Original Version A C G A A G T A A G G T C C A G T C C C T T C C T G G A T T G C G A Example: k = 4, T = 4 The matching word GGTC initiates an alignment Extension to the left and right with no gaps until alignment falls < 50% Output: GTAAGGTCC GTTAGGTCC

Gapped BLAST A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A Added features: Pairs of words can initiate alignment Extensions with gaps in a band around anchor Output: GTAAGGTCCAGT GTTAGGTC-AGT

Example Query: gattacaccccgattacaccccgattaca (29 letters) [2 mins] Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences) 1,726,556 sequences; 8,074,398,388 total letters >gi| |gb|AC | Oryza sativa chromosome 3 BAC OSJNBa0087C10 genomic sequence, complete sequence Length = Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plusgi| |gb|AC | Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: tacacccagattacaccccga Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: tacacccagattacaccccga >gi| |gb|AC | Oryza sativa chromosome 3 BAC OSJNBa0052F07 genomic sequence, complete sequence Length = Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plusgi| |gb|AC | Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 3891 tacacccagattacaccccga 3911

Example Query: Human atoh enhancer, 179 letters[1.5 min] Result: 57 blast hits 1. gi| |gb|AF |AF Homo sapiens ATOH1 enhanc e-95 gi| |gb|AF |AF gi| |gb|AC | Mus musculus Strain C57BL6/J ch e-68gi| |gb|AC |264 3.gi| |gb|AF |AF Mus musculus Atoh1 enhanc e-66gi| |gb|AF |AF gi| |gb|AF | Gallus gallus CATH1 (CATH1) gene e-12gi| |gb|AF |78 5.gi| |emb|AL | Zebrafish DNA sequence from clo e-05gi| |emb|AL |54 6.gi| |gb|AC | Oryza sativa chromosome 10 BAC O gi| |gb|AC |44 7.gi| |ref|NM_ | Mus musculus suppressor of Ty gi| |ref|NM_ |42 8.gi| |gb|BC | Mus musculus, Similar to suppres gi| |gb|BC |42 gi| |gb|AF |AF218258gi| |gb|AF |AF Mus musculus Atoh1 enhancer sequence Length = 1517 Score = 256 bits (129), Expect = 9e-66 Identities = 167/177 (94%), Gaps = 2/177 (1%) Strand = Plus / Plus Query: 3 tgacaatagagggtctggcagaggctcctggccgcggtgcggagcgtctggagcggagca 62 ||||||||||||| ||||||||||||||||||| |||||||||||||||||||||||||| Sbjct: 1144 tgacaatagaggggctggcagaggctcctggccccggtgcggagcgtctggagcggagca 1203 Query: 63 cgcgctgtcagctggtgagcgcactctcctttcaggcagctccccggggagctgtgcggc 122 |||||||||||||||||||||||||| ||||||||| |||||||||||||||| ||||| Sbjct: 1204 cgcgctgtcagctggtgagcgcactc-gctttcaggccgctccccggggagctgagcggc 1262 Query: 123 cacatttaacaccatcatcacccctccccggcctcctcaacctcggcctcctcctcg 179 ||||||||||||| || ||| |||||||||||||||||||| ||||||||||||||| Sbjct: 1263 cacatttaacaccgtcgtca-ccctccccggcctcctcaacatcggcctcctcctcg 1318

BLAST Score: bit score vs raw score Bit score is converted from raw score by taking into account K and :  S’ = ( S – log K) / log 2 To compute E-value from bit score:  E = KM’N’ e - S = M’N’ 2 -S’ Critical score is now:  S* = log 2 (M’N’)  If S’ >> S*: significant  If S’ << S*: not significant (M’ ~ M, N’ ~ N)

Different types of BLAST blastn: search nucleic acid databases blastp: search protein databases blastx: you give a nucleic acid sequence, search protein databases tblastn: you give a protein sequence, search nucleic acid databases tblastx: you give a nucleic sequence, search nucleic acid database, implicitly translate both into protein sequences

BLAST cons and pros Advantages –Fast!!!! –A few minutes to search a database of bases Disadvantages –Sensitivity may be low –Often misses weak homologies New improvement –Make it even faster Mainly for aligning very similar sequences or really long sequences –E.g. whole genome vs whole genome –Make it more sensitive PSI-BLAST: iteratively add more homologous sequences PatternHunter: discontinuous seeds

Variants of BLAST NCBI-BLAST: most widely used version WU-BLAST: (Washington University BLAST): another popular version Optimized, added features MEGABLAST: Optimized to align very similar sequences. Linear gap penalty BLAT: Blast-Like Alignment Tool BlastZ: Optimized for aligning two genomes PSI-BLAST: BLAST produces many hits Those are aligned, and a pattern is extracted Pattern is used for next search; above steps iterated Sensitive for weak homologies Slower

Pattern hunter Instead of exact matches of consecutive matches of k-mer, we can look for discontinuous matches –My query sequence looks like: ACGTAGACTAGCAGTTAAG –Search for sequences in database that match AXGXAGXCTAXC X stands for don’t care Seed:

Pattern hunter A good seed may give you both a higher sensitivity and higher specificity You may think is the best seed –Because mutation in the third position of a codon often doesn’t change the amino acid –Best seed is actually Empirically determined How to design such seed is an open problem May combine multiple random seeds

Summary Part I: Algorithms –Global sequence alignment: Needleman-Wunsch –Local sequence alignment: Smith-Waterman –Improvement on space and time Part II: Biological issues –Model gaps more accurately: affine gap penalty –Alignment statistics Part III: Heuristic algorithms – BLAST family