Pairwise Sequence Alignment

Pairwise Sequence Alignment
BMI 877 Colin Dewey March 14, 2017

Overview What does it mean to align sequences?
How do we cast sequence alignment as a computational problem? What algorithms exist for solving this computational problem? How do we cast sequence alignment as a statistical problem?

What is sequence alignment?
Pattern matching ...CATCGATGACTATCCG... ATGACTGT Database searching . CATGCATGCTGCGTAC CATGGTTGCTCACAAGTAC CATGCCTGCTGCGTAA TACGTGCCTGACCTGCGTAC CATGCCGAATGCTG CATGCTTGCTGGCGTAAA suffix trees, locality-sensitive hashing,... BLAST Optimization problem Needleman-Wunsch, Smith-Waterman,... Statistical problem Pair HMMs, TKF, Karlin-Altschul statistic...

A specific motivating task: read alignment
reads sequencing alignment ? genome

DNA sequence edits Substitutions: ACGA AGGA Insertions: ACGA ACCGGAGA
Deletions: ACGGAGA AGA Transpositions: ACGGAGA AAGCGGA Inversions: ACGGAGA ACTCCGA Bold characters are involved in edit Note that inversions create reverse complement

How is a pairwise alignment represented?
matching of corresponding positions in two sequences positions in one sequence with no corresponding positions in the other are matched with a space ‘-’ A group of consecutive spaces is a gap CA--GATTCGAAT CGCCGATT---AT gap

Two main classes of pairwise alignment
Global: All positions are aligned Local: A (contiguous) subset of positions are aligned CA--GAGTCGAAT CGCCGA-TC-A-- semi-global: find best match without penalizing gaps on the ends of the alignment ..GAGTC.... ....GA-TC.

Pairwise Alignment: Task Definition
Given a pair of sequences (DNA or protein) a method for scoring a candidate alignment Do find an alignment for which the score is maximized

Scoring An Alignment: What Is Needed?
substitution matrix S(a,b) indicates score of aligning character a with character b gap penalty function w(k) indicates cost of a gap of length k

Linear Gap Penalty Function
different gap penalty functions require somewhat different algorithms the simplest case is when a linear gap function is used where s is a constant we'll start by considering this case

Scoring an Alignment the score of an alignment is the sum of the scores for pairs of aligned characters plus the scores for gaps example: given the following alignment VAHV---D--DMPNALSALSDLHAHKL AIQLQVTGVVVTDATLKNLGSVHVSKG we would score it by S(V,A) + S(A,I) + S(H,Q) + S(V,L) + 3s + S(D,G) + 2s …

Number of Possible Alignments
given sequences of length m and n assume we don’t count as distinct and we can have as few as 0 and as many as min{m, n} aligned pairs therefore the number of possible alignments is given by C- -G -C G- Two strings of length=100 No gaps: 2N-1 possible alignments 199 in this example Gaps: 10**77 in this case See p. 188 in Waterman for an analysis of this issue. There must be k aligned pairs, 0 < k < min{n,m} There are n-choose-k ways to select the residues in n, and m-choose-k ways to select the residues in m So we have sum_{k>=0} (n-choose-k)(m-choosek) = (n+m choose k) possible alignments (for a local)

Number of Possible Alignments
there are possible global alignments for 2 sequences of length n e.g. two sequences of length 100 have possible alignments but we can use dynamic programming to find an optimal alignment efficiently Stirling’s formula n! ~ sqrt{2\pin}(n/e)^n Two strings of length=100 No gaps: 2N-1 possible alignments 199 in this example Gaps: 10**77 in this case See p. 188 in Waterman for an analysis of this issue. There must be k aligned pairs, 0 < k < min{n,m} There are n-choose-k ways to select the residues in n, and m-choose-k ways to select the residues in m So we have sum_{k>=0} (n-choose-k)(m-choosek) = (n+m choose k) possible alignments (for a local)

Pairwise Alignment Via Dynamic Programming
first algorithm by Needleman & Wunsch, Journal of Molecular Biology, 1970 dynamic programming algorithm: determine best alignment of two sequences by determining best alignment of all prefixes of the sequences String edit distance algorithms have been invented many times - Needleman & Wunsch around the first

Dynamic Programming Idea
consider last step in computing alignment of AAAC with AGC three possible options; in each we’ll choose a different pairing for end of alignment, and add this to the best alignment of previous characters AAA C AG AAAC C AG - Write out scenarios more abstractly Consider last column X_i aligned to Y_j X_i gapped Y_j gapped AAA - AGC C consider best alignment of these prefixes score of aligning this pair +

DP Algorithm for Global Alignment with Linear Gap Penalty
Subproblem: F(i,j) = score of best alignment of the length i prefix of x and the length j prefix of y. Note base cases

Dynamic Programming Implementation
given an n-character sequence x, and an m-character sequence y construct an (n+1)  (m+1) matrix F F ( i, j ) = score of the best alignment of x[1…i ] with y[1…j ] A C G score of best alignment of AAA to AG

Initializing Matrix: Global Alignment with Linear Gap Penalty
s 2s C G 3s 4s

DP Algorithm Sketch: Global Alignment
initialize first row and column of matrix fill in rest of matrix from top to bottom, left to right for each F ( i, j ), save pointer(s) to cell(s) that resulted in best score F (m, n) holds the optimal alignment score; trace pointers back from F (m, n) to F (0, 0) to recover alignment Note third point: pointer to cell that *resulted* in best score, not having best score

Global Alignment Example
suppose we choose the following scoring scheme: +1 -1 s (penalty for aligning with a space) = -2

Global Alignment Example
-2 -4 C G -6 -8 1 -1 -3 -5 one optimal alignment x: A A A C y: A G - C Do a couple of example cells: one where there are multiple maxes, the other with best value coming from entry that does not have biggest value among the three Do example traceback TopHat question but there are three optimal alignments here (can you find them?)

More On Gap Penalty Functions
a gap of length k is more probable than k gaps of length 1 a gap may be due to a single mutational event that inserted/deleted a stretch of characters separated gaps are probably due to distinct mutational events a linear gap penalty function treats these cases the same it is more common to use gap penalty functions involving two terms a penalty g associated with opening a gap a smaller penalty s for extending the gap

Gap Penalty Functions linear affine

Dynamic Programming for the Affine Gap Penalty Case
to do in time, need 3 matrices instead of 1 best score given that x[i] is aligned to y[j] best score given that x[i] is aligned to a gap Not enough to simply know best score of alignment of prefix i and prefix j best score given that y[j] is aligned to a gap

Global Alignment DP for the Affine Gap Penalty Case
Why not I_x to I_y or vice versa? Valid if smallest mismatch score is greater than or equal to 2s

Read alignment in practice
O(n2) time is too slow for read vs. genome Heuristic: find short substring exact (or slightly inexact) matches between read and genome (alignment “seeds”) perform DP only for limited area around the seeds Fast (and memory efficient) methods for finding short exact matches exist Burrows-Wheeler index (used by Bowtie and BWA)

Statistical questions
How can we express the uncertainty of an alignment? How can we estimate the parameters for alignment?

Picking Alignments Alignment 1 Alignment 2
mel TGTTGTGTGATGTTGATTTCTTTACGACTCCTATCAAACTAAACCCATAAAGCATTCAATTCAAAGCATATA pse T----TTTGATGTTGATTTCTTTACGAGTTTGATAGAACTAAACCCATAAAGCATTCAATTCGTAGCATATAGCTCTCCTCTGC * * ******************** * ** ************************** ******** mel CATGTGAAAATCCCAGCGAGAACTCCTTATTAATCCAG CGCAGTCGGCGGCGGCGGC pse CATTCGGCATGTGAAAA TCCTTATTAATCCAGAACGTGTGCGCCAGCGTCAGCGCCAGCGCCGGCAGCAGC ********** *************** *** ** **** ** ** mel GCGCAGTCAGC GGTGGCAGCGCAGTATATAAATAAAGTCTTATAAGAAACTCGTGAGCG pse CGCAG-CAGCAAAACGGCACGCTGGCAGCGGAGTATATAAATAA--TCTTATAAGAAACTCGTGTGAGCGCAACCGGGCAGCG ***** **** * ******** ************* ****************** * * mel AAAGAGAGCG-TTTTATTTATGTGCGTCAGCGTCGGCCGCAACAGCGCCGTCAGCACTGGCAGCGACTGCGAC pse GCCAAAGAGAGCGATTTTATTTATGTG ACTGCGCTGCCTG GTCCTCGGC ********** ************* ** **** * * * * * ** * Alignment 1 Alignment summary: 27 mismatches, 12 gaps, 116 spaces Alignment summary: 45 mismatches, 4 gaps, 214 spaces mel TGTTGTGTGATGTTGATTTCTTTACGACTCCTATCAAACTAAACCCATAAAGCATTCAATTCAAAGCATATACATGTGAAAATC pse T----TTTGATGTTGATTTCTTTACGAGTTTGATAGAACTAAACCCATAAAGCATTCAATTCGTAGCATATAGCTCTCCTCTGC * * ******************** * ** ************************** ******** * * * mel CCAGCGAGA------ACTCCTTATTAATCCAGCGCAGTCGGCGGCGGCGGCGCGCAGTCAGCGGTGGCAGCGCAGTATATAAAT pse CATTCGGCATGTGAAAATCCTTATTAATCCAGAAC * ** * * *************** * mel AAAGTCTTATAAGAAACTCGTGAGCGAAAGAGAGCGTTTTATTTATGTGCGTCAGCGTCGGCCGCAACAGCGCCGTCAGCACTG pse GTGTGCGCCAGCGTCAGCGCCAGCGCCGGCAGCAGCCGCA ****** ******* ** ** * ** * **** mel GCAGCGA pse GCAGCAAAACGGCACGCTGGCAGCGGAGTATATAAATAATCTTATAAGAAACTCGTGTGAGCGCAACCGGGCAGCGGCCAAAGA ***** * mel CTGCGAC pse GAGCGATTTTATTTATGTGACTGCGCTGCCTGGTCCTCGGC * ** * Alignment 2

An Elusive Cis-Regulatory Element
Antp: antennapedia >chr3R: TGTTGTGTGATGTTGATTTCTTTACGACTCCTATCAAACTAAACCCATAAAGCATTCAAT TCAAAGCATATACATGTGAAAATCCCAGCGAGAACTCCTTATTAATCCAGCGCAGTCGGC GGCGGCGGCGCGCAGTCAGCGGTGGCAGCGCAGTATATAAATAAAGTCTTATAAGAAACT CGTGAGCGAAAGAGAGCGTTTTATTTATGTGCGTCAGCGTCGGCCGCAACAGCGCCGTCA GCACTGGCAGCGACTGCGAC Adf1→Antp:06447: Binding site for transcription factor Adf1 Drosophila melanogaster polytene chromosomes

The Conservation of Adf1→Antp:06447
TGTGCGTCAGCGTCGGCCGCAACAGCG TGTG ACTGCG **** ** *** 33% identity mel TGTTGTGTGATGTTGATTTCTTTACGACTCCTATCAAACTAAACCCATAAAGCATTCAATTCAAAGCATATA pse T----TTTGATGTTGATTTCTTTACGAGTTTGATAGAACTAAACCCATAAAGCATTCAATTCGTAGCATATAGCTCTCCTCTGC * * ******************** * ** ************************** ******** mel CATGTGAAAATCCCAGCGAGAACTCCTTATTAATCCAG CGCAGTCGGCGGCGGCGGC pse CATTCGGCATGTGAAAA TCCTTATTAATCCAGAACGTGTGCGCCAGCGTCAGCGCCAGCGCCGGCAGCAGC ********** *************** *** ** **** ** ** mel GCGCAGTCAGC GGTGGCAGCGCAGTATATAAATAAAGTCTTATAAGAAACTCGTGAGCG pse CGCAG-CAGCAAAACGGCACGCTGGCAGCGGAGTATATAAATAA--TCTTATAAGAAACTCGTGTGAGCGCAACCGGGCAGCG ***** **** * ******** ************* ****************** * * mel AAAGAGAGCG-TTTTATTTATGTGCGTCAGCGTCGGCCGCAACAGCGCCGTCAGCACTGGCAGCGACTGCGAC pse GCCAAAGAGAGCGATTTTATTTATGTG ACTGCGCTGCCTG GTCCTCGGC ********** ************* ** **** * * * * * ** * Alignment 1 Alignment summary: 27 mismatches, 12 gaps, 116 spaces Alignment summary: 45 mismatches, 4 gaps, 214 spaces TGTGCGTCAGCGTCGGCCGCAACAGCG TGTGCGCCAGCGTCAGCGCCAGCGCCG ****** ******* ** ** * ** 74% identity mel TGTTGTGTGATGTTGATTTCTTTACGACTCCTATCAAACTAAACCCATAAAGCATTCAATTCAAAGCATATACATGTGAAAATC pse T----TTTGATGTTGATTTCTTTACGAGTTTGATAGAACTAAACCCATAAAGCATTCAATTCGTAGCATATAGCTCTCCTCTGC * * ******************** * ** ************************** ******** * * * mel CCAGCGAGA------ACTCCTTATTAATCCAGCGCAGTCGGCGGCGGCGGCGCGCAGTCAGCGGTGGCAGCGCAGTATATAAAT pse CATTCGGCATGTGAAAATCCTTATTAATCCAGAAC * ** * * *************** * mel AAAGTCTTATAAGAAACTCGTGAGCGAAAGAGAGCGTTTTATTTATGTGCGTCAGCGTCGGCCGCAACAGCGCCGTCAGCACTG pse GTGTGCGCCAGCGTCAGCGCCAGCGCCGGCAGCAGCCGCA ****** ******* ** ** * ** * **** mel GCAGCGA pse GCAGCAAAACGGCACGCTGGCAGCGGAGTATATAAATAATCTTATAAGAAACTCGTGTGAGCGCAACCGGGCAGCGGCCAAAGA ***** * mel CTGCGAC pse GAGCGATTTTATTTATGTGACTGCGCTGCCTGGTCCTCGGC * ** * Alignment 2

The Polytope 364 Vertices 760 Ridges 398 Facets
Sequence lengths = 260bp, 280bp 364 Vertices 760 Ridges 398 Facets TGTGCGTCAGCGTCGGCCGCAACAGCG TGTGCGCCAGCGTCAGCGCCAGCGCCG ****** ******* ** ** * ** TGTG ACTGCG **** ** ***

Methodological machinery to be used
Hidden Markov models (HMMs) Viterbi and Forward algorithms Profile HMMs Pair HMMs Connections to classical sequence alignment

Hidden Markov models Generative probabilistic models of sequences
Explicitly models unobserved (hidden) states that “emit” the characters of the observed sequence Primary task of interest is to infer the hidden states given the observed sequence Alignment case: hidden states = alignment

Two HMM random variables
Observed sequence Hidden state sequence HMM: Markov chain over hidden sequence Dependence between

The Parameters of an HMM
as in Markov chain models, we have transition probabilities probability of a transition from state k to l represents a path (sequence of states) through the model since we’ve decoupled states and characters, we also have emission probabilities Explain probability notation probability of emitting character b in state k

A Simple HMM with Emission Parameters
probability of a transition from state 1 to state 3 probability of emitting character A in state 2 0.4 A 0.4 C 0.1 G 0.2 T 0.3 A 0.1 C 0.4 G 0.4 T 0.1 A 0.2 C 0.3 G 0.3 T 0.2 begin end 0.5 0.2 0.8 0.6 0.1 0.9 5 4 3 2 1 G 0.1 T 0.4 Give example path and emitted sequence 0.8

How Likely is a Given Path and Sequence?
the probability that the path is taken and the sequence is generated: (assuming begin/end are the only silent states on path) Note that we have both transitions and emissions

Three Important Questions
How likely is a given sequence? the Forward algorithm What is the most probable “path” (sequence of hidden states) for generating a given sequence? the Viterbi algorithm How can we learn the HMM parameters given a set of sequences? the Forward-Backward (Baum-Welch) algorithm

How Likely Is A Given Path and Sequence?
0.4 0.2 A 0.4 C 0.1 G 0.2 T 0.3 A 0.2 C 0.3 G 0.3 T 0.2 0.8 0.6 0.5 1 3 begin end 5 A 0.4 C 0.1 G 0.1 T 0.4 A 0.1 C 0.4 G 0.4 T 0.1 0.5 0.9 0.2 \pi = 2 4 0.1 0.8

How can we use HMMs for pairwise alignment?
What is the observed sequence? one of the two sequences? both sequences? What is the hidden path? the alignment

Profile HMM for pairwise alignment
Select one sequence to be observed (the query) The other sequence (the reference) defines the states of the HMM Three classes of states Match: corresponds to aligned positions Delete: positions of the reference that are deleted in the query Insert: positions on the query that are insertions relative to the reference

Profile HMMs i 2 3 1 d m start end Delete states are silent; they
Account for characters missing in some sequences i 2 3 1 d m start end Insert states account for extra characters in some sequences A 0.01 R 0.12 D 0.04 N 0.29 C 0.01 E 0.03 Q 0.02 G 0.01 Insert and match states have emission distributions over sequence characters Match states represent key conserved positions

Example Profile HMM insert states delete states (silent) match states
Src-family tyrosine kinases (important role in signal transduction and regulation of cell growth) SH3 domain involved in interactions with proline-rich regions of other proteins Figure from A. Krogh, An Introduction to Hidden Markov Models for Biological Sequences match states

Profile HMM considerations
Odd asymmetry: have to pick one sequence as reference Models conditional probability P(X|Y) of query sequence (X) given reference sequence (Y) Is there something more natural here? Yes, Pair HMMs We will revisit Profile HMMs for multiple alignment a bit later

Pair Hidden Markov Models
each non-silent state emits one or a pair of characters H: homology (match) state I: insert state D: delete state

PHMM Paths = Alignments
sequence 1: AAGCGC sequence 2: ATGTC B H A H A T I G I C H G D T H C E hidden: observed:

Transition Probabilities
probabilities of moving between states at each step state i+1 B H I D E 1-2δ-τ δ τ 1-ε-τ ε state i The model in the book does not allow for transitions between I and D – this is not really necessary and, in fact, is probably not realistic – could have insertion and deletion following each other * issue is that viterbi path will generally not have insertion and deletion following each other

Emission Probabilities
Deletion (D) Insertion (I) Homology (H) A 0.3 C 0.2 G T A 0.1 C 0.4 G T A C G T 0.13 0.03 0.06 Note that homology emission matrix sums to one over all entries (rows/columns are not conditional probabilities) single character single character pairs of characters

Pair HMM Viterbi probability of most likely sequence of hidden states generating length i prefix of x and length j prefix of y, with the last state being: H I D note that the recurrence relations here allow ID and DI transitions

PHMM Alignment HIDHHDDIIHH...
calculate probability of most likely alignment traceback, as in Needleman-Wunsch (NW), to obtain sequence of state states giving highest probability HIDHHDDIIHH...

Correspondence with Needleman-Wunsch (NW)
NW values ≈ logarithms of Pair HMM Viterbi values Recall that in place of probabilities, we can keep track of log probabilities (monotonic function), product becomes sum The transition probabilities make it so that there is not an equivalence. However, another formulation of the PHMM can give an exact equivalence

Posterior Probabilities
there are similar recurrences for the Forward and Backward values from the Forward and Backward values, we can calculate the posterior probability of the event that the path passes through a certain state S, after generating length i and j prefixes

Uses for Posterior Probabilities
sampling of suboptimal alignments posterior probability of pairs of residues being homologous (aligned to each other) posterior probability of a residue being gapped training model parameters (EM)

Posterior Probabilities
plot of posterior probability of each alignment column

Parameter Training supervised training
given: sequences and correct alignments do: calculate parameter values that maximize joint likelihood of sequences and alignments unsupervised training given: sequence pairs, but no alignments do: calculate parameter values that maximize marginal likelihood of sequences (sum over all possible alignments)

Summary Pairwise sequence alignment ubiquitous in computational molecular biology Classical computational task Dynamic programming Heuristics for additional speed Statistical models for this task HMMs, Pair HMMs Captures uncertainty in alignment Allows for principled estimation of parameters Easily used in classification settings

Pairwise Sequence Alignment

Similar presentations

Presentation on theme: "Pairwise Sequence Alignment"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pairwise Sequence Alignment

Similar presentations

Presentation on theme: "Pairwise Sequence Alignment"— Presentation transcript:

Similar presentations

About project

Feedback