Download presentation
Presentation is loading. Please wait.
Published byAlexandrina Barrett Modified over 9 years ago
1
Sequence Alignments and Dynamic Programming BIO/CS 471 – Algorithms for Bioinformatics
2
Sequence Alignments2 Module II: Sequence Alignments Problem: SequenceAlignment Input: Two or more strings of characters Output: The optimal alignment of the input strings, possibly including gaps, and an alignment score. Example Input: The calico cat was very sleepy. The cat was sleepy. Output: The calico cat was very sleepy. The ------ cat was ---- sleepy.
3
Sequence Alignments3 Biological Sequence Alignment Input: DNA or Amino Acid sequence Output: Optimal alignment Example Input: ACTATAGCTATAGCTATAGCGTATCG ACGATTACAGGTCGTGTCG Output: ACTATAGCTATAGCTATAGCGTATCG || || ||*|| | |||*||| ACGAT---TACAGGT----CGTGTCG Score: +8
4
Sequence Alignments4 Why align sequences? Why would we want to align two sequences? Compare two genes/proteins, e.g. to infer function Database searches Protein structure prediction Why would we want to align multiple sequences? Conservation analysis Molecular evolution
5
Sequence Alignments5 How do we decide on a scoring function? Objective: Sequences that are homologous or evolutionarily related, should align with high scores. Less related sequences should score lower. Question: How do homologous sequences arise?
6
Sequence Alignments6 Transcription The Central Dogma DNAtranscription RNAtranslation Proteins
7
Sequence Alignments7 DNA Replication Prior to cell division, all the genetic instructions must be “copied” so that each new cell will have a complete set DNA polymerase is the enzyme that copies DNA Reads the old strand in the 3´ to 5´ direction
8
Over time, genes accumulate mutations Environmental factors Radiation Oxidation Mistakes in replication or repair Deletions, Duplications Insertions Inversions Point mutations
9
Sequence Alignments9 Codon deletion: ACG ATA GCG TAT GTA TAG CCG… Effect depends on the protein, position, etc. Almost always deleterious Sometimes lethal Frame shift mutation: ACG ATA GCG TAT GTA TAG CCG… ACG ATA GCG ATG TAT AGC CG?… Almost always lethal Deletions
10
Sequence Alignments10 Indels Comparing two genes it is generally impossible to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known: ACGTCTGATACGCCGTATCGTCTATCT ACGTCTGAT---CCGTATCGTCTATCT
11
Sequence Alignments11 The Genetic Code Substitutions Substitutions are mutations accepted by natural selection. Synonymous: CGC CGA Non-synonymous: GAU GAA
12
Sequence Alignments12 Two kinds of homologs Orthologs Species A has a particular gene Species A diverges into species B and C, each with the same gene The two genes accumulate substitutions independently
13
Sequence Alignments13 Orthologs and Paralogs Paralogs A gene duplication event within a species results in two copies of a gene One of the copies is now under less selective constraint The copies can accumulate substitutions independently, before and after speciation
14
Sequence Alignments14 Scoring a sequence alignment Match score:+1 Mismatch score:+0 Gap penalty:–1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 × (+1) Mismatches: 2 × 0 Gaps: 7 × (– 1) Score = +11
15
Sequence Alignments15 Origination and length penalties We want to find alignments that are evolutionarily likely. Which of the following alignments seems more likely to you? ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGAT-------ATAGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT AC-T-TGA--CG-CGT-TA-TCTATCT We can achieve this by penalizing more for a new gap, than for extending an existing gap
16
Sequence Alignments16 Scoring a sequence alignment (2) Match/mismatch score:+1/+0 Origination/length penalty:–2/–1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 × (+1) Mismatches: 2 × 0 Origination: 2 × (–2) Length: 7 × (–1) Score = +7
17
Sequence Alignments17 Optimal Substructure in Alignments Consider the alignment: ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Is it true that the alignment in the boxed region must be optimal?
18
Sequence Alignments18 A Greedy Strategy Consider this pair of sequences GAGC CAGC Greedy Approach: G or G or - C - G Leads to GAGC--- Better: GACG ---CAGC CACG GAP = 1 Match = +1 Mismatch = 2
19
Sequence Alignments19 Breaking apart the problem Suppose we are aligning: ACTCG ACAGTAG First position choices: A+1CTCG ACAGTAG A-1CTCG -ACAGTAG --1ACTCG ACAGTAG
20
Sequence Alignments20 A Recursive Approach to Alignment Choose the best alignment based on these three possibilities: align(seq1, seq2) { if (both sequences empty) {return 0;} if (one string empty) { return(gapscore * num chars in nonempty seq); else { score1 = score(firstchar(seq1),firstchar(seq2)) + align(tail(seq1), tail(seq2)); score2 = align(tail(seq1), seq2) + gapscore; score3 = align(seq1, tail(seq2) + gapscore; return(min(score1, score2, score3)); }
21
Sequence Alignments21 Time Complexity of RecurseAlign What is the recurrence equation for the time needed by RecurseAlign? 3 3 33 33 … n 3 9 27 3n3n
22
Sequence Alignments22 RecurseAlign repeats its work
23
Sequence Alignments23 How can we find an optimal alignment? Finding the alignment is computationally hard: ACGTCTGATACGCCGTATAGTCTATCT CTGAT---TCG—CATCGTC--T-ATCT C(27,7) gap positions = ~888,000 possibilities It’s possible, as long as we don’t repeat our work! Dynamic programming: The Needleman & Wunsch algorithm
24
Sequence Alignments24 What is the optimal alignment? ACTCG ACAGTAG Match: +1 Mismatch: 0 Gap: –1
25
Sequence Alignments25 Needleman-Wunsch: Step 1 Each sequence along one axis Mismatch penalty multiples in first row/column 0 in [1,1] (or [0,0] for the CS-minded)
26
Sequence Alignments26 Needleman-Wunsch: Step 2 Vertical/Horiz. move: Score + (simple) gap penalty Diagonal move: Score + match/mismatch score Take the MAX of the three possibilities
27
Sequence Alignments27 Needleman-Wunsch: Step 2 (cont’d) Fill out the rest of the table likewise…
28
Sequence Alignments28 Needleman-Wunsch: Step 2 (cont’d) Fill out the rest of the table likewise… The optimal alignment score is calculated in the lower-right corner
29
Sequence Alignments29 But what is the optimal alignment To reconstruct the optimal alignment, we must determine of where the MAX at each step came from…
30
Sequence Alignments30 A path corresponds to an alignment = GAP in top sequence = GAP in left sequence = ALIGN both positions One path from the previous table: Corresponding alignment (start at the end): AC--TCG ACAGTAG Score = +2
31
Sequence Alignments31 Practice Problem Find an optimal alignment for these two sequences: GCGGTT GCGT Match: +1 Mismatch: 0 Gap: –1
32
Sequence Alignments32 Practice Problem Find an optimal alignment for these two sequences: GCGGTT GCGT GCGGTT GCG-T- Score = +2
33
Sequence Alignments33 Semi-global alignment Suppose we are aligning: GCG GGCG Which do you prefer? G-CG-GCG GGCGGGCG Semi-global alignment allows gaps at the ends for free.
34
Sequence Alignments34 Semi-global alignment Semi-global alignment allows gaps at the ends for free. Initialize first row and column to all 0’s Allow free horizontal/vertical moves in last row and column
35
Sequence Alignments35 Local alignment Global alignments – score the entire alignment Semi-global alignments – allow unscored gaps at the beginning or end of either sequence Local alignment – find the best matching subsequence CGATG AAATGGA This is achieved by allowing a 4 th alternative at each position in the table: zero.
36
Sequence Alignments36 Local alignment Mismatch = –1 this time CGATG AAATGGA
37
Sequence Alignments37 Food for thought… What is the asymptotic time complexity of the Needleman & Wunsch algorithm? Is this good or bad? How about the space complexity? Why might this be a problem?
38
Sequence Alignments38 Saving Space Note that we can throw away the previous rows of the table as we fill it in: This row is based only on this one
39
Sequence Alignments39 Saving Space (2) Each row of the table contains the scores for aligning a prefix of the left-hand sequence with all prefixes of the top sequence: Scores for aligning aca with all prefixes of actcg
40
Sequence Alignments40 Divide and Conquer By using a recursive approach, we can use only two rows of the matrix at a time: Choose the middle character of the top sequence, i Find out where i aligns to the bottom sequence Needs two vectors of scores Recursively align the sequences before and after the fixed positions ACGCTATGCTCATAG CGACGCTCATCG i
41
Sequence Alignments41 Finding where i lines up Find out where i aligns to the bottom sequence Needs two vectors of scores Assuming i lines up with a character: alignscore = align( ACGCTAT, prefix(t)) + score( G, char from t) + align( CTCATAG, suffix(t)) Which character is best? Can quickly find out the score for aligning ACGCTAT with every prefix of t. s: ACGCTATGCTCATAG t: CGACGCTCATCG i
42
Sequence Alignments42 Finding where i lines up But, i may also line up with a gap Assuming i lines up with a gap: alignscore = align( ACGCTAT, prefix(t)) + gapscore + align( CTCATAG, suffix(t)) s: ACGCTATGCTCATAG t: CGACGCTCATCG i
43
Sequence Alignments43 Recursive Call Fix the best position for I Call align recursively for the prefixes and suffixes: s: ACGCTATGCTCATAG t: CGACGCTCATCG i
44
Sequence Alignments44 Complexity Let len(s) = m and len(t) = n Space: 2m Time: Each call to build similarity vector = m´n´ First call + recursive call: s: ACGCTATGCTCATAG t: CGACGCTCATCG i j
45
Sequence Alignments45 General Gap Penalties Suppose we are no longer using simple gap penalties: Origination = −2 Length = −1 Consider the last position of the alignment for ACGTA with ACG We can’t determine the score for unless we know the previous positions! G-G- -G-G or
46
Sequence Alignments46 Scoring Blocks Now we must score a block at a time A block is a pair of characters, or a maximal group of gaps paired with characters To score a position, we need to either start a new block or add it to a previous block A A C --- A TATCCG A C T AC A C T ACC T ------ C G C --
47
Sequence Alignments47 The Algorithm Three tables a – scores for alignments ending in char-char blocks b – scores for alignments ending in gaps in the top sequence (s) c – scores for alignments ending in gaps in the left sequence (t) Scores no longer depend on only three positions, because we can put any number of gaps into the last block
48
Sequence Alignments48 The Recurrences
49
Sequence Alignments49 The Optimal Alignment The optimal alignment is found by looking at the maximal value in the lower right of all three arrays The algorithm runs in O(n 3 ) time Uses O(n 2 ) space
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.