Download presentation
Presentation is loading. Please wait.
1
Sequence Alignments Introduction to Bioinformatics
2
Intro to Bioinformatics – Sequence Alignment2 Sequence Alignments Cornerstone of bioinformatics What is a sequence? Nucleotide sequence Amino acid sequence Pairwise and multiple sequence alignments We will focus on pairwise alignments What alignments can help Determine function of a newly discovered gene sequence Determine evolutionary relationships among genes, proteins, and species Predicting structure and function of protein Acknowledgement: This notes is adapted from lecture notes of both Wright State University’s Bioinformatics Program and Professor Laurie Heyer of Davidson College with permission.
3
Intro to Bioinformatics – Sequence Alignment3 DNA Replication Prior to cell division, all the genetic instructions must be “copied” so that each new cell will have a complete set DNA polymerase is the enzyme that copies DNA Reads the old strand in the 3´ to 5´ direction
4
Intro to Bioinformatics – Sequence Alignment4 Over time, genes accumulate mutations Environmental factors Radiation Oxidation Mistakes in replication or repair Deletions, Duplications Insertions, Inversions Translocations Point mutations
5
Intro to Bioinformatics – Sequence Alignment5 Codon deletion: ACG ATA GCG TAT GTA TAG CCG… Effect depends on the protein, position, etc. Almost always deleterious Sometimes lethal Frame shift mutation: ACG ATA GCG TAT GTA TAG CCG… ACG ATA GCG ATG TAT AGC CG?… Almost always lethal Deletions
6
Intro to Bioinformatics – Sequence Alignment6 Indels Comparing two genes it is generally impossible to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known: ACGTCTGATACGCCGTATCGTCTATCT ACGTCTGAT---CCGTATCGTCTATCT
7
Intro to Bioinformatics – Sequence Alignment7 The Genetic Code Substitutions Substitutions are mutations accepted by natural selection. Synonymous: CGC CGA Non-synonymous: GAU GAA
8
Intro to Bioinformatics – Sequence Alignment8 Comparing Two Sequences Point mutations, easy: ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGATTCGCCCTATCGTCTATCT Indels are difficult, must align sequences: ACGTCTGATACGCCGTATAGTCTATCT CTGATTCGCATCGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT ----CTGATTCGC---ATCGTCTATCT
9
Intro to Bioinformatics – Sequence Alignment9 Why Align Sequences? The draft human genome is available Automated gene finding is possible Gene: AGTACGTATCGTATAGCGTAA What does it do?What does it do? One approach: Is there a similar gene in another species? Align sequences with known genes Find the gene with the “best” match
10
Intro to Bioinformatics – Sequence Alignment10 Gaps or No Gaps Examples
11
Intro to Bioinformatics – Sequence Alignment11 Scoring a Sequence Alignment Match score:+1 Mismatch score:+0 Gap penalty:–1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 × (+1) Mismatches: 2 × 0 Gaps: 7 × (– 1) Score = +11
12
Intro to Bioinformatics – Sequence Alignment12 Origination and Length Penalties We want to find alignments that are evolutionarily likely. Which of the following alignments seems more likely to you? ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGAT-------ATAGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT AC-T-TGA--CG-CGT-TA-TCTATCT We can achieve this by penalizing more for a new gap, than for extending an existing gap
13
Intro to Bioinformatics – Sequence Alignment13 Scoring a Sequence Alignment (2) Match/mismatch score:+1/+0 Origination/length penalty:–2/–1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 × (+1) Mismatches: 2 × 0 Origination: 2 × (–2) Length: 7 × (–1) Score = +7
14
Intro to Bioinformatics – Sequence Alignment14 How can we find an optimal alignment? Finding the alignment is computationally hard: ACGTCTGATACGCCGTATAGTCTATCT CTGAT---TCG-CATCGTC--T-ATCT C(27,7) gap positions = ~888,000 possibilities It’s possible, as long as we don’t repeat our work! Dynamic programming: The Needleman & Wunsch algorithm
15
Intro to Bioinformatics – Sequence Alignment15 Dynamic Programming Technique of solving optimization problems Find and memorize solutions for subproblems Use those solutions to build solutions for larger subproblems Continue until the final solution is found Recursive computation of cost function in a non-recursive fashion
16
Intro to Bioinformatics – Sequence Alignment16 Global Sequence Alignment Needleman-Wunsch algorithm Suppose we are aligning: A with A … D i,j = max{ D i-1,j + d(A i, –), D i,j-1 + d(–, B j ), D i-1,j-1 + d(A i, B j ) } i-1 j-1 j i
17
Intro to Bioinformatics – Sequence Alignment17 Dynamic Programming (DP) Concept Suppose we are aligning: CACGA CCGA
18
Intro to Bioinformatics – Sequence Alignment18 DP – Recursion Perspective Suppose we are aligning: ACTCG ACAGTAG Last position choices: G+1ACTC GACAGTA G-1ACTC -ACAGTAG --1ACTCG GACAGTA
19
Intro to Bioinformatics – Sequence Alignment19 What is the optimal alignment? ACTCG ACAGTAG Match: +1 Mismatch: 0 Gap: –1
20
Intro to Bioinformatics – Sequence Alignment20 Needleman-Wunsch: Step 1 Each sequence along one axis Mismatch penalty multiples in first row/column 0 in [1,1] (or [0,0] for the CS-minded)
21
Intro to Bioinformatics – Sequence Alignment21 Needleman-Wunsch: Step 2 Vertical/Horiz. move: Score + (simple) gap penalty Diagonal move: Score + match/mismatch score Take the MAX of the three possibilities
22
Intro to Bioinformatics – Sequence Alignment22 Needleman-Wunsch: Step 2 (cont’d) Fill out the rest of the table likewise…
23
Intro to Bioinformatics – Sequence Alignment23 Needleman-Wunsch: Step 2 (cont’d) Fill out the rest of the table likewise… The optimal alignment score is calculated in the lower-right corner
24
Intro to Bioinformatics – Sequence Alignment24 But what is the optimal alignment To reconstruct the optimal alignment, we must determine of where the MAX at each step came from…
25
Intro to Bioinformatics – Sequence Alignment25 A path corresponds to an alignment = GAP in top sequence = GAP in left sequence = ALIGN both positions One path from the previous table: Corresponding alignment (start at the end): AC--TCG ACAGTAG Score = +2
26
Intro to Bioinformatics – Sequence Alignment26 Algorithm Analysis Brute force approach If the length of both sequences is n, number of possibility = C(2n, n) = (2n)!/(n!) 2 2 2n / ( n) 1/2, using Sterling’s approximation of n! = (2 n) 1/2 e -n n n. O(4 n ) Dynamic programming O(mn), where the two sequence sizes are m and n, respectively O(n 2 ), if m is in the order of n
27
Intro to Bioinformatics – Sequence Alignment27 Practice Problem Find an optimal alignment for these two sequences: GCGGTT GCGT Match: +1 Mismatch: 0 Gap: –1
28
Intro to Bioinformatics – Sequence Alignment28 Practice Problem Find an optimal alignment for these two sequences: GCGGTT GCGT GCGGTT GCG-T- Score = +2
29
Intro to Bioinformatics – Sequence Alignment29 Semi-global alignment Suppose we are aligning: GCG GGCG Which do you prefer? G-CG-GCG GGCGGGCG Semi-global alignment allows gaps at the ends for free.
30
Intro to Bioinformatics – Sequence Alignment30 Semi-global alignment Semi-global alignment allows gaps at the ends for free. Initialize first row and column to all 0’s Allow free horizontal/vertical moves in last row and column
31
Intro to Bioinformatics – Sequence Alignment31 Local alignment Global alignments – score the entire alignment Semi-global alignments – allow unscored gaps at the beginning or end of either sequence Local alignment – find the best matching subsequence CGATG AAATGGA This is achieved by allowing a 4 th alternative at each position in the table: zero.
32
Intro to Bioinformatics – Sequence Alignment32 Local Sequence Alignment Why local sequence alignment? Subsequence comparison between a DNA sequence and a genome Protein function domains Exons matching Smith-Waterman algorithm D i,j = max{ D i-1,j + d(A i, –), D i,j-1 + d(–,B j ), D i-1,j-1 + d(A i,B j ), 0 } Initialization: D 1,j = , D i,1 =
33
Intro to Bioinformatics – Sequence Alignment33 Local alignment Score: Match = 1, Mismatch = -1, Gap = -1 CGATG AAATGGA
34
Intro to Bioinformatics – Sequence Alignment34 Local alignment Another example
35
Intro to Bioinformatics – Sequence Alignment35 More Example Align ATGGCCTC ACGGCTC Mismatch = -3 Gap = -4 --ACGGCTC 0-4-8-12-16-20-24-32 1-3 -- A T G G C C T C -4 -8 -12 -16 -20 -24 -28 -32 -7-11-15-19-23 -2-6-10-14 -18 -7-6-5-9-13-17 -11-10-50-4-8-12 -15-10-9-41-3-7 -19-14-13-8-3-2 -23-18-17-12-7-2-5 -27-22-21-16-11-6 Global Alignment: ATGGCCTC ACGGC-TC
36
Intro to Bioinformatics – Sequence Alignment36 More Example Local Alignment: ATGGCCTC ACGG CTC or ATGGCCTC ACGGC TC --ACGGCTC 00000000 10 0 A T G G C C T C 0 0 0 0 0 0 0 0 00000 000010 0011000 0012000 0100301 0100101 0000020 0100103
37
Intro to Bioinformatics – Sequence Alignment37 Scoring Matrices for DNA Sequences Transition: A G C T Transversion: a purine (A or G) is replaced by a pyrimadine (C or T) or vice versa
38
Intro to Bioinformatics – Sequence Alignment38 Scoring Matrices for Protein Sequence PAM (Percent Accepted Mutations) 250
39
Intro to Bioinformatics – Sequence Alignment39 Scoring Matrices for Protein Sequence BLOSUM (BLOcks SUbstitution Matrix) 62
40
Intro to Bioinformatics – Sequence Alignment40 Using Protein Scoring Matrices Divergence BLOSUM 80 BLOSUM 62 BLOSUM 45 PAM 1 PAM 120 PAM 250 Closely related Distantly related Less divergent More divergent Less sensitive More sensitive Looking for Short similar sequences → use less sensitive matrix Long dissimilar sequences → use more sensitive matrix Unknown → use range of matrices Comparison PAM – designed to track evolutionary origin of proteins BLOSUM – designed to find conserved regions of proteins
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.