Download presentation
Presentation is loading. Please wait.
1
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically meaningful. Global alignment - Needleman-Wunsch (1970) maximizes the number of matches between the sequences along the entire length of the sequences. Local alignment - Smith-Waterman (1981) gives the highest scoring local match between two sequences.
2
Pairwise Global Alignment
Global alignment - Needleman-Wunsch (1970) maximizes the number of matches between the sequences along the entire length of the sequences. Reason for making a global alignment: checking minor difference between two sequences Analyzing polymorphisms (ex. SNPs) between closely related sequences …
3
Pairwise Global Alignment
Computationally: Given: a pair of sequences (strings of characters) Output: an alignment that maximizes the similarity
4
How can we find an optimal alignment?
1 27 ACGTCTGATACGCCGTATAGTCTATCT CTGAT---TCG-CATCGTC--T-ATCT How many possible alignments? C(27,7) gap positions = ~888,000 possibilities Dynamic programming: The Needleman & Wunsch algorithm
5
Time Complexity Consider two sequences:
AAGT AGTC How many possible alignments the 2 sequences have? N! = sqrt(2pin)(n/e)^n + … Sterlings formula 2n n = (2n)!/(n!)2 = (22n /n ) = (2n)
6
Scoring a sequence alignment
Match/mismatch score: +1/+0 Open/extension penalty: –2/–1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 × (+1) Mismatches: 2 × 0 Open: 2 × (–2) Extension: 5 × (–1) Score = +9
7
Pairwise Global Alignment
Computationally: Given: a pair of sequences (strings of characters) Output: an alignment that maximizes the similarity
8
Needleman & Wunsch Place each sequence along one axis
Place score 0 at the up-left corner Fill in 1st row & column with gap penalty multiples Fill in the matrix with max value of 3 possible moves: Vertical move: Score + gap penalty Horizontal move: Score + gap penalty Diagonal move: Score + match/mismatch score The optimal alignment score is in the lower-right corner To reconstruct the optimal alignment, trace back where the max at each step came from, stop when hit the origin.
9
Example AAAC AAAC A-GC -AGC C A G -6 -4 -2 -8 -6 -4 -2 1 -1 -3 -5 -1
Let gap = -2 match = 1 mismatch = -1. C A empty G -6 -4 -2 -8 -6 -4 -2 1 -1 -3 -5 -1 -2 -4 -3 -2 -1 -1 AAAC A-GC AAAC -AGC
10
Time Complexity Analysis
Initialize matrix values: O(n), O(m) Filling in rest of matrix: O(nm) Traceback: O(n+m) If strings are same length, total time O(n2)
11
Local Alignment Problem first formulated: Problem: Algorithm:
Smith and Waterman (1981) Problem: Find an optimal alignment between a substring of s and a substring of t Algorithm: is a variant of the basic algorithm for global alignment
12
Motivation Searching for unknown domains or motifs within proteins from different families Proteins encoded from Homeobox genes (only conserved in 1 region called Homeo domain – 60 amino acids long) Identifying active sites of enzymes Comparing long stretches of anonymous DNA Querying databases where query word much smaller than sequences in database Analyzing repeated elements within a single sequence
13
Local Alignment match = 1 mismatch = -1. 1 1 2 1 3 1 1 1 2 2 2 1 3 1 1
Let gap = -2 match = 1 mismatch = -1. GATCACCT GATACCC GATCACCT GAT _ ACCC C A T G empty 1 1 2 1 3 1 1 1 2 2 2 1 3 1 1 1 2 4 2 1 2 3 3
14
Smith & Waterman Place each sequence along one axis
Place score 0 at the up-left corner Fill in 1st row & column with 0s Fill in the matrix with max value of 4 possible values: Vertical move: Score + gap penalty Horizontal move: Score + gap penalty Diagonal move: Score + match/mismatch score The optimal alignment score is the max in the matrix To reconstruct the optimal alignment, trace back where the MAX at each step came from, stop when a zero is hit
15
exercise Find the best local alignment: CGATG AAATGGA Let: gap = -2
match = 1 mismatch = -1. Find the best local alignment: CGATG AAATGGA
16
Semi-global Alignment
Example: CAGCA-CTTGGATTCTCGG –––CAGCGTGG–––––––– CAGCACTTGGATTCTCGG CAGC––––G––T––––GG We like the first alignment much better. In semiglobal comparison, we score the alignments ignoring some of the end spaces.
17
Global Alignment Example: AAACCC A CCC -2 -4 -6 -8 -10 -12 1 -1 -3
empty A C -2 -4 -6 -8 -10 -12 1 -1 -3 -5 -7 -9 Prefer to see: AAACCC ACCC Do not want to penalize the end spaces
18
SemiGlobal Alignment t = ACCC Example: s = AAACCC -2 1 -1 -4 2 -6
empty A C -2 1 -1 -4 2 -6 -3 3 -8 -5 4
19
SemiGlobal Alignment t = ACCC Example: s = AAACCCG -2 1 -1 -4 2
empty A C -2 1 -1 -4 2 -6 -3 3 -8 -5 4 G -1 -2 -1 2
20
Place where spaces are not penalized for
SemiGlobal Alignment Summary of end space charging procedures: Place where spaces are not penalized for Action Beginning of 1st sequence End of 1st sequence Beginning of 2nd sequence End of 2nd sequence Initialize 1st row with zeros Look for max in last row Initialize 1st column with zeros Look for max in last column
21
Pairwise Sequence Comparison over Internet
lalign Global/Local fasta.bioch.virginia.edu/fasta_www/plalign.htm USC www-hto.usc.edu/software/seqaln/seqaln-query.html alion fold.stanford.edu/alion genome.cs.mtu.edu/align.html align xenAliTwo Local for DNA blast2seqs Local BLAST web.umassmed.edu/cgi-bin/BLAST/blast2seqs lalnview Visualization prss Evaluation Fasta.bioch.virginia.edu/fasta/prss.htm graph-align Darwin.nmsu.edu/cgi-bin/graph_align.cgi Bioinformatics for Dummies
22
Significance of Sequence Alignment
Consider randomly generated sequences. What distribution do you think the best local alignment score of two sequences of sample length should follow? Uniform distribution Normal distribution Binomial distribution (n Bernoulli trails) Poisson distribution (n, np=) others Binomial distribution --- in n Bernoulli trials (p-H q-T), probability to see k successes. # of successes satisfies the binomial distribution The composition of the two sequences is the same as two test sequences
23
Extreme Value Distribution
Yev = exp(- x - e-x )
24
Extreme Value Distribution vs. Normal Distribution
P-value --- the area under the shaded curve
25
“Twilight Zone” Some proteins with less than 15% similarity have exactly the same 3-D structure while some proteins with 20% similarity have different structures. Homology/non-homology is never granted in the twilight zone.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.