Download presentation
Presentation is loading. Please wait.
1
Pairwise alignment Computational Genomics and Proteomics
2
Genetic code
3
Searching for similarities What is the function of the new gene? The “lazy” investigation: –Find the set of similar proteins –Identify similarities and differences –For long proteins: identify domains
4
Is similarity really interesting?
5
Common ancestry is more interesting Makes it more likely that genes share the same function Homology: sharing a common ancestor –a binary property (yes/no) –It’s a nice tool: When (a known gene) G is homologous to (an unknown) X it means that we gain a lot of information on X Z XY
6
How to determine similarity Frequent evolutionary events: 1. Substitution 2. Insertion, deletion 3. Duplication 4. Inversion Evolution at work Z XY Common ancestor, usually extinct available We’ll use only these
7
Alignment - the problem Operations: –substitution, –Insertion –deletion GTACT--C GGA-TGAC algorithmsforgenomes |||||||||| algorithms4genomes Algorithmsforgenomes ||||||| algorithms4genomes algorithmsforgenomes ||||||||||? ||||||| algorithms4--genomes
8
Many ways to align Which alignment is better? –Reflecting evolution ~= –Most probable ~= –Simplest GTA GCA GT-A G-CA GTA--- ---GCA 1) 2) 3)
9
Scoring Should give reasonable alignments And have to assign scores to: –Substitution (or match/mismatch) –Gap penalty Linear: g(k)= k Affine: g(k)= + k Concave, e.g.: g(k)=log(k) The score for an alignment is the sum of scores of all columns GTA-- --ATG G-T-A -ATG- GTACT--C GGA-TGAC
10
Substitution matrices Define a score for match/mismatch of letters DNA –- Simple: –- Used in genome alignments
11
Substitution matrices for aa Amino acids are not equal: 1. Some are easily substituted, similar: biochemical properties structure 2. Some mutations occur more often due to similar codons The two above give us substitution matrices orange: nonpolar and hydrophobic. green: polar and hydrophilic magenta box are acidic light blue box are basic http://www.cimr.cam.ac.uk/links/codon.htm http://www.people.virginia.edu/~rjh9u/aminacid.html
12
BLOSUM 62 substitution matrix http://www.carverlab.org/testing/epp.html
13
Constant vs. affine gap scoring Seq1G T A - -G - T - A Seq2- - A T G- A T G - Const-2 –2 1 –2 –2 (SUM = -7)-2 –2 1 -2 –2 (SUM = -7) Affine –4 1 -4 (SUM = -7)-3 –3 1 -3 –3 (SUM = -11) -2 extensionintro Gap Scoring -3Affine -2Constant …and +1 for match
14
Global alignment. The Needleman-Wunsch algorithm Goal: find the maximal scoring alignment Scores: m match, -s mismatch, -g for insertion/deletion Dynamic programming –Solve smaller subproblem(s) –Iteratively extend the solution The best alignment for X[1…i] and Y[1…j] is called M[i, j] X 1 … X i-1 - - X i Y 1 … - Y j-1 Y j -
15
The algorithm Y[1…j - 1] Y[j] X[1…i - 1] X[i] Y[1…j-1]Y[j] X[1…i] - Y[1…j] - X[1…i-1] X[i] M[i, j-1] - gM[i-1, j] - g M[i-1, j-1] + m - s M[i,j] = Goal: find the maximal scoring alignment Scores: m match, -s mismatch, -g for insertion/deletion The best alignment for X[1…i] and Y[1…j] is called M[i, j] 3 ways to extend the alignment
16
The algorithm – final equation M[i, j] = M[i, j- 1 ] – g M[i- 1, j] – g M[i- 1, j- 1 ] + score(X[i],Y[j]) jj-1 i-1 i max * * -g X 1 …X i-1 X i Y 1 …Y j-1 Y j X 1 …X i - Y 1 …Y j-1 Y j X 1 …X i-1 X i Y 1 …Y j - Corresponds to:
17
Example: global alignment of two sequences Align two DNA sequences: –GAGTGA –GAGGCGA (note the length difference) Parameters of the algorithm: –Match: score(A,A) = 1 –Mismatch: score(A,T) = -1 –Gap: g = -2 M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max
18
The algorithm. Step 1: init Create the matrix Initiation –0 at [0,0] –Apply the equation… M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max
19
The algorithm. Step 1: init Initiation of the matrix: –0 at pos [0,0] –Fill in the first row using the “ ” rule –Fill in the first column using “ ” M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max j i
20
The algorithm. Step 2: fill in Continue filling in of the matrix, remembering from which cell the result comes (arrows) M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max j i
21
The algorithm. Step 2: fill in We are done… Where’s the result? M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max j i
22
The algorithm. Step 3: traceback Start at the last cell of the matrix Go against the direction of arrows Sometimes the value may be obtained from more than one cell (which one?) j i M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max
23
The algorithm. Step 3: traceback Extract the alignments j i GAGT-GA GAGGCGA GA-GTGA GAGGCGA a b a) b)
24
Affine gaps M[i, j] = I x [i-1, j- 1 ] I y [i- 1, j-1] M[i- 1, j- 1 ] max score(X[i], Y[j]) + I x [i, j] = M[i, j-1] + g o I x [i, j- 1 ] + g e max I y [i, j] = M[i-1, j] + g o I y [i-1, j] + g x max Penalties: g o - gap opening (e.g. -8) g e - gap extension (e.g. -2) X 1 … - Y 1 … Y j X 1 …X i - Y 1 …Y j-1 Y j X 1 … - - Y 1 …Y j-1 Y j X1… XiY1… YjX1… XiY1… Yj @ home: think of boundary values M[*,0], I[*,0] etc.
25
Variation on global alignment Global alignment: previous algorithms is called global alignment, because it uses all letters from both sequences. CAGCACTTGGATTCTCGG CAGC-----G-T----GG Semi-global alignment: don’t penalize for start/end gaps (omit the start/end gaps of sequences). CAGCA-CTTGGATTCTCGG ---CAGCGTGG-------- –Applications of semi-global: –Finding a gene in genome –Placing marker onto a chromosome –One sequence much longer than the other –Danger! – really bad alignments for divergent seqs seq X: seq Y:
26
Semi-global alignment Ignore start gaps –First row/column set to 0 Ignore end gaps –Read the result from last row/column T G A - GTGAG- 13000 -202 10 1 0 000000
27
a heuristics –Heuristics: A rule of thumb that often helps in solving a certain class of problems, but makes no guarantees. Perkins, DN (1981) The Mind's Best Work
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.