Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pairwise alignment Computational Genomics and Proteomics.

Similar presentations


Presentation on theme: "Pairwise alignment Computational Genomics and Proteomics."— Presentation transcript:

1 Pairwise alignment Computational Genomics and Proteomics

2 Genetic code

3 Searching for similarities  What is the function of the new gene?  The “lazy” investigation: –Find the set of similar proteins –Identify similarities and differences –For long proteins: identify domains

4 Is similarity really interesting?

5  Common ancestry is more interesting Makes it more likely that genes share the same function  Homology: sharing a common ancestor –a binary property (yes/no)‏ –It’s a nice tool: When (a known gene) G is homologous to (an unknown) X it means that we gain a lot of information on X Z XY

6 How to determine similarity Frequent evolutionary events: 1. Substitution 2. Insertion, deletion 3. Duplication 4. Inversion Evolution at work Z XY Common ancestor, usually extinct available We’ll use only these

7 Alignment - the problem Operations: –substitution, –Insertion –deletion GTACT--C GGA-TGAC algorithmsforgenomes |||||||||| algorithms4genomes Algorithmsforgenomes ||||||| algorithms4genomes algorithmsforgenomes ||||||||||? ||||||| algorithms4--genomes

8 Many ways to align  Which alignment is better? –Reflecting evolution ~= –Most probable ~= –Simplest GTA GCA GT-A G-CA GTA--- ---GCA 1)‏ 2)‏ 3)‏

9 Scoring  Should give reasonable alignments  And have to assign scores to: –Substitution (or match/mismatch)‏ –Gap penalty Linear: g(k)=  k Affine: g(k)=  +  k Concave, e.g.: g(k)=log(k)  The score for an alignment is the sum of scores of all columns GTA-- --ATG G-T-A -ATG- GTACT--C GGA-TGAC

10 Substitution matrices Define a score for match/mismatch of letters DNA –- Simple: –- Used in genome alignments

11 Substitution matrices for aa Amino acids are not equal: 1. Some are easily substituted, similar: biochemical properties structure 2. Some mutations occur more often due to similar codons The two above give us substitution matrices orange: nonpolar and hydrophobic. green: polar and hydrophilic magenta box are acidic light blue box are basic http://www.cimr.cam.ac.uk/links/codon.htm http://www.people.virginia.edu/~rjh9u/aminacid.html

12 BLOSUM 62 substitution matrix http://www.carverlab.org/testing/epp.html

13 Constant vs. affine gap scoring Seq1G T A - -G - T - A Seq2- - A T G- A T G - Const-2 –2 1 –2 –2 (SUM = -7)-2 –2 1 -2 –2 (SUM = -7) Affine –4 1 -4 (SUM = -7)-3 –3 1 -3 –3 (SUM = -11)‏ -2 extensionintro Gap Scoring -3Affine -2Constant …and +1 for match

14 Global alignment. The Needleman-Wunsch algorithm  Goal: find the maximal scoring alignment  Scores: m match, -s mismatch, -g for insertion/deletion  Dynamic programming –Solve smaller subproblem(s)‏ –Iteratively extend the solution  The best alignment for X[1…i] and Y[1…j] is called M[i, j] X 1 … X i-1 - - X i Y 1 … - Y j-1 Y j -

15 The algorithm Y[1…j - 1] Y[j] X[1…i - 1] X[i] Y[1…j-1]Y[j] X[1…i] - Y[1…j] - X[1…i-1] X[i] M[i, j-1] - gM[i-1, j] - g M[i-1, j-1] + m - s M[i,j] = Goal: find the maximal scoring alignment Scores: m match, -s mismatch, -g for insertion/deletion The best alignment for X[1…i] and Y[1…j] is called M[i, j] 3 ways to extend the alignment

16 The algorithm – final equation M[i, j] = M[i, j- 1 ] – g M[i- 1, j] – g M[i- 1, j- 1 ] + score(X[i],Y[j])‏ jj-1 i-1 i max * * -g X 1 …X i-1 X i Y 1 …Y j-1 Y j X 1 …X i - Y 1 …Y j-1 Y j X 1 …X i-1 X i Y 1 …Y j - Corresponds to:

17 Example: global alignment of two sequences  Align two DNA sequences: –GAGTGA –GAGGCGA (note the length difference)‏  Parameters of the algorithm: –Match: score(A,A) = 1 –Mismatch: score(A,T) = -1 –Gap: g = -2 M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max

18 The algorithm. Step 1: init  Create the matrix  Initiation –0 at [0,0] –Apply the equation… M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max

19 The algorithm. Step 1: init  Initiation of the matrix: –0 at pos [0,0] –Fill in the first row using the “  ” rule –Fill in the first column using “  ” M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max j i

20 The algorithm. Step 2: fill in  Continue filling in of the matrix, remembering from which cell the result comes (arrows)‏ M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max j i

21 The algorithm. Step 2: fill in  We are done…  Where’s the result? M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max j i

22 The algorithm. Step 3: traceback  Start at the last cell of the matrix  Go against the direction of arrows  Sometimes the value may be obtained from more than one cell (which one?)‏ j i M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max

23 The algorithm. Step 3: traceback  Extract the alignments j i GAGT-GA GAGGCGA GA-GTGA GAGGCGA a b a)‏ b)‏

24 Affine gaps M[i, j] = I x [i-1, j- 1 ] I y [i- 1, j-1] M[i- 1, j- 1 ] max score(X[i], Y[j]) + I x [i, j] = M[i, j-1] + g o I x [i, j- 1 ] + g e max I y [i, j] = M[i-1, j] + g o I y [i-1, j] + g x max Penalties: g o - gap opening (e.g. -8)‏ g e - gap extension (e.g. -2)‏ X 1 … - Y 1 … Y j X 1 …X i - Y 1 …Y j-1 Y j X 1 … - - Y 1 …Y j-1 Y j X1… XiY1… YjX1… XiY1… Yj @ home: think of boundary values M[*,0], I[*,0] etc.

25 Variation on global alignment  Global alignment: previous algorithms is called global alignment, because it uses all letters from both sequences. CAGCACTTGGATTCTCGG CAGC-----G-T----GG  Semi-global alignment: don’t penalize for start/end gaps (omit the start/end gaps of sequences). CAGCA-CTTGGATTCTCGG ---CAGCGTGG-------- –Applications of semi-global: –Finding a gene in genome –Placing marker onto a chromosome –One sequence much longer than the other –Danger! – really bad alignments for divergent seqs seq X: seq Y:

26 Semi-global alignment  Ignore start gaps –First row/column set to 0  Ignore end gaps –Read the result from last row/column T G A - GTGAG- 13000 -202 10 1 0 000000

27  a heuristics –Heuristics: A rule of thumb that often helps in solving a certain class of problems, but makes no guarantees. Perkins, DN (1981) The Mind's Best Work


Download ppt "Pairwise alignment Computational Genomics and Proteomics."

Similar presentations


Ads by Google