CSE 5290: Algorithms for Bioinformatics Fall 2011

CSE 5290: Algorithms for Bioinformatics Fall 2011
Suprakash Datta Office: CSEB 3043 Phone: ext 77875 Course page: 12/7/2018 CSE 5290, Fall 2011

Last time Dynamic programming algorithms Next: Sequence alignment
The following slides are based on slides by the authors of our text. 12/7/2018 CSE 5290, Fall 2011

Alignment with Affine Gap Penalties
Sequence Alignment More realistic sequence alignment algorithms Types: Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties 12/7/2018 CSE 5290, Fall 2011

From LCS to Alignment: Change the Scoring
The Longest Common Subsequence (LCS) problem—the simplest form of sequence alignment – allows only insertions and deletions (no mismatches). In the LCS Problem, we scored 1 for matches and 0 for indels Consider penalizing indels and mismatches with negative scores Simplest scoring schema: +1 : match premium -μ : mismatch penalty -σ : indel penalty 12/7/2018 CSE 5290, Fall 2011

Simple Scoring When mismatches are penalized by –μ, indels are penalized by –σ, and matches are rewarded with +1, the resulting score is: #matches – μ(#mismatches) – σ (#indels) 12/7/2018 CSE 5290, Fall 2011

The Global Alignment Problem
Find the best alignment between two strings under a given scoring schema Input : Strings v and w and a scoring schema Output : Alignment of maximum score ↑→ = -б = 1 if match = -µ if mismatch si-1,j if vi = wj si,j = max s i-1,j-1 -µ if vi ≠ wj s i-1,j - σ s i,j-1 - σ m : mismatch penalty σ : indel penalty { 12/7/2018 CSE 5290, Fall 2011

Scoring Matrices To generalize scoring, consider a (4+1) x(4+1) scoring matrix δ. In the case of an amino acid sequence alignment, the scoring matrix would be a (20+1)x(20+1) size. The addition of 1 is to include the score for comparison of a gap character “-”. This will simplify the algorithm as follows: si-1,j-1 + δ (vi, wj) si,j = max s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) { 12/7/2018 CSE 5290, Fall 2011

Measuring Similarity Measuring the extent of similarity between two sequences Based on percent sequence identity Based on conservation 12/7/2018 CSE 5290, Fall 2011

Percent Sequence Identity
The extent to which two nucleotide or amino acid sequences are invariant A C C T G A G – A G A C G T G – G C A G mismatch indel 70% identical 12/7/2018 CSE 5290, Fall 2011

Making a Scoring Matrix
Scoring matrices are created based on biological evidence. Alignments can be thought of as two sequences that differ due to mutations. Some of these mutations have little effect on the protein’s function, therefore some penalties, δ(vi , wj), will be less harsh than others. 12/7/2018 CSE 5290, Fall 2011

Scoring Matrix: Example
K 5 -2 -1 - 7 3 6 Notice that although R and K are different amino acids, they have a positive score. Why? They are both positively charged amino acids will not greatly change function of protein. AKRANR KAAANK -1 + (-1) + (-2) = 11 12/7/2018 CSE 5290, Fall 2011

Conservation Amino acid changes that tend to preserve the physico-chemical properties of the original residue Polar to polar aspartate  glutamate Nonpolar to nonpolar alanine  valine Similarly behaving residues leucine to isoleucine 12/7/2018 CSE 5290, Fall 2011

Scoring matrices Amino acid substitution matrices PAM BLOSUM
DNA substitution matrices DNA is less conserved than protein sequences Less effective to compare coding regions at nucleotide level 12/7/2018 CSE 5290, Fall 2011

PAM some residues may have mutated several times
Point Accepted Mutation (Dayhoff et al.) 1 PAM = PAM1 = 1% average change of all amino acid positions After 100 PAMs of evolution, not every residue will have changed some residues may have mutated several times some residues may have returned to their original state some residues may not changed at all 12/7/2018 CSE 5290, Fall 2011

PAMX PAMx = PAM1x PAM250 = PAM1250
PAM250 is a widely used scoring matrix: Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys ... A R N D C Q E G H I L K ... Ala A Arg R Asn N Asp D Cys C Gln Q ... Trp W Tyr Y Val V 12/7/2018 CSE 5290, Fall 2011

BLOSUM Blocks Substitution Matrix
Scores derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins Matrix name indicates evolutionary distance BLOSUM62 was created using sequences sharing no more than 62% identity 12/7/2018 CSE 5290, Fall 2011

The Blosum50 Scoring Matrix
12/7/2018 CSE 5290, Fall 2011

Local vs. Global Alignment
The Global Alignment Problem tries to find the longest path between vertices (0,0) and (n,m) in the edit graph. The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i,j) and (i’, j’) in the edit graph. 12/7/2018 CSE 5290, Fall 2011

Local vs. Global Alignment
The Global Alignment Problem tries to find the longest path between vertices (0,0) and (n,m) in the edit graph. The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i,j) and (i’, j’) in the edit graph. In the edit graph with negatively-scored edges, Local Alignment may score higher than Global Alignment 12/7/2018 CSE 5290, Fall 2011

Local vs. Global Alignment (cont’d)
Local Alignment—better alignment to find conserved segment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C tccCAGTTATGTCAGgggacacgagcatgcagagac |||||||||||| aattgccgccgtcgttttcagCAGTTATGTCAGatc 12/7/2018 CSE 5290, Fall 2011

Local Alignment: Example
Compute a “mini” Global Alignment to get Local Local alignment Global alignment 12/7/2018 CSE 5290, Fall 2011

Local Alignments: Why? Two genes in different species may be similar over short conserved regions and dissimilar over remaining regions. Example: Homeobox genes have a short region called the homeodomain that is highly conserved between species. A global alignment would not find the homeodomain because it would try to align the ENTIRE sequence 12/7/2018 CSE 5290, Fall 2011

The Local Alignment Problem
Goal: Find the best local alignment between two strings Input : Strings v, w and scoring matrix δ Output : Alignment of substrings of v and w whose alignment score is maximum among all possible alignment of all possible substrings 12/7/2018 CSE 5290, Fall 2011

The Problem with this Problem
Long run time O(n4): - In the grid of size n x n there are ~n2 vertices (i,j) that may serve as a source. - For each such vertex computing alignments from (i,j) to (i’,j’) takes O(n2) time. This can be remedied by giving free rides 12/7/2018 CSE 5290, Fall 2011

Compute a “mini” Global Alignment to get Local Local alignment Global alignment 12/7/2018 CSE 5290, Fall 2011

12/7/2018 CSE 5290, Fall 2011

Local Alignment: Running Time
Long run time O(n4): - In the grid of size n x n there are ~n2 vertices (i,j) that may serve as a source. - For each such vertex computing alignments from (i,j) to (i’,j’) takes O(n2) time. This can be remedied by giving free rides 12/7/2018 CSE 5290, Fall 2011

Local Alignment: Free Rides
Yeah, a free ride! Vertex (0,0) The dashed edges represent the free rides from (0,0) to every other node. 12/7/2018 CSE 5290, Fall 2011

The Local Alignment Recurrence
The largest value of si,j over the whole edit graph is the score of the best local alignment. The recurrence: Notice there is only this change from the original recurrence of a Global Alignment si,j = max si-1,j-1 + δ (vi, wj) s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) { 12/7/2018 CSE 5290, Fall 2011

The Local Alignment Recurrence
The largest value of si,j over the whole edit graph is the score of the best local alignment. The recurrence: Power of ZERO: there is only this change from the original recurrence of a Global Alignment - since there is only one “free ride” edge entering into every vertex si,j = max si-1,j-1 + δ (vi, wj) s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) { 12/7/2018 CSE 5290, Fall 2011

Scoring Indels: Naive Approach
A fixed penalty σ is given to every indel: -σ for 1 indel, -2σ for 2 consecutive indels -3σ for 3 consecutive indels, etc. Can be too severe penalty for a series of 100 consecutive indels 12/7/2018 CSE 5290, Fall 2011

Affine Gap Penalties ATA__GC ATATTGC ATAG_GC AT_GTGC
In nature, a series of k indels often come as a single event rather than a series of k single nucleotide events: ATA__GC ATATTGC ATAG_GC AT_GTGC This is more likely. This is less likely. Normal scoring would give the same score for both alignments 12/7/2018 CSE 5290, Fall 2011

Accounting for Gaps Gaps- contiguous sequence of spaces in one of the rows Score for a gap of length x is: -(ρ + σx) where ρ >0 is the penalty for introducing a gap: gap opening penalty ρ will be large relative to σ: gap extension penalty because you do not want to add too much of a penalty for extending the gap. 12/7/2018 CSE 5290, Fall 2011

Affine Gap Penalties Gap penalties: -ρ-σ when there is 1 indel
-ρ-2σ when there are 2 indels -ρ-3σ when there are 3 indels, etc. -ρ- x·σ (-gap opening - x gap extensions) Somehow reduced penalties (as compared to naïve scoring) are given to runs of horizontal and vertical edges 12/7/2018 CSE 5290, Fall 2011

Affine Gap Penalties and Edit Graph
To reflect affine gap penalties we have to add “long” horizontal and vertical edges to the edit graph. Each such edge of length x should have weight - - x * 12/7/2018 CSE 5290, Fall 2011

Adding “Affine Penalty” Edges to the Edit Graph
There are many such edges! Adding them to the graph increases the running time of the alignment algorithm by a factor of n (where n is the number of vertices) So the complexity increases from O(n2) to O(n3) 12/7/2018 CSE 5290, Fall 2011

Manhattan in 3 Layers ρ δ δ σ δ ρ δ δ σ 12/7/2018 CSE 5290, Fall 2011

Affine Gap Penalties and 3 Layer Manhattan Grid
The three recurrences for the scoring algorithm creates a 3-layered graph. The top level creates/extends gaps in the sequence w. The bottom level creates/extends gaps in sequence v. The middle level extends matches and mismatches. 12/7/2018 CSE 5290, Fall 2011

Switching between 3 Layers
Levels: The main level is for diagonal edges The lower level is for horizontal edges The upper level is for vertical edges A jumping penalty is assigned to moving from the main level to either the upper level or the lower level (-r- s) There is a gap extension penalty for each continuation on a level other than the main level (-s) 12/7/2018 CSE 5290, Fall 2011

The 3-leveled Manhattan Grid
Gaps in w Matches/Mismatches Gaps in v 12/7/2018 CSE 5290, Fall 2011

Affine Gap Penalty Recurrences
si,j = s i-1,j - σ max s i-1,j –(ρ+σ) si,j = s i,j-1 - σ max s i,j-1 –(ρ+σ) si,j = si-1,j-1 + δ (vi, wj) max s i,j s i,j Continue Gap in w (deletion) Start Gap in w (deletion): from middle Continue Gap in v (insertion) Start Gap in v (insertion):from middle Match or Mismatch End deletion: from top End insertion: from bottom 12/7/2018 CSE 5290, Fall 2011

Next: Multiple Alignment
Dynamic Programming in 3-D Progressive Alignment Profile Progressive Alignment (ClustalW) Scoring Multiple Alignments Entropy Sum of Pairs Alignment 12/7/2018 CSE 5290, Fall 2011

Multiple Alignment vs Pairwise Alignment
Up until now we have only tried to align two sequences. 12/7/2018 CSE 5290, Fall 2011

Up until now we have only tried to align two sequences. What about more than two? And what for? 12/7/2018 CSE 5290, Fall 2011

Up until now we have only tried to align two sequences. What about more than two? And what for? A faint similarity between two sequences becomes significant if present in many Multiple alignments can reveal subtle similarities that pairwise alignments do not reveal 12/7/2018 CSE 5290, Fall 2011

Generalizing Pairwise Alignment
Alignment of 2 sequences is represented as a 2-row matrix In a similar way, we represent alignment of 3 sequences as a 3-row matrix A T _ G C G _ A _ C G T _ A A T C A C _ A Score: more conserved columns, better alignment 12/7/2018 CSE 5290, Fall 2011

Alignments = Paths in… Align 3 sequences: ATGC, AATC,ATGC A -- T G C A
12/7/2018 CSE 5290, Fall 2011

Alignment Paths x coordinate 1 2 3 4 A -- T G C A T -- C -- A T G C
1 2 3 4 x coordinate A -- T G C A T -- C -- A T G C 12/7/2018 CSE 5290, Fall 2011

Alignment Paths Align the following 3 sequences: ATGC, AATC,ATGC
1 2 3 4 x coordinate A -- T G C y coordinate 1 2 3 4 A T -- C -- A T G C 12/7/2018 CSE 5290, Fall 2011

Alignment Paths Resulting path in (x,y,z) space:
1 2 3 4 x coordinate A -- T G C y coordinate 1 2 3 4 A T -- C 1 2 3 4 z coordinate -- A T G C Resulting path in (x,y,z) space: (0,0,0)(1,1,0)(1,2,1) (2,3,2) (3,3,3) (4,4,4) 12/7/2018 CSE 5290, Fall 2011

Aligning Three Sequences
source Same strategy as aligning two sequences Use a 3-D “Manhattan Cube”, with each axis representing a sequence to align For global alignments, go from source to sink sink 12/7/2018 CSE 5290, Fall 2011

2-D vs 3-D Alignment Grid V W 2-D edit graph 3-D edit graph 12/7/2018
CSE 5290, Fall 2011

2-D cell versus 2-D Alignment Cell
In 2-D, 3 edges in each unit square In 3-D, 7 edges in each unit cube 12/7/2018 CSE 5290, Fall 2011

Architecture of 3-D Alignment Cell
(i-1,j,k-1) (i-1,j-1,k-1) (i-1,j-1,k) (i-1,j,k) (i,j,k-1) (i,j-1,k-1) (i,j,k) (i,j-1,k) 12/7/2018 CSE 5290, Fall 2011

Multiple Alignment: Dynamic Prog.
cube diagonal: no indels si,j,k = max (x, y, z) is an entry in the 3-D scoring matrix si-1,j-1,k-1 + (vi, wj, uk) si-1,j-1,k +  (vi, wj, _ ) si-1,j,k  (vi, _, uk) si,j-1,k  (_, wj, uk) si-1,j,k +  (vi, _ , _) si,j-1,k +  (_, wj, _) si,j,k  (_, _, uk) face diagonal: one indel edge diagonal: two indels 12/7/2018 CSE 5290, Fall 2011

Multiple Alignment: Running Time
For 3 sequences of length n, the run time is 7n3; O(n3) For k sequences, build a k-dimensional Manhattan, with run time (2k-1)(nk); O(2knk) Conclusion: dynamic programming approach for alignment between two sequences is easily extended to k sequences but it is impractical due to exponential running time 12/7/2018 CSE 5290, Fall 2011

Multiple Alignment Induces Pairwise Alignments
Every multiple alignment induces pairwise alignments x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG 12/7/2018 CSE 5290, Fall 2011

Reverse Problem: Constructing Multiple Alignment from Pairwise Alignments
Given 3 arbitrary pairwise alignments: x: ACGCTGG-C; x: AC-GCTGG-C; y: AC-GC-GAG y: ACGC--GAC; z: GCCGCA-GAG; z: GCCGCAGAG can we construct a multiple alignment that induces them? 12/7/2018 CSE 5290, Fall 2011

Reverse Problem: Constructing Multiple Alignment from Pairwise Alignments
Given 3 arbitrary pairwise alignments: x: ACGCTGG-C; x: AC-GCTGG-C; y: AC-GC-GAG y: ACGC--GAC; z: GCCGCA-GAG; z: GCCGCAGAG can we construct a multiple alignment that induces them? NOT ALWAYS Pairwise alignments may be inconsistent 12/7/2018 CSE 5290, Fall 2011

Inferring Multiple Alignment from Pairwise Alignments
From an optimal multiple alignment, we can infer pairwise alignments between all pairs of sequences, but they are not necessarily optimal It is difficult to infer a ``good” multiple alignment from optimal pairwise alignments between all sequences 12/7/2018 CSE 5290, Fall 2011

Combining Optimal Pairwise Alignments into Multiple Alignment
Can combine pairwise alignments into multiple alignment Cannot combine pairwise alignments into multiple alignment 12/7/2018 CSE 5290, Fall 2011

Profile Representation of Multiple Alignment
- A G G C T A T C A C C T G T A G – C T A C C A G C A G – C T A C C A G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A C G T 12/7/2018 CSE 5290, Fall 2011

Profile Representation of Multiple Alignment
- A G G C T A T C A C C T G T A G – C T A C C A G C A G – C T A C C A G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A C G T In the past we were aligning a sequence against a sequence Can we align a sequence against a profile? Can we align a profile against a profile? 12/7/2018 CSE 5290, Fall 2011

Aligning alignments Given two alignments, can we align them?
x GGGCACTGCAT y GGTTACGTC-- Alignment 1 z GGGAACTGCAG w GGACGTACC-- Alignment 2 v GGACCT----- 12/7/2018 CSE 5290, Fall 2011

Aligning alignments Given two alignments, can we align them?
Hint: use alignment of corresponding profiles x GGGCACTGCAT y GGTTACGTC-- Combined Alignment z GGGAACTGCAG w GGACGTACC-- v GGACCT----- 12/7/2018 CSE 5290, Fall 2011

Multiple Alignment: Greedy Approach
Choose most similar pair of strings and combine into a profile , thereby reducing alignment of k sequences to an alignment of of k-1 sequences/profiles. Repeat This is a heuristic greedy method u1= ACg/tTACg/tTACg/cT… u2 = TTAATTAATTAA… … uk = CCGGCCGGCCGG… u1= ACGTACGTACGT… u2 = TTAATTAATTAA… u3 = ACTACTACTACT… … uk = CCGGCCGGCCGG k-1 k 12/7/2018 CSE 5290, Fall 2011

Greedy Approach: Example
Consider these 4 sequences s1 GATTCA s2 GTCTGA s3 GATATT s4 GTCAGC 12/7/2018 CSE 5290, Fall 2011

Greedy Approach: Example (cont’d)
There are = 6 possible alignments s2 GTCTGA s4 GTCAGC (score = 2) s1 GAT-TCA s2 G-TCTGA (score = 1) s3 GATAT-T (score = 1) s1 GATTCA-- s4 G—T-CAGC(score = 0) s2 G-TCTGA s3 GATAT-T (score = -1) s3 GAT-ATT s4 G-TCAGC (score = -1) 12/7/2018 CSE 5290, Fall 2011

Greedy Approach: Example (cont’d)
s2 and s4 are closest; combine: s2 GTCTGA s4 GTCAGC s2,4 GTCt/aGa/cA (profile) new set of 3 sequences: s1 GATTCA s3 GATATT s2,4 GTCt/aGa/c 12/7/2018 CSE 5290, Fall 2011

Progressive Alignment
Progressive alignment is a variation of greedy algorithm with a somewhat more intelligent strategy for choosing the order of alignments. Progressive alignment works well for close sequences, but deteriorates for distant sequences Gaps in consensus string are permanent Use profiles to compare sequences 12/7/2018 CSE 5290, Fall 2011

ClustalW Popular multiple alignment tool today
‘W’ stands for ‘weighted’ (different parts of alignment are weighted differently). Three-step process 1.) Construct pairwise alignments 2.) Build Guide Tree 3.) Progressive Alignment guided by the tree 12/7/2018 CSE 5290, Fall 2011

Step 1: Pairwise Alignment
Aligns each sequence again each other giving a similarity matrix Similarity = exact matches / sequence length (percent identity) v1 v2 v3 v4 v1 - v v v (.17 means 17 % identical) 12/7/2018 CSE 5290, Fall 2011

Step 2: Guide Tree Create Guide Tree using the similarity matrix
ClustalW uses the neighbor-joining method Guide tree roughly reflects evolutionary relations 12/7/2018 CSE 5290, Fall 2011

Step 2: Guide Tree (cont’d)
v1 v3 v4 v2 v1 v2 v3 v4 v1 - v v v Calculate: v1,3 = alignment (v1, v3) v1,3,4 = alignment((v1,3),v4) v1,2,3,4 = alignment((v1,3,4),v2) 12/7/2018 CSE 5290, Fall 2011

Step 3: Progressive Alignment
Start by aligning the two most similar sequences Following the guide tree, add in the next sequences, aligning to the existing alignment Insert gaps as necessary FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFD FOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFD FOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFD FOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP LPFQ FOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP LPFQ . . : ** :.. *:.* * . * **: Dots and stars show how well-conserved a column is. 12/7/2018 CSE 5290, Fall 2011

Multiple Alignments: Scoring
Number of matches (multiple longest common subsequence score) Entropy score Sum of pairs (SP-Score) 12/7/2018 CSE 5290, Fall 2011

Multiple LCS Score A column is a “match” if all the letters in the column are the same Only good for very similar sequences AAA AAT ATC 12/7/2018 CSE 5290, Fall 2011

Entropy Define frequencies for the occurrence of each letter in each column of multiple alignment pA = 1, pT=pG=pC=0 (1st column) pA = 0.75, pT = 0.25, pG=pC=0 (2nd column) pA = 0.50, pT = 0.25, pC=0.25 pG=0 (3rd column) Compute entropy of each column AAA AAT ATC 12/7/2018 CSE 5290, Fall 2011

Entropy: Example Best case Worst case 12/7/2018 CSE 5290, Fall 2011

Multiple Alignment: Entropy Score
Entropy for a multiple alignment is the sum of entropies of its columns:  over all columns  X=A,T,G,C pX logpX 12/7/2018 CSE 5290, Fall 2011

Entropy of an Alignment: Example
column entropy: -( pAlogpA + pClogpC + pGlogpG + pTlogpT) A C G T Column 1 = -[1*log(1) + 0*log0 + 0*log0 +0*log0] = 0 Column 2 = -[(1/4)*log(1/4) + (3/4)*log(3/4) + 0*log0 + 0*log0] = -[ (1/4)*(-2) + (3/4)*(-.415) ] = Column 3 = -[(1/4)*log(1/4)+(1/4)*log(1/4)+(1/4)*log(1/4) +(1/4)*log(1/4)] = 4* -[(1/4)*(-2)] = +2.0 Alignment Entropy = = 12/7/2018 CSE 5290, Fall 2011

Multiple Alignment Induces Pairwise Alignments
Every multiple alignment induces pairwise alignments x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG 12/7/2018 CSE 5290, Fall 2011

Inferring Pairwise Alignments from Multiple Alignments
From a multiple alignment, we can infer pairwise alignments between all sequences, but they are not necessarily optimal This is like projecting a 3-D multiple alignment path on to a 2-D face of the cube 12/7/2018 CSE 5290, Fall 2011

Multiple Alignment Projections
A 3-D alignment can be projected onto the 2-D plane to represent an alignment between a pair of sequences All 3 Pairwise Projections of the Multiple Alignment 12/7/2018 CSE 5290, Fall 2011

Sum of Pairs Score(SP-Score)
Consider pairwise alignment of sequences ai and aj imposed by a multiple alignment of k sequences Denote the score of this suboptimal (not necessarily optimal) pairwise alignment as s*(ai, aj) Sum up the pairwise scores for a multiple alignment: s(a1,…,ak) = Σi,j s*(ai, aj) 12/7/2018 CSE 5290, Fall 2011

Computing SP-Score Aligning 4 sequences: 6 pairwise alignments
Given a1,a2,a3,a4: s(a1…a4) = s*(ai,aj) = s*(a1,a2) + s*(a1,a3) s*(a1,a4) + s*(a2,a3) s*(a2,a4) + s*(a3,a4) 12/7/2018 CSE 5290, Fall 2011

SP-Score: Example a1 ATG-C-AAT . A-G-CATAT ak ATCCCATTT
To calculate each column: s s*( Pairs of Sequences A G 1 Score=3 1 -m 1 Score = 1 – 2m A A C G 1 -m Column 1 Column 3 12/7/2018 CSE 5290, Fall 2011

Multiple Alignment: History
1975 Sankoff Formulated multiple alignment problem and gave dynamic programming solution 1988 Carrillo-Lipman Branch and Bound approach for MSA 1990 Feng-Doolittle Progressive alignment 1994 Thompson-Higgins-Gibson-ClustalW Most popular multiple alignment program 1998 Morgenstern et al.-DIALIGN Segment-based multiple alignment 2000 Notredame-Higgins-Heringa-T-coffee Using the library of pairwise alignments 2004 MUSCLE What’s next? 12/7/2018 CSE 5290, Fall 2011

Problems with Multiple Alignment
Multidomain proteins evolve not only through point mutations but also through domain duplications and domain recombinations Although MSA is a 30 year old problem, there were no MSA approaches for aligning rearranged sequences (i.e., multi-domain proteins with shuffled domains) prior to 2002 Often impossible to align all protein sequences throughout their entire length 12/7/2018 CSE 5290, Fall 2011

Next: Gene prediction Similarity-Based Approaches de Novo approaches Note: HW2 is about a de novo (or ab initio) approach. Some of the following slides are based on slides by the authors of our text. 12/7/2018 CSE 5290, Fall 2011

Using Known Genes to Predict New Genes
Some genomes may be very well-studied, with many genes having been experimentally verified. Closely-related organisms may have similar genes Unknown genes in one species may be compared to genes in some closely-related species 12/7/2018 CSE 5290, Fall 2011

Similarity-Based Approach to Gene Prediction
Genes in different organisms are similar The similarity-based approach uses known genes in one genome to predict (unknown) genes in another genome Problem: Given a known gene and an unannotated genome sequence, find a set of substrings of the genomic sequence whose concatenation best fits the gene 12/7/2018 CSE 5290, Fall 2011

Comparing Genes in Two Genomes
Small islands of similarity corresponding to similarities between exons 12/7/2018 CSE 5290, Fall 2011

Reverse Translation Given a known protein, find a gene in the genome which codes for it One might infer the coding DNA of the given protein by reversing the translation process Inexact: amino acids map to > 1 codon This problem is essentially reduced to an alignment problem 12/7/2018 CSE 5290, Fall 2011

Reverse Translation (cont’d)
This reverse translation problem can be modeled as traveling in Manhattan grid with free horizontal jumps Complexity of Manhattan is n3 Every horizontal jump models an insertion of an intron Problem with this approach: would match nucleotides pointwise and use horizontal jumps at every opportunity 12/7/2018 CSE 5290, Fall 2011

Comparing Genomic DNA Against mRNA
Portion of genome (codon sequence) mRNA exon3 exon1 exon2 { intron1 intron2 12/7/2018 CSE 5290, Fall 2011

Using Similarities to Find the Exon Structure
The known frog gene is aligned to different locations in the human genome Find the “best” path to reveal the exon structure of human gene Frog Gene (known) Human Genome 12/7/2018 CSE 5290, Fall 2011

Finding Local Alignments
Use local alignments to find all islands of similarity Human Genome Frog Genes (known) 12/7/2018 CSE 5290, Fall 2011

Chaining Local Alignments
Find substrings that match a given gene sequence (candidate exons) Define a candidate exons as (l, r, w) (left, right, weight defined as score of local alignment) Look for a maximum chain of substrings Chain: a set of non-overlapping nonadjacent intervals. 12/7/2018 CSE 5290, Fall 2011

Exon Chaining Problem 3 4 11 9 15 5 2 6 13 16 20 25 27 28 30 32 Locate the beginning and end of each interval (2n points) Find the “best” path 12/7/2018 CSE 5290, Fall 2011

Exon Chaining Problem: Formulation
Exon Chaining Problem: Given a set of putative exons, find a maximum set of non-overlapping putative exons Input: a set of weighted intervals (putative exons) Output: A maximum chain of intervals from this set 12/7/2018 CSE 5290, Fall 2011

Exon Chaining Problem: Formulation
Exon Chaining Problem: Given a set of putative exons, find a maximum set of non-overlapping putative exons Input: a set of weighted intervals (putative exons) Output: A maximum chain of intervals from this set Would a greedy algorithm solve this problem? 12/7/2018 CSE 5290, Fall 2011

Exon Chaining Problem: Graph Representation
This problem can be solved with dynamic programming in O(n) time. 12/7/2018 CSE 5290, Fall 2011

Exon Chaining Algorithm
ExonChaining (G, n) //Graph, number of intervals for i ← to 2n si ← 0 for i ← 1 to 2n if vertex vi in G corresponds to right end of the interval I j ← index of vertex for left end of the interval I w ← weight of the interval I sj ← max {sj + w, si-1} else si ← si-1 return s2n 12/7/2018 CSE 5290, Fall 2011

Exon Chaining: Deficiencies
Poor definition of the putative exon endpoints Optimal chain of intervals may not correspond to any valid alignment First interval may correspond to a suffix, whereas second interval may correspond to a prefix Combination of such intervals is not a valid alignment 12/7/2018 CSE 5290, Fall 2011

CSE 5290: Algorithms for Bioinformatics Fall 2011

Similar presentations

Presentation on theme: "CSE 5290: Algorithms for Bioinformatics Fall 2011"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSE 5290: Algorithms for Bioinformatics Fall 2011

Similar presentations

Presentation on theme: "CSE 5290: Algorithms for Bioinformatics Fall 2011"— Presentation transcript:

Similar presentations

About project

Feedback