Learning to Align: a Statistical Approach IDA 2007 Elisa Ricci, University of Perugia (IT) Tijl De Bie, Nello Cristianini, University of Bristol (UK)
Outline Sequence alignment Z-score as function of the alignments parameters Z-score computation by dynamic programming Inverse Parametric Sequence Alignment Problem (IPSAP) Z-score maximization to solve IPSAP Experimental results Artificial data PALI dataset of protein structure alignments Conclusions
Sequence Alignment Definition: Given two sequences S1, S2 a global alignment is an assignment of gaps, so as to line up each letter in one sequence with either a gap or a letter in the other sequence. It is used to determine the similarity between biological sequences. Example: S ={A,T,G,C}, S1 , S2 S S1 ATGCTTTC S2 CTGTCGCC ATGCTTTC--- ---CTGTCGCC A
f (S1, S2, A) = am m + as s + ag g = aT x Sequence Alignment Score of the alignment: a linear function of the parameters. 3-parameter model: matches are rewarded with am , mismatches are penalized by as , gaps are weighted by ag. f (S1, S2, A) = am m + as s + ag g = aT x with xT =[m s g]=[#matches #mismatches #gaps] and aT = [am as ag]. Example: f (S1,S2, A) = 4am + as + 6ag But how to determine the optimal alignment? A scoring scheme must be produced. ATGCTTTC--- ---CTGTCGCC A
Sequence Alignment 4-parameter model: affine function for gap penalties, i.e. different costs if the gap starts (gap opening penalty ao) in a given position or if it continues (gap extension penalty ae). 211/212-parameter model: gap penalties plus a symmetric scoring matrix with elements ayt, y,t S, S ={A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}. C paired with D aCD =aDC
Sequence Alignment Optimal alignment: the highest score alignment. Optimality depends on the parameters a. The number of possible alignments N is exponential in the length of the sequences. The optimal alignment is computed using dynamic programming (DP) in a O(nm) time [Needleman-Wunsch, 1970]. Alignments can be represented as paths from the upper-left to the lower-right corner in the alignment graph. A T G C T T T C C T G
Moments of the scores The mean and the variance of the scores can be expressed as function of the parameters a. Example: For the 3-parameter model:
The Z-score Definition: Let m(S1, S2) and s2(S1, S2) be the average score and the variance of the scores for all possible alignments between S1 and S2. Let be the optimal alignment between S1 and S2 for a given a and be the associated feature vector. We define the Z-score Z(S1, S2): where .
Computing the Z-score Given a parameter vector a, the Z-score can be computed with DP for the 3-, 4-, 211-, 212-parameter models. Example: For the 3-parameter model 9 DP routines are required. DP table A T G C 0.11 0.23 0.12 0.09 0.34 0.22
Computing the Z-score 2 DP tables: p, mm. Inductive assumption: p(i, j-1), p(i-1,j), p(i-1, j-1) are the number of alignments. mm(i, j-1), mm(i-1,j), mm(i-1, j-1) are the correct mean values. Each cell is filled with the following rules: p(i, j) = p(i-1, j-1) + p(i, j-1) + p(i-1, j) mm (i-1, j-1) p(i-1, j-1) + Mp(i-1, j-1) mm (i, j) p(i, j) = sum mm(i, j-1) p(i, j-1) mm(i-1, j) p(i-1, j) where M = 1, if S1(i) = S2(j) ; M = 0, if S1(i) ≠ S2(j) .
Computing the Z-score Basic principle: Mean values: Variances are computed centering the second order moments:
IPSAP Inverse Parametric Sequence Alignment Problem (IPSAP): given a training set of pairwise global alignments learn the parameters a in such a way that the given alignments have the best scores among all possible alignments. Training set Find a s.t. Exponential number of linear constraints. Iterative approaches: linear programming [Kececioglu and Kim 06], max margin [Joachims et al. 05].
Z-score maximization Idea: global objective function, more naturally suited for non-separable cases. Z-score maximization: Minimize the number of alignments with score higher than the given one. m s
Z-score maximization Z-score of a training set: Convex optimization Most linear constraints are satisfyied. (QP)
Iterative algorithm Impose explicitly the violated constraints. Again a convex optimization problem. Iterative algorithm. Eventually relax constraints (e.g. add slack variables for non separable problems).
Iterative algorithm INPUT: training set T 1: C ← ø 2: Compute bi, Ci for all i=1…ℓ 3: Compute b*=sum(bi), C*=sum(Ci) 4: Find a solving QP. 5: Repeat 6: for i=1…ℓ do 7: Compute xi’=argmaxx f (Si1, Si2, Ai) 8: if aTxi’> aT 9: C ← C U { aT ( -xi’)> 0 } 10: Find a solving QP s.t. C 11: endif 12: endfor 13: until C is not changed in during the current iteration. Moments computation Z-score maximization Constrained Identify the most violated constraint
Experimental results Test error as function of the training set size. Distribution of correctly reconstructed alignments as a function of the number of additional constraints.
Experimental results Experiments with no constraints. Test error as function of the training set size. Given and computed substitution matrices. 5 10 20 50 100 Z-score 78.6 62.85 44.6 36.7 30.84 Generative 96.4 94.39 87.12 45.31 31.05
Experimental results Real sequences of amino acids: 5 multiple alignments from the PALI database of structural protein alignments. Error rates and added constraints (in parenthesis). 4-parameters 212-parameters Dataset Training error Test error nad 4.95 (5) 6.46 567.46 (21) 703.12 kun 1.46 (12) 0.95 386.46 (21) 457.3 box 1 (3) 1.13 211.3 (12) 256.7 sir 1 (10) 1.16 236 (36) 301.44 pec 46.2 (8) 76.1 835.12 (31) 1054.12
Summary New method for IPSAP: Further works: Accurate and fast (few constraints are required). Easy to implement: DP for computing moments and simple convex optimization problem. Mean and variance computations parallelizable for large training set. Further works: Approximate moments estimation with sampling techniques is suitable. Possible extension to other problems: sequence labeling learning and sequence parse learning with context free grammars.