Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al
Introduction New method to calculate a score function, aiming to optimize the ability to discriminate between homologs and non- homologs Existing software uses the following to compute an alignment score:
Number of times AA i is aligned with AA j Number of gaps in alignment Number of residues in each gap beyond one Score function / Substitution matrix Contribution to score for AA match/mismatch Contribution to score for gap initialization Contribution to score for gap extension
Current Methods to Calculate Homology p(S r > x): probability that a random pair of proteins of the same length would have that score E: expected number of random proteins in the db that would have at least that score P: probability that there is at least one random pair with a higher score As p(S r > x), E, P increase, the likelihood that the given pair is homologous decreases
Current Score Matrices PAM (percent accepted mutations) – Dayhoff GCB, JTT: used to apply to larger sequence datasets BLOSUM62 – Henikoff & Henikoff, constructed using a dataset of aligned sequence blocks STR – protein sequences aligned based on their observed structures
Limitations of Current Score Functions Current score functions assume independent evolution of each location, overlooking correlations Score functions derived from a db of properly aligned proteins, not on alignments between random sequences Gap penalty a priori
Theory Z score for alignment: Characterize the significance of alignment score by calculating the likelihood that this score or higher would be obtained by a random match Account for variations in E with the length of the proteins
Theory Score function optimized by maximizing the confidence over the training set Avoids dependence on extreme E values (easily detected or overly distant homologies) Eliminates contribution of falsely identified homologies (overly distant)
Database Preparation Use set of known homologs whose homology cannot be reliably determined with standard pairwise comparison, in order to optimize score function for detection of distant homologs Training set: 900 pairs of protein in same COG with < 25% sequence identity
Optimization of Score Function Align using BLOSOM62 matrix Calculate Z and C for each pair of homologs, then averaged over pairs in training set to yield Generate initial alignments using gap penalties that yielded highest C values ~10 cycles of optimization and realignments until score function converged
Results Small changes in gap penalties: most of the improvement cones from refinements of OPTIMA: resulting score function –has significantly improved average confidence value compared with other score matrices – x)>, significantly decreased
Summary Aim: optimize score matrix to discriminate between homologs and non-homologs OPTIMA score function: more successful at discriminating between homologs and non- homologs compared with standard score matrices Gap penalties treated as additional parameters to be optimized