Protein Encoding Optimization Student: Logan Everett Mentor: Endre Boros Funded by DIMACS REU 2004
Project Overview Model Biological Scoring Matrices Weighted Binary Hamming Space Optimize Using Linear Programming Accurate Random Generation
Scoring Matrices A Q M K R H… A R M I F L… –3 –3 -3…
Encode To Binary Strings Hamming Distances Easy to Approximate on Binary Strings Statistically Proven Methods More Efficient How Do Similarity and Distance Relate? Inverse Relationship First Create “Real” Distance Vector: D
Precise Problem: Distortion D ij (1– ) h[ i, j ] D ij (1+ ) unique pairs i,j ( n C 2 ) s.t. 0 1 and 0
Encoding Scheme as Vector C = S = T = P = A = G = y2y1y2y1
Modified Inequality D(1– ) Ax D(1+ ) s.t. 0 1 and 0 Let x = y
Linear Programming Problem Need All Linear Expressions D(1 – ) Ax and Ax D(1 + ) -Ax – D -D and Ax – D D All x i, 0 Goal: Minimize Solve with CPLEX
Problem Size Number of Constraints (Rows) 2( n C 2 ) = 380 Number of Variables (Columns) 2 n-1 = 524,288 Total Size – App. 2x10 8 CPLEX – App. 1 Minute
Linear Programming Solution Solution Contains: Min Value of Scaled Weight Vector x Non-Integral Values in x Convert to p Vector X = x i p i = x i / X
Random Encodings Randomly Select Cross Sections Based on Percent Weights Can Scale For Any N-Length Encoding Longer Encodings Should Approach Minimum Distortion
Results
Courtesy of DIMACS Mentor: Endre Boros – RUTCOR Logan Everett – DIMACS REU 2004