Presentation is loading. Please wait.

Presentation is loading. Please wait.

Protein String Encoding Student: Logan Everett Mentor: Endre Boros Funded by DIMACS REU 2004.

Similar presentations


Presentation on theme: "Protein String Encoding Student: Logan Everett Mentor: Endre Boros Funded by DIMACS REU 2004."— Presentation transcript:

1 Protein String Encoding Student: Logan Everett Mentor: Endre Boros Funded by DIMACS REU 2004

2 Project Overview Model Biological Scoring Matrices Weighted Binary Hamming Space Optimize Using Linear Programming

3 Biological Data Gene Databases 4-Character Alphabet: {A,G,C,T} Protein Databases 20-Character Alphabet Goal: Match Similar Strings

4 String Distance Simple: Hamming Distance AGCTAGCT ACCTAGTT Real: Scoring Matrix (G,C) = 2 (C,T) = 1 Goal: Encode to Binary Strings Hamming Distance Should Match Real D = 2 D = 3

5 Building Encoding Model Example: Genes 4 Letters to Encode Same Length N (Arbitrary) Arrange in Columns Row = Cross Section A = 10101010… G = 01101010… C = 00110100… T = 10011100… AGCT 1001 0100 1110 0011 1101 0011 1100 0000 …………

6 Equivalent Encoding Schemes Simplify Encoding Hamming Distances Based on Rows Can Switch All Values in a Row Can Ignore Uniform Rows Can Reorder AGCT 1001 0100 1110 0011 1101 0011 1100 0000 ………… AGCT 0110 0100 0001 0011 0010 0011 0011 0000 ………… (A,G) = 1 AGCT 0001 0010 0011 0011 0011 0100 0110 …………

7 Weight Vector Representation Count Identical Rows Store Counts in Weight Vector (Weighted Hamming Space) AGCT 0001 0010 0011 0011 0011 0100 0110 ………… y 1 = 1 y 2 = 1 y 4 = 1 y 6 = 1 y 3 = 3 } y = (y 1, y 2, y 3, y 4, y 5, y 6, y 7 )

8 Hamming Distance Matrix Rows: Unique Pairs Columns: Encoding Cross Sections Cells: 1 = Contributes to Hamming Distance 0 = Does Not Contribute

9 Hamming Distance Matrix 1234567 (A,G)(A,G) (A,C)(A,C) (A,T)(A,T) (G,C)(G,C) (G,T)(G,T) (C,T)(C,T) 1000 TCGA Same 1234567 (A,G)(A,G)0 (A,C)(A,C) (A,T)(A,T) (G,C)(G,C) (G,T)(G,T) (C,T)(C,T) 1010 TCGA Different 1234567 (A,G)(A,G)01 (A,C)(A,C) (A,T)(A,T) (G,C)(G,C) (G,T)(G,T) (C,T)(C,T) 1234567 (A,G)(A,G)0001111 (A,C)(A,C)0110011 (A,T)(A,T)1010101 (G,C)(G,C)0111100 (G,T)(G,T)1011010 (C,T)(C,T)1100110

10 Our Formula B = Hamming Distance Matrix y = Weight Vector D = Real Distance Vector By = Model Hamming Distances Goal: By = D (  = Scaling Factor)

11 Our Formula Must Allow For Distortion D(1 –  ) <=  By <= D(1 +  ) Scaled Weight Vector x = y D(1 –  ) <=  B( y) <= D(1 +  ) D(1 –  ) <=  Bx <= D(1 +  )

12 Linear Programming Problem D(1 –  ) <=  Bx and Bx <= D(1 +  ) -Bx – D  <= -D and Bx – D  <= D All x i,  >= 0 Goal: Minimize  Solve with CPLEX

13 Linear Programming Solution Solution Contains: Min Value of  Scaled Weight Vector x

14 Courtesy of DIMACS Mentor: Endre Boros – RUTCOR Logan Everett – DIMACS REU 2004


Download ppt "Protein String Encoding Student: Logan Everett Mentor: Endre Boros Funded by DIMACS REU 2004."

Similar presentations


Ads by Google