Protein String Encoding Student: Logan Everett Mentor: Endre Boros Funded by DIMACS REU 2004.

Protein String Encoding Student: Logan Everett Mentor: Endre Boros Funded by DIMACS REU 2004

Project Overview Model Biological Scoring Matrices Weighted Binary Hamming Space Optimize Using Linear Programming

Biological Data Gene Databases 4-Character Alphabet: {A,G,C,T} Protein Databases 20-Character Alphabet Goal: Match Similar Strings

String Distance Simple: Hamming Distance AGCTAGCT ACCTAGTT Real: Scoring Matrix (G,C) = 2 (C,T) = 1 Goal: Encode to Binary Strings Hamming Distance Should Match Real D = 2 D = 3

Building Encoding Model Example: Genes 4 Letters to Encode Same Length N (Arbitrary) Arrange in Columns Row = Cross Section A = 10101010… G = 01101010… C = 00110100… T = 10011100… AGCT 1001 0100 1110 0011 1101 0011 1100 0000 …………

Equivalent Encoding Schemes Simplify Encoding Hamming Distances Based on Rows Can Switch All Values in a Row Can Ignore Uniform Rows Can Reorder AGCT 1001 0100 1110 0011 1101 0011 1100 0000 ………… AGCT 0110 0100 0001 0011 0010 0011 0011 0000 ………… (A,G) = 1 AGCT 0001 0010 0011 0011 0011 0100 0110 …………

Weight Vector Representation Count Identical Rows Store Counts in Weight Vector (Weighted Hamming Space) AGCT 0001 0010 0011 0011 0011 0100 0110 ………… y 1 = 1 y 2 = 1 y 4 = 1 y 6 = 1 y 3 = 3 } y = (y 1, y 2, y 3, y 4, y 5, y 6, y 7 )

Hamming Distance Matrix Rows: Unique Pairs Columns: Encoding Cross Sections Cells: 1 = Contributes to Hamming Distance 0 = Does Not Contribute

Hamming Distance Matrix 1234567 (A,G)(A,G) (A,C)(A,C) (A,T)(A,T) (G,C)(G,C) (G,T)(G,T) (C,T)(C,T) 1000 TCGA Same 1234567 (A,G)(A,G)0 (A,C)(A,C) (A,T)(A,T) (G,C)(G,C) (G,T)(G,T) (C,T)(C,T) 1010 TCGA Different 1234567 (A,G)(A,G)01 (A,C)(A,C) (A,T)(A,T) (G,C)(G,C) (G,T)(G,T) (C,T)(C,T) 1234567 (A,G)(A,G)0001111 (A,C)(A,C)0110011 (A,T)(A,T)1010101 (G,C)(G,C)0111100 (G,T)(G,T)1011010 (C,T)(C,T)1100110

Our Formula B = Hamming Distance Matrix y = Weight Vector D = Real Distance Vector By = Model Hamming Distances Goal: By = D (  = Scaling Factor)

Our Formula Must Allow For Distortion D(1 –  ) <=  By <= D(1 +  ) Scaled Weight Vector x = y D(1 –  ) <=  B( y) <= D(1 +  ) D(1 –  ) <=  Bx <= D(1 +  )

Linear Programming Problem D(1 –  ) <=  Bx and Bx <= D(1 +  ) -Bx – D  <= -D and Bx – D  <= D All x i,  >= 0 Goal: Minimize  Solve with CPLEX

Linear Programming Solution Solution Contains: Min Value of  Scaled Weight Vector x

Courtesy of DIMACS Mentor: Endre Boros – RUTCOR Logan Everett – DIMACS REU 2004

Protein String Encoding Student: Logan Everett Mentor: Endre Boros Funded by DIMACS REU 2004.

Similar presentations

Presentation on theme: "Protein String Encoding Student: Logan Everett Mentor: Endre Boros Funded by DIMACS REU 2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Protein String Encoding Student: Logan Everett Mentor: Endre Boros Funded by DIMACS REU 2004.

Similar presentations

Presentation on theme: "Protein String Encoding Student: Logan Everett Mentor: Endre Boros Funded by DIMACS REU 2004."— Presentation transcript:

Similar presentations

About project

Feedback