Case Study
DNA Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organism and some viruses. The main role of DNA molecules is the long- term storage of information. The DNA segments that carry this genetic information are called genes.
DNA DNA consists of two long polymers of simple units called nucleotides, with backbones made of sugars and phosphate groups joined by ester bonds. These two strands run in opposite directions to each other and are therefore Anti-parallel. Attached to each sugar is one of four types of molecules called heterocyclic bases. It is the sequence of these four bases along the backbone that encodes information.
DNA The heterocyclic bases are adenine (A), guanine (G), cytosine (C) and thymine (T), linked together with sugar units and phosphates. One of the biggest puzzles was that although the proportion of these bases varied from one DNA to another, it was always found that the number of A = T, and G = C.[4]
The Longest Common Subsequence (LCS) in DNA Finding Longest Common Subsequence (LCS) from molecular Sequences is an interesting variation of the classical string matching problem: the task is that of finding a subsequence from two molecular sequences such that it is of longest length. There can exist more than one LCS for the same given sequences. For example consider two short DNA sequences as given below S1=T C C A T A G T C S2=A G C T A A T A G One of the LCS for the above two sequences can be ‘CAAT’ of length four. ‘ATAT’,’ATAG’ etc can be other such subsequences of length four.
The Longest Common Subsequence (LCS) in DNA While determining the LCS we look for symbols in all the sequences in the same relative order but not necessarily contiguous. In molecular biology it is important to estimate the similarity of two DNA or protein sequences. Especially for pre-selection purposes, those biological sequences can be treated as strings from an appropriate input alphabet. The degree of similarity can be measured by counting the maximal number of identical symbols existing in both input strings in the same order.
The Longest Common Subsequence (LCS) in DNA Collecting these identical symbols and concatenating them produces (one of) the longest common subsequence(s) of the strings the length of which numerically describes the similarity between the strings. In molecular biology, LCS is an appropriate measure of the similarity of biological sequences. When we want to know how homologous those DNA or protein sequences are, we can calculate the maximum number of identical symbols among them. That is exactly an LCS of them
How The algorithm works In our case, we will consider only the sequences with the same length, so the relation between them can be represented using a square matrix which can be used to determine whether the relation has certain properties. we are given two character strings X=x 0 x 1 x 2 ……x n-1 and Y=y 0 y 1 y 2 ……y m-1 of length n and m respectively(m=n in our case), and are asked to find a longest string S that is a sequence of both X and Y.
How The algorithm works There are more than one way to solve this problem but the most efficient one is to use the dynamic programming technique where the solution to the global problem must be a composition of optimal sub problem solutions. The algorithm based on dynamic programming technique initialize an (n+1)(m+1) array (matrix), L, and iteratively build up values L [i,j] which used to denote the length of a longest string that is a subsequence of both X[0….i]= x 0 x 1 x 2 ……x i and Y[0…j]= y 0 y 1 y 2 …y j.
How The algorithm works There are more than one way to solve this problem but the most efficient one is to use the dynamic programming technique where the solution to the global problem must be a composition of optimal sub problem solutions. The algorithm based on dynamic programming technique initialize an (n+1)(m+1) array (matrix), L, and iteratively build up values L [i,j] which used to denote the length of a longest string that is a subsequence of both X[0….i]= x 0 x 1 x 2 ……x i and Y[0…j]= y 0 y 1 y 2 …y j.
How The algorithm works This allows us to rewrite L [i,j] in terms of optimal sub problem solutions and it depends on which of the following two cases we are in.
How The algorithm works In order to make both of these equations make sense in the boundary cases when i=0 or j=0, the algorithm assign L[i,-1]=0 for i=-1,0,1,…..,n-1 and L[-1,j]=0 for j=-1,0,1,…..,m-1. The algorithm iterative until we have L[n-1,m-1], the length of a longest common subsequence of X and Y.
Implementation Given two strands of DNA, X=AGCGA and Y=CAGAT, which could for example come from two individuals, we assumed that the two DNA are have the same length so we can determine whether the relation has certain properties, then after applying the algorithm we get the following matrix.
Implementation CAGAT A G C G A012233
Implementation CAGAT A G C G A The LCS of the sub string AG and the sub string CAG is AG and it is of length 2
Implementation CAGAT A G C G A The LCS of the sub string AGC and the sub string CA is A and it is of length 1
Implementation CAGAT A G C G A The LCS of the string AGCGA and the string CAGAT is AGA and it is of length 3
Implementing the binary relations on the LCS matrix We will consider the matrix L to be the relation between two distinct strands of DNA which come from two individuals and we will set L[i,j] as following:
Implementing the binary relations on the LCS matrix CAGAT 0A G C G A11111
As we can see the relation in our example is not reflexive i.e L[i,i] is not a member of the relation in all the time. Also the relation is not symmetric because L[1,0]=0 while L[0,1]=1.
Implementing the binary relations on the LCS matrix The relation is not transitive as shown below. × =
Implementing the binary relations on the LCS matrix consider that we have another two strands of DNA, X=CGAAT and Y=CGATT, then after applying LCS algorithm we get the following matrix CGATT C G A A T012344
Implementing the binary relations on the LCS matrix After converting the matrix to 0, 1 matrix and deleting the extra row and column we get: As we can see the binary relation between these two strands of DNA considers being an Equivalence Relation since it is reflexive, symmetric and transitive.
Conclusion We presented a case study about applying the relation concepts in to a given two strands of DNA through specific text similarity problem called Longest Common Subsequence problem.