1 FireμSat : An Algorithm to Detect Tandem Repeats in DNA
2 Introduction What are tandem repeats in DNA? How are we going to detect tandem repeats in DNA? Why would anybody want to detect tandem repeats in DNA?
3 Genetic sequences DNA consists of four different nucleotides, namely: Adenine (A)Guanine (G) Cytosine (C)Thiamine (T) Genetic databanks e.g. Genbank, Emboss and Entrez stores DNA sequences as concatenated single letter codes in FASTA format.
4 Tandem Repeats (TR’s) in genome sequences DNA molecules are subject to numerous mutational events. One of the consequences of these events that can be detected by computationally analyzing genome sequences is tandem duplication. A TR or TR-zone is a string of DNA molecules that is characterized by a certain motif that introduces the string, contiguously followed by a number of ‘copies’ of the motif, e.g., ACGACGACGACGACG
5 Tandem Repeats Perfect tandem repeat (PTR) if the copies are exact e.g. ACGACGACGACGACG, hence five copies of the motif ACG. Approximate tandem repeat (ATR) if the copies of the motif include non-exact copies, thus mutational events have, most likely occurred e.g. ACGACACGAGGACGAG. In the absence of further qualification, reference to a tandem repeat should be construed as a reference to either a PTR or an ATR.
6 Tandem Repeat Elements A PTR element (PTRE) is a TR element that matches the motif. If the motif is for example ACG then the PTRE will also be ACG. An ATR element (ATRE) is a TR element similar to the motif but not an exact copy thereof. If the motif is ACG then an ATRE may for example be AC.
7 Microsatellites The length of PTRE’s may vary: satellites, minisatellites and microsatellites Microsatellites is a subset of TR’s (conforming to Benson, Delgrange, Rivals & Abajian)
8 Formal problem statement A PTR whose motif is ρ is repeated p times where p 1, is denoted by ρ p. An ATR u that is derived from this PTR ρ p must always have the motif (ρ) as its prefix. It therefore has the form ρu 2 …u p where each ATRE, u k (k = 2…p), is the result of at most ε mutations on ρ. Here ε is the so called motif error. Besides the restrictions applicable to the motif error threshold values are also introduced that manipulate the attributes of the detected TR.
9 Tolerated error types Tolerated error types Errors regarding the motif or PTRE (motif errors): deletions mismatches insertions Errors related to the detected TR (TR errors): in terms of the ratio between PTRE’s and ATRE’s the minimum number TRE’s to be reported the maximum number of ATRE’s consecutively
10 Motif errors Maximum of 50% error toleration If |ρ| = 2 or |ρ| = 3 then є = 0 or є = 1 (default = 1) If |ρ| = 4 or |ρ| = 5 then є = 0; є = 1 or є = 2 (default = 2) Consider ACGTT then ACT will be an ATRE where two deletions have occurred.
11 Motif errors: Types of Mutations Deletion Refers to the absence of a base pair in the motif. Insertion An ATRE with up to ε base pairs inserted into any position of the PTRE. Mismatch Refers to the replacement of a base pair in the motif by another.
12 Detected TR errors: the substring error Detected TR errors: the substring error The substring error : where is the maximum substring error allowed and = (n_d x p_d) + (n_i x p_i) + (n_m x p_m) – n_ptre where n_d: number of deletions n_i: number of insertions n_m: number of mismatches p_d: penalty allocated to deletions p_i: penalty allocated to insertions p_m: penalty allocated to mismatches
13 Detected TR errors: the minimum number of TRE’s Detected TR errors: the minimum number of TRE’s tn_tre = tn_ptre + tn_atre tn_tre the default value for = 2 to prevent the output of unwanted data
14 Detected TR errors: the maximum number of consecutive ATRE’s Detected TR errors: the maximum number of consecutive ATRE’s tn_atreC tn_atreC is incremented for every ATRE read tn_atreC is set to zero whenever a PTRE is read the default of tn_atreC is 0
15 Deletion Refers to the absence of a base pair in the motif FA D (ACG,1)
16 Mismatch Refers to the replacement of a base pair in the motif by another. FA m (ACG,1)
17 generateWords(ρ,ε) generates a set of all words of length ρLength from the alphabet Σ = {A,C,G,T}. createFA TR (ρ,ε) returns FA TR (ρ,ε) as discussed. findIndices(gSeq, FA TR, τ, α, β, p_m, p_d, p_i) returns a set of index pairs in gSeq of an identified TR. the TR is such that it complies with the constraints specified by τ, α, β. Various counters have to be updated to ensure correct output. High-level Description of Fire μ Sat
18 Why does anybody want to detect TR’s in DNA? The cause of several human diseases can be traced to having too many copies of a certain nucleotide triplet. TR’s play a role in the development of immune system cells. TR’s serves as genetic markers in plant and animal species. Tandem repeats play a role in gene regulation and contribute to the breeding of disease resistant cultivars.
19 Conclusion A new theoretical approach to detect TR’s in DNA has been introduced. The time complexity of FireµSat is linear in |gSeq|. The practical implementation of FireµSat is in progress. The following matters constitute a future research agenda: the performance of FireµSat the possibility of reducing FA TR and, if successful, the latter results could suggest ways of adapting FireµSat to detect minisatellites and satellites as well.