Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut.

Similar presentations


Presentation on theme: "Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut."— Presentation transcript:

1 Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

2 Overview  Background on universal tag arrays  Tag set design problem - c-h code formulation - Integer program and LP-approximation - Cycle packing formulation  Experimental results  Conclusions

3 Watson-Crick Complementarity Four nucleotide types: A,C,G,T A’s paired with T’s (2 hydrogen bonds) C’s paired with G’s (3 hydrogen bonds)

4 DNA Microarrays Exploit Watson-Crick complementarity to simultaneously perform a large number of substring tests Used in a variety of genomic analyses –Transcription (gene expression) analysis –Single Nucleotide Polymorphism (SNP) genotyping –Genomic-based microorganism identification Common microarray formats involve direct hybridization between labeled DNA/RNA sample and DNA probes attached to a glass slide

5 Direct Hybridization Experiment Images courtesy of Affymetrix. Labeled DNA/RNA sample hybridized to array of probes Laser activation of fluorescent labels Optical scanning used to identify probes with complements in the mixture

6 Limitations of Common Array Formats Arrays of cDNAs –Probes obtained by reverse transcription from Expressed Sequence Tags (ESTs) –Inexpensive, but can only be used for transcription analysis Oligonucleotide arrays –Short (20-60bp) synthetic DNA probes –Flexible, but expensive unless produced in large quantities

7 Universal Tag Arrays [Brenner 97, Morris et al. 98] “Programmable” oligonucleotide arrays –Array consists of application independent oligonucleotides called tags –Two-part “reporter” probes: aplication specific primers ligated to antitags –Detection carried by a sequence of reactions separately involving the primer and the antitag part of reporter probes

8 + Mix reporter probes with genomic DNA Universal Tag Array Experiment Solution phase hybridization Single-Base Extension Solid phase hybridization

9 Universal Tag Array Advantages Cost effective –Same array used in many analyses  economies of scale Easy to customize – Only need to synthesize new set of reporter probes Reliable –Solution phase hybridization better understood than hybridization on solid support

10 Tag Hybridization Constraints (H1) Antitags hybridize strongly to complementary tags (H2) No antitag hybridizes to a non-complementary tag (H3) Antitags do not cross-hybridize to each other t1 t2 t1t2t1 Tag Set Design Problem: Find a maximum cardinality set of tags satisfying (H1)-(H3)

11 More Hybridization Constraints… Enforced during tag assignment by - Leaving some tags unassigned and distributing primers to multiple arrays [Ben-Dor et al. 03] - Exploiting availability of multiple primer candidates [MPT05] t1t2 t1

12 Hybridization Models: Stability Melting temperature models, e.g., nearest neighbor [SantaLucia 96] Tag weight  h [Ben-Dor et al. 00] –wt(A)=wt(T)=1, wt(C)=wt(G)=2 –Equivalent to “2-4 rule” melting temperature model Tag length = l [Affymetrix] –Combined with additional constraints on GC-content, etc.

13 Hamming distance model, e.g. [Marathe et al. 01] –Models rigid DNA strands LCS/edit distance model, e.g., [Torney et al. 03] –Models infinitely elastic DNA strands c-token model [Ben-Dor et al. 00]: –Derived from nucleation complex theory: duplex formation requires formation of nucleation complex between perfectly complementary substrings –Nucleation complex must have weight  c Hybridization Models: Non-Interaction

14 c-h Code Problem c-token: left-minimal DNA string of weight  c, i.e., –w(x)  c –w(x’) < c for every proper suffix x’ of x A set of tags is called c-h code if (C1) Every tag has weight  h (C2) Every c-token is used at most once c-h code problem: given c and h, find maximum cardinality c-h code

15 Previous Work [Ben-Dor et al.00] - Constructive upper-bound on c-h code size based on token tail-weight - Approximation algorithm based on DeBruijn sequences [MPT05] - Upper-bound for c-h codes including antitag-to-antitag hybridization constraints - Simple alphabetic tree search heuristic

16 Token Content of a Tag c=4 CCAGATT CC CCA CAG AGA GAT GATT Tag  sequence of c-tokens End pos: 2 3 4 5 6 7 c-token: CC  CCA  CAG  AGA  GAT  GATT

17 Layered c-token graph s t c1c1 cNcN ll-1 c/2(c/2)+1…

18 Integer Program Formulation Maximum integer flow problem w/ set capacity constraints O( lN)/ O( hN) constraints & variables, where N = #c-tokens

19 Number of c-tokens c# c-tokens 5208 6568 71552 84240 911584 1031648

20 Packing ILP Formulation

21 Garg-Konemann Algorithm 1. x  0; y   // y i are variables of the dual LP 2. Find min weight s-t path p, where weight(v) = y i for every v  V i 3. While weight(p) < 1 do M  max i |p  V i | x p  x p + 1/M For every i, y i  y i ( 1 +  * |p  V i |/M ) Find min weight s-t path p, where weight(v) = y i for v  V i 4. For every p, x p  x p / (1 - log 1+   ) [GK98] The algorithm computes a factor (1-  ) 2 approximation to the optimal LP solution with (N/  )* log 1+  N shortest path computations

22 LP Based c-h Code Construction 1.Run Garg-Konemann and store the minimum weight paths in a list 2.Traversing the list in reverse order, pick tags corresponding to paths if they are feasible and do not share c-tokens with already selected tags 3.Mark used c-tokens and run the alphabetic tree search algorithm to select additional tags

23 Experimental Results (h=15)

24 Experimental Results (h=28)

25 Periodic Tags Key observation: c-token uniqueness constraint in c-h code formulation is too strong –A c-token should not appear in two different tags, but can be repeated within a tag! A tag t is called periodic if it is the prefix of (  )  for some  –Periodic strings make better use of c-tokens (t uses at most |  | c-tokens)

26 c-token factor graph, c=4 (incomplete) CC AAG AAC AAAA AAAT

27 Cycle Packing Problem Vertex-Disjoint Cycle Packing Problem: Given directed graph G, find maximum number of vertex disjoint directed cycles in G Theorem 1: APX-hard even for regular directed graphs with in-degree and out-degree 2 [Salavatipour and Verstraete 05] For general graphs: –Quasi-NP-hard to approximate within  (log 1-  n) –O(n 1/2 ) approximation algorithm

28 Tag Set Design Algorithm 1.Construct c-token factor graph G 2.T  {} 3.For all cycles C defining periodic tags, in increasing order of cycle length, do Add to T the tag defined by C Remove C from G 4.Perform an alphabetic tree search and add to T tags with no c-tokens in common with T 5.Return T

29 Experimental Results h

30 Antitag-to-Antitag Hybridization Formalization in c-token hybridization model: (C3) No two (anti)tags contain complementary substrings of weight  c Cycle packing and tree search extend easily

31 Results w/ Extended Constraints

32 Herpes B Gene Expression Assay TmTm # pools Pool size 500 tags1000 tags2000 tags # arrays% Util.# arrays% Util.# arrays% Util. 601446 1494.06297.20172.30 5496.132100.00172.30 671560 1496.53298.70178.00 5498.00299.90178.00 701522 1496.73298.90176.10 5497.80299.80176.10 TmTm # pools Pool size 500 tags1000 tags2000 tags # arrays% Util.# arrays% Util.# arrays% Util. 601446 1482.26365.35257.05 5488.26370.95263.55 671560 1486.33369.70261.15 5491.86376.00267.20 701522 1488.46373.65265.40 5492.26291.10270.30 GenFlex Tags Periodic Tags

33 Conclusions Use of periodic tags yields significant increase in tag set size and enables higher multiplexing rates in tag assignment Algorithms extend to more accurate hybridization models –Monotonic melting temperature  c-tokens  factor graph Other applications of non-interacting DNA tag sets –Lab-on-chip, DNA-mediated assembly, DNA computing [Brenneman&Condon 02] Open problems –Settle approximation complexity of disjoint cycle packing –Improved tag set design algorithms?

34 Acknowledgments UCONN Research Foundation

35 Constraints on tag assignment If primer p hybridizes with tag t, then either p or t must be left un-assigned, unless p is assigned to t p t t’ p’

36 Characterization of Assignable Sets [Ben-Dor 04] Set P is assignable to T iff X+Y  |P|, where, in the hybridization graph induced by P+T –X = number of primers incident to a degree 1 tag –Y = number of degree 0 tags Y=2 X=1

37 MAPS Problem Maximum Assignable Primer Set (MAPS) Problem: given primer set P and tag set T, find maximum size assignable subset of P [Ben-Dor 04] Greedy deletion heuristic: repeatedly delete primer of maximum weight from P until it becomes assignable, where –Potential of tag t is 2 -|P(t)| –Potential of primer p is sum of potentials of conflicting tags

38 Universal Array Multiplexing Problem Multiplexing Problem: given primer set P and tag set T, find partition of P into minimum number of assignable sets [Ben-Dor 04] Repeatedly find approximate MAPS


Download ppt "Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut."

Similar presentations


Ads by Google