Algorithms for Biochip Design and Optimization Ion Mandoiu Computer Science & Engineering Department University of Connecticut.

Algorithms for Biochip Design and Optimization Ion Mandoiu Computer Science & Engineering Department University of Connecticut

Overview  Physical design of DNA arrays  DNA tag set design  Digital microfluidic biochip testing  Conclusions

Driver Biochip Applications Driver applications –Gene expression (transcription analysis) –SNP genotyping –CNP analysis –Genomic-based microorganism identification –Point-of-care diagnosis –healthcare, forensics, environmental monitoring,… As focus shifts from basic research to clinical applications, there are increasingly stringent design requirements on sensitivity, specificity, cost –Assay design and optimization become critical

Human Genome  3  10 9 base pairs Main form of variation between individual genomes: single nucleotide polymorphisms (SNPs) Total #SNPs  1  10 7 Difference b/w any two individuals  3  10 6 SNPs (  0.1% of entire genome) Single Nucleotide Polymorphisms … ataggtcc C tatttcgcgc C gtatacacggg T ctata … … ataggtcc G tatttcgcgc A gtatacacggg A ctata … … ataggtcc C tatttcgcgc C gtatacacggg T ctata …

Watson-Crick Complementarity Four nucleotide types: A,C,T,G A’s paired with T’s (2 hydrogen bonds) C’s paired with G’s (3 hydrogen bonds)

Hybridization SNP genotyping via direct hybridization SNP1 with alleles T/G SNP2 with alleles A/G A CTC G A A CTC G A Array with 2 probes/SNP Optical scanning used to identify alleles present in the sample Labeled sample

In-Place Probe Synthesis CG AC CG AC ACG AG G C Probes to be synthesized A A A A A

In-Place Probe Synthesis CG AC CG AC ACG AG G C Probes to be synthesized A A A A A C C C C C C

In-Place Probe Synthesis CG AC CG AC ACG AG G C Probes to be synthesized A A A A A C C C C C C G G G G G

Simplified DNA Array Flow Probe Selection Array Manufacturing Hybridization Experiment Gene expression levels, SNP genotypes,… Analysis of Hybridization Intensities Mask Manufacturing Physical Design: Probe Placement & Embedding Design Manufacturing End User

Unwanted Illumination Effect Unintended illumination during manufacturing  synthesis of erroneous probes Effect gets worse with technology scaling

Border Length Minimization Objective A A A A A C C C C C C G G G G G border Effects of unintended illumination  border length CG AC CG AC ACG AG G C

Synchronous Synthesis Periodic deposition sequence, e.g., (ACTG) k –Each probe grown by one nucleotide in each period T G C A T G T G C A … C A period  # border conflicts b/w adjacent probes = 2 x Hamming distance C T A C G T

2D Placement Problem probe Edge cost = 2 x Hamming distance Find minimum cost mapping of the Hamming graph onto the grid graph Special case of the Quadratic Assignment Problem

2D Placement: Sliding-Window Matching 1 3 2 5 4 Select mutually nonadjacent probes from small window 2 2 3 1 4 Re-assign optimally Slide window over entire chip Repeat fixed # of iterations (  O(N) time for fixed window size), or until improvement drops below certain threshold Proposed by [Doll et al. ‘94] in VLSI context

2D Placement: Epitaxial Growth Proposed by [PreasL’88, ShahookarM’91] in VLSI context Simulates crystal growth Efficient “row” implementation Use lexicographical sorting for initial ordering of probes Fill cells row-by-row Bound number of candidate probes considered when filling each cell Constant # of lookahead rows  O(N 3/2 ) runtime, N = #probes

2D Placement: Recursive Partitioning Very effective in VLSI placement [AlpertK’95,Caldwell et al.’00] 4-way partition using linear time clustering Repeat until Row-Epitaxial can be applied

Asynchronous Synthesis A A A C C C T T T G G G A C T G A G T G T G A A Synchronous Embedding A G T A G G T A Deposition Sequence Probes G A A G T A G T ASAP Embedding G

A C T ACG T ACGT Source Sink Efficient solution by dynamic programming Optimal Single-Probe Re-Embedding

In-Place Re-Embedding Algorithms 2D placement fixed, allow only probe embeddings to change –Greedy: optimally re-embed probe with largest gain –Chessboard: alternate re-embedding of black/white cells –Sequential: re-embed probes row-by-row Chip size GreedyChessboardSequential %LBCPU%LBCPU%LBCPU 100 125.740120.554119.964 500 127.1943121.41423120.91535

Integration with Probe Selection Probe Selection Physical Design: Placement & Embedding Probe Pools Pool Size Pool Row-Epitaxial % ImprovCPU sec. 1-217 24.31040 48.21796 811.83645 1615.27515 Chip size 100x100

Universal Tag Arrays Brenner 97, Morris et al. 98 –Array consisting of application independent tags –Two-part “reporter” probes: aplication specific primers ligated to antitags –Detection carried by a sequence of reactions separately involving the primer and the antitag part of reporter probes

Universal Tag Array Advantages Cost effective –Same tag array used for different analyses  can be mass-produced – Only need to synthesize new set of reporter probes More reliable! –Solution phase hybridization better understood than hybridization on solid support

SNP Genotyping with Tag Arrays Tag + Primer G A G G A G G A G T G C A T CC antitag T CC 1.Mix reporter probes with unlabeled genomic DNA 2. Solution phase hybridization 3. Single-Base Extension (SBE) 4. Solid phase hybridization

Tag Set Design Problem (H1) Tags hybridize strongly to complementary antitags (H2) No tag hybridizes to a non-complementary antitag t1 t2 t1t2t1 Tag Set Design Problem: Find a maximum cardinality set of tags satisfying (H1)-(H2)

Hybridization Models Melting temperature Tm: temperature at which 50% of duplexes are in hybridized state 2-4 rule Tm = 2 #(As and Ts) + 4 #(Cs and Gs) More accurate models exist, e.g., the near- neighbor model

Hamming distance model, e.g., [Marathe et al. 01] –Models rigid DNA strands LCS/edit distance model, e.g., [Torney et al. 03] –Models infinitely elastic DNA strands c-token model [Ben-Dor et al. 00]: –Duplex formation requires formation of nucleation complex between perfectly complementary substrings –Nucleation complex must have weight  c, where wt(A)=wt(T)=1, wt(C)=wt(G)=2 (2-4 rule) Hybridization Models (contd.)

c-h Code Problem c-token: left-minimal DNA string of weight  c, i.e., –w(x)  c –w(x’) < c for every proper suffix x’ of x A set of tags is a c-h code if (C1) Every tag has weight  h (C2) Every c-token is used at most once c-h Code Problem [Ben-Dor et al.00] Given c and h, find maximum cardinality c-h code

Algorithms for c-h Code Problem [Ben-Dor et al.00] approximation algorithm based on DeBruijn sequences Alphabetic tree search algorithm - Enumerate candidate tags in lexicographic order, save tags whose c- tokens are not used by previously selected tags - Easily modified to handle various combinations of constraints [MT 05, 06] Optimum c-h codes can be computed in practical time for small values of c by using integer programming Practical runtime using Garg-Koneman approximation and LP-rounding

Token Content of a Tag c=4 CCAGATT CC CCA CAG AGA GAT GATT Tag  sequence of c-tokens End pos: 2 3 4 5 6 7 c-token: CC  CCA  CAG  AGA  GAT  GATT

Layered c-token graph for length-l tags s t c1c1 cNcN ll-1 c/2(c/2)+1…

Integer Program Formulation [MPT05] Maximum integer flow problem w/ set capacity constraints O( hN) constraints & variables, where N = #c-tokens

Packing LP Formulation

Garg-Konemann Algorithm 1. x  0; y   // y i are variables of the dual LP 2. Find min weight s-t path p, where weight(v) = y i for every v  V i 3. While weight(p) < 1 do M  max i |p  V i | x p  x p + 1/M For every i, y i  y i ( 1 +  * |p  V i |/M ) Find min weight s-t path p, where weight(v) = y i for v  V i 4. For every p, x p  x p / (1 - log 1+   ) [GK98] The algorithm computes a factor (1-  ) 2 approximation to the optimal LP solution with (N/  )* log 1+  N shortest path computations

LP Based Tag Set Design 1.Run Garg-Konemann and store the minimum weight paths in a list 2.Traversing the list in reverse order, pick tags corresponding to paths if they are feasible and do not share c-tokens with already selected tags 3.Mark used c-tokens and run the alphabetic tree search algorithm to select additional tags

Periodic Tags [MT05] Key observation: c-token uniqueness constraint in c-h code formulation is too strong –A c-token should not appear in two different tags, but can be repeated in a tag A tag t is called periodic if it is the prefix of (  )  for some “period”  –Periodic strings make best use of c-tokens

c-token factor graph, c=4 (incomplete) CC AAG AAC AAAA AAAT

Vertex-disjoint Cycle Packing Problem Given directed graph G, find maximum number of vertex disjoint directed cycles in G [MT 05] APX-hard even for regular directed graphs with in-degree and out-degree 2 –h-c/2+1 approximation factor for tag set design problem [Salavatipour and Verstraete 05] –Quasi-NP-hard to approximate within  (log 1-  n) –O(n 1/2 ) approximation algorithm

Cycle Packing Algorithm 1.Construct c-token factor graph G 2.T  {} 3.For all cycles C defining periodic tags, in increasing order of cycle length, Add to T the tag defined by C Remove C from G 4.Perform an alphabetic tree search and add to T tags consisting of unused c-tokens 5.Return T

Experimental Results h

More Hybridization Constraints… Enforced during tag assignment by - Leaving some tags unassigned and distributing primers across multiple arrays [Ben-Dor et al. 03] - Exploiting availability of multiple primer candidates [MPT05] t1t2 t1

Herpes B Gene Expression Assay TmTm # pools Pool size 500 tags1000 tags2000 tags # arrays% Util.# arrays% Util.# arrays% Util. 601446 1494.06297.20172.30 5496.132100.00172.30 671560 1496.53298.70178.00 5498.00299.90178.00 701522 1496.73298.90176.10 5497.80299.80176.10 TmTm # pools Pool size 500 tags1000 tags2000 tags # arrays% Util.# arrays% Util.# arrays% Util. 601446 1482.26365.35257.05 5488.26370.95263.55 671560 1486.33369.70261.15 5491.86376.00267.20 701522 1488.46373.65265.40 5492.26291.10270.30 GenFlex Tags Periodic Tags

Digital Microfluidic Biochips [Srinivasan et al. 04] [Su&Chakrabarty 06] I/O Cell Electrodes typically arranged in rectangular grid Droplets moved by applying voltage to adjacent cell Can be used for analyses of DNA, proteins, metabolites…

Design Challenges Testing –High electrode failure rate, but can re-configure around –Performed both after manufacturing and concurrent with chip operation –Main objective is minimization of completion time Module placement –Assay operations (mixing, amplification, etc.) can be mapped to overlapping areas of the chip if performed at different times Droplet routing –When multiple droplets are routed simultaneously must prevent accidental droplet merging or interference MergingInterference

Concurrent Testing Problem GIVEN: –Input/Output cells –Position of obstacles (cells in use by ongoing reactions) FIND: –Trajectories for test droplets such that Every non-blocked cell is visited by at least one test droplet Droplet trajectories meet non-merging and non-interference constraints Completion time is minimized Defect model: test droplet gets stuck at defective electrode [Su et al. 04] ILP-based solution for single test droplet case & heuristic for multiple input-output pairs with single test droplet/pair

ILP Formulation for Unconstrained Number of Droplets Each cell (i,j) visited at least once: Droplet conservation: No droplet merging: No droplet interference: Minimize completion time:

Special Case NxN Chip I/O cells in Opposite Corners No Obstacles  Single droplet solution needs N 2 cycles

Lower Bound Claim: Completion time is at least 4N – 4 cycles Proof: In each cycle, each of the k droplets place 1 dollar in current cell  3k(k-1)/2 dollars paid waiting to depart  3k(k-1)/2 dollars paid waiting for last droplet  k dollars in each diagonal  1 dollar in each cell

Stripe Algorithm with N/3 Droplets Stripe algorithm has approximation factor of

Stripe Algorithm with Obstacles of width Q Divide array into vertical stripes of width Q+1 Use one droplet per stripe All droplets visit cells in assigned stripes in parallel In case of interference droplet on left stripe waits for droplet in right stripe

Results for 120x120 Chip, 2x2 Obstacles ~20x decrease in completion time by using multiple droplets Obstacle Area Average completion time (cycles) k=40 vs. k=1 speed-up k=1k=12k=20k=30k=40 0%144001412944710593 24x 1%142561420953.4715.2598.8 24x 5%136801473982.8725596.2 23x 10%1296014901010.8734.8592.6 22x 15%1224015011025.8730.8588.2 21x 20%1152015011046.8738.4580.8 20x 25%1080015011071736.6570 19x

Conclusions Biochip design is a fertile area of applications –Combinatorial optimization techniques can yield significant improvements in assay quality/throughput Very dynamic area, driver applications and underlying technologies change rapidly

Acknowledgments Physical design of DNA arrays: A.B. Kahng, P. Pevzner, S. Reda, X. Xu, A. Zelikovsky Tag set design: D. Trinca Testing of digital microfluidic biochips: R. Garfinkel, B. Pasaniuc, A. Zelikovsky Financial support: UCONN Research Foundation, NSF awards 0546457 and 0543365

Questions?

Algorithms for Biochip Design and Optimization Ion Mandoiu Computer Science & Engineering Department University of Connecticut.

Similar presentations

Presentation on theme: "Algorithms for Biochip Design and Optimization Ion Mandoiu Computer Science & Engineering Department University of Connecticut."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Algorithms for Biochip Design and Optimization Ion Mandoiu Computer Science & Engineering Department University of Connecticut.

Similar presentations

Presentation on theme: "Algorithms for Biochip Design and Optimization Ion Mandoiu Computer Science & Engineering Department University of Connecticut."— Presentation transcript:

Similar presentations

About project

Feedback