Engineering a Scalable Placement Heuristic for DNA Probe Arrays A.B. Kahng, I.I. Mandoiu, P. Pevzner, S. Reda (all UCSD), A. Zelikovsky (GSU)
Outline DNA probe arrays and unwanted illumination Synchronous array design (2-D placement) Asynchronous array design (3-D placement) Experimental results Extensions Conclusions
Outline DNA probe arrays and unwanted illumination Synchronous array design (2-D placement) Asynchronous array design (3-D placement) Experimental results Extensions Conclusions
DNA Probe Arrays Used in wide range of genomic analyses –Gene expression monitoring, SNP mapping, sequencing by hybridization,… Arrays with up to 1000x1000 probes in commercial use, 10 8 probes envisioned for next generation arrays –Highly scalable algorithms required for array design
Simplified DNA Array Flow Probe Selection Array Manufacturing Hybridization Experiment Gene sequences, position of SNPs, etc. Analysis of Hybridization Intensities Mask Manufacturing Soft/Computational Domain Hard/Biochemistry Domain Mask Design: Placement & Embedding
Array Manufacturing Process Very Large-Scale Immobilized Polymer Synthesis: 1.Treat substrate with chemically protected “linker” molecules, creating rectangular array –Site size = approx. 10x10 microns 2.Selectively expose array sites to light –Light deprotects exposed molecules, activating further synthesis 3.Flush chip surface with solution of protected A,C,G,T –Binding occurs at previously deprotected sites 4.Repeat steps 2&3 until desired probes are synthesized
Photo-Deprotection Step Our concern: diffraction unwanted illumination yield decrease
Probe Synthesis Nucleotide deposition sequence ACG G M 3 C M 2 A M 1 CG AC CG AC ACG AG G C Placed probes A A A A A C C C C C C G G G G G
Measuring Unwanted Illumination Nucleotide deposition sequence ACG G M 3 C M 2 A M 1 A A A A A C C C C C C G G G G G border Unwanted illumination border length CG AC CG AC ACG AG G C Placed probes
Synchronous vs. Asynchronous Synthesis (a) periodic deposition sequence (b) Synchronous embedding of CTG (c) Asynchronous leftmost embedding of CTG (d) Another asynchronous embedding T G C A T G T G C A … C A 4-group (a) C G T (b) C T G (c) G C T (d)
Outline DNA probe arrays and unwanted illumination Synchronous array design (2-D placement) Asynchronous array design (3-D placement) Experimental results Extensions Conclusions
Problem Formulation (Synchronous Case) Synchronous Array Design (2-D Placement) Problem: Minimize placement cost of Hamming graph H (vertices = probes, distance = Hamming) On 2-dimensional grid graph G2 (N x N array, edges b/w distance 1 neighbors) H probe G2 site
2-D Placement Lower Bound Sum of Hamming distances to 4 closest neighbors minus weight of 4N heaviest arcs H probe G2
TSP+1-Threading Placement Hubbell 90’s Find TSP tour/path over given probes w.r.t. Hamming distance Thread TSP path in the grid row by row Hannenhalli,Hubbell,Lipshutz, Pevzner’02 Place the probes according to 1-Threading Further decreases total border by 20%
Lexicographical Sorting +1-Threading A A T G C A A T G A T G G Radix-sort the probes in lexicographical order 123 C C Thread on the chip
Matching Based Probe Placement Select an independent (mutually nonadjacent) set of placed probes Re-embed using optimal perfect matching Total cost can only decrease or remain the same Runtime: roughly proportional to square of independent set size
Sliding Window Matching There is a trade-off between solution quality and size/overlap of windows Iterate SlidingWindowMatching over the chip until improvement drops below 0.1%
Effect of Window Size on Solution Quality Increased window size/overlap decreases number of conflicts, but increases runtime
Epitaxial Placement Algorithm Simulates crystal-growth Start with arbitrary probe placed at center Maintain a best probe-candidate (i.e, a probe with min number of conflicts to the already placed neighbors) for each border site Iteratively fill the border site with minimum increase in border length - give priority to sites with more neighbors filled
Tile- and Row- Epitaxial Tile-epitaxial –Divide array into 100x100 tiles –Run Epitaxial within each tile –Take into account border of already placed tiles Row-epitaxial –Place probes by a fast method, e.g., sort+1-thread –Re-place probes row by row, sequentially filling sites within a row –Assign to each site a probe with min number of conflicts among the unplaced probes from following K rows
2-D Placement Algorithm Comparison: Border Conflict
2-D Placement Algorithm Comparison: Runtime
Outline DNA probe arrays and unwanted illumination Synchronous array design (2-D placement) Asynchronous array design (3-D placement) Experimental results Extensions Conclusions
Problem Formulation (Asynchronous Case) Asynchronous synthesis: –Periodic nucleotide deposition sequence, e.g., (ACTG) p –Every probe grows asynchronously Border length = Hamming distance between embedded probes Asynchronous Array (3-D Placement) Design Problem: –Minimize placement cost of embedded-probe Hamming graph H (vertices=probes, distance = Hamming b/w embedded probes) –on 2-dimensional grid graph G2 (N x N array, edges b/w neighbors) H probe G2 site
Lower Bound Sum of distances to 4 closest neighbors minus weight of 4N heaviest arcs –Distance between two probes of length p = 2p - |Longest Common Subsequence| Non-tight bound: example with LB = 8 and best placement cost = 10 AC CTTG GA Optimum placement AC CTTG GA Nucleotide deposition sequence S=ACTGA A G T C A A G G TT C C A (c)
Optimal Probe Alignment A C T ACG T ACGT Source Sink Find best alignment of probe wrt embedded neighbors Dynamic Programming: – Source-sink paths corresponds to feasible embeddings – O[(probe length) x (deposition sequence length)] Can be extended to simultaneous alignment of two adjacent probes (2x1) with increase by O(probe length)
3-D Placement Flows -Simultaneous placement and alignment -asynchronous epitaxial (slow and low quality) -Synchronous placement followed by in-place probe alignment (analogous to standard for VLSI flow partition) -using previous DP to do in-place probe alignment -Synchronous placement followed by probe alignment with reshuffle (analogous to feedback loops in VLSI flows) -asynchronous sliding window matching
Algorithms for In-Place Probe Alignment Asynchronous re-embedding after 2-dim placement – Greedy Algorithm While there exist probes to re-embed with gain –Optimally re-embed the probe with the largest gain –Batched greedy: speed-up by avoiding recalculations –Chessboard Algorithm While there is gain –Re-embed probes in green sites –Re-embed probes in red sites
Comparison of In-Place Probe Alignments Chip size LBTSP+1ThrGreedyChessboard2x1 Chessboard %LB CPU%LBCPU%LBCPU Post-placement LB = sum of distances to adjacent probes –D istance between two probes of length p = 2p - |LCS | –Useful for assessing quality of algorithms that change probe embeddings but do not change probe placement
Outline DNA probe arrays and unwanted illumination Synchronous array design (2-D placement) Asynchronous array design (3-D placement) Experimental results Extensions Conclusions
3-D vs. 2-D Placement Results Chip size TSP+1ThrTSP+1Thr+ Chessboard Epitaxial+ Chessboard SyncSWM+ Chessboard AsyncSWM Cost CPUCostCPUCostCPUCostCPU
3-D Placement Algorithm Comparison: Border Conflict
3-D Placement Algorithm Comparison: Runtime
Outline DNA probe arrays and unwanted illumination Synchronous array design (2-D placement) Asynchronous array design (3-D placement) Experimental results Extensions Conclusions
Practical Extensions Distant-dependent border conflict weights Take into account conflicts between 2-,3-hop neighbors rather than only immediate neighbors Position-dependent border conflict weights In alignment DP for two sequences take into account importance of conflicts in the middle of probes – alignment cost has weights on conflicts which depend on conflict position Polymorphic probes Chip contains SNP’s, e.g. pairs of probes different in a single position – they should be placed together and alignment DP should align them simultaneously
Alignment DP for 2-SNP’s Optimal Embedding of A{C,T}T
Simplified DNA Array Flow Probe Selection Array Manufacturing Hybridization Experiment Gene sequences, position of SNPs, etc. Analysis of Hybridization Intensities Mask Manufacturing Soft/Computational Domain Hard/Biochemistry Domain Mask Design: Placement & Embedding
Enhanced DNA Array Design Flow Probe Selection Mask Design: Placement & Embedding
Enhanced DNA Array Design Flow Probe Selection Mask Design: Placement & Embedding Probe Pools
Enhanced DNA Array Design Flow Probe Selection Mask Design: Placement & Embedding Deposition Mask Design Probe Pools
Enhanced DNA Array Design Flow Probe Selection Mask Design: Placement & Embedding Deposition Mask Design Probe Pools Design Rules &Parameters
Enhanced DNA Array Design Flow Probe Selection Mask Design: Placement & Embedding Deposition Mask Design Conflict Map Probe Pools Design Rules &Parameters
Enhanced DNA Array Design Flow Probe Selection Mask Design: Placement & Embedding Deposition Mask Design Test/Control Structure Design Conflict Map Probe Pools Design Rules &Parameters
Summary Contributions: –Epitaxial placement reduces by extra 10% over the previously best known method –Asynchronous placement problem formulation –Postplacement improvement by extra % –Lower bounds –Scalable Placements (1000x1000 in 20min) Ongoing work –Comparison on industrial benchmarks –Experiments with algorithms for extended formulations (SNPs, distance-dependent weights, etc.) Future Directions –Design flow enhancements –Nucleotide deposition sequence design –Partitioning and integration for manufacturing cost reduction
Thank you!