APBC Improved Algorithms for Multiplex PCR Primer Set Selection with Amplification Length Constraints Kishori M. Konwar Ion I. Mandoiu Alexander C. Russell Alexander A. Shvartsman CS&E Dept., Univ. of Connecticut
APBC Combinatorial Optimization in Bioinformatics Fast growing number of applications –Sequence alignment –DNA sequencing –Haplotype inference –Pathogen identification –… –High-throughput assay design Microarray probe selection Microarray quality control Universal tag arrays … This talk: Multiplex PCR primer set selection
APBC Outline Background and problem formulation “Potential function” greedy algorithm Approximation guarantee Experimental results Conclusions
APBC The Polymerase Chain Reaction Target Sequence Polymerase Primer 1 Primer 2 Primers Repeat cycles
APBC Primer Pair Selection Problem Given: Genomic sequence around amplification locus Primer length k Amplification upperbound L Find: Forward and reverse primers of length k that hybridize within a distance of L of each other and optimize amplification efficiency (melting temperature, secondary structure, mis-priming, etc.) L L Forward primer Reverse primer amplification locus 3'3' 3'3' 5'5' 5'5'
APBC PCR for SNP Genotyping Thousands of SNPs to be genotyped using hybridization methods (e.g., SBE) Selective PCR amplification needed to improve accuracy of detection steps –whole-genome amplification not appropriate Simultaneous amplification OK Multiplex PCR
APBC Multiplex PCR How it works –Multiple DNA fragments amplified simultaneously –Each amplified fragment still defined by two primers –A primer may participate in amplification of multiple targets Primer set selection –Currently done by time-consuming trial and error –An important objective is to minimize number of primers Reduced assay cost Higher effective concentration of primers higher amplification efficiency Reduced unintended amplification
APBC Primer Set Selection Problem Given: Genomic sequences around n amplification loci Primer length k Amplification upper bound L Find: Minimum size set S of primers of length k such that, for each amplification locus, there are two primers in S hybridizing with the forward and reverse genomic sequences within a distance of L of each other
APBC Previous Work on Primer Selection Well-studied problem: [Pearson et al. 96], [Linhart & Shamir’02], [Souvenir et al.’03], etc. Almost all problem formulations decouple selection of forward and reverse primers –To enforce bound of L on amplification length, select only primers that hybridize within L/2 bases of desired target –In worst case, this method can increase the number of primers by a factor of O(n) compared to the optimum [Pearson et al. 96] Greedy set cover algorithm gives O(ln n) approximation factor for the “decoupled” formulation
APBC Previous Work (2) [Fernandes&Skiena’02] study primer set selection with uniqueness constraints Minimum Multi-Colored Subgraph Problem: –Vertices correspond to candidate primers –Edge colored by color i between u and v iff corresponding primers hybridize within a distance of L of each other around i-th amplification locus –Goal is to find minimum size set of vertices inducing edges of all colors
APBC The Set Cover Problem Given: - Universal set U with n elements - Family of sets (S x, x X) covering all elements of U Find: - Minimum size subset X’ of X s.t. (S x, x X’) covers all elements of U
APBC Selection w/ Length Constraints “Simultaneous set covering” problem: - Ground set partitioned into n disjoint sets S i (one for each target), each with 2L elements - Goal is to select minimum number of sets == primers covering at least 1/2 of the elements in each partition L L SNP i
APBC Greedy Setcover Algorithm Classical result (Johnson’74, Lovasz’75, Chvatal’79): the greedy setcover algorithm has an approximation factor of H(n)=1+1/2+1/3+…+1/n < 1+ln(n) - The approximation factor is tight - Cannot be approximated within a factor of (1- )ln(n) unless NP=DTIME(n loglog(n) ) Greedy Algorithm: - Repeatedly pick the set with most uncovered elements
APBC Potential Functions Set cover = #uncovered elements Initially, = n For feasible solutions, = 0 Primer selection with length constraints = minimum number of elements that must be covered = i max{0, L - #uncovered elements in S i } Initially, = nL For feasible solutions, = 0
APBC General setting Potential function (X’) 0 ({}) = max (X’) = 0 for all feasible solutions X’’ X’ (X’’) (X’) If (X’)>0, then there exists x s.t. (X’+x) < (X’) X’’ X’ ∆(x,X’) ∆(x,X’) for every x, where ∆(x,X’) := (X’) - (X’+x) Objective: find minimum size set X’ with (X’)=0
APBC Generic Greedy Algorithm Theorem: The generic greedy algorithm has an approximation factor of 1+ln ∆ max Corollary: 1+ln(nL) approximation for PCR primer selection X’ {} While ( X’ ) > 0 Find x with maximum ∆( x,X’ ) X’ X’ + x
APBC Proof Sketch (1) x 1, x 2,…,x g be the elements selected by greedy, in the order in which they are chosen x* 1, x* 2,…,x* k be the elements of an optimum solution. Charging scheme: x i charges to x* j a cost of where i j = ∆(x i,{x 1,…, x i-1 } {x* 1,…,x* j }) Fact 1: Each x* j gets charged a total cost of at most 1+ln ∆ max
APBC Proof Sketch (2) Fact 2: Each x i charges at least 1 unit of cost
APBC Experimental Setting Datasets extracted from NCBI databases, L=1000 Dell PowerEdge 2.8GHz Xeon Compared algorithms –G-FIX: greedy primer cover algorithm [Pearson et al.] –MIPS-PT: iterative beam-search heuristic [Souvenir et al.] Restrict primers to L/2 bases around amplification locus –G-VAR: naïve modification of G-FIX First selected primer can be up to L bases away Opposite sequence truncated after selecting first primer –G-POT: potential function driven greedy algorithm
APBC Experimental Results, NCBI tests # Targets k G-FIX (Pearson et al.) G-VAR (G-FIX with dynamic truncation) MIPS-PT (Souvenir et al.) G-POT (Potential- function greedy) #PrimersCPU sec #PrimersCPU sec #PrimersCPU sec #PrimersCPU sec
APBC #primers, as percentage of 2n (l=8) n
APBC #primers, as percentage of 2n (l=10) n
APBC #primers, as percentage of 2n (l=12) n
APBC CPU Seconds (l=10) n
APBC Conclusions Numerous combinatorial optimization problems arising in the area of high-throughput assay design Theoretical insights such as approximation results can lead to significant practical improvements Choosing the proper problem model is critical to solution efficiency
APBC Ongoing Work & Open Problems Degenerate primers Accurate hybridization model (melting temperature, secondary structure, cross hybridization,…) –In-silico MP-PCR simulator Partition into multiple multiplexed PCR reactions (Aumann et al. Wabi’03)
APBC Acknowledgments Financial support from UCONN’s Research Foundation
APBC Integer Program Formulation 0/1 variable x u for every vertex 0/1 variable y e for every edge e
APBC LP-Rounding Algorithm Theorem [Konwar et al.’04]: The LP-rounding algorithm finds a feasible solution at most O(m 1/2 lnn) times larger than the optimum, where m is the maximum color class size, and n is the number of nodes For primer selection, m L 2 approximation factor is O(Llnn) Better approximation? - Unlikely for minimum multi-colored subgraph problem (1) Solve linear programming relaxation (2) Select node u with probability x u (3) Repeat step 2 O(ln(n)) times and return selected nodes