Ion Mandoiu Computer Science & Engineering Department

Slides:



Advertisements
Similar presentations
The Primal-Dual Method: Steiner Forest TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA A A A AA A A.
Advertisements

ECE 667 Synthesis and Verification of Digital Circuits
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.
1 NP-Complete Problems. 2 We discuss some hard problems:  how hard? (computational complexity)  what makes them hard?  any solutions? Definitions 
Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.
Combinatorial Algorithms
Complexity 16-1 Complexity Andrei Bulatov Non-Approximability.
Computability and Complexity 23-1 Computability and Complexity Andrei Bulatov Search and Optimization.
Design and Optimization of Universal DNA Arrays Ion Mandoiu CSE Department & BME Program University of Connecticut.
Primer Selection Methods for Detection of Genomic Inversions and Deletions via PAMP Bhaskar DasGupta, University of Illinois at Chicago Jin Jun, and Ion.
1 Optimization problems such as MAXSAT, MIN NODE COVER, MAX INDEPENDENT SET, MAX CLIQUE, MIN SET COVER, TSP, KNAPSACK, BINPACKING do not have a polynomial.
Implicit Hitting Set Problems Richard M. Karp Harvard University August 29, 2011.
1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.
Approximation Algorithms
Job Scheduling Lecture 19: March 19. Job Scheduling: Unrelated Multiple Machines There are n jobs, each job has: a processing time p(i,j) (the time to.
1 Combinatorial Optimization Methods for Reliable Genomic-Based Detection Systems Ion Mandoiu University of Connecticut Computer Science & Engineering.
May 25, GSU Biotech Symposium1 Minimum PCR Primer Set Selection with Amplification Length and Uniqueness Constraints Ion Mandoiu University of.
1 Introduction to Approximation Algorithms Lecture 15: Mar 5.
(work appeared in SODA 10’) Yuk Hei Chan (Tom)
APBC Improved Algorithms for Multiplex PCR Primer Set Selection with Amplification Length Constraints Kishori M. Konwar Ion I. Mandoiu Alexander.
Optimization Methods for Reliable Genomic- Based Pathogen Detection Systems K.M. Konwar, I.I. Mandoiu, A.C. Russell, and A.A. Shvartsman Computer Science.
Daniel Kroening and Ofer Strichman Decision Procedures An Algorithmic Point of View Deciding ILPs with Branch & Bound ILP References: ‘Integer Programming’
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Primal-Dual Meets Local Search: Approximating MST’s with Non-uniform Degree Bounds Author: Jochen Könemann R. Ravi From CMU CS 3150 Presentation by Dan.
Approximation Algorithms for NP-hard Combinatorial Problems Magnús M. Halldórsson Reykjavik University
Design Techniques for Approximation Algorithms and Approximation Classes.
Approximating Minimum Bounded Degree Spanning Tree (MBDST) Mohit Singh and Lap Chi Lau “Approximating Minimum Bounded DegreeApproximating Minimum Bounded.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Edge-disjoint induced subgraphs with given minimum degree Raphael Yuster 2012.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Nonunique Probe Selection and Group Testing Ding-Zhu Du.
Graph Colouring L09: Oct 10. This Lecture Graph coloring is another important problem in graph theory. It also has many applications, including the famous.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Approximation Algorithms Department of Mathematics and Computer Science Drexel University.
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
Lecture.6. Table of Contents Lp –rounding Dual Fitting LP-Duality.
C&O 355 Lecture 19 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A A A.
Models of Greedy Algorithms for Graph Problems Sashka Davis, UCSD Russell Impagliazzo, UCSD SIAM SODA 2004.
CSE280Stefano/Hossein Project: Primer design for cancer genomics.
Ion I. Mandoiu, Vijay V. Vazirani Georgia Tech Joseph L. Ganley Simplex Solutions A New Heuristic for Rectilinear Steiner Trees.
Approximation algorithms
TU/e Algorithms (2IL15) – Lecture 11 1 Approximation Algorithms.
Polymerase Chain Reaction
Outline Introduction State-of-the-art solutions
Introduction to Approximation Algorithms
Data Driven Resource Allocation for Distributed Learning
The minimum cost flow problem
Haim Kaplan and Uri Zwick
Computability and Complexity
Analysis and design of algorithm
Integer Programming (정수계획법)
Coverage Approximation Algorithms
Linear Programming and Approximation
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Chapter 11 Limitations of Algorithm Power
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Integer Programming (정수계획법)
Lecture 14 Shortest Path (cont’d) Minimum Spanning Tree
CS 394C: Computational Biology Algorithms
ICS 252 Introduction to Computer Design
CS154, Lecture 16: More NP-Complete Problems; PCPs
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001)
Lecture 13 Shortest Path (cont’d) Minimum Spanning Tree
Fragment Assembly 7/30/2019.
Approximation Algorithms for the Selection of Robust Tag SNPs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Greedy Approximation Algorithms for Covering Problems in Computational Biology Ion Mandoiu Computer Science & Engineering Department University of Connecticut

Why Approximation Algorithms? Most practical optimization problems are NP-hard Approximation algorithms offer the next best thing to an efficient exact algorithm Polynomial time Solutions guaranteed to be “close” to optimum -approximation algorithm: solution cost within a multiplicative factor of  of optimum cost Practical relevance: insights needed to establish approximation guarantee often lead to fast, highly effective practical implementations

Why Computational Biology? Exploding multidisciplinary field at the intersection of computer science, biology, discrete mathematics, statistics, optimization, chemistry, physics, … Source of a fast growing number of combinatorial optimization applications: TSP and Euler paths in DNA sequencing Dynamic Programming in sequence alignment Integer Programming in Haplotype inference … This talk: two “covering” problems in computational biology (primer set selection and string barcoding)

Overview Potential function greedy algorithm - The set cover problem and the greedy algorithm - Potential function generalization Primer Set Selection for Multiplex PCR The String Barcoding Problem Conclusions

The Set Cover Problem Given: Find: Greedy Algorithm: Universal set U with n elements Family of sets (Sx, xX) covering all elements of U Find: Minimum size subset X’ of X s.t. (Sx, xX’) covers all elements of U Greedy Algorithm: - Start with empty X’, and repeatedly add x such that Sx contains the most uncovered elements until U is covered

Approximation Guarantee Classical result (Johnson’74, Lovasz’75, Chvatal’79): the greedy setcover algorithm has an approximation factor of H(n)=1+1/2+1/3+…+1/n < 1+ln(n) The approximation factor is tight Cannot be approximated within a factor of (1-)ln(n) unless NP=DTIME(nloglog(n))

General setting “Potential function” (X’)  0 ({}) = max (X’) = 0 for all feasible solutions X’’  X’  (X’’)  (X’) If (X’)>0, then there exists x s.t. (X’+x) < (X’) X’’  X’  ∆(x,X’)  ∆(x,X’) for every x, where ∆(x,X’) := (X’) - (X’+x) Problem: find minimum size set X’ with (X’)=0

Generic Greedy Algorithm X’  {} While (X’) > 0 Find x with maximum ∆(x,X’) X’  X’ + x Theorem (Konwar et al.’05) The generic greedy algorithm has an approximation factor of 1+ln ∆max

Proof Idea Let x1, x2,…,xg be the elements selected by greedy, in the order in which they are chosen, and x*1, x*2,…,x*k be the elements of an optimum solution. Charging scheme: xi charges to x*j a cost of where ij = ∆(xi,{x1,…, xi-1}{x*1,…,x*j}) Fact 1: Each x*j gets charged a total cost of at most 1+ln ∆max

Proof of claim 2 Fact 2: Each xi charges at least 1 unit of cost

Overview Potential function greedy algorithm Primer set selection for multiplex PCR Motivation and problem formulation Greedy applied to primer set selection Experimental results The String Barcoding Problem Conclusions

DNA Structure Four nucleotide types: A,C,T,G Normally double stranded A’s paired with T’s C’s paired with G’s

The Polymerase Chain Reaction Primer 1 Primer 2 Primers Polymerase Target Sequence Repeat 20-30 cycles

Primer Pair Selection Problem Forward primer Reverse primer amplification locus 3' 5' Given: Genomic sequence around amplification locus Primer length k Amplification upperbound L Find: Forward and reverse primers of length k that hybridize within a distance of L of each other and optimize amplification efficiency (melting temperature, secondary structure, mis-priming, etc.)

Multiplex PCR Multiplex PCR (MP-PCR) Primer set selection Multiple DNA fragments amplified simultaneously Each amplified fragment still defined by two primers A primer may participate in amplification of multiple targets Primer set selection Typically done by time-consuming trial and error An important objective is to minimize number of primers Reduced assay cost Higher effective concentration of primers  higher amplification efficiency Reduced unintended amplification

Primer Set Selection Problem Given: Genomic sequences around n amplification loci Primer length k Amplification upper bound L Find: Minimum size set S of primers of length k such that, for each amplification locus, there are two primers in S hybridizing with the forward and reverse genomic sequences within a distance of L of each other

Applications Single Nucleotide Polymorphism (SNP) genotyping Up to thousands of SNPs genotyped simultaneously Selective PCR amplification required for improved accuracy Spotted microarray synthesis [Fernandes&Skiena’02] Primers can be used multiple times For each target, need a pair of primers amplifying that target and only that target (amplification uniqueness constraint) Can still reduce #primers from 2n to O(n1/2)

Previous Work on Primer Selection Well-studied problem: [Pearson et al. 96], [Linhart & Shamir’02], [Souvenir et al.’03], etc. Almost all problem formulations decouple selection of forward and reverse primers To enforce bound of L on amplification length, select only primers that hybridize within L/2 bases of desired target In worst case, this method can increase the number of primers by a factor of O(n) compared to the optimum [Pearson et al. 96] Greedy set cover algorithm gives O(ln n) approximation factor for the “decoupled” formulation

Previous Work (contd.) [Fernandes&Skiena’02] study primer set selection with uniqueness constraints Minimum Multi-Colored Subgraph Problem: Vertices correspond to candidate primers Edge colored by color i between u and v iff corresponding primers hybridize within a distance of L of each other around i-th amplification locus Goal is to find minimum size set of vertices inducing edges of all colors Can capture length amplification constraints too

Integer Program Formulation 0/1 variable xu for every vertex 0/1 variable ye for every edge e

LP-Rounding Algorithm (1) Solve linear programming relaxation (2) Select node u with probability xu (3) Repeat step 2 O(ln(n)) times and return selected nodes Theorem [Konwar et al.’04]: The LP-rounding algorithm finds a feasible solution at most O(m1/2lnn) times larger than the optimum, where m is the maximum color class size, and n is the number of nodes For primer selection, m  L2  approximation factor is O(Llnn) Better approximation? Unlikely for minimum multi-colored subgraph problem

Selection w/o Uniqueness Constraints Can be seen as a “simultaneous set covering” problem: - The ground set is partitioned into n disjoint sets Si (one for each target), each with 2L elements The goal is to select a minimum number of sets (i.e., primers) that cover at least half of the elements in each partition SNPi L L

Greedy Algorithm Potential function  = minimum number of elements that must be covered = i max{0, L - #uncovered elements in Si} Initially,  = nL For feasible solutions,  = 0 ∆()  nL (much smaller in practice) Theorem [Konwar et al.’05]: The number of primers selected by the greedy algorithm is at most 1+ln(nL) larger than the optimum

Experimental Setting Datasets extracted from NCBI databases, L=1000 Dell PowerEdge 2.8GHz Xeon Compared algorithms G-FIX: greedy primer cover algorithm [Pearson et al.] MIPS-PT: iterative beam-search heuristic [Souvenir et al.] Restrict primers to L/2 bases around amplification locus G-VAR: naïve modification of G-FIX First selected primer can be up to L bases away Opposite sequence truncated after selecting first primer G-POT: potential function driven greedy algorithm

Experimental Results, NCBI tests # Targets k G-FIX (Pearson et al.) G-VAR (G-FIX with dynamic truncation) MIPS-PT (Souvenir et al.) G-POT (Potential- function greedy) #Primers CPU sec 20 8 7 0.04 0.08 10 6 0.10 9 0.03 13 15 12 14 18 26 0.11 50 0.13 0.30 21 48 0.32 23 0.22 24 0.36 30 150 0.33 31 0.14 32 41 246 29 0.28 100 17 0.49 0.89 226 0.58 37 0.37 0.72 844 0.75 53 0.59 0.84 75 2601 42 0.61

#primers, as percentage of 2n (l=8)

#primers, as percentage of 2n (l=10)

#primers, as percentage of 2n (l=12)

CPU Seconds (l=10) n

Overview Potential function greedy algorithm Primer Set Selection for Multiplex PCR The String Barcoding Problem - Problem Formulation - Integer programming and greedy algorithms - Experimental results Conclusions

Motivation Rapid pathogen detection Given Pathogen with unknown identity Database of known pathogens Problem Identify unknown pathogen quickly Ideal solution: determine DNA sequence of unknown pathogen

Real World Not possible to quickly sequence an unknown pathogen Only have sequence for pathogens in database Can quickly test for presence of short substrings in unknown virus (substring tests) using hybridization String barcoding [Borneman et al.’01, RashGusfield’02] Use substring tests that uniquely identify each pathogen in the database

String Barcoding Problem Given: Genomic sequences g1,…, gn Find: Minimum number of distinguisher strings t1,…,tk Such that: For every gi  gj, there exists a string tl which is substring of gi or gj, but not of both At least log2n distinguishers needed Fingerprints  n distinguishers

Example Given sequences: Feasible set of distinguishers: {tg, atgga} 1. cagtgc 2. cagttc 3. catgga Feasible set of distinguishers: {tg, atgga} tg atgga cagtgc 1 cagttc catgga Row vectors: unique barcodes for each pathogen

Computational Complexity [Berman et al.’04] Cannot be approximated within a factor of (1-)ln(n) unless NP=DTIME(nloglog(n))

Setcover Greedy Algorithm Distinguisher selection as setcover problem Elements to be covered are the pairs of sequences Each candidate distinguisher defines a set of pairs that it separates Another view: covering all edges of a complete graph with n vertices by the minimum number of given cuts For n sequences, largest set can have O(n2) elements  The setcover greedy guarantees ln(n2) = 2 ln n approximation

Integer Program Formulation 0/1 variable for each candidate distinguisher 1  candidate is selected 0  candidate is not selected For each pair of sequences, at least one candidate separating them is selected Objective Function Minimize #selected candidates

Practical Issues Quadratic # of constraints, huge # of variables Genome sizes range from thousands of bases for phage and viruses to millions for bacteria to billions for higher organisms Many variables can be removed: Candidates that appear in all sequences Sufficient to keep a single candidate among those that appear in the same set of sequences How to efficiently remove useless variables? Rash&Gusfield use suffix trees

Suffix Tree Example Strings: 1. cagtgc 2. cagttc 3. catgga v1 - {1,2,3} v2 - {1,2,3} v3 - {3} v4 - {1} v5 - {3} v6 - {1,2} v7 - {2} v8 - {1} v9 - {1,2,3} v10 - {1,2,3} v11 - {1,2} v12 - {1} v13 - {2} v14 - {3} v15 - {1,2,3} v16 - {2} v17 - {2} v18 - {1,3} v19 - {1} v20 - {3} v21 - {1,2,3} v22 - {3} v23 - {2} v24 - {1,2} v25 - {1}

Integer Program Minimize V18 + V22 + V11 + V17 + V8 #objective function Such that V18 + V17 + V8 >= 1 #constraint to cover pair 1,2 V22 + V11 + V8 >= 1 #constraint to cover pair 1,3 V18 + V22 + V11 + V17 >= 1 #constraint to cover pair 2,3 Binaries #all variables are 0/1 V18 V22 V11 V17 V8 End tg (V18) atgga (V22) cagtgc 1 cagttc catgga

Limitations of Integer Program Method Works only for small instances 50-150 sequences Average length ~1000 characters Over 4 hours needed to come within 20% of optimum! Scalable Heuristics?

Distinguisher Induced Partition Key idea [Berman et al. 04]: Keep track of the partition defined by distinguishers selected so far Distinguisher 1 1 2 3 n-1 n Distinguisher 2

Information Content Heuristic  = partition entropy = log2(#permutations compatible with current partition) Initial partition entropy = log2(n!)  n log2n For feasible distinguisher sets, partition entropy = 0 ∆()  n : log2(n!) - log2(k!(n-k)!) < log2(2n) = n Information content heuristic (ICH) = greedy driven by partition entropy Theorem [Berman et al.’04] ICH has an approximation factor of 1+ln(n)

ICH Limitations Real genomic data has degenerate nucleotides Ambiguous sequencing Single nucleotide polymorphisms For sequences with degenerate nucleotides there are three possibilities for distinguisher hybridization Sure hybridization Sure mismatch Uncertain hybridization  No partition to work with!

Practical Implementation ICH and setcover greedy give nearly identical results on data w/o non-degenerate bases Setcover greedy can also be extended to handle degenerate bases in the sequences redundancy requirements (each pair of sequences must be separated r times) Two main steps for both algorithms: Candidate generation Greedy selection

Candidate Generation Can be done using suffix trees We use a simpler yet efficient incremental approach Candidates that match all or only one sequence are removed from consideration Solution quality is similar even when candidates are generated from a single sequence Equivalent to considering only distinguisher sets that assign a barcode of (1,1,…,1) to the source sequence

Candidate Selection Evaluate ∆() for all candidates and choose best Speed-up techniques Efficient gain computation using partition data-structure Lazy gain update: if old ∆() is lower than best so far, do not recompute

Experimental Results mat mat part part # n lazy lazy dist 100 35.4 22.1 2.2 1.4 8.0 200 221.6 125.2 8.8 4.6 10.0 500 2168.8 1144.4 53.0 18.7 12.3 1000 5600.4 2756.4 113.6 31.7 14.1 Averages over 10 testcases, sequence length = 10,000 Barcodes for 100 sequences of length 1,000,000 computed in less than 10 minutes

Overview Potential function greedy algorithm Primer Set Selection for Multiplex PCR The String Barcoding Problem Conclusions

Conclusions General potential function framework for designing and analyzing greedy covering algorithms Improved approximation guarantees and practical performance for two important optimization problems in computational biology: primer set selection for multiplex PCR, and distinguisher selection for string barcoding

Ongoing Work Primer Set Selection String Barcoding Improved hybridization models Degenerate primers Partitioning into multiple multiplexed PCR reactions Close approximation gap for minimum multicolored sub-graph String Barcoding Probe mixtures as distinguishers Beyond redundancy: error correcting Simultaneous detection of multiple pathogens

Acknowledgments B. DasGupta, K. Konwar, A. Russell, A. Shvartsman UCONN Research Foundation