1 Combinatorial Optimization Methods for Reliable Genomic-Based Detection Systems Ion Mandoiu University of Connecticut Computer Science & Engineering.

Slides:



Advertisements
Similar presentations
Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.
Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions
Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.
Greed is good. (Some of the time)
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Approximation, Chance and Networks Lecture Notes BISS 2005, Bertinoro March Alessandro Panconesi University La Sapienza of Rome.
1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.
Winter 2005ICS 252-Intro to Computer Design ICS 252 Introduction to Computer Design Lecture 5-Scheudling Algorithms Winter 2005 Eli Bozorgzadeh Computer.
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Exact and Approximation Algorithms for DNA Tag Set Design
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Design and Optimization of Universal DNA Arrays Ion Mandoiu CSE Department & BME Program University of Connecticut.
Primer Selection Methods for Detection of Genomic Inversions and Deletions via PAMP Bhaskar DasGupta, University of Illinois at Chicago Jin Jun, and Ion.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Implicit Hitting Set Problems Richard M. Karp Harvard University August 29, 2011.
Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut.
1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
UConn BioGrid REU Summer 2008 Primer Design for Multiplex PCR Nikoletta DiGirolamo.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut.
Real-Time Primer Design for DNA Chips Annie Hui CMSC 838 Presentation.
Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.
May 25, GSU Biotech Symposium1 Minimum PCR Primer Set Selection with Amplification Length and Uniqueness Constraints Ion Mandoiu University of.
Physical Mapping II + Perl CIS 667 March 2, 2004.
APBC Improved Algorithms for Multiplex PCR Primer Set Selection with Amplification Length Constraints Kishori M. Konwar Ion I. Mandoiu Alexander.
Optimization Methods for Reliable Genomic- Based Pathogen Detection Systems K.M. Konwar, I.I. Mandoiu, A.C. Russell, and A.A. Shvartsman Computer Science.
Gene expression & Clustering (Chapter 10)
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Distributed Constraint Optimization Michal Jakob Agent Technology Center, Dept. of Computer Science and Engineering, FEE, Czech Technical University A4M33MAS.
1 Outline Last time: –Molecular biology primer (sections ) –PCR Today: –More basic techniques for manipulating DNA (Sec. 3.8) Cutting into shorter.
Network Aware Resource Allocation in Distributed Clouds.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Design Techniques for Approximation Algorithms and Approximation Classes.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Analysis of Algorithms
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Computational Molecular Biology Non-unique Probe Selection via Group Testing.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Polymerase Chain Reaction (PCR) Developed in 1983 by Kary Mullis Major breakthrough in Molecular Biology Allows for the amplification of specific DNA fragments.
Precomputing Edit-Distance Specificity of Short Oligonucleotides Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Nonunique Probe Selection and Group Testing Ding-Zhu Du.
Engineering Better Brain Implants for the Future of Medicine Patrick J. Rousche, Ph.D. Bioengineering, and co-PI Laxman Saggere, Ph.D. Mechancial Engineering.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Graph Colouring L09: Oct 10. This Lecture Graph coloring is another important problem in graph theory. It also has many applications, including the famous.
Taqman Technology and Its Application to Epidemiology Yuko You, M.S., Ph.D. EPI 243, May 15 th, 2008.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Computational Molecular Biology Non-unique Probe Selection via Group Testing.
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.
CSE280Stefano/Hossein Project: Primer design for cancer genomics.
Learning Hidden Graphs Hung-Lin Fu 傅 恆 霖 Department of Applied Mathematics Hsin-Chu Chiao Tung Univerity.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
NAME THAT ALGORITHM #2 HERE ARE SOME PROBLEMS. SOLVE THEM. GL HF.
The minimum cost flow problem
Ion Mandoiu Computer Science & Engineering Department
Alan Kuhnle*, Victoria G. Crawford, and My T. Thai
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001)
Fragment Assembly 7/30/2019.
Presentation transcript:

1 Combinatorial Optimization Methods for Reliable Genomic-Based Detection Systems Ion Mandoiu University of Connecticut Computer Science & Engineering Department

2 Motivation  Early detection, early response: rapid identification of pathogens causing epidemic outbreaks enables faster containment  Emerging large scale systems for infectious agent detection: - BioWatch [DHS] - Human Virome project [Anderson et al. 03]  Genomic-based assays are becoming the method of choice for early detection and identification - Sequence data increasingly available - Broad detection spectrum, fast, easy to automate - Reduced deployment and update overhead  Besides resolving numerous technological challenges, novel bioinformatics tools will be needed to assist in assay design and optimization

3 Can Computer Scientists Really Help?  They’ve done it before: BLAST, Human Genome assembly  Computer virus detection - more than 68,000 viruses detected in real-time - daily updates of computer virus signatures - techniques used by computer anti-virus programs can be used to speed-up genomic-based detection assays

4 Overview  Generic Detection System Architecture  The String Barcoding Problem  Primer Set Selection for Multiplex PCR  Conclusions

5 Detection System Requirements Fast, highly specific pathogen detection and identification without compromising sensitivity (low false alarm rate) Ability to work with trace amounts of genetic material Fully automated operation -- should require minimal human intervention Parallel detection of a large number of pathogens Discrimination between pathogens and non-pathogenic organisms Low operating cost Easy to upgrade

6 Key System Components Selection of distinguishing oligonucleotides based on available genomic sequences Selective amplification of distinguishing sequences from environmental sample Hybridization-based detection of present distinguishers Pathogen identification by comparison with stored signatures/barcodes of known pathogens

7 Generic System Architecture Multiplex PCR PCR Machine Mixture of (degenerate) primers Set of (degenerate) primers Mixture of (degenerate) primers Single-base extension and hybridization with universal tag array Amplified DNA sequences from sample Sample containing minute traces of pathogen genetic material Probes obtained by ligating distinguisher reporters and anti-tags … … Barcodes of pathogens present in sample Fluorescent nucleotides Multiplex PCRMixture of (degenerate) primers Multiplex PCR

8 SBE & Hybridization with Universal Tag Arrays +

9 Overview  Generic Detection System Architecture  The String Barcoding Problem - Problem Formulation - Integer Program - Fast heuristics  Primer Set Selection for Multiplex PCR  Conclusions

10 Motivation Need for rapid virus detection –Given Virus with unknown identity Database of known viruses –Problem Identify unknown virus quickly –Ideal solution Have sequence of –Viruses in database –Unknown virus Solution –use BLAST (or any sequence similarity program/algorithm)

11 Real World Only have sequence for pathogens in database –Not possible to quickly sequence an unknown virus Can quickly test for presence of short substrings in unknown virus (substring tests) using, e.g., hybridization + SBE New Idea (Borneman et al.’01, Rash&Gusfield’02) –String Barcoding: use substring tests to uniquely identify each virus in the database

12 Problem Definition Given: Genomic sequences g 1,…, g n Find: Minimum number of distinguisher strings t 1,…,t k Such that: For every g i  g j, there exists a string t l which is the Watson-Crick complement for a substring of g i or g j, but not of both - At least log 2 n distinguishers needed - Fingerprints  n distinguishers - Much fewer than n distinguishers needed in practice (close to log 2 n)

13 Small Example Given sequences: 1. cagtgc 2. cagttc 3. catgga Feasible set of distinguishers: {tg, atgga} tgatgga cagtgc10 cagttc00 catgga11 0/1 row vectors: unique barcode for each pathogen

14 Computational Complexity [Berman et al.’04] Cannot be approximated within a factor of (1-  )ln(n) unless NP=DTIME(n loglog(n) )

15 Integer Program Formulation Basic Idea (Rash&Gusfield’02) –Write problem as minimization of a linear function subject to linear constraints –Variables restricted to take 0/1 values For our problem –One variable for each candidate distinguisher Value = 1  candidate is selected Value = 0  candidate is not selected –One constraint for each pair of strings in S At least one good distinguisher chosen for each pair –Objective Function Minimize sum of variables (#selected candidates)

16 Practical Implementation Key point: runtime needed to solve integer program depends on #variables Lots of variables can be removed: –Candidates that appear in all sequences –Sufficient to keep a single candidate among those that appear in the same set of strings How to remove useless variables? –Rash&Gusfield’s method: use suffix trees

17 Suffix Trees Key Properties of the suffix tree built for a set of strings S: –Rooted tree with character sequences labeling edges –Tree nodes labeled with a subset of the original string IDs –Every substring of original input set appears as a tree walk from root exactly once

18 Suffix Tree Example Strings: 1. cagtgc 2. cagttc 3. catgga v1 - {1,2,3}v2 - {1,2,3}v3 - {3}v4 - {1}v5 - {3} v6 - {1,2}v7 - {2}v8 - {1}v9 - {1,2,3}v10 - {1,2,3} v11 - {1,2}v12 - {1}v13 - {2}v14 - {3}v15 - {1,2,3} v16 - {2}v17 - {2}v18 - {1,3}v19 - {1}v20 - {3} v21 - {1,2,3}v22 - {3}v23 - {2}v24 - {1,2}v25 - {1}

19 Integer Program Minimize V18 + V22 + V11 + V17 + V8#objective function Such that V18 + V17 + V8 >= 1#constraint to cover pair 1,2 V22 + V11 + V8 >= 1#constraint to cover pair 1,3 V18 + V22 + V11 + V17 >= 1#constraint to cover pair 2,3 Binaries #all variables are 0/1 V18 V22 V11 V17 V8 End tg (V18)atgga (V22) cagtgc10 cagttc00 catgga11

20 Limitations of Integer Program Method Works only for moderately sized datasets sequences Average length ~1000 characters Over 4 hours needed to come within 20% of optimum Scalable Heuristics?

21 Information Content Heuristic [Berman et al. 2004] –Keep track of the partition defined by distinguishers selected so far n-1 n Distinguisher 1 Distinguisher 2

22 Information Content Heuristic [Berman et al. 2004] –Keep track of the partition defined by distinguishers selected so far –In every step, choose candidate that reduces partition entropy by largest amount Initial entropy = log 2 (n!)  n*log 2 n Final entropy = 0

23 Information Content Heuristic [Berman et al. 2004] –Keep track of the partition defined by distinguishers selected so far –In every step, choose candidate that reduces partition entropy by largest amount Initial entropy = log 2 (n!)  n*log 2 n Final entropy = 0 Theorem: Information Content Heuristic is always finding a #distinguishers within 1+ln(n) of optimum

24 Limitations of ICH Real genomic data has degenerate nucleotides –Ambiguous sequencing –Single nucleotide polymorphisms For sequences with degenerate nucleotides there are three possibilities for distinguisher hybridization –Sure hybridization –Sure mismatch –Uncertain hybridization  No partition to work with!

25 Simpler Greedy Heuristic Setcover greedy: –In every step, choose candidate that distinguishes the largest number of not yet distinguished pairs Distinguisher selection as setcover problem: –Elements to be covered are the pairs of sequences –Each candidate distinguisher defines a set of pairs that it separates –Problem: find minimum number of sets that cover all elements By a classical result, setcover greedy gives 2*ln(n) approximation; in practice as good as ICH Runtime is few seconds for Rash&Gusfield datasets

26 Overview  Generic Detection System Architecture  The String Barcoding Problem  Primer Set Selection for Multiplex PCR - Problem formulation - Greedy and LP-rounding algorithm for primer set selection with uniqueness constraints - Experimental results  Conclusions

27 The Polymerase Chain Reaction Target Sequence Primer 1 Primer 2 5’ 3’ 5’ 3’ 5’ 3’ Polymerase Primers Repeat cycles

28 Primer Pair Selection Problem Given: Genomic sequence around amplification locus Primer length k Amplification upperbound L Find: Forward and reverse primers of length k that hybridize within a distance of L of each other and optimize amplification efficiency (melting temperatures, secondary structure, cross hybridization, etc.)  L L Forward primer Reverse primer amplification locus 3'3' 3'3' 5'5' 5'5'

29 Multiplex PCR Multiplex PCR (MP-PCR) –Multiple DNA fragments amplified simultaneously –Boundaries of each amplification fragment still defined by two oligonucleotide primers –A primer may participate in the amplification of multiple targets Primer set selection –Typically done by time-consuming trial and error –An important objective is to minimize number of primers  Reduced assay cost  Higher effective concentration of primers  higher amplification efficiency  Reduced unintended amplification

30 Other Applications of Multiplex PCR Spotted microarray synthesis [Fernandes&Skiena’02] –Need unique pair of primers for each one of the n amplification products, but primers can be used multiple times –Potential to reduce #primers from O(n) to O(n 1/2 ) SNP Genotyping –Thousands of SNPs that must genotyped using hybridization based methods (e.g., single-base extension) –Selective PCR amplification needed to improve accuracy of detection steps (whole-genome amplification less appropriate) –No need for unique amplification! –Primer minimization is critical Reduced cost Fewer multiplex PCR reactions, less mispriming

31 Primer Set Selection Problem Given: Genomic sequences around each amplification locus Primer length k Amplification upper bound L Find: Minimum size set of primers S of length k such that, for each amplification locus, there are two primers in S hybridizing to the forward and reverse sequences within a distance of L of each other For applications requiring uniqueness: S should contain a unique pair of primers amplifying each each locus

32 Previous Work Well-studied problem: [Pearson et al. 96], [Linhart & Shamir’02], [Souvenir et al.’03], etc. Almost all problem formulations decouple selection of forward and reverse primers –Cannot directly enforce constraints on amplification product length! –To enforce bound of L on amplification length, select only primers that hybridize within L/2 bases of desired target –In worst case, this method can increase the number of primers by a factor of O(n) compared to the optimum Greedy set cover algorithm gives O(ln n) approximation factor for the “decoupled” formulation –Cannot find better approximation unless P=NP

33 Previous Work (contd.) [Fernandes&Skiena’02] model primer set selection with uniqueness constraints as a minimum multicolored subgraph problem: – Vertices of the graph correspond to candidate primers – There is an edge colored by color i between primers u and v if they hybridize within a distance of L of each other around i-th amplification locus –Goal is to find minimum size set of vertices inducing edges of all colors Can be used to model length amplification constraints [Lancia et al.’02] Trivial approximation algorithm: select 2 primers for each amplification target – O(n 1/2 ) approximation since at least n 1/2 primers required by every feasible solution

34 Integer Program Formulation Variable x u for every vertex (candidate primer) u - x u set to 1 if u is selected, and to 0 otherwise Variable y e for every edge e - y e set to 1 if corresponding primer pair selected to amplify corresponding target Objective: minimize sum of x u ’s Constraints: - for each i, sum of y e ’s over all e’s amplifying locus i is at least 1 - y e  x u for every e incident to u

35 Linear Program Relaxation Integer program hard to solve exactly Can still solve efficiently the linear programming relaxation, in which variables are allowed to take fractional values

36 LP-Rounding Algorithm  Theorem [Konwar et al.’04]: The LP-rounding algorithm finds a feasible solution at most O(m 1/2 lnn) times larger than the optimum, where m is the maximum color class size, and n is the number of nodes  For primer selection, m  L 2  approximation factor is O(Llnn)  Better approximation? - Unlikely for minimum multi-colored subgraph problem (1) Solve linear programming relaxation (2) Select node u with probability x u (3) Repeat step 2 O(ln(n)) times and return selected nodes

37 Selection w/o Uniqueness Constraints Can be seen as a “simultaneous set covering” problem: - The ground set is partitioned into n disjoint sets, each with 2L elements - The goal is to select a minimum number of sets (== primers) that cover at least half of the elements in each partition Naïve modifications of the greedy set cover algorithm do not work Key idea: use potential function  to measure progress towards fasibility. For primer selection, potential function counts the total number of elements that remain to be covered Initially,  = nL For feasible solutions,  = 0

38 Greedy Approximation Algorithm Theorem: The greedy algorithm in returns a feasible primer set whose size is at most 1+ln ∆ times larger than the optimum, where ∆ is the maximum potential value decrease caused by a single primer For primer selection ∆ is equal to nL in the worst case, and is much smaller in practice –The number of primers selected by the greedy algorithm is at most ln(nL) larger than the optimum Potential-Function Driven Greedy Algorithm  Select a primer that decreases potential function  by the largest amount (breaking ties arbitrarily)  Repeat until feasibility is achieved

39 Experimental Setting Datasets –Extracted from NCBI databases –Randomly generated using uniform distribution Compared algorithms –G-FIX: greedy primer cover algorithm of Pearson et al. Primers restricted to be within L/2 bases of amplification locus –G-VAR: naïve modification of G-FIX For each locus, first selected primer can be up to L bases away If first selected primer is L 1 bases away from amplification locus, opposite sequence is truncated to a length of L- L 1 –MIPS-PT: iterative beam-search heuristic of Souvenir et al. –G-POT: potential function driven greedy algorithm

40 Experimental Results, NCBI tests # Targets k G-FIX (Pearson et al.) G-VAR (G-FIX with dynamic truncation) MIPS-PT (Souvenir et al.) G-POT (Potential- function greedy) #PrimersCPU sec #PrimersCPU sec #PrimersCPU sec #PrimersCPU sec

41 #primers, as percentage of 2n (l=8) n

42 #primers, as percentage of 2n (l=10) n

43 #primers, as percentage of 2n (l=12) n

44 CPU Seconds (l=10) n

45 Overview  Generic Detection System Architecture  The String Barcoding Problem  Primer Set Selection for Multiplex PCR  Conclusions

46 Conclusions Building the next-generation of pathogen detection systems will require novel bioinformatics tools for genomic assay design, built around accurate mathematical models and powerful algorithmic techniques We have given improved algorithms for two critical optimizations: distinguisher selection for string barcoding, and primer set selection for multiplex PCR

47 Ongoing Work String Barcoding –Probe mixtures as distinguishers –Redundancy and error correcting properties –Simultaneous detection of multiple pathogens Primer Set Selection –Improved hybridization models –Practical validation –Degenerate primers Universal Tag array design –Tag selection (Ben-Dor’00) –Tag placement and embedding –Assignment of reporter probes to anti-tags Partitioning into multiple multiplexed PCR reactions and multiple Universal Tag array hybridizations (Aumann et al. WABI’03)

48 Acknowledgments B. DasGupta, K. Konwar, A. Russell, A. Shvartsman UCONN Research Foundation