1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.

Slides:

Advertisements

Similar presentations

Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.

Advertisements

Counting the bits Analysis of Algorithms Will it run on a larger problem? When will it fail?

Fast Algorithms For Hierarchical Range Histogram Constructions

Minimizing Seed Set for Viral Marketing Cheng Long & Raymond Chi-Wing Wong Presented by: Cheng Long 20-August-2011.

Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU),

Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.

1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.

BLAST Sequence alignment, E-value & Extreme value distribution.

Winter 2005ICS 252-Intro to Computer Design ICS 252 Introduction to Computer Design Lecture 5-Scheudling Algorithms Winter 2005 Eli Bozorgzadeh Computer.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Design and Optimization of Universal DNA Arrays Ion Mandoiu CSE Department & BME Program University of Connecticut.

Primer Selection Methods for Detection of Genomic Inversions and Deletions via PAMP Bhaskar DasGupta, University of Illinois at Chicago Jin Jun, and Ion.

Evaluation of Placement Techniques for DNA Probe Array Layout Andrew B. Kahng 1 Ion I. Mandoiu 2 Sherief Reda 1 Xu Xu 1 Alex Zelikovsky 3 (1) CSE Department,

Design of Optimal Multiple Spaced Seeds for Homology Search Jinbo Xu School of Computer Science, University of Waterloo Joint work with D. Brown, M. Li.

Recent Development on Elimination Ordering Group 1.

Implicit Hitting Set Problems Richard M. Karp Harvard University August 29, 2011.

This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.

Jan 6-10th, 2007VLSI Design A Reduced Complexity Algorithm for Minimizing N-Detect Tests Kalyana R. Kantipudi Vishwani D. Agrawal Department of Electrical.

Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.

Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.

1 Efficient Placement and Dispatch of Sensors in a Wireless Sensor Network Prof. Yu-Chee Tseng Department of Computer Science National Chiao-Tung University.

1 Combinatorial Optimization Methods for Reliable Genomic-Based Detection Systems Ion Mandoiu University of Connecticut Computer Science & Engineering.

Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

May 25, GSU Biotech Symposium1 Minimum PCR Primer Set Selection with Amplification Length and Uniqueness Constraints Ion Mandoiu University of.

APBC Improved Algorithms for Multiplex PCR Primer Set Selection with Amplification Length Constraints Kishori M. Konwar Ion I. Mandoiu Alexander.

Optimization Methods for Reliable Genomic- Based Pathogen Detection Systems K.M. Konwar, I.I. Mandoiu, A.C. Russell, and A.A. Shvartsman Computer Science.

Sequence alignment, E-value & Extreme value distribution

10/31/02CSE Greedy Algorithms CSE Algorithms Greedy Algorithms.

10/31/02CSE Greedy Algorithms CSE Algorithms Greedy Algorithms.

Efficient and Effective Practical Algorithms for the Set-Covering Problem Qi Yang, Jamie McPeek, Adam Nofsinger Department of Computer Science and Software.

Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.

Network Aware Resource Allocation in Distributed Clouds.

Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.

1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

Filter Algorithms for Approximate String Matching Stefan Burkhardt.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.

National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.

The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization Jia Wang, Shiyan Hu Department of Electrical and Computer Engineering.

Reference-Based Indexing of Sequence Databases (VLDB ’ 06) Jayendra Venkateswaran Deepak Lachwani Tamer Kahveci Christopher Jermaine Presented by Angela.

BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.

Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.

Precomputing Edit-Distance Specificity of Short Oligonucleotides Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.

Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.

1 Short Term Scheduling. 2  Planning horizon is short  Multiple unique jobs (tasks) with varying processing times and due dates  Multiple unique jobs.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.

CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.

Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.

Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.

Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.

A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.

National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.

The Haplotype Blocks Problems Wu Ling-Yun

Genetic Algorithm. Outline Motivation Genetic algorithms An illustrative example Hypothesis space search.

Prof. Yu-Chee Tseng Department of Computer Science

BLAST Anders Gorm Pedersen & Rasmus Wernersson.

Distributed Submodular Maximization in Massive Datasets

Coverage and Distinguishability in Traffic Flow Monitoring

Ion Mandoiu Computer Science & Engineering Department

Bioinformatics, Vol.17 Suppl.1 (ISMB 2001)

Sequence alignment, E-value & Extreme value distribution

Discovering Frequent Poly-Regions in DNA Sequences

Presentation transcript:

1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering Department University of Connecticut Storrs, CT * Department of Computer Science University of Illinois at Chicago Chicago, IL

2 Motivation There are many critical situations when one needs to rapidly identify an unknown genomic sequence from among a given set of known sequences  Rapid identification of pathogens in epidemic outbreaks  Monitoring of microbial communities, e.g., in environmental studies  Fast database search

3 Possible Approaches Sequencing based: sequence the unknown DNA sequence, then use similarity search programs such as BLAST to identify the unknown virus sequence for pathogens in database  Sequencing is prohibitively expensive and time consuming Hybridization Based: identify the unknown sequence by testing for the presence of certain subsequences  Subsequence tests can be performed quickly and at low cost using a variety of hybridization based methods

4 Sequence Fingerprints For each sequence, find a subsequence that appears in that sequence and only in that sequence GTTGCGTTCCAT CAGTTGC100 CAGTTC010 CATGGA001 Sequence barcodes: 0/1 vectors When using fingerprints, barcode length = #sequences

5 String Barcoding (Borneman et al.’01, Rash & Gusfield’02): Unique occurrence of tested subsequences not needed, as long as 0/1 barcodes are unique TGCAGT CAGTTGC11 CAGTTC01 CATGGA10 When using non-unique subsequences, barcode length can be much smaller than #sequences

6 Overview Problem Formulation and Previous Work Greedy Setcover Algorithm Experimental Results Conclusions

7 Given: Genomic sequences g 1,…, g n Find: Minimum number of distinguisher strings t 1,…,t k Such that: For every g i  g j there exists a distinguisher t l which is a substring of g i or g j but not of both - At least log 2 n distinguishers needed - n distinguishers are always sufficient Problem Definition

8 Computational Complexity [Berman et al.’04] Cannot be approximated within a factor of (1-  )ln(n) unless NP=DTIME(n loglog(n) )

9 A non-redundant set of candidate distinguishers is generated using a suffix tree One variable v i for each candidate distinguisher x i  v i = 1  x i is selected  v i = 0  x i is not selected Rash & Gusfield Integer Program

10 Integer Program Example Minimize V TG + V ATGGA + V CAGT + V +V V TG + V ATGGA + V CAGT + V TTC +V GTGC #objective function Such that V TG VV V TG + V TTC + V GTGC >= 1 #constraint to cover pair 1,2 V ATGGA V CAGT V V ATGGA + V CAGT + V GTGC >= 1 #constraint to cover pair 1,3 V TG V ATGGA V CAGT V V TG + V ATGGA + V CAGT + V TTC >= 1 #constraint to cover pair 2,3 Binaries #all variables are 0/1 V TG V ATGGA V CAGT VV V TG V ATGGA V CAGT V TTC V GTGC End TGATGGA 1. CAGTGC10 2. CAGTTC00 3. CCATGGA11

11 Limitations of Integer Program Method Works only for moderately sized datasets  sequences  Average sequence length ~1000 nucleotides  Up to 4 hours needed to come within 20% of optimum

12 Information Content Heuristic [Berman et al. 2004]  Keep track of the partition defined by distinguishers selected so far  In every step, choose candidate that reduces partition entropy by largest amount Theorem: Information Content Heuristic is always finding a #distinguishers within 1+ln(n) of optimum

13 CATCAGA TTCAGT TAT AATAG AATCAG D = { } Entropy = log 2 5! Information Content Heuristic

14 CATCAGA TTCAGT TAT AATAG AATCAG c=TCAG D = { } New Entropy = log 2 (3!2!) Change = log 2 5! - log 2 (3!2!) Information Content Heuristic

15 CATCAGA TTCAGT TAT AATAG AATCAG D = {TCAG} c=AAT New Entropy = log 2 (2!1!1!1!) Change = log 2 (3!2!) - log 2 (2!) Information Content Heuristic

16 CATCAGA TTCAGT TAT AATAG AATCAG D = {TCAG,AAT} Information Content Heuristic

17 Limitations of ICH Real genomic sequences contain degenerate nucleotides (e.g., N for any of {A,T,C,G} ) due to sequencing errors and known single nucleotide polymorphisms Distinguisher-to-sequence matches:  Perfect matches  Perfect mismatches  Uncertain matches Information Content cannot be defined in the presence of uncertain matches ATCNAT ATC1 CCC0 CCA?

18 Other Heuristics (Cazalis et al 2004): greedy setcover, simulated annealing, and genetic algorithms for distinguisher selection To achieve practical running time, only a small random subset (2000 candidates) of all candidate distinguishers is considered  No data provided on the loss of solution quality due to this restriction

19 Overview Problem Formulation and Previous work Greedy Setcover Algorithm Experimental Results Conclusions

20 Setcover Greedy Heuristic Phase I: Candidate Generation  Generate a representative set of candidate distinguishers from the source sequences Phase II: Greedy Distinguisher Selection  In every step, choose candidate that distinguishes the largest number of not yet distinguished pairs

21 Candidate Generation A set of candidate distinguishers guaranteed to contain an optimum solution is generated from the sequences We do not generate certain redundant candidates  A candidate is redundant if there is another candidate that appears exactly in the same set of sequences  For every sequence we generate only one of the substrings that appear exclusively in that sequence

22 Efficient Candidate Generation Our implementation uses simple array datastructures  We generate candidates in increasing order of length  Exact match positions for candidates of length l-1 used to generate the exact matches for candidates of length l Candidates that do not satisfy individual given biochemical constraints, such as minimum/maximum length, GC content, melting temperature, are discarded

23 Setcover Greedy Heuristic Phase I: Candidate Generation  Generate a set of candidate distinguishers from the source sequences Phase II: Greedy Distinguisher Selection  In every step, choose candidate that distinguishes the largest number of not yet distinguished pairs

24 Distinguisher Selection as Set Cover Set Cover Problem: given a universal set U and a family of subsets, find a minimum number of subsets covering U Distinguisher selection is a special case of set cover:  Elements to be covered are the pairs of sequences  Each candidate distinguisher defines a set of pairs that it separates By a classical result, the greedy algorithm has an approximation factor of 1+ln(|U|)  Setcover greedy has approximation factor of 2*ln(n) for distinguisher selection with n sequences

25 Distinguisher Selection Start with an empty set D of distinguishers While there are pairs of sequences not yet distinguished, do:  Compute for each remaining candidate c its coverage gain  (c, D) – the number of not yet distinguished pairs of sequences that are distinguished by c  Add the candidate with maximum coverage gain to D Return D

26 Computation of  (c, D): CATCAGA TTCAGT TAT AATAG AATCAG D = { }

27 Computation of  (c, D): CATCAGA TTCAGT TAT AATAG AATCAG c=TCAG D = { }

28 Computation of  (c, D): CATCAGA TTCAGT TAT AATAG AATCAG c=TCAG D = { }  (c, D)= 3 x (5 –3) = 6

29 Computation of  (c, D): CATCAGA TTCAGT TAT AATAG AATCAG D = {TCAG}

30 Computation of  (c, D): CATCAGA TTCAGT TAT AATAG AATCAG D = {TCAG} c=AAT

31 Computation of  (c, D): CATCAGA TTCAGT TAT AATAG AATCAG D = {TCAG} c=AAT  (c,D)= 1 x (2-1) + 1 x (3-1) = 3

32 Computation of  (c, D): CATCAGA TTCAGT TAT AATAG AATCAG D = {TCAG,AAT}

33 Computation of  (c, D) S 1, S 2, …, S k are the subsets in the partition defined by D M c is the set of matches of candidate c Using simple datastructures, computation can be done in linear time (in the number of sequences)

34 Lazy Update of Gains Coverage gains are monotonically non-increasing during the algorithm Re-compute coverage gain for a candidate only if last saved gain is higher than the gain of current best candidate In practice this speeds-up the selection algorithm by a factor of ~2

35 Degenerate bases  A pair of sequences is separated by candidate c if  c has at least one perfect match with one of the sequences, and  c has perfect mismatches at all positions of the other sequence  Gain computation done in O(n 2 ) time using a simple coverage matrix data-structure Redundancy r  A pair of sequences is counted in the gain function until r distinguishers separate it Distinguisher cross-hybridization  Minimum edit distance, or maximum common substring weight, bound for every pair of selected distinguishers  Candidates incompatible with a selected distinguisher removed from candidate list Algorithm Extensions

36 Overview Problem Formulation and Previous work Greedy Setcover Algorithm Experimental Results Conclusions

37 Randomly generated instances  Equal probabilities assigned to each of the four nucleotides Microbial genomes extracted from NCBI databases  Sequence lengths between 490 Kbases to 4.75 Mbases  Small number of degenerate bases Testcases

38 Selection time, L=10k, r=1 basic – O(n 2 ) computation of gains using matrix datastructure partition – O(n) computation of gains using partition-based datastructure

39 Candidate Sampling, n=1000, L=10k, r=1

40 Comparison to ICH, L=10k, r=1 Algo n  log 2 n  ICH SGA

41 Varying Redundancy, L=10k

42 20 NCBI microbial genomic sequences Distinguisher melting temperature range of o C GC content range of 40-60% Max common subsequence weight bound of 5  weight(A)=weight(T)=1, weight(C)=weight(G)=2 NCBI testcase

43 AACTGTCTCACGACGTTCTGAA GATTCGAACCCCCGA GTGGATGCCTTGGCA GGACTACCAGGGTATCTAATCCTG AAAGAAGATAGAGCAGCAGCT AAGCGCGTCGCAAA CACAAGGAGTGAGTGTTGC CGGTTTTGTGCTTCATGG CCATTGACAATTTCAACACC Organism Mb Barcode Nanoarchaeum equitans Kin4-M Mycobacterium tuberculosis CDC Brucella suis 1330 chromosome Leifsonia xyli subsp. xyli str. CTCB Mannheimia succiniciproducens MBEL55E Geobacter sulfurreducens PCA Rickettsia typhi str. Wilmington Picrophilus torridus DSM Mesoplasma florum L Methylococcus capsulatus str. Bath Propionibacterium acnes KPA Mycoplasma mobile 163K Mycoplasma hyopneumoniae Bacillus licheniformis DSM Legionella pneumophila subsp. pneumophila str. Philadelphia Onion yellows phytoplasma OY-M DNA Staphylococcus aureus subsp. Aureus strain MRSA Staphylococcus aureus strain MSSA Burkholderia pseudomallei strain K96243 chromosome Bartonella henselae strain Houston GC (%) Tm ( o C) NCBI testcase, r=1

44 Results on 29 Microbial Sequences (76 Mb) Redunl min l max MinEdit Select Time #Distinguishers    

45 Overview Problem Formulation and Previous work Greedy Setcover Algorithm Experimental Results Conclusions

46 We provided highly scalable algorithms for the robust string barcoding problem, capable of handling whole genomic sequences of up to bacterial size Distinguisher selection based whole genomic sequences results in a number of distinguishers nearly matching the information theoretic lower bounds for the problem The software can be used online at Conclusions