1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.

1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering Department University of Connecticut Storrs, CT * Department of Computer Science University of Illinois at Chicago Chicago, IL

2 Motivation There are many critical situations when one needs to rapidly identify an unknown genomic sequence from among a given set of known sequences  Rapid identification of pathogens in epidemic outbreaks  Monitoring of microbial communities, e.g., in environmental studies  Fast database search

3 Possible Approaches Sequencing based: sequence the unknown DNA sequence, then use similarity search programs such as BLAST to identify the unknown virus sequence for pathogens in database  Sequencing is prohibitively expensive and time consuming Hybridization Based: identify the unknown sequence by testing for the presence of certain subsequences  Subsequence tests can be performed quickly and at low cost using a variety of hybridization based methods

4 Sequence Fingerprints For each sequence, find a subsequence that appears in that sequence and only in that sequence GTTGCGTTCCAT CAGTTGC100 CAGTTC010 CATGGA001 Sequence barcodes: 0/1 vectors When using fingerprints, barcode length = #sequences

5 String Barcoding (Borneman et al.’01, Rash & Gusfield’02): Unique occurrence of tested subsequences not needed, as long as 0/1 barcodes are unique TGCAGT CAGTTGC11 CAGTTC01 CATGGA10 When using non-unique subsequences, barcode length can be much smaller than #sequences

6 Overview Problem Formulation and Previous Work Greedy Setcover Algorithm Experimental Results Conclusions

7 Given: Genomic sequences g 1,…, g n Find: Minimum number of distinguisher strings t 1,…,t k Such that: For every g i  g j there exists a distinguisher t l which is a substring of g i or g j but not of both - At least log 2 n distinguishers needed - n distinguishers are always sufficient Problem Definition

8 Computational Complexity [Berman et al.’04] Cannot be approximated within a factor of (1-  )ln(n) unless NP=DTIME(n loglog(n) )

9 A non-redundant set of candidate distinguishers is generated using a suffix tree One variable v i for each candidate distinguisher x i  v i = 1  x i is selected  v i = 0  x i is not selected Rash & Gusfield Integer Program

10 Integer Program Example Minimize V TG + V ATGGA + V CAGT + V +V V TG + V ATGGA + V CAGT + V TTC +V GTGC #objective function Such that V TG VV V TG + V TTC + V GTGC >= 1 #constraint to cover pair 1,2 V ATGGA V CAGT V V ATGGA + V CAGT + V GTGC >= 1 #constraint to cover pair 1,3 V TG V ATGGA V CAGT V V TG + V ATGGA + V CAGT + V TTC >= 1 #constraint to cover pair 2,3 Binaries #all variables are 0/1 V TG V ATGGA V CAGT VV V TG V ATGGA V CAGT V TTC V GTGC End TGATGGA 1. CAGTGC10 2. CAGTTC00 3. CCATGGA11

11 Limitations of Integer Program Method Works only for moderately sized datasets  50-150 sequences  Average sequence length ~1000 nucleotides  Up to 4 hours needed to come within 20% of optimum

12 Information Content Heuristic [Berman et al. 2004]  Keep track of the partition defined by distinguishers selected so far  In every step, choose candidate that reduces partition entropy by largest amount Theorem: Information Content Heuristic is always finding a #distinguishers within 1+ln(n) of optimum

13 CATCAGA TTCAGT TAT AATAG AATCAG D = { } Entropy = log 2 5! Information Content Heuristic

14 CATCAGA TTCAGT TAT AATAG AATCAG c=TCAG D = { } New Entropy = log 2 (3!2!) Change = log 2 5! - log 2 (3!2!) Information Content Heuristic

15 CATCAGA TTCAGT TAT AATAG AATCAG D = {TCAG} c=AAT New Entropy = log 2 (2!1!1!1!) Change = log 2 (3!2!) - log 2 (2!) Information Content Heuristic

16 CATCAGA TTCAGT TAT AATAG AATCAG D = {TCAG,AAT} Information Content Heuristic

17 Limitations of ICH Real genomic sequences contain degenerate nucleotides (e.g., N for any of {A,T,C,G} ) due to sequencing errors and known single nucleotide polymorphisms Distinguisher-to-sequence matches:  Perfect matches  Perfect mismatches  Uncertain matches Information Content cannot be defined in the presence of uncertain matches ATCNAT ATC1 CCC0 CCA?

18 Other Heuristics (Cazalis et al 2004): greedy setcover, simulated annealing, and genetic algorithms for distinguisher selection To achieve practical running time, only a small random subset (2000 candidates) of all candidate distinguishers is considered  No data provided on the loss of solution quality due to this restriction

19 Overview Problem Formulation and Previous work Greedy Setcover Algorithm Experimental Results Conclusions

20 Setcover Greedy Heuristic Phase I: Candidate Generation  Generate a representative set of candidate distinguishers from the source sequences Phase II: Greedy Distinguisher Selection  In every step, choose candidate that distinguishes the largest number of not yet distinguished pairs

21 Candidate Generation A set of candidate distinguishers guaranteed to contain an optimum solution is generated from the sequences We do not generate certain redundant candidates  A candidate is redundant if there is another candidate that appears exactly in the same set of sequences  For every sequence we generate only one of the substrings that appear exclusively in that sequence

22 Efficient Candidate Generation Our implementation uses simple array datastructures  We generate candidates in increasing order of length  Exact match positions for candidates of length l-1 used to generate the exact matches for candidates of length l Candidates that do not satisfy individual given biochemical constraints, such as minimum/maximum length, GC content, melting temperature, are discarded

23 Setcover Greedy Heuristic Phase I: Candidate Generation  Generate a set of candidate distinguishers from the source sequences Phase II: Greedy Distinguisher Selection  In every step, choose candidate that distinguishes the largest number of not yet distinguished pairs

24 Distinguisher Selection as Set Cover Set Cover Problem: given a universal set U and a family of subsets, find a minimum number of subsets covering U Distinguisher selection is a special case of set cover:  Elements to be covered are the pairs of sequences  Each candidate distinguisher defines a set of pairs that it separates By a classical result, the greedy algorithm has an approximation factor of 1+ln(|U|)  Setcover greedy has approximation factor of 2*ln(n) for distinguisher selection with n sequences

25 Distinguisher Selection Start with an empty set D of distinguishers While there are pairs of sequences not yet distinguished, do:  Compute for each remaining candidate c its coverage gain  (c, D) – the number of not yet distinguished pairs of sequences that are distinguished by c  Add the candidate with maximum coverage gain to D Return D

26 Computation of  (c, D): CATCAGA TTCAGT TAT AATAG AATCAG D = { }

27 Computation of  (c, D): CATCAGA TTCAGT TAT AATAG AATCAG c=TCAG D = { }

28 Computation of  (c, D): CATCAGA TTCAGT TAT AATAG AATCAG c=TCAG D = { }  (c, D)= 3 x (5 –3) = 6

29 Computation of  (c, D): CATCAGA TTCAGT TAT AATAG AATCAG D = {TCAG}

30 Computation of  (c, D): CATCAGA TTCAGT TAT AATAG AATCAG D = {TCAG} c=AAT

31 Computation of  (c, D): CATCAGA TTCAGT TAT AATAG AATCAG D = {TCAG} c=AAT  (c,D)= 1 x (2-1) + 1 x (3-1) = 3

32 Computation of  (c, D): CATCAGA TTCAGT TAT AATAG AATCAG D = {TCAG,AAT}

33 Computation of  (c, D) S 1, S 2, …, S k are the subsets in the partition defined by D M c is the set of matches of candidate c Using simple datastructures, computation can be done in linear time (in the number of sequences)

34 Lazy Update of Gains Coverage gains are monotonically non-increasing during the algorithm Re-compute coverage gain for a candidate only if last saved gain is higher than the gain of current best candidate In practice this speeds-up the selection algorithm by a factor of ~2

35 Degenerate bases  A pair of sequences is separated by candidate c if  c has at least one perfect match with one of the sequences, and  c has perfect mismatches at all positions of the other sequence  Gain computation done in O(n 2 ) time using a simple coverage matrix data-structure Redundancy r  A pair of sequences is counted in the gain function until r distinguishers separate it Distinguisher cross-hybridization  Minimum edit distance, or maximum common substring weight, bound for every pair of selected distinguishers  Candidates incompatible with a selected distinguisher removed from candidate list Algorithm Extensions

37 Randomly generated instances  Equal probabilities assigned to each of the four nucleotides Microbial genomes extracted from NCBI databases  Sequence lengths between 490 Kbases to 4.75 Mbases  Small number of degenerate bases Testcases

38 Selection time, L=10k, r=1 basic – O(n 2 ) computation of gains using matrix datastructure partition – O(n) computation of gains using partition-based datastructure

39 Candidate Sampling, n=1000, L=10k, r=1

40 Comparison to ICH, L=10k, r=1 Algo n 10 20 50 100 200 500 1000  log 2 n  ICH SGA 4 5 6 7 8 9 10 4.0 5.0 7.0 8.0 10.0 12.2 14.1 4.0 5.0 7.0 8.0 10.0 12.3 14.1

41 Varying Redundancy, L=10k

42 20 NCBI microbial genomic sequences Distinguisher melting temperature range of 55- 60 o C GC content range of 40-60% Max common subsequence weight bound of 5  weight(A)=weight(T)=1, weight(C)=weight(G)=2 NCBI testcase

43 AACTGTCTCACGACGTTCTGAA GATTCGAACCCCCGA GTGGATGCCTTGGCA GGACTACCAGGGTATCTAATCCTG AAAGAAGATAGAGCAGCAGCT AAGCGCGTCGCAAA CACAAGGAGTGAGTGTTGC CGGTTTTGTGCTTCATGG CCATTGACAATTTCAACACC Organism Mb Barcode Nanoarchaeum equitans Kin4-M 0.49 0 0 0 0 0 0 0 0 1 Mycobacterium tuberculosis CDC1551 4.40 0 0 0 0 0 0 1 0 0 Brucella suis 1330 chromosome 1 2.11 0 0 0 0 1 1 0 1 0 Leifsonia xyli subsp. xyli str. CTCB07 2.58 0 0 0 0 0 0 1 0 1 Mannheimia succiniciproducens MBEL55E 2.31 0 0 0 0 1 1 1 0 0 Geobacter sulfurreducens PCA 3.81 0 0 0 1 0 0 0 0 0 Rickettsia typhi str. Wilmington 1.11 0 0 0 0 0 1 1 0 1 Picrophilus torridus DSM 9790 1.55 0 1 0 0 0 0 0 0 1 Mesoplasma florum L1 0.79 0 0 0 0 0 0 0 1 1 Methylococcus capsulatus str. Bath 3.30 0 0 0 0 0 1 0 0 1 Propionibacterium acnes KPA171202 2.56 0 0 0 0 0 0 1 1 0 Mycoplasma mobile 163K 0.78 0 0 0 0 0 1 0 1 1 Mycoplasma hyopneumoniae 232 0.89 1 0 0 0 0 1 0 1 1 Bacillus licheniformis DSM 13 4.22 0 0 0 0 0 1 1 1 0 Legionella pneumophila subsp. pneumophila str. Philadelphia 1 3.40 0 0 0 0 0 1 1 0 0 Onion yellows phytoplasma OY-M DNA 0.86 0 0 0 0 1 1 1 1 0 Staphylococcus aureus subsp. Aureus strain MRSA252 2.90 0 0 1 0 0 1 1 1 1 Staphylococcus aureus strain MSSA476 2.80 0 0 0 0 0 1 1 1 1 Burkholderia pseudomallei strain K96243 chromosome 1 4.07 0 0 0 0 0 1 0 0 0 Bartonella henselae strain Houston-1 1.93 0 0 0 0 0 1 0 1 0 GC (%)60.0 45.5 60.0 50.0 57.1 50.0 52.6 42.9 40.0 Tm ( o C)55.6 59.6 55.4 59.3 56.9 58.6 55.1 55.4 56.3 NCBI testcase, r=1

44 Results on 29 Microbial Sequences (76 Mb) Redunl min l max MinEdit Select Time #Distinguishers 1111 0  0 14.2 6.0 15 40 6 2.6 8.0 5555 0  0 20.3 21.0 15 40 6 8.7 31.0 10 0  0 22.9 41.0 15 40 6 16.4 60.0 20 0  0 26.8 76.0 15 40 6 33.4 123.0

46 We provided highly scalable algorithms for the robust string barcoding problem, capable of handling whole genomic sequences of up to bacterial size Distinguisher selection based whole genomic sequences results in a number of distinguishers nearly matching the information theoretic lower bounds for the problem The software can be used online at http://dna.engr.uconn.edu/~software/DNA-BAR/ Conclusions

1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.

Similar presentations

Presentation on theme: "1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.

Similar presentations

Presentation on theme: "1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering."— Presentation transcript:

Similar presentations

About project

Feedback