Precomputing Edit-Distance Specificity of Short Oligonucleotides Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

Slides:



Advertisements
Similar presentations
Linear Time Methods for Propagating Beliefs Min Convolution, Distance Transforms and Box Sums Daniel Huttenlocher Computer Science Department December,
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
BLAST Sequence alignment, E-value & Extreme value distribution.
S. J. Shyu Chap. 1 Introduction 1 The Design and Analysis of Algorithms Chapter 1 Introduction S. J. Shyu.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : k-difference.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
COFFEE: an objective function for multiple sequence alignments
DYNAMIC PROGRAMMING. 2 Algorithmic Paradigms Greedy. Build up a solution incrementally, myopically optimizing some local criterion. Divide-and-conquer.
Evaluation of Placement Techniques for DNA Probe Array Layout Andrew B. Kahng 1 Ion I. Mandoiu 2 Sherief Reda 1 Xu Xu 1 Alex Zelikovsky 3 (1) CSE Department,
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Chapter 3 Brute Force Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Implicit Hitting Set Problems Richard M. Karp Harvard University August 29, 2011.
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases O. Ozturk and H. Ferhatosmanoglu. IEEE International Symp. on Bioinformatics.
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
UConn BioGrid REU Summer 2008 Primer Design for Multiplex PCR Nikoletta DiGirolamo.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
May 25, GSU Biotech Symposium1 Minimum PCR Primer Set Selection with Amplification Length and Uniqueness Constraints Ion Mandoiu University of.
Protein Sequence Comparison Patrice Koehl
APBC Improved Algorithms for Multiplex PCR Primer Set Selection with Amplification Length Constraints Kishori M. Konwar Ion I. Mandoiu Alexander.
Sequence alignment, E-value & Extreme value distribution
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Finding Regulatory Motifs in DNA Sequences. Motifs and Transcriptional Start Sites gene ATCCCG gene TTCCGG gene ATCCCG gene ATGCCG gene ATGCCC.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Introduction to Bioinformatics Algorithms Exhaustive Search and Branch-and-Bound Algorithms for Partial Digest Mapping.
Approximate Inference 2: Monte Carlo Markov Chain
(c) University of Washingtonhashing-1 CSC 143 Java Hashing Set Implementation via Hashing.
SSAHA, or Sequence Search and Alignment by Hashing Algorithm, is used mainly for fast sequence assembly, SNP detection, and the ordering and orientation.
Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Final Exam Review Final exam will have the similar format and requirements as Mid-term exam: Closed book, no computer, no smartphone Calculator is Ok Final.
Young Ki Baik, Computer Vision Lab.
SHRiMP: Accurate Mapping of Short Reads in Letter- and Colour-spaces Stephen Rumble, Phil Lacroute, …, Arend Sidow, Michael Brudno.
Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.
1 CPSC 320: Intermediate Algorithm Design and Analysis July 28, 2014.
1 Short Term Scheduling. 2  Planning horizon is short  Multiple unique jobs (tasks) with varying processing times and due dates  Multiple unique jobs.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Prof. Swarat Chaudhuri COMP 482: Design and Analysis of Algorithms Spring 2012 Lecture 16.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Update any set S of nodes simultaneously with step-size We show fixed point update is monotone for · 1/|S| Covering Trees and Lower-bounds on Quadratic.
Introduction to Bioinformatics Algorithms DNA Mapping and Brute Force Algorithms.
Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004.
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
Sequence Alignment Tanya Berger-Wolf CS502: Algorithms in Computational Biology January 25, 2011.
Machine Learning Queens College Lecture 7: Clustering.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Doug Raiford Phage class: introduction to sequence databases.
DYNAMICALLY COMPUTING FASTEST PATHS FOR INTELLIGENT TRANSPORTATION SYSTEMS MEERA KRISHNAN R.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.
An Exact Algorithm for Difficult Detailed Routing Problems Kolja Sulimma Wolfgang Kunz J. W.-Goethe Universität Frankfurt.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
CSCI2950-C Genomes, Networks, and Cancer
Genome sequence assembly
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Lecture 4: Probe & primer design
Advanced Associative Structures
Objective of This Course
Precomputing Edit-Distance Specificity of Short Oligonucleotides
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Precomputing Edit-Distance Specificity of Short Oligonucleotides Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park

2 Polymerase Chain Reaction

3

4 Primer Specificity Need to ensure that primers hybridize to a specific (specified) locus only Require exactly one occurrence of specified sequence Require no (potential) mis-hybridization loci Bottleneck computation in primer-design Design / check iteration is problematic

5 k-unique 20-mers Edit-distance as a surrogate for mis- hybridization potential k-unique loci: All non-self genomic loci are require more than k edits in (global) alignment Closest non-self genomic loci requires (k+1) edits in (global) alignment

6 Find all k-unique 20-mers Naïve algorithm: O(n 2 km) Quadratic in size of genome. 0-unique (exact match) 20-mers (Expected) linear time algorithm Achieve expected linear time using a hybrid approach (blastn): Use partial exact match to “seed” expensive dynamic programming alignment Large chunks ) Fast, but miss occurrences Small chunks ) Slow, but correct

7 Baeza-Yates Perleberg: Correct and O(n) for small k At least 1 chunk is observed with no error. Small k → Large chunks → Fast and correct Largest correct chunk: floor(m/(k+1)) Inexact sequence match ≠ = ≠ q g

8 Example worst case alignments TCCCGC-TAGATTGAGATCT ||||||v||||||*|||||| TCCCGCCTAGATTTAGATCT ACTTGTCCACAGTGCTTAAG ||||||*||||||*|||||| ACTTGTGCACAGTCCTTAAG

9 Brute-force approach ACTTGTGCACAGTCCTTAAG AA:18 AC:1,9 AG:11,19 CA:8,10 CC:14 CT:2,15 GC:7 GT:5,12 TA:17 TC:13 TG:4,6 TT:3,16 2-mer position table

10 Brute-force approach ACTTGTGCACAGTCCTTAAG

11 Brute-force approach ACTTGTGCACAGTCCTTAAG

12 Brute-force approach ACTTGTGCACAGTCCTTAAG

13 Brute-force approach ACTTGTGCACAGTCCTTAAG

14 Brute-force approach ACTTGTGCACAGTCCTTAAG

15 Brute-force approach ACTTGTGCACAGTCCTTAAG

16 Brute-force approach Divide the genome into 10 Mb blocks For all pairs of blocks: For all l-mer matches: Do all pair-wise DPs containing match If ≤ k edits, mark position non-unique 300 x 300 pairs of blocks For 20-mers: k=1 ) l=10; k=2 ) l=6; k=3 ) l=5 ; k=4 ) l=4.

17 Brute-force approach Things are looking really, really, bad: Seeds are too short 90,000 pair-wise block comparisons Actually quite good (seed size 12): Non-uniqueness certificates are dense Almost all positions eliminated early Behaves more like linear time than quadratic

18 In practice (edit-dist 4)

19 In practice (edit-dist 4)

20 In practice (edit-dist 4)

21 In practice (edit-dist 3)

22 In practice (edit-dist 3)

23 In practice (edit-dist 4,3,2)

24 In practice (edit-dist 4,3,2)

25 Edit distance 2 After seed size 12 ~ 27K (0.288%) positions have no match After seed size 8 ~ 3K (0.029%) positions have no match Using seed size 6 is still too slow Need a more sophisticated hashing strategy 6-mers match in too many places!

26 Spaced seed-set design problem Given: mer-size: m ( = 20 ) # errors: k ( = 1,2,3) # cares: l ( = 10,12,14 ) Find the smallest set of spaced seeds that will find all alignments.

27 Solution for (20,2,8) , TCCCGCGTAGATTGAGATCT ||||||*||||||*|||||| TCCCGCCTAGATTTAGATCT How can we find these spaced seed set solutions? One/two table? 2 x false positives

28 Spaced seed set design set- cover formulation Set cover instance: Ground set: all possible placements of the k errors (alignments) Covering sets: all possible placements of the l care positions For (m=20,k=2,l=10), 190 elements, 184,756 sets! Need to reduce the number of sets!

29 Dirty secret of spaced seeds Spaced seeds take O(# cares) to update! Contiguous seeds are O(1) to update vs steps to update vs 1 step to update Constant time update for spaced seeds? Yes, if they have a certain structure(s) Restrict spaced seeds to small update cost

30 O(1) spaced seed update ACGTACGTACGTACGTACGT 1: A G A G 2: C T C T... Spaced seed can be updated in 1 step!

31 O(1) spaced seed update ACGTACGTACGTACGTACGT ACGTACG -> ACGACG CGTACGT -> CGTCGT... Spaced seed can be updated in 1 step!

32 O(1) spaced seed update “Period” step update steps step “Runs” step update step steps Minimize the number of update steps Weighted set-cover instance…

33 Edit-distance SS-SDP Position of matching bases might shift! Need ↓ to get CCGCTAGA Need ↑ to get CCGCTAGA Set cover formulation no longer works TCCCGC-TAGATTGAGATCT ||||||v||||||*|||||| TCCCGCCTAGATTTAGATCT

34 Edit-Distance SS-SDP Use a variation on set cover: q: ,r: covers: Pay for query & reference update costs separately Control size of problem by only enumerating templates with small update cost r:TCCCGC-TAGATTGAGATCT ||||||v||||||*|||||| q:TCCCGCCTAGATTTAGATCT

35 Correct solutions for 1-unique 20-mers 10 7 random sequence x 10 7 random sequence Seed: (Best single seed solution, weight 10) ~ 9.5 expensive dynamic programs per locus Seed set: ; (weight 11) ~ 8.9 DP/locus Seed set (weight 11) ~ 7.8 DP/locus Seed set: ; (weight 12) ~ 2.2 DP/locus ~ ~ ~

36 Correct solutions for 1-unique 20-mers 10 7 random sequence x 10 7 random sequence Seed set: ; (weight 12) ~ 2.5 DP/locus Seed set (weight 12) ~ 1.8 DP/locus Seed set: ; (weight 13) ~ 0.56 DP/locus (same specificity as contiguous seed weight 12) Seed set (weight 14) ~ x ~ ~ ~ ~ ~ ~

Correct solutions for 2-unique 20-mers Seed: (Best single seed solution, weight 6) ~ 2439 DP/locus Weight 10 – 73 DP/locus (specificity of 8/9 contig seed) Weight 12 – 10 DP/locus (specificity of 10 contig seed) ~ ~ ~ ~ ~ ~ ~ ~

38 k-unique human 20-mers No 4-unique 20-mers No 3-unique 20-mers % of (forward) human 20-mers are 2-unique in total about 1 every 2638 bases Fast 2-uniquness oracle

Genome Browser Track Edit Distance: UCSC Track: 1-unique 20-mers UCSC Track: 2-unique 20-mers 39

40 Integration with High Performance Computing Break sequence into chunks of size Remember which positions have been eliminated. Integrated with (UMIACS) Condor Too unreliable for very large sequences NFS filesystem is unreliable Simultaneous jobs capped at ~ 300 Integrated with hadoop/map-reduce on 80 nodes (Edwards Lab) Reliability improved, DFS solves (some) filesystem woes Much better scalability (in theory, yet to be tested) Explicit synchronization of reduce step is undesirable.

41 Other improvements Other groupings are possible: Species designation on FASTA defline can be any suitable partition Constraints on the position of edits: False positive due to mishybridization at 3’ end is unlikely to be observed with some technologies Constraint on valid Tm range: Computed as in Primer3 Can eliminate undesirable mers early

42 Conclusions Precompute of human k-unique 20-mers is now feasible! Faster for large edit-distance! Need spaced seed-set designs Constant time update for spaced seeds Good integer programming formulation of SS- SDP Limited template enumeration based on update cost Work with integer programming experts to solve effectively