Precomputing Edit-Distance Specificity of Short Oligonucleotides

Slides:

Advertisements

Similar presentations

Algorithm Analysis Input size Time I1 T1 I2 T2 …

Advertisements

Part IV: Memory Management

Fundamentals of Python: From First Programs Through Data Structures

Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.

1 Closest Points A famous algorithmic problem... Given a set of points in the plane (cities in the U.S., transistors on a circuit board, computers on a.

Combinatorial Pattern Matching CS 466 Saurabh Sinha.

Algorithmic Complexity Nelson Padua-Perez Bill Pugh Department of Computer Science University of Maryland, College Park.

Algorithm Design Strategy Divide and Conquer. More examples of Divide and Conquer  Review of Divide & Conquer Concept  More examples  Finding closest.

Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.

Chapter 3 Brute Force Copyright © 2007 Pearson Addison-Wesley. All rights reserved.

Lecture 8: Clock Distribution, PLL & DLL

Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.

Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)

Pattern Matching COMP171 Spring Pattern Matching / Slide 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences.

Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers

Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,

Algorithm Analysis Dr. Bernard Chen Ph.D. University of Central Arkansas.

(c) University of Washingtonhashing-1 CSC 143 Java Hashing Set Implementation via Hashing.

Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

SHRiMP: Accurate Mapping of Short Reads in Letter- and Colour-spaces Stephen Rumble, Phil Lacroute, …, Arend Sidow, Michael Brudno.

JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.

Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.

Precomputing Edit-Distance Specificity of Short Oligonucleotides Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

1 Short Term Scheduling. 2  Planning horizon is short  Multiple unique jobs (tasks) with varying processing times and due dates  Multiple unique jobs.

1 CSE 2337 Introduction to Data Management Access Book – Ch 1.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Introduction to Bioinformatics Algorithms DNA Mapping and Brute Force Algorithms.

Algorithm Analysis Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2008.

UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.

Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression Nathan J. Edwards Center for Bioinformatics & Computational.

CS 3343: Analysis of Algorithms Lecture 18: More Examples on Dynamic Programming.

Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,

Doug Raiford Phage class: introduction to sequence databases.

CS 3343: Analysis of Algorithms Lecture 19: Introduction to Greedy Algorithms.

Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University

An Exact Algorithm for Difficult Detailed Routing Problems Kolja Sulimma Wolfgang Kunz J. W.-Goethe Universität Frankfurt.

Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University

Brute Force A straightforward approach, usually based directly on the problem’s statement and definitions of the concepts involved Examples: Computing.

Advanced Algorithms Analysis and Design

FastHASH: A New Algorithm for Fast and Comprehensive Next-generation Sequence Mapping Hongyi Xin1, Donghyuk Lee1, Farhad Hormozdiari2, Can Alkan3, Onur.

COMP9319 Web Data Compression and Search

Hans Bodlaender, Marek Cygan and Stefan Kratsch

Solver & Optimization Problems

CS 3343: Analysis of Algorithms

Data Abstraction & Problem Solving with C++

Genome alignment Usman Roshan.

13 Text Processing Hongfei Yan June 1, 2016.

Bioinformatics: The pair-wise alignment problem

Department of Computer Science

Advanced Associative Structures

Dynamic Programming.

Data Structures and Algorithms

Objective of This Course

Data Structures and Algorithms

Next-generation sequencing - Mapping short reads

Chapter 3 Brute Force Copyright © 2007 Pearson Addison-Wesley. All rights reserved.

Database Design and Programming

Protein structure prediction.

Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures

Sahand Kashani, Stuart Byma, James Larus 2019/02/16

How to use hash tables to solve olympiad problems

Next-generation sequencing - Mapping short reads

CS 6293 Advanced Topics: Translational Bioinformatics

Sequence alignment, E-value & Extreme value distribution

Index Structures Chapter 13 of GUW September 16, 2019

Presentation transcript:

Precomputing Edit-Distance Specificity of Short Oligonucleotides Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park

Polymerase Chain Reaction

Polymerase Chain Reaction

Primer Specificity Need to ensure that primers hybridize to a specific (specified) locus only Require exactly one occurrence of specified sequence Require no (potential) mis-hybridization loci Bottleneck computation in primer-design Design / check iteration is problematic

k-unique 20-mers Edit-distance as a surrogate for mis-hybridization potential k-unique loci: All non-self genomic loci are require more than k edits in (global) alignment Closest non-self genomic loci requires (k+1) edits in (global) alignment

Find all k-unique 20-mers Naïve algorithm: O(n2km) Quadratic in size of genome. 0-unique (exact match) 20-mers (Expected) linear time algorithm Achieve expected linear time using a hybrid approach (blastn): Use partial exact match to “seed” expensive dynamic programming alignment Large chunks ) Fast, but miss occurrences Small chunks ) Slow, but correct

Inexact sequence match Baeza-Yates Perleberg: Correct and O(n) for small k At least 1 chunk is observed with no error. Small k → Large chunks → Fast and correct Largest correct chunk: floor(m/(k+1)) g ≠ = ≠ q

Example worst case alignments TCCCGC-TAGATTGAGATCT ||||||v||||||*|||||| TCCCGCCTAGATTTAGATCT ACTTGTCCACAGTGCTTAAG ||||||*||||||*|||||| ACTTGTGCACAGTCCTTAAG

Brute-force approach ACTTGTGCACAGTCCTTAAG 2-mer position table AA:18

Brute-force approach ACTTGTGCACAGTCCTTAAG ACTTGTGCACAGTCCTTAAG

Brute-force approach ACTTGTGCACAGTCCTTAAG ACTTGTGCACAGTCCTTAAG

Brute-force approach ACTTGTGCACAGTCCTTAAG ACTTGTGCACAGTCCTTAAG

Brute-force approach ACTTGTGCACAGTCCTTAAG ACTTGTGCACAGTCCTTAAG

Brute-force approach ACTTGTGCACAGTCCTTAAG ACTTGTGCACAGTCCTTAAG

Brute-force approach ACTTGTGCACAGTCCTTAAG ACTTGTGCACAGTCCTTAAG

Brute-force approach Divide the genome into 10 Mb blocks For all pairs of blocks: For all l-mer matches: Do all pair-wise DPs containing match If ≤ k edits, mark position non-unique 300 x 300 pairs of blocks For 20-mers: k=1 ) l=10; k=2 ) l=6; k=3 ) l=5 ; k=4 ) l=4.

Brute-force approach Things are looking really, really, bad: Seeds are too short 90,000 pair-wise block comparisons Actually quite good (seed size 12): Non-uniqueness certificates are dense Almost all positions eliminated early Behaves more like linear time than quadratic

In practice (edit-dist 4)

In practice (edit-dist 4)

In practice (edit-dist 4)

In practice (edit-dist 3)

In practice (edit-dist 3)

In practice (edit-dist 4,3,2)

In practice (edit-dist 4,3,2)

Edit distance 2 After seed size 12 After seed size 8 ~ 27K (0.288%) positions have no match After seed size 8 ~ 3K (0.029%) positions have no match Using seed size 6 is still too slow Need a more sophisticated hashing strategy 6-mers match in too many places!

Spaced seed-set design problem Given: mer-size: m ( = 20 ) # errors: k ( = 1,2,3) # cares: l ( = 10,12,14 ) Find the smallest set of spaced seeds that will find all alignments.

Solution for (20,2,8) 11111111, 111101111 TCCCGCGTAGATTGAGATCT ||||||*||||||*|||||| TCCCGCCTAGATTTAGATCT How can we find these spaced seed set solutions?

Spaced seed set design set-cover formulation Set cover instance: Ground set: all possible placements of the k errors (alignments) Covering sets: all possible placements of the l care positions For (m=20,k=2,l=10), 190 elements, 184,756 sets! Need to reduce the number of sets!

Dirty secret of spaced seeds Spaced seeds take O(# cares) to update! Contiguous seeds are O(1) to update 101010101010101 vs 11111111 8 steps to update vs 1 step to update Constant time update for spaced seeds? Yes, if they have a certain structure

O(1) spaced seed update ACGTACGTACGTACGTACGT A G A G C T C T G A G A T C T C ... Spaced seed 1010101 can be updated in 1 step!

O(1) spaced seed update “Periodic” spaced seeds can be updated in “constant” time 11011011011 2 steps 11001100110011 2 steps 1000010000100001 1 step Need to minimize the number of update steps, not the number of templates 11111111,111101111 has update cost 5.

TCCCGC-TAGATTGAGATCT ||||||v||||||*|||||| TCCCGCCTAGATTTAGATCT Edit-distance SS-SDP Position of matching bases might shift! Need 11111111 ↓ to get CCGCTAGA Need 111101111 ↑ to get CCGCTAGA Set cover formulation no longer works TCCCGC-TAGATTGAGATCT ||||||v||||||*|||||| TCCCGCCTAGATTTAGATCT

r:TCCCGC-TAGATTGAGATCT ||||||v||||||*|||||| q:TCCCGCCTAGATTTAGATCT Edit-Distance SS-SDP Use a variation on set cover: q:111101111,r:11111111 covers: Pay for query & reference update costs separately Control size of problem by only enumerating templates with small update cost r:TCCCGC-TAGATTGAGATCT ||||||v||||||*|||||| q:TCCCGCCTAGATTTAGATCT

Solution for (20,2,10) Query Templates: 1: 11111111110000000000 Cost: 1 2: 11111011111000000000 Cost: 5 27: 11111000001111100000 Cost: 5 42: 11111000000001111100 Cost: 5 Text Templates: 32: 11111000000111110000 Cost: 5 37: 11111000000011111000 Cost: 5 Pairs of templates: 1: 11111111110000000000 1: 11111111110000000000 Covers: 1274 2: 11111011111000000000 1: 11111111110000000000 Covers: 260 2: 11111011111000000000 2: 11111011111000000000 Covers: 1218 1: 11111111110000000000 2: 11111011111000000000 Covers: 309 42: 11111000000001111100 32: 11111000000111110000 Covers: 42 27: 11111000001111100000 32: 11111000000111110000 Covers: 319 42: 11111000000001111100 37: 11111000000011111000 Covers: 186 27: 11111000001111100000 37: 11111000000011111000 Covers: 51 42: 11111000000001111100 42: 11111000000001111100 Covers: 287

k-unique human 20-mers No 4-unique 20-mers No 3-unique 20-mers 0. 038% of (forward) human 20-mers are 2-unique 1088322 in total about 1 every 2638 bases Fast 2-uniquness oracle

F. tularensis 20-mer signatures Exact match in all six strains No match to bacterial background at edit-distance k No 3-unique 20-mer signatures 263 2-unique 20-mer signatures 0.013% 1.3M 20-mer signatures (no background check) 1.2M 0-unique 20-mer signatures 580K 1-unique 20-mer signatures

Conclusions Precompute of human k-unique 20-mers is now feasible! Faster for large edit-distance! Need spaced seed-set designs Constant time update for spaced seeds Good integer programming formulation of SS-SDP Limited template enumeration based on update cost Work with integer programming experts to solve effectively

Next Steps Publish! Adapt for Tm and/or hybridization model Convert to native BOINC-application Integrate with primer-design software