November 2018 Deep Sequencing Seminar Avia Efrat, Tomer Ronen

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Seeds for Similarity Search Presentation by: Anastasia Fedynak.
Index-based search of single sequences Omkar Mate CS 374 Stanford University.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Protein Domain Finding Problem Olga Russakovsky, Eugene Fratkin, Phuong Minh Tu, Serafim Batzoglou Algorithm Step 1: Creating a graph of k-mers First,
Sequence similarity.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Heuristic Approaches for Sequence Alignments
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
© Wiley Publishing All Rights Reserved.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Filter Algorithms for Approximate String Matching Stefan Burkhardt.
Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,
An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design Won-Hyong Chung and Seong-Bae Park Dept. of Computer Engineering.
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Doug Raiford Phage class: introduction to sequence databases.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Jeffrey D. Ullman Stanford University.  A real story from CS341 data-mining project class.  Students involved did a wonderful job, got an “A.”  But.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Phylogeny - based on whole genome data
Hash table CSC317 We have elements with key and satellite data
Part III – Gathering Data
Other Kinds of Arrays Chapter 11
Sequence comparison: Multiple testing correction
Stat 217 – Day 28 Review Stat 217.
Discrete Event Simulation - 4
Fast Sequence Alignments
Advanced Algorithms Analysis and Design
BLAST.
Algorithms for Deep Sequencing Data
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
CPS 173 Computational problems, algorithms, runtime, hardness
Basic Local Alignment Search Tool (BLAST)
Sequence comparison: Multiple testing correction
False discovery rate estimation
Minwise Hashing and Efficient Search
Evaluating Classifiers
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

November 2018 Deep Sequencing Seminar Avia Efrat, Tomer Ronen Minimizers Reducing Storage Requirements for Biological Sequence Comparison M. Roberts et al, 2004 November 2018 Deep Sequencing Seminar Avia Efrat, Tomer Ronen

Previously On “Algorithms for Deep Sequencing” Last Week: Find identical parts between documents Reduce the needed storage for the algorithm Similarities to Last Week: Comparing between documents Large storage reduction, small performance hit Differences from Last Week Biology domain - our “documents” are sequences of nucleic acids (ACGT) and proteins. Different options for choosing the “representative k-mers” A bit on similarities, not just identical strings.

Problem Given two long sequences, find common substrings (and their location) “Common”: Not necessarily Identical, But “similar”. The similarity is a function of the differences between the two strings, and can be determined by the user. A C G T Not Similar A C G T Similar enough

Motivation DNA assembly - find common patterns at the ends of DNA parts

Motivation Homology - similar parts may suggest common ancestry or similar “function”.

k-mers k-mer: a substring of length k In a string with length L, there are L-k+1 k-mers: G 4 T 5 A 1 C 2 3 Intuition: In the first L-k indices there are L-k k-mers. In the last k indices there is only place for 1 more k-mer. A 1 C 2 G 3 G 4 C 2 3 G 4 T 5 3

“Seed and Extend” We want to find common substrings between two strings. What about more than two strings? And similarity? Later! We get k-mers from both strings. These are the “seeds”. How do we choose these seeds/k-mers? Good question! Each seed is represented by 3-tuple: <s, i, p> s - the k-mer letters i - the index of the string (in our case: ‘1’ or ‘2’) p - the starting position of the k-mer in the string

Finding Common K-mers Sort your list of k-mers (your “seeds”). Now identical seeds are one after another, making it easy to find the corresponding strings and try to extend the seed matches. The ability to recognize matches as soon as the database is sorted is called the “collection criterion”.

Seed and Extend - Simple Process G T Chosen seed (3-mer): To get the length of 5 match, a 3-mer in that “window” had to be chosen. Exact Matches of less that k can’t be found that way. G A C

Storage Cost: Naïve Choice of K-Mers If we don’t want to miss any match, we have to store all k-mers. For simplicity, assume 𝑘≪𝐿. Then there are L k-mers in a string of size L. In 2004, gene assembly for a common rat uses about 33∗ 10 6 sequences, with an average length of 600 letters each. That gives about 2∗ 10 10 k-mers. A typical choice of k was 20, so get to 4∗ 10 11 letters. Even if we use 2 bits to store each letter, we need 5 bytes for the letters of a k-mer. And don’t forget we need 3 more bytes for the string index (i) and 2 more bytes for the position of the k-mer in the string (p). The total is about 200GB, and a task can be more demanding than the gene assembly of a rat.

Reducing Storage Requirements Store fewer k-mers as seeds But which ones? Simplest option: store every 𝐺 𝑡ℎ k-mer (for some G) Problem: overlapping strings with offset can be completely ignored Better option: Minimizers Group of adjacent k-mers  Choose representative k-mer Representative = two strings with a significant overlap choose the same k-mer

Minimizers Minimizers: a special set of representative k-mers. The Representation Property (Property 1): If two strings have a significant exact match, then at least one of the minimizers chosen from one will also be chosen from the other. 𝒎 𝟏 𝒎 𝟐 𝒎 𝟑 𝒎 𝟏

Window of K-Mers 𝒎 𝟏 𝒎 𝟐 𝒌: seed length (k-mer) 𝒘: window size. The number of adjacent k-mers in a k-mer group. 𝒘+𝒌−𝟏: window coverage. The substring covered by all w k-mers. k=3 𝒎 𝟏 𝒎 𝟐 w=5 1 2 3 4 5 6 7 w+k-1=7 … K w w+k-1

Interior Minimizers A T C G (𝐰,𝐤) minimizer: given a window of 𝑤 consecutive k-mers, the minimizer is the smallest k-mer. This tactic requires an ordering: Simplest: lexicographic. “AAAA” < “ABAA” < “CGCG” Better orderings: later. The representation property is satisfied: (Property 1’): If two strings have a substring of length w+k-1 in common, they have a (𝑤,𝑘) minimizer in common. k=3 w=5 A T C G

Interior Minimizers 𝑘=3 , 𝑤=4 , 𝑆 =15 , 𝑤+𝑘−1 =6 𝑘=3 , 𝑤=4 , 𝑆 =15 , 𝑤+𝑘−1 =6 All k-Mers: 𝑆 −𝑘+1 =13 seeds Interior minimizers: 4 seeds

Gaps Between Minimizers Not every letter in the string must be covered by some minimizer Maximal gap size: 𝑤−𝑘= 𝑤+𝑘−2𝑘 Complete coverage: 𝑤≤𝑘 𝑤=𝑘 is common Sparse minimizers: 𝑤≫𝑘 𝒎 𝟏 gap 𝒎 𝟐 1 … k w+k-1 w+k

End Minimizers Interior minimizers don’t guarantee coverage at the ends of a string At most 𝑤−1 letters at each end might be uncovered (𝑢,𝑘) end-minimizer: a (𝑢,𝑘) minimizer chosen from a windows of size 𝑢 which is anchored to one end of the string. If for some 𝑣 we build the set of all (𝑢,𝑘) end-minimizers for 𝑢∈ 1…𝑣 , we satisfy the end-representation property (Property 2): If the ends of two strings have an exact overlap of at least 𝑘 letters and at most 𝑘+𝑣−1 letters, then they share at least one k-end-minimizer.

End Minimizers 𝑘=3 , 𝑙=16 , 𝑢∈ 1,…,𝑙−𝑘+1 = 1,…,14

Mixed Strategy Ensure complete letter coverage by using both: 𝑤=𝑘=3 𝑢∈ 1,2 Ensure complete letter coverage by using both: Choose 𝑤≤𝑘 𝑤,𝑘 interior minimizers 𝑢,𝑘 end minimizers at both ends for every 𝑢∈ 1,…,𝑤−1 Every letter in the string is covered by at least one minimizer.

Ordering - Effect on Storage Last week we saw that on average we get a new minimal value in our window (i.e. "minimizer") every 2 𝑤+1 letters. Until now we dealt with numbers. Let’s switch to nucleic acids (ACGT). We choose minimizers by lexicographic order (i.e A is the first letter). In case of a tie, choose all that are tied. What if encounter k-mers of A’s several times in our window? a k-mer of A’s is the first in our order. We would have to choose all those k-mers!

Ordering - Effect on Storage But this could be said on every sequence of the same letter - If our “first” letter was G, then sequences of G’s would have hurt us the same way… But DNA sequences are not completely random. A is more common than C or G. So, if A is our “first” letter, since there are more sequences of A’s than of C’s and G’s, we would get a new minimizer more than every 2/(𝑤+1) times (on average), and so we would store more k-mers than expected. But if we choose the “first” letter (or k-mer) to be a rare one, e.g CGCG, this would mitigate this problem!

Ordering - Effect on Match Significance Ordering by “rarity” does not only help in reducing storage We want our matches to be significant. If we want to see if two articles are similar, The word “protein” in both of them is more indicative of a resemblance than multiple co-occurrences of the phrase “this is a“. Same with genes - a match of CGCG is more significant than a match of AAAA. The order can impact both storage requirements and the statistical significance of the matches that were found. The latter is important when minimizers are sparse (not covering the whole string).

A Bit on Similarity Like matches, not all mismatches are the same. BLAST: a family of algorithms for sequence matching. Seed & Extend: Start from seeds, tries to extend from there. Can look for similarities, not just exact matches. Similarities are determined by a similarity score matrix.

How Can Minimizer Ordering Help Find Similarities One possible feature of BLAST is to extend until the similarity score is below some threshold. Assume seed size (k) is 4, threshold 7, and this matrix: A C G T 2 -2 -3 1 5 -1 3 A T G C

Case Study Faux dataset – computationally shattered C. Elegans genome Dataset stats: Total genome length: 100MB = 10 8 base pairs About 10 6 reads of length ~𝑁( 𝜇=537, 𝜎=90 ) 5.7-fold cover of the genome Artificial base errors were inserted Probabilities taken for actual reads of the human genome

Case Study The goal: finding overlaps of at least 40 base pairs Algorithmic pipeline: Seeds were created by using minimizers of different (𝑤, 𝑘) values, including all k-mers (w=1). Seed & Extend algorithm was executed, detecting overlaps using these seeds. In some cases, Symmetrizer was applied to find additional overlaps.

Case Study The goal: finding overlaps of at least 40 base pairs Measurements: Run time: in hours, using an average desktop computer. 𝑡 𝑟𝑎𝑡𝑖𝑜 or 𝑟𝑒𝑐𝑎𝑙𝑙 or 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 : the percent of true overlaps found. 𝐹/𝑇 : 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 : 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 = 1 1+𝐹/𝑇 False positives are common due to repeated regions in the genome that match locally.

Symmetrizer A technique for finding missing overlaps If read X plausibly overlaps reads Y and Z, and the offsets suggest that Y and Z overlap, then Y and Z are sent to the Extender part of the algorithm. In the example: the overlap between R and B is insufficient to reliably produce a minimizer, but their offsets relative to G suggest that they do in fact overlap. 𝑹 𝟏𝟎 𝑹 𝟐𝟎 𝑹 𝟑𝟎 𝑹 𝟒𝟎 𝑹 𝟓𝟎 𝑹 𝟔𝟎 𝑹 𝟕𝟎 𝑹 𝟖𝟎 𝑮 𝟏𝟎 𝑮 𝟐𝟎 𝑮 𝟑𝟎 𝑮 𝟒𝟎 𝑮 𝟓𝟎 𝑮 𝟔𝟎 𝑮 𝟕𝟎 𝑮 𝟖𝟎 𝑮 𝟗𝟎 𝑮 𝟏𝟎𝟎 𝑩 𝟏𝟎 𝑩 𝟐𝟎 𝑩 𝟑𝟎 𝑩 𝟒𝟎 𝑩 𝟓𝟎 𝑩 𝟔𝟎 𝑩 𝟕𝟎 𝑩 𝟖𝟎 𝑩 𝟗𝟎

Results Precision 36% 38% 42% 48% 45% 43% 28% 29% 33%

Recall-Precision Tradeoff As 𝑤 increases, minimizers are less common. w=20 VS w=1: Recall somewhat damaged Precision increases significantly Run time shortens dramatically

Minimizers VS All K-Mers w=k=20 VS w=1,k=30: Similar recall Minimizers have better precision Minimizers have much better run time Notice Recall-Precision tradeoff with all k-mers, different k sizes

With & Without Symmetrizer In general, Symmetrizer improves recall Sym,w=20 VS NoSim,w=3: Similar precision Better recall Better Run Time Window size has minor effect on recall (notice the scale!) and big effect on run time. (Although this might be because the recall is already so high)

Sym Minimizers VS All K-Mers For high-recall needs, Minimizers with Symmetrizer provide: Better recall Similar precision Significantly better run time Sym Minimizers might not perform well in a high-precision scenario (tendency to find FPs), but the data is insufficient for a solid conclusion.

Recall Equivalence About 2 𝑤+1 of k-mers are (𝑤,𝑘) minimizers. A string of length 𝑙 has 𝑙−𝑘+1 k-mers in total. ⇒ A string of length 𝑙 is expected to have about 2 𝑙−𝑘+1 𝑤+1 minimizers. When is a string expected to have 1 minimizer? Solving for 𝑙: 𝑙 1 =𝑘+ 𝑤−1 2 ≅𝑘+ 𝑤 2 We expect matching substrings of length 𝑘+ 𝑤 2 to be found both by a (𝑤,𝑘) minimizer and by a 𝑘+ 𝑤 2 -mer Indeed, we see similar recall values for all 𝑘+ 𝑤 2 -mers and for (𝑤,𝑘) minimizers

Minimizers vs. All k-mers: Precision We saw that with regards to recall, using (𝑤, 𝑘) minimizers and using all 𝑘+ 𝑤 2 k-mers have comparable results. To analyze precision, we have to consider the following parameters: L: Total length of the database (in letters) b: number of different letters (the “base” of a sequence). 4 in DNA (ACGT), 20 in proteins. A sequence of length k is expected to appear in L a total of 𝐿 𝑏 𝑘 times. 𝐿 𝑏 𝑘 is an indicator of precision: If k is chosen such the 𝐿 𝑏 𝑘 ≪ 1, then if a sequence of length k appears in our database more than one time, the match is unlikely to have occurred in random.

Looking for Long Matches? Minimizers! Assume we are looking for long matches. We will probably choose a large k, so that we won’t waste our time on many small seeds. In this case, we probably get 𝐿 𝑏 𝑘 ≪ 1. But to be sure, check your L! We know that using (𝑤, 𝑘) minimizers and using all 𝑘+ 𝑤 2 k-mers will have similar recall, but using the minimizers will take a factor of 2 𝑤+1 less storage. Although the “all k-mers” approach will have a slightly better precision, the difference will be negligible. 

What About shorter Matches? Say that short matches are significant enough. So k is bounded by the size of a significant match. In this case, we can get 𝐿 𝑏 𝑘 > 1. (but again, check your L!) If k=15, b=4, and L= 10 10 , then for all k-mers approach, we get 𝐿 𝑏 𝑘 ≈10. (w=10, k=10) minimizers will yield the same recall, but 𝐿 𝑏 𝑘 will be a thousand times larger (k got from 15 to 10). This is a big hit on precision. And even regarding storage, since every k-mer is expected to appear in L multiple times, we don’t need to save all the k-mers, we can just use a hash table.

Summary 𝑤,𝑘 minimizers are guaranteed to find matches of length ≥ 𝑤+𝑘−1 Minimizers use a factor of 2 𝑤+1 less storage on average Choose an ordering that favors rare k-mers 𝑤 and 𝑘 affect the recall-precision-runtime tradeoff Don’t always use minimizers. Consider the size of the alphabet (b) and database (L)