Algorithms for Deep Sequencing Data

Slides:



Advertisements
Similar presentations
Noise, Information Theory, and Entropy (cont.) CS414 – Spring 2007 By Karrie Karahalios, Roger Cheng, Brian Bailey.
Advertisements

CS590 Z Matching Program Versions Xiangyu Zhang. CS590Z Problem Statement  Suppose a program P’ is created by modifying P. Determine the difference between.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
Lecture 10: Search Structures and Hashing
Finding Similar Items.
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
Basic Concepts in Number Theory Background for Random Number Generation 1.For any pair of integers n and m, m  0, there exists a unique pair of integers.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Memory Management OS Fazal Rehman Shamil. swapping Swapping concept comes in terms of process scheduling. Swapping is basically implemented by Medium.
R ANDOM N UMBER G ENERATORS Modeling and Simulation CS
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
Theory of Computational Complexity M1 Takao Inoshita Iwama & Ito Lab Graduate School of Informatics, Kyoto University.
Chapter 14 Genetic Algorithms.
Subject Name: File Structures
Data Structures Using C++ 2E
Updating SF-Tree Speaker: Ho Wai Shing.
Hashing, Hash Function, Collision & Deletion
Hash table CSC317 We have elements with key and satellite data
The Variable-Increment Counting Bloom Filter
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Hashing Alexandra Stefan.
Streaming & sampling.
Subject Name: File Structures
Hashing Alexandra Stefan.
Data Structures Using C++ 2E
CS 430: Information Discovery
Algorithm Analysis CSE 2011 Winter September 2018.
Random Number Generators
Chapter 5. Optimal Matchings
Tests for Gene Clustering
Rank Aggregation.
Objective of This Course
RS – Reed Solomon List Decoding.
Hash Tables – 2 Comp 122, Spring 2004.
Indexing and Hashing Basic Concepts Ordered Indices
Data Structures – Week #7
Advanced Algorithms Analysis and Design
Resolution Proofs for Combinational Equivalence
November 2018 Deep Sequencing Seminar Avia Efrat, Tomer Ronen
Chapter 3 DataStorage Foundations of Computer Science ã Cengage Learning.
CS202 - Fundamental Structures of Computer Science II
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Analysis of Algorithms
Data Structures – Week #7
Minwise Hashing and Efficient Search
Hashing.
CS 3343: Analysis of Algorithms
The Selection Problem.
Data Structures and Algorithm Analysis Hashing
DATA STRUCTURES-COLLISION TECHNIQUES
DATA STRUCTURES-COLLISION TECHNIQUES
Matching Program Versions
Hash Tables – 2 1.
Lecture-Hashing.
Presentation transcript:

Algorithms for Deep Sequencing Data Eliran Shabat Lior Wainstain School of Computer Science Tel Aviv University, Israel

Plagiarism Detection via Winnowing Schleimer, Wilkerson and Aiken School of CS, Tel Aviv University, Israel November 2018

Outline The Problem and Motivation Basic Notions and pre-Winnowing Work Requirements Karp-Rabin Method for Hashing Winnowing Local Algorithms Experiments

The Problem and Motivation

Copying Documents Digital documents are easily copied. People copy documents for a wide range of reasons People quote from each other’s email. Collaborators create multiple versions of documents. Web sites are mirrored. Students plagiarize their homework from the Web or from one another.

Full Plagiarism Detection We want to develop a method in order to detect such copies given two or more documents. If we want to reliably detect exact copies, we can compare whole document checksums. A checksum is a small digest of the full document. This method is simple and suffices for detecting exact copies. But, it is rather global for us. We want a method which is more local.

Partial Plagiarism Detection What we actually want is to be able to detect partial copied content between documents. This task is much more subtle than the previous one. Methods for “detecting of partial copies” have many potential applications. For example in genetics, where we don’t want to detect full genome equality, but rather detect small portions (e.g. genes) which are shared between genomes.

Basic Notions and pre-Winnowing Work

𝑘−𝑔𝑟𝑎𝑚𝑠 A 𝑘−𝑔𝑟𝑎𝑚 is a contiguous substring of length 𝑘. For a document of length 𝑛 there are 𝑛−𝑘+1 possible 𝑘−𝑔𝑟𝑎𝑚𝑠. This is true, since there is a 𝑘−𝑔𝑟𝑎𝑚 starting at any position of the text, except for the last 𝑘−1 positions. At first, we divide the document into 𝑘−𝑔𝑟𝑎𝑚𝑠, where 𝑘 is a parameter chosen by the user.

5−𝑔𝑟𝑎𝑚𝑠 Example For example we take the text “A do run run run, a do run run” Then we remove irrelevant features “adorunrunrunadorunrun” The resulting 5−𝑔𝑟𝑎𝑚𝑠 are

Fingerprints After computing all the 𝑘−𝑔𝑟𝑎𝑚𝑠, we hash them. Then, we select some subset of these hashes to be the document’s 𝑓𝑖𝑛𝑔𝑒𝑟𝑝𝑟𝑖𝑛𝑡𝑠. If we want to check whether documents share some portion, we can compare their fingerprints. One strategy to chose the subset is to fix some number 𝑝 and take all the hashes that are 0 𝑚𝑜𝑑 𝑝.

Fingerprints Example The 5−𝑔𝑟𝑎𝑚𝑠 from before A potential list of hashes The hashes that are 0 𝑚𝑜𝑑 4

Advantages of the 0 𝑚𝑜𝑑 𝑝 Method Suppose that the hash function is chosen so that the probability of collisions is very small. Then, whenever two documents share one or more fingerprints, it is highly likely that they share a 𝑘−𝑔𝑟𝑎𝑚 as well. Also, we store a relatively small amount of hashes as the fingerprints. On average, only a 1 𝑝 fraction of the total amount of 𝑘−𝑔𝑟𝑎𝑚𝑠.

Disadvantages of the 0 𝑚𝑜𝑑 𝑝 Method The method gives no guarantee that matches between documents are detected. A 𝑘−𝑔𝑟𝑎𝑚 shared between documents is detected only if its hash is 0 𝑚𝑜𝑑 𝑝. In fact, big gaps between selected fingerprints were observed when experimenting with real data.

Requirements

Desirable Properties Not all copy-detection algorithms are based on selecting fingerprints of 𝑘−𝑔𝑟𝑎𝑚𝑠. Instead, the strings to fingerprint can be chosen by looking for sentences or paragraphs, or by choosing fixed-length strings that begin with “anchor” words. There are methods not using fingerprints at all. Instead, they use a notion of distance between documents. To give some basis for discussing different techniques, we list several criteria that a copy-detection algorithm should satisfy.

(Whitespace) Insensitivity In matching text files, matches should be unaffected by such things as extra whitespace, capitalization, punctuation, etc. In other domains the notion of what strings should be equal is different. For example, in matching software text it is desirable to make matching insensitive to variable names.

Simple Solution A first pass over the data transforms it to eliminate undesirable differences between documents. E.g. whitespace and punctuation are removed, all letters are converted to lower case, or all variable names are replaced by the identifier “$”. The details vary from one type of document to the other. However, the general idea is the same: semantic information about the document type is used to eliminate unimportant differences between documents.

Noise Suppression Discovering short matches, such as the fact that the word "𝑡ℎ𝑒" appears in two different documents, is uninteresting. Any match must be large enough to imply that the material has been copied and is not simply a common word or idiom of the language in which the documents are written.

Simple Solution Schemes based on fingerprinting 𝑘−𝑔𝑟𝑎𝑚𝑠 satisfy this requirement quite easily. The user can choose 𝑘 to be sufficiently large so that common idioms of the language have length shorter than 𝑘. Of course, we assume in this solution that there is some threshold 𝑘 (depending on the type of documents) that distinguishes between uninteresting and interesting matches in terms of length.

Position Independence Permutation of the contents of a document should not affect the set of discovered matches. Adding to a document should not affect the set of matches in the original portion of the new document. Removing part of a document should not affect the set of matches in the portion that remains.

First Approach to Position Independence A simple but incorrect strategy is to select every 𝑖 th hash of a document, but this is not robust against reordering, insertions and deletions. In fact, prepending one character to a file shifts the positions of all 𝑘−𝑔𝑟𝑎𝑚𝑠 by one, which means that the modified file shares none of its fingerprints with the original. Thus, any effective algorithm for choosing the fingerprints to represent a document cannot rely on the position of the fingerprints within the document.

Karp and Rabin’s Method for Hashing

Historical Note Karp and Rabin’s algorithm for substring matching is the earliest version of fingerprinting based on 𝑘−𝑔𝑟𝑎𝑚𝑠. From the late ‘80s Their problem, which was motivated by string matching problems in genetics, is to find occurrences of a particular string 𝑠 of length 𝑘 within a much longer string. The idea is to compare hashes of all 𝑘−𝑔𝑟𝑎𝑚𝑠 in the long string with a hash of 𝑠, as we have seen before.

Rolling Hash Function Hasing strings of length 𝑘 is expensive for large 𝑘. So, Karp and Rabin proposed a “rolling” hash function that allows the hash for the 𝑖 + 1 st 𝑘−𝑔𝑟𝑎𝑚 to be easily computed given the hash of the 𝑖 th 𝑘−𝑔𝑟𝑎𝑚. The idea is to treat a 𝑘−𝑔𝑟𝑎𝑚 𝑐 1 … 𝑐 𝑘 as a 𝑘-digit number in some base 𝑏. Then 𝐻 𝑐 1 … 𝑐 𝑘 = 𝑐 1 ⋅ 𝑏 𝑘−1 + 𝑐 2 ⋅ 𝑏 𝑘−2 +⋯+ 𝑐 𝑘−1 ⋅𝑏+ 𝑐 𝑘 = 𝑖=1 𝑘 𝑐 𝑖 ⋅ 𝑏 𝑘−𝑖

Rolling Hash Function – Incremental Step To compute the hash of the next 𝑘−𝑔𝑟𝑎𝑚 𝑐 2 … 𝑐 𝑘+1 𝐻 𝑐 2 … 𝑐 𝑘+1 = 𝑐 2 ⋅ 𝑏 𝑘−1 + 𝑐 3 ⋅ 𝑏 𝑘−2 +⋯+ 𝑐 𝑘 ⋅𝑏+ 𝑐 𝑘+1 = 𝐻 𝑐 1 … 𝑐 𝑘 − 𝑐 1 ⋅ 𝑏 𝑘−1 ⋅𝑏+ 𝑐 𝑘+1 Since 𝑏 𝑘−1 is a constant, this allows each subsequent hash to be computed from the previous one with only two additions and two multiplications. Further, the above identity holds when addition and multiplication are modulo some value.

Weakness of 𝐻 The values of the 𝑐 𝑖 ’s are relatively small integers, so doing the addition last means that the last character only affects a few of the low-order bits of the hash. A better hash function would have each character 𝑐 𝑖 potentially affect all of the hash’s bits. It is easy to fix this by multiplying the entire hash of the first 𝑘−𝑔𝑟𝑎𝑚 by an additional 𝑏 and then switching the order of the multiply and add in the incremental step. 𝐻 𝑐 2 ,…, 𝑐 𝑘+1 = 𝐻 𝑐 1 ,…, 𝑐 𝑘 − 𝑐 1 ∗ 𝑏 𝑘 + 𝑐 𝑘+1 ∗𝑏

Winnowing

Parameters Given a set of documents, we want to find substring matches between them that satisfy two properties: If there is a substring match at least as long as the guarantee threshold, 𝑡, then this match is detected. We do not detect any matches shorter than the noise threshold, 𝑘. The parameters 𝑡 and 𝑘 are chosen by the user and we must have that 𝑘≤𝑡.

The Algorithm in General The algorithm uses the previous notion of 𝑘−𝑔𝑟𝑎𝑚𝑠. This means that we Filter the document. Compute all the 𝑘−𝑔𝑟𝑎𝑚𝑠. Compute all the hashes of the 𝑘−𝑔𝑟𝑎𝑚𝑠. Decide which hashes are taken as fingerprints. The only innovation of the algorithm is in the last step! The first three steps were used beforehand, as we have seen.

An Observation Suppose that we are given 𝑚 consecutive hashes ℎ 1 ,…, ℎ 𝑚 such that 𝑚>𝑡−𝑘. Then at least one of the ℎ 𝑖 ’s must be chosen to guarantee detection of all matches of length at least 𝑡. Why ? Since ℎ 1 ,…, ℎ 𝑚 cover 𝑚+𝑘−1≥𝑡 consecutive letters from the document, we may have a matching in this interval of letters with length at least 𝑡. Therefore, we must pick at least one of the ℎ 𝑖 ’s.

The Winnowing Algorithm Input: a sequence of hashes ℎ 1 ,…, ℎ 𝑛 that represents a document, and two parameters 𝑘≤𝑡. Let the window size be 𝑤=𝑡−𝑘+1. Each position 1≤𝑖≤𝑛−𝑤+1 defines a window of hashes ℎ 𝑖 ,…, ℎ 𝑖+𝑤−1 . In each window select the minimum hash value. If there is more than one hash with the minimum value, select the rightmost occurrence. The selected hashes are the fingerprints of the document.

Example Some text Irrelevant features removed The sequence of 5−𝑔𝑟𝑎𝑚𝑠

Example Our hypothetical sequence of hashes of 5−𝑔𝑟𝑎𝑚𝑠 Consider a window of hashes of size 4

Example The fingerprints selected by Winnowing are In many applications we also want to remember positional information rather than saving only the fingerprints themselves. So, we add a 0−base positional information, i.e.

Intuition Behind Winnowing The minimum hash in one window is very likely to remain the minimum hash in adjacent windows, since the odds are that the minimum of 𝑤 random numbers is smaller than one additional random number. Thus, many overlapping windows select the same hash, and the number of fingerprints selected is far smaller than the number of windows while still maintaining the guarantee.

Density and the Charge Function The density of a fingerprinting algorithm is the expected fraction of fingerprints selected from among all the hash values computed, given random input. Efficient algorithms attempt to minimize the density. The charge function, 𝐶, maps the position of each selected fingerprint to the position of the first (leftmost) window that selected it. Note that the charge function is monotonically increasing.

Density Analysis of Winnowing Assume that the sequence of hashes are random and that space of the hash values is very large so that we can safely ignore the possibility that there is a tie for the minimum value. Consider an indicator random variable 𝑋 𝑖 that is one iff the 𝑖 th window 𝑊 𝑖 is charged. Note that 𝑊 𝑖 and 𝑊 𝑖−1 overlap except at the leftmost and rightmost positions. Consider the position 𝑝 containing the smallest hash in the union interval.

Density Analysis of Winnowing Any window that includes 𝑝 selects ℎ 𝑝 as a fingerprint. There are three cases: If 𝑝=𝑖−1, then since 𝑝∉ 𝑊 𝑖 : 𝑊 𝑖 must pick a new hash ℎ 𝑞 . No earlier window is charged for ℎ 𝑞 because the charge is monotonic. Thus 𝑊 𝑖 is charged and 𝑋 𝑖 =1. If 𝑝=𝑖+𝑤−1, then since 𝑝∉ 𝑊 𝑖−1 , 𝑊 𝑖 must be charged for ℎ 𝑝 . Thus, 𝑋 𝑖 =1. If 𝑝 is in any other position, then both 𝑊 𝑖 and 𝑊 𝑖−1 select it. Thus, 𝑊 𝑖 is not charged and 𝑋 𝑖 =0. The first two cases happen with probability 1/(𝑤+1). Thus, 𝐸 𝑋 𝑖 =2/(𝑤+1).

Density Analysis of Winnowing Let 𝑁 be the number of windows in the document. Then the number of fingerprints selected is 𝑖=1 𝑁 𝑋 𝑖 And the expected number of fingerprints is 𝐸 𝑖=1 𝑁 𝑋 𝑖 Note that even though the 𝑋 𝑖 ’s are clearly dependent, the expectancy is linear: 𝐸 𝑖=1 𝑁 𝑋 𝑖 = 𝑖=1 𝑁 𝐸 𝑋 𝑖 =𝑁 2 𝑤+1 Thus, the expected density (the fraction) is 𝑑= 2 𝑤+1 .

Comparison to 0 𝑚𝑜𝑑 𝑝 at Same Density Note that the density of 0 𝑚𝑜𝑑 𝑝 is 1/𝑝. Thus, we need to take 𝑝= 1 𝑑 = 𝑤+1 2 . Let us consider the probability that for a string of length 𝑡=𝑤+𝑘−1, the 0 𝑚𝑜𝑑 𝑝 algorithm fails to select any fingerprint at all within it. Note that the probability that there exist such a string in a text is at least as big. This probability is the probability that no hash is selected in a given sequence of 𝑤 hashes: 1− 1 𝑝 𝑤 = 1− 2 𝑤+1 𝑤 ≈ 𝑒 − 2𝑤 𝑤+1 ≈13.5%

Queries In a typical application, one first builds a database of fingerprints and then queries the fingerprints of individual documents against this database. Different window size can be used for the database and the queries. Let 𝐹 𝑤 be the set of fingerprints chosen for a document by winnowing with window size 𝑤. For 𝑤 ′ ≥𝑤 we have that 𝐹 𝑤 ′ ⊆ 𝐹 𝑤 . Picking larger window for queries is useful in cases where the system is heavily loaded, or if we are interested in obtaining a faster but coarser estimate of the matching in a document.

Local Algorithms

Local Algorithms Class - Motivation Winnowing selects the minimum value in a window of hashes. It is one of a family of algorithms that choose elements from a local window. However, not every method for selecting hashes from a local window maintains the guarantee. For example, assume window size 𝑤 and a method that select every 𝑤 th hash as fingerprint. This method fails in the presence of insertions or deletions. This motivates the definition of the class of local algorithms.

Local Algorithms Definition Let 𝑆 be a selection function taking a 𝑤-tuple of hashes and returning an integer between zero and 𝑤−1, inclusive. A fingerprinting algorithm is local with selection function 𝑆, if, for every window ℎ 𝑖 ,…, ℎ 𝑖+𝑤−1 , the hash at position 𝑖+𝑆 ℎ 𝑖 ,…, ℎ 𝑖+𝑤−1 is selected as a fingerprint. It is sometimes beneficial to weaken locality slightly to provide flexibility in choosing among equal fingerprints.

Correctness of Local Algorithms Lemma: Any matching pair of substrings of length at least 𝑡=𝑤+𝑘−1 is found by any local algorithm. Proof: The sequence of hashes of 𝑘−𝑔𝑟𝑎𝑚𝑠 representing each substring spans at least one window, 𝑊, of length 𝑤. 𝑊 appears in the sequence of hashes in both documents. The selection function is only a function of the contents of 𝑊. Thus, the same fingerprint is selected from 𝑊 in both copies.

Lower Bound for Local Algorithms Theorem: any local algorithm with noise threshold 𝑘 and guarantee 𝑡=𝑤+𝑘−1 has density 𝑑≥ 1.5 𝑤+1 Note that the lower-bound does not meet the upper bound for winnowing. The winnowing algorithm is within 33% of this lower bound. It might be possible to improve the lower bound.

Proof of the Lower Bound Assume the hashes are independent and uniformly distributed. Consider the behavior of the algorithm on every 𝑤+1 th window. Because the windows are disjoint, each window selects a separate fingerprint. Consider all of the windows between the 𝑖 th and 𝑖+𝑤+1 th windows. Let 𝑍 be a random variable such that 𝑍=0 if among these windows, no additional fingerprint is selected, and 𝑍=1, otherwise.

Proof of the Lower Bound Let 𝑋 and 𝑌 denote the random variables 𝑆 𝑊 𝑖 and 𝑆 𝑊 𝑖+𝑤+1 , respectively. Note that they are independent. If 𝑌≥𝑋 then 𝑍=1, because the algorithm is required to select at least one additional fingerprint from a window in between 𝑊 𝑖 and 𝑊 𝑖+𝑤+1 . Denote Θ≔ Pr 𝑋>𝑌 = Pr 𝑌>𝑋 , Δ≔ Pr 𝑋=𝑌 2Θ+Δ=1, thus,Θ+Δ= 1+Δ 2 > 1 2 Thus, 𝐸 𝑍 ≥ Pr 𝑌≥𝑋 =Θ+Δ> 1 2

Proof of the Lower Bound 𝐸 𝑍 > 1 2 This means that in every sequence of 𝑤+1 windows, in addition to the fingerprint selected in the first window we expect to select an additional distinct fingerprint at least half the times. Which means that 𝑑≥ 1.5 𝑤+1

Experiments

Web Data Experiment Dataset: 500,000 pages downloaded from the Stanford WebBase. Rolling hash function (“Improved” Karp-Rabin). Base 𝑏 𝐻 𝑐 2 ,…, 𝑐 𝑘+1 = 𝐻 𝑐 1 ,…, 𝑐 𝑘 − 𝑐 1 ∗ 𝑏 𝑘 + 𝑐 𝑘+1 ∗𝑏 64 bit hashes.

Web Data Experiment – Hash Verification 8MB of randomly generated text. Strings of length 50 (𝑘=50). Winnowing windows was set to 100 (𝑤=100). Winnowing selected 0.019902 of the hashes. Expected density: 0.019802. Selecting hashes equal to 0 𝑚𝑜𝑑 50 results in a measured density of 0.020005. Expected density: 0.02. Observed uniform distribution of the hash values. Taken all together, the hash function implementation appears to be sufficient for the fingerprinting algorithm.

Web Data – Second Experiment 500,000 HTML documents. Winnowing window size: 100. Noise threshold: 50. Results:

Web Data – Second Experiment Both algorithms come close to the expected density. Run of over 29,900 characters that has no hash that is 0 𝑚𝑜𝑑 50. Probability assuming random data: 1− 1 50 29851 < 10 −260 Clearly, the data is not uniformly random. Low entropy strings: many equal hash values which cause many ties for the minimum hash. For example: string of zeros. Rightmost value is always the new one. Robust winnowing: tries to break ties using the hash value from previous window.

MOSS – Implementing Winnowing MOSS, which stands for Measure Of Software Similarity, accepts batches of documents and returns report pages showing where significant sections of a pair of documents are very similar. MOSS is primarily used for detecting plagiarism in programming assignments in computer science and other engineering courses. The service currently uses robust winnowing.

Thank You!