Algorithms for Deep Sequencing Data

Algorithms for Deep Sequencing Data
Eliran Shabat Lior Wainstain School of Computer Science Tel Aviv University, Israel

Plagiarism Detection via Winnowing
Schleimer, Wilkerson and Aiken School of CS, Tel Aviv University, Israel November 2018

Outline The Problem and Motivation
Basic Notions and pre-Winnowing Work Requirements Karp-Rabin Method for Hashing Winnowing Local Algorithms Experiments

The Problem and Motivation

Copying Documents Digital documents are easily copied.
People copy documents for a wide range of reasons People quote from each other’s . Collaborators create multiple versions of documents. Web sites are mirrored. Students plagiarize their homework from the Web or from one another.

Full Plagiarism Detection
We want to develop a method in order to detect such copies given two or more documents. If we want to reliably detect exact copies, we can compare whole document checksums. A checksum is a small digest of the full document. This method is simple and suffices for detecting exact copies. But, it is rather global for us. We want a method which is more local.

Partial Plagiarism Detection
What we actually want is to be able to detect partial copied content between documents. This task is much more subtle than the previous one. Methods for “detecting of partial copies” have many potential applications. For example in genetics, where we don’t want to detect full genome equality, but rather detect small portions (e.g. genes) which are shared between genomes.

Basic Notions and pre-Winnowing Work

𝑘−𝑔𝑟𝑎𝑚𝑠 A 𝑘−𝑔𝑟𝑎𝑚 is a contiguous substring of length 𝑘.
For a document of length 𝑛 there are 𝑛−𝑘+1 possible 𝑘−𝑔𝑟𝑎𝑚𝑠. This is true, since there is a 𝑘−𝑔𝑟𝑎𝑚 starting at any position of the text, except for the last 𝑘−1 positions. At first, we divide the document into 𝑘−𝑔𝑟𝑎𝑚𝑠, where 𝑘 is a parameter chosen by the user.

5−𝑔𝑟𝑎𝑚𝑠 Example For example we take the text
“A do run run run, a do run run” Then we remove irrelevant features “adorunrunrunadorunrun” The resulting 5−𝑔𝑟𝑎𝑚𝑠 are

Fingerprints After computing all the 𝑘−𝑔𝑟𝑎𝑚𝑠, we hash them.
Then, we select some subset of these hashes to be the document’s 𝑓𝑖𝑛𝑔𝑒𝑟𝑝𝑟𝑖𝑛𝑡𝑠. If we want to check whether documents share some portion, we can compare their fingerprints. One strategy to chose the subset is to fix some number 𝑝 and take all the hashes that are 0 𝑚𝑜𝑑 𝑝.

Fingerprints Example The 5−𝑔𝑟𝑎𝑚𝑠 from before
A potential list of hashes The hashes that are 0 𝑚𝑜𝑑 4

Advantages of the 0 𝑚𝑜𝑑 𝑝 Method
Suppose that the hash function is chosen so that the probability of collisions is very small. Then, whenever two documents share one or more fingerprints, it is highly likely that they share a 𝑘−𝑔𝑟𝑎𝑚 as well. Also, we store a relatively small amount of hashes as the fingerprints. On average, only a 1 𝑝 fraction of the total amount of 𝑘−𝑔𝑟𝑎𝑚𝑠.

Disadvantages of the 0 𝑚𝑜𝑑 𝑝 Method
The method gives no guarantee that matches between documents are detected. A 𝑘−𝑔𝑟𝑎𝑚 shared between documents is detected only if its hash is 0 𝑚𝑜𝑑 𝑝. In fact, big gaps between selected fingerprints were observed when experimenting with real data.

Requirements

Desirable Properties Not all copy-detection algorithms are based on selecting fingerprints of 𝑘−𝑔𝑟𝑎𝑚𝑠. Instead, the strings to fingerprint can be chosen by looking for sentences or paragraphs, or by choosing fixed-length strings that begin with “anchor” words. There are methods not using fingerprints at all. Instead, they use a notion of distance between documents. To give some basis for discussing different techniques, we list several criteria that a copy-detection algorithm should satisfy.

(Whitespace) Insensitivity
In matching text files, matches should be unaffected by such things as extra whitespace, capitalization, punctuation, etc. In other domains the notion of what strings should be equal is different. For example, in matching software text it is desirable to make matching insensitive to variable names.

Simple Solution A first pass over the data transforms it to eliminate undesirable differences between documents. E.g. whitespace and punctuation are removed, all letters are converted to lower case, or all variable names are replaced by the identifier “$”. The details vary from one type of document to the other. However, the general idea is the same: semantic information about the document type is used to eliminate unimportant differences between documents.

Noise Suppression Discovering short matches, such as the fact that the word "𝑡ℎ𝑒" appears in two different documents, is uninteresting. Any match must be large enough to imply that the material has been copied and is not simply a common word or idiom of the language in which the documents are written.

Simple Solution Schemes based on fingerprinting 𝑘−𝑔𝑟𝑎𝑚𝑠 satisfy this requirement quite easily. The user can choose 𝑘 to be sufficiently large so that common idioms of the language have length shorter than 𝑘. Of course, we assume in this solution that there is some threshold 𝑘 (depending on the type of documents) that distinguishes between uninteresting and interesting matches in terms of length.

Position Independence
Permutation of the contents of a document should not affect the set of discovered matches. Adding to a document should not affect the set of matches in the original portion of the new document. Removing part of a document should not affect the set of matches in the portion that remains.

First Approach to Position Independence
A simple but incorrect strategy is to select every 𝑖 th hash of a document, but this is not robust against reordering, insertions and deletions. In fact, prepending one character to a file shifts the positions of all 𝑘−𝑔𝑟𝑎𝑚𝑠 by one, which means that the modified file shares none of its fingerprints with the original. Thus, any effective algorithm for choosing the fingerprints to represent a document cannot rely on the position of the fingerprints within the document.

Karp and Rabin’s Method for Hashing

Historical Note Karp and Rabin’s algorithm for substring matching is the earliest version of fingerprinting based on 𝑘−𝑔𝑟𝑎𝑚𝑠. From the late ‘80s Their problem, which was motivated by string matching problems in genetics, is to find occurrences of a particular string 𝑠 of length 𝑘 within a much longer string. The idea is to compare hashes of all 𝑘−𝑔𝑟𝑎𝑚𝑠 in the long string with a hash of 𝑠, as we have seen before.

Rolling Hash Function Hasing strings of length 𝑘 is expensive for large 𝑘. So, Karp and Rabin proposed a “rolling” hash function that allows the hash for the 𝑖 + 1 st 𝑘−𝑔𝑟𝑎𝑚 to be easily computed given the hash of the 𝑖 th 𝑘−𝑔𝑟𝑎𝑚. The idea is to treat a 𝑘−𝑔𝑟𝑎𝑚 𝑐 1 … 𝑐 𝑘 as a 𝑘-digit number in some base 𝑏. Then 𝐻 𝑐 1 … 𝑐 𝑘 = 𝑐 1 ⋅ 𝑏 𝑘−1 + 𝑐 2 ⋅ 𝑏 𝑘−2 +⋯+ 𝑐 𝑘−1 ⋅𝑏+ 𝑐 𝑘 = 𝑖=1 𝑘 𝑐 𝑖 ⋅ 𝑏 𝑘−𝑖

Rolling Hash Function – Incremental Step
To compute the hash of the next 𝑘−𝑔𝑟𝑎𝑚 𝑐 2 … 𝑐 𝑘+1 𝐻 𝑐 2 … 𝑐 𝑘+1 = 𝑐 2 ⋅ 𝑏 𝑘−1 + 𝑐 3 ⋅ 𝑏 𝑘−2 +⋯+ 𝑐 𝑘 ⋅𝑏+ 𝑐 𝑘+1 = 𝐻 𝑐 1 … 𝑐 𝑘 − 𝑐 1 ⋅ 𝑏 𝑘−1 ⋅𝑏+ 𝑐 𝑘+1 Since 𝑏 𝑘−1 is a constant, this allows each subsequent hash to be computed from the previous one with only two additions and two multiplications. Further, the above identity holds when addition and multiplication are modulo some value.

Weakness of 𝐻 The values of the 𝑐 𝑖 ’s are relatively small integers, so doing the addition last means that the last character only affects a few of the low-order bits of the hash. A better hash function would have each character 𝑐 𝑖 potentially affect all of the hash’s bits. It is easy to fix this by multiplying the entire hash of the first 𝑘−𝑔𝑟𝑎𝑚 by an additional 𝑏 and then switching the order of the multiply and add in the incremental step. 𝐻 𝑐 2 ,…, 𝑐 𝑘+1 = 𝐻 𝑐 1 ,…, 𝑐 𝑘 − 𝑐 1 ∗ 𝑏 𝑘 + 𝑐 𝑘+1 ∗𝑏

Winnowing

Parameters Given a set of documents, we want to find substring matches between them that satisfy two properties: If there is a substring match at least as long as the guarantee threshold, 𝑡, then this match is detected. We do not detect any matches shorter than the noise threshold, 𝑘. The parameters 𝑡 and 𝑘 are chosen by the user and we must have that 𝑘≤𝑡.

The Algorithm in General
The algorithm uses the previous notion of 𝑘−𝑔𝑟𝑎𝑚𝑠. This means that we Filter the document. Compute all the 𝑘−𝑔𝑟𝑎𝑚𝑠. Compute all the hashes of the 𝑘−𝑔𝑟𝑎𝑚𝑠. Decide which hashes are taken as fingerprints. The only innovation of the algorithm is in the last step! The first three steps were used beforehand, as we have seen.

An Observation Suppose that we are given 𝑚 consecutive hashes ℎ 1 ,…, ℎ 𝑚 such that 𝑚>𝑡−𝑘. Then at least one of the ℎ 𝑖 ’s must be chosen to guarantee detection of all matches of length at least 𝑡. Why ? Since ℎ 1 ,…, ℎ 𝑚 cover 𝑚+𝑘−1≥𝑡 consecutive letters from the document, we may have a matching in this interval of letters with length at least 𝑡. Therefore, we must pick at least one of the ℎ 𝑖 ’s.

The Winnowing Algorithm
Input: a sequence of hashes ℎ 1 ,…, ℎ 𝑛 that represents a document, and two parameters 𝑘≤𝑡. Let the window size be 𝑤=𝑡−𝑘+1. Each position 1≤𝑖≤𝑛−𝑤+1 defines a window of hashes ℎ 𝑖 ,…, ℎ 𝑖+𝑤−1 . In each window select the minimum hash value. If there is more than one hash with the minimum value, select the rightmost occurrence. The selected hashes are the fingerprints of the document.

Example Some text Irrelevant features removed The sequence of 5−𝑔𝑟𝑎𝑚𝑠

Example Our hypothetical sequence of hashes of 5−𝑔𝑟𝑎𝑚𝑠
Consider a window of hashes of size 4

Example The fingerprints selected by Winnowing are
In many applications we also want to remember positional information rather than saving only the fingerprints themselves. So, we add a 0−base positional information, i.e.

Intuition Behind Winnowing
The minimum hash in one window is very likely to remain the minimum hash in adjacent windows, since the odds are that the minimum of 𝑤 random numbers is smaller than one additional random number. Thus, many overlapping windows select the same hash, and the number of fingerprints selected is far smaller than the number of windows while still maintaining the guarantee.

Density and the Charge Function
The density of a fingerprinting algorithm is the expected fraction of fingerprints selected from among all the hash values computed, given random input. Efficient algorithms attempt to minimize the density. The charge function, 𝐶, maps the position of each selected fingerprint to the position of the first (leftmost) window that selected it. Note that the charge function is monotonically increasing.

Density Analysis of Winnowing
Assume that the sequence of hashes are random and that space of the hash values is very large so that we can safely ignore the possibility that there is a tie for the minimum value. Consider an indicator random variable 𝑋 𝑖 that is one iff the 𝑖 th window 𝑊 𝑖 is charged. Note that 𝑊 𝑖 and 𝑊 𝑖−1 overlap except at the leftmost and rightmost positions. Consider the position 𝑝 containing the smallest hash in the union interval.

Any window that includes 𝑝 selects ℎ 𝑝 as a fingerprint. There are three cases: If 𝑝=𝑖−1, then since 𝑝∉ 𝑊 𝑖 : 𝑊 𝑖 must pick a new hash ℎ 𝑞 . No earlier window is charged for ℎ 𝑞 because the charge is monotonic. Thus 𝑊 𝑖 is charged and 𝑋 𝑖 =1. If 𝑝=𝑖+𝑤−1, then since 𝑝∉ 𝑊 𝑖−1 , 𝑊 𝑖 must be charged for ℎ 𝑝 . Thus, 𝑋 𝑖 =1. If 𝑝 is in any other position, then both 𝑊 𝑖 and 𝑊 𝑖−1 select it. Thus, 𝑊 𝑖 is not charged and 𝑋 𝑖 =0. The first two cases happen with probability 1/(𝑤+1). Thus, 𝐸 𝑋 𝑖 =2/(𝑤+1).

Let 𝑁 be the number of windows in the document. Then the number of fingerprints selected is 𝑖=1 𝑁 𝑋 𝑖 And the expected number of fingerprints is 𝐸 𝑖=1 𝑁 𝑋 𝑖 Note that even though the 𝑋 𝑖 ’s are clearly dependent, the expectancy is linear: 𝐸 𝑖=1 𝑁 𝑋 𝑖 = 𝑖=1 𝑁 𝐸 𝑋 𝑖 =𝑁 2 𝑤+1 Thus, the expected density (the fraction) is 𝑑= 2 𝑤+1 .

Comparison to 0 𝑚𝑜𝑑 𝑝 at Same Density
Note that the density of 0 𝑚𝑜𝑑 𝑝 is 1/𝑝. Thus, we need to take 𝑝= 1 𝑑 = 𝑤+1 2 . Let us consider the probability that for a string of length 𝑡=𝑤+𝑘−1, the 0 𝑚𝑜𝑑 𝑝 algorithm fails to select any fingerprint at all within it. Note that the probability that there exist such a string in a text is at least as big. This probability is the probability that no hash is selected in a given sequence of 𝑤 hashes: 1− 1 𝑝 𝑤 = 1− 2 𝑤+1 𝑤 ≈ 𝑒 − 2𝑤 𝑤+1 ≈13.5%

Queries In a typical application, one first builds a database of fingerprints and then queries the fingerprints of individual documents against this database. Different window size can be used for the database and the queries. Let 𝐹 𝑤 be the set of fingerprints chosen for a document by winnowing with window size 𝑤. For 𝑤 ′ ≥𝑤 we have that 𝐹 𝑤 ′ ⊆ 𝐹 𝑤 . Picking larger window for queries is useful in cases where the system is heavily loaded, or if we are interested in obtaining a faster but coarser estimate of the matching in a document.

Local Algorithms

Local Algorithms Class - Motivation
Winnowing selects the minimum value in a window of hashes. It is one of a family of algorithms that choose elements from a local window. However, not every method for selecting hashes from a local window maintains the guarantee. For example, assume window size 𝑤 and a method that select every 𝑤 th hash as fingerprint. This method fails in the presence of insertions or deletions. This motivates the definition of the class of local algorithms.

Local Algorithms Definition
Let 𝑆 be a selection function taking a 𝑤-tuple of hashes and returning an integer between zero and 𝑤−1, inclusive. A fingerprinting algorithm is local with selection function 𝑆, if, for every window ℎ 𝑖 ,…, ℎ 𝑖+𝑤−1 , the hash at position 𝑖+𝑆 ℎ 𝑖 ,…, ℎ 𝑖+𝑤−1 is selected as a fingerprint. It is sometimes beneficial to weaken locality slightly to provide flexibility in choosing among equal fingerprints.

Correctness of Local Algorithms
Lemma: Any matching pair of substrings of length at least 𝑡=𝑤+𝑘−1 is found by any local algorithm. Proof: The sequence of hashes of 𝑘−𝑔𝑟𝑎𝑚𝑠 representing each substring spans at least one window, 𝑊, of length 𝑤. 𝑊 appears in the sequence of hashes in both documents. The selection function is only a function of the contents of 𝑊. Thus, the same fingerprint is selected from 𝑊 in both copies.

Lower Bound for Local Algorithms
Theorem: any local algorithm with noise threshold 𝑘 and guarantee 𝑡=𝑤+𝑘−1 has density 𝑑≥ 1.5 𝑤+1 Note that the lower-bound does not meet the upper bound for winnowing. The winnowing algorithm is within 33% of this lower bound. It might be possible to improve the lower bound.

Proof of the Lower Bound
Assume the hashes are independent and uniformly distributed. Consider the behavior of the algorithm on every 𝑤+1 th window. Because the windows are disjoint, each window selects a separate fingerprint. Consider all of the windows between the 𝑖 th and 𝑖+𝑤+1 th windows. Let 𝑍 be a random variable such that 𝑍=0 if among these windows, no additional fingerprint is selected, and 𝑍=1, otherwise.

Let 𝑋 and 𝑌 denote the random variables 𝑆 𝑊 𝑖 and 𝑆 𝑊 𝑖+𝑤+1 , respectively. Note that they are independent. If 𝑌≥𝑋 then 𝑍=1, because the algorithm is required to select at least one additional fingerprint from a window in between 𝑊 𝑖 and 𝑊 𝑖+𝑤+1 . Denote Θ≔ Pr 𝑋>𝑌 = Pr 𝑌>𝑋 , Δ≔ Pr 𝑋=𝑌 2Θ+Δ=1, thus,Θ+Δ= 1+Δ 2 > 1 2 Thus, 𝐸 𝑍 ≥ Pr 𝑌≥𝑋 =Θ+Δ> 1 2

𝐸 𝑍 > 1 2 This means that in every sequence of 𝑤+1 windows, in addition to the fingerprint selected in the first window we expect to select an additional distinct fingerprint at least half the times. Which means that 𝑑≥ 1.5 𝑤+1

Experiments

Web Data Experiment Dataset: 500,000 pages downloaded from the Stanford WebBase. Rolling hash function (“Improved” Karp-Rabin). Base 𝑏 𝐻 𝑐 2 ,…, 𝑐 𝑘+1 = 𝐻 𝑐 1 ,…, 𝑐 𝑘 − 𝑐 1 ∗ 𝑏 𝑘 + 𝑐 𝑘+1 ∗𝑏 64 bit hashes.

Web Data Experiment – Hash Verification
8MB of randomly generated text. Strings of length 50 (𝑘=50). Winnowing windows was set to 100 (𝑤=100). Winnowing selected of the hashes. Expected density: Selecting hashes equal to 0 𝑚𝑜𝑑 50 results in a measured density of Expected density: 0.02. Observed uniform distribution of the hash values. Taken all together, the hash function implementation appears to be sufficient for the fingerprinting algorithm.

Web Data – Second Experiment
500,000 HTML documents. Winnowing window size: 100. Noise threshold: 50. Results:

Web Data – Second Experiment
Both algorithms come close to the expected density. Run of over 29,900 characters that has no hash that is 0 𝑚𝑜𝑑 50. Probability assuming random data: 1− < 10 −260 Clearly, the data is not uniformly random. Low entropy strings: many equal hash values which cause many ties for the minimum hash. For example: string of zeros. Rightmost value is always the new one. Robust winnowing: tries to break ties using the hash value from previous window.

MOSS – Implementing Winnowing
MOSS, which stands for Measure Of Software Similarity, accepts batches of documents and returns report pages showing where significant sections of a pair of documents are very similar. MOSS is primarily used for detecting plagiarism in programming assignments in computer science and other engineering courses. The service currently uses robust winnowing.

Thank You!

Algorithms for Deep Sequencing Data

Similar presentations

Presentation on theme: "Algorithms for Deep Sequencing Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Algorithms for Deep Sequencing Data

Similar presentations

Presentation on theme: "Algorithms for Deep Sequencing Data"— Presentation transcript:

Similar presentations

About project

Feedback