1 bioRxiv preprint first posted online August 14, 2014; doi: The copyright holder for this preprint is the author/funder.

1 bioRxiv preprint first posted online August 14, 2014; doi: http://dx.doi.org/10.1101/008003; The copyright holder for this preprint is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. Journal Club 04/06/2015 K. Higasa

2 A hash function is any function that can be used to map digital data of arbitrary size to digital data of fixed size (http://en.wikipedia.org/wiki/Hash_function). For example, suppose that the input data are file names such as FILE0000.txt, FILE0001.txt, FILE0002.txt, etc., with mostly sequential numbers. For such data, a function that extracts the numeric part k of the file name would be a hash function. 1.Pre-image resistance: Given a hash h, it should be difficult to find any message m such that h = hash(m). This concept is related to that of one-way function. 2.Collision resistance: It should be difficult to find two different messages m1 and m2 such that hash(m1) = hash(m2). Such a pair is called a hash collision.

Genome assembly refers to aligning and merging fragments of DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology cannot read whole genomes (chromosomes) in one go, but rather reads small pieces of between 100 and 30000 bases, depending on the technology used. 3 ACGT original seq. reads assemble 30x human genome = ~1 billion reads of ~100 bp in length per person

Cost : over $3 billion 4 Current surveys of genetic variation (mainly for that associated with diseases) are largely depend on the reference sequence of human genome constructed by the international project in 2003. Interpretation of GWAS results Design of PCR primers Mapping of NGS-reads to find variations

While the quality of current human reference genome sequence is high, more than 160 gaps remain and the effort to improve the reference genome is being continued. Gap = Missing sequence In reality, a few percent of reads cannot be found any place to map, which is probably due to these missing parts or the differences among populations or individuals. Therefore, an effort to reconstruct a more complete and ethnically applicable version of the human genome reference sequence will be essential to bring about a new era for future human genome studies. 5

Repetitive sequences make assembly a difficult problem when the repeat length exceeds the read length. Longer is better to have unique sequences. Unfortunately, most high-throughput sequencing methods generate sequencing reads of only a few hundred base pairs, which is well short of many common repeats. Overlap finding Merging 6

7 Mardis, NHGRI Current Topics in Genome Analysis 2014

9 SequencerOutputRead LengthError rate Illumina (HiSeq X) 1.5 ~ 1.8 Tb~ 150 b0.001~0.0001 PacBio0.5 ~ 1 Gb~ 30 kb0.15 PacBio data is going to be produced to construct Japanese reference. We need a method to find overlaps among reads with high error rate efficiently.

10 (A)The sequence is first decomposed into its constituent k-mers. In this example, k=3, resulting in 12 k-mers for S1 and S2. (B) All k-mers are then converted to integer fingerprints via multiple hash functions. The number of hash functions determines the resulting sketch size H. Here H=4 (Γ1..H). The k-mer generating the minimum value for each hash is referred to as the min-mer for that hash. (C)The sketch of a sequence is composed of the ordered set of its H min-mer fingerprints. In this example, the sketches of S1 and S2 share the same minimum fingerprints for Γ1 and Γ2. (D)The fraction of entries shared between the sketches of two sequences S1 and S2 is an estimate of Jaccard similarity. (E)Find overlapped region according to the shared min-mers (ACC and CCG in this case).

11 For two sets, The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets. This measure of similarity is suitable for many applications, including textual similarity of documents and similarity of buying habits of customers.

MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets (reads) are. The algorithm is used for finding similar documents (such as web-pages, Google uses the technique for ads). 12 Example, 1. Transform to two digit vectors 0123456789 A1110011000 B1011010101 2. Prepare a hash function (in this case permutation is enough)

13 3. Apply  1 to the digit vectors 0123456789 A1110011000 B1011010101 0123456789 A0111001100 B0111100011 4. Apply a function mh to return a minimum number of the elements which has non-zero value

14 0123456789 A1110011000 B1011010101 0123456789 A1100001101 B0000111111 6. Apply a function mh to return a minimum number of the elements which has non-zero value 5. Apply  2 to the digit vectors

15 8. If you focus on the first element i that has non-zero value in A or B after applying a hash function, there are three possibilities. C1. A[i]=0 and B[i]=1 C2. A[i]=1 and B[i]=0 C3. A[i]=1 and B[i]=1 7. When applying  1 to the digit vectors When applying  2 to the digit vectors 0123456789 A1110011000 B1011010101 9. Possibility that we can get [mh(A)==mh(B)] is Now, #hash functions that return [mh(A)==mh(B)] / #hash functions in total = ½ = 0.5 → 0.375 which is equal to the definition of Jaccard similarity.

16 Number of elements (m) Number of reads (n) Calculation cost for Jaccard similarity comparison = O(mn 2 ) Number of hash function (k) Calculation cost for MinHash = O(kmn) Calculation cost for hash value comparison = O(kn 2 ) In total, O(kmn)+O(kn 2 ) ~ O(kn 2 ) when n>>m If k < m, MinHash is faster than Jaccard similarity.

17 m<-200 ov<-100 k<-(m+ov)/2 J<-ov/m # Two digit vectors (a, b) a<-b<-rep(0,m) a[1:k]<-1 b[(k-ov+1):m]<-1 Jaccard<-length(which(a==1 & b==1))/m R<-100 mh<-matrix(NA,R,k) for (j in 1:R){ M<-0 for (i in 1:k){ hash<-sample(1:m,m) if(min(which(a[hash]==1))==min(which(b[hash]==1))) M=M+1 mh[j,i]<-M/i } boxplot(mh,outline=FALSE) abline(h=Jaccard,col="red")

18 m<-200 ov<-100 k<-(m+ov)/2 J<-ov/m ### Min-mer a<-1:k b<-(k-ov+1):m R<-100 mh<-matrix(NA,R,k) for (j in 1:R){ M<-0 for (i in 1:k){ hash<-sample(1:m,m) if(min(hash[a])==min(hash[b])) M=M+1 mh[j,i]<-M/i } boxplot(mh,outline=FALSE) abline(h=J,col="red")

19 Reads were randomly extracted from the human reference genome and errors were introduced to simulate a PacBio sequencing error model (11.88% insertion, 1.83% deletion, and 1.29% substitution). Match types are divided into: unrelated sequences (rand), overlapping reads (olap), and reads mapped to a perfect reference (map). The estimations are from 50,000 trials. Probability of detecting ≥1 or ≥3 matching minhash for k=10 (A) and k=16 (B) with various sketch sizes.

20 ObjectExactLSHApplication GroupJaccard similarityMinHash Assembly, Image recognition Distance EuclideanFACS Cosine Hamming Edit ClusteringSingle linkageLSH-link

21 Locality-sensitive hashing (LSH) is a method of performing probabilistic dimension reduction of high-dimensional data. The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items). This is different from the conventional hash functions, such as those used in cryptography, as in this case the goal is to maximize the probability of “collision” of similar items rather than to avoid collisions. A hash function that maps names to integers from 0 to 15. There is a collision between "Join Smith" and "Lisa Smith".

1 bioRxiv preprint first posted online August 14, 2014; doi: The copyright holder for this preprint is the author/funder.

Similar presentations

Presentation on theme: "1 bioRxiv preprint first posted online August 14, 2014; doi: The copyright holder for this preprint is the author/funder."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 bioRxiv preprint first posted online August 14, 2014; doi: The copyright holder for this preprint is the author/funder.

Similar presentations

Presentation on theme: "1 bioRxiv preprint first posted online August 14, 2014; doi: The copyright holder for this preprint is the author/funder."— Presentation transcript:

Similar presentations

About project

Feedback