1 bioRxiv preprint first posted online August 14, 2014; doi: The copyright holder for this preprint is the author/funder.

Slides:



Advertisements
Similar presentations
Fast Algorithms For Hierarchical Range Histogram Constructions
Advertisements

Near-Duplicates Detection
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Data Structures Using C++ 2E
High Dimensional Search Min-Hashing Locality Sensitive Hashing
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
Next Generation Sequencing, Assembly, and Alignment Methods
Lecture 14 Genome sequencing projects
Hash Tables How well do hash tables support dynamic set operations? Implementations –Direct address –Hash functions Collision resolution methods –Universal.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Sets and Maps Chapter 9. Chapter 9: Sets and Maps2 Chapter Objectives To understand the Java Map and Set interfaces and how to use them To learn about.
Assembly.
1 Lecture 18 Syntactic Web Clustering CS
Near Duplicate Detection
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Finding Similar Items. Set Similarity Problem: Find similar sets. Motivation: Many things can be modeled/represented as sets Applications: –Face Recognition.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Hashing General idea: Get a large array
Finding Similar Items.
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
Genome sequencing and assembling
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
DNA Technology and Genomics
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
Todd J. Treangen, Steven L. Salzberg
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Discrete Mathematical Structures (Counting Principles)
File Structures Foundations of Computer Science  Cengage Learning.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
Finding Similar Items 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 10: Finding Similar Items Mining.
Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas.
Comp 335 File Structures Hashing.
1 Gene Therapy Gene therapy: the attempt to cure an underlying genetic problem by insertion of a correct copy of a gene. –Tantalizingly simple and profound.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Gel Electrophoresis A molecular biology tool. Purpose To separate and analyze/compare fragments of DNA.
Hashing 8 April Example Consider a situation where we want to make a list of records for students currently doing the BSU CS degree, with each.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai.
Computer Science CSC 474Dr. Peng Ning1 CSC 474 Information Systems Security Topic 2.3 Hash Functions.
Short read alignment BNFO 601. Short read alignment Input: –Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.
Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.
CSE280Stefano/Hossein Project: Primer design for cancer genomics.
Short Read Workshop Day 5: Mapping and Visualization
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
CS480 Cryptography and Information Security Huiping Guo Department of Computer Science California State University, Los Angeles 13.Message Authentication.
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
Locality-sensitive hashing and its applications
Canadian Bioinformatics Workshops
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
Near Duplicate Detection
LEARNING OBJECTIVES O(1), O(N) and O(LogN) access times. Hashing:
Advanced Associative Structures
Hash Table.
2nd (Next) Generation Sequencing
Locality Sensitive Hashing
Minwise Hashing and Efficient Search
Fragment Assembly 7/30/2019.
Presentation transcript:

1 bioRxiv preprint first posted online August 14, 2014; doi: The copyright holder for this preprint is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. Journal Club 04/06/2015 K. Higasa

2 A hash function is any function that can be used to map digital data of arbitrary size to digital data of fixed size ( For example, suppose that the input data are file names such as FILE0000.txt, FILE0001.txt, FILE0002.txt, etc., with mostly sequential numbers. For such data, a function that extracts the numeric part k of the file name would be a hash function. 1.Pre-image resistance: Given a hash h, it should be difficult to find any message m such that h = hash(m). This concept is related to that of one-way function. 2.Collision resistance: It should be difficult to find two different messages m1 and m2 such that hash(m1) = hash(m2). Such a pair is called a hash collision.

Genome assembly refers to aligning and merging fragments of DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology cannot read whole genomes (chromosomes) in one go, but rather reads small pieces of between 100 and bases, depending on the technology used. 3 ACGT original seq. reads assemble 30x human genome = ~1 billion reads of ~100 bp in length per person

Cost : over $3 billion 4 Current surveys of genetic variation (mainly for that associated with diseases) are largely depend on the reference sequence of human genome constructed by the international project in Interpretation of GWAS results Design of PCR primers Mapping of NGS-reads to find variations

While the quality of current human reference genome sequence is high, more than 160 gaps remain and the effort to improve the reference genome is being continued. Gap = Missing sequence In reality, a few percent of reads cannot be found any place to map, which is probably due to these missing parts or the differences among populations or individuals. Therefore, an effort to reconstruct a more complete and ethnically applicable version of the human genome reference sequence will be essential to bring about a new era for future human genome studies. 5

Repetitive sequences make assembly a difficult problem when the repeat length exceeds the read length. Longer is better to have unique sequences. Unfortunately, most high-throughput sequencing methods generate sequencing reads of only a few hundred base pairs, which is well short of many common repeats. Overlap finding Merging 6

7 Mardis, NHGRI Current Topics in Genome Analysis 2014

8

9 SequencerOutputRead LengthError rate Illumina (HiSeq X) 1.5 ~ 1.8 Tb~ 150 b0.001~ PacBio0.5 ~ 1 Gb~ 30 kb0.15 PacBio data is going to be produced to construct Japanese reference. We need a method to find overlaps among reads with high error rate efficiently.

10 (A)The sequence is first decomposed into its constituent k-mers. In this example, k=3, resulting in 12 k-mers for S1 and S2. (B) All k-mers are then converted to integer fingerprints via multiple hash functions. The number of hash functions determines the resulting sketch size H. Here H=4 (Γ1..H). The k-mer generating the minimum value for each hash is referred to as the min-mer for that hash. (C)The sketch of a sequence is composed of the ordered set of its H min-mer fingerprints. In this example, the sketches of S1 and S2 share the same minimum fingerprints for Γ1 and Γ2. (D)The fraction of entries shared between the sketches of two sequences S1 and S2 is an estimate of Jaccard similarity. (E)Find overlapped region according to the shared min-mers (ACC and CCG in this case).

11 For two sets, The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets. This measure of similarity is suitable for many applications, including textual similarity of documents and similarity of buying habits of customers.

MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets (reads) are. The algorithm is used for finding similar documents (such as web-pages, Google uses the technique for ads). 12 Example, 1. Transform to two digit vectors A B Prepare a hash function (in this case permutation is enough)

13 3. Apply  1 to the digit vectors A B A B Apply a function mh to return a minimum number of the elements which has non-zero value

A B A B Apply a function mh to return a minimum number of the elements which has non-zero value 5. Apply  2 to the digit vectors

15 8. If you focus on the first element i that has non-zero value in A or B after applying a hash function, there are three possibilities. C1. A[i]=0 and B[i]=1 C2. A[i]=1 and B[i]=0 C3. A[i]=1 and B[i]=1 7. When applying  1 to the digit vectors When applying  2 to the digit vectors A B Possibility that we can get [mh(A)==mh(B)] is Now, #hash functions that return [mh(A)==mh(B)] / #hash functions in total = ½ = 0.5 → which is equal to the definition of Jaccard similarity.

16 Number of elements (m) Number of reads (n) Calculation cost for Jaccard similarity comparison = O(mn 2 ) Number of hash function (k) Calculation cost for MinHash = O(kmn) Calculation cost for hash value comparison = O(kn 2 ) In total, O(kmn)+O(kn 2 ) ~ O(kn 2 ) when n>>m If k < m, MinHash is faster than Jaccard similarity.

17 m<-200 ov<-100 k<-(m+ov)/2 J<-ov/m # Two digit vectors (a, b) a<-b<-rep(0,m) a[1:k]<-1 b[(k-ov+1):m]<-1 Jaccard<-length(which(a==1 & b==1))/m R<-100 mh<-matrix(NA,R,k) for (j in 1:R){ M<-0 for (i in 1:k){ hash<-sample(1:m,m) if(min(which(a[hash]==1))==min(which(b[hash]==1))) M=M+1 mh[j,i]<-M/i } boxplot(mh,outline=FALSE) abline(h=Jaccard,col="red")

18 m<-200 ov<-100 k<-(m+ov)/2 J<-ov/m ### Min-mer a<-1:k b<-(k-ov+1):m R<-100 mh<-matrix(NA,R,k) for (j in 1:R){ M<-0 for (i in 1:k){ hash<-sample(1:m,m) if(min(hash[a])==min(hash[b])) M=M+1 mh[j,i]<-M/i } boxplot(mh,outline=FALSE) abline(h=J,col="red")

19 Reads were randomly extracted from the human reference genome and errors were introduced to simulate a PacBio sequencing error model (11.88% insertion, 1.83% deletion, and 1.29% substitution). Match types are divided into: unrelated sequences (rand), overlapping reads (olap), and reads mapped to a perfect reference (map). The estimations are from 50,000 trials. Probability of detecting ≥1 or ≥3 matching minhash for k=10 (A) and k=16 (B) with various sketch sizes.

20 ObjectExactLSHApplication GroupJaccard similarityMinHash Assembly, Image recognition Distance EuclideanFACS Cosine Hamming Edit ClusteringSingle linkageLSH-link

21 Locality-sensitive hashing (LSH) is a method of performing probabilistic dimension reduction of high-dimensional data. The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items). This is different from the conventional hash functions, such as those used in cryptography, as in this case the goal is to maximize the probability of “collision” of similar items rather than to avoid collisions. A hash function that maps names to integers from 0 to 15. There is a collision between "Join Smith" and "Lisa Smith".