Roye Rozov Shamir group meeting 3/7/13

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

I/O and Space-Efficient Path Traversal in Planar Graphs Craig Dillabaugh, Carleton University Meng He, University of Waterloo Anil Maheshwari, Carleton.
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Evaluating “find a path” reachability queries P. Bouros 1, T. Dalamagas 2, S.Skiadopoulos 3, T. Sellis 1,2 1 National Technical University of Athens 2.
Chapter 4: Trees Part II - AVL Tree
Searching on Multi-Dimensional Data
CS 267: Automated Verification Lecture 10: Nested Depth First Search, Counter- Example Generation Revisited, Bit-State Hashing, On-The-Fly Model Checking.
Data Structure and Algorithms (BCS 1223) GRAPH. Introduction of Graph A graph G consists of two things: 1.A set V of elements called nodes(or points or.
Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1.
Next Generation Sequencing, Assembly, and Alignment Methods
Algorithm Strategies Nelson Padua-Perez Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
An Improved Construction for Counting Bloom Filters Flavio Bonomi Michael Mitzenmacher Rina Panigrahy Sushil Singh George Varghese Presented by: Sailesh.
SASH Spatial Approximation Sample Hierarchy
Modern Information Retrieval
CS267 Assignment 3: Parallelize Graph Algorithms for de Novo Genome Assembly Spring 2015.
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
Graphs & Graph Algorithms 2 Nelson Padua-Perez Bill Pugh Department of Computer Science University of Maryland, College Park.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Mike 66 Sept Succinct Data Structures: Techniques and Lower Bounds Ian Munro University of Waterloo Joint work with/ work of Arash Farzan, Alex Golynski,
De-novo Assembly Day 4.
Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Improving the Accuracy of Genome Assemblies July 17 th 2012 Roy Ronen *,1, Christina Boucher *,1, Hamidreza Chitsaz 2 and Pavel Pevzner 1 1. University.
Meraculous: De Novo Genome Assembly with Short Paired-End Reads
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Probe Design Using Exact Repeat Count August 8th, 2007 Aaron Arvey.
Metagenomics Assembly Hubert DENISE
The iPlant Collaborative
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Graphs A graphs is an abstract representation of a set of objects, called vertices or nodes, where some pairs of the objects are connected by links, called.
Author: Haoyu Song, Murali Kodialam, Fang Hao and T.V. Lakshman Publisher/Conf. : IEEE International Conference on Network Protocols (ICNP), 2009 Speaker:
University of Illinois at Urbana-Champaign
Effective Parallel Multicore-optimized K-mers Counting Algorithm
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
MERmaid: Distributed de novo Assembler Richard Xia, Albert Kim, Jarrod Chapman, Dan Rokhsar.
Theory of Computational Complexity Yusuke FURUKAWA Iwama Ito lab M1.
Lesson: Sequence processing
Bloom Filters An Introduction and Really Most Of It CMSC 491
Quality Control & Preprocessing of Metagenomic Data
CPS216: Data-intensive Computing Systems
Indexing and hashing.
Updating SF-Tree Speaker: Ho Wai Shing.
Denovo genome assembly of Moniliophthora roreri
Research in Computational Molecular Biology , Vol (2008)
Parallel Density-based Hybrid Clustering
Hash-Based Indexes Chapter 11
Introduction to Genome Assembly
Removing Erroneous Connections
Distributed Memory Partitioning of High-Throughput Sequencing Datasets for Enabling Parallel Genomics Analyses Nagakishore Jammula, Sriram P. Chockalingam,
CS 598AGB Genome Assembly Tandy Warnow.
Graphs & Graph Algorithms 2
CS222: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hash-Based Indexes Chapter 10
Lectures on Graph Algorithms: searching, testing and sorting
CS222P: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hash-Based Indexes Chapter 11
Packet Classification Using Coarse-Grained Tuple Spaces
Database Design and Programming
ITEC 2620M Introduction to Data Structures
Dynamic Graph Algorithms
Hash-Based Indexes Chapter 11
Hash Functions for Network Applications (II)
Chapter 11 Instructor: Xin Zhang
TRC: Trace – Reference Compression
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #07 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
CS 201 Compiler Construction
Presentation transcript:

Roye Rozov Shamir group meeting 3/7/13 Space-efficient and exact de Bruijn graph representation based on a Bloom Filter Roye Rozov Shamir group meeting 3/7/13

Introduction De novo assembly of large genomes is a memory intensive process – often requires >= 100 GB using standard tools (e.g., Velvet, SOAPdenovo) Newer methods have been able to reduce this to about half (e.g. SGA requires ~50GB per human genome assembly), but this is still out of reach if a powerful server is not available Minia, a new Bloom filter based assembler introduced here, reduces memory use by one more order of magnitude

Biological events – viral/mitochondrial DNA, etc. R. Chikhi, 2013 “de novo assembly” presentation, http://evomicsorg.wpengine.netdna-cdn.com/wp-content/uploads/2013/01/2013-evomics-assembly.pdf

De Bruijn graph assembly Nodes are kmer substrings, edges (k+1)mers formed by k-1 overlaps between nodes Assembly attempts to find traversal covering all edges – often not unique due to cycles and ambiguous branch order R. Chikhi, 2013 “de novo assembly” presentation, http://evomicsorg.wpengine.netdna-cdn.com/wp-content/uploads/2013/01/2013-evomics-assembly.pdf

Memory efficient dBG methods – 1.Sparse Assembler draw overlap, write original cost, explain size calculation Remember it’s 2g because in most cases only one extension on each side C. Ye et al, BMC Bioinformatics 2012

2. Conway and Bromage 2011 R. Chikhi, WABI 2012 presentation ***If you list all combinations and then number the list to be able to identify contents, this is the cost of a number identifying a group*** Sparse bitmap representation designed to store minimal information about each node Designed to be close to self-information limit and efficiently queriable

3. Pell et al. Stored kmers in Bloom filter Showed that global structure is maintained up to FP limit (~18%) Used kmer traversal to partition kmers and then assemble components separately using conventional assembler (Abyss) with little RAM Did not use Bloom filter for assembly – used to partition reads into smaller sets, which are then assembled using classical assembler Wikimedia Commons: “Bloom filter”

Minia - Main ideas 1) for any node, enumerate its neighbors using Bloom filter 2) sequentially enumerate all nodes using list on disk, use to construct cFP 3) limit memory use in assembly by using Bloom filter + cFP set to allow traversal, 4) mark visited complex nodes during traversal

Store only nodes Hashed kmer nodes Reads Implied edges ACG CGT GTT ACA CAT ATA CCG GTA . ACGTT CCGTA . k=3 query hash ACATA A|CG? A C G T

Allow enumeration of real neighbors Bloom filter allows querying for neighbors Maintain list of critical false positives to avoid false branching List of all kmers stored on disk allows identifying cFP set

Traversal Marking visited nodes needed for DFS/BFS traveral Cannot mark anything on Bloom filter Use additional hash to mark these, but only mark complex nodes – branch-points or tips 2k+8 cost per node, luckily there are few – 0.06 of all kmers stored in human genome assembly performed C. Ye et al, BMC Bioinformatics 2012

cFP construction S = set of true positive nodes (stored on disk) E = set of all extensions of S P (C E) = the set of all yes answers to extension queries on Bloom filter Create P by querying all of S, writing to disk Generate cFP by loading partitions of S to hash, removing elements of partition from P iteratively

Bloom filter parameters h = number of hash functions applied F = false positive rate m = size of array n = number of elements inserted F ≈ (1 – e-h/r)h , r = m/n (number of bits per element) For the optimal number of hash functions, m ≈ 1.44log2 (1/F)n F from prob bit set to one is 1 – (prob not set to one)kn Prob not set to one = 1 – 1/m (since each position equally likely) F is prob of getting a “yes” by chance – k positions set to one =[ 1 – (prob not set to one)kn ]k K is same as h

cFP vs. Bloom filter size More false positives with a smaller Bloom filter (in terms of bits per element) E[|cFP|] = 8nF BF + cFP total: 1.44n(log2(1/F) +16Fnk) Minimum based on optimal number of hashes: n(1.44 log2 (16k/2.08) + 2.08)

cFP vs. BF size

Results – with vs. without cFP Small e.coli data set 20 M reads, assemblies compared with reference 13.8 with cFP vs 13.6 without Basically trade marking structure for cFP because many more nodes become complex without cFP, and each node is expensive Benefit is much better agreement with reference assembly (after mapping contigs with MUMmer)

Human de novo assembly Aligned contigs after assembly, found high accuracy Compared with sparseAssembler – much lower memory use, similar N50, time