Distributed Memory Partitioning of High-Throughput Sequencing Datasets for Enabling Parallel Genomics Analyses Nagakishore Jammula, Sriram P. Chockalingam,

Slides:

Advertisements

Similar presentations

Chapter 4 Partition I. Covering and Dominating.

Advertisements

CS 336 March 19, 2012 Tandy Warnow.

Multilevel Hypergraph Partitioning Daniel Salce Matthew Zobel.

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.

* Bellman-Ford: single-source shortest distance * O(VE) for graphs with negative edges * Detects negative weight cycles * Floyd-Warshall: All pairs shortest.

SplitMEM: graphical pan-genome analysis with suffix skips Shoshana Marcus May 29, 2014.

Author: Jie chen and Yousef Saad IEEE transactions of knowledge and data engineering.

Distributed Message Passing for Large Scale Graphical Models Alexander Schwing Tamir Hazan Marc Pollefeys Raquel Urtasun CVPR2011.

CS267 Assignment 3: Parallelize Graph Algorithms for de Novo Genome Assembly Spring 2015.

SST:an algorithm for finding near- exact sequence matches in time proportional to the logarithm of the database size Eldar Giladi Eldar Giladi Michael.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.

1 Brief Announcement: Distributed Broadcasting and Mapping Protocols in Directed Anonymous Networks Michael Langberg: Open University of Israel Moshe Schwartz:

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

A Framework For Community Identification in Dynamic Social Networks Chayant Tantipathananandh Tanya Berger-Wolf David Kempe Presented by Victor Lee.

The Shortest Path Problem

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.

Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan.

De-novo Assembly Day 4.

Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

A computational study of protein folding pathways Reducing the computational complexity of the folding process using the building block folding model.

Graph Partitioning and Clustering E={w ij } Set of weighted edges indicating pair-wise similarity between points Similarity Graph.

1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.

Scalable and Efficient Data Streaming Algorithms for Detecting Common Content in Internet Traffic Minho Sung Networking & Telecommunications Group College.

CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina

Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )

1 A -Approximation Algorithm for Shortest Superstring Speaker: Chuang-Chieh Lin Advisor: R. C. T. Lee National Chi-Nan University Sweedyk, Z. SIAM Journal.

Presenter: Yang Ruan Indiana University Bloomington

A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.

1 NETTAB 2012 FILTERING WITH ALIGNMENT FREE DISTANCES FOR HIGH THROUGHPUT DNA READS ASSEMBLY Maria de Cola, Giovanni Felici, Daniele Santoni, Emanuel Weitschek.

Shortest Path Problems Dijkstra’s Algorithm. Introduction Many problems can be modeled using graphs with weights assigned to their edges: Airline flight.

Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.

ITEC 2620A Introduction to Data Structures Instructor: Prof. Z. Yang Course Website: 2620a.htm Office: TEL 3049.

Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2.

Data Structures and Algorithms in Parallel Computing Lecture 2.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University.

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.

Graph Theory. undirected graph node: a, b, c, d, e, f edge: (a, b), (a, c), (b, c), (b, e), (c, d), (c, f), (d, e), (d, f), (e, f) subgraph.

Effective Parallel Multicore-optimized K-mers Counting Algorithm

Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!

Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.

MERmaid: Distributed de novo Assembler Richard Xia, Albert Kim, Jarrod Chapman, Dan Rokhsar.

Spanning Trees Dijkstra (Unit 10) SOL: DM.2 Classwork worksheet Homework (day 70) Worksheet Quiz next block.

Clustering [Idea only, Chapter 10.1, 10.2, 10.4].

CSCI2950-C Genomes, Networks, and Cancer

Cohesive Subgraph Computation over Large Graphs

Some slides adapted from those of Yuan Yu and Michael Isard

WABI: Workshop on Algorithms in Bioinformatics

Fast nearest neighbor searches in high dimensions Sami Sieranoja

Research in Computational Molecular Biology , Vol (2008)

Probabilistic Data Management

13 Text Processing Hongfei Yan June 1, 2016.

Graph Analysis by Persistent Homology

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Towards Effective Partition Management for Large Graphs

Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.

Intro to Alignment Algorithms: Global and Local

ITEC 2620M Introduction to Data Structures

A Fundamental Bi-partition Algorithm of Kernighan-Lin

Anastasia Baryshnikova Cell Systems

(Top) Construction of synthetic long read clouds with 10× Genomics technology. (Top) Construction of synthetic long read clouds with 10× Genomics technology.

Major Design Strategies

Roye Rozov Shamir group meeting 3/7/13

Fragment Assembly 7/30/2019.

Presentation transcript:

Distributed Memory Partitioning of High-Throughput Sequencing Datasets for Enabling Parallel Genomics Analyses Nagakishore Jammula, Sriram P. Chockalingam, and Srinivas Aluru Georgia Institute of Technology ACM BCB – August 2017 Presented by: Evan Stene

Outline Introduction Background Methods Results

Introduction Large volume of short biological sequence data Construction of long sequences from short is time consuming Use existing sequence for reference Compare short sequences to each other Distributed computing is a promising direction for speed up Can intelligent partitioning benefit sequence construction?

Introduction – Case 1: Alignment Assuming a copy of a reference exists in each partition Partitioning dataset is simply balancing partition sizes (always true?) Reference acts as global coordinate system Position on reference will infer relations across partitions Two closely related sequences Reference

Introduction – Case 2: Assembly Ideally, group reads with greatest relation How can we calculate overlap quickly? Partition 1 Partition 2 No relation in this partition Would be related in this partition

Introduction – Motivation Intelligently partitioning entire dataset can be time consuming (depending on method) Partitioning a graph that represents the dataset is a good partition of the dataset as well Related works: Pairwise similarity – very time consuming Hash partition of graph – destroys data locality

Background – de Bruijn Graph Used to bring down number of comparisons in assembly Captures connection and frequency of common subsequences Tracing paths through graph recreate sequence dataset

Background – de Bruijn Graph Input Sequences, k = 3 AABCDD AABCED Graph AAB

Background – de Bruijn Graph Input Sequences, k = 3 AABCDD AABCED Graph AAB ABC

Background – de Bruijn Graph Input Sequences, k = 3 AABCDD AABCED Graph BCD AAB ABC BCE

Background – de Bruijn Graph Input Sequences, k = 3 AABCDD AABCED Graph BCD CDD AAB ABC BCE CED

Methods Construction Compaction Graph Partition Dataset Partition

Methods - Construction Each vertex has at most 8 neighbors Alphabet of size 4 for DNA 4 edges in, 4 out 2 Classes of vertex: Vertices that branch ( >1 in/out edges) Vertices in a chain

Methods - Construction Build hash table of all subsequence of size k in dataset For each subsequence check the 8 possible neighbors Add 1 to weight of edge for each occurrence Trim edges below some threshold

Methods - Compaction Essentially connected components Combine chain vertices into single node Concatenate labels (subsequences) and sum edge weights

Methods – Graph Partition Optimize two parameters – min cut and balance Cut defined as weight of edges between partitions Bound the balance of partitions by some threshold (1 + ε) Balance function C(Vi) = sum of weights of vertices in Vi m = total number of partitions

Methods – Graph Partition Recursively coarsen graph Partition coarsest graph Recursively un-coarsen graph, refining cut after each iteration From [1]

Methods – Dataset Partitioning Map partition id to each subsequence of length k Chains will contain multiple subsequences that will need to map Build distributed index from mapping Assign sequence r to partition id most frequently assigned Sequence r will contain |r| - k + 1 subsequences

Results – Test Environment 32 nodes 16 cores 128GB memory OpenMPI 1.8.6

Results - Compaction

Results – Graph Partitioning

Results – Graph Runtimes

Results – Dataset Quality

Results – Dataset Runtimes Total runtime for Bird dataset on 512 cores: 11 min

References [1] Henning Meyerhenke, Peter Sanders, and Christian Schulz. 2015. Parallel Graph Partitioning for Complex Networks. In Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium. 1055–1064.