CS267 Assignment 3: Parallelize Graph Algorithms for de Novo Genome Assembly Spring 2015.

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

Discrete Mathematics University of Jazeera College of Information Technology & Design Khulood Ghazal Connectivity Lecture _13.
Lecture 11: Code Optimization CS 540 George Mason University.
Review Binary Search Trees Operations on Binary Search Tree
Minimum Spanning Tree Sarah Brubaker Tuesday 4/22/8.
Analysis of Algorithms Depth First Search. Graph A representation of set of objects Pairs of objects are connected Interconnected objects are called “vertices.
Combinatorial Algorithms
CS 206 Introduction to Computer Science II 03 / 27 / 2009 Instructor: Michael Eckmann.
Data Structure and Algorithms (BCS 1223) GRAPH. Introduction of Graph A graph G consists of two things: 1.A set V of elements called nodes(or points or.
De Bruijn Sequences Define an alphabet A consisting of k elements E.g. A={0,1}, A={a,b,c}, A= { 图,论,很,好, 玩 } K=2 , K=3 , and K=5 in the preceding example.
CS 206 Introduction to Computer Science II 11 / 11 / Veterans Day Instructor: Michael Eckmann.
1 CS 201 Compiler Construction Lecture 12 Global Register Allocation.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
Reduction Techniques Restriction Local Replacement Component Design Examples.
Connected Components, Directed Graphs, Topological Sort COMP171.
Chapter 9 Graph algorithms Lec 21 Dec 1, Sample Graph Problems Path problems. Connectedness problems. Spanning tree problems.
Connected Components, Directed Graphs, Topological Sort Lecture 25 COMP171 Fall 2006.
(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.
Connected Components, Directed graphs, Topological sort COMP171 Fall 2005.
The Euler-tour technique
Improving Code Generation Honors Compilers April 16 th 2002.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Introduction to Routing. The Routing Problem Apply after placement Input: –Netlist –Timing budget for, typically, critical nets –Locations of blocks and.
Zvi Kohavi and Niraj K. Jha 1 Capabilities, Minimization, and Transformation of Sequential Machines.
Chapter 9 – Graphs A graph G=(V,E) – vertices and edges
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
Meraculous: De Novo Genome Assembly with Short Paired-End Reads
Graph. Terminologi Graph A graph consists of a set of vertices V, along with a set of edges E that connect pairs of vertices. – An edge e = (vi,vj) connects.
1 Applications of BFS and DFS CSE 2011 Winter May 2016.
Memory Allocation of Multi programming using Permutation Graph By Bhavani Duggineni.
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
ITEC 2620A Introduction to Data Structures Instructor: Prof. Z. Yang Course Website: 2620a.htm Office: TEL 3049.
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
Data Structures and Algorithms in Parallel Computing Lecture 2.
Lecture 17: Trees and Networks I Discrete Mathematical Structures: Theory and Applications.
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.
CSE 421 Algorithms Richard Anderson Winter 2009 Lecture 5.
SPARSE CERTIFICATES AND SCAN-FIRST SEARCH FOR K-VERTEX CONNECTIVITY
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
Graph Concepts Illustrated Using The Leda Library Amanuel Lemma CS252 Algorithms.
1 Trees : Part 1 Reading: Section 4.1 Theory and Terminology Preorder, Postorder and Levelorder Traversals.
Data Structures and Algorithm Analysis Graph Algorithms Lecturer: Jing Liu Homepage:
MERmaid: Distributed de novo Assembler Richard Xia, Albert Kim, Jarrod Chapman, Dan Rokhsar.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Performance Profiling of NGS Genome Assembly Algorithms Alex Ropelewski Pittsburgh Supercomputing Center
Brute Force and Exhaustive Search Brute Force and Exhaustive Search Traveling Salesman Problem Knapsack Problem Assignment Problem Selection Sort and Bubble.
Hashing 1 Lec# 12 Presented by Halla Abdel Hameed.
Brute Force and Exhaustive Search Brute Force and Exhaustive Search Traveling Salesman Problem Knapsack Problem Assignment Problem Selection Sort and Bubble.
BCA-II Data Structure Using C Submitted By: Veenu Saini
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
ExaBiome: Exascale Solutions to Microbiome Analysis
Lecture 11 Graph Algorithms
Depth-First Search.
Distributed Memory Partitioning of High-Throughput Sequencing Datasets for Enabling Parallel Genomics Analyses Nagakishore Jammula, Sriram P. Chockalingam,
GRAPH SPANNERS.
Assignment 7 questions?.
Genome Assembly.
What is a Graph? a b c d e V= {a,b,c,d,e} E= {(a,b),(a,c),(a,d),
Connected Components, Directed Graphs, Topological Sort
Graphs Part 2 Adjacency Matrix
Depth-First Search Graph Traversals Depth-First Search DFS.
ITEC 2620M Introduction to Data Structures
Graphs G = (V, E) V are the vertices; E are the edges.
Applications of BFS CSE 2011 Winter /17/2019 7:03 AM.
Copyright ©2012 by Pearson Education, Inc. All rights reserved
Graphs: Definitions How would you represent the following?
Lecture 10 Graph Algorithms
Roye Rozov Shamir group meeting 3/7/13
Presentation transcript:

CS267 Assignment 3: Parallelize Graph Algorithms for de Novo Genome Assembly Spring 2015

Problem statement Input: A set of unique k-mers and their corresponding extensions. k-mers are sequences of length k (alphabet is A/C/G/T). An extension is a simple symbol (A/C/G/T/F). The input k-mers form a de Bruijn graph, a special graph that is used to represent overlaps between sequences of symbols. Output: A set of contigs, i.e. connected components in the input de Bruijn graph. 2

Example Input: A set of unique k-mers and their corresponding extensions. Example for k = 3: Format: k-mer forward extension, backward extension 3 AACCF ATCTG ACCGA TGAFC GATCF AATGF ATGCA TCTGA CCGFA CTGAT TGCFA

Example Input: A set of unique k-mers and their corresponding extensions. The input corresponds to a de Bruijn graph. Example for k = 3: 4 Sequence of k-mers : GATATC TCTCTGTGA AAC ACC CCG AATATG TGC Sequence of k-mers : Sequence of k-mers : AACCF ATCTG ACCGA TGAFC GATCF AATGF ATGCA TCTGA CCGFA CTGAT TGCFA

Example Input: A set of unique k-mers and their corresponding extensions. The input corresponds to a de Bruijn graph. Example for k = 3: 5 Sequence of k-mers : GATATC TCTCTGTGA AAC ACC CCG AATATG TGC Sequence of k-mers : Sequence of k-mers : AACCF ATCTG ACCGA TGAFC GATCF AATGF ATGCA TCTGA CCGFA CTGAT TGCFA k-mers with “F” as an extension are start vertices

Example Input: A set of unique k-mers and their corresponding extensions. The input corresponds to a de Bruijn graph. Example for k = 3: 6 Sequence of k-mers : GATATC TCTCTGTGA AAC ACC CCG AATATG TGC Sequence of k-mers : Sequence of k-mers : AACCF ATCTG ACCGA TGAFC GATCF AATGF ATGCA TCTGA CCGFA CTGAT TGCFA Consider k-mer: TCT Concatenate last k-1 bases (CT) and forward extension (G) => CTG (following vertex) Concatenate backward extension (A) and first k-1bases (TC) =>ATC (preceding vertex) The graph is undirected, we can visit a vertex from both directions.

Example 7 Contig : GATCTGA Sequence of k-mers : Contig : AACCG Contig : AATGC GATATC TCTCTGTGA AAC ACC CCG AATATG TGC Sequence of k-mers : Sequence of k-mers : AACCF ATCTG ACCGA TGAFC GATCF AATGF ATGCA TCTGA CCGFA CTGAT TGCFA Input: A set of unique k-mers and their corresponding extensions. The input corresponds to a de Bruijn graph. Example for k = 3: Output: A set of contigs or equivalently the connected components in the de Bruijn graph

Compact graph representation: hash table 8 The vertices are keys The edges (neighboring vertices) are represented with a two-letter value AACCF ATCTG ACCGA TGAFC GATCF AATGF ATGCA TCTGA CCGFA CTGAT TGCFA bucketsentries key: ATC forw_ext: T back_ext: G              key: ACC forw_ext: G back_ext: A  key: AAC forw_ext: C back_ext: F  key: TGA forw_ext: F back_ext: C  key: GAT forw_ext: C back_ext: F  key: CTG forw_ext: A back_ext: T  key: TCT forw_ext: G back_ext: A  key: ATG forw_ext: C back_ext: A  key: AAT forw_ext: G back_ext: F  key: CCG forw_ext: F back_ext: A  key: TGC forw_ext: F back_ext: A 

Serial algorithm 9

Graph construction 10 The vertices are keys The edges (neighboring vertices) are represented with a two-letter value AACCF ATCTG ACCGA TGAFC GATCF AATGF ATGCA TCTGA CCGFA CTGAT TGCFA bucketsentries key: ATC forw_ext: T back_ext: G              key: ACC forw_ext: G back_ext: A  key: AAC forw_ext: C back_ext: F  key: TGA forw_ext: F back_ext: C  key: GAT forw_ext: C back_ext: F  key: CTG forw_ext: A back_ext: T  key: TCT forw_ext: G back_ext: A  key: ATG forw_ext: C back_ext: A  key: AAT forw_ext: G back_ext: F  key: CCG forw_ext: F back_ext: A  key: TGC forw_ext: F back_ext: A 

Graph traversal 11 GATATC TCTCTGTGA AAC ACC CCG AATATG TGC AACCF ATCTG ACCGA TGAFC GATCF AATGF ATGCA TCTGA CCGFA CTGAT TGCFA We pick a start vertex and we initiate a contig. Contig: A A T

Graph traversal 12 GATATC TCTCTGTGA AAC ACC CCG AATATG TGC AACCF ATCTG ACCGA TGAFC GATCF AATGF ATGCA TCTGA CCGFA CTGAT TGCFA We add the forward extension to the contig. Contig: A A T G

Graph traversal 13 GATATC TCTCTGTGA AAC ACC CCG AATATG TGC AACCF ATCTG ACCGA TGAFC GATCF AATGF ATGCA TCTGA CCGFA CTGAT TGCFA We take the last k bases of the contig and look them up in the hash table. Contig: A A T G

Graph traversal 14 GATATC TCTCTGTGA AAC ACC CCG AATATG TGC AACCF ATCTG ACCGA TGAFC GATCF AATGF ATGCA TCTGA CCGFA CTGAT TGCFA We add the new forward extension to the contig. Contig: A A T G C

Graph traversal 15 GATATC TCTCTGTGA AAC ACC CCG AATATG TGC AACCF ATCTG ACCGA TGAFC GATCF AATGF ATGCA TCTGA CCGFA CTGAT TGCFA We take the last k bases of the contig and look them up in the hash table. Contig: A A T G C

Graph traversal 16 GATATC TCTCTGTGA AAC ACC CCG AATATG TGC AACCF ATCTG ACCGA TGAFC GATCF AATGF ATGCA TCTGA CCGFA CTGAT TGCFA We terminate the current contig since the forward extension is an “F”. Contig: A A T G C

17 Contig : GATCTGA Sequence of k-mers : Contig : AACCG Contig : AATGC GATATC TCTCTGTGA AAC ACC CCG AATATG TGC Sequence of k-mers : Sequence of k-mers : AACCF ATCTG ACCGA TGAFC GATCF AATGF ATGCA TCTGA CCGFA CTGAT TGCFA Graph traversal We iterate until we exhaust all start vertices: we have found all the contigs.

18 Parallelization hints 1.Distribute the hash table among the processors. UPC is convenient: Store the hash table in the shared address space. You may want to use upc_all_alloc(). 2.Each processor stores part of the input in the distributed hash table. What happens if two processors try to write the same bucket at the same time? We need to avoid race conditions ( UPC provides locks and global atomics). 3.We want to traverse the graph in parallel. Can we determine independent traversals by examining the input? How can we distribute the work among processors?