NETWORK-ON-CHIP HARDWARE ACCELERATORS FOR BIOLOGICAL SEQUENCE ALIGNMENT Author: Souradip Sarkar; Gaurav Ramesh Kulkarni; Partha Pratim Pande; and Ananth.

Slides:

Advertisements

Similar presentations

Basic Communication Operations

Advertisements

A HIGH-PERFORMANCE IPV6 LOOKUP ENGINE ON FPGA Author : Thilan Ganegedara, Viktor Prasanna Publisher : FPL 2013.

An On-Chip IP Address Lookup Algorithm Author: Xuehong Sun and Yiqiang Q. Zhao Publisher: IEEE TRANSACTIONS ON COMPUTERS, 2005 Presenter: Yu Hao, Tseng.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

Efficient Realization of Hypercube Algorithms on Optical Arrays* Hong Shen Department of Computing & Maths Manchester Metropolitan University, UK ( Joint.

Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.

1 Tuesday, November 14, 2006 “UNIX was never designed to keep people from doing stupid things, because that policy would also keep them from doing clever.

EE 4272Spring, 2003 Chapter 10 Packet Switching Packet Switching Principles  Switching Techniques  Packet Size  Comparison of Circuit Switching & Packet.

NCKU CSIE CIAL1 Principles and Protocols for Power Control in Wireless Ad Hoc Networks Authors: Vikas Kawadia and P. R. Kumar Publisher: IEEE JOURNAL ON.

Multiprocessors Andreas Klappenecker CPSC321 Computer Architecture.

Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors Focused on permanent and transient faults detection. Three.

Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.

CS 584. Sorting n One of the most common operations n Definition: –Arrange an unordered collection of elements into a monotonically increasing or decreasing.

Interconnection Networks. Applications of Interconnection Nets Interconnection networks are used everywhere! ◦ Supercomputers – connecting the processors.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

Chapter One Introduction to Pipelined Processors.

LayeredTrees: Most Specific Prefix based Pipelined Design for On-Chip IP Address Lookups Author: Yeim-Kuau Chang, Fang-Chen Kuo, Han-Jhen Guo and Cheng-Chien.

CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page

1 Optimal Resource Placement in Structured Peer-to-Peer Networks Authors: W. Rao, L. Chen, A.W.-C. Fu, G. Wang Source: IEEE Transactions on Parallel and.

A Distributed Algorithm for 3D Radar Imaging PATRICK LI SIMON SCOTT CS 252 MAY 2012.

Resource-Aware Video Multicasting via Access Gateways in Wireless Mesh Networks IEEE Transactions on Mobile Computing,Volume 11,Number 6,June 2012 Authors.

10.Introduction to Data-Parallel architectures TECH Computer Science SIMD {Single Instruction Multiple Data} 10.1 Introduction 10.2 Connectivity 10.3 Alternative.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

Pipelined and Parallel Computing Data Dependency Analysis for 1 Hongtao Du AICIP Research Mar 9, 2006.

Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.

1 Presenter: Min Yu,Lo 2015/12/21 Kumar, S.; Jantsch, A.; Soininen, J.-P.; Forsell, M.; Millberg, M.; Oberg, J.; Tiensyrja, K.; Hemani, A. VLSI, 2002.

A Hardware Accelerator for the Fast Retrieval of DIALIGN Biological Sequence Alignments in Linear Space Author: Azzedine Boukerche, Jan M. Correa, Alba.

2016/1/6Part I1 A Taste of Parallel Algorithms. 2016/1/6Part I2 We examine five simple building-block parallel operations and look at the corresponding.

Student: Fan Bai Instructor: Dr. Sushil Prasad CSc8530.

Memory-Efficient and Scalable Virtual Routers Using FPGA Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan,

Updating Designed for Fast IP Lookup Author : Natasa Maksic, Zoran Chicha and Aleksandra Smiljani´c Conference: IEEE High Performance Switching and Routing.

A Fast Regular Expression Matching Engine for NIDS Applying Prediction Scheme Author: Lei Jiang, Qiong Dai, Qiu Tang, Jianlong Tan and Binxing Fang Publisher:

HYPERCUBE ALGORITHMS-1

1 Bit Weaving: A Non-Prefix Approach to Compressing Packet Classifiers in TCAMs Author: Chad R. Meiners, Alex X. Liu, and Eric Torng Publisher: IEEE/ACM.

PMLAB, IECS, FCU Designing Efficient Matrix Transposition on Various Interconnection Networks Using Tensor Product Formulation Presented by Chin-Yi Tsai.

Basic Communication Operations Carl Tropper Department of Computer Science.

Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.

Author: Weirong Jiang, Viktor K. Prasanna Publisher: th IEEE International Conference on Application-specific Systems, Architectures and Processors.

Hierarchical Hybrid Search Structure for High Performance Packet Classification Authors : O˜guzhan Erdem, Hoang Le, Viktor K. Prasanna Publisher : INFOCOM,

CALTECH CS137 Winter DeHon 1 CS137: Electronic Design Automation Day 8: January 27, 2006 Cellular Placement.

CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola.

INTERCONNECTION NETWORK

Stanford University.

Author: Heeyeol Yu; Mahapatra, R.; Publisher: IEEE INFOCOM 2008

Overview Parallel Processing Pipelining

Interconnect Networks

The University of Adelaide, School of Computer Science

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.

Neural Network Implementations on Parallel Architectures

Dynamic Fine-Grained Localization in Ad-Hoc Networks of Sensors

Interconnection topologies

Outline Introduction Routing in Mobile Ad Hoc Networks

Byung Joon Park, Sung Hee Kim

Sensor Network Routing

任課教授：陳朝鈞教授學生：王志嘉、馬敏修

Computer Architecture Introduction to Data-Parallel architectures

Parallel and Multiprocessor Architectures

Communication operations

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

Multiple Processor Systems

Sequence Alignment with Traceback on Reconfigurable Hardware

Parallel Sorting Algorithms

Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu

Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.

2019/9/14 The Deep Learning Vision for Heterogeneous Network Traffic Control Proposal, Challenges, and Future Perspective Author: Nei Kato, Zubair Md.

Chapter 2 from ``Introduction to Parallel Computing'',

Presentation transcript:

NETWORK-ON-CHIP HARDWARE ACCELERATORS FOR BIOLOGICAL SEQUENCE ALIGNMENT Author: Souradip Sarkar; Gaurav Ramesh Kulkarni; Partha Pratim Pande; and Ananth Kalyanaraman Publisher: IEEE TRANSACTIONS ON COMPUTERS, JANUARY 2010 Presenter: Chin-Chung Pan Date: 2010/09/08

OUTLINE Introduction Algorithms for Sequence Alignment and Parallelization The PP Approach The AD Approach NoC Implementation Mapping the Sequence Alignment Algorithms to NoC NoC Switch Design Experimental Results

INTRODUCTION Propose optimized NoC architectures for different sequence alignment algorithms that were originally designed for distributed memory parallel computers. Provide a thorough comparative evaluation of their respective performance and energy dissipation. The results show that our NoC-based implementations can provide above 102-103-fold speedup over other hardware accelerators and above 104-fold speedup over traditional CPU architectures.

ALGORITHMS FOR SEQUENCE ALIGNMENT AND PARALLELIZATION We briefly outline the main ideas behind these two algorithms: PP algorithm (for parallel prefix) AD algorithm (for antidiagonal) There are two main challenges in parallelizing DP algorithms for PSA: Meeting the dependency constraint without affecting the parallel speedup during forward pass. Computing the optimal retrace without storing the entire DP table during forward pass.

ALGORITHMS FOR SEQUENCE ALIGNMENT AND PARALLELIZATION The PP approach The forward pass in table computation proceeds iteratively, row by row, all the PEs participate in computing row i in parallel. an O(n/p) local computation. an O(log p) parallel prefix communication.

ALGORITHMS FOR SEQUENCE ALIGNMENT AND PARALLELIZATION The PP approach

ALGORITHMS FOR SEQUENCE ALIGNMENT AND PARALLELIZATION The AD algorithm This requires that both the input sequences s1 and s2 are made available to all the PEs. For the forward pass, the algorithm proceeds iteratively by computing one antidiagonal of DPtable at each “time step”.

NOC IMPLEMENTATION MAPPING THE SEQUENCE ALIGNMENT ALGORITHMS TO NOC The PP Implementation While there are multiple NoC architectures, the hypercubic topology is best suited for the parallel prefix operation. But a physical realization of the NoC is limited by the layout dimension of the chip, which is predominantly 2D in practice. D-dimensional mesh

NOC IMPLEMENTATION MAPPING THE SEQUENCE ALIGNMENT ALGORITHMS TO NOC The PP Implementation It is not practical to implement any arbitrarily higher dimensional hypercube. A well-defined property of the communication pattern in parallel prefix algorithm is that PEs do not communicate arbitrarily. The forward pass is followed by a backward pass operation. This step is implemented using p-1 neighbor PE communication exchanges, as the PEs regenerate the path from cell [m, n] to [0, 0].

NOC IMPLEMENTATION MAPPING THE SEQUENCE ALIGNMENT ALGORITHMS TO NOC The AD Implementation In our implementation, we achieve this by embedding a ring into the mesh, and following the Moore space filing curve. Note that this result is unlike the PP approach; the communication volume is independent of the number of PEs during the forward pass.

NOC IMPLEMENTATION MAPPING THE SEQUENCE ALIGNMENT ALGORITHMS TO NOC The AD Implementation In our backward pass implementation, we partition the processors into two subgroups, and the number of processors in each of the subgroups depends on the number of cells in the two partitions. The processor grouping requires a broadcast operation, to propagate the new partitioning cell coordinate to all the PEs, which takes O(log p) time (where p is the number of PEs).

NOC IMPLEMENTATION NOC SWITCH DESIGN To reduce the delay, instead of building a multihop or pipelined communication link between two nonadjacent PEs. The switch boxes are designed to establish a direct communication path (unpipelined or single hop) between the PEs

NOC IMPLEMENTATION NOC SWITCH DESIGN To exchange data between PE1 and PE2, the following pass transistors will be on: Miph1 and Mh1 of the switch connected to PE1, and Miph1 of the switch connected to PE2.

NOC IMPLEMENTATION NOC SWITCH DESIGN

EXPERIMENTAL RESULTS PIPELINED VERSUS UNPIPELINED IMPLEMENTATION We considered two types of NoC implementations: A “Pipelined” communication scheme, where all the nonadjacent PEs communicate in multiple hops step by step. An “Unpipelined” communication scheme, where the nonadjacent hypercubic neighboring PEs communicate in a single hop with the help of bypass. Pipelined Unpipelined Multihop

EXPERIMENTAL RESULTS PIPELINED VERSUS UNPIPELINED IMPLEMENTATION

EXPERIMENTAL RESULTS ENERGY AND TIMING CHARACTERISTICS OF THE PP APPROACH

EXPERIMENTAL RESULTS ENERGY AND TIMING CHARACTERISTICS OF THE AD APPROACH

EXPERIMENTAL RESULTS COMPARATIVE EVALUATION OF PP AND AD SCHEMES

EXPERIMENTAL RESULTS IMPACT OF VARYING STRING SIZES In order to allow a fair comparison, we varied the string sizes but keeping the total work (i.e., number of cells in the DP table) same. As the number of rows is decreased and/or the number of columns increased, the total time and energy both decrease.

EXPERIMENTAL RESULTS PARALLEL PREFIX IMPLEMENTATION ON PROTEIN SEQUENCE DATA

EXPERIMENTAL RESULTS DISCUSSION