Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

Slides:



Advertisements
Similar presentations
Multi-dimensional Packet Classification on FPGA: 100Gbps and Beyond
Advertisements

Instructor: Sazid Zaman Khan Lecturer, Department of Computer Science and Engineering, IIUC.
Authors: Raphael Polig, Kubilay Atasu, and Christoph Hagleitner Publisher: FPL, 2013 Presenter: Chia-Yi, Chu Date: 2013/10/30 1.
Sequence Alignment in DNA Under the Guidance of : Prof. Kolin Paul Presented By: Lalchand Gaurav Jain.
A Memory-Efficient Reconfigurable Aho-Corasick FSM Implementation for Intrusion Detection Systems Authors: Seongwook Youn and Dennis McLeod Presenter:
Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
1 ReCPU:a Parallel and Pipelined Architecture for Regular Expression Matching Department of Computer Science and Information Engineering National Cheng.
Efficient IP-Address Lookup with a Shared Forwarding Table for Multiple Virtual Routers Author: Jing Fu, Jennifer Rexford Publisher: ACM CoNEXT 2008 Presenter:
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Hardware Basics: Inside the Box 2  2001 Prentice Hall2.2 Chapter Outline “There is no invention – only discovery.” Thomas J. Watson, Sr. What Computers.
1 Multi-Core Architecture on FPGA for Large Dictionary String Matching Department of Computer Science and Information Engineering National Cheng Kung University,
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Chapter 1 and 2 Computer System and Operating System Overview
GCSE Computing - The CPU
Gnort: High Performance Intrusion Detection Using Graphics Processors Giorgos Vasiliadis, Spiros Antonatos, Michalis Polychronakis, Evangelos Markatos,
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center.
Field Programmable Gate Array (FPGA) Layout An FPGA consists of a large array of Configurable Logic Blocks (CLBs) - typically 1,000 to 8,000 CLBs per chip.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Chapter Two Hardware Basics: Inside the Box. ©1999 Addison Wesley Longman2.2 Chapter Outline What Computers Do A Bit About Bits The Computer’s Core: CPU.
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
1 Chapter 04 Authors: John Hennessy & David Patterson.
PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
1 of 23 Fouts MAPLD 2005/C117 Synthesis of False Target Radar Images Using a Reconfigurable Computer Dr. Douglas J. Fouts LT Kendrick R. Macklin Daniel.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
PERFORMANCE ANALYSIS cont. End-to-End Speedup  Execution time includes communication costs between FPGA and host machine  FPGA consistently outperforms.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 2 Parallel Hardware and Parallel Software An Introduction to Parallel Programming Peter Pacheco.
Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*
Lecture on Central Process Unit (CPU)
Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Sunpyo Hong, Hyesoon Kim
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
A Scalable Pipelined Associative SIMD Array With Reconfigurable PE Interconnection Network For Embedded Applications Hong Wang & Robert A. Walker Computer.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
String Matching in Hardware using the FM-Index Author: Edward Fernandez, Walid Najjar and Stefano Lonardi Publisher: FCCM,2011 Presenter: Jia-Wei,You Date:
1 Computer System Overview Chapter 1. 2 Operating System Exploits the hardware resources of one or more processors Provides a set of services to system.
CPU Central Processing Unit
GCSE Computing - The CPU
GCSE OCR Computing A451 The CPU Computing hardware 1.
NFV Compute Acceleration APIs and Evaluation
HISTORY OF MICROPROCESSORS
Ildar Absalyamov, PhD Candidate
Assembly Language for Intel-Based Computers, 5th Edition
Genomic Data Clustering on FPGAs for Compression
Core i7 micro-processor
Architecture Background
CPU Central Processing Unit
Regular Expression Acceleration at Multiple Tens of Gb/s
Operating Systems Chapter 5: Input/Output Management
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
CSC2431 February 3rd 2010 Alecia Fowler
Types of Computers Mainframe/Server
Scalable Memory-Less Architecture for String Matching With FPGAs
GCSE Computing - The CPU
Run time performance for all benchmarked software.
ARM920T Processor This training module provides an introduction to the ARM920T processor embedded in the AT91RM9200 microcontroller.We’ll identify the.
Presentation transcript:

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer Science and Engineering Jacquard Computing

Introduction Multithreaded architectures masks long memory latencies by context switching threads. FPGA provides a platform for hardware acceleration of multithreaded architectures targeting a specific application Our target application for this research is DNA sequence matching

Introduction FHAST (FPGA Hardware Accelerated Sequencing Tool) implements a heuristic based on the FM-Index string matching algorithm FHAST is implemented on Convey Computer HC-1 which can be used as a drop-in replacement for the Bowtie sequencing tool Speed up of FHAST compared to Bowtie ranges from 7x to 70x dependent on allowed number of mismatches

Presentation Outline FM-Index String Matching Algorithm Hardware Architecture Exact String Matching Architecture Approximate String Matching Architecture Implementation Results Evaluation Conclusion

FM-Index String Matching Algorithm The FM-index operates on the Burrows- Wheeler Transform of the text The top and bottom pointers of the FM-index indicate a matching pattern on the text for every character of the pattern processed Top >= Bottom: pattern does not exist Top < Bottom: pattern exists

FM-Index String Matching Algorithm CTTTACAG$AGCGTACTTTACAG$AGCGTA Search pattern: T A G G Text: GCTAATTAGGTACC$ Search pattern: C C G A CTTTACAG$AGCGTACTTTACAG$AGCGTA Top Bottom Top Bottom PATTERN EXISTS PATTERN DOES NOT EXISTS

FM-Index String Matching Algorithm Limited block RAM available on the FPGA for storing the Burrows-Wheeler Transform (BWT) of an extremely long text Utilization of external memory to store BWT of the text Exploitation of multiple threads to masks long latencies because of memory access.

Multithreading Memory Threads wait in queues while waiting for memory to return required data Multiple threads are processed to achieve parallelism and faster execution Ready Threads Waiting Threads

Exact String Matching Architecture Receive SendLocate Update C-Table (External) Fetch Pattern source (External) Output file (External) A thread represents a pattern in the queue of the components. Multiple threads are processed to hide latencies due to memory access.

Approximate String Matching Architecture Pattern source (External) Engine 0 pattern Engine 1 ….…. Engine n C-Table (External) Locate Block A miss on Engine n creates three new threads and passed to Engine n+1 Each new thread on the succeeding thread replaces the failing base pair with the other three base pairs

Implementation Coprocessor call (Hardware) Setup registers Setup input table Allocate memory Setup output table Report output table C- Table/Suffix Array (External) Pattern source (External) (Software) Software: Performs memory allocation for reading the C tables, the suffix arrays, the reads and writing the results to external memory. Hardware: Executes the search algorithm Convey HC-1 hybrid core: Dual core Intel Xeon processor 2.13 GHz Four Xilinx Virtex 5 FPGAs as coprocessor with eight memory controller supporting peak bandwidth of 80 GB/s at 150 MHz.

Experimental Setup 18 million unique reads with 101 base pairs on Chromosome 14 of the human genome (length of 107 million base pairs) Bowtie is executed in the two following setup CPU1CPU2 Processor TypeXeon L540BXeon E5520 # of cores2 dual cores2 quad cores Memory size192 GB24 GB Cache size6 MB8 MB Frequency2.13 GHz2.27 GHz

Execution Time Longer execution time of Bowtie running in both CPUs compared to FHAST Simultaneous searching of the reads in three engines of FHAST results to no significant difference in execution time for difference mismatches MismatchFHASTBowtie CPU1CPU

Speed Up Highest speed up (70x) is achieved in detecting two mismatches where execution time of Bowtie is longest

Conclusion We demonstrated a multithreaded approach using FPGAs to accelerate execution of DNA sequence matching We compared FHAST execution time to Bowtie which is a widely used tool for sequencing reads and showed an actual performance improvement up to 70x

Thanks for Listening

Back up slides

Performance Improvement New read Update block read Old read RAM Old pointerNew pointer pointer The memory address are pre-calculated and stored on a RAM for all character combinations up to a specific length such that each combination of characters represent a range. Instead of initializing the address to the first and last rows of the C-table, we initialize the top and bottom pointers to the pre-calculated values

Replace Block Replace: The replace block creates three copies of the failing read and replaces the failing character with the other base pairs. The update block of Engine 1 accepts reads from replace block instead of the fetch block. Read in Engine N Read in Engine N-1 Update block Replace block readpointer Pointer of Read in Engine N-1 Pointer of Read in Engine N

Scaling Factor Scaling factor defined as execution time on a single device divided by the number of devices Results show that FHAST scales better if more FPGAs are used for searching