Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning.

Slides:

Advertisements

Similar presentations

Sumitha Ajith Saicharan Bandarupalli Mahesh Borgaonkar.

Advertisements

Sorting Comparison-based algorithm review –You should know most of the algorithms –We will concentrate on their analyses –Special emphasis: Heapsort Lower.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

V The DARPA Dynamic Programming Benchmark on a Reconfigurable Computer Justification High performance computing benchmarking Compare and improve the performance.

Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 08: RC Principles: Software (1/4) Prof. Sherief Reda.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

MAPLD 2005 A High-Performance Radix-2 FFT in ANSI C for RTL Generation John Ardini.

Embedded DRAM for a Reconfigurable Array S.Perissakis, Y.Joo 1, J.Ahn 1, A.DeHon, J.Wawrzynek University of California, Berkeley 1 LG Semicon Co., Ltd.

SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Field Programmable Gate Array (FPGA) Layout An FPGA consists of a large array of Configurable Logic Blocks (CLBs) - typically 1,000 to 8,000 CLBs per chip.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.

CSE 373 Data Structures Lecture 15

Viterbi Decoder Project Alon weinberg, Dan Elran Supervisors: Emilia Burlak, Elisha Ulmer.

Digital signature using MD5 algorithm Hardware Acceleration

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.

Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.

Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks Chitalwala. E., El-Ghazawi. T., Gaj. K., The George Washington.

Performance and Overhead in a Hybrid Reconfigurable Computer O. D. Fidanci 1, D. Poznanovic 2, K. Gaj 3, T. El-Ghazawi 1, N. Alexandridis 1 1 George Washington.

Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

1 of 23 Fouts MAPLD 2005/C117 Synthesis of False Target Radar Images Using a Reconfigurable Computer Dr. Douglas J. Fouts LT Kendrick R. Macklin Daniel.

Parallel Computing Using FPGA ( Field Programmable Gate Arrays ) 15 th May, 2009 Studies in Parallel & Distributed Systems – Sohaib Ahmed.

Automated Design of Custom Architecture Tulika Mitra

Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.

HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. A Program Transformation Approach to High Performance Embedded Computing using the SRC.

Hardware Implementation of a Memetic Algorithm for VLSI Circuit Layout Stephen Coe MSc Engineering Candidate Advisors: Dr. Shawki Areibi Dr. Medhat Moussa.

AMIN FARMAHININ-FARAHANI CHARLES TSEN KATHERINE COMPTON FPGA Implementation of a 64-bit BID-Based Decimal Floating Point Adder/Subtractor.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Gaj1P230/MAPLD 2004 Elliptic Curve Cryptography over GF(2 m ) on a Reconfigurable Computer: Polynomial Basis vs. Optimal Normal Basis Representation Comparative.

Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,

1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.

J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

South Carolina The DARPA Data Transposition Benchmark on a Reconfigurable Computer Sreesa Akella, Duncan A. Buell, Luis E. Cordova, and Jeff Hammes Department.

A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.

Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.

A Monte Carlo Simulation Accelerator using FPGA Devices Final Year project : LHW0304 Ng Kin Fung && Ng Kwok Tung Supervisor : Professor LEONG, Heng Wai.

Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.

Outline Announcements: –HW III due Friday! –HW II returned soon Software performance Architecture & performance Measuring performance.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

J. Harkins1 of 51MAPLD2005/C178 Sorting on the SRC 6 Reconfigurable Computer John Harkins, Tarek El-Ghazawi, Esam El-Araby, Miaoqing Huang The George Washington.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

Exploiting Parallelism

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.

1 Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

CoDeveloper Overview Updated February 19, Introducing CoDeveloper™  Targeting hardware/software programmable platforms  Target platforms feature.

Buffering Techniques Greg Stitt ECE Department University of Florida.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

Cache Memory Presentation I

FPGAs in AWS and First Use Cases, Kees Vissers

Elliptic Curve Cryptography over GF(2m) on a Reconfigurable Computer:

Implementation of IDEA on a Reconfigurable Computer

RECONFIGURABLE PROCESSING AND AVIONICS SYSTEMS

Presentation transcript:

Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning of the DARPA HPCS Integer Sort Benchmark within the SRC-6e Reconfigurable Computer

Page 2 MAPLD 2005/253MichalskiOutline Reconfigurable Computing – Introduction  SRC-6e architecture, programming model Sorting Algorithms  Design guidelines Testing Procedures, Results Conclusions, Future Work  Lessons learned

Page 3 MAPLD 2005/253Michalski What is a Reconfigurable Computer? Combination of:  Microprocessor workstation for frontend processing  FPGA backend for specialized coprocessing  Typical PC bus for communications

Page 4 MAPLD 2005/253Michalski What is a Reconfigurable Computer? PC Characteristics  High clock speed  Superscalar, pipelined  Out of order issue  Speculative execution  High-Level Language FPGA Characteristics  Low clock speed  Large number of configurable elements LUTs, Block RAMs, CPAs Multipliers  HDL Language

Page 5 MAPLD 2005/253Michalski What is the SRC-6e? SRC = Seymour R. Cray RC with high-throughput memory interface  1,415 MB/s for SNAP writes, 1,280 MB/s for SNAP reads  PCI-X (1.0) = GB/s

Page 6 MAPLD 2005/253Michalski SRC-6e Development Programming does not require knowledge of HW design  C code can compile to hardware

Page 7 MAPLD 2005/253Michalski FPGA Considerations  Superscalar design Parallel, pipelined execution SRC Considerations  High overall data throughput Streaming versus non-streaming data transfer?  Reduction of FPGA data processing stalls due to data dependencies, data read/write delays FPGA Block RAM versus SRC OnBoard Memory? Evaluate software/hardware partitioning  Algorithm partitioning  Data size partitioning SRC Design Objectives

Page 8 MAPLD 2005/253Michalski Sorting Algorithms Traditional Algorithms  Comparison Sorts: Θ(n lg n) best case Insertion sort Merge sort Heapsort Quicksort  Counting Sorts Radix sort: Θ(d(n+k)) HPCS FORTRAN code baseline  Radix sort in combination with heapsort  This research focuses on 128-bit operands SRC simplified data transfer, management

Page 9 MAPLD 2005/253Michalski Memory Constraints  SRC onboard memory 6 banks x 4 MB Pipelined read or write access 5 clock latency  FPGA BRAM memory 144 blocks, 18 Kbit each 1 clock read and write latency Initial Choices  Parallel Insertion Sort (BubbleSort) Produces sorted blocks Use of onboard memory pipelined processing – Minimize data access stalls  Parallel Heapsort Random access merge of sorted lists Use of BRAM for low latency access – Good for random data access Sorting – SRC FPGA Implementation

Page 10 MAPLD 2005/253Michalski Parallel Insertion Sort (BubbleSort) Systolic array of cells  Pipelined SRC processing from OnBoard Memory  Keeps highest value, passes other values  Latency 2x number of cells

Page 11 MAPLD 2005/253Michalski Parallel Insertion Sort (BubbleSort) Systolic array of cells  Results passed out in reverse order of comparison N = # comparator cells  Sorts a list completely in Θ(L 2 )  Limit sort size to some number a < L (list size) Create multiple sorted lists Each list sorted in Θ(a)

Page 12 MAPLD 2005/253Michalski Parallel Insertion Sort (BubbleSort) #include void parsort_test(int arraysize, int sortsize, int transfer, uint64_t datahigh_in[], uint64_t datalow_in[], uint64_t datahigh_out[], uint64_t datalow_out[], int64_t *start_transferin, int64_t *start_loop, int64_t *start_transferout, int64_t *end_transfer, int mapno) { OBM_BANK_A (a, uint64_t, MAX_OBM_SIZE) OBM_BANK_B (b, uint64_t, MAX_OBM_SIZE) OBM_BANK_C (c, uint64_t, MAX_OBM_SIZE) OBM_BANK_D (d, uint64_t, MAX_OBM_SIZE) DMA_CPU(CM2OBM, a, MAP_OBM_stripe(1, "A"), datahigh_in, 1, arraysize*8, 0); wait_DMA(0); …. while (arrayindex < arraysize) { endarrayindex = arrayindex + sortsize - 1; if (endarrayindex > arraysize - 1) endarrayindex = arraysize - 1; while (arrayindex < endarrayindex) { for (i=arrayindex; i<=endarrayindex; i++) { data_high_in = a[i];data_low_in = b[i]; parsort(i==endarrayindex, data_high_in, data_low_in, &data_high_out, &data_low_out); c[i] = data_high_out;d[i] = data_low_out;

Page 13 MAPLD 2005/253Michalski Parallel Heapsort Tree structure of cells  Asynchronous operation Acknowledged data transfer  Merges sorted lists in Θ(n lg n)  Designed for Independent BRAM block accesses

Page 14 MAPLD 2005/253Michalski Parallel Heapsort BRAM Limitations  144 Block bit values = not a whole lot of 128-bit values OnBoard Memory  SRC constraint – Up to 64 reads and 8 writes in one MAP C file  Cascading clock delays as number of reads increase  Explore the use of MUXd access: search and update only 6 of 48 leaf nodes at a time in round-robin fashion

Page 15 MAPLD 2005/253Michalski FPGA Initial Results Baseline: One V26000  PAR options: -ol high –t 1 Bubblesort Results – 100 Cells  29,354 Slices(86%)  37,131 LUTs(54%)  ns = 73 MHz (verified operational at 100MHz) Heapsort Results – 95 Cells (48 Leafs)  21,011 Slices(62%)  24,467 LUTs(36%)  ns = 85 MHz (verified operational at 100MHz)

Page 16 MAPLD 2005/253Michalski Testing Procedures All tests utilize one chip for baseline results Evaluate fastest software radix of operation Hardware/Software Partitioning  Five cases - Case 5 utilizes FPGA reconfiguration  Data size partitioning – 100, 500, 1000, 5000,  10 runs for each test case/data partitioning combination  List size values

Page 17 MAPLD 2005/253MichalskiResults Fastest Software Operations (Baseline)  Comparison of Radixsort and Heapsort Combinations Radix 4, 8 and 16 evaluated Minimum Time: Radix-8 Radixsort + Heapsort (Size = 5000 or 10000)  Radix-16 has too many buckets for sort size partitions evaluated  Heapsort comparisons faster than radixsort index updates

Page 18 MAPLD 2005/253MichalskiResults Fastest SW-only Time = 3.41 sec. Fastest time including HW = 3.89 sec.  Bubblesort (HW), Heapsort (SW)  Partition Listsize of 1000 Heapsort times…   Dominated by data access   Significantly slower than software

Page 19 MAPLD 2005/253Michalski Results – Bubblesort vs. Radixsort Some cases where HW faster than SW  List sizes < 5000  SRC data pipelined access  Fastest SW case was for list size = MAP data transfer time less significant than data processing time   For size = 1000: Input (11.3%), Analyze (76.9%), Output (11.5%)

Page 20 MAPLD 2005/253Michalski Results - Limitations Heapsort is limited by overhead of input servicing  Random accesses of OBM not ideal  Overhead of loop search, sequentially dependent processing Bubblesort limited by number of cells  Can increase by approximately 13 cells  Two-chip streaming Reconfiguration time assumed to be one-time setup factor  Reconfiguration case exception – Solve by having a core per V26000

Page 21 MAPLD 2005/253MichalskiConclusions Pipelined, systolic designs are needed to overcome speed advantage of microprocessor  Bubblesort works well on small data sets  Heapsort’s random data access cannot exploit SRC benefits SRC high-throughput data transfer and high- level data abstraction provides good framework to implement systolic designs

Page 22 MAPLD 2005/253Michalski Future Work Heapsort’s random data access cannot exploit SRC benefits  Look for possible speedups using BRAM?  Unroll leaf memory access  Exploit SRC “periodic macro” paradigm Currently evaluating radix sort in hardware  This works better than bubblesort for larger sort sizes Compare MAP-C to VHDL when baseline VHDL is faster than SW