V The DARPA Dynamic Programming Benchmark on a Reconfigurable Computer Justification High performance computing benchmarking Compare and improve the performance.

Slides:



Advertisements
Similar presentations
Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.
Advertisements

1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.
CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.
Graduate Computer Architecture I Lecture 16: FPGA Design.
Implementation methodology for Emerging Reconfigurable Systems With minimum optimization an appreciable speedup of 3x is achievable for this program with.
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.
ENGIN112 L38: Programmable Logic December 5, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 38 Programmable Logic.
A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.
1 Multi-Core Architecture on FPGA for Large Dictionary String Matching Department of Computer Science and Information Engineering National Cheng Kung University,
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Reconfigurable Computing in the Undergraduate Curriculum Jason D. Bakos Dept. of Computer Science and Engineering University of South Carolina.
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
CS 151 Digital Systems Design Lecture 38 Programmable Logic.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
RUN-TIME RECONFIGURATION FOR AUTOMATIC HARDWARE/SOFTWARE PARTITIONING Tom Davidson, Karel Bruneel, Dirk Stroobandt Ghent University, Belgium Presenting:
General FPGA Architecture Field Programmable Gate Array.
1  Staunstrup and Wolf Ed. “Hardware Software codesign: principles and practice”, Kluwer Publication, 1997  Gajski, Vahid, Narayan and Gong, “Specification,
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
HSDSL, Technion Spring 2014 Preliminary Design Review Matrix Multiplication on FPGA Project No. : 1998 Project B By: Zaid Abassi Supervisor: Rolf.
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
Invitation to Computer Science 5 th Edition Chapter 9 Introduction to High-Level Language Programming.
Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks Chitalwala. E., El-Ghazawi. T., Gaj. K., The George Washington.
1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning.
©2003/04 Alessandro Bogliolo Computer systems A quick introduction.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
(TPDS) A Scalable and Modular Architecture for High-Performance Packet Classification Authors: Thilan Ganegedara, Weirong Jiang, and Viktor K. Prasanna.
Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
InCoB August 30, HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.
PERFORMANCE ANALYSIS cont. End-to-End Speedup  Execution time includes communication costs between FPGA and host machine  FPGA consistently outperforms.
High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop
CSC 230: C and Software Tools Rudra Dutta Computer Science Department Course Introduction.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
J. Christiansen, CERN - EP/MIC
Guiding Principles. Goals First we must agree on the goals. Several (non-exclusive) choices – Want every CS major to be educated in performance including.
R2D2 team R2D2 team Reconfigurable and Retargetable Digital Devices  Application domains Mobile telecommunications  WCDMA/UMTS (Wideband Code Division.
Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
South Carolina The DARPA Data Transposition Benchmark on a Reconfigurable Computer Sreesa Akella, Duncan A. Buell, Luis E. Cordova, and Jeff Hammes Department.
Paper Review Presentation Paper Title: Hardware Assisted Two Dimensional Ultra Fast Placement Presented by: Mahdi Elghazali Course: Reconfigurable Computing.
StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:
Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Copyright  2005 SRC Computers, Inc. ALL RIGHTS RESERVED Overview.
CS-303 Introduction to Programming
A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.
1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.
1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.
The Functions and Purposes of Translators Translators, Interpreters and Compilers - High Level Languages.
The Functions and Purposes of Translators Translators, Interpreters and Compilers - High Level Languages.
An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
Software Engineering Algorithms, Compilers, & Lifecycle.
Ioannis E. Venetis Department of Computer Engineering and Informatics
CSCI-235 Micro-Computer Applications
Selective Code Compression Scheme for Embedded System
Implementation of IDEA on a Reconfigurable Computer
Hardware Acceleration of the Lifting Based DWT
RECONFIGURABLE PROCESSING AND AVIONICS SYSTEMS
Jian Huang, Matthew Parris, Jooheung Lee, and Ronald F. DeMara
CSCE569 Parallel Computing
HIGH LEVEL SYNTHESIS.
Final Project presentation
Question 1 How are you going to provide language and/or library (or other?) support in Fortran, C/C++, or another language for massively parallel programming.
Presentation transcript:

v The DARPA Dynamic Programming Benchmark on a Reconfigurable Computer Justification High performance computing benchmarking Compare and improve the performance of reconfigurable computers against other supercomputers, distributed and massively parallel processing machines Design and Benchmark a Dynamic Programming problem to: Understand the advantages of reconfigurable supercomputing machines Study and devise a methodology for optimal mapping of algorithms Study the scalability of algorithms with the size of the input and with the size of the system Explore limitations of reconfigurable computers Justify and propose architectural performance improvements to important problems Contributions of research We have: Designed and implemented benchmark 2 of the DARPA HPCS Discrete Mathematics problems Developed a methodology to map similar algorithms to reconfigurable hardware platforms Explore architectures and their trade-offs in area, bandwidth, power consumption, and parallelism Objectives Luis E. Cordova, Duncan A. Buell, and Sreesa Akella Department of Computer Science and Engineering University of South Carolina >>> MUX +A 9,1 +A 9,2 +A 1,1 +A 1,2 +A 1,10 +A 9,10 +A 10,1 +A 10,2 +A 10,10 … MUX x > … x > x > … x > … … … LUT … … … OBM A OBM B OBM C OBM D OBM E Chip1 Maximizing Chip2 Sequencing OBM F … MUX … FIFO x > … tracking x > … … … > > xxx > > … … … Dynamic Programming Problem SRAM on- board-memory banks FPGA user logic chips High level view of hardware platform Maximizing loop Sequencing loop Transformation 1: column-wise reading Fully registered matrix architecture Transformation 2: row-wise reading Sequencing loop scheme Reconfigurable Computing Methodology: The entire design is based on standard high-level programming languages, ANSI C or Fortran. There is a seamless path between the naïve version of the algorithm coded in C language to a version mapped to the specific SRC platform. The methodology is based on transformations of the initial architecture to architectures that better exploit the parallelism of the problem. The effective utilization of the hardware resources is assisted by the SRC high-level compiler. The compiler aids in code debugging and elimination of slowdowns in suboptimal architectures. Storing all the matrices on-chip yields the top performance. The architecture is detailed as above in the 3-d figure for the first matrix. Maximizing loop scheme Architecture exploration Limitation: We find a need to automate higher level compilation steps at the problem level; this step requires specialized or expert knowledge on the field of application that is being studied. *Two sequencing architectures optimized for area (a) and memory bandwidth (b). (a)(b) *Two maximizing architectures. The architecture reading in row- wise fashion (transformation 2) offers higher performance than 1. Specification of top performance is an ANSI C file. We are able to explore a large number of possible architecures providing different trade-offs between parallelism, economy of resources, and throughput.