11 University of Michigan Electrical Engineering and Computer Science Exploring the Design Space of LUT-based Transparent Accelerators Sami Yehia *, Nathan.

Slides:



Advertisements
Similar presentations
Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Advertisements

University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.
CS1104: Computer Organisation School of Computing National University of Singapore.
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
1 U NIVERSITY OF M ICHIGAN 11 1 SODA: A Low-power Architecture For Software Radio Author: Yuan Lin, Hyunseok Lee, Mark Woh, Yoav Harel, Scott Mahlke, Trevor.
Fast Adders See: P&H Chapter 3.1-3, C Goals: serial to parallel conversion time vs. space tradeoffs design choices.
PipeRench: A Coprocessor for Streaming Multimedia Acceleration Seth Goldstein, Herman Schmit et al. Carnegie Mellon University.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.
1 Automatically Generating Custom Instruction Set Extensions Nathan Clark, Wilkin Tang, Scott Mahlke Workshop on Application Specific Processors.
Configurable System-on-Chip: Xilinx EDK
Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
From Sequences of Dependent Instructions to Functions: A Complexity-Effective Approach for Improving Performance Without ILP or Speculation Sami YEHIA.
University of Michigan Electrical Engineering and Computer Science 1 An Architecture Framework for Transparent Instruction Set Customization in Embedded.
Embedded Computing From Theory to Practice November 2008 USTC Suzhou.
University of Michigan Electrical Engineering and Computer Science 1 Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping Nathan Clark,
Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.
1 Design and Implementation of Turbo Decoders for Software Defined Radio Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.
University of Michigan Electrical Engineering and Computer Science 1 Processor Acceleration Through Automated Instruction Set Customization Nathan Clark,
University of Michigan Electrical Engineering and Computer Science Data-centric Subgraph Mapping for Narrow Computation Accelerators Amir Hormati, Nathan.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
University of Michigan Electrical Engineering and Computer Science 1 A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded.
Prardiva Mangilipally
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
University of Michigan Electrical Engineering and Computer Science 1 Systematic Register Bypass Customization for Application-Specific Processors Kevin.
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Automated Design of Custom Architecture Tulika Mitra
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
To be smart or not to be? Siva Subramanian Polaris R&D Lab, RTP Tal Lavian OPENET Lab, Santa Clara.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
Nov 10, 2008ECE 561 Lecture 151 Adders. Nov 10, 2008ECE 561 Lecture 152 Adders Basic Ripple Adders Faster Adders Sequential Adders.
Computer Architecture And Organization UNIT-II General System Architecture.
AN ARCHITECTURE FRAMEWORK FOR TRANSPARENT ISA CUSTOMIZATION IN EMBEDDED PROCESSORS VINAY GANGADHAR ECE 751 TALK, FALL 2015 DEPARTMENT.
EE3A1 Computer Hardware and Digital Design
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Computer Architecture Memory, Math and Logic. Basic Building Blocks Seen: – Memory – Logic & Math.
CDA 3101 Fall 2013 Introduction to Computer Organization The Arithmetic Logic Unit (ALU) and MIPS ALU Support 20 September 2013.
Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.
Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.
A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.
Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,
VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1, Amir Hormati 2, Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
1 KU College of Engineering Elec 204: Digital Systems Design Lecture 23 Introduction Computer Specification –Instruction Set Architecture (ISA) - the specification.
EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
Hardware Architecture
Somet things you should know about digital arithmetic:
ECE354 Embedded Systems Introduction C Andras Moritz.
Adaptive Cache Partitioning on a Composite Core
Instruction Set Architecture
Application-Specific Customization of Soft Processor Microarchitecture
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Computer Architecture
Super Quick Architecture Review
Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Ramkumar Jayaseelan, Haibin Liu, Tulika Mitra School of Computing, National.
Dynamically Reconfigurable Architectures: An Overview
Computer Structure S.Abinash 11/29/ _02.
CSC 4250 Computer Architectures
Application-Specific Processing on a General Purpose Core via Transparent Instruction Set Customization Nathan Clark, Manjunath Kudlur, Hyunchul Park,
Application-Specific Customization of Soft Processor Microarchitecture
Presentation transcript:

11 University of Michigan Electrical Engineering and Computer Science Exploring the Design Space of LUT-based Transparent Accelerators Sami Yehia *, Nathan Clark ▪, Scott Mahlke ▪, and Krisztian Flautner * * ARM Ltd. ▪ Advanced Computer Architecture Lab, University of Michigan CASES 2005, September 24-27

222 University of Michigan Electrical Engineering and Computer Science Embedded Products Convergence  Needs of performance for increasing application demands  Embedded systems win through customization : more performance, low power, etc..  Traditional ISA customization and hardware specialization cannot cope with the increase of functionalities.  One way : Transparent Instruction Set Customization 3.5G (HSDPA) WiMax NFC / RFID Stereo Headset Bluetooth/UWB Biometrics GPS TV out PC / Mac Memory card DMB (Digital Mobile Broadcast) 20 GB HD Concept Smart phone of 2008

333 University of Michigan Electrical Engineering and Computer Science Transparent Instruction Set Customization Transparent I1 I2 I3 I4 I5 Higher Frequency I1I1 I2I2 I3I3 I4I4 I5I5 OR… I5 I1 I2 I3 I4 Collapse Instructions (Customization)  An alternative way to performance  No ISA (or minor) change  Baseline CPU unchanged  Hardware generates control  Eases software burden  Forward compatible

444 University of Michigan Electrical Engineering and Computer Science Architecture Framework Compiler Standard Pipeline … BRL … BRL … Application Subgraph Execution Unit InputsOutputs Control Generation Instructions Augments Instruction Stream Subgraph

555 University of Michigan Electrical Engineering and Computer Science Pipeline Interface

666 University of Michigan Electrical Engineering and Computer Science LUT-based accelerator  Addition/Subtraction inst1: EOR r6,r1,r2 inst2: AND r7,r4,r5 inst3: ORR r12,r6,r7 EOR AND ORR r1r2r2 r4r5r5 r12 r5r4r2r1(a^b) | (c&d) inst1: ADD r6,r1,r2 r6 i = r1 i  r2 i  Cin i-1 Cin i = r1 i.r2 i | Cin i-1.(r1 i  r2 i ) A Carry Generator that is also programmable  LUT-Based r12 r1 r2 r4 r5 32 LUT

777 University of Michigan Electrical Engineering and Computer Science Programmable Carry Functional Unit (PCFU) oooo oooo oooo oooo oooo oooo oooo ooo oooo oooo oooo oooo oooo oooo oooo oo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo L1 L2 L3 L4 L i = (g i,p i ) o (G,P)(G’,P’) (G | GP’,P.P’)

888 University of Michigan Electrical Engineering and Computer Science Configuration generation Output in1in2 Cin AND r3, r1, r2 ADD r4, r1, r2 XOR r5, r3, r Subgraph m-r3 m-r4 m-r5 p g Meta Register file pg Out =A AND B r1r2 Out =A  B  cin g = A.B p = A  B r3 r4 Out =A  B g LUT p LUT Carry Generator g1 p1 cin1 OutLUT Out in1 in2 in1 in2 Meta Function Unit LUT(r3) = LUT (r1) AND LUT (r2)

999 University of Michigan Electrical Engineering and Computer Science Design Space NN umber of Inputs NN umber of Outputs NN umber of Addition/Subtractions SS hift support AA t inputs AA t outputs g1 LUT – p1 LUT 16 Carry Generator 32 g1 32 p1 32 g2 LUT – p2 LUT 32 Carry Generator 32 g2 32 p cin1 cin2 OutLUT 32 in1 in2 in3 in4 in1 in2 in3 in4 in1 in2 in3 in4 32 g1 LUT – p1 LUT g2 LUT – p2 LUTOutLUT in1 in2 in3 in4 in5 in6 in1 in2 in3 in4 in5 in6 in1 in2 in3 in4 in5 in OutLUT2 32 Out2 in2 in3 in4 in1 Shifter Out Shifter Out

10 University of Michigan Electrical Engineering and Computer Science Evaluation  Ported Trimaran compiler to ARM ISA  Subgraph identification engine  Synthesized with Synopsis standard cell library at 0.13µ  SimpleScalar configured as ARM926EJ-S  5 stage pipe, 250 MHz  1 cycle 16k I/D caches  Single issue  Baseline: 1 cycle subgraph execution latency

11 University of Michigan Electrical Engineering and Computer Science Speedup – Baseline PCFU  4-inputs, 2-outputs PCFU design

12 University of Michigan Electrical Engineering and Computer Science Number of inputs/outputs Area is proportional

13 University of Michigan Electrical Engineering and Computer Science Number of addition/subtractions

14 University of Michigan Electrical Engineering and Computer Science Collapsing Emulation

15 University of Michigan Electrical Engineering and Computer Science Shift support

16 University of Michigan Electrical Engineering and Computer Science Design points 4I, 2O, 2A, None 4I, 3O, 2A, None 5I, 3O, 2A, None

17 University of Michigan Electrical Engineering and Computer Science Conclusions  Transparent Instruction Set Customization needs  Extracting computations from program  Efficient Substrate to Map subgraphs  PCFU LUT Based accelerators  Flexible configurable accelerators  Efficient configuration  You can get up to 66% with a 6 input / 3 out / 2 Adder PCFU ... … but you get 62% with a 8 time smaller, ~40% faster PCFU

18 University of Michigan Electrical Engineering and Computer Science Q & A

19 University of Michigan Electrical Engineering and Computer Science Backups

20 University of Michigan Electrical Engineering and Computer Science PCFU Design Space Latency (ns) Area (cells) Speedup CCA(Michigan) CCA (R&D) PCFU (2 AS/4IN/2OUT) PCFU Logic only PCFU 1 ADD PCFU 2 ADD PCFU 3 ADD PCFU 2 IN PCFU 3 IN PCFU 4 IN PCFU 5 IN PCFU 6 IN PCFU (1 OUT) PCFU (2 OUT) PCFU (3 OUT) PCFU (Shift at inputs) PCFU(Shift at outputs)

21 University of Michigan Electrical Engineering and Computer Science LUT-based accelerator ADD r4,r1,r2 XOR r5,r3,r4 +  r1r2 r3 r5 i = r3 i  (r1 i  r2 i  cin i-1 ) cin i = (r1 i.r2 i ) OR (r1 i  r2 i ).cin i-1 r4 r5 Cin i-1 r3 i r2 i r1 i r5 i r5i r1 i r2 i r3 i Cin i-1 32 LUT  Closer to FPGA  Bit level functions too complex  Proposed Ripple Carry Scheme too slow  May involve carry propagation network very complex also  Hard to configure and have a reasonable latency in a GPP