University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

Slides:



Advertisements
Similar presentations
MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.
Advertisements

Computer Architecture Instruction-Level Parallel Processors
Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.
A Digital Circuit Toolbox
University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
University of Michigan Electrical Engineering and Computer Science 1 Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability.
1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.
University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
Components for high performance grid programming in the GRID.it project 1 Workshop on Component Models and Systems for Grid Applications - St.Malo 26 june.
Term Project Overview Yong Wang. Introduction Goal –familiarize with the design and implementation of a simple pipelined RISC processor What to do –Build.
University of Michigan Electrical Engineering and Computer Science 1 Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability.
Instruction Level Parallelism (ILP) Colin Stevens.
University of Michigan Electrical Engineering and Computer Science 1 Increasing the Number of Effective Registers in a Low-Power Processor Using a Windowed.
University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
University of Michigan Electrical Engineering and Computer Science 1 Resource Recycling: Putting Idle Resources to Work on a Composable Accelerator Yongjun.
Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.
University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.
Computer Organization and Architecture Reduced Instruction Set Computers (RISC) Chapter 13.
Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)
University of Michigan Electrical Engineering and Computer Science 1 Systematic Register Bypass Customization for Application-Specific Processors Kevin.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
University of Michigan Electrical Engineering and Computer Science 1 Integrating Post-programmability Into the High-level Synthesis Equation* Scott Mahlke.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning.
Automated Design of Custom Architecture Tulika Mitra
Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.
Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.
A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.
L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
Region-based Hierarchical Operation Partitioning for Multicluster Processors Michael Chu, Kevin Fan, Scott Mahlke University of Michigan Presented by Cristian.
University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines Manjunath Kudlur,
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.
Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory.
University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.
University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,
VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1, Amir Hormati 2, Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Hyunchul Park†, Kevin Fan†, Scott Mahlke†,
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.
Los Alamos National Laboratory Streams-C Maya Gokhale Los Alamos National Laboratory September, 1999.
Winter-Spring 2001Codesign of Embedded Systems1 Essential Issues in Codesign: Architectures Part of HW/SW Codesign of Embedded Systems Course (CE )
University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,
Ioannis E. Venetis Department of Computer Engineering and Informatics
Ph.D. in Computer Science
Introduction Introduction to VHDL Entities Signals Data & Scalar Types
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang
DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores
Michael Chu, Kevin Fan, Scott Mahlke
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
Chapter 12 Pipelining and RISC
CMSC 611: Advanced Computer Architecture
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park, Scott Mahlke September 25, 2004 EDCEP Workshop

University of Michigan Electrical Engineering and Computer Science Loop Accelerators Hardware implementation of a critical loop nest –Hardwired state machine –Digital camera appln – 1000x vs Pentium III –Multiple accelerators hooked up in a pipeline Loop accelerator vs. customized processor –1 block of code vs. multiple blocks –Trivial control flow vs. handling generic branches –Traditionally state machine vs. instruction driven

University of Michigan Electrical Engineering and Computer Science Programmable Loop Accelerators Goals –Multifunction accelerators – Accelerator hardware can handle multiple loops (re-use) –Post-programmable – To a degree, allow changes to the application –Use compiler as architecture synthesis tool But … –Don’t build a customized processor –Maintain ASIC-level efficiency

University of Michigan Electrical Engineering and Computer Science NPA (Nonprogrammable Accelerator) Synthesis in PICO

University of Michigan Electrical Engineering and Computer Science PICO Frontend for i = 1 to ni for j = 1 to nj y[i] += w[j] * x[i+j] for jt = 1 to 100 step 10 for t = 0 to 502 for p = 0 to 1 (i,j) = function of (t,p) if (i>1) W[t][p] = W[t-5][p] else w[jt+j] if (i>1 && j<bj) X[t][p] = X[t-4][p+1] else x[i+jt+j] Y[t][p] += W[t][p] * X[t][p] Goals –Exploit loop-level parallelism –Map loop to abstract hardware –Manage global memory BW Steps –Tiling –Load/store elimination –Iteration mapping –Iteration scheduling –Virtual processor clustering

University of Michigan Electrical Engineering and Computer Science PICO Backend Resource allocation (II, operation graph) Synthesize machine description for “fake” fully connected processor with allocated resources

University of Michigan Electrical Engineering and Computer Science Reduced VLIW Processor after Modulo Scheduling

University of Michigan Electrical Engineering and Computer Science Data/control-path Synthesis  NPA

University of Michigan Electrical Engineering and Computer Science PICO Methodology – Why it Works? Systematic design methodology –1. Parameterized meta-architecture – all NPAs have same general organization –2. Performance/throughput is input –3. Abstract architecture – We know how to build compilers for this –4. Mapping mechanism – Determine architecture specifics from schedule for abstract architecture

University of Michigan Electrical Engineering and Computer Science Direct Generalization of PICO? Programmability would require full interconnect between elements Back to the meta architecture! Generalize connectivity to enable post-programmability But stylize it

University of Michigan Electrical Engineering and Computer Science Programmable Loop Accelerator – Design Strategy Compile for partially defined architecture –Build long distance communication into schedule –Limit global communication bandwidth Proposed meta-architecture –Multi-cluster VLIW Explicit inter-cluster transfers (varying latency/BW) Intra-cluster communication is complete –Hardware partially defined – expensive units

University of Michigan Electrical Engineering and Computer Science Programmable Loop Accelerator Schema Intra-cluster Communication Shift Register SRAM DRAM … Stream Buffer Accelerator Pipeline of Tiled or Clustered Accelerators Accelerator Datapath Control Unit Stream Unit II FU …… …… …… MEM Inter-cluster Register File ……

University of Michigan Electrical Engineering and Computer Science Flow Diagram FU Alloc Partition Modulo Schedule Assembly code, II # clusters # expensive FUs # cheap FUs FUs assigned to clusters Shift register depth, width, porting Intercluster bandwidth Loop Accelerator

University of Michigan Electrical Engineering and Computer Science Sobel Kernel for (i = 0; i < N1; i++) { for (j = 0; j < N2; j++) { int t00, t01, t02, t10, t12, t20, t21, t22; int e, tmp; t00 = x[i ][j ]; t01 = x[i ][j+1]; t02 = x[i ][j+2]; t10 = x[i+1][j ]; t12 = x[i+1][j+2]; t20 = x[i+2][j ]; t21 = x[i+2][j+1]; t22 = x[i+2][j+2]; e1 = ((t00 + t01) + (t01 + t02)) – ((t20 + t21) + (t21 + t22)); e2 = ((t00 + t10) + (t10 + t20)) – ((t02 + t12) + (t12 + t22)); e12 = e1*e1; e22 = e2*e2; e = e12 + e22; if (e > threshold) tmp = 1; else tmp = 0; edge[i][j] = tmp; }

University of Michigan Electrical Engineering and Computer Science FU Allocation Determine number of clusters: Determine number of expensive FUs –MPY, DIV, memory Sobel with II=4 41 ops  3 clusters 2 MPY ops  1 multiplier 9 memory ops  3 memory units

University of Michigan Electrical Engineering and Computer Science Partitioning Multi-level approach consists of two phases –Coarsening –Refinement Minimize inter-cluster communication Load balance –Max of 4  II operations per cluster Take FU allocation into account –Restricted # of expensive units –# of cheap units (ADD, logic) determined from partition

University of Michigan Electrical Engineering and Computer Science Coarsening Group highly related operations together –Pair operations together at each step –Forces partitioner to consider several operations as a single unit Coarsening Sobel subgraph into 2 groups: LLLLL LLLLL LLLLL LLLLL

University of Michigan Electrical Engineering and Computer Science Refinement Move operations between clusters Good moves: –Reduce inter-cluster communication –Improve load balance –Reduce hardware cost Reduce number of expensive units to meet limit Collect similar bitwidth operations together LLLLL ?

University of Michigan Electrical Engineering and Computer Science Partitioning Example From sobel, II=4 Place MPYs together Place each tree of ADD- LOAD-ADDs together Cuts 6 edges

University of Michigan Electrical Engineering and Computer Science Modulo Scheduling Determines shift register width, depth, and number of read ports Sobel II=4 LD ADD LD ADD cycle FU0FU1FU2FU3 FUCycleMax result lifetime Req’d depth Req’d ports

University of Michigan Electrical Engineering and Computer Science Test Cases Sobel and fsed kernels, II=4 designs Each machine has 4 clusters with 4 FUs per cluster M+ - M M *& B<< + -<< M+ - << M+ - << M+ & B+ - * sobel fsed

University of Michigan Electrical Engineering and Computer Science Cross Compile Results Computation is localized –sobel: 1.5 moves/cycle –fsed: 1 move/cycle Cross compile –Can still achieve II=4 –More inter-cluster communication –May require more units –sobel on fsed machine: ~2 moves/cycle –fsed on sobel machine: ~3 moves/cycle

University of Michigan Electrical Engineering and Computer Science Concluding Remarks Programmable loop accelerator design strategy –Meta-architecture with stylized interconnect –Systematic compiler-directed design flow Costs of programmability: –Interconnect, inter-cluster communication –Control – “micro-instructions” are necessary Just scratching the surface of this work For more, see the CCCP group webpage –