University of Michigan Electrical Engineering and Computer Science 1 Integrating Post-programmability Into the High-level Synthesis Equation* Scott Mahlke.

Slides:



Advertisements
Similar presentations
DSPs Vs General Purpose Microprocessors
Advertisements

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
1 U NIVERSITY OF M ICHIGAN 11 1 SODA: A Low-power Architecture For Software Radio Author: Yuan Lin, Hyunseok Lee, Mark Woh, Yoav Harel, Scott Mahlke, Trevor.
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
University of Michigan Electrical Engineering and Computer Science 1 Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability.
University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,
University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.
UC Berkeley B. Nikolić Architecture choices MAC Unit Addr Gen  P Prog Mem Embedded Processor (lpArm) Direct Mapped Hardware Embedded FPGA DSP (e.g. TI.
Configurable System-on-Chip: Xilinx EDK
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
University of Michigan Electrical Engineering and Computer Science 1 Resource Recycling: Putting Idle Resources to Work on a Composable Accelerator Yongjun.
Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Project performed by: Naor Huri Idan Shmuel.
Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.
University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.
UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.
1 Design and Implementation of Turbo Decoders for Software Defined Radio Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.
University of Michigan Electrical Engineering and Computer Science Data-centric Subgraph Mapping for Narrow Computation Accelerators Amir Hormati, Nathan.
University of Michigan Electrical Engineering and Computer Science Power-Efficient Medical Image Processing using PUMA Ganesh Dasika, Kevin Fan 1, Scott.
1 A survey on Reconfigurable Computing for Signal Processing Applications Anne Pratoomtong Spring2002.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
University of Michigan Electrical Engineering and Computer Science 1 Systematic Register Bypass Customization for Application-Specific Processors Kevin.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
Computer Architecture and Organization Introduction.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Automated Design of Custom Architecture Tulika Mitra
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
Chapter 1 Introduction. Architecture & Organization 1 Architecture is those attributes visible to the programmer —Instruction set, number of bits used.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Page 1 Reconfigurable Communications Processor Principal Investigator: Chris Papachristou Task Number: NAG Electrical Engineering & Computer Science.
MAPLD 2005/254C. Papachristou 1 Reconfigurable and Evolvable Hardware Fabric Chris Papachristou, Frank Wolff Robert Ewing Electrical Engineering & Computer.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines Manjunath Kudlur,
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.
1 - CPRE 583 (Reconfigurable Computing): System Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 21: Fri 11/4/2011.
University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.
University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,
SCORES: A Scalable and Parametric Streams-Based Communication Architecture for Modular Reconfigurable Systems Abelardo Jara-Berrocal, Ann Gordon-Ross NSF.
Hyunchul Park†, Kevin Fan†, Scott Mahlke†,
HIGH LEVEL SYNTHESIS WITH AREA CONSTRAINTS FOR FPGA DESIGNS: AN EVOLUTIONARY APPROACH Tesi di Laurea di: Christian Pilato Matr.n Relatore: Prof.
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.
University of Michigan Electrical Engineering and Computer Science Bridging the Computation Gap Between Programmable Processors and Hardwired Accelerators.
University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,
1 Introduction to Engineering Spring 2007 Lecture 18: Digital Tools 2.
Dynamo: A Runtime Codesign Environment
Ph.D. in Computer Science
A Methodology for System-on-a-Programmable-Chip Resources Utilization
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Instructor: Dr. Phillip Jones
Anne Pratoomtong ECE734, Spring2002
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
URECA: A Compiler Solution to Manage Unified Register File for CGRAs
HIGH LEVEL SYNTHESIS.
Department of Electrical Engineering Joint work with Jiong Luo
ECE 448 Lecture 6 Finite State Machines State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts, and VHDL code ECE 448 – FPGA and ASIC Design.
Research: Past, Present and Future
Presentation transcript:

University of Michigan Electrical Engineering and Computer Science 1 Integrating Post-programmability Into the High-level Synthesis Equation* Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan Ann Arbor, MI USA * This is work done by Kevin Fan and Manjunath Kudlur at UM

University of Michigan Electrical Engineering and Computer Science 2 Application Engines Differentiate Consumer SoCs Slide Courtesy of Synfora

University of Michigan Electrical Engineering and Computer Science 3 The HLS Equation Area Power Performance  What about programmability?  How to deal with application changes?  Time to market

University of Michigan Electrical Engineering and Computer Science 4 Substrate Determines Programmability MAC Unit Addr Gen  P Prog Mem Embedded Processor (lpArm) Direct Mapped Hardware Embedded FPGA DSP (e.g. TI 320CXX ) Flexibility Area or Power Reconfigurable Processors (Maia) Factor of MOPS/mW MOPS/mW.5-5 MOPS/mW

University of Michigan Electrical Engineering and Computer Science 5 for(k=0; k<N4; k++) {... real = Z1[k][0]; img = Z1[k][1]; Z1[k][0] = real * sincos[k][0] - img*sincos[k][1]; Z1[k][0] = Z1[k][0] << 1; if(b_scale) { Z1[k][0] = Z1[k][0] * scale; } Version 1.40 for(k=0; k<N4; k++) {... uint16_t n = k << 1; ComplexMult(...); X_out[ n] = -RE(x); X_out[N n] = IM(x); X_out[N2 + n] = -IM(x); X_out[N n] = RE(x); } Version 1.34 for(k=0; k<N4; k++) {... uint16_t n = k << 1; ComplexMult(...); X_out[ n] = RE(x); X_out[N n] = -IM(x); X_out[N2 + n] = IM(x); X_out[N n] = -RE(x); } Version 1.33 Bug fix in faad2 How Much Programmability? for(k=0; k<N4; k++) {... real = Z1[k][0]; img = Z1[k][1]; Z1[k][0] = real * sincos[k][0] - img*sincos[k][1]; Z1[k][0] = Z1[k][0] << 1; } Version 1.39 New feature in faad2 Just Enough!

University of Michigan Electrical Engineering and Computer Science 6 StreamRoller Approach Frame Type? Loop 2Loop 3 Loop 1 Loop 4 Application … Block 5 Loop Accelerator Template Architecture Point-to-point Connections + …… & …… MEM …… Local Mem FSM Control signals CRF BR

University of Michigan Electrical Engineering and Computer Science 7 LA Programmability Shortcomings Point-to-point Connections + …… & …… MEM …… Local Mem FSM Control signals CRF BR 1. Point-to-point interconnect: Only dataflow in the original application is supported 2. Fixed functionality: Only operators in the original application are supported 3. Hardwired control and unaddressable register storage

University of Michigan Electrical Engineering and Computer Science 8 Programmable Loop Accelerator Point-to-point Connections +/- …… &/| …… MEM …… Local Mem Control Memory Control signals CRF BR RR Literals Bus 1. Low-cost functionality generalization 2. Addressable rotating registers 3. Low bandwidth full connectivity path 4. Enable input swapping 5. Programmable literals 6. Memory for decoded control

University of Michigan Electrical Engineering and Computer Science 9 Mapping New Loops onto a PLA Move Insertion SMT Scheduling Register Allocation Loop Control Signals Machine description Increment II Large search space, few solutions Op-centric approaches unable to find solutions Satisfiability Modulo Theory (SMT) formulation to solve linear and SAT constraints simultaneously

University of Michigan Electrical Engineering and Computer Science 10 Area Comparison – 130nm Library LA = single function accelerator, PLA = programmable accelerator, OR1K = OR-1200 processor

University of Michigan Electrical Engineering and Computer Science 11 Power Comparison 1.0 = power for single function LA, OR1K-equiv = performance equivalent processor

University of Michigan Electrical Engineering and Computer Science 12 Efficiency Comparison 20 MIPS/mW 2 MIPS/mW 200 MIPS/mW

University of Michigan Electrical Engineering and Computer Science 13 Programmability Assessment Number of algorithm perturbations tolerated while maintaining the same performance

University of Michigan Electrical Engineering and Computer Science 14 Final Thoughts Programmability not an all or nothing issue –Application accelerators need to be able to evolve –HLS + targeted design generalizations yield a highly customized, but semi-programmable ASIC Bottom line tradeoffs –PLA vs OR-1200: x more power efficient, 30x smaller –PLA vs ASIC: 2 - 9x worse power, 2x larger Cost breakdown –Addressable register storage and generalized FUs most costly –Interconnect extensions less costly

University of Michigan Electrical Engineering and Computer Science 15 For More Information “Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability,” K. Fan, H. Park, M. Kudlur, and S. Mahlke, Proc International Symposium on Code Generation and Optimization, Apr. 2008, pp “Orchestrating the Execution of Stream Programs on Multicore Platforms,” M. Kudlur and S. Mahlke, Proc. ACM SIGPLAN 2008 Conference on Programming Languages Design and Implementation, Jun