Programming Model for Spatial Low-Power Architectures Phitchaya Mangpo Phothilimthana and Nishant Totla with Prof. Ras Bodik mentored by Dinakar Dhurjati.

Slides:

Advertisements

Similar presentations

RAM (cont.) 220 bytes of RAM (1 Mega-byte) 20 bits of address Address

Advertisements

DSPs Vs General Purpose Microprocessors

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

COMP3221: Microprocessors and Embedded Systems Lecture 2: Instruction Set Architecture (ISA) Lecturer: Hui Wu Session.

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

Memory Management 2010.

Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 08: RC Principles: Software (1/4) Prof. Sherief Reda.

Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)

Building An Interpreter After having done all of the analysis, it’s possible to run the program directly rather than compile it … and it may be worth it.

Chapter 6 Memory and Programmable Logic Devices

Program Synthesis for Low-Power Accelerators Ras Bodik Mangpo Phitchaya Phothilimthana Tikhon Jelvis Rohin Shah Nishant Totla Computer Science UC Berkeley.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

ECE-3056-B Quiz-2 Topic Areas John Copeland March 28, 2014.

Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Operator Precedence First the contents of all parentheses are evaluated beginning with the innermost set of parenthesis. Second all multiplications, divisions,

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

©2003/04 Alessandro Bogliolo Computer systems A quick introduction.

Automated Design of Custom Architecture Tulika Mitra

Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.

Advanced Computer Architecture 0 Lecture # 1 Introduction by Husnain Sherazi.

Hardware Implementation of a Memetic Algorithm for VLSI Circuit Layout Stephen Coe MSc Engineering Candidate Advisors: Dr. Shawki Areibi Dr. Medhat Moussa.

Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.

Programming Model and Synthesis for Low-power Spatial Architectures Phitchaya Mangpo Phothilimthana Nishant Totla University of California, Berkeley.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.

GPU Architecture and Programming

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Computer Architecture And Organization UNIT-II General System Architecture.

Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.

Lecture 04: Instruction Set Principles Kai Bu

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

Basic Memory Management 1. Readings r Silbershatz et al: chapters

Digital Computer Concept and Practice Copyright ©2012 by Jaejin Lee Control Unit.

Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

1 Basic Processor Architecture. 2 Building Blocks of Processor Systems CPU.

VADA Lab.SungKyunKwan Univ. 1 L5:Lower Power Architecture Design 성균관대학교 조 준 동 교수

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

Parallel Computing Presented by Justin Reschke

1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.

Digital Computer Concept and Practice Copyright ©2012 by Jaejin Lee Control Unit.

1 of 14 Lab 2: Design-Space Exploration with MPARM.

An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.

BASIC COMPUTER ARCHITECTURE HOW COMPUTER SYSTEMS WORK.

Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Addressing modes, memory architecture, interrupt and exception handling, and external I/O. An ISA includes a specification of the set of opcodes (machine.

Code Optimization.

15-740/ Computer Architecture Lecture 3: Performance

CSCE 212 Chapter 4: Assessing and Understanding Performance

Introduction to cosynthesis Rabi Mahapatra CSCE617

Chapter 1 Introduction.

Instruction Set Principles

rePLay: A Hardware Framework for Dynamic Optimization

Rohan Yadav and Charles Yuan (rohany) (chenhuiy)

Chapter 4 The Von Neumann Model

Presentation transcript:

Programming Model for Spatial Low-Power Architectures Phitchaya Mangpo Phothilimthana and Nishant Totla with Prof. Ras Bodik mentored by Dinakar Dhurjati Introduction Heterogeneous CPUs are the future of mobile computing because they promise high energy efficiency without sacrificing performance. To achieve better energy efficiency, heterogeneous architectures will include minimalistic hardware: tiny cores; simple interconnects; as well as more efficient ISAs. The resulting spatial nature of the CPU and the lack of hardware support for programmability will complicate programming and will necessitate developing new programming models and compiler tools. We are working on a high-level programming model for heterogeneous architectures and a synthesis-based compiler toolchain. Our system helps the programmer with partitioning his code onto cores and is retargetable to a range of target architectures. Case Study As our case-study architecture, we have selected GreenArrays (GA) 144: 18-bit stack-based architecture 8 x 18 array of asynchronous cores no shared resources (e.g. clock, cache, memory bus) 144-byte RAM, 144-byte ROM, two 8-word stacks per core each core can only communicate to its neighbors V DD = 1.8V. Power usage ranges from 14 uW – 650 mW Fewer than 20k transistors per core Finite Impulse Response Benchmark GreenArrays 144 is 11x faster and simultaneously 9x more energy-efficient than MSP 430. PerformanceMSP430 (65nm)GA144 (180nm) usec / FIR output nJ / FIR output Data from Rimas Avizienis ApproachSynthesis-based Code Generation Current Synthesizer Spec GreenArrays program (sequence of instructions) Output the fastest program (can be modified to the most energy-efficient) Sketch optionally, we can provide a template of the desired GreenArrays program with holes Our current prototype synthesizes straight line programs with no branches and loops. Code generation Sketching-based Synthesis Sketch is : ?? * n >> ?? Naïve Implementation of Division Subtract divisor until remainder < divisor. # of iterations = output value Better Implementation (for constant divisors) n - input M - “magic” number S - shifting value M and s depend on the number of bits and on the (constant) divisor. quotient = (M * n) >> s SpecSolution x/3(43691 * x) >> 17 x/5( * x) >> 20 x/6(43691 * x) >> 18 x/7( * x) >> 20 ProgramApprox. Speedup Code length reduction Original Code Length Synthesis Time x – (x & y)5.2x4x82 s (x + 7) & -81.7x1.8x930 s (x & m) | (y & ~m)2x 2213 m (y & m) | (x & ~m)2.6x 214 m ((x & y) | (~x & z)) & 0xffff1.4x1.5x155h 15m (y ^ (x | ~z)) & 0xffff1.1x1.4x141h 46m Goals 1)Design and implement an easy-to-use programming model for programming heterogeneous hardware, eliminating the need for the programmer to program at the machine level. 2)Develop algorithms for partitioning and placement of the high-level program to maximize parallelism while minimizing the communication cost. 3)Apply program synthesis to generate very efficient executable code. Synthesis is an alternative to building traditional compilers that eliminates the need to implement a new compiler that targets a specific hardware. Current status and Future plans Current Status Completely functioning prototype compiler Superoptimizer for straight-line code Data-flow language support for streaming applications Working MD5 Program compiled by the prototype compiler Partitioner Code Generator High-Level Program Per-core High-Level Programs Per-core Optimized Machine Code New Programming Model New Approach Using Synthesis Future Plan Develop scalable superoptimizer for larger block of code Test retargetability of synthesizer Design reusable spatial data structures Build low-power gadgets for audio, vision, health Evaluate ISA performance - when deciding to add new instructions - when choosing a set of instructions Example: simplified MD5 (one iteration) Partitions are automatically generated. Synthesis via Superoptimization (i.e., searching all instruction sequences) The table shows speedup and code length reduction of the synthesized code against naïve implementation, except in the last two rows, which compare against expert-hand-optimized code. Demo: synthesized program running on GA144 with lemon-bleach battery Figure from Per Ljung ~100x Computational rate vs power consumption of different low-power devices Programming Model for Code Partitioning Features Users can specify: exact places, if known; only the partitioning; or no constraints. Unknown places will be inferred by the synthesizer such that - number of messages is minimized - code fits in each core Users do not need to code communication explicitly. Annotation at Variable Declaration Various Place Annotations Example Program Language allowing to define placement of data and code on cores. Partitioning Synthesizer RiRi K F M R M K 256-byte mem per core initial data placement specified F <<< high low M R 106 K 512-byte mem per core different initial data placement F <<< K F M R M K F <<< 512-byte mem per core same initial data placement high low Example: simplified MD5 (one iteration) Input: initial data placement Output: optimal computation placement that minimizes # of messages passing between cores Acknowledgement: Rohin Shah, Tikhon Jelvis, and Andres RioFrio