Program Synthesis for Low-Power Accelerators Ras Bodik Mangpo Phitchaya Phothilimthana Tikhon Jelvis Rohin Shah Nishant Totla Computer Science UC Berkeley.

Slides:

Advertisements

Similar presentations

Part IV: Memory Management

Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.

High Level Languages: A Comparison By Joel Best. 2 Sources The Challenges of Synthesizing Hardware from C-Like Languages  by Stephen A. Edwards High-Level.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Programming Model for Spatial Low-Power Architectures Phitchaya Mangpo Phothilimthana and Nishant Totla with Prof. Ras Bodik mentored by Dinakar Dhurjati.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)

Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.

CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.

Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.

Computer Organization and Architecture Reduced Instruction Set Computers (RISC) Chapter 13.

Computer Architecture ECE 4801 Berk Sunar Erkay Savas.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

Models of Computation: FSM Model Reading: L. Lavagno, A.S. Vincentelli and E. Sentovich, “Models of computation for Embedded System Design”

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Institute of Information Sciences and Technology Towards a Visual Notation for Pipelining in a Visual Programming Language for Programming FPGAs Chris.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Automated Design of Custom Architecture Tulika Mitra

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.

Programming Model and Synthesis for Low-power Spatial Architectures Phitchaya Mangpo Phothilimthana Nishant Totla University of California, Berkeley.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

Synthesis with the Sketch System D AY 1 Armando Solar-Lezama.

Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation.

Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.

1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.

Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.

Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.

Basic Memory Management 1. Readings r Silbershatz et al: chapters

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

Operating Systems: Summary INF1060: Introduction to Operating Systems and Data Communication.

Parallel Computing Presented by Justin Reschke

CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

Symbolic Model Checking of Software Nishant Sinha with Edmund Clarke, Flavio Lerda, Michael Theobald Carnegie Mellon University.

IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.

Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.

PipeliningPipelining Computer Architecture (Fall 2006)

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

Multiprocessor System Distributed System

FPGAs in AWS and First Use Cases, Kees Vissers

Introduction to cosynthesis Rabi Mahapatra CSCE617

COMP60621 Fundamentals of Parallel and Distributed Systems

Templates of slides for P4 Experiments with your synthesizer

The University of Adelaide, School of Computer Science

COMP60611 Fundamentals of Parallel and Distributed Systems

Presentation transcript:

Program Synthesis for Low-Power Accelerators Ras Bodik Mangpo Phitchaya Phothilimthana Tikhon Jelvis Rohin Shah Nishant Totla Computer Science UC Berkeley

What we do and talk overview Our main expertise is in program synthesis a modern alternative/complement to compilation Our applications of synthesis: hard-to-write code parallel, concurrent, dynamic programming, end-user code In this project, we explore spatial accelerators an accelerator programming model aided by synthesis 2

Future Programmable Accelerators 3

no cache coherence limited interconnect spatial & temporal partitioning no clock! crazy ISA small memory back to 16-bit nums

no cache coherence limited interconnect spatial & temporal partitioning no clock! crazy ISA small memory back to 16-bit nums

What we desire from programmable accelerators We want the obvious conflicting properties: -high performance at low energy -easy to program, port, and autotune Can’t usually get both -transistors that aid programmability burn energy -ex: cache coherence, smart interconnects, … In principle, most decisions can be done in compilers -which would simplify the hardware -but the compiler power has proven limited 8

We ask How to use fewer “programmability transistors”? How should a programming framework support this? Our approach: synthesis-aided programming model 9

GA 144: our stress-test case study 10

Low-Power Architectures thanks: Per Ljung (Nokia) 11

Why is GA low power The GA architecture uses several mechanisms to achieve ultra high energy efficiency, including: –asynchronous operation (no clock tree), –ultra compact operator encoding (4 instructions per word) to minimize fetching, –minimal operand communication using stack architecture, –optimized bitwidths and small local resources, –automatic fine-grained power gating when waiting for inter-node communication, and –node implementation minimizes communication energy. 12

GreenArray GA144 slide from Rimas Avizienis

MSP430 vs GreenArray

Finite Impulse Response Benchmark PerformanceMSP430 (65nm)GA144 (180nm) F2274 swmult F2617 hwmult F2617 hwmult+dma usec / FIR output nJ / FIR output MSP430GreenArrays Data from Rimas Avizienis GreenArrays 144 is 11x faster and simultaneously 9x more energy efficient than MSP 430.

How to compile to spatial architectures 16

Programming/compiling challenges Partition the code and map it to cores also, design the algorithm for optimal partitioning Implement inter-core communication don’t introduce races and deadlocks Manage distributed data structures for user-facing accelerators (GUIs), it’s not just arrays Generate efficient code often on a limited ISA 17

Programming/Compiling Challenges Algorithm/ Pseudocode Partition & Distribute Data Structures Place Comp1 Comp2 Comp3 Send X Comp4 Recv Y Comp5 Comp1 Comp2 Comp3 Send X Comp4 Recv Y Comp5 Schedule & Implement code & Route Communication

Existing technologies Optimizing compilers hard to write optimizing code transformations FPGAs: synthesis from C partition C code, map it and route the communication Bluespec: synthesis from C or SystemVerilog synthesizes HW, SW, and HW/SW interfaces 19

What does our synthesis differ? FPGA, BlueSpec: partition and map to minimize cost We push synthesis further: –super-optimization: synthesize code that meets a spec –a different target: accelerator, not FPGA, not RTL Benefits: superoptimization: easy to port as there is no need to port code generator and the optimizer to a new ISA 20

Overview of program synthesis 21

Synthesis with “sketches” 22 spec: int foo (int x) { return x + x; } sketch: int bar (int x) implements foo { return x << ??; } result: int bar (int x) implements foo { return x << 1; } Extend your language with two constructs 22 instead of implements, assertions over safety properties can be used

Example: 4x4-matrix transpose with SIMD a functional (executable) specification: int[16] transpose(int[16] M) { int[16] T = 0; for (int i = 0; i < 4; i++) for (int j = 0; j < 4; j++) T[4 * i + j] = M[4 * j + i]; return T; } This example comes from a Sketch grad-student contest 23

Implementation idea: parallelize with SIMD Intel SHUFP (shuffle parallel scalars) SIMD instruction: return = shufps(x1, x2, imm8 :: bitvector8) 24 x1x2 return 24 imm8[0:1]

High-level insight of the algorithm designer 25

The SIMD matrix transpose, sketched int[16] trans_sse(int[16] M) implements trans { int[16] S = 0, T = 0; S[??::4] = shufps(M[??::4], M[??::4], ??); … T[??::4] = shufps(S[??::4], S[??::4], ??); … return T; } 26 Phase 1 Phase 2

The SIMD matrix transpose, sketched int[16] trans_sse(int[16] M) implements trans { int[16] S = 0, T = 0; repeat (??) S[??::4] = shufps(M[??::4], M[??::4], ??); repeat (??) T[??::4] = shufps(S[??::4], S[??::4], ??); return T; } int[16] trans_sse(int[16] M) implements trans { // synthesized code S[4::4] = shufps(M[6::4], M[2::4], b); S[0::4] = shufps(M[11::4], M[6::4], b); S[12::4] = shufps(M[0::4], M[2::4], b); S[8::4] = shufps(M[8::4], M[12::4], b); T[4::4] = shufps(S[11::4], S[1::4], b); T[12::4] = shufps(S[3::4], S[8::4], b); T[8::4] = shufps(S[4::4], S[9::4], b); T[0::4] = shufps(S[12::4], S[0::4], b); } 27 From the contestant Over the summer, I spent about 1/2 a day manually figuring it out. Synthesis time: <5 minutes.

Demo: transpose on Sketch Try Sketch online at 28

The mechanics of program synthesis (constraint solving) 29

What to do with a program as a formula? 30

With program as a formula, solver is versatile 31

Search of candidates as constraint solving 32

Inductive synthesis 33

We propose a programming model for low-power devices by exploiting program synthesis. 34

Programming/Compiling Challenges Algorithm/ Pseudocode Partition & Distribute Data Structures Place Comp1 Comp2 Comp3 Send X Comp4 Recv Y Comp5 Comp1 Comp2 Comp3 Send X Comp4 Recv Y Comp5 Schedule & Implement code & Route Communication

Language? Project Pipeline Comp1 Comp2 Comp3 Send X Comp4 Recv Y Comp5 Comp1 Comp2 Comp3 Send X Comp4 Recv Y Comp5 Language Design expressive flexible (easy to partition) Partitioning minimize # of msgs fit each block in a core Placement & Routing minimize comm cost reason about I/O pins Intracore scheduling & Optimization overlap comp and comm avoid deadlock find most energy- efficient code

Programming model abstractions

MD5 case study 38

MD5 Hash 39 i th round Buffer (before) Buffer (after) message constant from lookup table Figure taken from Wikipedia RiRi

MD5 from Wikipedia 40 Figure taken from Wikipedia

Actual MD5 Implementation on GA144 41

MD5 on GA High order Low order shift value R current hash message M rotate & add with carry rotate & add with carry constant K RiRi

This is how we express MD5 43

Project Pipeline

Annotation at Variable Declaration typedef pair myInt; message[16]; k[64]; =(104,4)} output[4]; =(104,4)} hash[4]; r[64]; indicates where data lives k[i] high order low order

Annotation in Program md5() { t = 0; t < 16; t++) { a = hash[0], b = hash[1], c = hash[2], d = hash[3]; i = 0; i < 64; i++) { temp = d; d = c; c = b; b = round(a, b, c, d, i); a = temp; } hash[0] += a; hash[1] += b; hash[2] += c; hash[3] += d; } output[0] = hash[0]; output[1] = hash[1]; output[2] = hash[2]; output[3] = hash[3]; } 46 (104,4) is home for md5() suggests that any core can have this refers to the function’s

MD5 in CPart typedef pair myInt; k[64]; buffer,...) { sum = buffer k[i] + message[g];... }

MD5 in CPart typedef pair myInt; k[64]; buffer,...) { sum = buffer k[i] + message[g];... } 48 buffer is at (104,4) buffer high order low order

MD5 in CPart typedef pair myInt; k[64]; buffer,...) { sum = buffer k[i] + message[g];... } 49 k[i] is at (106,6) k[i] high order low order

MD5 in CPart typedef pair myInt; k[64]; buffer,...) { sum = buffer k[i] + message[g];... } 50 + is at (105,5) high order low order

MD5 in CPart typedef pair myInt; k[64]; buffer,...) { sum = buffer k[i] + message[g];... } 51 buffer is at (104,4) + is at (105,5) k[i] is at (106,6) buffer is at (104,4) + is at (105,5) k[i] is at (106,6) bufferk[i] high order low order Implicit communication in source program. Communication inserted by synthesizer. Implicit communication in source program. Communication inserted by synthesizer.

MD5 in CPart typedef pair myInt; k[64]; buffer,...) { sum = buffer k[i] + message[g];... } 52 buffer is at (104,4) + is at (204,5) k[i] is at (205,105) buffer is at (104,4) + is at (204,5) k[i] is at (205,105) buffer k[i] low order high order

MD5 in CPart typedef vector myInt; k[64]; buffer,...) { sum = buffer k[i] + message[g];... } 53 buffer is at {304,204,104,4} + is at {305,205,105,5} k[i] is at {306,206,106,6} buffer is at {304,204,104,4} + is at {305,205,105,5} k[i] is at {306,206,106,6} buffer k[i]

MD5 in CPart typedef pair myInt; k[64]; buffer,...) { sum = buffer k[i] + message[g];... }

MD5 in CPart typedef pair myInt; k[64]; buffer,...) { sum = buffer k[i] message[g];... } + happens at here which is (105,5) + happens at where the synthesizer decides

Summary: Language & Compiler Language features: Specify code and data placement by annotation No explicit communication required Option to not specifying place Synthesizer fills in holes such that: number of messages is minimized code can fit in each core 56

Project Pipeline

Program Synthesis for Code Generation Synthesizing optimal code Input: unoptimized code (the spec) Search space of all programs Synthesizing optimal library code Input: sketch + spec Search completions of the sketch Synthesizing communication code Input: program with virtual channels Insert actual communication code 58

slower faster unoptimized code (spec) optimal code synthesizer 1) Synthesizing optimal code

Our Experiment Register-based processorStack-based processor slower faster naive optimized spec most optimal hand bit trick synthesizer

Our Experiment Register-based processorStack-based processor slower faster naive optimized spec most optimal hand bit trick synthesizer

Comparison Register-based processorStack-based processor slower faster naive optimized spec most optimal translation hand bit trick hand synthesizer

Preliminary Synthesis Times Synthesizing a program with 8 unknown instructions takes 5 second to 5 minutes Synthesizing a program up to ~25 unknown instructions within 50 minutes

Preliminary Results ProgramDescriptionApprox. Speedup Code length reduction x – (x & y)Exclude common bits5.2x4x ~(x – y)Negate difference2.3x2x x | yInclusive or1.8x (x + 7) & -8Round up to multiple of 81.7x1.8x (x & m) | (y & ~m)Replace x with y where bits of m are 1’s 2x (y & m) | (x & ~m)Replace y with x where bits of m are 1’s 2.6x x’ = (x & m) | (y & ~m) y’ = (y & m) | (x & ~m) Swap x and y where bits of m are 1’s 2x

Code Length ProgramOriginal Length Output Length x – (x & y)82 ~(x – y)84 x | y2715 (x + 7) & -895 (x & m) | (y & ~m)2211 (y & m) | (x & ~m)218 x’ = (x & m) | (y & ~m) y’ = (y & m) | (x & ~m) 4321

2) Synthesizing optimal library code Input: Sketch: program with holes to be filled Spec: program in any programing language Output: Complete program with filled holes

Example: Integer Division by Constant Naïve Implementation: Subtract divisor until reminder < divisor. # of iterations = output valueInefficient! Better Implementation: n - input M- “magic” number s- shifting value M and s depend on the number of bits and constant divisor. quotient = (M * n) >> s

Example: Integer Division by 3 Spec: n/3 Sketch: quotient = (?? * n) >> ??

Preliminary Results ProgramSolutionSynthesis Time (s) Verification Time (s) # of Pairs x/3(43691 * x) >> x/5( * x) >> x/6(43691 * x) >> x/7( * x) >> deBruijn: Log 2 x (x is power of 2) deBruijn = 46, Table = {7, 0, 1, 3, 6, 2, 5, 4} 3.8N/A8 Note: these programs work for 18-bit number except Log2x is for 8-bit number.

3) Communication Code for GreenArray Synthesize communication code between nodes Interleave communication code with computational code such that There is no deadlock. The runtime or energy comsumption of the synthesized program is minimized.

The key to the success of low-power computing lies in inventing better and more effective programming frameworks. 71