Programming Model and Synthesis for Low-power Spatial Architectures Phitchaya Mangpo Phothilimthana Nishant Totla University of California, Berkeley.

Slides:



Advertisements
Similar presentations
System Integration and Performance
Advertisements

Memory.
Part IV: Memory Management
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
DSPs Vs General Purpose Microprocessors
CPU Review and Programming Models CT101 – Computing Systems.
Lecture 3: Instruction Set Principles Kai Bu
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Programming Model for Spatial Low-Power Architectures Phitchaya Mangpo Phothilimthana and Nishant Totla with Prof. Ras Bodik mentored by Dinakar Dhurjati.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Operating System Support Focus on Architecture
Sim2Imp (Simulation to Implementation) Breakout J. Wawrzynek, K. Asanovic, G. Gibeling, M. Lin, Y. Lee, N. Patil.
CSCE 121, Sec 200, 507, 508 Fall 2010 Prof. Jennifer L. Welch.
Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ input/output and clock inputs Sequence of control signal combinations.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Program Synthesis for Low-Power Accelerators Ras Bodik Mangpo Phitchaya Phothilimthana Tikhon Jelvis Rohin Shah Nishant Totla Computer Science UC Berkeley.
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.
CSc 453 Interpreters & Interpretation Saumya Debray The University of Arizona Tucson.
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
An Introduction Chapter Chapter 1 Introduction2 Computer Systems  Programmable machines  Hardware + Software (program) HardwareProgram.
Programming Models CT213 – Computing Systems Organization.
Chun Chiu. Overview What is RISC? Characteristics of RISC What is CISC? Why using RISC? RISC Vs. CISC RISC Pipelines Advantage of RISC / disadvantage.
CSC 310 – Imperative Programming Languages, Spring, 2009 Virtual Machines and Threaded Intermediate Code (instead of PR Chapter 5 on Target Machine Architecture)
Automated Design of Custom Architecture Tulika Mitra
Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.
What have mr aldred’s dirty clothes got to do with the cpu
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
CMPE 511 Computer Architecture A Faster Optimal Register Allocator Betül Demiröz.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Memory Management 1 Tanenbaum Ch. 3 Silberschatz Ch. 8,9.
Macro instruction synthesis for embedded processors Pinhong Chen Yunjian Jiang (william) - CS252 project presentation.
RISC Architecture RISC vs CISC Sherwin Chan.
CE Operating Systems Lecture 14 Memory management.
Lecture 4: MIPS Instruction Set Reminders: –Homework #1 posted: due next Wed. –Midterm #1 scheduled Friday September 26 th, 2014 Location: TODD 430 –Midterm.
Standard Template Library The Standard Template Library was recently added to standard C++. –The STL contains generic template classes. –The STL permits.
CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.
Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.
Operand Addressing And Instruction Representation Cs355-Chapter 6.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Lecture 04: Instruction Set Principles Kai Bu
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Baring It All to Software: Raw Machines E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb,
ISA's, Compilers, and Assembly
Nov 14, 08ACES III and SIAL1 ACES III and SIAL: technologies for petascale computing in chemistry and materials physics Erik Deumens, Victor Lotrich, Mark.
VADA Lab.SungKyunKwan Univ. 1 L5:Lower Power Architecture Design 성균관대학교 조 준 동 교수
Operating Systems: Summary INF1060: Introduction to Operating Systems and Data Communication.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
1 of 14 Lab 2: Design-Space Exploration with MPARM.
Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ read/write and clock inputs Sequence of control signal combinations.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Embedded Systems Design
Advanced Topic: Alternative Architectures Chapter 9 Objectives
CC 423: Advanced Computer Architecture Limits to ILP
Ke Bai and Aviral Shrivastava Presented by Bryce Holton
Lecture 4: MIPS Instruction Set
Central Processing Unit
COSC121: Computer Systems
Performance Optimization for Embedded Software
COMS 361 Computer Organization
What is Computer Architecture?
What is Computer Architecture?
ECE 352 Digital System Fundamentals
CPU Structure CPU must:
Lecture 4: Instruction Set Design/Pipelining
Computer Systems An Introducton.
Presentation transcript:

Programming Model and Synthesis for Low-power Spatial Architectures Phitchaya Mangpo Phothilimthana Nishant Totla University of California, Berkeley

Heterogeneity is Inevitable Why heterogeneous system/architecture? Energy efficiency Runtime performance What is the future architecture? Convergence point unclear. But it will be some combination of 1.Many small cores (less control overhead, smaller bitwidth) 2.Simple interconnect (reduce communication energy) 3.New ISAs (specialized, more compact encoding) What we are working on Programming model for future heterogeneous architectures Synthesis-aided “compiler” We want both!

Energy Efficiency vs. Programmability SW-controlled messages New compiler optimizations Fine-grained partitioningMany small cores New ISAs Simple interconnect Challenges “The biggest challenge with 1000 core chips will be programming them.” 1 - William Dally (NVIDIA, Stanford) Future architectures On NVIDIA’s 28nm chips, getting data from neighboring memory takes 26x the energy of addition. 1 Cache hit uses up to 7x energy of addition. 2 1:

Compilers: State of the Art If you limit yourself to a traditional compilation framework, you are restricted to similar architectures or have to wait for years to prove success. Synthesis, an alternative to compilation Compiler: transforms the source code Synthesis: searches for a correct, fast program Brand new compiler Takes 10 years to build an optimizing compiler New hardware Modify existing compiler Stuck with similar architecture Synthesis (our approach) Synthesis (our approach) Break free from the architecture

Program Synthesis (Example) int[16] transpose(int[16] M) { int[16] T = 0; for (int i = 0; i < 4; i++) for (int j = 0; j < 4; j++) T[4 * i + j] = M[4 * j + i]; return T; } int[16] trans_sse(int[16] M) implements trans { // synthesized code S[4::4] = shufps(M[6::4], M[2::4], b); S[0::4] = shufps(M[11::4], M[6::4], b); S[12::4] = shufps(M[0::4], M[2::4], b); S[8::4] = shufps(M[8::4], M[12::4], b); T[4::4] = shufps(S[11::4], S[1::4], b); T[12::4] = shufps(S[3::4], S[8::4], b); T[8::4] = shufps(S[4::4], S[9::4], b); T[0::4] = shufps(S[12::4], S[0::4], b); return T; } x1x2 return imm8[0:1] int[16] trans_sse(int[16] M) implements trans { int[16] S = 0, T = 0; repeat (??) S[??::4] = shufps(M[??::4], M[??::4], ??); repeat (??) T[??::4] = shufps(S[??::4], S[??::4], ??); return T; } Synthesis time < 10 seconds. Search space > Specification: Sketch:Synthesized program:

Our Plan Partitioner Code Generator High-Level Program Per-core High-Level Programs Per-core Optimized Machine Code New Programming Model New Approach Using Synthesis

Example challenges of programming spatial architectures like GA144: Bitwidth slicing: Represent 32-bit numbers by two 18-bit words Function partitioning: Break functions into a pipeline with just a few operations per core. Example challenges of programming spatial architectures like GA144: Bitwidth slicing: Represent 32-bit numbers by two 18-bit words Function partitioning: Break functions into a pipeline with just a few operations per core. Case study: GreenArrays Spatial Processors Specs Stack-based 18-bit architecture 32 instructions 8 x 18 array of asynchronous computers (cores) No shared resources (i.e. clock, cache, memory). Very scalable architecture. Limited communication, neighbors only < 300 byte memory per core Specs Stack-based 18-bit architecture 32 instructions 8 x 18 array of asynchronous computers (cores) No shared resources (i.e. clock, cache, memory). Very scalable architecture. Limited communication, neighbors only < 300 byte memory per core GA144 is 11x faster and simultaneously 9x more energy efficient than MSP 430. Figure from Per Ljung Data from Rimas Avizienis # of Instructions/second vs Power Finite Impulse Response Benchmark ~100x

Spatial programming model typedef pair myInt; k[64]; buffer,...) { sum = buffer k[i] + message[g];... } k[i] is at (106,6) k[i] high order low order typedef pair myInt; k[64]; buffer,...) { sum = buffer k[i] + message[g];... } typedef pair myInt; k[64]; buffer,...) { sum = buffer k[i] + message[g];... } buffer is at (104,4) buffer high order low order typedef pair myInt; k[64]; buffer,...) { sum = buffer k[i] + message[g];... } + is at (105,5) high order low order typedef pair myInt; k[64]; buffer,...) { sum = buffer k[i] + message[g];... } buffer is at (104,4) + is at (105,5) k[i] is at (106,6) buffer is at (104,4) + is at (105,5) k[i] is at (106,6) bufferk[i] high order low order Shift value R current hash message M rotate & add with carry rotate & add with carry constant K typedef pair myInt; k[64]; buffer,...) { sum = buffer k[i] + message[g];... } typedef pair myInt; k[64]; buffer,...) { sum = buffer k[i] message[g];... }

Optimal Partitions from Our Synthesizer K F RiRi M R M K 256-byte mem per core initial data placement specified Benchmark: simplified MD5 (one iteration) F <<< K F M R M K F <<< 512-byte mem per core same initial data placement M R 106 K high low 512-byte mem per core different initial data placement F <<< Partitions are automatically generated

Partitioner Code Generator Retargetable Code Generation Example: define exclusive-or for a stack architecture xor = lambda: push(pop() ^ pop()) Traditional compiler needs many tasks: implement optimizing transformations, including hardware-specific code generation (e.g. register allocation, instruction scheduling) Synthesis-based code translation needs only these: define space of programs to search, via code templates define machine instructions, as if writing an interpreter Synthesizer can generate code from a template with holes as in transpose example --> sketching an unconstrained template --> superoptimization

Code Generation via Superoptimization Synthesized functions are 1.7x – 5.2x faster and 1.8x – 4x shorter than naïve implementation of simple GreenArrays functions 1.1x-1.4x faster and 1.5x shorter than optimized hand-written GreenArrays functions by experts (MD5 App Note) ProgramSolution x/3(43691 * x) >> 17 x/5(52429* x) >> 18 x/6(43691 * x) >> 18 x/7( * x) >> 20 Synthesize efficient division by constant quotient = (?? * n) >> ?? Current prototype synthesizes a program with 8 unknown instructions in 2 to 30 seconds ~25 unknown instructions within 5 hours

Current Status Partitioner for straight-line code Superoptimizer for smaller code Future Work Make synthesizer retargetable Release it! Design spatial data structures Build low-power gadgets for audio, vision, health, … We will answer “how minimal can hardware be?” “how to build tools for it?” Demo and Future Input here Blinking LED