Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
Spatial Computation Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 SCS Mihai.
Advertisements

Fast Compilation for Reconfigurable Hardware Mihai Budiu and Seth Copen Goldstein Carnegie Mellon University Computer Science Department Joint work with.
Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon.
Inter-Iteration Scalar Replacement in the Presence of Control-Flow Mihai Budiu – Microsoft Research, Silicon Valley Seth Copen Goldstein – Carnegie Mellon.
Optimizing Memory Accesses for Spatial Computation Mihai Budiu, Seth Goldstein CGO 2003.
Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University.
Mihai Budiu Microsoft Research – Silicon Valley Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University Spatial Computation.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Spatial Computation Computing without General-Purpose Processors Mihai Budiu Carnegie Mellon University July 8, 2004.
Memory Systems Performance Workshop 2004© David Ryan Koes MSP 2004 Programmer Specified Pointer Independence David Koes Mihai Budiu Girish Venkataramani.
BitValue: Detecting and Exploiting Narrow Bitwidth Computations Mihai Budiu Carnegie Mellon University joint work with Majd Sakr, Kip.
Nanotechnology: Spatial Computing Using Molecular Electronics Mihai Budiu joint work with Seth Copen Goldstein Dan Rosewater.
Peer-to-peer Hardware-Software Interfaces for Reconfigurable Fabrics Mihai Budiu Mahim Mishra Ashwin Bharambe Seth Copen Goldstein Carnegie Mellon University.
Application-Specific Hardware Computing Without Processors Mihai Budiu October 6, 2001 SOCS-4.
Multiscalar processors
Presentation at May 17, 2004 Mihai Budiu Carnegie Mellon University Spatial Computation Computing without General-Purpose Processors.
Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003.
On How to Talk Mihai Budiu Monday seminar, Apr 12, 2004.
Detecting and Exploiting Narrow Bitwidth Computations Mihai Budiu Carnegie Mellon University joint work with Seth Copen Goldstein.
SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu
Computing Without Processors Thesis Proposal Mihai Budiu July 30, 2001 This presentation uses TeXPoint by George Necula Thesis Committee: Seth Goldstein,
(6.1) Central Processing Unit Architecture  Architecture overview  Machine organization – von Neumann  Speeding up CPU operations – multiple registers.
ASH: A Substrate for Scalable Architectures Mihai Budiu Seth Copen Goldstein CALCM Seminar, March 19, 2002.
Optimization software for apeNEXT Max Lukyanov,  apeNEXT : a VLIW architecture  Optimization basics  Software optimizer for apeNEXT  Current.
Instruction-Level Parallelism for Low-Power Embedded Processors January 23, 2001 Presented By Anup Gangwar.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
ECE 4100/6100 Advanced Computer Architecture Lecture 2 Instruction-Level Parallelism (ILP) Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.
Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Slack Analysis in the System Design Loop Girish VenkataramaniCarnegie Mellon University, The MathWorks Seth C. Goldstein Carnegie Mellon University.
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
15-740/ Computer Architecture Lecture 3: Performance
Central Processing Unit Architecture
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Prof. Onur Mutlu Carnegie Mellon University
Henk Corporaal TUEindhoven 2009
Spatial Computation Computing without General-Purpose Processors
Computer Architecture
CS 152 Computer Architecture & Engineering
From C to Elastic Circuits
Yingmin Li Ting Yan Qi Zhao
Ka-Ming Keung Swamy D Ponpandi
Dynamically Scheduled High-level Synthesis
Sampoorani, Sivakumar and Joshua
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
The Vector-Thread Architecture
Chapter 12 Pipelining and RISC
How to improve (decrease) CPI
The University of Adelaide, School of Computer Science
Computer Systems An Introducton.
Ka-Ming Keung Swamy D Ponpandi
Presentation transcript:

Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University

Resources

Problems Complexity Power Global Signals Limited issue window => limited ILP We propose a scalable architecture

Outline Introduction ASH: Application Specific Hardware Compiling for ASH Conclusions

Application-Specific Hardware C program Compiler Dataflow IR Reconfigurable hardware

Our Solution General: applicable to today’s software - programming languages - applications Automatic: compiler-driven Scalable: - run-time: with clock, hardware - compile-time: with program size Parallelism: exploit application parallelism

Asynchronous Computation + data valid ack

New Entire C applications Dynamically scheduled circuits Custom dataflow machines - application-specific - direct execution (no interpretation) - spatial computation

Outline Scalability Application Specific Hardware CASH: Compiling in ASH Conclusions

CASH: Compiling for ASH Memory partitioning Interconnection net Circuits C Program RH

Primitives + Arithmetic/logic Multiplexors Merge Eta (gateway) Memory data predicates data predicate ldst

Forward Branches if (x > 0) y = -x; else y = b*x; * xb0 y ! -> Decoded mux Conditionals => Speculation

Critical Paths if (x > 0) y = -x; else y = b*x; * xb0 y ! ->

Lenient Operations if (x > 0) y = -x; else y = b*x; * xb0 y ! -> Solve the problem of unbalanced paths

! ret i +1 < * + sum 0 Loops int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; Control flow => data flow

Compilation Translate C to dataflow machines Optimizations software-, hardware-, dataflow-specific Expose parallelism –predication –speculation –localized synchronization –pipelining

Pipelining i + <= * + sum pipelined multiplier

Pipelining i + <= * + sum

Pipelining i + <= * + sum

Pipelining i + <= * + sum

Pipelining i + <= * + sum i’s loop sum’s loop Long latency pipe

Pipelining i + <= * + sum

Pipelining i + <= * + sum i’s loop sum’s loop Long latency pipe predicate

Predicate ack edge is on the critical path. Pipelining i + <= * + sum critical path i’s loop sum’s loop

Pipelining i + <= * + sum i’s loop sum’s loop decoupling FIFO

Pipelining i + <= * + sum i’s loop sum’s loop critical path decoupling FIFO

ASH Features What you code is what you get –no hidden control logic –lean hardware (no CAM, multi-ported files, etc.) –no global signals Compiler has complete control Dynamic scheduling => latency tolerant Natural ILP and loop pipelining

Conclusions ASH: compiler-synthesized hardware from HLL Exposes program parallelism Dataflow techniques applied to hardware ASH promises to scale with: – circuit speed – transistors – program size

Backup slides Hyperblocks Predication Speculation Memory access Procedure calls Recursive calls Resources Performance

Hyperblocks Procedure back

Predication p !p q if (p) q if (!p) hyperblock back

Speculation q if (!p) q ops w/ side-effects back

Memory Access back load address predicate token data Load-store queue store addresspred token data Interconnection network Memory

Procedure calls back Interconnection network Extract args ret resultcaller Procedure P call P args

Recursion recursive call save live values restore live values hyperblock stack back

Resources Estimated SpecINT95 and Mediabench Average < 100 bit-operations/line of code Routing resources harder to estimate Detailed data in paper back

Performance Preliminary comparison with 4-wide OOO Assumed same FU latencies Speed-up on kernels from Mediabench back