Peer-to-peer Hardware-Software Interfaces for Reconfigurable Fabrics Mihai Budiu Mahim Mishra Ashwin Bharambe Seth Copen Goldstein Carnegie Mellon University.

Slides:

Advertisements

Similar presentations

Fast Compilation for Reconfigurable Hardware Mihai Budiu and Seth Copen Goldstein Carnegie Mellon University Computer Science Department Joint work with.

Advertisements

Inter-Iteration Scalar Replacement in the Presence of Control-Flow Mihai Budiu – Microsoft Research, Silicon Valley Seth Copen Goldstein – Carnegie Mellon.

Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Remote Procedure Call Design issues Implementation RPC programming

1/1/ / faculty of Electrical Engineering eindhoven university of technology Introduction Part 3: Input/output and co-processors dr.ir. A.C. Verschueren.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.

Memory Management Questions answered in this lecture: How do processes share memory? What is static relocation? What is dynamic relocation? What is segmentation?

The road to reliable, autonomous distributed systems

CS 153 Design of Operating Systems Spring 2015

PipeRench: A Coprocessor for Streaming Multimedia Acceleration Seth Goldstein, Herman Schmit et al. Carnegie Mellon University.

Distributed Systems Lecture #3: Remote Communication.

Memory Systems Performance Workshop 2004© David Ryan Koes MSP 2004 Programmer Specified Pointer Independence David Koes Mihai Budiu Girish Venkataramani.

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

Run-Time Storage Organization

Memory Management 2010.

Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 08: RC Principles: Software (1/4) Prof. Sherief Reda.

Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.

Run time vs. Compile time

Application-Specific Hardware Computing Without Processors Mihai Budiu October 6, 2001 SOCS-4.

Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003.

Detecting and Exploiting Narrow Bitwidth Computations Mihai Budiu Carnegie Mellon University joint work with Seth Copen Goldstein.

1 Run time vs. Compile time The compiler must generate code to handle issues that arise at run time Representation of various data types Procedure linkage.

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu

.NET Mobile Application Development Remote Procedure Call.

Computing Without Processors Thesis Proposal Mihai Budiu July 30, 2001 This presentation uses TeXPoint by George Necula Thesis Committee: Seth Goldstein,

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

ASH: A Substrate for Scalable Architectures Mihai Budiu Seth Copen Goldstein CALCM Seminar, March 19, 2002.

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

Chapter 1 Algorithm Analysis

The Procedure Abstraction Part I: Basics Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412.

Operating Systems for Reconfigurable Systems John Huisman ID:

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.

11 September 2008CIS 340 # 1 Topics To examine the variety of approaches to handle the middle- interaction (continued) 1.RPC-based systems 2.TP monitors.

Automated Design of Custom Architecture Tulika Mitra

Compiler & Microarchitecture Lab Support of Cross Calls between Microprocessor and FPGA in CPU-FPGA Coupling Architecture G. NguyenThiHuong and Seon Wook.

CIS250 OPERATING SYSTEMS Memory Management Since we share memory, we need to manage it Memory manager only sees the address A program counter value indicates.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Memory Management 1 Tanenbaum Ch. 3 Silberschatz Ch. 8,9.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #21 – HW/SW.

MAPLD Reconfigurable Computing Birds-of-a-Feather Programming Tools Jeffrey S. Vetter M. C. Smith, P. C. Roth O. O. Storaasli, S. R. Alam

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

May 16-18, Skeletons and Asynchronous RPC for Embedded Data- and Task Parallel Image Processing IAPR Conference on Machine Vision Applications Wouter.

Dynamo: A Transparent Dynamic Optimization System Bala, Dueterwald, and Banerjia projects/Dynamo.

By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming.  To allocate scarce memory.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

Chapter 1 Computers, Compilers, & Unix. Overview u Computer hardware u Unix u Computer Languages u Compilers.

M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young MACH: A New Kernel Foundation for UNIX Development Presenter: Wei-Lwun.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Memory Management Overview.

NETW3005 Memory Management. Reading For this lecture, you should have read Chapter 8 (Sections 1-6). NETW3005 (Operating Systems) Lecture 07 – Memory.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Memory Management Chapter 5 Advanced Operating System.

Region-Based Software Distributed Shared Memory Song Li, Yu Lin, and Michael Walker CS Operating Systems May 1, 2000.

1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.

Software Engineering Algorithms, Compilers, & Lifecycle.

A Single Intermediate Language That Supports Multiple Implemtntation of Exceptions Delvin Defoe Washington University in Saint Louis Department of Computer.

Dynamo: A Runtime Codesign Environment

Instructors: Haryadi Gunawi

Improving java performance using Dynamic Method Migration on FPGAs

Programming Models for Distributed Application

Implementation of IDEA on a Reconfigurable Computer

Page Replacement.

High Level Synthesis Overview

System Structure and Process Model

Virtual Memory Overcoming main memory size limitation

COMP755 Advanced Operating Systems

Presentation transcript:

Peer-to-peer Hardware-Software Interfaces for Reconfigurable Fabrics Mihai Budiu Mahim Mishra Ashwin Bharambe Seth Copen Goldstein Carnegie Mellon University

Peer-to-peer hw/sw interfaces Reconfigurable Hardware CacheLogic Resources Galore

Peer-to-peer hw/sw interfaces Fixed Why RH: Computational Bandwidth CPU “Unbounded” RH

Peer-to-peer hw/sw interfaces Partition Application C ProgramHDL CADCompiler OS support communication Using RH Today

Peer-to-peer hw/sw interfaces Computer System Tomorrow high-ILP computation low-ILP computation + OS + VM CPURH Memory Tight coupling

Peer-to-peer hw/sw interfaces This Work HLL Program Partitioning We suggest a high-level mechanism (not a policy). CPURH Memory ccCAD

Peer-to-peer hw/sw interfaces Outline Motivation Interfacing RH & CPU Opportunities Conclusions

Peer-to-peer hw/sw interfaces Premises RH is large –can implement large program fragments RH can access memory –does not require CPU support to access data –coherent memory view with CPU RH seen through clean abstraction –interface portability

Peer-to-peer hw/sw interfaces Unit of Partitioning: Procedure library leaves recursive hot spot high ILP Program call-graph:

Peer-to-peer hw/sw interfaces Production-Quality Software int foo(….) { highly parallel computation; …. if (!r) { fprintf(stderr, “Unexpected input”); return E_BADIN; } …. }

Peer-to-peer hw/sw interfaces Peering a( ) { b( ); } b( ) { c( ); } c( ) { d( ) } d( ) { } Program CPURH a b c d

Peer-to-peer hw/sw interfaces marshalling, control transfer Stubs software procedure call hardware dependent RH “RPC” CPU a b c d b’ c’ d’

Peer-to-peer hw/sw interfaces RH a( ) { r = b’(b_args); } b’(b_args) { } CPU b Stubs a( ) { r = b(b_args); } b(b_args) { } Program send_rh(b_args); invoke_rh(b); r = receive_rh( ); return r;

Peer-to-peer hw/sw interfaces Required Stubs 1 stub to call each RH procedure 1 stub for each procedure called by RH CPURH

Peer-to-peer hw/sw interfaces policy Compiling Procedures for RH Synthesis Procedures for CPU Program Partitioning Stubs Configuration Linker Executable automatic HLL to HDL

Peer-to-peer hw/sw interfaces Outline Motivation Interfacing RH & CPU Opportunities Conclusions

Peer-to-peer hw/sw interfaces Evaluation How much can be mapped to RH? SpecInt95 & Mediabench Partition strictly on procedure boundaries Limit RH to 10 6 bit-operations

Peer-to-peer hw/sw interfaces Coverage a( ) { b( ); } b( ) { c( ); } c( ) {} On RH Method1Method2 N N YY Y N 40%75% Total 100% 40% 35% 25% Running Time

Peer-to-peer hw/sw interfaces Coverage a( ) { b( ); } b( ) { c( ); } c( ) {} Running Time 40% 35% 25% On RH Method1Method2 N N YY N Y 25%65% Total 100%

Peer-to-peer hw/sw interfaces Policies leaves on RH RH X CPU arbitrary

Peer-to-peer hw/sw interfaces RH Stack Models Locals in registers f() { int local; g(&local); } Locals statically allocated f(x) { return x+1; } f(x) { f(x+1); } Dynamic stack

Peer-to-peer hw/sw interfaces Potential RH Coverage: SpecINT95 % Running time leaves CPU->RH CPU->RH->CPU dynamic stack static stack frames no stack

Peer-to-peer hw/sw interfaces Potential RH Coverage: Mediabench dynamic stack static stack frames no stack leaves CPU->RH CPU->RH->CPU

Peer-to-peer hw/sw interfaces Conclusions Stubs make RH/CPU interface transparent Stubs are automatically generated RH and CPU as peers RH/CPU interface: (remote) procedure call RPC used for control transfer (not data) Peering gives partitioner freedom

Peer-to-peer hw/sw interfaces The End

Peer-to-peer hw/sw interfaces

Independent of b Dispatcher Stubs a( ) { r = b(b_args); } b(b_args) { if (x) c( ); return r; } c( ) { } Program b’(b_args) { send_rh(b_args); invoke_rh(b); while (1) { com = get_rh_command( ); if (! com) break; (*com)( ); } r = receive_rh( ); return r; } c’s stub

Peer-to-peer hw/sw interfaces C’s Stub a( ) { r = b(b_args); } b(b_args) { if (x) c( ); return r; } c( ) { } Program c’( ) { receive_rh(c_args); r = c(c_args); send_rh(r); invoke_rh(return_to_rh); } back

Peer-to-peer hw/sw interfaces Attempt 1 Manual partitioning Interface: ad hoc Ex: OneChip, NAPA, PAM Advantage: huge speed-ups Problem: very hard work RH Program

Peer-to-peer hw/sw interfaces Attempt 2 Select small computations Interface: RH = functional unit Ex: PRISC, Chimaera Advantage: easy to automate Problem: low speed-up + >> Program + >> *

Peer-to-peer hw/sw interfaces Attempt 3 while (b) { b[ j+5]; } Select loop body Deeply pipelined implementation No memory access Interface: I/O or Functional Unit or Coprocessor Ex: PipeRench Advantage: very high speed-up Problems: cannot be automated loop-carried dependences few opportunities Program

Peer-to-peer hw/sw interfaces Attempt 4 Select whole loop Pipelined implementation Autonomous memory access Interface: coprocessor Ex: GARP Advantage: many opportunities Problems: complicated algorithm requires exceptional loop exits while (b) { if (error) printf(“err”); a[x] = y; } Program