ASH: A Substrate for Scalable Architectures Mihai Budiu Seth Copen Goldstein CALCM Seminar, March 19, 2002
/322 Resources
/323 CPU Problems Complexity Power Global Signals Limited issue window => limited ILP We propose an architecture with none of these limits
/324 Outline Scalability Reconfigurable hardware advantages A hybrid RH + CPU architecture CPU and RH as peers Application Specific Hardware
/325 FU * clock freq Computational Bandwidth CPU Unbounded RH * + / a=a+b b=b+c
/326 Registers Fixed RH Unbounded eax ebx ecx edx ijklmijklm spillsp[0] CPU
/327 Register Bandwidth Fixed CPU R1 R2 R3 W1 W2 RH Unbounded
/328 Out-of-Order Execution RHCPU Fetch Decode Dispatch Execute Commit In-order Limited by window Compiler’s window is unbounded
/329 Outline Scalability Reconfigurable hardware advantages A hybrid RH + CPU architecture CPU and RH as peers Application Specific Hardware
/3210 Hybrid system: CPU+RH High ILP application- specific Low ILP + OS + VM generic CPURH Memory Tight coupling
/3211 Problem HLL Program CPURH Memory Compiler
/3212 Our Solution General: applicable to today’s software Automatic: compiler-driven [RISC approach] Scalable: with clock, hardware and program size Parallelism: exploit application parallelism bit-level ILP pipeline loop-level
/3213 Outline Scalability Reconfigurable hardware advantages A hybrid RH + CPU architecture CPU and RH as peers Application Specific Hardware
/3214 Peering a( ) { b( ); } b( ) { c( ); } c( ) { d( ) } d( ) { } CPURH a b c d Program
/3215 marshalling, control transfer software procedure call hardware dependent RH “RPC” CPU a b c d b’ c’ d’ Stubs built automatically.
/3216 Stub Synthesis Procedures for RH RH Compiler Procedures for CPU Program Partitioning Stubs Configuration Linker Executable
/3217 Outline Scalability Reconfigurable hardware advantages A hybrid RH + CPU architecture CPU and RH as peers Application Specific Hardware
/3218 Application-Specific Hardware Reconfigurable hardware HLL program Compiler Circuit HLL Program CPURH Memory Compiler
/3219 CASH: Compiling for ASH Memory partitioning Interconnection net Circuits C Program RH
/3220 Asynchronous Computation + data ready ack Can extend to locally synchronous, globally asynchronous
/3221 Dataflow Graphs int plus(int x, int y) { return x + y; }
/3222 From Control Flow to Data Flow
/3223 From Control Flow to Data Flow
/3224 From Control Flow to Data Flow
/3225 Conditionals = Speculation int cond(int p, int x, int y) { int z; if (p) z = x; else z = y; return z; }
/3226 Critical Paths if (x > 0) y = -x; else y = b*x; * xb0 y ! ->
/3227 Executing Lenient Operators if (x > 0) y = -x; else y = b*x; * xb0 y ! -> Up to 40% performance improvement.
/3228 Pipelining PipelinedCycles N903 Y653
/3229 Loop Pipelining PipeFIFOCycles N0903 N1 Y0653 Y1474 Y2408 Y3
/3230 Loop Pipelining PipeFIFOCycles N0903 N1 Y0653 Y1474 Y2408 Y3
/3231 ASH Features What you code is what you get –no hidden control logic –really lean hardware (no CAM, decoders, multiported files, etc.) Compiler has complete control Dynamic scheduling => latency tolerant Naturally exploits ILP, even across loop iterations
/3232 Conclusions ASH = Compiler-synthesized hardware ASH matches program parallelism Dynamically scheduled RH ASH scales with –clock frequency –transistors –program size
/3233 Backup Slides
/3234 Reconfigurable Hardware Universal gates and/or storage elements Interconnection network Programmable switches
/3235 Switch controlled by a 1-bit RAM cell Universal gate = RAM a0 a1 a0 a1 data a1 & a2 0 data in control Main RH Ingredient: RAM Cell
/3236 Stubs a( ) { r = b(b_args); } b(b_args) { } a( ) { r = b’(b_args); } b’(b_args) { send_rh(b_args); invoke_rh(b); r = receive_rh( ); return r; } RH Program
/3237 Independent of b Dispatcher Stubs a( ) { r = b(b_args); } b(b_args) { if (x) c( ); return r; } c( ) { } Program b’(b_args) { send_rh(b_args); invoke_rh(b); while (1) { com = get_rh_command( ); if (! com) break; (*com)( ); } r = receive_rh( ); return r; } c’s stub
/3238 C’s Stub a( ) { r = b(b_args); } b(b_args) { if (x) c( ); return r; } c( ) { } Program c’( ) { receive_rh(c_args); r = c(c_args); send_rh(r); invoke_rh(return_to_rh); } back
/3239 Input to Output int io(int x) { return x; }
/3240 Loops int loop() { int w = 10; while (w > 0) w--; return w; }
/3241 Pointers and Arrays int a[10]; void pointer(int *p) { a[2] += a[4] + *p; }
/3242 int sum() { int s = 0; int i; for (i=0; i < 10; i++) s += a[i]; return s; } Pointers and Loops