Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University
Resources
Problems Complexity Power Global Signals Limited issue window => limited ILP We propose a scalable architecture
Outline Introduction ASH: Application Specific Hardware Compiling for ASH Conclusions
Application-Specific Hardware C program Compiler Dataflow IR Reconfigurable hardware
Our Solution General: applicable to today’s software - programming languages - applications Automatic: compiler-driven Scalable: - run-time: with clock, hardware - compile-time: with program size Parallelism: exploit application parallelism
Asynchronous Computation + data valid ack
New Entire C applications Dynamically scheduled circuits Custom dataflow machines - application-specific - direct execution (no interpretation) - spatial computation
Outline Scalability Application Specific Hardware CASH: Compiling in ASH Conclusions
CASH: Compiling for ASH Memory partitioning Interconnection net Circuits C Program RH
Primitives + Arithmetic/logic Multiplexors Merge Eta (gateway) Memory data predicates data predicate ldst
Forward Branches if (x > 0) y = -x; else y = b*x; * xb0 y ! -> Decoded mux Conditionals => Speculation
Critical Paths if (x > 0) y = -x; else y = b*x; * xb0 y ! ->
Lenient Operations if (x > 0) y = -x; else y = b*x; * xb0 y ! -> Solve the problem of unbalanced paths
! ret i +1 < * + sum 0 Loops int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; Control flow => data flow
Compilation Translate C to dataflow machines Optimizations software-, hardware-, dataflow-specific Expose parallelism –predication –speculation –localized synchronization –pipelining
Pipelining i + <= * + sum pipelined multiplier
Pipelining i + <= * + sum
Pipelining i + <= * + sum
Pipelining i + <= * + sum
Pipelining i + <= * + sum i’s loop sum’s loop Long latency pipe
Pipelining i + <= * + sum
Pipelining i + <= * + sum i’s loop sum’s loop Long latency pipe predicate
Predicate ack edge is on the critical path. Pipelining i + <= * + sum critical path i’s loop sum’s loop
Pipelining i + <= * + sum i’s loop sum’s loop decoupling FIFO
Pipelining i + <= * + sum i’s loop sum’s loop critical path decoupling FIFO
ASH Features What you code is what you get –no hidden control logic –lean hardware (no CAM, multi-ported files, etc.) –no global signals Compiler has complete control Dynamic scheduling => latency tolerant Natural ILP and loop pipelining
Conclusions ASH: compiler-synthesized hardware from HLL Exposes program parallelism Dataflow techniques applied to hardware ASH promises to scale with: – circuit speed – transistors – program size
Backup slides Hyperblocks Predication Speculation Memory access Procedure calls Recursive calls Resources Performance
Hyperblocks Procedure back
Predication p !p q if (p) q if (!p) hyperblock back
Speculation q if (!p) q ops w/ side-effects back
Memory Access back load address predicate token data Load-store queue store addresspred token data Interconnection network Memory
Procedure calls back Interconnection network Extract args ret resultcaller Procedure P call P args
Recursion recursive call save live values restore live values hyperblock stack back
Resources Estimated SpecINT95 and Mediabench Average < 100 bit-operations/line of code Routing resources harder to estimate Detailed data in paper back
Performance Preliminary comparison with 4-wide OOO Assumed same FU latencies Speed-up on kernels from Mediabench back