Download presentation
Presentation is loading. Please wait.
2
Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University
3
Resources
4
Problems Complexity Power Global Signals Limited issue window => limited ILP We propose a scalable architecture
5
Outline Introduction ASH: Application Specific Hardware Compiling for ASH Conclusions
6
Application-Specific Hardware C program Compiler Dataflow IR Reconfigurable hardware
7
Our Solution General: applicable to today’s software - programming languages - applications Automatic: compiler-driven Scalable: - run-time: with clock, hardware - compile-time: with program size Parallelism: exploit application parallelism
8
Asynchronous Computation + data valid ack
9
New Entire C applications Dynamically scheduled circuits Custom dataflow machines - application-specific - direct execution (no interpretation) - spatial computation
10
Outline Scalability Application Specific Hardware CASH: Compiling in ASH Conclusions
11
CASH: Compiling for ASH Memory partitioning Interconnection net Circuits C Program RH
12
Primitives + Arithmetic/logic Multiplexors Merge Eta (gateway) Memory data predicates data predicate ldst
13
Forward Branches if (x > 0) y = -x; else y = b*x; * xb0 y ! -> Decoded mux Conditionals => Speculation
14
Critical Paths if (x > 0) y = -x; else y = b*x; * xb0 y ! ->
15
Lenient Operations if (x > 0) y = -x; else y = b*x; * xb0 y ! -> Solve the problem of unbalanced paths
16
! ret i +1 < 100 0 * + sum 0 Loops int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; Control flow => data flow
17
Compilation Translate C to dataflow machines Optimizations software-, hardware-, dataflow-specific Expose parallelism –predication –speculation –localized synchronization –pipelining
18
Pipelining i + <= 100 1 * + sum pipelined multiplier
19
Pipelining i + <= 100 1 * + sum
20
Pipelining i + <= 100 1 * + sum
21
Pipelining i + <= 100 1 * + sum
22
Pipelining i + <= 100 1 * + sum i’s loop sum’s loop Long latency pipe
23
Pipelining i + <= 100 1 * + sum
24
Pipelining i + <= 100 1 * + sum i’s loop sum’s loop Long latency pipe predicate
25
Predicate ack edge is on the critical path. Pipelining i + <= 100 1 * + sum critical path i’s loop sum’s loop
26
Pipelining i + <= 100 1 * + sum i’s loop sum’s loop decoupling FIFO
27
Pipelining i + <= 100 1 * + sum i’s loop sum’s loop critical path decoupling FIFO
28
ASH Features What you code is what you get –no hidden control logic –lean hardware (no CAM, multi-ported files, etc.) –no global signals Compiler has complete control Dynamic scheduling => latency tolerant Natural ILP and loop pipelining
29
Conclusions ASH: compiler-synthesized hardware from HLL Exposes program parallelism Dataflow techniques applied to hardware ASH promises to scale with: – circuit speed – transistors – program size
30
Backup slides Hyperblocks Predication Speculation Memory access Procedure calls Recursive calls Resources Performance
31
Hyperblocks Procedure back
32
Predication p !p q if (p)....... q if (!p)....... hyperblock back
33
Speculation q if (!p)...... q ops w/ side-effects back
34
Memory Access back load address predicate token data Load-store queue store addresspred token data Interconnection network Memory
35
Procedure calls back Interconnection network Extract args ret resultcaller Procedure P call P args
36
Recursion recursive call save live values restore live values hyperblock stack back
37
Resources Estimated SpecINT95 and Mediabench Average < 100 bit-operations/line of code Routing resources harder to estimate Detailed data in paper back
38
Performance Preliminary comparison with 4-wide OOO Assumed same FU latencies Speed-up on kernels from Mediabench back
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.