Download presentation
Presentation is loading. Please wait.
Published byNicholas Watts Modified over 6 years ago
1
Swanson et al. Presented by Andrew Waterman ECE259 Spring 2008
WaveScalar Swanson et al. Presented by Andrew Waterman ECE259 Spring 2008
2
Why Dataflow? “[...] as wire delays grow relative to gate delays, improvements in clock rate and IPC become directly antagonistic” [Agarwal00] Large bypass networks, highly associative structures especially problematic Can only ameliorate somewhat in superscalar designs (21264 clustering, WIB, etc.) Shorter wires, smaller loads => higher fclk possible with point-to-point networks, decentralized structures
3
Dataflow Locality Def: predictability of instruction dependencies
3/5 source operands come from most recent producer Completely ignored by most superscalars Over-general design: large bypass networks, regular references to huge PRF, ... Partial exceptions: clustering, hierarchical RFs P4: 1 cycle of 31 stages devoted to execution Can exploit to greatly cheapen communication
4
The von Neumann abstraction
Elegant as it is, the von Neumann execution model is inherently sequential Control dependencies limit exploitable ILP considerably P4 again: 20 stage (!) branch misprediction loop Store/load aliasing hurts, too
5
Why Not Dataflow? Dataflow architectures may scale further, but...
Who the hell wants to write a program in Id? For commercial adoption and future sanity, must support von Neumann memory semantics But ideally without fetch serialization
6
Enter WaveScalar WaveScalar: dataflow's new groove
Enabled by process improvements: can integrate 2N processing elements (PEs) + nearby storage on-die “Cache-only” architecture (not in the COMA sense) Provides total load/store ordering Can be programmed conventionally ...without a program counter
7
WaveScalar ISA WaveScalar binary encodes the DFG
ISA is RISCy, plus a few new primitives Control flow: ɸ insn implements the C ternary operator Similar to predication ɸ-1 insn conditionally sends data to one PE or another based upon Indirect-Send(arg,addr,offset) insn implements indirect jumps, calls, returns
8
WaveScalar ISA: Waves Wave === connected DAG, subset of DFG
Can span multiple hyperblocks iff each insn executed at most once (no loops) Easily elongated with unrolling To disambiguate which dynamic insn is being executed by a PE, data values carry a wave number Wave numbers incremented by Wave-Advance insn Wave number assignment is not centralized!
9
WaveScalar ISA: Memory Ordering
Wave-ordered memory Where possible, mem ops labeled with location within its wave: <predecessor,this,successor> Control flow may prohibit this; when unknown, '?' used as label Rule: no op with ? in succ. field may connect to an op with ? in pred. field Solution: memory-nops Result: memory has enough info to establish total load/store order
10
WaveCache: WaveScalar Implemented
Grid of 211 PEs in clusters of 16 On each PE: control logic, IQ/OQ, ALU, buffering for 8 static insns Small L1 D$ per 4 clusters Traditional unified L2$ 1 StQ per 4 clusters Each wave bound to a StQ dynamically Intra-cluster comm: shared buses Inter-cluster: mesh?
11
WaveScalar ISA: Waves Wave === connected DAG, subset of DFG
Can span multiple hyperblocks iff each insn executed at most once (no loops) Easily elongated with unrolling To disambiguate which dynamic insn is being executed by a PE, data values carry a wave number Wave numbers incremented by Wave-Advance insn Wave number assignment is not centralized!
12
Compilation Compilation basically same as for traditional arch.
To the point that binary translation is possible Additional steps: inserting memory-nops, wave-advances converting branches to ɸ-1 Binaries larger Extra insns Larger insns (list of target PEs) ...but this is OK (no repeated fetch)
13
Program Load/Termination
Loading As usual, program loaded by setting PC & incurring I$ miss Insn targets labeled "not-loaded" until those miss, as well In general, hopefully I$ misses are infrequent Must back up evicted insn's state (queues), restore new insn's state Probably need to invoke OS Termination OS purges all insns from all PEs
14
Execution Example And it's that simple!
void s(char in[10], char out[10]) { int i = 0, j = 0; do { int t = in[i]; if(t) out[j++] = t; } while(i++ < 10); } And it's that simple!
15
Just Kidding... void s(char in[10], char out[10]) { int i = 0, j = 0;
do { int t = in[i]; if(t) out[j++] = t; } while(i++ < 10); }
16
Unmapped and Mapped
17
How Well Does It Do? Methodology Benchmarks: SPEC and a few others
Compiled for Alpha & binary-translated Fairness; better overall code generation But no WaveCache-specific optimizations Results reported in Alpha-equivalent IPC Fairness (WaveScalar has extra insns)
18
How Well Does It Do? Favorable comparison to superscalar
16-wide (!!), out-of-order |PRF|=|IW|=1024 Better IPC than TRIPS, but certainly lower fclk TRIPS limited by smaller execution units (hyperblocks vs. waves)
19
Other performance results
Extra instruction overhead In terms of static code size: 20%-140% In terms of execution time: 10% Parallelism Input queue size 8 sets of input values sufficient for most programs Except for victims of parallelism explosion
20
Performance improvements
Control speculation Baseline WaveCache: no branch prediction 47% perf. improvement with perfect prediction Memory Speculation Baseline WaveCache: no memory disambiguation 62% perf. improvement with perfect memory disambiguation Upshot: unrealistic, but lots of headroom 340% improvement with both
21
Analysis WaveScalar makes dataflow much more general- purpose
Seems fast enough to spend the time implementing Good IPC; more clock period headroom Why isn't this the golden standard? Why are Swanson, Oskin no longer into dataflow?
22
Swanson et al. Presented by Andrew Waterman ECE259 Spring 2008
Questions? Swanson et al. Presented by Andrew Waterman ECE259 Spring 2008
23
Swanson et al. Presented by Andrew Waterman ECE259 Spring 2008
WaveScalar Swanson et al. Presented by Andrew Waterman ECE259 Spring 2008
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.