WaveScalar: the Executive Summary

WaveScalar: the Executive Summary
Dataflow machine good at exploiting ILP dataflow parallelism + traditional coarser-grain parallelism cheap thread management low operand latency because of a hierarchical organization memory ordering enforced through wave-ordered memory no special languages imperative languages Autumn 2006 CSE P548

WaveScalar Additional motivation:
begin with motivation, then the salient capabilities WaveScalar shrinking feature sizes no solution, just present the problem Additional motivation: increasing disparity between computation (fast transistors) & communication (long wires) increasing circuit complexity decreasing fabrication reliability in 1 cycle ¼ distance of 2000 in 2010 performance long design time more bugs & longer verification time decrease size: R increases proportionally wire proximity determines C; C determines cross-talk time = R * C Autumn 2006 CSE P548

Monolithic von Neumann Processors
A phenomenal success today. But in 2016?  Performance Centralized processing & control Long wires e.g., operand broadcast networks  Complexity 40-75% of “design” time is design verification  Defect tolerance 1 flaw -> paperweight What is centralized processing? Relative to what? Autumn 2006 2004 die photo of P4 CSE P548

WaveScalar’s Microarchitecture
Good performance via distributed microarchitecture  hundreds of PEs dataflow execution – no centralized control short point-to-point (producer to consumer) operand communication organized hierarchically for fast communication between neighboring PEs scalable Low design complexity through simple, identical PEs  design one & stamp out thousands Defect tolerance  route around a bad PE No long wires, even if increase WS size Autumn 2006 CSE P548

Processing Element begin with bottom level of hierarchy
see how operand latency grows with size Processing Element Simple, small (.5M transistors) 5-stage pipeline (receive input operands, match tags, instruction schedule, execute, send output) Holds 64 (decoded) instructions 128-entry token store 4-entry output buffer matching table tracking board Autumn 2006 CSE P548 cache: index by hashing on I# & tag

PEs in a Pod Share operand bypass network
Back-to-back producer-consumer execution across PEs Relieve congestion on intra-domain bus Autumn 2006 CSE P548

Domain bus for operand traffic 5 cycles Autumn 2006 CSE P548

Cluster Autumn 2006 CSE P548

WaveScalar Processor Long distance communication dynamic routing
grid-based network 2-cycle hop/cluster latency depends on distance 9 cycles + cluster distance single thread: 61% traffic (operand) in pod 32% in domain 5% in cluster 1% across chip multiple threads: 40% traffic (operand) in pod 12% in domain 46% in cluster 2% across chip Autumn 2006 CSE P548

Whole Chip Can hold 32K instructions Normal memory hierarchy
8K: 8 PEs/domain, 4 clusters Can hold 32K instructions Normal memory hierarchy Traditional directory-based cache coherence ~400 mm2 in 90 nm technology 1GHz. ~85 watts 4MB 16wa L2 200 cycle memory Autumn 2006 CSE P548 revisit short wires, defect tolerance, simple design

WaveScalar Instruction Placement
only get low latency if place producers & consumers together on the other hand, want to exploit ILP, execute Is in parallel sometimes these goals conflict WaveScalar Instruction Placement pod PE1 PE2 operand latency vs. parallelism (resource conflicts) Autumn 2006 CSE P548

WaveScalar Instruction Placement
Place instructions in PEs to maximize data locality & instruction-level parallelism. Instruction placement algorithm based on a performance model that captures the important performance factors Carve the dataflow graph into segments a particular depth to make chains of dependent instructions that will be placed in the same pod a particular width to make multiple independent chains that will be placed in different, but near-by pods Snakes segments across PES in the chip on demand K-loop bounding to prevent instruction “explosion” minimizing resource conflicts (execute in diff PEs, maximize parallelism) minimize operand latency short operand latency increase parallelism Loop control runs ahead creating tokens that won’t be used for awhile Autumn 2006 CSE P548

Example to Illustrate the Memory Ordering Problem
* Load Store + j i b A A[j + i*i] = i; b = A[i*j]; Autumn 2006 CSE P548

Wave-ordered Memory 3 4 8 5 6 7 2 3 ? 4 5 4 ? 9 6 8 Load Store Store
Sequence # 2 3 ? 4 5 Predecessor 4 ? 9 6 8 Successor Load Compiler annotates memory operations Send memory requests in any order Hardware reconstructs the correct order Store ? previous example Store Load Load Assigned in breadth-first order Full examples of memory. Compare to superscalar store buffer. Is this part of the model or the architecture. Don’t say total order. This is a memory model. Store Autumn 2006 CSE P548

Wave-ordering Example
this is how the HW reconstructs the memory order Wave-ordering Example Store buffer Load 3 4 2 3 4 2 Store 4 ? 3 4 5 6 Store Load 7 8 4 5 6 8 Load Store 8 9 ? Autumn 2006 CSE P548

this is how the HW reconstructs the memory order Wave-ordering Example Store buffer Load 3 4 2 3 4 2 Store 4 ? 3 4 ? 3 4 5 6 Store Load 7 8 4 5 6 8 Load Store 8 9 ? Autumn 2006 CSE P548

this is how the HW reconstructs the memory order Wave-ordering Example Store buffer Load 3 4 2 3 4 2 Store 4 ? 3 4 ? 3 4 5 6 Store Load 7 8 4 5 6 8 Load 8 9 ? Store 8 9 ? store arrives 1st ? prevents sending to memory Autumn 2006 CSE P548

this is how the HW reconstructs the memory order Wave-ordering Example Store buffer Load 3 4 2 3 4 2 Store 4 ? 3 4 ? 3 4 5 6 Store Load 7 8 4 5 6 8 Load 7 8 4 8 9 ? Store 8 9 ? part of architecture use it/turn it off Ripples allow loads to execute in parallel or in any order Autumn 2006 CSE P548 memory nop makes a chain if no memory op on a path does not access memory

Wave-ordered Memory Waves are loop-free sections of the dataflow graph
green bars are wave management instructions that was a wave: sequence of instructions, no backedges WOM within a wave Waves are loop-free sections of the dataflow graph Each dynamic wave has a wave number Wave number is incremented between waves Ordering memory: wave-numbers sequence number within a wave so can execute iterations & recursive functions in right order 5:47 Precise interrupts on WS: What does that mean on a DF machine? Is before interrupted I have executed Is data dep on interrupted I have not fired Memory order is preserved Autumn 2006 CSE P548

WaveScalar Tag-matching
thread identifier wave number Token: tag & value <2:5>.3 <2:5>.6 + <ThreadID:Wave#:InstructionID>.value <2:5>.9 I ID? Autumn 2006 CSE P548

Single-thread Performance
specfp, specint, mediabench/binary translator/AIPC avg raw performance comparable: iteration parallelism vs overhead of calling short functions Single-thread Performance Autumn 2006 CSE P548

Single-thread Performance per Area
can use chip area more effectively bkz simpler & more parallel design Autumn 2006 CSE P548 encouraging but not the point: 1 cluster, 32 PEs, .1 to 1.4 AIPC

Multithreading the WaveCache
Architectural-support for WaveScalar threads instructions to start & stop memory orderings, i.e., threads memory-free synchronization to allow exclusive access to data (thread communicate instruction) fence instruction to force all previous memory operations to fully execute (to allow other threads to see the results of this one’s memory ops) Combine to build threads with multiple granularities coarse-grain threads: X over a single thread; 2-16X over CMP, 5-11X over SMT fine-grain, dataflow-style threads: X over single thread combine the two in the same application: 1.6X or 7.9X -> 9X passes lock to the next thread used before another thread gets a lock $ coh on diff processors this on same processor more detail on thread creation/termination & performance Autumn 2006 CSE P548

Creating & Terminating a Thread
data/thread/wave conversion Is t creates s Creating & Terminating a Thread Sets up HW in SB for WOM Must complete bf DtTW Autumn 2006 CSE P548

Thread Creation Overhead
varied size of loop (affects granularity of parallelism) & dependence distance (affects # threads which can execute in parallel) compare to serial execution Thread Creation Overhead Reflection of how many threads can execute in parallel go back Autumn 2006 CSE P548 certain speedup above contour lines break even: 24 Is, 2 copies; 3 Is, 10 copies

Performance of Coarse-grain Parallelism
near linear speedup to 32 threads 10X faster than CMPs Performance of Coarse-grain Parallelism 8x8 IPC ocean & raytrace decrease with 128 threads: 18% more I misses, 23% more matching table misses, 2X more data $ misses Autumn 2006 CSE P548

CMP Comparison Autumn 2006 CSE P548

Performance of Fine-grain Parallelism
WOM part of architecture: turn off/on Relies on: Cheap synchronization Load once, pass data (not load/compute/store) Performance of Fine-grain Parallelism lcs Autumn 2006 CSE P548

Building the WaveCache
RTL-level implementation some didn’t believe it could be built in a normal-sized chip some didn’t believe it could achieve a decent cycle time and load- use latencies Verilog & Synopsis CAD tools Different WaveCache’s for different applications 1 cluster: low-cost, low power, single-thread or embedded 42 mm2 in 90 nm process technology, 2.2 AIPC on Splash2 16 clusters: multiple threads, higher performance: 378 mm2 , AIPC Board-level FPGA implementation OS & real application simulations 20 FO4/1 GHz P4 122 fully custom SRAM cells 1/15 Autumn 2006 CSE P548

WaveScalar: the Executive Summary

Similar presentations

Presentation on theme: "WaveScalar: the Executive Summary"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

WaveScalar: the Executive Summary

Similar presentations

Presentation on theme: "WaveScalar: the Executive Summary"— Presentation transcript:

Similar presentations

About project

Feedback