Spring 2003CSE P5481 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington
Spring 2003CSE P5482 Worries to Keep You up at Night In ,000 RISC-1 processors will fit on a die. It will take 36 cycles to cross the die. Still a lack of ILP. Memory latency is still a problem. For reasonable yields, only 1 transistor in 24 billion may be broken (if one flaw breaks a chip).
Spring 2003CSE P5483 WaveScalar’s Solution: Utilize Die Capability A sea of simple, RISClike processors in-order, single-issue takes advantage of billions of transistors without exacerbating the other problems short design & implementation time operates at a short cycle not need lots of ILP fewer defects
Spring 2003CSE P5484 WaveScalar Processing Element
Spring 2003CSE P5485 WaveScalar’s Solution: Short Wires Dataflow execution model each processor executes when it’s operands have arrived same principle as out-of-order execution but applies to the processor & includes fetching no single program counter short wires: no long control lines no centralized hardware data structures no need for sequential & individual instruction fetches
Spring 2003CSE P5486 WaveScalar’s Solution: Short Wires Dataflow execution model, cont’d. differs from original dataflow computers distributed tag management (matching between renamed producer-consumer registers) special WaveScalar instructions assign a number to all operands in a wave (think iteration or trace) & coordinate wave execution all instructions in a “wave” execute on data with the same wave number
Spring 2003CSE P5487 WaveScalar’s Solution: Short Wires Dataflow execution model differs from original dataflow computers explicit wave-ordered memory compiler assigns sequence number to each memory operation in a bread-first manner sequence number for an operation, its predecessor & successor all sent with produced data wave & sequence numbers provide a total order on memory operations through any traversal of a wave + normal memory semantics + no need for special dataflow languages; C & C++ programs execute just fine
Spring 2003CSE P5488 WaveScalar’s Solution: Short Wires Nearest-neighbor communication code placement to locate consumers near their producers short, fast node-to-node links rather than slow broadcast networks exploits dataflow locality: probability of producing a value for a particular consumer instruction & therefore register (register renaming can destroy this) instructions can dynamically migrate toward their neighbors during execution
Spring 2003CSE P5489 Dynamic Optimization The common case has higher costs, and the branch can detect this… Common Case Rare Case Branch Join
Spring 2003CSE P54810 Dynamic Optimization …and fix it, by moving. The join can do the same. Common Case Rare Case Branch Join
Spring 2003CSE P54811 WaveScalar’s Solution: Short Wires PE Domain
Spring 2003CSE P54812 WaveScalar’s Solution: Short Wires Cluster
Spring 2003CSE P54813 WaveScalar’s Solution: Creative Use of Untapped Parallelism Expand the window for exploiting ILP no in-order fetch using only one PC (sucking though a straw) place instructions with the processing elements out-of-order execution on a grand scale Allow multiple threads to execute concurrently OS & applications multiple applications, parallel threads
Spring 2003CSE P54814 WaveScalar’s Solution: The I-Cache is the Processor Model is processor-in-memory (PIM) processing element associated with each instruction WaveScalar version processing elements placed in the I-cache to reduce latency
Spring 2003CSE P54815 WaveScalar’s Solution: Design to Compensate for Circuit Unreliablity Fewer design & implementation errors from the grid of simple, uniform design Route around processors with flaws decentralized control dynamic instruction migration
Spring 2003CSE P54816 Research Agenda: Architecture WaveScalar ISA Microarchitecture design node design domain size cache-coherence across clusters cluster arrangement Control & memory speculation WaveScalar instruction management hardware for instruction placement & replacement hardware for dynamic, self-optimizing placement
Spring 2003CSE P54817 Research Agenda: Architecture Multithreaded WaveScalar Design of the network & routing issues Power management Static & dynamic fault detection & recovery (rerouting instructions) System-level design Application to non-silicon designs
Spring 2003CSE P54818 Research Agenda: Compilers Instruction placement Revisit classic optimizations code savings vs. communication costs cache pollution vs. loop parallelism New opportunities for optimization a match between compiler & execute models WaveScalar-specific instructions
Spring 2003CSE P54819 Research Agenda: OS & Networking Tension between facilitating short routines & poor instruction locality The software side of thread management A bunch of stuff I don’t know about optimizing the OS interface new thread protection policies memory management issues security lazy context switching utilizing virtual machines
Spring 2003CSE P54820 Putting It All Together Grid of hundreds (maybe thousands) of simple, data-flow processing nodes no centralized control; scalable few design errors; increase in yield Processing nodes embedded in the I-cache Instructions execute in place Send results directly to the consumers short, point-to-point links Instructions can dynamically migrate reduce latency to hot consumers map around defects 3X performance without any prediction mechanisms more with them
Spring 2003CSE P54821