Stanford University The Stanford Hydra Chip Multiprocessor Kunle Olukotun The Hydra Team Computer Systems Laboratory Stanford University
Technology Architecture Transistors are cheap, plentiful and fast Moore’s law 100 million transistors by 2000 Wires are cheap, plentiful and slow Wires get slower relative to transistors Long cross-chip wires are especially slow Architectural implications Plenty of room for innovation Single cycle communication requires localized blocks of logic High communication bandwidth across the chip easier to achieve than low latency
Stanford University Exploiting Program Parallelism Instruction Loop Thread Process Levels of Parallelism Grain Size (instructions) K10K100K1M
Stanford University Hydra Approach A single-chip multiprocessor architecture composed of simple fast processors Multiple threads of control Exploits parallelism at all levels Memory renaming and thread-level speculation Makes it easy to develop parallel programs Keep design simple by taking advantage of single chip implementation
Stanford University Outline Base Hydra Architecture Performance of base architecture Speculative thread support Speculative thread performance Improving speculative thread performance Hydra prototype design Conclusions
Stanford University The Base Hydra Design Shared 2nd-level cache Low latency interprocessor communication (10 cycles) Separate read and write buses Single-chip multiprocessor Four processors Separate primary caches Write-through data caches to maintain coherence
Stanford University Hydra vs. Superscalar ILP only SS 30-50% better than single Hydra processor ILP & fine thread SS and Hydra comparable ILP & coarse thread Hydra 1.5–2 better “The Case for a CMP” ASPLOS ‘96 compress m88ksim eqntott MPEG2 applu apsi swim tomcatv pmake Speedup Superscalar 6-way issue Hydra 4 x 2-way issue OLTP
Stanford University Problem: Parallel Software Parallel software is limited Hand-parallelized applications Auto-parallelized dense matrix FORTRAN applications Traditional auto-parallelization of C-programs is very difficult Threads have data dependencies synchronization Pointer disambiguation is difficult and expensive Compile time analysis is too conservative How can hardware help? Remove need for pointer disambiguation Allow the compiler to be aggressive
Stanford University Solution: Data Speculation Data speculation enables parallelization without regard for data-dependencies Loads and stores follow original sequential semantics Speculation hardware ensures correctness Add synchronization only for performance Loop parallelization is now easily automated Other ways to parallelize code Break code into arbitrary threads (e.g. speculative subroutines ) Parallel execution with sequential commits Data speculation support Wisconsin multiscalar Hydra provides low-overhead support for CMP
Stanford University Data Speculation Requirements I Forward data between parallel threads Detect violations when reads occur too early
Stanford University Data Speculation Requirements II Safely discard bad state after violation Correctly retire speculative state
Stanford University Data Speculation Requirements III Maintain multiple “views” of memory
Stanford University Hydra Speculation Support Write bus and L2 buffers provide forwarding “Read” L1 tag bits detect violations “Dirty” L1 tag bits and write buffers provide backup Write buffers reorder and retire speculative state Separate L1 caches with pre-invalidation & smart L2 forwarding for “view” Speculation coprocessors to control threads
Stanford University Speculative Reads – L1 hit The read bits are set L1 miss L2 and write buffers are checked in parallel The newest bytes written to a line are pulled in by priority encoders on each byte (priority A-D)
Stanford University Speculative Writes A CPU writes to its L1 cache & write buffer “Earlier” CPUs invalidate our L1 & cause RAW hazard checks “Later” CPUs just pre-invalidate our L1 Non-speculative write buffer drains out into the L2
Stanford University Speculation Runtime System Software Handlers Control speculative threads through CP2 interface Track order of all speculative threads Exception routines recover from data dependency violations Adds more overhead to speculation than hardware but more flexible and simpler to implement Complete description in “Data Speculation Support for a Chip Multiprocessor” ASPLOS ‘98 and “Improving the Performance of Speculatively Parallel Applications on the Hydra CMP” ICS ‘99
Stanford University Creating Speculative Threads Speculative loops for and while loop iterations Typically one speculative thread per iteration Speculative procedures Execute code after procedure speculatively Procedure calls generate a speculative thread Compiler support C source to source translator Pfor, pwhile Analyze loop body and globalize any local variables that could cause loop-carried dependencies
Stanford University Base Speculative Thread Performance Entire applications GCC O2 4 single-issue processors Accurate modeling of all aspects of Hydra architecture and real runtime system compress eqntott grep m88ksim wc ijpeg mpeg2 alvin cholesky ear simplex sparse Speedup Base
Stanford University Improving Speculative Runtime System Procedure support adds overhead to loops Threads are not created sequentially Dynamic thread scheduling necessary Start and end of loop: 75 cycles End of iteration: 80 cycles Performance Best performing speculative applications use loops Procedure speculation often lowers performance Need to optimize RTS for common case Lower speculative overheads Start and end of loop: 25 cycles End of iteration: 12 cycles (almost a factor of 7) Limit procedure speculation to specific procedures
Stanford University Improved Speculative Performance Improves performance of all applications Most improvement for applications with fine- grained threads Eqntott uses procedure speculation compress eqntott grep m88ksim wc ijpeg mpeg2 alvin cholesky ear simplex sparse Speedup Base Optimized RTS
Stanford University Optimizing Parallel Performance Cache coherent shared memory No explicit data movement 100+ cycle communication latency Need to optimize for data locality Look at cache misses (MemSpy, Flashpoint) Speculative threads No explicit data independence Frequent dependence violations limit performance Need to optimize to reduce frequency and impact of data violations Dependence prediction can help Look at violation statistics (requires some hardware support)
Stanford University Feedback and Code Transformations Feedback tool Collects violation statistics (PCs, frequency, work lost) Correlates read and write PC values with source code Synchronization Synchronize frequently occurring violations Use non-violating loads Code Motion Find dependent load-stores Move loads down in thread Move stores up in thread
Stanford University Code Motion Rearrange reads and writes to increase parallelism Delay reads and advance writes Create local copies to allow earlier data forwarding read x write x read x write x iteration i iteration i+1 read x write x read x write x iteration i iteration i+1 read x read x’
Stanford University Optimized Speculative Performance Base performance Optimized RTS with no manual intervention Violation statistics used to manually transform code compress eqntott grep m88ksim wc ijpeg mpeg2 alvin cholesky ear simplex sparse Speedup
Stanford University Size of Speculative Write State Max size determines size of write buffer for max performance Non-head processor stalls when write buffer fills up Small write buffers (< 64 lines) will achieve good performance 32 byte cache lines Max no. lines of write state
Stanford University Hydra Prototype Design based on Integrated Device Technology (IDT) RC32364 88 mm 2 in 0.25 m with 8 KB I, D and 128 KB L2
Stanford University Conclusions Hydra offers a new way to design microprocessors Single-chip MP exploits parallelism at all levels Low overhead support for speculative parallelism Provides high performance on applications with medium to large-grain parallelism Allows performance optimization migration path for difficult to parallelize fine-grain applications Prototype Implementation Work out implementation details Provide platform for application and compiler development Realistic performance evaluation
Stanford University Hydra Team Team Monica Lam, Lance Hammond, Mike Chen, Ben Hubbert, Manohar Prahbu, Mike Siu, Melvyn Lim and Maciek Kozyrczak (IDT) URL