Download presentation
Presentation is loading. Please wait.
1
Stanford University The Stanford Hydra Chip Multiprocessor Kunle Olukotun The Hydra Team Computer Systems Laboratory Stanford University
2
Technology Architecture Transistors are cheap, plentiful and fast Moore’s law 100 million transistors by 2000 Wires are cheap, plentiful and slow Wires get slower relative to transistors Long cross-chip wires are especially slow Architectural implications Plenty of room for innovation Single cycle communication requires localized blocks of logic High communication bandwidth across the chip easier to achieve than low latency
3
Stanford University Exploiting Program Parallelism Instruction Loop Thread Process Levels of Parallelism Grain Size (instructions) 1101001K10K100K1M
4
Stanford University Hydra Approach A single-chip multiprocessor architecture composed of simple fast processors Multiple threads of control Exploits parallelism at all levels Memory renaming and thread-level speculation Makes it easy to develop parallel programs Keep design simple by taking advantage of single chip implementation
5
Stanford University Outline Base Hydra Architecture Performance of base architecture Speculative thread support Speculative thread performance Improving speculative thread performance Hydra prototype design Conclusions
6
Stanford University The Base Hydra Design Shared 2nd-level cache Low latency interprocessor communication (10 cycles) Separate read and write buses Single-chip multiprocessor Four processors Separate primary caches Write-through data caches to maintain coherence
7
Stanford University Hydra vs. Superscalar ILP only SS 30-50% better than single Hydra processor ILP & fine thread SS and Hydra comparable ILP & coarse thread Hydra 1.5–2 better “The Case for a CMP” ASPLOS ‘96 compress m88ksim eqntott MPEG2 applu apsi swim tomcatv pmake 0 0.5 1 1.5 2 2.5 3 3.5 4 Speedup Superscalar 6-way issue Hydra 4 x 2-way issue OLTP
8
Stanford University Problem: Parallel Software Parallel software is limited Hand-parallelized applications Auto-parallelized dense matrix FORTRAN applications Traditional auto-parallelization of C-programs is very difficult Threads have data dependencies synchronization Pointer disambiguation is difficult and expensive Compile time analysis is too conservative How can hardware help? Remove need for pointer disambiguation Allow the compiler to be aggressive
9
Stanford University Solution: Data Speculation Data speculation enables parallelization without regard for data-dependencies Loads and stores follow original sequential semantics Speculation hardware ensures correctness Add synchronization only for performance Loop parallelization is now easily automated Other ways to parallelize code Break code into arbitrary threads (e.g. speculative subroutines ) Parallel execution with sequential commits Data speculation support Wisconsin multiscalar Hydra provides low-overhead support for CMP
10
Stanford University Data Speculation Requirements I Forward data between parallel threads Detect violations when reads occur too early
11
Stanford University Data Speculation Requirements II Safely discard bad state after violation Correctly retire speculative state
12
Stanford University Data Speculation Requirements III Maintain multiple “views” of memory
13
Stanford University Hydra Speculation Support Write bus and L2 buffers provide forwarding “Read” L1 tag bits detect violations “Dirty” L1 tag bits and write buffers provide backup Write buffers reorder and retire speculative state Separate L1 caches with pre-invalidation & smart L2 forwarding for “view” Speculation coprocessors to control threads
14
Stanford University Speculative Reads – L1 hit The read bits are set L1 miss L2 and write buffers are checked in parallel The newest bytes written to a line are pulled in by priority encoders on each byte (priority A-D)
15
Stanford University Speculative Writes A CPU writes to its L1 cache & write buffer “Earlier” CPUs invalidate our L1 & cause RAW hazard checks “Later” CPUs just pre-invalidate our L1 Non-speculative write buffer drains out into the L2
16
Stanford University Speculation Runtime System Software Handlers Control speculative threads through CP2 interface Track order of all speculative threads Exception routines recover from data dependency violations Adds more overhead to speculation than hardware but more flexible and simpler to implement Complete description in “Data Speculation Support for a Chip Multiprocessor” ASPLOS ‘98 and “Improving the Performance of Speculatively Parallel Applications on the Hydra CMP” ICS ‘99
17
Stanford University Creating Speculative Threads Speculative loops for and while loop iterations Typically one speculative thread per iteration Speculative procedures Execute code after procedure speculatively Procedure calls generate a speculative thread Compiler support C source to source translator Pfor, pwhile Analyze loop body and globalize any local variables that could cause loop-carried dependencies
18
Stanford University Base Speculative Thread Performance Entire applications GCC 2.7.2 -O2 4 single-issue processors Accurate modeling of all aspects of Hydra architecture and real runtime system compress eqntott grep m88ksim wc ijpeg mpeg2 alvin cholesky ear simplex sparse1.3 0 0.5 1 1.5 2 2.5 3 3.5 4 Speedup Base
19
Stanford University Improving Speculative Runtime System Procedure support adds overhead to loops Threads are not created sequentially Dynamic thread scheduling necessary Start and end of loop: 75 cycles End of iteration: 80 cycles Performance Best performing speculative applications use loops Procedure speculation often lowers performance Need to optimize RTS for common case Lower speculative overheads Start and end of loop: 25 cycles End of iteration: 12 cycles (almost a factor of 7) Limit procedure speculation to specific procedures
20
Stanford University Improved Speculative Performance Improves performance of all applications Most improvement for applications with fine- grained threads Eqntott uses procedure speculation compress eqntott grep m88ksim wc ijpeg mpeg2 alvin cholesky ear simplex sparse1.3 0 0.5 1 1.5 2 2.5 3 3.5 4 Speedup Base Optimized RTS
21
Stanford University Optimizing Parallel Performance Cache coherent shared memory No explicit data movement 100+ cycle communication latency Need to optimize for data locality Look at cache misses (MemSpy, Flashpoint) Speculative threads No explicit data independence Frequent dependence violations limit performance Need to optimize to reduce frequency and impact of data violations Dependence prediction can help Look at violation statistics (requires some hardware support)
22
Stanford University Feedback and Code Transformations Feedback tool Collects violation statistics (PCs, frequency, work lost) Correlates read and write PC values with source code Synchronization Synchronize frequently occurring violations Use non-violating loads Code Motion Find dependent load-stores Move loads down in thread Move stores up in thread
23
Stanford University Code Motion Rearrange reads and writes to increase parallelism Delay reads and advance writes Create local copies to allow earlier data forwarding read x write x read x write x iteration i iteration i+1 read x write x read x write x iteration i iteration i+1 read x read x’
24
Stanford University Optimized Speculative Performance Base performance Optimized RTS with no manual intervention Violation statistics used to manually transform code compress eqntott grep m88ksim wc ijpeg mpeg2 alvin cholesky ear simplex sparse1.3 0 0.5 1 1.5 2 2.5 3 3.5 4 Speedup
25
Stanford University Size of Speculative Write State Max size determines size of write buffer for max performance Non-head processor stalls when write buffer fills up Small write buffers (< 64 lines) will achieve good performance 32 byte cache lines Max no. lines of write state
26
Stanford University Hydra Prototype Design based on Integrated Device Technology (IDT) RC32364 88 mm 2 in 0.25 m with 8 KB I, D and 128 KB L2
27
Stanford University Conclusions Hydra offers a new way to design microprocessors Single-chip MP exploits parallelism at all levels Low overhead support for speculative parallelism Provides high performance on applications with medium to large-grain parallelism Allows performance optimization migration path for difficult to parallelize fine-grain applications Prototype Implementation Work out implementation details Provide platform for application and compiler development Realistic performance evaluation
28
Stanford University Hydra Team Team Monica Lam, Lance Hammond, Mike Chen, Ben Hubbert, Manohar Prahbu, Mike Siu, Melvyn Lim and Maciek Kozyrczak (IDT) URL http://www-hydra.stanford.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.