Download presentation
Presentation is loading. Please wait.
Published byLaura Townsend Modified over 9 years ago
1
The Standford Hydra CMP Lance Hammond Benedict A. Hubbert Michael Siu Manohar K. Prabhu Michael Chen Kunle Olukotun Presented by Jason Davis
2
Introduction Hydra CMP with 4 MIPS Processors L1 cache for each CPU and L2 cache that holds the permanent states Why? –Moore’s law is reaching its end –Finite amount of ILP –TLP (Thread Level Parallelism) vs ILP in pipelined architecture –CMP can use ILP as well (TLP and ILP are orthogonal) –Wire Delay –Design Time (CPU core doesn’t need to be redesigned) just increase the number Problems –Integration densities just now giving reasons to consider new models –Difficult to convert uniprocessor code –Multiprogramming is hard
3
Base Design 4 MIPS Cores (250 MHz) –Each core: L1 Data Cache L1 Primary Instruction Cache –Share a single L2 Cache –Virtual Buses (pipelined with repeaters) Read bus (256 bits) –Acts as general purpose system bus for moving data between CPUs, L2, and external memory –Wide enough to handle entire cache line (CMP explicit gain, multiprocessor systems would require too many pins Write bus (64 bits) –Writes directly from 4 CPUs to L2 –Pipelined to allow for single-cycle occupancy (not a bottleneck) –Uses simple invalidation for caches (broadcast invalidates all other L1s) L2 Cache –Point of communication (10-20 cycles) Bus Sufficient for 4-8 MIPS cores, more need larger system buses
4
Base Design
5
Parallel Software Performance
6
Thread Speculation Takes sequence of instructions on normal program and arbitrarily breaks it into a sequenced group of threads –Hardware must track all interthread dependencies to insure program acts the same way –Must re-execute code that follows a data violation based upon a true dependency Advantages: –Does not require synchronization (different than enforcing dependencies on multiprocessor systems) –Dynamic (done at runtime) so programmer only needs to consider for maximum performance –Conventional Parallelizing compilers miss a lot of TLP because synchronization points must be inserted where dependencies can happen and not just where they do happen 5 Issues to address:
7
Thread Speculation 1. Forward data between parallel threads 2. Detect when reads occur to early (RAW) 3. Safely Discard speculative state after violations
8
Thread Speculation 5. Provide Memory renaming (WAR hazards) 4. Retire speculative writes in correct order (WAW hazard)
9
Hydra Speculation Implementation Takes care of the 5 issues: –Forward data between parallel threads: When thread writes to bus, newer threads that need the data have their current cache lines for that data invalidated On miss in L1, access L2, write buffers of current or older thread replaces data returned from L2 byte-byte –Detect when read occurs too early: Primary cache bits are set to mark possible violations, if write to that address of an earlier thread invalidates – Violation detected and thread is restarted. –Safely discard speculative states after violation: Permanent state kept in L2, any L1 lines that are speculative data are invalidated, L2 buffer for thread is discarded (permanent state not effected)
10
Hydra Speculation Implementation –Place speculative writes in memory in correct order: Separate speculative data L2 buffers kept for each thread Must be drained into L2 in original sequence Thread sequencing system also sequences the buffer draining –Memory Renaming: Each CPU can only read data written by itself or earlier threads Writes from later threads don’t cause immediate invalidations (since writes from these threads should not be visible yet) Ignored invalidations are recorded with pre-invalidate bit If thread accesses L2 it must only access data it should be able to see from itself or earlier L2 buffers When current thread completes all currently pre- invalidated lines are check against future threads for violations
11
Hydra Speculation Implementation
13
Speculation Performance
14
Prototype MIPS-based RC32364 SRAM macro cells 8-Kbyte L1 data and instruction caches 128 Kbytes L2 Die is 90 mm^2,.25-micron process Have a verilog model, moving to physical design using synthesis Central Arbritration for Buses will be the most difficult part, hard to pipeline, must accept many requests, and must reply with grant signals
15
Prototype
16
Prototype
17
Conclusion Hydra CMP –High performance -Cost effective alternative to large chip single processors -Similar die area can achieve similar to uniprocessor performance on integer programs using thread speculation -Multiprogrammed or High Parallelism can do better then single processor -Hardware Thread-Speculation is not cost intensive, and can give great gains to performance
18
Questions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.