Stanford University The Stanford Hydra Chip Multiprocessor Kunle Olukotun The Hydra Team Computer Systems Laboratory Stanford University.

Slides:



Advertisements
Similar presentations
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Advertisements

IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
Lecture 6: Multicore Systems
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Instruction-Level Parallelism (ILP)
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Compiler Optimization of scalar and memory resident values between speculative threads. Antonia Zhai et. al.
A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.
EECC722 - Shaaban #1 lec # 10 Fall Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A Chip Multiprocessor.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
EECC722 - Shaaban #1 lec # 10 Fall Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core.
EECC722 - Shaaban #1 lec # 10 Fall A New Approach to Speculation in the Stanford Hydra Chip Multiprocessor (CMP) A Chip Multiprocessor.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Multiscalar processors
EECC722 - Shaaban #1 lec # 10 Fall Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core.
EECC722 - Shaaban #1 lec # 10 Fall Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor.
The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science
Evaluation of Memory Consistency Models in Titanium.
Roza Ghamari Bogazici University.  Current trends in transistor size, voltage, and clock frequency, future microprocessors will become increasingly susceptible.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.
Thread-Level Speculation Karan Singh CS
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
MULTIPLEX: UNIFYING CONVENTIONAL AND SPECULATIVE THREAD-LEVEL PARALLELISM ON A CHIP MULTIPROCESSOR Presented by: Ashok Venkatesan Chong-Liang Ooi, Seon.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
The Standford Hydra CMP  Lance Hammond  Benedict A. Hubbert  Michael Siu  Manohar K. Prabhu  Michael Chen  Kunle Olukotun Presented by Jason Davis.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
CS 258 Spring The Expandable Split Window Paradigm for Exploiting Fine- Grain Parallelism Manoj Franklin and Gurindar S. Sohi Presented by Allen.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
CS 352H: Computer Systems Architecture
CS Lecture 20 The Case for a Single-Chip Multiprocessor
Computer Architecture Principles Dr. Mike Frank
Transactional Memory : Hardware Proposals Overview
Multiscalar Processors
Simultaneous Multithreading
5.2 Eleven Advanced Optimizations of Cache Performance
/ Computer Architecture and Design
Henk Corporaal TUEindhoven 2009
Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based micro-architecture/compiler.
Henk Corporaal TUEindhoven 2011
Instruction Level Parallelism (ILP)
* From AMD 1996 Publication #18522 Revision E
/ Computer Architecture and Design
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
How to improve (decrease) CPI
Loop-Level Parallelism
The University of Adelaide, School of Computer Science
Presentation transcript:

Stanford University The Stanford Hydra Chip Multiprocessor Kunle Olukotun The Hydra Team Computer Systems Laboratory Stanford University

Technology  Architecture Transistors are cheap, plentiful and fast  Moore’s law  100 million transistors by 2000 Wires are cheap, plentiful and slow  Wires get slower relative to transistors  Long cross-chip wires are especially slow Architectural implications  Plenty of room for innovation  Single cycle communication requires localized blocks of logic  High communication bandwidth across the chip easier to achieve than low latency

Stanford University Exploiting Program Parallelism Instruction Loop Thread Process Levels of Parallelism Grain Size (instructions) K10K100K1M

Stanford University Hydra Approach A single-chip multiprocessor architecture composed of simple fast processors Multiple threads of control  Exploits parallelism at all levels Memory renaming and thread-level speculation  Makes it easy to develop parallel programs Keep design simple by taking advantage of single chip implementation

Stanford University Outline Base Hydra Architecture Performance of base architecture Speculative thread support Speculative thread performance Improving speculative thread performance Hydra prototype design Conclusions

Stanford University The Base Hydra Design  Shared 2nd-level cache  Low latency interprocessor communication (10 cycles)  Separate read and write buses  Single-chip multiprocessor  Four processors  Separate primary caches  Write-through data caches to maintain coherence

Stanford University Hydra vs. Superscalar  ILP only  SS 30-50% better than single Hydra processor  ILP & fine thread  SS and Hydra comparable  ILP & coarse thread  Hydra 1.5–2  better  “The Case for a CMP” ASPLOS ‘96 compress m88ksim eqntott MPEG2 applu apsi swim tomcatv pmake Speedup Superscalar 6-way issue Hydra 4 x 2-way issue OLTP

Stanford University Problem: Parallel Software Parallel software is limited  Hand-parallelized applications  Auto-parallelized dense matrix FORTRAN applications Traditional auto-parallelization of C-programs is very difficult  Threads have data dependencies  synchronization  Pointer disambiguation is difficult and expensive  Compile time analysis is too conservative How can hardware help?  Remove need for pointer disambiguation  Allow the compiler to be aggressive

Stanford University Solution: Data Speculation Data speculation enables parallelization without regard for data-dependencies  Loads and stores follow original sequential semantics  Speculation hardware ensures correctness  Add synchronization only for performance  Loop parallelization is now easily automated Other ways to parallelize code  Break code into arbitrary threads (e.g. speculative subroutines )  Parallel execution with sequential commits Data speculation support  Wisconsin multiscalar  Hydra provides low-overhead support for CMP

Stanford University Data Speculation Requirements I  Forward data between parallel threads  Detect violations when reads occur too early

Stanford University Data Speculation Requirements II  Safely discard bad state after violation  Correctly retire speculative state

Stanford University Data Speculation Requirements III  Maintain multiple “views” of memory

Stanford University Hydra Speculation Support  Write bus and L2 buffers provide forwarding  “Read” L1 tag bits detect violations  “Dirty” L1 tag bits and write buffers provide backup  Write buffers reorder and retire speculative state  Separate L1 caches with pre-invalidation & smart L2 forwarding for “view”  Speculation coprocessors to control threads

Stanford University Speculative Reads – L1 hit The read bits are set  L1 miss L2 and write buffers are checked in parallel The newest bytes written to a line are pulled in by priority encoders on each byte (priority A-D)

Stanford University Speculative Writes  A CPU writes to its L1 cache & write buffer  “Earlier” CPUs invalidate our L1 & cause RAW hazard checks  “Later” CPUs just pre-invalidate our L1  Non-speculative write buffer drains out into the L2

Stanford University Speculation Runtime System Software Handlers  Control speculative threads through CP2 interface  Track order of all speculative threads  Exception routines recover from data dependency violations  Adds more overhead to speculation than hardware but more flexible and simpler to implement  Complete description in “Data Speculation Support for a Chip Multiprocessor” ASPLOS ‘98 and “Improving the Performance of Speculatively Parallel Applications on the Hydra CMP” ICS ‘99

Stanford University Creating Speculative Threads Speculative loops  for and while loop iterations  Typically one speculative thread per iteration Speculative procedures  Execute code after procedure speculatively  Procedure calls generate a speculative thread Compiler support  C source to source translator  Pfor, pwhile  Analyze loop body and globalize any local variables that could cause loop-carried dependencies

Stanford University Base Speculative Thread Performance  Entire applications  GCC O2  4 single-issue processors  Accurate modeling of all aspects of Hydra architecture and real runtime system compress eqntott grep m88ksim wc ijpeg mpeg2 alvin cholesky ear simplex sparse Speedup Base

Stanford University Improving Speculative Runtime System Procedure support adds overhead to loops  Threads are not created sequentially  Dynamic thread scheduling necessary  Start and end of loop: 75 cycles  End of iteration: 80 cycles Performance  Best performing speculative applications use loops  Procedure speculation often lowers performance  Need to optimize RTS for common case Lower speculative overheads  Start and end of loop: 25 cycles  End of iteration: 12 cycles (almost a factor of 7)  Limit procedure speculation to specific procedures

Stanford University Improved Speculative Performance  Improves performance of all applications  Most improvement for applications with fine- grained threads  Eqntott uses procedure speculation compress eqntott grep m88ksim wc ijpeg mpeg2 alvin cholesky ear simplex sparse Speedup Base Optimized RTS

Stanford University Optimizing Parallel Performance Cache coherent shared memory  No explicit data movement  100+ cycle communication latency  Need to optimize for data locality  Look at cache misses (MemSpy, Flashpoint) Speculative threads  No explicit data independence  Frequent dependence violations limit performance  Need to optimize to reduce frequency and impact of data violations  Dependence prediction can help  Look at violation statistics (requires some hardware support)

Stanford University Feedback and Code Transformations Feedback tool  Collects violation statistics (PCs, frequency, work lost)  Correlates read and write PC values with source code Synchronization  Synchronize frequently occurring violations  Use non-violating loads Code Motion  Find dependent load-stores  Move loads down in thread  Move stores up in thread

Stanford University Code Motion  Rearrange reads and writes to increase parallelism  Delay reads and advance writes  Create local copies to allow earlier data forwarding read x write x read x write x iteration i iteration i+1 read x write x read x write x iteration i iteration i+1 read x read x’

Stanford University Optimized Speculative Performance Base performance Optimized RTS with no manual intervention Violation statistics used to manually transform code compress eqntott grep m88ksim wc ijpeg mpeg2 alvin cholesky ear simplex sparse Speedup

Stanford University Size of Speculative Write State  Max size determines size of write buffer for max performance  Non-head processor stalls when write buffer fills up  Small write buffers (< 64 lines) will achieve good performance 32 byte cache lines Max no. lines of write state

Stanford University Hydra Prototype  Design based on Integrated Device Technology (IDT) RC32364  88 mm 2 in 0.25  m with 8 KB I, D and 128 KB L2

Stanford University Conclusions Hydra offers a new way to design microprocessors  Single-chip MP exploits parallelism at all levels  Low overhead support for speculative parallelism  Provides high performance on applications with medium to large-grain parallelism  Allows performance optimization migration path for difficult to parallelize fine-grain applications Prototype Implementation  Work out implementation details  Provide platform for application and compiler development  Realistic performance evaluation

Stanford University Hydra Team Team  Monica Lam, Lance Hammond, Mike Chen, Ben Hubbert, Manohar Prahbu, Mike Siu, Melvyn Lim and Maciek Kozyrczak (IDT) URL 