The Standford Hydra CMP  Lance Hammond  Benedict A. Hubbert  Michael Siu  Manohar K. Prabhu  Michael Chen  Kunle Olukotun Presented by Jason Davis.

Slides:

Advertisements

Similar presentations

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.

Lecture 6: Multicore Systems

Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.

Microprocessors VLIW Very Long Instruction Word Computing April 18th, 2002.

Instruction-Level Parallelism (ILP)

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

Memory Consistency in Vector IRAM David Martin. Consistency model applies to instructions in a single instruction stream (different than multi-processor.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

EECC722 - Shaaban #1 lec # 10 Fall Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A Chip Multiprocessor.

Stanford University The Stanford Hydra Chip Multiprocessor Kunle Olukotun The Hydra Team Computer Systems Laboratory Stanford University.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

EECC722 - Shaaban #1 lec # 10 Fall Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core.

EECC722 - Shaaban #1 lec # 10 Fall A New Approach to Speculation in the Stanford Hydra Chip Multiprocessor (CMP) A Chip Multiprocessor.

Computer Architecture 2011 – coherency & consistency (lec 7) 1 Computer Architecture Memory Coherency & Consistency By Dan Tsafrir, 11/4/2011 Presentation.

Pipeline Exceptions & ControlCSCE430/830 Pipeline: Exceptions & Control CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

Multiscalar processors

EECC722 - Shaaban #1 lec # 10 Fall Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core.

EECC722 - Shaaban #1 lec # 10 Fall Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor.

1 Lecture 9: TM Implementations Topics: wrap-up of “lazy” implementation (TCC), eager implementation (LogTM)

GCSE Computing - The CPU

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.

A Flexible Architecture for Simulation and Testing (FAST) Multiprocessor Systems John D. Davis, Lance Hammond, Kunle Olukotun Computer Systems Lab Stanford.

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Computer Architecture System Interface Units Iolanthe II approaches Coromandel Harbour.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

MULTIPLEX: UNIFYING CONVENTIONAL AND SPECULATIVE THREAD-LEVEL PARALLELISM ON A CHIP MULTIPROCESSOR Presented by: Ashok Venkatesan Chong-Liang Ooi, Seon.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

BCS361: Computer Architecture I/O Devices. 2 Input/Output CPU Cache Bus MemoryDiskNetworkUSBDVD …

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,

Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

现代计算机体系结构主讲教师：张钢天津大学计算机学院 2009 年.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Dynamic Scheduling Why go out of style?

Lecture: Large Caches, Virtual Memory

Simultaneous Multithreading

The University of Adelaide, School of Computer Science

Architecture & Organization 1

5.2 Eleven Advanced Optimizations of Cache Performance

CS203 – Advanced Computer Architecture

Cache Memory Presentation I

Pipelining: Advanced ILP

Out of Order Processors

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based micro-architecture/compiler.

Architecture & Organization 1

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

How to improve (decrease) CPI

Sampoorani, Sivakumar and Joshua

Instruction Execution Cycle

Lecture: Cache Hierarchies

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Presentation transcript:

The Standford Hydra CMP  Lance Hammond  Benedict A. Hubbert  Michael Siu  Manohar K. Prabhu  Michael Chen  Kunle Olukotun Presented by Jason Davis

Introduction  Hydra CMP with 4 MIPS Processors  L1 cache for each CPU and L2 cache that holds the permanent states  Why? –Moore’s law is reaching its end –Finite amount of ILP –TLP (Thread Level Parallelism) vs ILP in pipelined architecture –CMP can use ILP as well (TLP and ILP are orthogonal) –Wire Delay –Design Time (CPU core doesn’t need to be redesigned) just increase the number  Problems –Integration densities just now giving reasons to consider new models –Difficult to convert uniprocessor code –Multiprogramming is hard

Base Design  4 MIPS Cores (250 MHz) –Each core:  L1 Data Cache  L1 Primary Instruction Cache –Share a single L2 Cache –Virtual Buses (pipelined with repeaters)  Read bus (256 bits) –Acts as general purpose system bus for moving data between CPUs, L2, and external memory –Wide enough to handle entire cache line (CMP explicit gain, multiprocessor systems would require too many pins  Write bus (64 bits) –Writes directly from 4 CPUs to L2 –Pipelined to allow for single-cycle occupancy (not a bottleneck) –Uses simple invalidation for caches (broadcast invalidates all other L1s)  L2 Cache –Point of communication (10-20 cycles)  Bus Sufficient for 4-8 MIPS cores, more need larger system buses

Base Design

Parallel Software Performance

Thread Speculation  Takes sequence of instructions on normal program and arbitrarily breaks it into a sequenced group of threads –Hardware must track all interthread dependencies to insure program acts the same way –Must re-execute code that follows a data violation based upon a true dependency  Advantages: –Does not require synchronization (different than enforcing dependencies on multiprocessor systems) –Dynamic (done at runtime) so programmer only needs to consider for maximum performance –Conventional Parallelizing compilers miss a lot of TLP because synchronization points must be inserted where dependencies can happen and not just where they do happen  5 Issues to address:

Thread Speculation 1. Forward data between parallel threads 2. Detect when reads occur to early (RAW) 3. Safely Discard speculative state after violations

Thread Speculation 5. Provide Memory renaming (WAR hazards) 4. Retire speculative writes in correct order (WAW hazard)

Hydra Speculation Implementation  Takes care of the 5 issues: –Forward data between parallel threads:  When thread writes to bus, newer threads that need the data have their current cache lines for that data invalidated  On miss in L1, access L2, write buffers of current or older thread replaces data returned from L2 byte-byte –Detect when read occurs too early:  Primary cache bits are set to mark possible violations, if write to that address of an earlier thread invalidates – Violation detected and thread is restarted. –Safely discard speculative states after violation:  Permanent state kept in L2, any L1 lines that are speculative data are invalidated, L2 buffer for thread is discarded (permanent state not effected)

Hydra Speculation Implementation –Place speculative writes in memory in correct order:  Separate speculative data L2 buffers kept for each thread  Must be drained into L2 in original sequence  Thread sequencing system also sequences the buffer draining –Memory Renaming:  Each CPU can only read data written by itself or earlier threads  Writes from later threads don’t cause immediate invalidations (since writes from these threads should not be visible yet)  Ignored invalidations are recorded with pre-invalidate bit  If thread accesses L2 it must only access data it should be able to see from itself or earlier L2 buffers  When current thread completes all currently pre- invalidated lines are check against future threads for violations

Hydra Speculation Implementation

Speculation Performance

Prototype  MIPS-based RC32364  SRAM macro cells  8-Kbyte L1 data and instruction caches  128 Kbytes L2  Die is 90 mm^2,.25-micron process  Have a verilog model, moving to physical design using synthesis  Central Arbritration for Buses will be the most difficult part, hard to pipeline, must accept many requests, and must reply with grant signals

Prototype

Prototype

Conclusion  Hydra CMP –High performance -Cost effective alternative to large chip single processors -Similar die area can achieve similar to uniprocessor performance on integer programs using thread speculation -Multiprogrammed or High Parallelism can do better then single processor -Hardware Thread-Speculation is not cost intensive, and can give great gains to performance

Questions