EECC722 - Shaaban #1 lec # 10 Fall 2003 10-22-2003 A New Approach to Speculation in the Stanford Hydra Chip Multiprocessor (CMP) A Chip Multiprocessor.

Slides:



Advertisements
Similar presentations
CSCI 4717/5717 Computer Architecture
Advertisements

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.
Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.
EECC722 - Shaaban #1 lec # 10 Fall Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A Chip Multiprocessor.
Stanford University The Stanford Hydra Chip Multiprocessor Kunle Olukotun The Hydra Team Computer Systems Laboratory Stanford University.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
EECC722 - Shaaban #1 lec # 10 Fall Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core.
1: Operating Systems Overview
OPERATING SYSTEM OVERVIEW
Translation Buffers (TLB’s)
Multiscalar processors
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
DLX Instruction Format
EECC722 - Shaaban #1 lec # 10 Fall Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core.
EECC722 - Shaaban #1 lec # 10 Fall Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Chapter 3 Memory Management: Virtual Memory
Computer System Architectures Computer System Software
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
CHAPTER 2: COMPUTER-SYSTEM STRUCTURES Computer system operation Computer system operation I/O structure I/O structure Storage structure Storage structure.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
OpenRISC 1000 Yung-Luen Lan, b Cache Model Perspective of the Programming Model. Hence, the hardware implementation details (cache organization.
1: Operating Systems Overview 1 Jerry Breecher Fall, 2004 CLARK UNIVERSITY CS215 OPERATING SYSTEMS OVERVIEW.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
MULTIPLEX: UNIFYING CONVENTIONAL AND SPECULATIVE THREAD-LEVEL PARALLELISM ON A CHIP MULTIPROCESSOR Presented by: Ashok Venkatesan Chong-Liang Ooi, Seon.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
The Standford Hydra CMP  Lance Hammond  Benedict A. Hubbert  Michael Siu  Manohar K. Prabhu  Michael Chen  Kunle Olukotun Presented by Jason Davis.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
The University of Adelaide, School of Computer Science
CS203 – Advanced Computer Architecture
Architecture Background
/ Computer Architecture and Design
Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based micro-architecture/compiler.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Hardware Multithreading
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
How to improve (decrease) CPI
Translation Buffers (TLB’s)
The University of Adelaide, School of Computer Science
How to improve (decrease) CPI
The University of Adelaide, School of Computer Science
Presentation transcript:

EECC722 - Shaaban #1 lec # 10 Fall A New Approach to Speculation in the Stanford Hydra Chip Multiprocessor (CMP) A Chip Multiprocessor microarchitecture/compiler effort at Stanford that provides hardware/software support for Data/Thread Level Speculation (TLS) to extract parallel speculated threads from sequential code augmented with software thread speculation handlers

EECC722 - Shaaban #2 lec # 10 Fall Chip Multiprocessors (CMPs) A CMP offers implementation benefits –High-speed signals are localized in individual CPUs –A proven CPU design may be replicated across the die Overcomes diminishing performance/transistor return problem in uniprocessors –Transistors are used today mostly for ILP extraction –MPs use transistors to run multiple threads... On parallelized programs With multiprogrammed workloads Fast inter-processor communication eases parallelization of code (Shared L2 cache)

EECC722 - Shaaban #3 lec # 10 Fall Stanford Hydra CMP Approach Exploit all levels of program parallelism. Develop a single-chip multiprocessor architecture that simplifies microprocessor design and achieves high performance. Make the multiprocessor transparent to the average user. Integrate use of parallelizing compiler technology in the design of microarchitecture that supports data/thread speculation.

EECC722 - Shaaban #4 lec # 10 Fall The Basic Hydra CMP 4 processors and secondary cache on a chip 2 buses connect processors and memory Coherence: writes are broadcast on write bus

EECC722 - Shaaban #5 lec # 10 Fall Hydra Memory Hierarchy Characteristics

EECC722 - Shaaban #6 lec # 10 Fall CMP Parallel Performance Varying levels of performance –Multiprogrammed workloads work well –Very parallel apps (matrix-based FP and multimedia) are excellent –Acceptable only with a few less parallel (i.e. integer) applications

EECC722 - Shaaban #7 lec # 10 Fall Problem: Limited Parallel Software Current parallel software is limited –Some programs just don’t have significant parallelism –Parallel compilers generally require dense matrix applications Many applications only hand-parallelizable –Parallelism may exist in algorithm, but code hides it –Compilers must statically verify parallelism –Data dependencies require synchronization –Pointer disambiguation is a major problem for this! Can hardware help the situation?

EECC722 - Shaaban #8 lec # 10 Fall Possible Limited Parallel Software Solution: Data Speculation & Thread Level Speculation (TLS) Data speculation and Thread Level Speculation (TLS) enable parallelization without regard for data dependencies –Normal sequential program is broken up into speculative threads –Speculative threads are now run in parallel on CPUs –Speculation hardware ensures correctness Parallel software implications –Loop parallelization is now easily automated –More “arbitrary” threads are possible (subroutines) –Add synchronization only for performance Speculation support mechanisms –Speculative thread control mechanism –Five basic speculation hardware/memory system requirements for correct data/thread speculation

EECC722 - Shaaban #9 lec # 10 Fall Subroutine Speculation

EECC722 - Shaaban #10 lec # 10 Fall Loop Iteration Speculative Threads A Simple Example of A speculatively Executed Loop using Data/Thread Speculation

EECC722 - Shaaban #11 lec # 10 Fall Overview of Loop-Iteration Thread Speculation Parallel regions (loop iterations) are annotated by the compiler. The hardware uses these annotations to run loop iterations in parallel as speculated threads on a number of CPUs. Each CPU knows which loop iteration it is running CPUs dynamically prevent data dependency violations –“later” iterations can’t use data before write by “earlier” iterations (RAW) –“earlier” iterations never see writes by “later” iterations (WAW) If a “later” iteration has used data that an “earlier” iteration writes (RAW hazard), it is restarted –All following iterations are halted and restarted, also –All writes by the later iteration are discarded (undo speculated work).

EECC722 - Shaaban #12 lec # 10 Fall Hydra’s Data & Thread Speculation Operations

EECC722 - Shaaban #13 lec # 10 Fall Hydra Loop Compiling for Speculation

EECC722 - Shaaban #14 lec # 10 Fall Loop Execution with Thread Speculation

EECC722 - Shaaban #15 lec # 10 Fall Speculative Thread Creation in Hydra Register Passing Buffer (RPB)

EECC722 - Shaaban #16 lec # 10 Fall Speculative Data Access in Speculated Threads i Less Speculated thread i+1 More speculated thread WAR RAW WAW

EECC722 - Shaaban #17 lec # 10 Fall To provide the desired memory behavior, the data speculation hardware must provide: 1. A method for detecting true memory dependencies, in order to determine when a dependency has been violated. 2. A method for backing up and re-executing speculative loads and any instructions that may be dependent upon them when the load causes a violation. 3. A method for buffering any data written during a speculative region of a program so that it may be discarded when a violation occurs or permanently committed at the right time. Speculative Data Access in Speculated Threads

EECC722 - Shaaban #18 lec # 10 Fall Five Basic Speculation Hardware Requirements For Correct Data/Thread Speculation 1. Forward data between parallel threads (RAW). A speculative system must be able to forward shared data quickly and efficiently from an earlier thread running on one processor to a later thread running on another. 2. Detect when reads occur too early (RAW hazards). If a data value is read by a later thread and subsequently written by an earlier thread, the hardware must notice that the read retrieved incorrect data since a true dependence violation has occurred. 3. Safely discard speculative state after violations. All speculative changes to the machine state must be discarded after a violation, while no permanent machine state may be lost in the process. 4. Retire speculative writes in the correct order (WAW hazards). Once speculative threads have completed successfully, their state must be added to the permanent state of the machine in the correct program order, considering the original sequencing of the threads. 5. Provide memory renaming (WAR hazards). The speculative hardware must ensure that the older thread cannot “see” any changes made by later threads, as these would not have occurred yet in the original sequential program.

EECC722 - Shaaban #19 lec # 10 Fall Speculative Hardware/Memory Requirements 1-2 (RAW) (RAW hazard or violation)

EECC722 - Shaaban #20 lec # 10 Fall Speculative Hardware/Memory Requirements 3-4 (RAW hazard). (WAW hazards). Restart

EECC722 - Shaaban #21 lec # 10 Fall Speculative Hardware/Memory Requirement 5 Memory Renaming to prevent WAR hazards.

EECC722 - Shaaban #22 lec # 10 Fall Hydra Speculation Hardware

EECC722 - Shaaban #23 lec # 10 Fall Hydra Speculation Support

EECC722 - Shaaban #24 lec # 10 Fall L1 Cache Tag Details

EECC722 - Shaaban #25 lec # 10 Fall L2 Buffer Details

EECC722 - Shaaban #26 lec # 10 Fall The operation of Speculative Loads

EECC722 - Shaaban #27 lec # 10 Fall Reading L2 Cache Speculative Buffers

EECC722 - Shaaban #28 lec # 10 Fall The Operation of Speculative Stores

EECC722 - Shaaban #29 lec # 10 Fall Hydra’s Handling of Five Basic Speculation Hardware Requirements For Correct Data/Thread Speculation 1. Forward data between parallel threads (RAW). –When a speculative thread writes data over the write bus, all more-speculative threads that may need the data have their current copy of that cache line invalidated. –This is similar to the way the system works during non- speculative operation (invalidate cache coherency protocol). –If any of the threads subsequently need the new speculative data forwarded to them, they will miss in their primary cache and access the secondary cache. The speculative data contained in the write buffers of the current or older threads replaces data returned from the secondary cache on a byte-by-byte basis just before the composite line is returned to the processor and primary cache.

EECC722 - Shaaban #30 lec # 10 Fall Detect when reads occur too early (RAW hazards). –Primary cache bits are set to mark any reads that may cause violations. –Subsequently, if a write to that address from an earlier thread invalidates the address, a violation is detected, and the thread is restarted. 3. Safely discard speculative state after violations. –Since all permanent machine state in Hydra is always maintained within the secondary cache, anything in the primary caches and secondary cache buffers may be invalidated at any time without risking a loss of permanent state. –As a result, any lines in the primary cache containing speculative data (marked with a special modified bit) may simply be invalidated all at once to clear any speculative state from a primary cache. –In parallel with this operation, the secondary cache buffer for the thread may be emptied to discard any speculative data written by the thread. Hydra’s Handling of Five Basic Speculation Hardware Requirements For Correct Data/Thread Speculation

EECC722 - Shaaban #31 lec # 10 Fall Retire speculative writes in the correct order (WAW hazards). –Separate secondary cache buffers are maintained for each thread. As long as these are drained into the secondary cache in the original program sequence of the threads, they will reorder speculative memory references correctly. 5. Provide memory renaming (WAR hazards). –Each processor can only read data written by itself or earlier threads when reading its own primary cache or the secondary cache buffers. –Writes from later threads don’t cause immediate invalidations in the primary cache, since these writes should not be visible to earlier threads. –However, these “ignored” invalidations are recorded using an additional pre- invalidate primary cache bit associated with each line. This is because they must be processed before a different speculative or non-speculative thread executes on this processor. –If future threads have written to a particular line in the primary cache, the pre- invalidate bit for that line is set. When the current thread completes, these bits allow the processor to quickly simulate the effect of all stored invalidations caused by all writes from later processors all at once, before a new thread begins execution on this processor. Hydra’s Handling of Five Basic Speculation Hardware Requirements For Correct Data/Thread Speculation

EECC722 - Shaaban #32 lec # 10 Fall Thread Speculation Performance Results representative of entire uniprocessor applications Simulated with accurate modeling of Hydra’s memory and hardware speculation support.

EECC722 - Shaaban #33 lec # 10 Fall Hydra Prototype Overview CPU core and cache. Speculative coprocessor –Speculative memory reference controller –Speculative interrupt screening mechanism –Statistics mechanisms for performance evaluation and to allow feedback for code tuning Memory system –Read and write buses –Controllers for all resources –On-chip L2 cache –Simple off-chip main memory controller –I/O and debugging interface

EECC722 - Shaaban #34 lec # 10 Fall Hydra Prototype Layout 250 MHz clock rate target

EECC722 - Shaaban #35 lec # 10 Fall Hydra Conclusions Hydra offers a number of advantages –Good performance on parallel applications –Promising performance on difficult to parallelize applications uniprocessor applications using data/thread speculation mechanisms. –Scalable, modular design –Low hardware overhead support for speculative thread parallelism, yet greatly increases the number of parallel applications.

EECC722 - Shaaban #36 lec # 10 Fall Other Thread Level Speculation (TLS) Efforts: Wisconsin Multiscalar This CMP design proposed the first reasonable hardware to implement TLS. Unlike Hydra, Multiscalar implements a ring-like network between all of the processors to allow direct register-to-register communication. –Along with hardware-based thread sequencing, this type of communication allows much smaller threads to be exploited at the expense of more complex processor cores. The designers proposed two different speculative memory systems to support the Multiscalar core. –The first was a unified primary cache, or address resolution buffer (ARB). Unfortunately, the ARB has most of the complexity of Hydra’s secondary cache buffers at the primary cache level, making it difficult to implement. –Later, they proposed the speculative versioning cache (SVC). The SVC uses write-back primary caches to buffer speculative writes in the primary caches, using a sophisticated coherence scheme.

EECC722 - Shaaban #37 lec # 10 Fall This CMP-with-TLS proposal is very similar to Hydra, including the use of software speculation handlers. However, the hardware is simpler than Hydra’s. The design uses write-back primary caches to buffer writes— similar to those in the SVC—and sophisticated compiler technology to explicitly mark all memory references that require forwarding to another speculative thread. Their simplified SVC must drain its speculative contents as each thread completes, unfortunately resulting in heavy bursts of bus activity. Other Thread Level Speculation (TLS) Efforts: Carnegie-Mellon Stampede

EECC722 - Shaaban #38 lec # 10 Fall This CMP design has three processors that share a primary cache and can communicate register-to-register through a crossbar. Each processor can also switch dynamically among several threads. (TLS & SMT??) As a result, the hardware connecting processors together is quite complex and slow. However, programs executed on the M-machine can be parallelized using very fine-grain mechanisms that are impossible on an architecture that shares outside of the processor cores, like Hydra. Performance results show that on typical applications extremely fine-grained parallelization is often not as effective as parallelism at the levels that Hydra can exploit. The overhead incurred by frequent synchronizations reduces the effectiveness. Other Thread Level Speculation (TLS) Efforts: MIT M-machine