Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.

Slides:

Advertisements

Similar presentations

1 Episode III in our multiprocessing miniseries. Relaxed memory models. What I really wanted here was an elephant with sunglasses relaxing On a beach,

Advertisements

Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.

IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

COMP4611 Tutorial 6 Instruction Level Parallelism

PUMA 2 : Bridging the CPU/Memory Gap through Prediction & Speculation Babak Falsafi Team Members: Chi Chen, Chris Gniady, Jangwoo Kim, Tom Wenisch, Se-Hyun.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,

1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04 Selective, Accurate,

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CS492B Analysis of Concurrent Programs Consistency Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Speculative Sequential Consistency with Little Custom Storage Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University

By Sarita Adve & Kourosh Gharachorloo Review by Jim Larson Shared Memory Consistency Models: A Tutorial.

1 Lecture 7: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.

Lecture 13: Consistency Models

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

Computer Architecture II 1 Computer architecture II Lecture 9.

Multiscalar processors

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.

1 Lecture 12: Relaxed Consistency Models Topics: sequential consistency recap, relaxing various SC constraints, performance comparison.

Shared Memory Consistency Models: A Tutorial By Sarita V Adve and Kourosh Gharachorloo Presenter: Meenaktchi Venkatachalam.

Shared Memory Consistency Models: A Tutorial By Sarita V Adve and Kourosh Gharachorloo Presenter: Sunita Marathe.

RISC CSS 548 Joshua Lo.

Evaluation of Memory Consistency Models in Titanium.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

(C) 2003 Mulitfacet ProjectUniversity of Wisconsin-Madison Revisiting “Multiprocessors Should Support Simple Memory Consistency Models” Mark D. Hill Multifacet.

Shared Memory Consistency Models: A Tutorial Sarita V. Adve Kouroush Ghrachorloo Western Research Laboratory September 1995.

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.

Analytic Evaluation of Shared-Memory Systems with ILP Processors Daniel J. Sorin, Vijay S. Pai, Sarita V. Adve, Mary K. Vernon, and David A. Wood Presented.

Memory Consistency Models Alistair Rendell See “Shared Memory Consistency Models: A Tutorial”, S.V. Adve and K. Gharachorloo Chapter 8 pp of Wilkinson.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors.

CS533 Concepts of Operating Systems Jonathan Walpole.

1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04.

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Interactions with Microarchitectures and I/O Copyright 2004 Daniel.

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Memory Consistency Models

CSC 4250 Computer Architectures

Simultaneous Multithreading

Lecture 11: Consistency Models

The University of Adelaide, School of Computer Science

Memory Consistency Models

5.2 Eleven Advanced Optimizations of Cache Performance

/ Computer Architecture and Design

Flow Path Model of Superscalars

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Hardware Multithreading

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Presented to CS258 on 3/12/08 by David McGrogan

Mark D. Hill Multifacet Project ( Computer Sciences Department

Lecture 10: Consistency Models

Lecture 21: Synchronization & Consistency

Lecture 11: Relaxed Consistency Models

Is SC + ILP = RC? C. Gniady, B. Falsafi, and T.N. Vijaykumar - Purdue

Is SC + ILP = RC? Chris Gniady, Babak Falsafr, and T.N. Vijaykumar

Lecture 11: Consistency Models

Presentation transcript:

Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer Architecture

Introduction Availability of multiprocessors (how to maximize performance?) Atomicity of operations (synchronization) Allow in-order processors to overlap store latency with other work (ie bypassing loads, overlapping with network latency etc) Allow processors to execute out-of-order (speculation) There exists a trade off between programmability and performance To simplify programming, implement a shared memory abstraction

Spring 2005: CS 7968 Parallel Computer Architecture Memory Models Shared memory systems implement memory consistency models Different models make different guarantees; the processor can reorder/overlap memory operations as long as the guarantees are upheld. Sequential Consistency (SC) is the simplest model which executes memory operations in program order Relaxed memory models require only some memory operations to perform in program order Release Consistency is the best of the relaxed memory models

Spring 2005: CS 7968 Parallel Computer Architecture Current Memory Consistency Models Sequential Consistency (SC) HP and MIPS processors Processor Consistency (PC) Intel processors Total Store Order Sun SPARC Release Consistency (RC) Sun SPARC, DEC Alpha, IBM PowerPC

Spring 2005: CS 7968 Parallel Computer Architecture Current Optimizations Techiniques used to exploit ILP Branch prediction Execute multiple instructions per cycle Non-blocking caches to overlap memory operations Out-of-order execution Implement precise exceptions and speculative execution Reorder buffer

Spring 2005: CS 7968 Parallel Computer Architecture Comparing SC and RC Sequential Consistency (SC) Guarantees memory order using hardware Easier to program Prevents high performance due to conservative nature Release Consistency (RC) Guarantees memory order using software Harder to program; more burden on programmer Achieves highest performance due to explicitness

Spring 2005: CS 7968 Parallel Computer Architecture SC Implementations Current SC use ILP Optimizations Hardware prefetching and non-blocking caches to overlap loads and stores using the reorder buffer Speculative load execution using reorder buffer and a special history buffer to roll back in case of invalidation Limitations Inability of stores to bypass other memory operations Long latency remote stores cause the relative small reorder buffer and load/store queue to fill up blocking the pipeline Capacity and conflict misses of small L2 caches causing frequent rollbacks

Spring 2005: CS 7968 Parallel Computer Architecture RC Implementations RC allows a programmer to specify the ordering constraints (fence instr) among specific memory operations to enforce order RC implementations use store buffering to allow loads and store to bypass pending stores Unlike SC, RC can use binding prefetches to perform loads in the reorder buffer RC can also relax ordering among fence instrns and use rollback mechanisms if there is a memory model violation

Spring 2005: CS 7968 Parallel Computer Architecture SC programmability with RC Perfor. SC can approach RC if hardware can provide support for: SC to relax the order speculatively of loads and stores Loads and stores to take place atomically and in program order Instructions to be allowed to execute out of program order Processor state must be remembered for rollbacks Limitations (costs) Memory order is arbitrary; no guarantees Rollbacks must be infrequent (enough space needed)

Spring 2005: CS 7968 Parallel Computer Architecture SC++ Architecture Modelled after R10k SHiQ allows for prefetching and non- blocking caches Other processors see SC History buffer allows speculative retirement unblocks RoB stores Load/store queue takes stores from RoB BLT has block addr’s for SHiQ

Spring 2005: CS 7968 Parallel Computer Architecture Experimental Setup Simulator: Simulator: RSIM, on an 8-node DSM Each DSM node is a R10k like processor Memory model implementations use Non-blocking caches Hardware prefetching for loads and stores Speculative load execution No speculative retirement is done in either SC or RC

Spring 2005: CS 7968 Parallel Computer Architecture Base System Configuration Each R10k processor node has the above configuration Large L2 cache – eliminates capacity and conflict misses Base configuration is used unless otherwise specified

Spring 2005: CS 7968 Parallel Computer Architecture Some points to remember Some points to remember SC and RC implementations… Use non-blocking caches Use hardware prefetching for loads and stores Perform speculative loads SC++ uses… Speculative History Queue (SHiQ) Block Lookup Table (BLT) Rollbacks due to Instructions in reorder buffer take one cycle Rollbacks due to Instructions in SHiQ take 4 cycles

Spring 2005: CS 7968 Parallel Computer Architecture Results – Base System Speedup normalized to that of SC implementation RC is better than SC Best for radix SC++ performs better than or equal to RC For raytrace it performs way better

Spring 2005: CS 7968 Parallel Computer Architecture Results – Network Latency Network latency increased by 4x RC hides the n/w latency by overlapping stores SC++inf keeps up with RC raytrace performs lesser since longer n/w latency dominates lock patterns.

Spring 2005: CS 7968 Parallel Computer Architecture Results – Reorder Buffer Size Allows more prefetch time Speeds up both SC and RC Hides store latencies by allowing more time for prefetches In raytrace, no speedup in both SC and RC Memory operations don’t overlap much In structured, the gap grows Due to increase in no. of rollbacks in SC

Spring 2005: CS 7968 Parallel Computer Architecture Res - SHiQ Size & Speculative Stores Absence of speculative stores causes significance performance loss radix and raytrace Reducing SHiQ sizes leads to performance degradation em3d and radix

Spring 2005: CS 7968 Parallel Computer Architecture Results – L2 Caches Size Two effects of smaller L2 cache Less room for speculative state => gap widens Lots of load misses for both SC and RC => might narrow performance gap Lu & radix – the high load miss rate degrades performance SC ++ is also sensitive to rollbacks due to replacements

Spring 2005: CS 7968 Parallel Computer Architecture Conclusions SC can perform equal to RC if hardware provides enough support for speculation SC++ allows for speculative bypassing for both loads and stores SC++ minimizes additional overheads to the processor pipeline critical paths by using the following structures SHiQ: to store speculative state, absorb remote latencies BLT: to allow fast lookups in SHiQ