An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.

Slides:



Advertisements
Similar presentations
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,
Advertisements

Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.
Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Code Transformations to Improve Memory Parallelism Vijay S. Pai and Sarita Adve MICRO-32, 1999.
Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.
Speculative Sequential Consistency with Little Custom Storage Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
CSCI 232© 2005 JW Ryder1 Cache Memory Systems Introduced by M.V. Wilkes (“Slave Store”) Appeared in IBM S360/85 first commercially.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
1 Lecture 7: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Lecture 13: Consistency Models
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
Multiscalar processors
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
1 Lecture 12: Relaxed Consistency Models Topics: sequential consistency recap, relaxing various SC constraints, performance comparison.
Shared Memory Consistency Models: A Tutorial By Sarita V Adve and Kourosh Gharachorloo Presenter: Meenaktchi Venkatachalam.
Multiprocessor Cache Coherency
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
(C) 2003 Mulitfacet ProjectUniversity of Wisconsin-Madison Revisiting “Multiprocessors Should Support Simple Memory Consistency Models” Mark D. Hill Multifacet.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Analytic Evaluation of Shared-Memory Systems with ILP Processors Daniel J. Sorin, Vijay S. Pai, Sarita V. Adve, Mary K. Vernon, and David A. Wood Presented.
Memory Consistency Models Alistair Rendell See “Shared Memory Consistency Models: A Tutorial”, S.V. Adve and K. Gharachorloo Chapter 8 pp of Wilkinson.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello
1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.
Limits of Instruction-Level Parallelism Presentation by: Robert Duckles CSE 520 Paper being presented: Limits of Instruction-Level Parallelism David W.
Coherence Decoupling: Making Use of Incoherence J. Huh, J. Chang, D. Burger, G. Sohi ASPLOS 2004.
Pipelining and Parallelism Mark Staveley
RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors.
CS533 Concepts of Operating Systems Jonathan Walpole.
Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.
Fetch Directed Prefetching - a Study
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
10/11: Lecture Topics Execution cycle Introduction to pipelining
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,
CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
Speculative Lock Elision
Analytic Evaluation of Shared-Memory Systems with ILP Processors
Architecture and Design of AlphaServer GS320
Lecture 11: Consistency Models
5.2 Eleven Advanced Optimizations of Cache Performance
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Lu Peng, Jih-Kwon Peir, Konrad Lai
Lecture 14: Reducing Cache Misses
Shared Memory Consistency Models: A Tutorial
Presented to CS258 on 3/12/08 by David McGrogan
How to improve (decrease) CPI
Mark D. Hill Multifacet Project ( Computer Sciences Department
Multiprocessor Highlights
Instruction Level Parallelism (ILP)
Lecture 10: Consistency Models
CSC3050 – Computer Architecture
How to improve (decrease) CPI
Lecture 11: Relaxed Consistency Models
Is SC + ILP = RC? C. Gniady, B. Falsafi, and T.N. Vijaykumar - Purdue
Lecture 11: Consistency Models
Presentation transcript:

An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy Harton

Motivation Memory consistency model determines extent to which memory operations may be overlapped or reordered SC vs. RC –Single issue statically scheduled processor –Blocking reads –Straightforward implementations & trace driven simulations Perform quantitative comparison of several implementations of SC & RC with ILP processors –Hardware prefetching & speculative loads

Current Implementations Simple implementation prohibit operation from entering memory system until all previous operations have completed Consistency Optimizations –Hardware controlled non-binding prefetch SC to obtain remote data for reads RC to prefetch reads past acquire Store Prefetch –Speculative Load Execution Speculative Load Buffer –Data remains visible to coherence mechanism –Reissue & rollback

Evaluation Methodology Hardware cache coherent multiprocessor –3-state directory protocol, 2 D mesh –Node has ILP processor, 2 levels of cache, part of main memory & directory –Simple SC Issue memory operation after previous one complete –Hardware prefetching Prefetch to primary writeback write allocate cache Prefetch to secondary cache for write through no write allocate –Speculative load execution Stopping load from retiring & reissue & flushing inst. Window SC used for issuing out of order loads RC used for loads past acquire RSIM instruction driven simulator

Evaluation Methodology Metrics –Execution time divided into CPU time & stalls –Cycle counted as busy if max possible no. of instructions retired otherwise counted as stall time component Applications –Radix, FFT, LU, Water, MP3D, Erlebacher

Evaluation SC system with first level write through cache –Prefetching improves performance but with small improvements for some applications –Speculative Load Execution leads to a factor of two speedup –Neither technique reduces large store latency SC system with first level write back cache –Contribution of write latency to execution time decreases –Hardware prefetching and speculative loads similar benefits Execution time & read latency –LU gets reduction in store stall time

Evaluation RC systems –Optimizations don’t provide much improvement Best improvement 7.7% for water Write through L1 cache has similar performance as Write Back L1 RC vs. SC –Simple RC performs better than most optimized SC More so for write through L1 Gap can be even more Aggressive Protocol –Delay ownership request for writes to line which have pending reads –Try to improve overlap of ownership request –Approximated by using s/w prefetch instructions –SC achieves reduced store latency with one exception –RC does not achieve much improvement

Techniques for Tolerating Acquire Latency RC assumes all operations after acquire depend on it Fuzzy Acquire –Acquire=non-blocking load+barrier –Independent operation can be inserted between read & barrier Selective Acquire –Uses arithmetic instruction to explicitly and selectively establish dependencies