Presented to CS258 on 3/12/08 by David McGrogan

Slides:



Advertisements
Similar presentations
1 Episode III in our multiprocessing miniseries. Relaxed memory models. What I really wanted here was an elephant with sunglasses relaxing On a beach,
Advertisements

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Relaxed Consistency Models. Outline Lazy Release Consistency TreadMarks DSM system.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.
1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04 Selective, Accurate,
Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.
Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.
Speculative Sequential Consistency with Little Custom Storage Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
By Sarita Adve & Kourosh Gharachorloo Review by Jim Larson Shared Memory Consistency Models: A Tutorial.
1 Lecture 7: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
Lecture 13: Consistency Models
Computer Architecture II 1 Computer architecture II Lecture 9.
1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
1 Lecture 12: Relaxed Consistency Models Topics: sequential consistency recap, relaxing various SC constraints, performance comparison.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.
Shared Memory Consistency Models: A Tutorial Sarita V. Adve Kouroush Ghrachorloo Western Research Laboratory September 1995.
Memory Consistency Models Alistair Rendell See “Shared Memory Consistency Models: A Tutorial”, S.V. Adve and K. Gharachorloo Chapter 8 pp of Wilkinson.
By Sarita Adve & Kourosh Gharachorloo Slides by Jim Larson Shared Memory Consistency Models: A Tutorial.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
Release Consistency Yujia Jin 2/27/02. Motivations Place partial order on memory accesses for correct parallel program behavior Relax partial order for.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
CS533 Concepts of Operating Systems Jonathan Walpole.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04.
CS 258 Spring The Expandable Split Window Paradigm for Exploiting Fine- Grain Parallelism Manoj Franklin and Gurindar S. Sohi Presented by Allen.
An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
Chapter Six.
Lecture 20: Consistency Models, TM
Computer Architecture Chapter (14): Processor Structure and Function
Software Coherence Management on Non-Coherent-Cache Multicores
Multiscalar Processors
Memory Consistency Models
Lecture 11: Consistency Models
Memory Consistency Models
5.2 Eleven Advanced Optimizations of Cache Performance
Morgan Kaufmann Publishers The Processor
Lecture 6: Advanced Pipelines
Lecture 14: Reducing Cache Misses
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Shared Memory Consistency Models: A Tutorial
Adapted from the slides of Prof
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Chapter Six.
Multiprocessor Highlights
Lecture 22: Consistency Models, TM
Shared Memory Consistency Models: A Tutorial
* From AMD 1996 Publication #18522 Revision E
Adapted from the slides of Prof
Lecture 10: Consistency Models
Latency Tolerance: what to do when it just won’t go away
Relaxed Consistency Part 2
Why we have Counterintuitive Memory Models
Update : about 8~16% are writes
Lecture: Consistency Models, TM
Lecture 11: Relaxed Consistency Models
Is SC + ILP = RC? C. Gniady, B. Falsafi, and T.N. Vijaykumar - Purdue
Is SC + ILP = RC? Chris Gniady, Babak Falsafr, and T.N. Vijaykumar
Lecture 11: Consistency Models
Presentation transcript:

Presented to CS258 on 3/12/08 by David McGrogan Is SC + ILP = RC? Presented to CS258 on 3/12/08 by David McGrogan

Review: Acronym Salad SC: Sequential Consistency Memory operations occur in program order ILP: Instruction Level Parallelism Operations moved around in hardware RC: Release Consistency Memory operations may be re-sequenced except across releases of synch. variables

Ease vs. Performance Sequential consistency Release consistency Intuitive for programmer Limits performance-enhancing hardware Release consistency Requires manual classification of mem. access Provides excellent performance

Improving SC, part 1 SC requires only that the result be equivalent to operations in program order Adding ILP to SC allows better performance

How it works Load/store queue maintains program order Exception/mispredict causes reorder buffer rollback Queued block prefetch Loads reordered

What about stores? In standard SC, reorder buffer blocks on loads if a store is pending Pipeline can stall on long stores as buffer/ queue fill

The RC ideal Most memory accesses can overlap in any given order Store buffering to bypass pending stores Binding prefetches – fetch before load reaches head of reorder buffer Speculative relaxation even across fences

Improving SC, part 2 SC requires only that the result be equivalent to operations in program order Anything allowed as long as this is upheld Let’s go nuts

Rampant Speculation SC++ relaxes all memory access orders State changes after pending stores kept in history, can be undone if necessary Undo happens on external coherence action Impossible without additional hardware

Additional Hardware Speculative History Queue (SHiQ) logs modifications to processor state and L1 cache, permits undo to oldest pending store Spec. stores done on L1 cache, not queued BLT speeds detection of rollback necessity

Avoiding Deadlock After rollback, all pending stores must complete before speculation can resume Coherence handling paused during rollback

Sources of High Rollback Data races in application Significant false sharing Inevitable cache conflicts A DSM system won’t help apps like this no matter what; communication dominates

Performance Better than SC, about as good as RC

Now with 4x net latency Large SHiQ allows SC to handle laggy networks

Is SC++ really necessary? A huge reorder buffer doesn’t save SC in every case

Speculative Stores Not always important, but sometimes very significant

The Importance of L2 Too small, and the miss rate goes up Some apps cause many conflict misses

Wrapup SC++ achieves its gains by mimicking RC Maintains convenience to programmer at expense of added hardware Later results provide further improvement