Evaluation of Memory Consistency Models in Titanium.

Slides:



Advertisements
Similar presentations
Bounded Model Checking of Concurrent Data Types on Relaxed Memory Models: A Case Study Sebastian Burckhardt Rajeev Alur Milo M. K. Martin Department of.
Advertisements

Symmetric Multiprocessors: Synchronization and Sequential Consistency.
1 Episode III in our multiprocessing miniseries. Relaxed memory models. What I really wanted here was an elephant with sunglasses relaxing On a beach,
1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
Synopsys University Courseware Copyright © 2012 Synopsys, Inc. All rights reserved. Compiler Optimization and Code Generation Lecture - 3 Developed By:
Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 3, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Introduction)
CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,
© Krste Asanovic, 2014CS252, Spring 2014, Lecture 12 CS252 Graduate Computer Architecture Spring 2014 Lecture 12: Synchronization and Memory Models Krste.
Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.
Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.
Shared Memory – Consistency of Shared Variables The ideal picture of shared memory: CPU0CPU1CPU2CPU3 Shared Memory Read/ Write The actual architecture.
CS492B Analysis of Concurrent Programs Consistency Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.
Memory Consistency in Vector IRAM David Martin. Consistency model applies to instructions in a single instruction stream (different than multi-processor.
“THREADS CANNOT BE IMPLEMENTED AS A LIBRARY” HANS-J. BOEHM, HP LABS Presented by Seema Saijpaul CS-510.
By Sarita Adve & Kourosh Gharachorloo Review by Jim Larson Shared Memory Consistency Models: A Tutorial.
1 Lecture 7: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
Lecture 13: Consistency Models
Type Systems For Distributed Data Sharing Ben Liblit Alex AikenKathy Yelick.
Computer Architecture II 1 Computer architecture II Lecture 9.
Multiscalar processors
1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
1 Sharing Objects – Ch. 3 Visibility What is the source of the issue? Volatile Dekker’s algorithm Publication and Escape Thread Confinement Immutability.
Memory Consistency Models
1 Lecture 12: Relaxed Consistency Models Topics: sequential consistency recap, relaxing various SC constraints, performance comparison.
Shared Memory – Consistency of Shared Variables The ideal picture of shared memory: CPU0CPU1CPU2CPU3 Shared Memory Read/ Write The actual architecture.
Making Sequential Consistency Practical in Titanium Amir Kamil and Jimmy Su.
Shared Memory Consistency Models: A Tutorial By Sarita V Adve and Kourosh Gharachorloo Presenter: Meenaktchi Venkatachalam.
1 Lecture 22: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
Shared Memory Consistency Models: A Tutorial By Sarita V Adve and Kourosh Gharachorloo Presenter: Sunita Marathe.
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Lecture 4. Memory Consistency Models
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Shared Memory Consistency Models: A Tutorial Sarita V. Adve Kouroush Ghrachorloo Western Research Laboratory September 1995.
Foundations of the C++ Concurrency Memory Model Hans-J. Boehm Sarita V. Adve HP Laboratories UIUC.
Memory Consistency Models Alistair Rendell See “Shared Memory Consistency Models: A Tutorial”, S.V. Adve and K. Gharachorloo Chapter 8 pp of Wilkinson.
Practical Reduction for Store Buffers Ernie Cohen, Microsoft Norbert Schirmer, DFKI.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.
Fence Scoping Changhui Lin †, Vijay Nagarajan*, Rajiv Gupta † † University of California, Riverside * University of Edinburgh.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
DISTRIBUTED COMPUTING
Memory Consistency Zhonghai Lu Outline Introduction What is a memory consistency model? Who should care? Memory consistency models Strict.
ICFEM 2002, Shanghai Reasoning about Hardware and Software Memory Models Abhik Roychoudhury School of Computing National University of Singapore.
Parallel Computing Presented by Justin Reschke
CISC. What is it?  CISC - Complex Instruction Set Computer  CISC is a design philosophy that:  1) uses microcode instruction sets  2) uses larger.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Code Optimization.
Memory Consistency Models
Threads Cannot Be Implemented As a Library
Lecture 11: Consistency Models
Memory Consistency Models
Amir Kamil and Katherine Yelick
Threads and Memory Models Hal Perkins Autumn 2011
Making Sequential Consistency Practical in Titanium
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Shared Memory Consistency Models: A Tutorial
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Threads and Memory Models Hal Perkins Autumn 2009
Background for Debate on Memory Consistency Models
Lecture 10: Consistency Models
Memory Consistency Models
Amir Kamil and Katherine Yelick
Relaxed Consistency Part 2
Type Systems For Distributed Data Sharing
Lecture 11: Relaxed Consistency Models
Problems with Locks Andrew Whitaker CSE451.
Lecture 11: Consistency Models
Presentation transcript:

Evaluation of Memory Consistency Models in Titanium

Introduction What should be the correct ordering of memory operations? Uniprocessor: follow the program order; read and writes to different locations can be reordered safely Multiprocessor: correct semantics not clear; memory operations from different processors are not ordered Memory Consistency Models: Impose restrictions on the ordering of shared memory accesses — strict: easy to program, but prevents many optimizations and generally thought to be slow —relaxed: better performance, but hard to code

Sequential Consistency Definition [Lamport 79]: —A system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program Enforce program ordering for single processor Memory accesses appear to execute atomically Few optimizations possible (no register allocation, write pipelining, etc) ReadWrite,ReadWriteRead Write

Weak Ordering Relax all program order requirements Ordering is preserved only at synchronization points; read and write between barriers can be performed in any order Explicit sync instruction to maintain program order Eliminates almost all memory latencies, allows most reordering optimizations 2x over naïve SC implementation in some study Used by PowerPC sync Reordered read/write

Effects of consistency models on program understanding P1 //initially A, B = 0 A = 1 If (B == 0) foo(); P2 B = 1 If (A == 0) foo(); SC: Either P1 or P2(or neither) will call foo(), but not both WC: foo() may be called by both P1 and P2, because of WR reordering

Enforce SC Using the Compiler Compiler can hide the weak consistency model of the underlying machine by inserting barriers Gives more flexibility to programmers, can choose between simplicity and performance SC can be violated at two levels Software level: register allocation, code motion, common subexpression elimination, etc. Hardware level: memory access reordering, read bypass pending write, non-blocking reads, etc.

Enforcing SC in Titanium Titanium compiler: disallow optimizations that could violate SC. —Only one: lifting of invariant exp out of loops GCC: One-at-a-time approach by adding fence instructions between accesses to non-stack variables asm volatile ($fence : : : “memory”) Hardware: —Pentium III: $fence = locked instruction —Power3: $fence = sync instruction Can do better if apply delay set analysis

Data Sharing Analysis If data is private, no need to enforce sequential consistency on memory operations Naïve: fence on every memory access (e.g., C) Titanium SC: distinguishes between stack (private) and heap (shared) data Titanium sharing inference: —Late : no deference/assignment of global pointers to private data —Early: Shared data cannot transitively reach private data —Late qualifies more variables to be private

Experimental Setup Machine Description —Millennium Node: 4-way Pentium III (PC model) —Seaborg: 16-way IBM Power3 chips (WO model) Benchmarks: Titanium Programs —Pi: Monte Carlo simulation —Sample-sort: distributed sorting algorithm —Cannon: matrix multiplication —3d-fft: 3D fast-fourier transform —Fish-grid: n-body simulation —Pps: poisson solver for uniform grid —Amr: poisson solver on an adaptive mesh

Running Time on Millennium Node 10% to 3x performance gap sharing analysis does not help except for small benchmarks

Performance Analysis Effect of Titanium compiler optimizations Effectiveness of sharing inference Effects of GCC optimizations Effects of architecture reordering Hardware performance data

Effect of Titanium compiler optimizations Collect how many lifting operations were prevented by SC Turns out the restriction makes virtually no difference; only in pps, amr do we see expressions not lifted because of SC restriction (5, 6%) Running pps,amr with loop lifting constraint turned off yields similar performance ratio Makes sense intuitively since rarely do programmers leave an invariant assignment to shared variables in the loop

Data Sharing Analysis 2.5 to 14x performance gap between SC and weak Identifying stack data as private is very important in lowering the performance penalty of SC

Effectiveness of Sharing Inference First bar is early, second is late enforcement Effective for small benchmarks, not so for large ones

Effects of GCC optimization (on Mill Node) Done by disabling GCC optimizations no major decrease in performance gap

Effects of Architecture Reordering (on Mill Node) Compiler does not prevent processor from reordering memory operations Decrease in performance gap observed Means Performance penalty for SC is higher at hardware level

Number of Sync Instructions Benchmarkweakearlylatesequential pi d-fft fish-grid amr

Breakdown of Execution Time: Pi (millions of cycles) SC’s performance penalty mainly caused by increase in idle cycles(waiting for sync instructions to complete)

Conclusion Performance gaps exists between SC and relaxed models —Size of gap is Highly dependent on the benchmarks —In general, not very expensive Architecture reordering accounts for most of the performance difference; Titanium and gcc optimization’s effect appear limited Sharing inference is not effective in reducing cost of SC At the hardware level —SC causes significant higher number of idle cycles, because of sync instructions