Conditional Memory Ordering Christoph von Praun, Harold W.Cain, Jong-Deok Choi, Kyung Dong Ryu Presented by: Renwei Yu Published in Proceedings of the.

Slides:

Advertisements

Similar presentations

Performance Evaluation of Cache Replacement Policies for the SPEC CPU2000 Benchmark Suite Hussein Al-Zoubi.

Advertisements

1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.

Relaxed Consistency Models. Outline Lazy Release Consistency TreadMarks DSM system.

An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios.

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.

Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.

Memory Consistency in Vector IRAM David Martin. Consistency model applies to instructions in a single instruction stream (different than multi-processor.

IBM Software Group © 2004 IBM Corporation Compilation Technology Java Synchronization : Not as bad as it used to be! Mark Stoodley J9 JIT Compiler Team.

1 Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* Onur Mutlu† Moinuddin K. Qureshi‡ Yale N. Patt* *The.

1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.

Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.

1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture.

Computer Architecture II 1 Computer architecture II Lecture 9.

Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.

Software-Based Cache Coherence with Hardware-Assisted Selective Self Invalidations Using Bloom Filters Authors ： Thomas J. Ashby, Pedro D´ıaz, Marcelo.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

1 Lightweight Remote Procedure Call Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska and Henry M. Levy Presented by: Karthika Kothapally.

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

Presented by Deepak Srinivasan Alaa Aladmeldeen, Milo Martin, Carl Mauer, Kevin Moore, Min Xu, Daniel Sorin, Mark Hill and David Wood Computer Sciences.

Evaluation of Memory Consistency Models in Titanium.

Synchronization (Barriers) Parallel Processing (CS453)

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.

TreadMarks Distributed Shared Memory on Standard Workstations and Operating Systems Pete Keleher, Alan Cox, Sandhya Dwarkadas, Willy Zwaenepoel.

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

Parallel and Distributed Simulation Memory Management & Other Optimistic Protocols.

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

Coherence Decoupling: Making Use of Incoherence J. Huh, J. Chang, D. Burger, G. Sohi ASPLOS 2004.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

DISTRIBUTED COMPUTING

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

ICFEM 2002, Shanghai Reasoning about Hardware and Software Memory Models Abhik Roychoudhury School of Computing National University of Singapore.

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.

HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation.

Optimizing Replication, Communication, and Capacity Allocation in CMPs Z. Chishti, M. D. Powell, and T. N. Vijaykumar Presented by: Siddhesh Mhambrey Published.

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.

Sunpyo Hong, Hyesoon Kim

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

Concurrency Idea. 2 Concurrency idea Challenge –Print primes from 1 to Given –Ten-processor multiprocessor –One thread per processor Goal –Get ten-fold.

Software Coherence Management on Non-Coherent-Cache Multicores

Memory Consistency Models

Memory Consistency Models

A Practical Stride Prefetching Implementation in Global Optimizer

CS510 - Portland State University

Memory Consistency Models

CSC Multiprocessor Programming, Spring, 2011

Presentation transcript:

Conditional Memory Ordering Christoph von Praun, Harold W.Cain, Jong-Deok Choi, Kyung Dong Ryu Presented by: Renwei Yu Published in Proceedings of the 33nd International Symposium on Computer Architecture, 2006.

Motivation  Modern multiprocessor systems need memory barrier instructions in the program to specify the memory ordering  Conventionally, we can guarantee memory ordering by using locks or barriers, it leads to superfluous memory barriers in programs.  We need a mechanism to reduce unnecessary memory ordering.

Redundancies of memory ordering in conventional locking algorithms  Lock operation on lock variable l  Unlock operation on lock variablel Neither private nor shared caches provide both goals

Source of memory ordering redundancy  Thread-confinement of lock variables.  Memory ordering that occurs for lock variables that are solely accessed by a single thread are redundant  Thread locality of locking.  Locality of locking is a situation where consecutive acquires of a lock variable are made by the same thread  Eager releases and repetitive acquires. CMPs change Latency-Capacity Tradeoff in two ways

CMO-conditional memory ordering  CMO is demonstrated on a lock algorithm that identifies those dynamic lock/unlock operations for which memory ordering is unnecessary, and speculatively omits the associated memory ordering instructions.  When ordering is required, this algorithm relies on a hardware mechanism for initiating a memory ordering operation on another processor.

CMO-conditional memory ordering  Acquire of lock l with conditional memory ordering

CMO-conditional memory ordering  Release of lock l with conditional memory ordering

CMO-conditional memory ordering  Memory synchronization model is different: the release synchronization is omitted at the unlock operation and “recovered” at the lock operation – only if necessary.  Necessity is determined according to a release number that is communicated between the thread that unlocks l and the thread that subsequently locks l.

Release numbers  relnum ⇐ (id & release ctr. of current proc)  a value that reflects a combination of a processor id and a counter of the release synchronization operations(release counter) that the respective processor performed at a certain stage during the execution of a program.

Conditional memory ordering  Based on the release number, the system arranges that release synchronization is recovered at the processor that previously released the lock, but only if necessary.  (sync conditional) implies (sync acquire) at the processor that issues the instruction.

Hardware support for CMO  Logical operation  Release vector entry  Register operand  Comparison of release counters  Release vector support  To support low latency reads, a copy of the release vector is mirrored in local storage at each processor.  Broadcast operation  Release hints  Instruct a processor to increment its release counter as soon as the conditions are met

Evaluation  S-CMO: A software CMO prototype  The result show that CMO avoids memory ordering operations for the vast majority of dynamic acquire and release operations across a set of multithreaded Java workloads, leading to significant speedups for many.  However, performance improvements in the software prototype are hindered by the high cost of remote memory ordering.

Experimental Methodology  Use a set of single and multi-threaded Java benchmarks from Java Grande and SPEC benchmark suites.  Run these applications on IBM’s J9 productive virtual machine.  Performed on both Power4 and Power5 multiprocessor systems running AIX, with 4 and 6 processors respectively.

Software CMO prototype with hardware support  Hardware-based (sync conditional) and (sync remote) implementation

Software CMO prototype with hardware support  CMO performance while varying remote sync latency in high-cost (Power4)memory ordering implementation.

Software CMO prototype with hardware support  CMO performance while varying remote sync latency in high-cost (Power5)memory ordering implementation.

Future Proposal  Hardware Proposal  Software Proposal

Summary  It developed a algorithm called conditional memory ordering (CMO), that can eliminates redundant memory ordering operations and improves the performance of the system effectively.  It summaries the characters the of synchronization and memory ordering operations in lock intensive Java workloads and demonstrate that a lot of memory ordering operations occur superfluously.  It evaluates the performance improvement of CMO.  It gives a Hardware proposals of CMO and its hardware implementation using a software prototype and an analytical model.

Conclusions  CMO can significantly improve the performance of multiprocessor systems.  With hardware support, CMO offers significant performance benefits across our set of Java benchmarks when assuming a reasonable remote synchronization latency.

Thank you Questions?