Tim Harris (MSR Cambridge)

Slides:



Advertisements
Similar presentations
The many faces of TM Tim Harris. Granularity Distributed, large-scale atomic actions Composable shared memory data structures Leaf shared memory data.
Advertisements

Thread-Level Speculation as a Memory Consistency Protocol for Software DSM? Marcelo Cintra University of Edinburgh
Pay-to-use strong atomicity on conventional hardware Martín Abadi, Tim Harris, Mojtaba Mehrara Microsoft Research.
Parallelism Lecture notes from MKP and S. Yalamanchili.
Privatization Techniques for Software Transactional Memory Michael F. Spear, Virendra J. Marathe, Luke Dalessandro, and Michael L. Scott University of.
Automatic Memory Management Noam Rinetzky Schreiber 123A /seminar/seminar1415a.html.
Distributed Systems CS
Assessing the Scalability of Garbage Collectors on Many Cores (Funded by ANR projects: Prose and ConcoRDanT) Lokesh GidraGaël Thomas Julien SopenaMarc.
Optimizing single thread performance Dependence Loop transformations.
1 Presenter: Chien-Chih Chen. 2 Dynamic Scheduler for Multi-core Systems Analysis of The Linux 2.6 Kernel Scheduler Optimal Task Scheduler for Multi-core.
IDIT KEIDAR DMITRI PERELMAN RUI FAN EuroTM 2011 Maintaining Multiple Versions in Software Transactional Memory 1.
Example (1) Two computer systems have been tested using three benchmarks. Using the normalized ratio formula and the following tables below, find which.
We should define semantics for languages, not for TM Tim Harris (MSR Cambridge)
PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
TOWARDS A SOFTWARE TRANSACTIONAL MEMORY FOR GRAPHICS PROCESSORS Daniel Cederman, Philippas Tsigas and Muhammad Tayyab Chaudhry.
Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.
The Cost of Privatization Hagit Attiya Eshcar Hillel Technion & EPFLTechnion.
Amdahl's Law.
Parallelizing Data Race Detection Benjamin Wester Facebook David Devecsery, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan.
The Impact of Performance Asymmetry in Multicore Architectures Saisanthosh Ravi Michael Konrad Balakrishnan Rajwar Upton Lai UW-Madison and, Intel Corp.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Rechen- und Kommunikationszentrum (RZ) Parallelization at a Glance Christian Terboven / Aachen, Germany Stand: Version 2.3.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
DATA STRUCTURES OPTIMISATION FOR MANY-CORE SYSTEMS Matthew Freeman | Supervisor: Maciej Golebiewski CSIRO Vacation Scholar Program
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Copyright 2007 Sun Microsystems, Inc SNZI: Scalable Non-Zero Indicator Yossi Lev (Brown University & Sun Microsystems Laboratories) Joint work with: Faith.
INTEL CONFIDENTIAL Predicting Parallel Performance Introduction to Parallel Programming – Part 10.
Accelerating Precise Race Detection Using Commercially-Available Hardware Transactional Memory Support Serdar Tasiran Koc University, Istanbul, Turkey.
Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Supercomputers – David Bailey (1991) Eileen Kraemer August 25, 2002.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
WormBench A Configurable Application for Evaluating Transactional Memory Systems MEDEA Workshop Ferad Zyulkyarov 1, 2, Sanja Cvijic 3, Osman.
Parallelizing Security Checks on Commodity Hardware Ed Nightingale Dan Peek, Peter Chen Jason Flinn Microsoft Research University of Michigan.
Lecturer: Simon Winberg Lecture 18 Amdahl’s Law (+- 25 min)
Scaling Area Under a Curve. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
Investigating the Effects of Using Different Nursery Sizing Policies on Performance Tony Guan, Witty Srisa-an, and Neo Jia Department of Computer Science.
On the Performance of Window-Based Contention Managers for Transactional Memory Gokarna Sharma and Costas Busch Louisiana State University.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Technology from seed Exploiting Off-the-Shelf Virtual Memory Mechanisms to Boost Software Transactional Memory Amin Mohtasham, Paulo Ferreira and João.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.
Advanced Computer Networks Lecture 1 - Parallelization 1.
1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Scaling Conway’s Game of Life. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.
Concurrency and Performance Based on slides by Henri Casanova.
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Gargamel: A Conflict-Aware Contention Resolution Policy for STM Pierpaolo Cincilla, Marc Shapiro, Sébastien Monnet.
Supercomputing in Plain English Tuning Blue Waters Undergraduate Petascale Education Program May 29 – June
Potential for parallel computers/parallel programming
18-447: Computer Architecture Lecture 30B: Multiprocessors
Algorithmic Improvements for Fast Concurrent Cuckoo Hashing
4- Performance Analysis of Parallel Programs
Java 9: The Quest for Very Large Heaps
PHyTM: Persistent Hybrid Transactional Memory
EE 193: Parallel Computing
CS 179 Lecture 14.
Omega: flexible, scalable schedulers for large compute clusters
Amdahl's law.
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
M4 and Parallel Programming
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
Presentation transcript:

Tim Harris (MSR Cambridge) TM performance: seeing the whole picture or Looking back over the first 500 papers Tim Harris (MSR Cambridge)

How might we compare TM systems? Where might TM be most useful?

Extending Dan’s GC analogy “Here’s a way to reduce the pause times...” A “Here’s a way to improve the throughput (total app runtime)... C Concurrent GC algorithm (run GC in small steps in amongst mutators) “Here’s a way to support pinned objects...” B

Min mutator utilization

Five dimensions to TM behavior Sequential overhead Scalability (to more cores) Semantics Scalability (to longer transactions) Tx-supported operations

Scaling to large transactions 1.0 = optimized sequential code (no tx, no locks)

Scaling: n*1-core copies 1.0 = optimized sequential code (no tx, no locks)

1.0 = optimized sequential code Scaling: 1*n-core copy 1.0 = optimized sequential code (no tx, no locks)

How might we compare TM systems? Where might TM be most useful?

Application model #1 Sequential Parallelizable f = fraction of original program that is parallelizable

Application model #1 Parallel Parallel Sequential ... Parallel f = fraction of original program that is parallelizable n = num parallel threads

Application model #1 Parallel, transactional Parallel, transactional Sequential ... Parallel, transactional f = fraction of original program that is parallelizable n = num parallel threads x = straight-line transactional slow-down

Conflict model 1 2 3 4 5 6 Fixed number of alternatives, execute different alternatives in parallel Execute conflicting operations in series f = fraction of original program that is parallelizable n = num parallel threads x = straight-line transactional slow-down c = mean number of attempts per transaction (1 => no conflicts)

n=16, c=1.0, vary f, vary x f = fraction of original program that is parallelizable n = num parallel threads x = straight-line transactional slow-down c = mean number of attempts per transaction (1 => no conflicts)

8x on 16 threads => 95% parallelizable n=16, c=1.0 8x on 16 threads => 95% parallelizable f = fraction of original program that is parallelizable n = num parallel threads x = straight-line transactional slow-down c = mean number of attempts per transaction (1 => no conflicts)

Straight-line slow-down bites quickly n=16, c=1.0 Straight-line slow-down bites quickly f = fraction of original program that is parallelizable n = num parallel threads x = straight-line transactional slow-down c = mean number of attempts per transaction (1 => no conflicts)

n=16, c=1.1 (1..1024) f = fraction of original program that is parallelizable n = num parallel threads x = straight-line transactional slow-down c = mean number of attempts per transaction (1 => no conflicts)

n=16, c=1.4 (1..256) f = fraction of original program that is parallelizable n = num parallel threads x = straight-line transactional slow-down c = mean number of attempts per transaction (1 => no conflicts)

n=16, c=2.0 (1..64) f = fraction of original program that is parallelizable n = num parallel threads x = straight-line transactional slow-down c = mean number of attempts per transaction (1 => no conflicts)

If Amdahl and overheads don’t get you then conflicts still can... f = fraction of original program that is parallelizable n = num parallel threads x = straight-line transactional slow-down c = mean number of attempts per transaction (1 => no conflicts)

n=16, c=1.0, scaling of large tx f = fraction of original program that is parallelizable n = num parallel threads x = straight-line transactional slow-down c = mean number of attempts per transaction (1 => no conflicts)

n=16, c=1.0, x*(f+(f^1.25)/4) f = fraction of original program that is parallelizable n = num parallel threads x = straight-line transactional slow-down c = mean number of attempts per transaction (1 => no conflicts)

n=16, c=1.0, x*(f+(f^2)/4) f = fraction of original program that is parallelizable n = num parallel threads x = straight-line transactional slow-down c = mean number of attempts per transaction (1 => no conflicts)

Application model #2: 100% parallel Tx Non-tx Tx Non-tx ... Tx Non-tx t = fraction of original program that is transactional n = num parallel threads x = straight-line transactional slow-down c = mean number of attempts per transaction (1 => no conflicts)

Workloads (ASPLOS ’10) JBBAtomic Labyrinth Vacation MaxFlow Genome t = fraction of original program that is transactional n = num parallel threads x = straight-line transactional slow-down c = mean number of attempts per transaction (1 => no conflicts)

Workloads (ASPLOS ’10) JBBAtomic Labyrinth Vacation MaxFlow Genome t = fraction of original program that is transactional n = num parallel threads x = straight-line transactional slow-down c = mean number of attempts per transaction (1 => no conflicts)

n=16, c=1.0 (no conflicts) t = fraction of original program that is transactional n = num parallel threads x = straight-line transactional slow-down c = mean number of attempts per transaction (1 => no conflicts)

Overheads rapidly reduce the amount that transactions can be used n=16, c=1.0 (no conflicts) Overheads rapidly reduce the amount that transactions can be used t = fraction of original program that is transactional n = num parallel threads x = straight-line transactional slow-down c = mean number of attempts per transaction (1 => no conflicts)

n=16, c=1.1 (1..1024) t = fraction of original program that is transactional n = num parallel threads x = straight-line transactional slow-down c = mean number of attempts per transaction (1 => no conflicts)

n=16, c=1.4 (1..256) t = fraction of original program that is transactional n = num parallel threads x = straight-line transactional slow-down c = mean number of attempts per transaction (1 => no conflicts)

n=16, c=2.0 (1..64) t = fraction of original program that is transactional n = num parallel threads x = straight-line transactional slow-down c = mean number of attempts per transaction (1 => no conflicts)

Conclusions Bad things come in threes... Amdahl’s law Sequential overhead Conflicts When developing TM systems we need to be careful about tradeoffs between these There’s a risk of “chasing around the TM design space” Scaling without conflicts Scaling with conflicts