U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science S HERIFF : Precise Detection & Automatic Mitigation of False Sharing Tongping Liu,

Slides:



Advertisements
Similar presentations
Dataflow Analysis for Datarace-Free Programs (ESOP 11) Arnab De Joint work with Deepak DSouza and Rupesh Nasre Indian Institute of Science, Bangalore.
Advertisements

Garbage Collection in the Next C++ Standard Hans-J. Boehm, Mike Spertus, Symantec.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 1 MC 2 –Copying GC for Memory Constrained Environments Narendran Sachindran J. Eliot.
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
A Randomized Dynamic Program Analysis for Detecting Real Deadlocks Pallavi Joshi  Chang-Seo Park  Koushik Sen  Mayur Naik ‡  Par Lab, EECS, UC Berkeley‡
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Grace: Safe Multithreaded Programming for C/C++ Emery Berger University of Massachusetts,
Exploring Security Vulnerabilities by Exploiting Buffer Overflow using the MIPS ISA Andrew T. Phillips Jack S. E. Tan Department of Computer Science University.
A Randomized Dynamic Program Analysis for Detecting Real Deadlocks Koushik Sen CS 265.
Threads. Readings r Silberschatz et al : Chapter 4.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Deadlock Emery Berger and Mark Corner University of Massachusetts.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 2007 Exterminator: Automatically Correcting Memory Errors with High Probability Gene.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.
1 U NIVERSITY OF M ASSACHUSETTS, A MHERST School of Computer Science P REDATOR : Predictive False Sharing Detection Tongping Liu*, Chen Tian, Ziang Hu,
Whose Cache Line Is It Anyway? Mihir Nanavati, Mark Spear, Nathan Taylor, Shriram Rajagopalan, Dutch T. Meyer, William Aiello, and Andrew Warfield University.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science CRAMM: Virtual Memory Support for Garbage-Collected Applications Ting Yang, Emery.
Operating Systems CMPSCI 377 Lecture 11: Memory Management
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
U NIVERSITY OF M ASSACHUSETTS Department of Computer Science Automatic Heap Sizing Ting Yang, Matthew Hertz Emery Berger, Eliot Moss University of Massachusetts.
U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Garbage Collection Without Paging Matthew Hertz, Yi Feng, Emery Berger University.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science PLDI 2006 DieHard: Probabilistic Memory Safety for Unsafe Programming Languages Emery.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles C/C++ Emery Berger and Mark Corner University of Massachusetts.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.
DTHREADS: Efficient Deterministic Multithreading
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 2006 Exterminator: Automatically Correcting Memory Errors Gene Novark, Emery Berger.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.
U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
- 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Processes & Threads Emery Berger and Mark Corner University.
Chapter 0.2 – Pointers and Memory. Type Specifiers  const  may be initialised but not used in any subsequent assignment  common and useful  volatile.
Computer Science and Software Engineering University of Wisconsin - Platteville 2. Pointer Yan Shi CS/SE2630 Lecture Notes.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Concurrency Patterns Emery Berger and Mark Corner University.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Emery Berger and Mark Corner University of Massachusetts Amherst Computer Systems.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 1 Automatic Heap Sizing: Taking Real Memory into Account Ting Yang, Emery Berger,
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.
CS162 Week 5 Kyle Dewey. Overview Announcements Reactive Imperative Programming Parallelism Software transactional memory.
Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,
Topic 3: C Basics CSE 30: Computer Organization and Systems Programming Winter 2011 Prof. Ryan Kastner Dept. of Computer Science and Engineering University.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Software Systems Advanced Synchronization Emery Berger and Mark Corner University.
Tongping Liu, Charlie Curtsinger, Emery Berger D THREADS : Efficient Deterministic Multithreading Insanity: Doing the same thing over and over again and.
CSE 351 Final Exam Review 1. The final exam will be comprehensive, but more heavily weighted towards material after the midterm We will do a few problems.
Operating Systems Lesson 5. Plan Memory Management ◦ Memory segments types ◦ Processes & Memory ◦ Virtual Memory ◦ Virtual Memory Management ◦ Swap File.
Sound and Precise Analysis of Parallel Programs through Schedule Specialization Jingyue Wu, Yang Tang, Gang Hu, Heming Cui, Junfeng Yang Columbia University.
Optimization of C Code The C for Speed
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Synchronization Emery Berger and Mark Corner University.
Martin Kruliš by Martin Kruliš (v1.1)1.
Threads. Readings r Silberschatz et al : Chapter 4.
Performance Problems You Can Fix: A Dynamic Analysis of Memoization Opportunities Luca Della Toffola – ETH Zurich Michael Pradel – TU Darmstadt Thomas.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science John Cavazos J Eliot B Moss Architecture and Language Implementation Lab University.
LASER: Light, Accurate Sharing dEtection and Repair Liang Luo, Akshitha Sriraman, Brooke Fugate, Shiliang Hu, Chris J Newburn, Gilles Pokam, Joseph Devietti.
Heterogeneous Computing using openMP lecture 1 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
1 Module 3: Processes Reading: Chapter Next Module: –Inter-process Communication –Process Scheduling –Reading: Chapter 4.5, 6.1 – 6.3.
December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.
Dthreads Tongping Liu, Charlie Curtsinger and Emery D. Berger, all of UMass Presented by Chris Smowton.
Concurrency 2 CS 2110 – Spring 2016.
Atomic Operations in Hardware
Atomic Operations in Hardware
Optimizing Your Dyninst Program
Parallelism and Concurrency
Dynamic Memory A whole heap of fun….
Presentation transcript:

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science S HERIFF : Precise Detection & Automatic Mitigation of False Sharing Tongping Liu, Emery Berger University of Massachusetts, Amherst

Multi-core: expectation is awesome int count[8]; //Global array thread_func(int id) { for(i = 0; i < M; i++) count[id]++; }

Reality is awful int count[8]; //Global array thread_func(int id) { for(i = 0; i < M; i++) count[id]++; } count[id]++; 13X False sharing kills scaling

Thread 1 Main Memory Core 1 Thread 2 Core 2 Cache Invalidate False sharing = performance problem

Thread 1 Thread 2 Cache Invalidate Interleaved writes cause cache invalidations Main Memory Core 1 Core 2 20X slower False sharing = performance problem

me = 1; you = 1; // globals me = new Foo; you = new Bar; // heap class X { int me; int you; }; // fields arr[me] = 12; arr[you] = 13; // array indices False sharing is invisible

False sharing detector: instrument every memory access Related work: S.M.Gunther et.al. [WBIA 2009]. C.Liu. [Master thesis 2009]. Q.Zhao et.al. [VEE2011]. Shortcomings: 1.Slow 2.No actionable output 3.False positives

+ 850 lines… False sharing detector: state of the art PTU Shortcomings: 1.Imprecise 2.Too many false positives

No false positives Actionable output Efficient (20%) S HERIFF -D ETECT Object has interleaving writes. The object starts at 0xd5c8e160, length 32. Allocation call stack: 0: word_count.c: 136 1: word_count.c: 444

t1 = spawn f(x); t2 = spawn g(y); sync; if (!fork()) f(x); if (!fork()) g(y); Related work: Grace [OOPSLA 2009], Dthreads [SOSP 2011]

Process 1 Process 2 Global State Main Memory Cache Core 1 Core 2 Process 1 Process 2 S HERIFF : isolated execution

PthreadsSheriff 1: Lock(); 2: XXX; 3: Unlock(); 4: YYY; 5: Lock(); Begin_isolated_execution Commit_local_changes XXX; //isolated execution Begin_isolated_execution Commit_local_changes YYY; //isolated execution S HERIFF : isolated execution

Snapshot and diffing: local changes

Process 1 Process 2 Main Memory Process 1 Process 2 Global State Cache Core 1 Core 2 S HERIFF -D ETECT : Find false sharing at commit points Interleaved writes

BenchmarksPTU (# shared lines) SHERIFF-DETECT (# shared objects) canneal11 fluidanimate31 kmeans19162 linear_regression51 matrix_multiply4680 pbzip2140 pca450 pfscan30 reverse_indexN/A0 streamcluster91 word_count44 swaptions1960 TOTAL2,66415 kmeans reverse_index N/A 5 Total 2, Output: PTU VS. S HERIFF -D ETECT

Example case study: linear_regression Allocation call stack: 0: linear_regression-pthread.c: line number: 136 Step 1: find allocation site 136: tid_args = (lreg_args *)calloc(sizeof(lreg_args), num_procs); Step 2: find references 152: pthread_create(&tid_args[i].tid, &attr, linear_regression_pthread, (void*)&tid_args[i]) != 0);

Example case study: linear_regression void *linear_regression_pthread(void *args_in) { lreg_args* args =(lreg_args*)args_in; …… for (i = 0; i num_elems; i++) { args->SX += args->points[i].x; args->SXX += args->points[i].x*args->points[i].x; …… “lreg_args” is not aligned

Example case study: linear_regression typedef struct { ….. char padding[128]; // Padding to avoid false sharing } lreg_args; Step 3: fix false sharing using padding 9.2X

8.2 S HERIFF -D ETECT performance %

Process 1 Process 2 Global State Main Memory Cache Core 1 Core 2 Process 1 Process 2 Speedup due to isolation

Prevents ALL false sharing S HERIFF -P ROTECT

Basis of S HERIFF -P ROTECT S HERIFF -P ROTECT S HERIFF -D ETECT

%

% g++ myprog.cpp –lsheriffdetect –o myprog % g++ myprog.cpp –lsheriffprotect –o myprog S HERIFF -P ROTECT S HERIFF -D ETECT S HERIFF libraries: easy to use

original program S HERIFF -D ETECT original program libpthread modified program libpthread original program S HERIFF -P ROTECT No false sharing Degrade performance too much memory No source code No time padding, alignment local variables S HERIFF - P ROTECT Workflow: using S HERIFF S HERIFF - D ETECT

28

%

Why no false positives? (1)actual interleaved writes (performance problem) (2)Word status – not true sharing (3) Avoid heap re-usage problems (4) The results of our experiment helps to exemplify the results. 30

Key Optimizations Isolate small heap objects and globals 31 Adaptive false sharing prevention – Protect on long transaction only

Key Optimizations Find sharing pages: false sharing objects  shared page Reduce overhead – Using sampling – Sampling only for long transactions ( > 5ms) 32