1 U NIVERSITY OF M ASSACHUSETTS, A MHERST School of Computer Science P REDATOR : Predictive False Sharing Detection Tongping Liu*, Chen Tian, Ziang Hu,

Slides:



Advertisements
Similar presentations
Dataflow Analysis for Datarace-Free Programs (ESOP 11) Arnab De Joint work with Deepak DSouza and Rupesh Nasre Indian Institute of Science, Bangalore.
Advertisements

Performance Tuning Panotools - PTMender. Layout Project Goal About Panotools Multi-threading SIMD, micro-architectural pitfalls Results.
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Prof. Necula CS 164 Lecture 141 Run-time Environments Lecture 8.
U NIVERSITY OF M ASSACHUSETTS – Department of Computer Science Emery Berger Scalable Memory Management for Multithreaded Applications CMPSCI 691P Fall.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 2007 Exterminator: Automatically Correcting Memory Errors with High Probability Gene.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.
Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.
CORK: DYNAMIC MEMORY LEAK DETECTION FOR GARBAGE- COLLECTED LANGUAGES A TRADEOFF BETWEEN EFFICIENCY AND ACCURATE, USEFUL RESULTS.
MPI and C-Language Seminars Seminar Plan  Week 1 – Introduction, Data Types, Control Flow, Pointers  Week 2 – Arrays, Structures, Enums, I/O,
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
Operating Systems CMPSCI 377 Lecture 11: Memory Management
CS 536 Spring Run-time organization Lecture 19.
3/17/2008Prof. Hilfinger CS 164 Lecture 231 Run-time organization Lecture 23.
U NIVERSITY OF M ASSACHUSETTS Department of Computer Science Automatic Heap Sizing Ting Yang, Matthew Hertz Emery Berger, Eliot Moss University of Massachusetts.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Garbage Collection Without Paging Matthew Hertz, Yi Feng, Emery Berger University.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.
Run-time Environment and Program Organization
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science PLDI 2006 DieHard: Probabilistic Memory Safety for Unsafe Programming Languages Emery.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles C/C++ Emery Berger and Mark Corner University of Massachusetts.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.
DTHREADS: Efficient Deterministic Multithreading
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.
/ PSWLAB Eraser: A Dynamic Data Race Detector for Multithreaded Programs By Stefan Savage et al 5 th Mar 2008 presented by Hong,Shin Eraser:
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 2006 Exterminator: Automatically Correcting Memory Errors Gene Novark, Emery Berger.
15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.
Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Concurrency Patterns Emery Berger and Mark Corner University.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science S HERIFF : Precise Detection & Automatic Mitigation of False Sharing Tongping Liu,
1 Components of the Virtual Memory System  Arrows indicate what happens on a lw virtual address data physical address TLB page table memory cache disk.
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 1 Automatic Heap Sizing: Taking Real Memory into Account Ting Yang, Emery Berger,
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Mark Marron 1, Deepak Kapur 2, Manuel Hermenegildo 1 1 Imdea-Software (Spain) 2 University of New Mexico 1.
Virtual Machines, Interpretation Techniques, and Just-In-Time Compilers Kostis Sagonas
Blackfin Array Handling Part 1 Making an array of Zeros void MakeZeroASM(int foo[ ], int N);
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Software Systems Advanced Synchronization Emery Berger and Mark Corner University.
Parallelism without Concurrency Charles E. Leiserson MIT.
CSE 351 Final Exam Review 1. The final exam will be comprehensive, but more heavily weighted towards material after the midterm We will do a few problems.
Platform Abstraction Group 3. Question How to deal with different types hardware and software platforms? What detail to expose to the programmer? What.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Synchronization Emery Berger and Mark Corner University.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Grigore Rosu Founder, President and CEO Professor of Computer Science, University of Illinois
Spring 2009 Programming Fundamentals I Java Programming XuanTung Hoang Lecture No. 8.
Pointer and Escape Analysis for Multithreaded Programs Alexandru Salcianu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology.
CS 2200 Presentation 18b MUTEX. Questions? Our Road Map Processor Networking Parallel Systems I/O Subsystem Memory Hierarchy.
Software System Performance CS 560. Performance of computer systems In most computer systems:  The cost of people (development) is much greater than.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016.
LASER: Light, Accurate Sharing dEtection and Repair Liang Luo, Akshitha Sriraman, Brooke Fugate, Shiliang Hu, Chris J Newburn, Gilles Pokam, Joseph Devietti.
Java Thread Programming
Concurrency 2 CS 2110 – Spring 2016.
ECE232: Hardware Organization and Design
Module 9: Memory and Resource Management
Run-time organization
runtime verification Brief Overview Grigore Rosu
Amir Kamil and Katherine Yelick
A High Performance SoC: PkunityTM
Programming with Shared Memory Specifying parallelism
Amir Kamil and Katherine Yelick
Programming with Shared Memory - 3 Recognizing parallelism
Foundations and Definitions
Programming with Shared Memory Specifying parallelism
RUN-TIME STORAGE Chuen-Liang Chen Department of Computer Science
Problems with Locks Andrew Whitaker CSE451.
Pointer analysis John Rollinson & Kaiyuan Li
Presentation transcript:

1 U NIVERSITY OF M ASSACHUSETTS, A MHERST School of Computer Science P REDATOR : Predictive False Sharing Detection Tongping Liu*, Chen Tian, Ziang Hu, Emery Berger* *University of Massachusetts Amherst Huawei US Research Center

2 Parallelism: Expectation is Awesome Runtime (s) Expectation Parallel Program int count[8]; int W; void increment(int S) { for(in=S; in<S+W; in++) for(j=0; j<1M; j++) count[in]++; } int main(int THREADS) { W=8/THREADS; for(i=0; i<8; i+=W) spawn(increment,i); }

3 False sharing slows the program by 13X Runtime (s) Parallel Program Expectation Reality Parallelism: Reality is Awful int count[8]; int W; void increment(int S) { for(in=S; in<S+W; in++) for(j=0; j<1M; j++) count[in]++; } int main(int THREADS) { W=8/THREADS; for(i=0; i<8; i+=W) spawn(increment,i); } False sharing

4 False Sharing in Real Applications False sharing slows MySQL by 50%

5 Cache Line False Sharing vs. True Sharing

6 Task 3Task 1 Task 2Task 4 False Sharing Task 1 True Sharing Task 2 False Sharing vs. True Sharing

7 Resource Contention at Cache Line Level

8 Thread 1 Main Memory Core 1 Thread 2 Core 2 Cache Invalidate Cache line: basic unit of data transfer False Sharing Causes Performance Problems

9 Thread 1 Thread 2 Cache Invalidate Interleaved accesses cause cache invalidations Main Memory Core 1 Core 2 False Sharing Causes Performance Problems

10 me = 1; you = 1; // globals me = new Foo; you = new Bar; // heap class X { int me; int you; }; // fields array[me] = 12; array[you] = 13; // array indices False Sharing is Everywhere

11 False Sharing is Hard to Diagnose Multiple experts worked together to diagnose MySQL scalability issue (1.5M LOC)

12 Problems of Existing Tools No precise information/false positives – WIBA’09, VEE’11, EuroSys’13, SC’13 Accurate & Precise – OOPSLA’11 ( Cannot detect read-write FS) Shared problem: only detect observed false sharing

13 Task 1 Task 2 Cache Invalidat e Main Memory Core 1 Core 2 False Sharing Causes Performance Problems Find cache lines with many cache invalidations Interleaved accesses Cache invalidations Performance problems Detect false sharing causing performance problems

14 Find Lines with Many Invalidations …… Track cache invalidations on each cache line Memory: Global, Heap

15 Track Cache Invalidations Hardware-based approach – Needs hardware support – No portability Simulation-based approach – Needs hardware info such as cache hierarchy, cache capacity – Very slow Conservative Assumptions – Each thread runs on a different core with its private cache. – Infinite cache capacity. P REDATOR : based on memory access history of each cache line

16 Track Cache Invalidations rwrwwrwr T1T2 0 0 # of invalidations Time T2 r r T1 r r T2 w w Each Entry: { Thread ID, Access Type} T2 w w T1 w w T2 w w T1 r r

17 P REDATOR Components Compiler Instrumentation Runtime System Instruments every memory read/write access Collects memory accesses and reports false sharing

18 Detect Problems Correctly & Precisely Correctly: – No false alarms Task 3Task 1 Task 2Task 4 False Sharing Task 1 True Sharing Task 2 Track memory accesses on each word Precisely – Global variables – Heap objects: pinpoint the line of memory allocation

19 P REDATOR ’s Report

20 Why do we need prediction?

21 Necessity of False Sharing Prediction Thread 1Thread 2 Cache line 1Cache line 2 Cache line 1Cache line 2 False Sharing Cache line 1 False Sharing

22 Properties Affecting False Sharing Occurrence  32-bit platform   64-bit platform  Different memory allocator  Different compiler or optimization  Different allocation order by changing the code, e.g., printf Change of memory layout Run on hardware with different cache line size

23 Example of False Sharing Sensitivity Offset = 0Offset = 8Offset = 56 …… Memory Colors represent threads Cache line size = 64 bytes

24 P REDATOR predicts false sharing problems without occurrence Example of False Sharing Sensitivity

25 Prediction Based on Virtual Cache Lines Thread 1Thread 2 Cache line 1Cache line 2 Virtual cache line 1Virtual cache line 2 False Sharing Virtual cache line 1 False Sharing Real case Prediction 1 Prediction 2

26 d Y X (sz-d)/2 Tracked virtual line Non-tracked virtual lines Track Invalidations on Virtual Cache Lines  d < the cache line size - sz  (X, Y) from different threads && one of them is write

27 Benchmark Results BenchmarksUnknown Problem Without Prediction With PredictionImprovement Histogram ✔✔✔ 46% Linear_regression ✔ 1207% Reverse_index ✔✔ 0.09% Word_count ✔✔ 0.14% Streamcluster-1 ✔✔✔ 4.77% Streamcluster-2 ✔✔ 7.52%

28 Real Applications Results MySQL – Problem: False sharing occurs when different threads update the shared bitmap simultaneously. – Performance improves 180% after fixes. Boost library: – Problem: “there will be 16 spinlocks per cache line” – Performance improves about 100%.

29 Performance Overhead of P REDATOR 5.6X

30 Compiler Instrumentation Runtime System Thread 1 Thread 2 Cache Invalidat e Main Memory Core 1 Core 2 Precise report Thread 1Thread 2 Cache line 1Cache line 2 Virtual cache line 1Virtual cache line 2 False Sharing Virtual cache line 1 False Sharing Real case Prediction 1 Prediction 2

31

32 False Sharing is Hard to Diagnose Multiple experts worked together to diagnose MySQL scalability issue (1.5M LOC)

33 Detailed Prediction Algorithm 1. Find suspected cache lines

34 Detailed Prediction Algorithm 1. Find suspected cache lines 2. Track detailed memory accesses

35 Detailed Prediction Algorithm 1. Find suspected cache lines 2. Track detailed memory accesses 3. Predict based on hot accesses Y X d d < sz && (X, Y) from different threads, potential false sharing

36 4: Tracking Cache Invalidations on the Virtual Line d Y X (sz-d)/2 Tracked virtual line Non-tracked virtual lines