LASER: Light, Accurate Sharing dEtection and Repair Liang Luo, Akshitha Sriraman, Brooke Fugate, Shiliang Hu, Chris J Newburn, Gilles Pokam, Joseph Devietti.

Slides:

Advertisements

Similar presentations

Profiler In software engineering, profiling ("program profiling", "software profiling") is a form of dynamic program analysis that measures, for example,

Advertisements

A Structure Layout Optimization for Multithreaded Programs Easwaran Raman, Princeton Robert Hundt, Google Sandya S. Mannarswamy, HP.

Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,

An Case for an Interleaving Constrained Shared-Memory Multi-Processor Jie Yu and Satish Narayanasamy University of Michigan.

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

A Randomized Dynamic Program Analysis for Detecting Real Deadlocks Koushik Sen CS 265.

1 U NIVERSITY OF M ASSACHUSETTS, A MHERST School of Computer Science P REDATOR : Predictive False Sharing Detection Tongping Liu*, Chen Tian, Ziang Hu,

Calvin: Deterministic or Not? Free Will to Choose Derek R. Hower, Polina Dudnik, Mark D. Hill, David A. Wood.

The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.

Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiPA.

An efficient data race detector for DIOTA Michiel Ronsse, Bastiaan Stougie, Jonas Maebe, Frank Cornelis, Koen De Bosschere Department of Electronics and.

Whose Cache Line Is It Anyway? Mihir Nanavati, Mark Spear, Nathan Taylor, Shriram Rajagopalan, Dutch T. Meyer, William Aiello, and Andrew Warfield University.

Continuously Recording Program Execution for Deterministic Replay Debugging.

CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.

Page-based Commands for DRAM Systems Aamer Jaleel Brinda Ganesh Lei Zong.

Automatic Compaction of OS Kernel Code via On-Demand Code Loading Haifeng He, Saumya Debray, Gregory Andrews The University of Arizona.

Page-based Commands for DRAM Systems Aamer Jaleel Brinda Ganesh Lei Zong.

PathExpander: Architectural Support for Increasing the Path Coverage of Dynamic Bug Detection S. Lu, P. Zhou, W. Liu, Y. Zhou, J. Torrellas University.

MemTracker Efficient and Programmable Support for Memory Access Monitoring and Debugging Guru Venkataramani, Brandyn Roemer, Yan Solihin, Milos Prvulovic.

Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.

Chapter 1: Introduction To Computer | SCP1103 Programming Technique C | Jumail, FSKSM, UTM, 2005 | Last Updated: July 2005 Slide 1 Introduction To Computers.

Replay Debugging for Distributed Systems Dennis Geels, Gautam Altekar, Ion Stoica, Scott Shenker.

RCDC SLIDES README Font Issues – To ensure that the RCDC logo appears correctly on all computers, it is represented with images in this presentation. This.

Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.

Rapid Identification of Architectural Bottlenecks via Precise Event Counting John Demme, Simha Sethumadhavan Columbia University

More on Locks: Case Studies

A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

LiNK: An Operating System Architecture for Network Processors Steve Muir, Jonathan Smith Princeton University, University of Pennsylvania

15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.

Accelerating Precise Race Detection Using Commercially-Available Hardware Transactional Memory Support Serdar Tasiran Koc University, Istanbul, Turkey.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.

NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL Presentation on ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Publisher’s:

Thread-Level Speculation Karan Singh CS

A Case for Unlimited Watchpoints Joseph L. Greathouse †, Hongyi Xin*, Yixin Luo †‡, Todd Austin † † University of Michigan ‡ Shanghai Jiao Tong University.

Using Model Checking to Find Serious File System Errors StanFord Computer Systems Laboratory and Microsft Research. Published in 2004 Presented by Chervet.

University of Maryland Dynamic Floating-Point Error Detection Mike Lam, Jeff Hollingsworth and Pete Stewart.

On-Demand Dynamic Software Analysis Joseph L. Greathouse Ph.D. Candidate Advanced Computer Architecture Laboratory University of Michigan December 12,

Accelerating Dynamic Software Analyses Joseph L. Greathouse Ph.D. Candidate Advanced Computer Architecture Laboratory University of Michigan December 1,

CS333 Intro to Operating Systems Jonathan Walpole.

Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.

Guiding Ispike with Instrumentation and Hardware (PMU) Profiles CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.

Detecting Atomicity Violations via Access Interleaving Invariants

Embedded System Lab. 오명훈 Addressing Shared Resource Contention in Multicore Processors via Scheduling.

HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

Energy Efficient Prefetching and Caching Athanasios E. Papathanasiou and Michael L. Scott. University of Rochester Proceedings of 2004 USENIX Annual Technical.

On-Demand Dynamic Software Analysis Joseph L. Greathouse Ph.D. Candidate Advanced Computer Architecture Laboratory University of Michigan November 29,

PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.

Demand-Driven Software Race Detection using Hardware Performance Counters Joseph L. Greathouse †, Zhiqiang Ma ‡, Matthew I. Frank ‡ Ramesh Peri ‡, Todd.

E-MOS: Efficient Energy Management Policies in Operating Systems

*Pentium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries Performance Monitoring.

FastTrack: Efficient and Precise Dynamic Race Detection [FlFr09] Cormac Flanagan and Stephen N. Freund GNU OS Lab. 23-Jun-16 Ok-kyoon Ha.

Remix Online Detection and Repair of Cache Contention for the JVM

On-Demand Dynamic Software Analysis

ProRace: Practical Data Race Detection for Production Use

Algorithmic Improvements for Fast Concurrent Cuckoo Hashing

Software Coherence Management on Non-Coherent-Cache Multicores

Tong Zhang, Dongyoon Lee, Changhee Jung

Effective Data-Race Detection for the Kernel

Address Translation for Manycore Systems

CMSC 611: Advanced Computer Architecture

On-Demand Dynamic Software Analysis

CMSC 611: Advanced Computer Architecture

Presentation transcript:

LASER: Light, Accurate Sharing dEtection and Repair Liang Luo, Akshitha Sriraman, Brooke Fugate, Shiliang Hu, Chris J Newburn, Gilles Pokam, Joseph Devietti

Multicore is Eating the World + Performance + Energy efficiency - Performance bugs 2

Cache Contention Bugs Contention for a single cash line Caused significant performance loss (Linux, MySQL, Boost) Architecture-specific Hard to find and debug 3

Background Cache coherence keeps private caches in sync All protocols share 3 key states: Modified, Shared, Invalid X:M L1 $ Core 0 Core 1 Write X Read X 4 X:S

Two Types of Contention Same Bytes == True Sharing X:M Core 0 Core 1 Write X Read X X 5 Cache Line

Two Types of Contention Different Bytes == False Sharing X:M Core 0 Core 1 Write X Read Y X Y 6

Related Work Detection & Repair for False Sharing Sheriff [Liu and Berger, OOPSLA 2011] Plastic [Nanavati et al., EuroSys 2013] Cheetah [Liu and Liu, CGO 2016] vTune Amplifier XE [Intel] Detection for generic events 7 Offline Analysis Predicts speedup from FS

MEM_LOAD_UOPS_LLC_XNSP_HITM PEBS Record struct { uint64_t ip; uint64_t addr; … char csrc; char cdst; } HitM Events A fundamental part of both types of contention A cache hit in a remote core’s cache in M state. 8 X:M L1 $ Core 0 Core 1 Write X Read X

HITM Record Accuracy 160 simple programs with read/write and write/write true/false/no sharing Intel Core i7-4770K 3.4GHz Haswell 4-core processor 9

HITM Record Accuracy(1/2) % correct data addresses simple benchmarks Counting only exact addresses as being correct

HITM Record Accuracy(2/2) 160 simple benchmarks 11 % correct PC addresses Counting only exact PC as being correct Counting adjacent PCs as being correct

LASER L ight: leverage Haswell h/w, no s/w or OS changes A ccurate: low false positives and false negatives S haring: detects both true and false sharing d E tection: works as a profiling tool R epair: automatically repairs false sharing at runtime 12

LASER System Overview Haswell or later Processors Linux Kernel Driver Detector Process LASER Repair App Process User Level App Operating System Hardware 13

Detection Algorithm W Store X+8, 4B HITM Load X, 2B Disjoint FALSE SHARING 14 Cache Line Model

Detection Algorithm Aggregate by Source Code Line Sort and Filter Foo.c:23 15

Evaluating LASER Detection Intel Core i7-4770K 3.4GHz Haswell 4-core processor 33 workloads from Phoenix 1.0, Parsec 3.0 and Splash2X 16

LASER Detection Accuracy Created a database of manually-validated cache contention bugs 9 bugs total across 33 workloads 4 new bugs discovered by LASER 17 False NegativeFalse Positive Runnable Benchmarks LASER02433/33 VTUNE16433/33 SHERIFF3412/33

Profiling Performance Comparison LASER (1.01x) vs VTune Amplifier XE 2015 (1.8x) 18

Repairing FS with SSB Needs Online False Sharing elimination! Legacy programs – no access to source code. Programs that need to be always running. Challenge is needing to rewrite the program without breaking it as it is running. Solve the problem of online FS repair with a Software Store Buffer (SSB). LASER repair tool is launched to attempt fix of FS. Implemented with Intel Pin 19

Cache Repairing FS with SSB 20 Core 0 Core 1 Write X,2 Read X Tail Cache X=2 Store Buffer 0 Store Buffer 1 Read X=0 …… X=1 …… … Read XRead X=2 Flush Shared Modified X=0 Invalid X=0 Flush at synchronization points for TSO compliance and basic block end for performance.

The conventional hardware store buffer is not good enough for speedup. Needs optimizations for better performance May cause subtle memory consistency issues. For good performance, requires coalescing store buffer. But coalescing violates TSO. E.g. Sheriff does not provide TSO compliance. Challenges with Optimizing SSB 21

LASER’s TSO-Compliant, Coalescing SSB Instrument regions from Laser’s input by walking through the CFG. Coalescing, TSO compliant SSB made possible with Intel TSX. 22 Read X: Result = SSB[X] If (Result == null) Result = *X; Return Result; Write X,Val: If (SSB[X] == null && SSB.full()) Flush(); SSB[X] = Val; Flush: TSX_Begin_Transaction Foreach pair in SSB *pair.memory = pair.value; TSX_End_Transaction Redo_TSX_Transaction_If_Fails

LASER FS Repair Performance Automatic speedups of up to 19% LASER profiling informs manual fixes of up to 17x 23

Conclusions Cache contention bugs undermine the promise of multicore LASER uses Intel’s Haswell platform for fast, precise contention detection and automatic false sharing repair Many opportunities to leverage Haswell’s sharing detection capabilities 24 REMIX: Online Detection and Repair of Cache Contention for the JVM [PLDI 2016] Add to read list Researchers Who Read This Paper Also Read 