Demand-Driven Software Race Detection using Hardware Performance Counters Joseph L. Greathouse †, Zhiqiang Ma ‡, Matthew I. Frank ‡ Ramesh Peri ‡, Todd.

Slides:



Advertisements
Similar presentations
On-the-fly Healing of Race Conditions in ARINC-653 Flight Software
Advertisements

Transactional Memory Parag Dixit Bruno Vavala Computer Architecture Course, 2012.
Michael Bond (Ohio State) Milind Kulkarni (Purdue)
Michael Bond Milind Kulkarni Man Cao Minjia Zhang Meisam Fathi Salmi Swarnendu Biswas Aritra Sengupta Jipeng Huang Ohio State Purdue.
UW-Madison Computer Sciences Multifacet Group© 2011 Karma: Scalable Deterministic Record-Replay Arkaprava Basu Jayaram Bobba Mark D. Hill Work done at.
Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Recording Inter-Thread Data Dependencies for Deterministic Replay Tarun GoyalKevin WaughArvind Gopalakrishnan.
Hybrid Transactional Memory Nir Shavit MIT and Tel-Aviv University Joint work with Alex Matveev (and describing the work of many in this summer school)
Calvin: Deterministic or Not? Free Will to Choose Derek R. Hower, Polina Dudnik, Mark D. Hill, David A. Wood.
Today’s Agenda  Midterm: Nov 3 or 10  Finish Message Passing  Race Analysis Advanced Topics in Software Engineering 1.
Concurrency.
[ 1 ] Agenda Overview of transactional memory (now) Two talks on challenges of transactional memory Rebuttals/panel discussion.
Chapter 6: Process Synchronization. 6.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Feb 8, 2005 Objectives Understand.
Operating Systems CSE 411 CPU Management Oct Lecture 13 Instructor: Bhuvan Urgaonkar.
Rapid Identification of Architectural Bottlenecks via Precise Event Counting John Demme, Simha Sethumadhavan Columbia University
NVSleep: Using Non-Volatile Memory to Enable Fast Sleep/Wakeup of Idle Cores Xiang Pan and Radu Teodorescu Computer Architecture Research Lab
A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill
15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.
Parallelizing Security Checks on Commodity Hardware E.B. Nightingale, D. Peek, P.M. Chen and J. Flinn U Michigan.
Accelerating Precise Race Detection Using Commercially-Available Hardware Transactional Memory Support Serdar Tasiran Koc University, Istanbul, Turkey.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Cosc 4740 Chapter 6, Part 3 Process Synchronization.
Concurrency, Mutual Exclusion and Synchronization.
- 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western.
Eraser: A Dynamic Data Race Detector for Multithreaded Programs STEFAN SAVAGE, MICHAEL BURROWS, GREG NELSON, PATRICK SOBALVARRO, and THOMAS ANDERSON Ethan.
Chapter 1 Computer System Overview Sections 1.1 to 1.6 Instruction exe cution Interrupt Memory hierarchy Cache memory Locality: spatial and temporal Problem.
DoubleChecker: Efficient Sound and Precise Atomicity Checking Swarnendu Biswas, Jipeng Huang, Aritra Sengupta, and Michael D. Bond The Ohio State University.
A Case for Unlimited Watchpoints Joseph L. Greathouse †, Hongyi Xin*, Yixin Luo †‡, Todd Austin † † University of Michigan ‡ Shanghai Jiao Tong University.
Copyright, 1996 © Dale Carnegie & Associates, Inc. Life with Hardware Threads to Burn Todd C. Mowry Intel Research Pittsburgh & Carnegie Mellon University.
Rerun: Exploiting Episodes for Lightweight Memory Race Recording Derek R. Hower and Mark D. Hill Computer systems complex – more so with multicore What.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
On-Demand Dynamic Software Analysis Joseph L. Greathouse Ph.D. Candidate Advanced Computer Architecture Laboratory University of Michigan December 12,
Accelerating Dynamic Software Analyses Joseph L. Greathouse Ph.D. Candidate Advanced Computer Architecture Laboratory University of Michigan December 1,
A Regulated Transitive Reduction (RTR) for Longer Memory Race Recording (ASLPOS’06) Min Xu Rastislav BodikMark D. Hill Shimin Chen LBA Reading Group Presentation.
Hardware Mechanisms for Distributed Dynamic Software Analysis Joseph L. Greathouse Advisor: Prof. Todd Austin May 10, 2012.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Demand Paging.
Platform Abstraction Group 3. Question How to deal with different types hardware and software platforms? What detail to expose to the programmer? What.
Aditya Thakur Rathijit Sen Ben Liblit Shan Lu University of Wisconsin–Madison Workshop on Dynamic Analysis 2009 Cooperative Crug Isolation.
Hardware Support for On-Demand Software Analysis Joseph L. Greathouse Advanced Computer Architecture Laboratory University of Michigan December 8, 2011.
Sampling Dynamic Dataflow Analyses Joseph L. Greathouse Advanced Computer Architecture Laboratory University of Michigan University of British Columbia.
HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation.
On-Demand Dynamic Software Analysis Joseph L. Greathouse Ph.D. Candidate Advanced Computer Architecture Laboratory University of Michigan November 29,
The Potential of Sampling for Dynamic Analysis Joseph L. GreathouseTodd Austin Advanced Computer Architecture Laboratory University of Michigan PLAS, San.
Atom-Aid: Detecting and Surviving Atomicity Violations Brandon Lucia, Joseph Devietti, Karin Strauss and Luis Ceze LBA Reading Group 7/3/08 Slides by Michelle.
Architectural Features of Transactional Memory Designs for an Operating System Chris Rossbach, Hany Ramadan, Don Porter Advanced Computer Architecture.
*Pentium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries Performance Monitoring.
LASER: Light, Accurate Sharing dEtection and Repair Liang Luo, Akshitha Sriraman, Brooke Fugate, Shiliang Hu, Chris J Newburn, Gilles Pokam, Joseph Devietti.
1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.
Kendo: Efficient Deterministic Multithreading in Software M. Olszewski, J. Ansel, S. Amarasinghe MIT to be presented in ASPLOS 2009 slides by Evangelos.
December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.
A Case for Unlimited Watchpoints
On-Demand Dynamic Software Analysis
Rerun: Exploiting Episodes for Lightweight Memory Race Recording
Tong Zhang, Dongyoon Lee, Changhee Jung
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Virtualizing Transactional Memory
PHyTM: Persistent Hybrid Transactional Memory
CSC 591/791 Reliable Software Systems
Atomic Operations in Hardware
Atomic Operations in Hardware
Effective Data-Race Detection for the Kernel
Hardware Mechanisms for Distributed Dynamic Software Analysis
Professor, No school name
Chapter 26 Concurrency and Thread
Lecture 22: Consistency Models, TM
On-Demand Dynamic Software Analysis
Eraser: A dynamic data race detector for multithreaded programs
Sampling Dynamic Dataflow Analyses
Presentation transcript:

Demand-Driven Software Race Detection using Hardware Performance Counters Joseph L. Greathouse †, Zhiqiang Ma ‡, Matthew I. Frank ‡ Ramesh Peri ‡, Todd Austin † † University of Michigan ‡ Intel Corporation ISCA-38 June 6, 2011

In spite of proposed hardware solutions Concurrency Bugs Still Matter 2 Hardware Data Race Recording Deterministic Execution/Replay Atomicity Violation Detectors Bug-Free Memory Models Bulk Memory Commits TRANSACTIONAL MEMORY AMD ASF ? AMD ASF ? Sun Rock ? Sun Rock ?

Concurrency Bugs Matter NOW 3 if(ptr==NULL) memcpy(ptr, data2, len2) ptr LEAKED TIME Thread 2 mylen=large Thread 1 mylen=small ∅ len2=thread_local->mylen; ptr=malloc(len2); len1=thread_local->mylen; ptr=malloc(len1); memcpy(ptr, data1, len1) Nov OpenSSL Security Flaw

This Talk in One Sentence Speed up software race detection with existing hardware support. 4

Software Data Race Detection Add checks around every memory access Find inter-thread sharing events Synchronization between write-shared accesses?  No? Data race. 5

Thread 2 mylen=large Thread 1 mylen=small if(ptr==NULL) len1=thread_local->mylen; ptr=malloc(len1); memcpy(ptr, data1, len1) len2=thread_local->mylen; ptr=malloc(len2); memcpy(ptr, data2, len2) Example of Data Race Detection 6 ptr write-shared? Interleaved Synchronization? TIME

SW Race Detection is Slow 7 PhoenixPARSEC

Goal of this Work Accelerate Software Data Race Detection Technique #1: Making it Fast Demand-Driven Data Race Detection Technique #2: Keeping it Real Find sharing events with existing HW 8

Inter-thread Sharing is What’s Important “ Data races... are failures in programs that access and update shared data in critical sections” – Netzer & Miller, if(ptr==NULL) len1=thread_local->mylen; ptr=malloc(len1); memcpy(ptr, data1, len1) len2=thread_local->mylen; ptr=malloc(len2); memcpy(ptr, data2, len2) Thread-local data NO SHARING Shared data NO INTER-THREAD SHARING EVENTS TIME

Very Little Inter-Thread Sharing 10 PhoenixPARSEC

Technique 1: Demand-Driven Analysis 11 Multi-threaded Application Multi-threaded Application Software Race Detector Local Access Inter-thread sharing Inter-thread Sharing Monitor

Check each memory op. for write-sharing Signal software race detector on sharing Possible to do in software +Can be built now with instrumentation –Slow. May take as long as race detection 12

Follow read/write sets of threads Fast user-level faults Sharing Monitor Thread 1 WRITE Y Ideal Hardware Sharing Detector 13 Thread 2 READ Y Thread 1 WRITE Y T1 R: ∅ W: ∅ T2 R: ∅ W: ∅ Thread 2 READ Y T1 R: ∅ W: {Y} W->R Sharing Multi-threaded Application Multi-threaded Application Software Race Detector Inter-thread Sharing Monitor

Limitations of Existing Hardware Fast faults  Solution: Enable detector for long periods of time Read/write sets  Solution: 14 Multi-threaded Application Multi-threaded Application Software Race Detector Inter-thread Sharing Monitor NO SHARING

Technique 2: Hardware Sharing Detector Hardware Performance Counters  Interrupt on cache coherency events  Intel’s HITM event: W→R Data Sharing Limitations of this method:  SMT sharing can’t be counted  Cache eviction  Others in paper 15 S S M M S S I I HITM

Demand-Driven Analysis on Real HW 16 Execute Instruction SW Race Detection Enable Analysis Disable Analysis HITM Interrupt? HITM Interrupt? Sharing Recently? Analysis Enabled? NO YES

Experimental Evaluation Modified Intel Inspector XE Race Detector Linux on 4-core Core i7, no Hyper-Threading Performance Tests:  Phoenix Suite  PARSEC Accuracy Tests:  Phoenix Suite  PARSEC  Pre-release version of RADBench 17 Simulation

Performance Difference 18 PhoenixPARSEC

Performance Increases 19 PhoenixPARSEC 51x

Demand-Driven Analysis Accuracy 20 1/1 2/4 3/3 4/4 3/3 4/4 2/4 4/4 2/4 Accuracy vs. Continuous Analysis: 97%

Future Directions Better Performance  Fast user-level faults  Application specific hardware More Accuracy  Better performance counters  Inform SW on cache evictions/misses Smooth transition to ideal hardware Combine sampling & demand-driven analysis 21

BACKUP SLIDES 22

Concurrency Bugs Matter NOW 23 if(ptr == NULL) { len=thread_local->mylen; ptr=malloc(len); memcpy(ptr, data, len); } Thread 2 mylen=large Nov OpenSSL Security Flaw Thread 1 mylen=small

Concurrency Bugs Matter NOW 24 Thread 2 if(ptr==NULL) ptr ∅ mylen=large len1=thread_local->mylen; ptr=malloc(len1); memcpy(ptr, data1, len1) len2=thread_local->mylen; ptr=malloc(len2); memcpy(ptr, data2, len2) TIME Thread 1 mylen=small

Concurrency Bugs Matter NOW 25 Thread 2 if(ptr==NULL) ptr mylen=large len2=thread_local->mylen; TIME ptr=malloc(len2); memcpy(ptr, data2, len2) len1=thread_local->mylen; ptr=malloc(len1); memcpy(ptr, data1,len1) Thread 1 mylen=small ∅

Demand-Driven Analysis Algorithm 26

Demand-Driven Analysis on Real HW 27

Accuracy on Real Hardware 28 kmeansfacesimferretfreqminevipsx264streamcluster W→W1/1 (100%) 0/1 (0%) --1/1 (100%) - R→W-0/1 (0%) 2/2 (100%) 1/1 (100%) 3/3 (100%) 1/1 (100%) W→R-2/2 (100%) 1/1 (100%) 2/2 (100%) 1/1 (100%) 3/3/ (100%) 1/1 (100%) Spider Monkey-0 Spider Monkey-1 Spider Monkey-2 NSPR-1Memcached-1Apache-1 W→W9/9 (100%) 1/1 (100%) 3/3 (100%) -1/1 (100%) R→W3/3 (100%) -1/1 (100%) 7/7 (100%) W→R8/8 (100%) 1/1 (100%) 2/2 (100%) 4/4 (100%) -2/2 (100%) kmeansfacesimferretfreqminevipsx264streamcluster W→W1/1 (100%) 0/1 (0%) --1/1 (100%) - R→W-0/1 (0%) 2/2 (100%) 1/1 (100%) 3/3 (100%) 1/1 (100%) W→R-2/2 (100%) 1/1 (100%) 2/2 (100%) 1/1 (100%) 3/3/ (100%) 1/1 (100%)