Computer Architecture Lab at Evangelos Vlachos, Michelle L. Goodstein, Michael A. Kozuch, Shimin Chen, Phillip B. Gibbons, Babak Falsafi and Todd C. Mowry.

Slides:

Advertisements

Similar presentations

Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Joint work with Michael Kozuch 1, Theodoros Strigkos 2, Babak Falsafi 3, Phillip.

Advertisements

UW-Madison Computer Sciences Multifacet Group© 2011 Karma: Scalable Deterministic Record-Replay Arkaprava Basu Jayaram Bobba Mark D. Hill Work done at.

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

Goldilocks: Efficiently Computing the Happens-Before Relation Using Locksets Tayfun Elmas 1, Shaz Qadeer 2, Serdar Tasiran 1 1 Koç University, İstanbul,

Monitoring Data Structures Using Hardware Transactional Memory Shakeel Butt 1, Vinod Ganapathy 1, Arati Baliga 2 and Mihai Christodorescu 3 1 Rutgers University,

Butterfly Analysis 1  Michelle Goodstein Butterfly Analysis: Adapting Dataflow Analysis to Dynamic Parallel Monitoring Michelle L. Goodstein*, Evangelos.

Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.

Speculative Sequential Consistency with Little Custom Storage Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University

Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Chrysalis Analysis: Incorporating Synchronization Arcs in Dataflow-Analysis-Based Parallel Monitoring Michelle Goodstein*, Shimin Chen †, Phillip B. Gibbons.

Hybrid Transactional Memory Nir Shavit MIT and Tel-Aviv University Joint work with Alex Matveev (and describing the work of many in this summer school)

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.

Steven Pelley, Peter M. Chen, Thomas F. Wenisch University of Michigan

Euro-Par Uppsala Architecture Research Team [UART] | Uppsala University Dept. of Information Technology Div. of.

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Dynamic Program Security Aaron Roth Ali Sinop Gunhee Kim Hyeontaek Lim.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

Presenter : Shih-Tung Huang Tsung-Cheng Lin Kuan-Fu Kuo 2015/6/26 EICE team dIP: A Non-Intrusive Debugging IP for Dynamic Data Race Detection in Many-core.

Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.

MemTracker Efficient and Programmable Support for Memory Access Monitoring and Debugging Guru Venkataramani, Brandyn Roemer, Yan Solihin, Milos Prvulovic.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick ∙ Onur Mutlu Phillip B. Gibbons ∙ Michael A. Kozuch ∙ Todd C. Mowry.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

RCDC SLIDES README Font Issues – To ensure that the RCDC logo appears correctly on all computers, it is represented with images in this presentation. This.

Shuchang Shan † ‡, Yu Hu †, Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences.

NVSleep: Using Non-Volatile Memory to Enable Fast Sleep/Wakeup of Idle Cores Xiang Pan and Radu Teodorescu Computer Architecture Research Lab

A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.

Parallelizing Security Checks on Commodity Hardware E.B. Nightingale, D. Peek, P.M. Chen and J. Flinn U Michigan.

Accelerating Precise Race Detection Using Commercially-Available Hardware Transactional Memory Support Serdar Tasiran Koc University, Istanbul, Turkey.

Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,

- 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western.

Reduced Hardware NOrec: A Safe and Scalable Hybrid Transactional Memory Alexander Matveev Nir Shavit MIT.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

Aritra Sengupta, Swarnendu Biswas, Minjia Zhang, Michael D. Bond and Milind Kulkarni ASPLOS 2015, ISTANBUL, TURKEY Hybrid Static-Dynamic Analysis for Statically.

Low-Overhead Software Transactional Memory with Progress Guarantees and Strong Semantics Minjia Zhang, 1 Jipeng Huang, Man Cao, Michael D. Bond.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Drinking from Both Glasses: Adaptively Combining Pessimistic and Optimistic Synchronization for Efficient Parallel Runtime Support Man Cao Minjia Zhang.

A Regulated Transitive Reduction (RTR) for Longer Memory Race Recording (ASLPOS’06) Min Xu Rastislav BodikMark D. Hill Shimin Chen LBA Reading Group Presentation.

Spring 2006 Wavescalar S. Swanson, et al. Computer Science and Engineering University of Washington Presented by Brett Meyer.

Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo.

Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.

Detecting Atomicity Violations via Access Interleaving Invariants

HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation.

Demand-Driven Software Race Detection using Hardware Performance Counters Joseph L. Greathouse †, Zhiqiang Ma ‡, Matthew I. Frank ‡ Ramesh Peri ‡, Todd.

Parallel Computing Presented by Justin Reschke

G. Venkataramani, I. Doudalis, Y. Solihin, M. Prvulovic HPCA ’08 Reading Group Presentation 02/14/2008.

Kendo: Efficient Deterministic Multithreading in Software M. Olszewski, J. Ansel, S. Amarasinghe MIT to be presented in ASPLOS 2009 slides by Evangelos.

Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun

PHyTM: Persistent Hybrid Transactional Memory

Olatunji Ruwase* Shimin Chen+ Phillip B. Gibbons+ Todd C. Mowry*

Effective Data-Race Detection for the Kernel

Automatic Detection of Extended Data-Race-Free Regions

Energy-Efficient Address Translation

1Intel Research Pittsburgh 2CMU 3EPFL 4UT Austin

Yiannis Nikolakopoulos

Mengjia Yan† , Jiho Choi† , Dimitrios Skarlatos,

Parallelizing Dynamic Information Flow Tracking

The Vector-Thread Architecture

Presentation transcript:

Computer Architecture Lab at Evangelos Vlachos, Michelle L. Goodstein, Michael A. Kozuch, Shimin Chen, Phillip B. Gibbons, Babak Falsafi and Todd C. Mowry ParaLog: Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

Software Errors & Analysis Tools Errors abundant in parallel software –Program crashes/vulnerabilities, limited performance Three main categories of analysis tools –Checking before, during or after program execution Instruction-grain Lifeguards –Online detailed analysis, but with high overhead –Several tools available, but mostly support for single- threaded code 2© Evangelos VlachosASPLOS '10 - ParaLog ParaLog: a framework for efficient analysis of parallel applications

Lifeguards and Parallel Applications Application Threads Timesliced Execution & Analysis Parallel Execution & Analysis Time Butterfly AnalysisParaLog windows of uncertainty precise application order (previous talk)(this talk) DBI tools available today - high overhead due to serialization - some false positives +software-based - new hardware required +no false positives +even better performance

Low-Overhead Instruction-level Analysis © Evangelos VlachosASPLOS '10 - ParaLog4 accelerators: IT, IF, MTLB [Chen et. al., ISCA’08] event stream event capturing application thread lifeguard thread event delivery application lifeguard online monitoring platform metadata add r1  r2, r4 add, r1, r2, r4 add_handler(){ i = load_state(r2); j = load_state(r4); if(check(i, j)) upd_state(r1); else error(); } Lifeguard core Application core

accelerators: IT, IF, MTLB Challenges in Parallel Monitoring © Evangelos VlachosASPLOS '10 - ParaLog5 event stream application lifeguardonline parallel monitoring platform [ParaLog] application thread 1 event capturing event delivery lifeguard thread 1 global metadata event stream application thread k event capturing event delivery lifeguard thread k

accelerators: IT, IF, MTLB Addressing the Challenges 1.Application event ordering 2.Ensuring metadata access atomicity efficiently 3.Parallelizing hardware accelerators © Evangelos VlachosASPLOS '10 - ParaLog6 event stream application-only order capturing order enforcing application lifeguardonline parallel monitoring platform dependence arcs [ParaLog] application thread 1 event capturing event delivery lifeguard thread 1 global metadata event stream application-only order capturing order enforcing application thread k event capturing event delivery lifeguard thread k

Outline Introduction Addressing the Challenges of Parallel Monitoring 1.Capturing & enforcing application event ordering 2.Ensuring metadata access atomicity 3.Parallelizing hardware accelerators Evaluation Conclusions 7© Evangelos VlachosASPLOS '10 - ParaLog

Event Ordering: the Problem Case Study: Information flow analysis (i.e., Taintcheck) © Evangelos VlachosASPLOS '10 - ParaLog8 store(A) load(A) Application thread jthread k st_handler(A) Lifeguard thread jthread k Application Time ld_handler(A) Expose happens-before information to lifeguards Lifeguard Time

{thread j, t j } progress j : t j progress j : t j - 2 progress k : t k - 1progress k : t k progress k : t k - 2 progress j : t j - 1 Event Ordering: the solution (1/2) Coherence-based ordering of application events –Similar to FDR, but online, focusing on application-only events © Evangelos VlachosASPLOS '10 - ParaLog9 store(A) load(A) Application thread jthread k Time t j - 1 tjtj t j + 1 t k - 1 tktk t k + 1 st_handler(A) ld_handler(A) Lifeguard thread jthread k wait while progress j < t j wait while progress j < t j

Is monitoring coherence enough?Event Ordering: the Solution (2/2) Previous work has not solved the problem of Logical Races Both logical races and system calls resolved with Conflict Alert messages © Evangelos VlachosASPLOS '10 - ParaLog10 free(A) load(A) Application thread jthread k free(A) start ld_handler(A) Lifeguard thread jthread k Metadata(A) free(A) end Conflict Alert MessageDependence Logical Race Application Time Lifeguard Time

Metadata Atomicity Frequent use of locking too expensive –# of instructions added & synchronization cost Dependence arcs handle the majority of the cases –Sufficient conditions: 1.One-to-one data-to-metadata mapping 2.Application reads don’t become metadata writes –Enforcing dependence arcs  race-free operation Rest of the cases handled by acquiring a lock –Lock used only in the load_handler(); other handlers safe © Evangelos VlachosASPLOS '10 - ParaLog11 (more details in the paper)

Parallel Hardware Accelerators Speed-up frequent lifeguard actions –Metadata-TLB; fast metadata address calculation –Idempotent Filters; filter out redundant checking –Inheritance Tracking; fast tracking of dataflow paths Accelerators have only local view of the analysis –Cache locally analysis information (e.g., frequent events) –Important events have application-wide effects (e.g., free()) –Coherence-like issues with accelerators’ local state Important events accompanied by Conflict Alerts –Use Conflict Alerts to flush accelerators’ state © Evangelos VlachosASPLOS '10 - ParaLog12

Outline Introduction Addressing the Challenges of Parallel Monitoring –Capturing & enforcing application event ordering –Ensuring metadata access atomicity –Parallelizing hardware accelerators Evaluation Conclusions 13© Evangelos VlachosASPLOS '10 - ParaLog

Experimental Framework 14© Evangelos VlachosASPLOS '10 - ParaLog Log-Based Architectures framework –Simics full-system simulation –CMP system with {2, 4, 8, 16} cores –{1, 2, 4, 8} of application and lifeguard threads –Sequentially Consistent memory model Benchmarks and multithreaded Lifeguards used –SPLASH-2 and PARSEC –TaintCheck: Information flow tracking; accelerated by M-TLB, IT –AddrCheck: Memory access checking; accelerated by M-TLB, IF Comparison with Timesliced Monitoring

Performance Results: AddrCheck 15© Evangelos VlachosASPLOS '10 - ParaLog 8 app/lifeguard threads 16 cores total Normalized to sequential, unmonitored

Performance Results: AddrCheck 16© Evangelos VlachosASPLOS '10 - ParaLog

Performance Results: AddrCheck 17© Evangelos VlachosASPLOS '10 - ParaLog Timesliced Monitoring is not scalable On average 15x slowdown over No Monitoring (8 threads)

Performance Results: AddrCheck 18© Evangelos VlachosASPLOS '10 - ParaLog Highest overhead with 8 threads: S WAPTIONS  6x Lowest overhead with 8 threads: < 5% Average overhead with 8 threads: 26%

Performance Results: TaintCheck 19© Evangelos VlachosASPLOS '10 - ParaLog

Performance Results: TaintCheck 20© Evangelos VlachosASPLOS '10 - ParaLog Timesliced Monitoring is not scalable On average 23x slowdown over No Monitoring (8 threads)

Performance Results: TaintCheck 21© Evangelos VlachosASPLOS '10 - ParaLog Highest overhead with 8 threads: B ARNES  2.6x Lowest overhead with 8 threads: L U  5% Average overhead with 8 threads: 48%

Other Results in the Paper Order capturing and order enforcing under TSO Performance Impact of Lifeguard Accelerators –AddrCheck: [1.13x – 3.4x], TaintCheck: [2x – 9x] A less expensive order capturing mechanism gets similar performance results –1 timestamp per core vs. 1 timestamp per cache block © Evangelos VlachosASPLOS '10 - ParaLog22

Conclusions ParaLog: Fast and precise parallel monitoring Components of event ordering –Normal memory accesses: monitor coherence activity –Logical Races; use of Conflict Alert messages Metadata Atomicity –Enforcing dependence arcs ensures atomicity (most cases) Parallel Hardware Accelerators –Flush local state on remote events (Conflict Alert) Average overhead is relatively low –AddrCheck: 26% and TaintCheck: 48% (8 threads) 23© Evangelos VlachosASPLOS '10 - ParaLog

Questions ? 24© Evangelos VlachosASPLOS '10 - ParaLog

Backup Slides 25© Evangelos VlachosASPLOS '10 - ParaLog

Metadata Atomicity Synchronization-free fast path vs. slow path –Concurrent application reads; no ordering available! Concurrent metadata reads: follow the fast-path Concurrent metadata writes: follow slow-path acquiring a lock Concurrent metadata read and write: read may get either value –In any other case dependence arcs are available © Evangelos VlachosASPLOS '1026 Application EventLifeguard Action RRW WRW AddrCheck TaintCheck MemCheck LockSet

Parallel Hardware Accelerators Accelerators have only local view of the analysis –Important events have system-wide effects –Case study: Idempotent Filters and AddrCheck © Evangelos VlachosASPLOS '10 - ParaLog27 R(A) R(B) R(A) R(C)R(B) R(A) IF free(A) R(A) IF LG 0 LG 1 ✔ ✖ ✔ Delivered to lifeguard ✖ Redundant; discarded ✖✔ ✔ ✖ ✔ ✔ ✔ Flush IF filters free(A) Flush local and remote IF filters Details for parallel M-TLB and IT can be found in the paper Builds on Remote Conflict Messages

Performance Impact of Lifeguard Accelerators 28© Evangelos VlachosASPLOS '10 - ParaLog Accelerators provide a major speedup [2x – 9x]

Performance Impact of Lifeguard Accelerators 29© Evangelos VlachosASPLOS '10 - ParaLog Accelerators provide a major speedup [ 1.13x – 3.4x]

Transitive Reduction Sensitivity Study 30© Evangelos VlachosASPLOS '10 - ParaLog Limited transitive reduction –No major performance impact; savings in chip area

Supporting Total Store Order (TSO) Cycle of dependencies in relaxed memory models –TSO relaxes the RAW ordering –Previous work (RTR): maintain versions of data –Identify SC offending instructions; save loaded value This paper: maintain versions of metadata © Evangelos VlachosASPLOS '10 - ParaLog31 Thread 0Thread 1 Commit order Wr(A)Wr(B) Rd(B)Rd(A) Memory Order:    P(v 1, A) C(v 0, B) P(v 0, B) C(v 1, A) Log 0 Log 1 Wr(A) Rd(B, v 0 ) Wr(B) Rd(A, v 1 ) produce_version(v 1,A ) Lifeguard 0 store_handler(A) wait_until_available(v 0,B) load_handler(B, v 0 )

Parallel Hardware Accelerators Speed-up frequent lifeguard actions –Fast metadata address calculation – Metadata-TLB –Fast tracking of data-flow paths – Inheritance Tracking –Filter out redundant checking – Idempotent Filters Per-instruction checking gives the same result; cache event Accelerators have only local view of the analysis –Important events have system-wide effects (e.g., free()) –Coherence-like issues with accelerators’ local state Important events accompanied by Conflict Alerts –Use Conflict Alerts to flush state and deliver pending events © Evangelos VlachosASPLOS '10 - ParaLog32

Experimental Framework BenchmarksInput barnes16K bodies oceanGrid: 258 x 258 luMatrix: 1024 x 1024 fmm32768 particles radiosityBase problem blackscholesSimlarge fluidanimateSimlarge swaptionsSimlarge Simulation Parameters Cores{2, 4, 8,16}, 1 GHz, In-Order scalar x86 L1I & L1D(private) 64KB, 64B line, 4- way assoc. L2 (shared){1, 2, 4, 8}MB, 64B line, 8-way assoc., 6-cycle latency Memory90-cycle latency Log Buffer64KB per thread Multithreaded Lifeguards TaintCheck: Information flow tracking; accelerated by M-TLB and IT AddrCheck: Memory access checking; accelerated by M-TLB and IF 33© Evangelos VlachosASPLOS '10 - ParaLog

Relative Slowdown - TaintCheck 34© Evangelos VlachosASPLOS '10 - ParaLog

Relative Slowdown - AddrCheck © Evangelos VlachosASPLOS '10 - ParaLog

Performance Results - AddrCheck 36© Evangelos VlachosASPLOS '10 - ParaLog

Performance Results - TaintCheck 37© Evangelos VlachosASPLOS '10 - ParaLog

Parallel Hardware Accelerators Speed-up frequent lifeguard actions –Metadata-TLB & Inheritance Tracking (discussed in the paper) –Idempotent Filters; identify and filter out redundant checking Per-instruction checking gives the same result Cache incoming event and local state to identify redundancy Accelerators have only local view of the analysis –Important events have application-wide effects (e.g., free()) –Coherence-like issues with accelerators’ local state Important events accompanied by Conflict Alerts –Use Conflict Alerts to flush accelerators’ state © Evangelos VlachosASPLOS '10 - ParaLog38