Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept.

Slides:

Advertisements

Similar presentations

You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…

Advertisements

Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.

Pricing for Utility-driven Resource Management and Allocation in Clusters Chee Shin Yeo and Rajkumar Buyya Grid Computing and Distributed Systems (GRIDS)

Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.

Introduction Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint manual.

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.

Prasanna Pandit R. Govindarajan

Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt.

TM 1 ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George.

Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.

Feedback Directed Prefetching Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt §¥ ¥ §

SE-292 High Performance Computing

Best of Both Worlds: A Bus-Enhanced Network on-Chip (BENoC) Ran Manevich, Isask har (Zigi) Walter, Israel Cidon, and Avinoam Kolodny Technion – Israel.

Intel Software College Tuning Threading Code with Intel® Thread Profiler for Explicit Threads.

Shredder GPU-Accelerated Incremental Storage and Computation

1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre.

1 A Case for MLP-Aware Cache Replacement International Symposium on Computer Architecture (ISCA) 2006 Moinuddin K. Qureshi Daniel N. Lynch, Onur Mutlu,

CRUISE: Cache Replacement and Utility-Aware Scheduling

Learning Cache Models by Measurements Jan Reineke joint work with Andreas Abel Uppsala University December 20, 2012.

1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache.

1 Sizing the Streaming Media Cluster Solution for a Given Workload Lucy Cherkasova and Wenting Tang HPLabs.

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.

Read-Write Lock Allocation in Software Transactional Memory Amir Ghanbari Bavarsad and Ehsan Atoofian Lakehead University.

Increasing the Energy Efficiency of TLS Systems Using Intermediate Checkpointing Salman Khan 1, Nikolas Ioannou 2, Polychronis Xekalakis 3 and Marcelo.

Executional Architecture

Addition 1’s to 20.

25 seconds left…...

SE-292 High Performance Computing

We will resume in: 25 Minutes.

©2004 Brooks/Cole FIGURES FOR CHAPTER 12 REGISTERS AND COUNTERS Click the mouse to move to the next page. Use the ESC key to exit this chapter. This chapter.

Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer.

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

University of Minnesota Optimizing MapReduce Provisioning in the Cloud Michael Cardosa, Aameek Singh†, Himabindu Pucha†, Abhishek Chandra

Cooperative Cache Scrubbing Jennifer B. Sartor, Wim Heirman, Steve Blackburn*, Lieven Eeckhout, Kathryn S. McKinley^ PACT 2014 * ^

The University of Adelaide, School of Computer Science

T-SPaCS – A Two-Level Single-Pass Cache Simulation Methodology + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Wei Zang.

Dynamic Optimization using ADORE Framework 10/22/2003 Wei Hsu Computer Science and Engineering Department University of Minnesota.

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Compiler Optimization of scalar and memory resident values between speculative threads. Antonia Zhai et. al.

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Chapter Hardwired vs Microprogrammed Control Multithreading

Multiscalar processors

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Thread-Level Speculation Karan Singh CS

Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory.

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

MULTIPLEX: UNIFYING CONVENTIONAL AND SPECULATIVE THREAD-LEVEL PARALLELISM ON A CHIP MULTIPROCESSOR Presented by: Ashok Venkatesan Chong-Liang Ooi, Seon.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Dynamic Region Selection for Thread Level Speculation Presented by: Jeff Da Silva Stanley Fung Martin Labrecque Feb 6, 2004 Builds on research done by:

Multiscalar Processors

5.2 Eleven Advanced Optimizations of Cache Performance

Presentation transcript:

Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept. of Computer Science and Engineering Dept. of Electronic Computer Engineering University of Minnesota – Twin Cities

Motivation Multicore processors brought forth large potentials of computing power Exploiting these potentials demands thread-level parallelism 2 Intel Core i7 die photo

Exploiting Thread-Level Parallelism Potentially more parallelism with speculation 3 dependence Sequential Store *p Load *q Store *p Time p != q Thread-Level Speculation (TLS) Traditional Parallelization Load *q p != q ?? Store 88 Load 20 Parallel execution Store 88 Load 88 Speculation Failure Time Load 88 Compiler gives up p == q But Unpredictable

Addressing Unpredictability State-of-the-art approach: Profiling-directed optimization Compiler analysis Source code Profiling run Profiling info Region selection Parallel code generation Feedback Machine code A typical compilation framework 4 Speculative code optimization

Problem # Time signal allow commit wait spawning thread inter-thread synchronization speculation failure …… load imbalance 5 4 Hard to model TLS overhead statically, even with profiling

Problem #2 6 Profiling Behavior Training Input Actual Input Actual Behavior Profile-based decision may not be appropriate for actual input

Problem #3 7 Time phase1phase2 Metric phase3 Behavior of the same code changes over time

Problem #4 8 Load *p=0x88 Prefetching Extra misses Pollution Load 0x22 miss Load 0x22 hit Load 0x22 miss SequentialParallel PPP How important are cache impacts? L1$ P Load *p=0x88 P L1$ P P Load *p=0x80 Load *p=0x08

Impact on Cache Performance 9 60% load *p=0x88 Cache: p == p Prefetched (+) load *p=0x88 Cache impact is difficult to model statically A major loop from benchmark art (scanner.c:594) 2.5X

Our proposal Proposed Performance Tuning Framework Compiler AnalysisRuntime Analysis 10 Source code Profiling info Region selection Parallel code generation Machine code Speculative code optimization Accurate Timing & Cache Impacts Adaptations for inputs and phase changes Performance Evaluation Tuning Policy Compiler Annotation Runtime Information Dynamic Execution Performance Profile Table

Parallelization Decisions at Runtime 11 … Evaluate performance Thread spawningEnd of a parallel execution spawn parallelize spawn serialize … spawn Parallel executionSequential execution Parallel Execution New execution model facilitates dynamic tuning Performance Profile Table

Evaluating Performance Impact 12 Runtime Analysis Simple metric: cost of speculative failure Comprehensive evaluation: sequential performance prediction Performance Evaluation Tuning Policy Runtime Information Dynamic Execution Performance Profile Table

Sequential Performance Prediction 13 Predict PP Programmable Cycle Classifier Information from ROB Hardware Counters iFetch tick Cache Busy ExeStall Speculative Parallel Predicted sequential Busy iFetch ExeStall Cache TLS overhead Performance Impact

Cache Behavior: Scenario #1 14 TLS Count the miss in squashed execution load p (unmodified) miss load p hit Predicted load p hit SEQ load p miss T0 Squash

Cache Behavior: Scenario #2 15 TLS load p miss load p miss Predicted load p miss*2 SEQ load p miss store p invalidate p store p Not count a miss if tag matched invalid status Not count a miss if tag matched invalid status store p T0

Tune the performance 16 Exhaustive search Prioritized search Runtime Analysis Performance Evaluation Tuning Policy Runtime Information Dynamic Execution Performance Profile Table

How to prioritize the search 17 Search order: Inside-Out Compiler suggestion Outside-In Performance summary metric: Speedup or not? Improvement in cycles Search order PerformanceSummary metric Speedup or not? Inside-Out Compiler suggestion Improve. in cycles More Aggressive Quantitative +StaticHint Quantitative Simple Suboptimal

Evaluation Infrastructure Benchmarks SPEC2000 written in C, -O3 optimization Underlying architecture 4-core, chip-multiprocessor (CMP) speculation supported by coherence Simulator Superscalar with detailed memory model simulates communication latency models bandwidth and contention Detailed, cycle-accurate simulation C C P C P Interconnect C P C P 18

Comparing Tuning Policies x 1.23x 1.37x Parallel Code Overhead

Profile-Based vs. Dynamic 20 Dynamic outperformed static by 10% (1.37x/1.25x) 1.25x Potentially go up to 15% (1.46x/1.27x) 1.27x 1.46x

Related works: Region Selection Runtime Loop Detection [Tubella HPCA98] [Marcuello ICS98] Compiler Heuristics [Vijaykumar Micro98] 21 StaticallyDynamically Saving Power from Useless Re- spawning [Renau ICS05] Simple Profiling Pass [Liu PPoPP06] (POSH) Extensive Profiling [Wang LCPC05] Balanced Min-Cut [Johnson PLDI04] Profile-Based Search [Johnson PPoPP07] Lack of adaptability Lack of high-level information Dynamic Performance Tuning [Luo ISCA09] (this work) Dynamic Performance Tuning [Luo ISCA09] (this work)

Conclusions Deciding how to extract speculative threads at runtime Quantitatively summarize performance profile Compiler hints avoid suboptimal decisions Compiler and hardware work together. 37% speedup over sequential execution 10% speedup over profile-based decisions 22 Dynamic performance tuning can improve TLS efficiency Configurable HPMs enable efficient dynamic optimizations