Paul D. Bryan, Jason A. Poovey, Jesse G. Beu, Thomas M. Conte Georgia Institute of Technology.

Slides:

Advertisements

Similar presentations

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Advertisements

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison

Microarchitectural Characterization of Production JVMs and Java Workload work in progress Jungwoo Ha (UT Austin) Magnus Gustafsson (Uppsala Univ.) Stephen.

- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.

AQM for Congestion Control1 A Study of Active Queue Management for Congestion Control Victor Firoiu Marty Borden.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.

November 29, 2005Christopher Tuttle1 Linear Scan Register Allocation Massimiliano Poletto (MIT) and Vivek Sarkar (IBM Watson)

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 Introduction to Experiment Design Shiv Kalyanaraman Rensselaer Polytechnic Institute

Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Fall 2001CS 4471 Chapter 2: Performance CS 447 Jason Bakos.

Experimental Evaluation

ECE 510 Brendan Crowley Paper Review October 31, 2006.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Linear Scan Register Allocation POLETTO ET AL. PRESENTED BY MUHAMMAD HUZAIFA (MOST) SLIDES BORROWED FROM CHRISTOPHER TUTTLE 1.

CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

NVSleep: Using Non-Volatile Memory to Enable Fast Sleep/Wakeup of Idle Cores Xiang Pan and Radu Teodorescu Computer Architecture Research Lab

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Multi-core architectures. Single-core computer Single-core CPU chip.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

ROBUST RESOURCE ALLOCATION OF DAGS IN A HETEROGENEOUS MULTI-CORE SYSTEM Luis Diego Briceño, Jay Smith, H. J. Siegel, Anthony A. Maciejewski, Paul Maxwell,

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL Presentation on ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Publisher’s:

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

PREDIcT: Towards Predicting the Runtime of Iterative Analytics Adrian Popescu 1, Andrey Balmin 2, Vuk Ercegovac 3, Anastasia Ailamaki

Sequential Hardware Prefetching in Shared-Memory Multiprocessors Fredrik Dahlgren, Member, IEEE Computer Society, Michel Dubois, Senior Member, IEEE, and.

Age Based Scheduling for Asymmetric Multiprocessors Nagesh B Lakshminarayana, Jaekyu Lee & Hyesoon Kim.

SoC CAD 2015/11/22 1 Instruction Set Extensions for Multi-Threading in LEON3 林孟諭電機系, Department of Electrical Engineering 國立成功大學, National Cheng Kung.

P-GAS: Parallelizing a Many-Core Processor Simulator Using PDES Huiwei Lv, Yuan Cheng, Lu Bai, Mingyu Chen, Dongrui Fan, Ninghui Sun Institute of Computing.

Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.

Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments Junghee Lee, Chrysostomos Nicopoulos, Yongjae Lee, Hyung Gyu Lee.

Sunpyo Hong, Hyesoon Kim

Memory-Aware Compilation Philip Sweany 10/20/2011.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.

1 Munther Abualkibash University of Bridgeport, CT.

CSE 340 Computer Architecture Summer 2016 Understanding Performance.

Analytic Evaluation of Shared-Memory Systems with ILP Processors

A Closer Look at Instruction Set Architectures

/ Computer Architecture and Design

Department of Computer Science University of California, Santa Barbara

Levels of Parallelism within a Single Processor

Hardware Multithreading

Applying SVM to Data Bypass Prediction

/ Computer Architecture and Design

Many-Core Graph Workload Analysis

CSC3050 – Computer Architecture

Levels of Parallelism within a Single Processor

Department of Computer Science University of California, Santa Barbara

Chapter 2: Performance CS 447 Jason Bakos Fall 2001 CS 447.

What Are Performance Counters?

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

Paul D. Bryan, Jason A. Poovey, Jesse G. Beu, Thomas M. Conte Georgia Institute of Technology

 Introduction  Multi-threaded Application Simulation Challenges  Circular Dependence Dilemma  Thread Skew  Barrier Interval Simulation  Results  Conclusion 2

 Simulation is vital for computer architecture design and research  importance of reducing costs: ▪ decreases iterative design cycle ▪ more design alternatives considered ▪ results in better architectural decisions  Simulation is SLOW  orders of magnitude slower than native execution  seconds of native execution can take weeks or months to simulate  Multi-core designs have exacerbated simulation intractability 3

CCycle accurate simulation run for all or a portion of a representative workload FFast-forward execution DDetailed execution SSingle-threaded acceleration techniques SSampled Simulation SSimPoints (Guided Simulation) RReduced Input Sets

 Progress of threads dependent upon:  implicit interactions ▪ shared resources (e.g., shared LLC)  explicit interactions ▪ synchronization ▪ critical section thread orderings ▪ dependent upon:  proximity to home node  network contention  coherence state  Circular Dependence System Performance Thread Performance 5

 Measures the thread divergence from actual performance:  Measured as #Instructions difference in individual thread progress at a global instruction count  Positive thread skew  thread is leading true execution  Negative thread skew  thread is lagging true execution 6

7 Barriers

8

 Introduction  Multi-threaded Application Simulation Challenges  Circular Dependence Dilemma  Thread Skew  Barrier Interval Simulation  Results  Conclusion 9

 Break the benchmark into “barrier intervals”  Execute each interval as a separate simulation  Execute all intervals in parallel 10

 Once per workload  Functional fast-forward to find barriers  BIS Simulation  Interval Simulation skips to barrier release event  Detailed execution of only the interval 11

 Cold-start effects  Warmup for 10k,100k,1M,10M instructions prior to barrier release event  Warms-up cache, coherence state, network state, etc. 12

 Introduction  Multi-threaded Application Simulation Challenges  Circular Dependence Dilemma  Thread Skew  Barrier Interval Simulation  Results  Conclusion 13

 Cycle accurate manycore simulation (details in paper) 14

 Subset of SPLASH-2 evaluated  Detailed warm-up lengths:  none, 10k, 100k, 1M, 10M  Evaluated:  Simulated Execution Time Error (percentage difference)  Wall-Clock Speedup  181,000 simulations to calculate simulated speedup (wall-clock speedup) 15

 Metric of interest is speedup  Measure execution time  Since whole program is executed, cycle count = execution time  Evaluation  Error rates  Simulation speedup/efficiency  Warmup sizing

17

18

 Max speedup is dependent upon two factors:  homogeneity of barrier interval sizes  the number of barrier intervals  Interval heterogeneity measured through the coefficient of variation (CV) ▪ lower CV  higher heterogeneity 19

20  Relative Efficiency = max speedup / # barriers  Lower CV:   higher relative efficiency   higher speedup

21

 Increasing warm-up decreases wall clock speedup  more duplicate work from overlapping interval streams  want “just enough” warm-up to provide a good trade-off between speed and accuracy  recommendation: 1M pre-interval warm-up 22

 Previous experiments assumed infinite contexts to calculate speedup  ok for workloads with small # barriers  unrealistic for workloads with high barrier counts  What is the speedup if a limited number of machine contexts are assumed?  used a greedy algorithm to schedule intervals 23

24

25

 Sampling barrier intervals  Useful for throughput metrics such as cache miss rates  More workloads  Preliminary results are promising on big data applications such as Graph500  Convergence point detection for non-barrier applications

 Barrier Interval Simulation is effective at simulation speedup for a class of multi-threaded applications  0.09% average error and 8.32x speedup for 1M warm- up  Certain applications (i.e., ocean) can benefit significantly  speedup of 596x  Even assuming limited contexts, attained speedups are significant  with 16 contexts  3x speedup 27

 Thank You!  Questions?

Figure - Thread skew is calculated using aggregate system and per-thread fetch counts. Simulations with functional fast-forwarding record fetch counts for all threads at the beginning of a simulation. Full simulations use these counts to determine when fetch counts are recorded. Since total system fetch counts are identical in the fast-forwarded and full simulations, the sum of thread skew for every measurement must be zero. Individual threads may lead or lag their counterpart in the full simulation.