MoBS-5 :: June 21, 2009 FIESTA: A Sample-Balanced Multi-Program Workload Methodology Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.

Slides:



Advertisements
Similar presentations
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
Advertisements

PACT-18 :: Sep 15, 2009 CPROB: Checkpoint Processing with Opportunistic Minimal Recovery Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania.
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.
CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.
Data Warehousing 1 Lecture-25 Need for Speed: Parallelism Methodologies Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison
CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.
SyNAR: Systems Networking and Architecture Group Symbiotic Jobscheduling for a Simultaneous Multithreading Processor Presenter: Alexandra Fedorova Simon.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
1 Computer Performance: Metrics, Measurement, & Evaluation.
Copyright © 2012 Houman Homayoun 1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
SYSTEM-LEVEL PERFORMANCE METRICS FOR MULTIPROGRAM WORKLOADS Presented by Ankit Patel Authors: Stijn Everman Lieven Eeckhout Lieven Eeckhout.
Lecture 2b: Performance Metrics. Performance Metrics Measurable characteristics of a computer system: Count of an event Duration of a time interval Size.
Multi-Core Architectures
Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.
Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information.
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
1 CS/EE 362 Hardware Fundamentals Lecture 9 (Chapter 2: Hennessy and Patterson) Winter Quarter 1998 Chris Myers.
Computer Science Department In-N-Out: Reproducing Out-of-Order Superscalar Processor Behavior from Reduced In-Order Traces Kiyeon Lee and Sangyeun Cho.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex.
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei Lin Dean M. Tullsen Speaker: Houman.
1  1998 Morgan Kaufmann Publishers How to measure, report, and summarize performance (suorituskyky, tehokkuus)? What factors determine the performance.
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
Sunpyo Hong, Hyesoon Kim
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.
A Quantitative Framework for Pre-Execution Thread Selection Gurindar S. Sohi University of Wisconsin-Madison MICRO-35 Nov. 22, 2002 Amir Roth University.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
ECE Dept., Univ. Maryland, College Park
Adaptive Cache Partitioning on a Composite Core
Zhichun Zhu Zhao Zhang ECE Department ECE Department
Simultaneous Multithreading
Simultaneous Multithreading
/ Computer Architecture and Design
Hyperthreading Technology
Application Slowdown Model
Morgan Kaufmann Publishers The Processor
CSCI1600: Embedded and Real Time Software
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
CSC3050 – Computer Architecture
Lecture 22: Multithreading
CSCI1600: Embedded and Real Time Software
Presentation transcript:

MoBS-5 :: June 21, 2009 FIESTA: A Sample-Balanced Multi-Program Workload Methodology Andrew Hilton, Neeraj Eswaran, Amir Roth University of Pennsylvania {adhilton, neeraj,

[ 2 ] Overview Multi-program workloads Samples from independent programs Executed concurrently to evaluate SMT, CMP, scheduling, etc. How to choose samples? Fixed-workload: choose samples first – Load imbalance problem Variable-workload: multi-program execution defines samples – Other (more serious) problems Our work Distinguish sample imbalance (bad) from schedule imbalance (ok) Propose FIESTA: sample-balanced fixed-workload methodology

[ 3 ] Traditional Fixed-Workload Single-program workload x N X insns (i.e. 5M/sample) from each program [ElMoursi03, Eyerman07] Workload composition is fixed across experiments + Direct comparisons between experiments – Load imbalance: time spent executing only slowest programs A: 5M B: 5M time A: 5M B: 5M time Experiment 1 Experiment 2

[ 4 ] Load Imbalance If significant – Not representative of real, continuous multi-program execution – Deflates (multi-program / single-program) speedup T 1 = 1, T 2 = 2, T 1+2 = 2  SMT-speedup = 50% Not really used (anymore) because of this A: 5M B: 5M time SMT-speedup: (T 1 +T 2 ) / T 1+2

[ 5 ] 2-way SMT Fixed 250M insns from each program 13% SMT speedup 51% load imbalance

[ 6 ] Variable-Workload Multi-program execution defines workload Execute all programs until some condition (i.e. total insns = 10M) Normalize to single-program region defined by this execution SMT-speedup metric used for this normalization Eliminates load imbalance (by construction) A: 3M B: 7M time

[ 7 ] Variable-Workload variations Many variations of execute all programs until … X total instructions committed [Kumar03, Luo01, Tune04] X instructions committed by one program [Cazorla04] X instructions committed by every program [Raasch03, Yeh05] X execution cycles have elapsed [Snavely00] All programs “fairly represented” [Vera07, Ramirez07] All basically the same Have same fundamental problems “Total of X instructions” used in this talk/paper

[ 8 ] 2-way SMT Fixed 250M insns from each program 13% SMT speedup 51% load imbalance Variable 500M insns total 0% imbalance (by construction) 35% SMT speedup What is the “real” speedup? 13%? 35%? Something else?

[ 9 ] Variable-Workload: Danger! Results from different experiments not directly comparable Different workload in each Skews workload to over-estimate throughput Over-samples fast programs Skews workload to over-estimate speedup Over-samples programs that slow down less due to contention “Fairness” attempts to account for this [Gabr06] How to synthesize SMT-speedup and fairness into “real speedup”? A: 3M B: 7M Experiment 1 A: 4M B: 6M Experiment 2

[ 10 ] Fixing Fixed Workload? Many problems with variable workload methodologies Incomparable experiments Over-estimations of throughputs and speedups “Tells you what you want to hear” Can we revive fixed-workload? Load imbalance only significant problem Very difficult to eliminate completely But complete balance may not even be what we want…

[ 11 ] Deconstructing Load Imbalance Fixed-workload runs experience two forms of imbalance Sample imbalance: different standalone runtimes Artifact of finite experiments Should be eliminated Easy: choose samples with same standalone runtimes Schedule imbalance: asymmetric (“unfair”) contention Characteristic of concurrent execution Should be preserved, measured

[ 12 ] FIESTA FIESTA: Fixed-Instruction with Equal STAndalone runtimes Run single-programs for C cycles, record insn count Build fixed workloads from time-balanced samples + Eliminates sample imbalance + Remaining imbalance is schedule imbalance Programs represented according to standalone performance Corresponds to “fair” continuous multi-programming A: 5M B: 7M time A: 5M B: 7M time schedule imbalance

[ 13 ] 2-way SMT Reprise Fixed 250M insns from each program 13% speedup, 51% imbalance Variable 500M insns total 35% speedup, 0% imbalance FIESTA 250M cycles from each program 28% speedup, 21% imbalance  Fixed has 30% sample imbalance

[ 14 ] The Rest of Our Methodology Processor configurations 4-way superscalar, dynamically scheduled, 17-stage pipeline 64KByte, 4-way I/D$, 2MByte, 8-way L2, 8 8-entry stream buffers 400 cycle main memory, 16 outstanding misses Up to 4 threads, ICOUNT, issue queue & stream buffers “capped” Eight SPEC2K benchmarks ILP (mesa, vortex), branch (gcc, perl) Memory latency (equake, mcf), memory bandwidth (art, swim) Workloads 50 samples per benchmark, periodic starting points for samples 28 2-thread workloads, 70 4-thread workloads

[ 15 ] Two Multi-Program Studies Same-architecture study: ICOUNT vs. Round-Robin FIESTA is perfect for this! All experiments share single-program baseline FIESTA workload is sample-balanced (by construction) in all runs Cross-architecture study: SMT vs. RaT Different experiments have different single-program baselines No single FIESTA workload is sample-balanced in all runs FIESTA not perfect … but much better than anything else

[ 16 ] ICOUNT vs. Round-Robin SMT-speedup Variable uniformly higher ICOUNT advantage Variable, FIESTA agree ICOUNT by 7% Workload composition Danger of Variable Workloads differ by 10%

[ 17 ] Cross-Architecture Studies Example: SMT vs. RaT (Runahead Threads) [Ramirez08] SMT baseline is ROB, RaT baseline is Runahead (RA) [Mutlu03] ROB workload sample-unbalanced on RaT, vice versa Well Cross-architecture sample imbalance not as bad as you might think FIESTA can be used to provide “tight” bound in these cases

[ 18 ] Cross-Architecture Sample Imbalance Sample imbalance Fixed: 30% FIESTA: 0% FIESTA-RA: 2% FIESTA-2K-D$: 1% FIESTA-2wide: 9% 30% lower IPC Surprisingly small Single change typically affects all programs in same direction Both programs accelerate by 2X? imbalance still 0% Architecture changes typically smaller in magnitude (1.1–3X) … … than a priori program performance differences (2–15X)

[ 19 ] SMT vs. RaT First RA only 5% faster than ROB No RA/SMT synergy, some overlap Variable: RaT by 11% (unlikely) Over-samples RA-happy programs Fixed: RaT by 6% (maybe?) RA fixes sample imbalance Exposes existing MT speedups FIESTA: 1–4% (confirms intuition) Upshot: any FIESTA workload better than Fixed or Variable Known direction of error from architectural change Use both FIESTA workloads: tight “range” of results

[ 20 ] Other Issues (Future Work) Representativeness of individual programs Being time-based, FIESTA will over-sample fast regions Potential solution: time-based SimPoint [Perelman03] Find representative sample that runs for C cycles Multi-threaded applications Should work FIESTA will ignore inter-thread imbalance, consider entire program

[ 21 ] Conclusions Prevailing multi-program studies use variable workloads Introduced to avoid load imbalance problems of fixed workloads – Have their own more subtle (and sinister) problems Direct comparisons impossible (but made repeatedly anyway) “Tells you what you want to hear” – “Fairness” can’t account for this FIESTA: sample-balanced fixed multi-program workloads Eliminates sample-imbalance artifacts (different standalone runtimes) Preserves schedule-imbalance characteristics (unfair contention) + Direct comparisons using any metric, unskewed results + Time-based … but works for cross-architecture studies Spread the word!

[ 22 ]

[ 23 ] Contention: The Key Measure? Multi-program speedups, proxy for contention No contention? 100% speedup (2 programs) Fixed Sample imbalance reduces speedups without contention Variable Allows asymmetric contention to disappear, without affecting speedup FIESTA Same architecture? Speedups correspond exactly to contention Different architecture? Very small sample imbalance Speedup/contention relation closer than anything else