Evaluating Non-deterministic Multi-threaded Commercial Workloads Computer Sciences Department University of Wisconsin—Madison

Slides:



Advertisements
Similar presentations
Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.
Advertisements

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios.
DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
(C) 2001 Daniel Sorin Correctly Implementing Value Prediction in Microprocessors that Support Multithreading or Multiprocessing Milo M.K. Martin, Daniel.
Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison
Lincoln University Canterbury New Zealand Evaluating the Parallel Performance of a Heterogeneous System Elizabeth Post Hendrik Goosen formerly of Department.
(C) 2003 Mulitfacet ProjectUniversity of Wisconsin-Madison Simulating a $2M Commercial Server on a $2K PC Alaa Alameldeen, Milo Martin, Carl Mauer, Kevin.
(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.
(C) 2003 Milo Martin Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper,
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.
Skewed Compressed Cache
SECTION 1: INTRODUCTION TO SIMICS Scott Beamer CS152 - Spring 2009.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
February 11, 2003Ninth International Symposium on High Performance Computer Architecture Memory System Behavior of Java-Based Middleware Martin Karlsson,
Adaptive Cache Compression for High-Performance Processors Alaa Alameldeen and David Wood University of Wisconsin-Madison Wisconsin Multifacet Project.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.
Presented by Deepak Srinivasan Alaa Aladmeldeen, Milo Martin, Carl Mauer, Kevin Moore, Min Xu, Daniel Sorin, Mark Hill and David Wood Computer Sciences.
DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.
A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill
(C) 2003 Mulitfacet ProjectUniversity of Wisconsin-Madison Evaluating a $2M Commercial Server on a $2K PC and Related Challenges Mark D. Hill Multifacet.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.
Computer Science Department University of Pittsburgh 1 Evaluating a DVS Scheme for Real-Time Embedded Systems Ruibin Xu, Daniel Mossé and Rami Melhem.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.
Revisiting Hardware-Assisted Page Walks for Virtualized Systems
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Simulating a $2M Commercial Server on a $2K PC Alaa R. Alameldeen, Milo M.K. Martin, Carl J. Mauer, Kevin E. Moore, Min Xu, Daniel J. Sorin, Mark D. Hill.
Analytic Evaluation of Shared-Memory Systems with ILP Processors Daniel J. Sorin, Vijay S. Pai, Sarita V. Adve, Mary K. Vernon, and David A. Wood Presented.
ICOM 6115: Computer Systems Performance Measurement and Evaluation August 11, 2006.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.
1. 2 Table 4.1 Key characteristics of six passenger aircraft: all figures are approximate; some relate to a specific model/configuration of the aircraft.
From lecture slides for Computer Organization and Architecture: Designing for Performance, Eighth Edition, Prentice Hall, 2010 CS 211: Computer Architecture.
Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.
CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
CMP Design Choices Finding Parameters that Impact CMP Performance Sam Koblenski and Peter McClone.
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
An Architectural Evaluation of Java TPC-W Harold “Trey” Cain, Ravi Rajwar, Morris Marden, Mikko Lipasti University of Wisconsin-Madison
CS757 Lock Behaviour Characterization of Commercial Workloads Jichuan Chang Xidong Wang.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Evaluation – Metrics, Simulation, and Workloads Copyright 2004 Daniel.
Dynamic Verification of Sequential Consistency Albert Meixner Daniel J. Sorin Dept. of Computer Dept. of Electrical and Science Computer Engineering Duke.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.
Computer Sciences Department University of Wisconsin-Madison
Lecture 2: Performance Evaluation
Analytic Evaluation of Shared-Memory Systems with ILP Processors
Timothy Zhu and Huapeng Zhou
Unit OS9: Real-Time and Embedded Systems
Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.
Section 1: Introduction to Simics
Predictive Performance
Presented by: Eric Carty-Fickes
Simulating a $2M Commercial Server on a $2K PC
Exploring Core Designs for Chip Multiprocessors
Dynamic Verification of Sequential Consistency
CMP Design Choices Finding Parameters that Impact CMP Performance
Performance Measurement and Analysis
Presentation transcript:

Evaluating Non-deterministic Multi-threaded Commercial Workloads Computer Sciences Department University of Wisconsin—Madison Alaa R. Alameldeen, Carl J. Mauer, Min Xu, Pacia J. Harper, Milo M.K. Martin, Daniel J. Sorin, Mark D. Hill, and David A. Wood

Evaluating Non-deterministic Multi-threaded Commercial Workloads Introduction Short measurements on real machines require multiple runs –Uncontrolled factors –Want to separate random from systematic effects Simulation measurements use a single run –Simulators are deterministic –No uncontrolled factors Wrong! –Multi-threaded workloads can be unstable –Small changes in timing cause large changes in results

Evaluating Non-deterministic Multi-threaded Commercial Workloads Introduction Instability may affect conclusions –Comparing Direct Mapped to Set-Associative Caches DM Mean SA Mean Direct Mapped 4-Way SA

Evaluating Non-deterministic Multi-threaded Commercial Workloads Overview Introduction Methods Workloads Result I: Process scheduling Result II: Workload Variability Conclusion Future Work

Evaluating Non-deterministic Multi-threaded Commercial Workloads Methods Real machine –Setup, tune, validate on a 16-processor Sun E6000 –8 – 16 X speed-up for each application Simulator –Simics, Full-system simulator running Solaris 8 –Ruby, Memory timing simulator Experiments –Start from a warm checkpoint –Measure throughput (transactions completed / time)

Evaluating Non-deterministic Multi-threaded Commercial Workloads Workloads OLTP –TPC-C-like benchmark using a 1 GB database SPECjbb –Server-side Java-based middleware workload Apache –Static web serving: Apache driven by SURGE Slashcode –Dynamic web serving message board, using code and data similar to slashdot.org

Evaluating Non-deterministic Multi-threaded Commercial Workloads Why unstable? Different paths are executed Hypotheses –Process scheduling –Order of lock acquisition

Evaluating Non-deterministic Multi-threaded Commercial Workloads Result I: Process scheduling Deterministic simulation of OLTP on uniprocessor Artificially injected misses to I-cache –Run1: 0, 100, 200 … –Run2: 50, 150, 250 … Measured equivalent to 3-5 seconds in real system Run time difference of 9% Is process scheduling a factor?

Evaluating Non-deterministic Multi-threaded Commercial Workloads Result I: Process scheduling Traced process groups scheduled on CPU

Evaluating Non-deterministic Multi-threaded Commercial Workloads Methods, part II Pseudo-random perturbations –Run multiple runs from same checkpoint –All runs have same average memory latency –Misses to main memory perturbed by 0-4% Calculate mean, standard deviation

Evaluating Non-deterministic Multi-threaded Commercial Workloads Result II: Variability Variability –16-processor system running 8,000 OLTP transactions –20 runs from same checkpoint –12 – 20 seconds in real system 1 / Throughput (cycles per transaction)

Evaluating Non-deterministic Multi-threaded Commercial Workloads Result II: Variability Miss rate (misses per transaction)

Evaluating Non-deterministic Multi-threaded Commercial Workloads Result II: Variability Instructions executed (per transaction) Hypothesis –“Spin-waiting” hypothesis –Lock-acquisition, idle loop, device activity

Evaluating Non-deterministic Multi-threaded Commercial Workloads Result II: Variability

Evaluating Non-deterministic Multi-threaded Commercial Workloads Conclusion Multi-threaded commercial workloads can be unstable even on uniprocessors Instability can affect conclusions in short runs Pseudo-random methodology can help Even within one workload variations exist

Evaluating Non-deterministic Multi-threaded Commercial Workloads Future Work Root cause(s)? Methodology improvements Quantify instability further

Evaluating Non-deterministic Multi-threaded Commercial Workloads Questions