UCSD SAN DIEGO SUPERCOMPUTER CENTER 1 Symbiotic Space-Sharing: Mitigating Resource Contention on SMP Systems Professor Snavely, University of California.

Slides:



Advertisements
Similar presentations
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
Advertisements

Lecture 6: Multicore Systems
1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Thoughts on Shared Caches Jeff Odom University of Maryland.
International Symposium on Low Power Electronics and Design Dynamic Workload Characterization for Power Efficient Scheduling on CMP Systems 1 Gaurav Dhiman,
Scheduling of parallel jobs in a heterogeneous grid environment Scheduling of parallel jobs in a heterogeneous grid environment Each site has a homogeneous.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Communication Pattern Based Node Selection for Shared Networks
CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.
Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Evaluating the Tera MTA Allan Snavely, Wayne Pfeiffer et.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
SyNAR: Systems Networking and Architecture Group Symbiotic Jobscheduling for a Simultaneous Multithreading Processor Presenter: Alexandra Fedorova Simon.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
Is 99% Utilization of a Supercomputer a Good Thing? Scheduling in Context: User Utility Functions Cynthia Bailey Lee Department of Computer Science and.
Rice01, slide 1 Characterizing NAS Benchmark Performance on Shared Heterogeneous Networks Jaspal Subhlok Shreenivasa Venkataramaiah Amitoj Singh University.
Computer System Architectures Computer System Software
1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.
Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen.
Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.
Waleed Alkohlani 1, Jeanine Cook 2, Nafiul Siddique 1 1 New Mexico Sate University 2 Sandia National Laboratories Insight into Application Performance.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Emalayan Vairavanathan
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
PMaC Performance Modeling and Characterization Performance Modeling and Analysis with PEBIL Michael Laurenzano, Ananta Tiwari, Laura Carrington Performance.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Profiling Memory Subsystem Performance in an Advanced POWER Virtualization Environment The prominent role of the memory hierarchy as one of the major bottlenecks.
Many-SC Project Runtime Environment (RTE) CSAP Lab 2014/10/28.
IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.
1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
Power Containers: An OS Facility for Fine-Grained Power and Energy Management on Multicore Servers Kai Shen, Arrvindh Shriraman, Sandhya Dwarkadas, Xiao.
HPCMP Benchmarking Update Cray Henry April 2008 Department of Defense High Performance Computing Modernization Program.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
Dynamic Voltage Frequency Scaling for Multi-tasking Systems Using Online Learning Gaurav DhimanTajana Simunic Rosing Department of Computer Science and.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.
Chapter 3 System Performance and Models Introduction A system is the part of the real world under study. Composed of a set of entities interacting.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Energy-Aware Resource Adaptation in Tessellation OS 3. Space-time Partitioning and Two-level Scheduling David Chou, Gage Eads Par Lab, CS Division, UC.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.
LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.
MicroGrid Update & A Synthetic Grid Resource Generator Xin Liu, Yang-suk Kee, Andrew Chien Department of Computer Science and Engineering Center for Networked.
OPERATING SYSTEMS CS 3502 Fall 2017
Lecture 2: Performance Evaluation
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Simultaneous Multithreading
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Department of Computer Science University of California, Santa Barbara
Memory Opportunity in Multicore Era
Haishan Zhu, Mattan Erez
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Hardware Counter Driven On-the-Fly Request Signatures
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Virtual Memory: Working Sets
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

UCSD SAN DIEGO SUPERCOMPUTER CENTER 1 Symbiotic Space-Sharing: Mitigating Resource Contention on SMP Systems Professor Snavely, University of California Jonathan Weinberg Allan Snavely University of California, San Diego San Diego Supercomputer Center

Performance Modeling and Characterization Lab PMaC Resource Sharing on DataStar I/O L2 L3 I/O MEM Others (e.g. WAN Bandwidth) P0 P1 P2 P3 P4 P5 P6 P7 L1 L2 L3 I/O MEM P0 P1 P2 P3 P4 P5 P6 P7 L1

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Symbiotic Space-Sharing  Symbiosis: from Biology meaning the graceful coexistence of organisms in close proximity  Space-Sharing: Multiple jobs use a machine at the same time, but do not share processors (vs time-sharing)  Symbiotic space-sharing: improve system throughput by executing applications in symbiotic combinations and configurations that alleviate pressure on shared resources

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Can Symbiotic Space-Sharing Work?  To what extent and why do jobs interfere with themselves and each other?  If this interference exists, how effectively can it be reduced by alternative job mixes?  How can parallel codes leverage this and what is the net gain?  How can a job scheduler create symbiotic schedules?

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Resource Sharing: Effects  GUPS: Giga-Updates-Per-Second measures the time to perform a fixed number of updates to random locations in main memory. (main memory)  STREAM: Performs a long series of short, regularly- strided accesses through memory (cache)  I/O Bench: Performs a series of sequential, backward, and random read and write tests (I/O)  EP: Embarrassingly Parallel, one of the NAS Parallel Benchmarks is a compute-bound code. (CPU)

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Resource Sharing: Effects Memory I/O

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Resource Sharing: Conclusions  To what extent and why do jobs interfere with themselves and each other?  10-60% for memory  Super-linear for I/O

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Can Symbiotic Space-Sharing Work?  To what extent and why do jobs interfere with themselves and each other?  If this interference exists, how effectively can it be reduced by alternative job mixes?  Are these alternative job mixes feasible for parallel codes and what is the net gain?  How can a job scheduler create symbiotic schedules?

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Mixing Jobs: Effects

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Mixing Jobs: Effects on NPB  Using NAS Benchmarks we generalize the results  EP and I/O Bench are symbiotic with all  Some symbiosis within the memory intensive codes  CG with IS, BT with others  Slowdown of self is among highest observed

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Mixing Jobs: Conclusions  Proper job mixes can mitigate slowdown from resource contention  Applications tend to slow themselves more heavily than others  Some symbiosis may exist even within one application category (e.g. memory-intensive)

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Can Symbiotic Space-Sharing Work?  To what extent and why do jobs interfere with themselves and each other?  If this interference exists, how effectively can it be reduced by alternative job mixes?  How can parallel codes leverage this and what is the net gain?  How can a job scheduler create symbiotic schedules?

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Parallel Jobs: Spreading Jobs Speedup when 16p benchmarks are spread across 4 nodes instead of 2

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Parallel Jobs: Mixing Spread Jobs  Choose some seemingly symbiotic combinations  Maintain speedup even with no idle processors  CG slows down when run with BTIO(S)…

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Parallel Jobs: Conclusions  Spreading applications is beneficial (15% avg. speedup for NAS benchmarks)  Speedup can be maintained with symbiotic combinations while maintaining full utilization

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Can Symbiotic Space-Sharing Work?  To what extent and why do jobs interfere with themselves and each other?  If this interference exists, how effectively can it be reduced by alternative job mixes?  How can parallel codes leverage this and what is the net gain?  How can a job scheduler create symbiotic schedules?

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Symbiotic Scheduler: Prototype  Symbiotic Scheduler vs DataStar  100 randomly selected 4p and 16p jobs from: {IOBench.4, EP.B.4, BT.B.4, MG.B.4, FT.B.4, DT.B.4, SP.B.4, LU.B.4, CG.B.4, IS.B.4, CG.C.16, IS.C.16, EP.C.16, BTIO FULL.C.16}  small jobs to large jobs: 4:3  memory-intensive to compute and I/O: 2:1:1  Expected runtimes were supplied to allow backfilling  Symbiotic scheduler used simplistic heuristic: only schedule memory apps with compute and I/0  DataStar=5355s, Symbiotic=4451s, Speedup=1.2

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Symbiotic Scheduler: Prototype Results  Per-Processor Speedups (based on Avg. runtimes in test)  16-Processor Apps: 10-25% speedup  4-Processor Apps: 4-20% slowdown (but double utilization)

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Identifying Symbiosis  Ask the users  Coarse Grained  Fine Grained  Online discovery  Sampling (e.g. Snavely w/ SMT)  Profiling (e.g. Antonopoulos, Koukis w/ hw counters) Memory operations/s vs self-slowdown

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC User Guidance: Why Ask Users?  Consent  Financial  Technical  Transparency  Familiarity  Submission flags from users are standard

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC User Guidance: Coarse Grained Can users identify the resource bottlenecks of applications?

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Application Workload  WRF Weather Research Forecasting System from the DoD’s HPCMP program  OOCORE Out Of Core solver from the DoD’s HPCMP program  MILC MIMD Lattice Computation from the DoE’s National Energy Research Scientific Computing (NERSC) program  PARATEC Parallel Total Energy Code from NERSC  HOMME High Order Methods Modeling Environment from the National Center for Atmospheric Research Applications deemed “of strategic importance to the United States federal government” by a recent $30M NSF procurement* * High Performance Computing Systems Acquisition: Towards a Petascale Computing Environment for Science and Engineering

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Expert User Inputs  User inputs collected independently from five expert users  Users reported to have used MPI Trace, HPMCOUNT, etc  Are these inputs accurate enough to inform a scheduler?

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC User-Guided Symbiotic Schedules  The Table:  64p runs using 32-way, p690 nodes  Speedups are vs 2 nodes  Predicted Slowdown | Predicted Speedup | No Prediction  All applications speed up when spread (even with communication bottlenecks)  Users identified non-symbiotic pairs  User speedup predictions were 94% accurate  Avg. speedup is 15% (Min=7%, Max=22%)

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC User Guidance: Fine Grained  Submit quantitative job characterizations  Scheduler learns good combinations on system  Chameleon Framework  Concise, quantitative description of application memory behavior (signature)  Tools for fast signature extraction (~5x)  Synthetic address traces  Fully tunable, executable benchmark

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Chameleon: Application Signatures Similarity between NPB on 68 LRU Caches

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Space-Sharing (Bus) Space-sharing on the Pentium D

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Comparative Performance of NPB Performance in 100M memory ops per second

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Space-Sharing (Bus, L2) Space-sharing on the Intel Centrino

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Space-Sharing (Bus, L2, L3) Space-sharing on the Power4

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Conclusions  To what extent and why do jobs interfere with themselves and each other? 10-60% for memory and 1000%+ for I/O (DataStar)  If this interference exists, how effectively can it be reduced by alternative job mixes? Almost completely given the right job  How can parallel codes leverage this and what is the net gain? Spread across more nodes. Normally up to 40% with our test set.  How can a job scheduler create symbiotic schedules? Ask users, use hardware counters, and do future work…

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Future Work  Workload study: How much opportunity in production workloads?  Runtime symbiosis detection  Scheduler Heuristics  How should the scheduler actually operate?  Learning algorithms?  How will it affect fairness or other policy objectives?  Other Deployment Contexts:  Desktop grids  Web servers  Desktops?

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Thank You!