This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Slides:

Advertisements

Similar presentations

Analysis of Computer Algorithms

Advertisements

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.

Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos.

FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

GWDAW 16/12/2004 Inspiral analysis of the Virgo commissioning run 4 Leone B. Bosi VIRGO coalescing binaries group on behalf of the VIRGO collaboration.

A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager

Priority Research Direction Key challenges General Evaluation of current algorithms Evaluation of use of algorithms in Applications Application of “standard”

Thoughts on Shared Caches Jeff Odom University of Maryland.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Instrumentation and Profiling David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA

Akhil Langer, Harshit Dokania, Laxmikant Kale, Udatta Palekar* Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Tanenbaum 8.3 See references

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

4.x Performance Technology drivers – Exascale systems will consist of complex configurations with a huge number of potentially heterogeneous components.

SensIT PI Meeting, January 15-17, Self-Organizing Sensor Networks: Efficient Distributed Mechanisms Alvin S. Lim Computer Science and Software Engineering.

Blaise Barney, LLNL ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

Predicting performance of applications and infrastructures Tania Lorido 27th May 2011.

Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.

Kenichi Kourai (Kyushu Institute of Technology) Takuya Nagata (Kyushu Institute of Technology) A Secure Framework for Monitoring Operating Systems Using.

University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.

Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.

Cluster Reliability Project ISIS Vanderbilt University.

McGraw-Hill/Irwin © The McGraw-Hill Companies, All Rights Reserved BUSINESS PLUG-IN B17 Organizational Architecture Trends.

Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.

Our work on virtualization Chen Haogang, Wang Xiaolin {hchen, Institute of Network and Information Systems School of Electrical Engineering.

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 1 On-line Automated Performance Diagnosis on Thousands of Processors Philip C. Roth Future.

Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.

Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

Polymorphous Computing Architectures Run-time Environment And Design Application for Polymorphous Technology Verification & Validation (READAPT V&V) Lockheed.

MIDORI The Post Windows Operating System Microsoft Research’s.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

“Trusted Passages”: Meeting Trust Needs of Distributed Applications Mustaque Ahamad, Greg Eisenhauer, Jiantao Kong, Wenke Lee, Bryan Payne and Karsten.

Embedded System Lab 김해천 Thread and Memory Placement on NUMA Systems: Asymmetry Matters.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

Grid Computing Framework A Java framework for managed modular distributed parallel computing.

Lawrence Livermore National Laboratory BRdeS-1 Science & Technology Principal Directorate - Computation Directorate How to Stop Worrying and Learn to Love.

Lawrence Livermore National Laboratory LLNL-PRES- XXXXXX LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.

PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.

Lawrence Livermore National Laboratory 1 Science & Technology Principal Directorate - Computation Directorate Scalable Fault Tolerance for Petascale Systems.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Cloud Computing – UNIT - II. VIRTUALIZATION Virtualization Hiding the reality The mantra of smart computing is to intelligently hide the reality Binary->

CEPBA-Tools experiences with MRNet and Dyninst Judit Gimenez, German Llort, Harald Servat

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Is MPI still part of the solution ? George Bosilca Innovative Computing Laboratory Electrical Engineering and Computer Science Department University of.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,

Persistent Memory (PM)

The Post Windows Operating System

NFV Compute Acceleration APIs and Evaluation

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Chandra S. Martha Min Lee 02/10/2016

Appro Xtreme-X Supercomputers

Structural Simulation Toolkit / Gem5 Integration

Energy Efficient Computing in Nanoscale CMOS

Some challenges in heterogeneous multi-core systems

Department of Computer Science University of California, Santa Barbara

Data-Intensive Computing: From Clouds to GPU Clusters

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Department of Computer Science University of California, Santa Barbara

Presentation transcript:

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA Petascale Tools Workshop  August 2014 LLNL-PRES-xxxxxx Judit Gimenez, BSC Martin Schulz, LLNL

Lawrence Livermore National Laboratory Petascale Tools Workshop Judit Gimenez and Martin Schulz  Can we make a Petascale class machine behave like what we expect Exascale machines to look like? Limit Resources (power, memory, network, I/O, …) Increase compute/bandwidth ratios Increase fault rates and lower MTBF rates  In short: release GREMLINs into a petascale machine  Goal: Emulation Platform for the Co-Design process Evaluate proxy-apps and compare to baseline Determine bounds of behaviors proxy apps can tolerate Drive changes in proxy apps to counter-act exascale properties

Lawrence Livermore National Laboratory Petascale Tools Workshop Judit Gimenez and Martin Schulz  Techniques to force “bad” behavior on “good” systems Target individual resources and artificially limit them — Using hardware techniques — Stealing resources by creating contention Directly inject bad behavior through external events  Individual GREMLINs are implemented as modules One effect at a time Orthogonal to each other Each GREMLIN has “knobs” to control behavior  The GREMLIN framework Dynamic configuration and loading of individual GREMLINs Ability to couple a range of “bad behaviors” Transparent to system and (mostly) to applications

Lawrence Livermore National Laboratory Petascale Tools Workshop Judit Gimenez and Martin Schulz Applications Architecture Applications Architecture Measurement GREMLIN Env. Power GREMLIN Fault GREMLIN Applications Architecture Rank 0Rank 1 Rank N Multi node job (e.g., MPI) Front end node GREMLIN Control GREMLIN Control Power GREMLIN Memory GREMLIN

Lawrence Livermore National Laboratory Petascale Tools Workshop Judit Gimenez and Martin Schulz  Power Impact of changes in frequency/voltage Impact of limits in available power per machine/rack/node/core  Memory Restrictions in bandwidth Reduction of cache size Limitations of memory size  Resiliency Injection of faults to understand impact of faults Notification of “fake” faults to test recovery  Noise Injection of controlled or random noise events Crosscut summarizing the effects of previous GREMLINs

Lawrence Livermore National Laboratory Petascale Tools Workshop Judit Gimenez and Martin Schulz  Power Impact of changes in frequency/voltage Impact of limits in available power per machine/rack/node/core  Memory Restrictions in bandwidth Reduction of cache size Limitations of memory size  Resiliency Injection of faults to understand impact of faults Notification of “fake” faults to test recovery  Noise Injection of controlled or random noise events Crosscut summarizing the effects of previous GREMLINs

Lawrence Livermore National Laboratory Petascale Tools Workshop Judit Gimenez and Martin Schulz  Using RAPL to install power caps Exposes chip variations Turns homogenous machines into inhomogeneous ones  Optimal configuration under a power cap Widely differing performance Application specific characteristics Need for models

Lawrence Livermore National Laboratory Petascale Tools Workshop Judit Gimenez and Martin Schulz  Low-level infrastructure Libmsr: user-level API to enable access to MSRs (incl. RAPL capping) Msr-safe: kernel module to make MSR access “safe”  Current status Support for Intel Sandy Bridge More CPUs (incl. AMD) in progress Code released on github: Inclusion into RHEL pending Deployed on TLCC cluster cab  Analysis update Full system characterization (see Barry’s talk) Application analysis in progress

Lawrence Livermore National Laboratory Petascale Tools Workshop Judit Gimenez and Martin Schulz  Scheduling research Find optimal configurations for each code Balance processor inhomogeneity Understand relationship to load balancing Integration into FLUX  New Gremlins Artificially introduce noise events Network Gremlins — Limit network bandwidth or increase latency — Inject controlled cross traffic  Adaptation of the Gremlins to new programming models Initially developed for MPI (using P n MPI as base) First new target: OmpSs

Lawrence Livermore National Laboratory Petascale Tools Workshop Judit Gimenez and Martin Schulz  Gremlins integrated with OmpSs runtime Extrae instrumentation (online mode)  Analysis of Gremlins’ impact Living with Gremlins! Measure applications’ sensitivity Growing our Gremlins! uniform vs. non-uniform populations Have Gremlins side-effects? — Do they increase/affect variability? — Should not affect other resources  First results Up to now playing with memory Gremlins

Lawrence Livermore National Laboratory Petascale Tools Workshop Judit Gimenez and Martin Schulz  Gremlins launched at runtime initialization but remain transparent  Each gremlin thread is exclusively pinned on one core  These processors then become inaccessible by the runtime  Runtime parameters can be used to enable/disable gremlins, define number of gremlin threads, resource type how much of that resource a single gremlin thread should use Marc Casas

Lawrence Livermore National Laboratory Petascale Tools Workshop Judit Gimenez and Martin Schulz  Identify sensitive tasks Classify how sensitive they are Match them with tasks that can be run concurrently and are less sensitive to the respective resource type  Implement smart Scheduler in OmpSs Modify OmpSs scheduler to identify resource sensitive tasks with the use of gremlin threads Implement a scheduler that takes this information into account when scheduling tasks for execution

Lawrence Livermore National Laboratory Petascale Tools Workshop Judit Gimenez and Martin Schulz 0 gremlin - 20MB 1 gremlin - 15MB 2 gremlin - 12MB 3 gremlin - 7MB 4 gremlin - 4MB

Lawrence Livermore National Laboratory Petascale Tools Workshop Judit Gimenez and Martin Schulz 0 gremlin - 40GB/s 1 gremlin GB/s 2 gremlin GB/s 3 gremlin GB/s 4 gremlin GB/s

Lawrence Livermore National Laboratory Petascale Tools Workshop Judit Gimenez and Martin Schulz Without gremlins Limiting size (4MB) Limiting bandwidth (28.7 GB/s)

Lawrence Livermore National Laboratory Petascale Tools Workshop Judit Gimenez and Martin Schulz 4lnj8dmatvec 8hyn9lpcg Limiting size (4MB) Limiting bandwidth (28.7 GB/s)

Lawrence Livermore National Laboratory Petascale Tools Workshop Judit Gimenez and Martin Schulz 8hyn9lpcg Limiting size (4MB) Limiting bandwidth (28.7 GB/s)

Lawrence Livermore National Laboratory Petascale Tools Workshop Judit Gimenez and Martin Schulz 4lnj8dmatvec 8hyn9lpcgahyn9lpcg 3hyn9lpcg Limiting bandwidth

Lawrence Livermore National Laboratory Petascale Tools Workshop Judit Gimenez and Martin Schulz  Extrae online analysis mode Based on MRNet  Gremlins API to activate them locally All Gremlins launched at initialization time  First experiments with LLC cache size gremlins Periodic increase of Gremlins Unbalanced steal of resources

Lawrence Livermore National Laboratory Petascale Tools Workshop Judit Gimenez and Martin Schulz  Can we extract insight from chaos? Unbalanced Gremlins creation

Lawrence Livermore National Laboratory Petascale Tools Workshop Judit Gimenez and Martin Schulz L3/instr. ratio

Lawrence Livermore National Laboratory Petascale Tools Workshop Judit Gimenez and Martin Schulz  Different regions show different sensitivity to resource reductions  Asynchrony affects actual sharing of resources Today happens without control  variability  Detailed analysis detects increases on variability and potential non-uniform impact