Cross-stack Energy Optimization: Fact or Fiction? Kevin Skadron University of Virginia Dept. of Computer Science.

Slides:



Advertisements
Similar presentations
Cross-stack Energy Optimization Fact or Fiction? WEED-ESSA Panel Discussion 2012 Technology Circuits Architecture Applications Hypervisor BIOS Micro-architecture.
Advertisements

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Daniel Schall, Volker Höfner, Prof. Dr. Theo Härder TU Kaiserslautern.
More on threads, shared memory, synchronization
*time Optimization Heiko, Diego, Thomas, Kevin, Andreas, Jens.
Some Opportunities and Obstacles in Cross-Layer and Cross-Component (Power) Management Onur Mutlu NSF CPOM Workshop, 2/10/2012.
Chapter 1 Section II Fundamentals of Information Systems
GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.
Review: Chapters 1 – Chapter 1: OS is a layer between user and hardware to make life easier for user and use hardware efficiently Control program.
Chapter 13 Embedded Systems
Keith Wiles DPACC vNF Overview and Proposed methods Keith Wiles – v0.5.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
1 School of Computing Science Simon Fraser University CMPT 300: Operating Systems I Dr. Mohamed Hefeeda.
CS 300 – Lecture 22 Intro to Computer Architecture / Assembly Language Virtual Memory.
Figure 1.1 Interaction between applications and the operating system.
Sensor Node Architecture Issues Stefan Dulman
1 Operating Systems Ch An Overview. Architecture of Computer Hardware and Systems Software Irv Englander, John Wiley, Bare Bones Computer.
Introduction Operating Systems’ Concepts and Structure Lecture 1 ~ Spring, 2008 ~ Spring, 2008TUCN. Operating Systems. Lecture 1.
Power is Leading Design Constraint Direct Impacts of Power Management – IDC: Server 2% of US energy consumption and growing exponentially HPC cluster market.
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Operating Systems What do you have left on your computer after you strip away all of the games and application programs you bought and installed? Name.
1 Layers of Computer Science, ISA and uArch Alexander Titov 20 September 2014.
Microkernels, virtualization, exokernels Tutorial 1 – CSC469.
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
GPU Programming with CUDA – Optimisation Mike Griffiths
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
Virtualization: Not Just For Servers Hollis Blanchard PowerPC kernel hacker.
Processes and OS basics. RHS – SOC 2 OS Basics An Operating System (OS) is essentially an abstraction of a computer As a user or programmer, I do not.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
CS 346 – Chapter 1 Operating system – definition Responsibilities What we find in computer systems Review of –Instruction execution –Compile – link – load.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
GPU Architecture and Programming
1 Instruction Set Architecture (ISA) Alexander Titov 10/20/2012.
 Virtual machine systems: simulators for multiple copies of a machine on itself.  Virtual machine (VM): the simulated machine.  Virtual machine monitor.
Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Wrapup and Open Issues.
CS 346 – Chapter 2 OS services –OS user interface –System calls –System programs How to make an OS –Implementation –Structure –Virtual machines Commitment.
1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
Abstract A Structured Approach for Modular Design: A Plug and Play Middleware for Sensory Modules, Actuation Platforms, Task Descriptions and Implementations.
CS 546: Intelligent Embedded Systems Gaurav S. Sukhatme Robotic Embedded Systems Lab Center for Robotics and Embedded Systems Computer Science Department.
Internet of Things. IoT Novel paradigm – Rapidly gaining ground in the wireless scenario Basic idea – Pervasive presence around us a variety of things.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
SensorWare: Distributed Services for Sensor Networks Rockwell Science Center and UCLA.
Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.
CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2012.
My Coordinates Office EM G.27 contact time:
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
OPERATING SYSTEMS DO YOU REQUIRE AN OPERATING SYSTEM IN YOUR SYSTEM?
CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
Software Architecture of Sensors. Hardware - Sensor Nodes Sensing: sensor --a transducer that converts a physical, chemical, or biological parameter into.
Unit 3 Computer Systems. What is software? unlike hardware it can’t be physically touched it’s the missing link between the computer hardware and the.
Introduction to Operating Systems Concepts
Modularity Most useful abstractions an OS wants to offer can’t be directly realized by hardware Modularity is one technique the OS uses to provide better.
Current Generation Hypervisor Type 1 Type 2.
Fault-Tolerant NoC-based Manycore system: Reconfiguration & Scheduling
Basic CUDA Programming
Cloud Testing Shilpi Chugh.
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu
Power is Leading Design Constraint
Lecture Topics: 11/1 General Operating System Concepts Processes
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
6- General Purpose GPU Programming
Task Manager & Profile Interface
Stream-based Memory Specialization for General Purpose Processors
Chapter 4 The Von Neumann Model
Presentation transcript:

Cross-stack Energy Optimization: Fact or Fiction? Kevin Skadron University of Virginia Dept. of Computer Science

Flavors of X-Stack “Up” the stack – Circuits  Microarchitecture – HW  SW eg, sensors  throttling Ideally, application itself can adapt (algorithm, precision, QoS, etc.) – … “Down” the stack – Often overlooked, but OS, HW can benefit from application knowledge – SW  HW eg, access patterns, thread priorities, private/shared, etc. – GPU example: texture (API  driver  HW) eg, reconfigurable hardware 2

Up: Dymaxion: Index Transformation SIMD/SIMT: Because SIMD requires contiguous access for efficiency, data layout/traversal needs to be transformed User  middleware  (device driver)  (hardware) feature[index] feature’[transform(index)] 8

Code Example HOST cudaMemcpy(feature_d, feature, …); kmeans_kernel_orig >>( feature_d,... ); HOST map_row2col(feature_remap, feature, …); kmeans_kernel_map >>( feature_remap,... ); DEVICE __global__ kmeans_kernel_orig(float *feature_d,...){ int tid = BLOCK_SIZE * blockIdx.x + threadIdx.x; /*... */ for (int l = 0; l < nclusters; l++) { index = point_id * nfeatures + l;...feature_d[index]... } DEVICE __global__ kmeans_kernel_map(float *feature_remap,...){ int tid = BLOCK_SIZE * blockIdx.x + threadIdx.x; /*... */ for (int l = 0; l < nclusters; l++) { index = point_id * nfeatures + l;...feature_remap[transform_row2col(index, npoints, nfeatures)]... } } Dymaxion Version Original Version

Down: Lack of Sensors and Actuators Feedback control: sensors and actuators Chicken and egg problem Lack of sensors is a big problem now – Can’t control what we can’t measure – Performance monitors not designed for this Too coarse-grained, can’t monitor enough – Moving in the right direction Need more actuators, too – Currently mainly have just DVFS and scheduling/placement – Some HDDs offer DRPM – Reconfiguration is a form of actuation, too 5

Wish List Sensors/constraint communication – Up: Structure occupancies, interval behavior, fine- grained/instruction-level responsiveness, physical location, etc. Expand perf-counter system, add informing loads (ISCA ~00), allow HW to query microarchitectural state, expose chip/rack/datacenter/geographic location, etc. – Down: Access patterns, private/shared, priority/performance expectations, etc. Requires new programming constructs and new (possibly privileged) instructions Actuators – Many system components hard to control e.g., HDDs, DRAM, power supply – Control memory behavior, light sleep modes Ordering/buffering/prefetching/contention – More reconfigurability, coarse-grained architectures Why use cache when you can use scratchpad; registers, routed network when you can do direct producer-consumer, etc.? 6

Summary Turn fiction into non-fiction! Some good ideas already in papers – Revisit: why weren’t they adopted? New ideas: – Imagine ideal sensing and actuation – Show a promising control/adaptation/reconfiguration algorithm – Propose plausible sensors/actuators 7

Backup 8

What is “Cross Stack”? Layer X adapts based on information in Layer Y – Example: OS uses hardware info e.g., temp sensors, structure occupancies, # pending cache misses guide thread co-location – Or hardware uses OS info e.g., thread priorities, task deadlines guide hardware DVFS policy – Important—leverage information across layers to make globally efficient decisions – Ultimately: break down costly interfaces Unnecessary copies, extra state, redundant computation Different than energy optimization happening independently in multiple layers – e.g., hardware DVFS (based on instruction flow) + OS DVFS (based on task deadlines) – Risky—control loops can fight 9

Fact or Fiction Should be fact! But mostly fiction – Can’t measure power/energy effectively in many systems and components – Control options are typically high-overhead DVFS, task migration, etc. – Most solutions are single-layer Baby steps – Cluster/datacenter front end monitors per-node activity, temperature—schedules accordingly – Autotuning – Reducing copies 10