Embedded System Lab. 김해천 The TURBO Diaries: Application-controlled Frequency Scaling Explained.

Slides:



Advertisements
Similar presentations
4. Workload directed adaptive SMP multicores
Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
SLA-Oriented Resource Provisioning for Cloud Computing
BY GAURAV GUPTA[13IS09F] PAWAN KUMAR THAKUR[13IS17F] Dynamic Management of Turbo Mode in Modern Multi-core Chips DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING.
Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.
Extensibility, Safety and Performance in the SPIN Operating System Presented by Allen Kerr.
CS533 Concepts of Operating Systems Class 12 Extensibility via Software Based Protection.
- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.
CS533 Concepts of Operating Systems Class 13 Micro-kernels (cont.) Extensibility via Software Based Protection.
CS533 Concepts of Operating Systems Class 6 Micro-kernels Extensibility via Hardware or Software Based Protection.
Process Concept An operating system executes a variety of programs
CS 423 – Operating Systems Design Lecture 22 – Power Management Klara Nahrstedt and Raoul Rivas Spring 2013 CS Spring 2013.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
The Impact of Performance Asymmetry in Multicore Architectures Saisanthosh Ravi Michael Konrad Balakrishnan Rajwar Upton Lai UW-Madison and, Intel Corp.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
Folklore Confirmed: Compiling for Speed = Compiling for Energy Tomofumi Yuki INRIA, Rennes Sanjay Rajopadhye Colorado State University 1.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
1 Overview 1.Motivation (Kevin) 1.5 hrs 2.Thermal issues (Kevin) 3.Power modeling (David) Thermal management (David) hrs 5.Optimal DTM (Lev).5 hrs.
Energy saving in multicore architectures Assoc. Prof. Adrian FLOREA, PhD Prof. Lucian VINTAN, PhD – Research.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Sigrity, Inc © Efficient Signal and Power Integrity Analysis Using Parallel Techniques Tao Su, Xiaofeng Wang, Zhengang Bai, Venkata Vennam Sigrity,
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Embedded System Lab. 김해천 Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist,
Stephen Berard Program Manager Windows Platform Architecture Team.
Process Scheduling III ( 5.4, 5.7) CPE Operating Systems
Computational Sprinting on a Real System: Preliminary Results Arun Raghavan *, Marios Papaefthymiou +, Kevin P. Pipe +#, Thomas F. Wenisch +, Milo M. K.
A dynamic optimization model for power and performance management of virtualized clusters Vinicius Petrucci, Orlando Loques Univ. Federal Fluminense Niteroi,
Copyright ©: University of Illinois CS 241 Staff1 Threads Systems Concepts.
Computer Science Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs Min Yeol Lim Computer Science Department Sep.
Towards Dynamic Green-Sizing for Database Servers Mustafa Korkmaz, Alexey Karyakin, Martin Karsten, Kenneth Salem University of Waterloo.
Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,
Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Lev Finkelstein ISCA/Thermal Workshop 6/ Overview 1.Motivation (Kevin) 2.Thermal issues (Kevin) 3.Power modeling (David) 4.Thermal management (David)
CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.
Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.
1 November 11, 2015 GAME CHANGING DESIGN Shlomit Weiss – VP Platform Engineering Group, Intel corp. 1.
Full and Para Virtualization
Platform Abstraction Group 3. Question How to deal with different types hardware and software platforms? What detail to expose to the programmer? What.
Technical Seminar Presentation 2004 Presented by- Geetanjali Konhar EE O81 1 Dynamic power management for embedded system “ Dynamic power management.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Embedded System Lab. 김해천 Decoupling Cores, Kernels, and Operating Systems Gerd Zellweger, Simon Gerber, Kornilios Kourtis, and Timothy.
E-MOS: Efficient Energy Management Policies in Operating Systems
CS533 Concepts of Operating Systems Jonathan Walpole.
ECE692 Course Project Proposal Cache-aware power management for multi-core real-time systems Xing Fu Khairul Kabir 16 September 2009.
Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Tuning Threaded Code with Intel® Parallel Amplifier.
Advanced Operating Systems CS6025 Spring 2016 Processes and Threads (Chapter 2)
Matthew Locke November 2007 A Linux Power Management Architecture.
1 Chapter 5: Threads Overview Multithreading Models & Issues Read Chapter 5 pages
Evaluation of Advanced Power Management for ClassCloud based on DRBL Rider Grid Technology Division National Center for High-Performance Computing Research.
Overview Motivation (Kevin) Thermal issues (Kevin)
Current Generation Hypervisor Type 1 Type 2.
CS427 Multicore Architecture and Parallel Computing
The Multikernel: A New OS Architecture for Scalable Multicore Systems
Processes and Threads Processes and their scheduling
SECTIONS 1-7 By Astha Chawla
Process Management Presented By Aditya Gupta Assistant Professor
Nikos Anastopoulos Nectarios Koziris
Frequency Governors for Cloud Database OLTP Workloads
Improving java performance using Dynamic Method Migration on FPGAs
NT1110 Computer Structure and Logic
Adaptive Single-Chip Multiprocessing
CS533 Concepts of Operating Systems Class 14
CS533 Concepts of Operating Systems Class 6
CS533 Concepts of Operating Systems Class 13
CS533 Concepts of Operating Systems Class 13
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Embedded System Lab. 김해천 The TURBO Diaries: Application-controlled Frequency Scaling Explained

김 해 천김 해 천 Embedded System Lab. Abstract Most multi-core architectures nowadays support dynamic voltage and frequency scaling (DVFS) to adapt their speed to the system’s load and save energy. Some recent architectures additionally allow cores to operate at boosted speeds exceeding the nominal base frequency but within their thermal design power. In this paper, we propose a general-purpose library that allows selective control of DVFS from user space to accelerate multi-threaded applications and expose the potential of heterogeneous frequencies. We analyze the performance and energy trade-offs using different DVFS configuration strategies on several benchmarks and real-world workloads. With the focus on performance, we compare the latency of traditional strategies that halt or busy-wait on contended locks and show the power implications of boosting of the lock owner. We propose new strategies that assign heterogeneous and possibly boosted frequencies while all cores remain fully operational. This allows us to leverage performance gains at the application level while all threads continuously execute at different speeds. We also derive a model to help developers decide on the optimal DVFS configuration strategy, e.g, for lock implementations. Our in-depth analysis and experimental evaluation of current hardware provides insightful guidelines for the design of future hardware power management and its operating system interface.

김 해 천김 해 천 Embedded System Lab. concept Dynamic voltage and frequency sacling (DVFS)  절전기술 : 주파수 스케일링, 전압 스케일링 P state and C state  P-states: performance states Predefined frequency/voltage pairs Controlled through machine-specific registers ( MSRs, priviledged rdmsr/wrmsr)  C-state: power states To save energy when a core is idle Entered by hlt or monitor/mwait P turbo P base P slow frequency/voltage C0 halted C1-Cn

김 해 천김 해 천 Embedded System Lab. concept Voltage and frequency domain: module vs package P turbo is entered automatically if some cores enter a deep C-state The manual control of all p-states AMD only: asymmetric frequency with manual boost x86FPU AMD Turbo CORE Intel Turbo Boost HT & P base P turbo ≥ C 1 x86 P turbo P slow

김 해 천김 해 천 Embedded System Lab. Evaluation setup Critical sections (CS) protected by MCS queue lock  Decoration on acquire/release trigger DVFS  Variable size of CS amortize DVFS cost  Effective CS frequency : f CS = f base * ( t CS / t A+CS+R ) =  Energy for 1 hour at P base : E = E sample * ( t A+CS+R / t CS ) =

김 해 천김 해 천 Embedded System Lab. Automatic Frequency Scaling Decoration: sinning vs blocking P-state transition triggered by hardware deeper C-state boosted P-state

김 해 천김 해 천 Embedded System Lab. Blocking vs Spinning Locks 1.5M 4M 1M, t wait = 7M 10k t wait = 70k

김 해 천김 해 천 Embedded System Lab. Manual Frequency Scaling Decoration: spin and application-level DVFS control Keep all core active

김 해 천김 해 천 Embedded System Lab. Manual Lock Boosting 200k 400k 600k Futex:1.5M Same application but with a different set of decorations for the lock  spin, owner, delegate, migrate, wait

김 해 천김 해 천 Embedded System Lab. END 감사합니다.

김 해 천김 해 천 Embedded System Lab. The easiest way to benefit from DVFS is to replace the application’s locks with thread control wrappers that are decorated with implicit P-state transitions, e.g., boosting the lock owner at Pturbo, waiting at Pslow, and executing parallel code at Pbase.

김 해 천김 해 천 Embedded System Lab. System calls for device-specific input/output operations (ioctl) have a low overhead and are easily extensible using the request code parameter. The interface of the TURBO driver (trb) is based on ioctl, while the Linux MSR driver (msr) uses a file-based interface that can be accessed most efficiently using pread/pwrite. The difference in speed between msr and trb (both use rdmsr/wrmsr to access the MSRs) results mostly from additional security checks and indirections that we streamlined for the TURBO driver

김 해 천김 해 천 Embedded System Lab. Processor and Linux Kernel Setup All boosted P-states are controlled by the processor and the Linux governor will adapt the non-boosted P-states based on the current processor utilization (“ondemand”) or based on static settings We must disable the influence of the governors and the processor’ power saving features in order to gain explicit control of the P-states and boosting in user space using our library. we disable the CPU frequency driver (cpufreq) and turn off AMD’s Cool’n’Quiet speed throttling technology in the BIOS. To control all available P-states in user space, we can either disable automatic boosting altogether, which is the only solution for Intel, or for AMD set #Pboosted = 0 to enable manual boosting control we only change BIOS settings and kernel parameters