There are no comprehensive, holistic studies of performance, power and thermals on distributed scientific systems and workloads Without innovation future.

Slides:

Advertisements

Similar presentations

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

Advertisements

ENERGY AND POWER CHARACTERIZATION OF PARALLEL PROGRAMS RUNNING ON THE INTEL XEON PHI JOAL WOOD, ZILIANG ZONG, QIJUN GU, RONG GE {JW1772, ZILIANG,

1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.

Power Management in Cloud Computing using Green Algorithm -Kushal Mehta COP 6087 University of Central Florida.

Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.

Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.

A Cyber-Physical Systems Approach to Energy Management in Data Centers Presented by Chen He Adopted form the paper authors.

Green Cloud Computing Hadi Salimi Distributed Systems Lab, School of Computer Engineering, Iran University of Science and Technology,

st International Conference on Parallel Processing (ICPP)

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

Application Models for utility computing Ulrich (Uli) Homann Chief Architect Microsoft Enterprise Services.

Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.

Akhil Langer, Harshit Dokania, Laxmikant Kale, Udatta Palekar* Parallel Programming Laboratory Department of Computer Science University of Illinois at.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

Power is Leading Design Constraint Direct Impacts of Power Management – IDC: Server 2% of US energy consumption and growing exponentially HPC cluster market.

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

Thermal Aware Resource Management Framework Xi He, Gregor von Laszewski, Lizhe Wang Golisano College of Computing and Information Sciences Rochester Institute.

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

Folklore Confirmed: Compiling for Speed = Compiling for Energy Tomofumi Yuki INRIA, Rennes Sanjay Rajopadhye Colorado State University 1.

Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.

Sensor-Based Fast Thermal Evaluation Model For Energy Efficient High-Performance Datacenters Q. Tang, T. Mukherjee, Sandeep K. S. Gupta Department of Computer.

Energy Profiling And Analysis Of The HPC Challenge Benchmarks Scalable Performance Laboratory Department of Computer Science Virginia Tech Shuaiwen Song,

Low-Power Wireless Sensor Networks

Integrating Fine-Grained Application Adaptation with Global Adaptation for Saving Energy Vibhore Vardhan, Daniel G. Sachs, Wanghong Yuan, Albert F. Harris,

Cloud Computing Energy efficient cloud computing Keke Chen.

1 Overview 1.Motivation (Kevin) 1.5 hrs 2.Thermal issues (Kevin) 3.Power modeling (David) Thermal management (David) hrs 5.Optimal DTM (Lev).5 hrs.

1 Using Multiple Energy Gears in MPI Programs on a Power- Scalable Cluster Vincent W. Freeh, David K. Lowenthal, Feng Pan, and Nandani Kappiah Presented.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Adaptive Parallel Sorting Algorithms in STAPL Olga Tkachyshyn, Gabriel Tanase, Nancy M. Amato

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

Towards reducing total energy consumption while constraining core temperatures Osman Sarood and Laxmikant Kale Parallel Programming Lab (PPL) University.

Temperature Aware Load Balancing For Parallel Applications Osman Sarood Parallel Programming Lab (PPL) University of Illinois Urbana Champaign.

Challenges towards Elastic Power Management in Internet Data Center.

A performance evaluation approach openModeller: A Framework for species distribution Modelling.

Supercomputing Cross-Platform Performance Prediction Using Partial Execution Leo T. Yang Xiaosong Ma* Frank Mueller Department of Computer Science.

1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.

Power-Aware Scheduling of Virtual Machines in DVFS-enabled Clusters

Background Gaussian Elimination Fault Tolerance Single or multiple core failures: Single or multiple core additions: Simultaneous core failures and additions:

A Node and Load Allocation Algorithm for Resilient CPSs under Energy-Exhaustion Attack Tam Chantem and Ryan M. Gerdes Electrical and Computer Engineering.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Thermal-aware Issues in Computers IMPACT Lab. Part A Overview of Thermal-related Technologies.

Computer Science Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs Min Yeol Lim Computer Science Department Sep.

Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

Dynamic Voltage Frequency Scaling for Multi-tasking Systems Using Online Learning Gaurav DhimanTajana Simunic Rosing Department of Computer Science and.

Software Architecture for Dynamic Thermal Management in Datacenters Tridib Mukherjee Graduate Research Assistant IMPACT Lab ( Department.

Thermal Aware Data Management in Cloud based Data Centers Ling Liu College of Computing Georgia Institute of Technology NSF SEEDM workshop, May 2-3, 2011.

Lev Finkelstein ISCA/Thermal Workshop 6/ Overview 1.Motivation (Kevin) 2.Thermal issues (Kevin) 3.Power modeling (David) 4.Thermal management (David)

Evaluating the Impact of Job Scheduling and Power Management on Processor Lifetime for Chip Multiprocessors (SIGMETRICS 2009) Authors: Ayse K. Coskun,

1 Thermal Management of Datacenter Qinghui Tang. 2 Preliminaries What is data center What is thermal management Why does Intel Care Why Computer Science.

Scalable and Coordinated Scheduling for Cloud-Scale computing

Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.

HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.

Xi He Golisano College of Computing and Information Sciences Rochester Institute of Technology Rochester, NY THERMAL-AWARE RESOURCE.

Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

CprE 458/558: Real-Time Systems (G. Manimaran)1 CprE 458/558: Real-Time Systems Energy-aware QoS packet scheduling.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

Characterizing Processors for Energy and Performance Management Harshit Goyal and Vishwani D. Agrawal Department of Electrical and Computer Engineering,

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

M AESTRO : Orchestrating Predictive Resource Management in Future Multicore Systems Sangyeun Cho, Socrates Demetriades Computer Science Department University.

Green cloud computing 2 Cs 595 Lecture 15.

Economic Operation of Power Systems

Green Software Engineering Prof

Flavius Gruian < >

Department of Computer Science University of California, Santa Barbara

Presentation transcript:

There are no comprehensive, holistic studies of performance, power and thermals on distributed scientific systems and workloads Without innovation future HEC systems will waste performance potential, waste energy, and require extravagant cooling. Improving Performance, Power, and Thermal Efficiency in High-End Systems Kirk W. Cameron Scalable Performance Laboratory Department of Computer Science and Engineering Virginia Tech cs.vt.edu Introduction Performance Efficiency Power Efficiency Thermal Efficiency Problem Statement Left unchecked, the fundamental drive to increase peak performance using tens of thousands of components in close proximity to one another will result in: 1) an inability to sustain performance improvements, and 2) exorbitant infrastructure and operational cost for power and cooling. Performance, Power, and Thermal Facts The gap between peak and achieved performance is growing  A 5 Megawatt Supercomputer can consume $4M in energy annually. In just 2 hours, Earth Simulator can produce enough heat to heat a home in the midwest all winter long. Projections Commodity components fail at annual rate of 2-3%. Petaflop system of ~12,000 nodes (CPU, NIC, DRAM, disk) will sustain hardware failure once every 24 hours. Life expectancy of an electronic component decreases 50% for every 10°C(18°F) temperature increase. Our Approach Observations: Predictive models and techniques are needed to maximize performance of emergent systems. Additional below-peak performance may provide adequate “slack times” for improved power and thermal efficiencies. Constraint: Performance is the critical constraint. Reduce power and thermals ONLY if it does not reduce performance significantly. Relevant approaches to the problem Improving Performance Efficiencies Includes a myriad of tools and modeling techniques to analyze and optimize the performance of parallel scientific applications. In our work we focus on using fast analytical modeling techniques to optimize emergent architectures such as the IBM Cell Broadband Architecture. Improving Power Efficiencies Exploit application “slack times” to operate various components in lower power modes (e.g. dynamic voltage scaling or DVFS) to conserve power and energy. Prior to our work, no framework for profiling performance and power of parallel systems and applications. Improving Thermal Efficiencies Exploit application “slack times” to operate various components in lower power (and thermal) modes to reduce the heat emitted by the system. Prior to our work, no framework for profiling performance and thermals of parallel systems and applications. Our Contributions I.Portable framework to profile, analyze and optimize distributed applications for performance, power, and thermals with minimal performance impact. II.Performance-Power-Thermal tradeoff studies and optimizations of scientific workloads on various architectures. Performance analysis of NAS parallel benchmarks Distributed Thermal Profiles: A thermal profile of FT (above) reveals thermal patterns corresponding to code phases. Floating point intensive phases run hot while memory bound phases run cooler. Also, significant temperature drops occur in very short periods of time. Thermal behavior of BT (not pictured) shows temperatures synchronize with workload behavior across nodes. We also observe some nodes trend hotter than others. All of this data was obtained using Tempest. Temperature-Performance tradeoffs Thermal-Performance tradeoffs are studied using Tempest and DVFS strategies applied to reduce temperature in parallel scientific applications. Download Tempest Tempest is available for download from Related papers can be found at Tempest Software Architecture Detailed thermal profile of FT (Class C,NP=4) Thermal optimizations are achieved with minimal performance impact Thermal regulation: (top & top right) Tempest controller constrains temperature to within a threshold. Since the controller is heuristic, the temperature can exceed the threshold. However, temperature is typically controlled well using DVFS in a node. The weighted importance of thermals, performance and energy can determine the “best” operating point over a number of nodes. CPU Impact on Thermals: (left) For floating point intensive codes (e.g. SP, FT, EP from NAS) CPU is a large consumer of power under load and dissipates significant heat. Energy optimizations that significantly reduce CPU heat should impact total system temperature significantly. Thermal regulation of IS (Class C, NP=4) Thermal regulation of FT (Class C, NP=4) Thermal-aware Performance Impact: (right) The performance impact of our thermal- aware DVFS controller is less than 10% for all the NAS PB codes measured. Nonetheless, we commonly reduce operating temperature nearly 10°C(18°F) which translates to 50% reliability improvement in some cases. On average, we reduce operating temperature between 5-7 °C. Avg CPU Temp for various NAS PB codes Tempest profiling techniques are automatic, accurate, and portable. 8-node Dori PowerPack II Software Power profiling API library - synchronized profiling of parallel applications. Power control API library - synchronized DVS control within parallel application. Multimeter middleware - coordinates data from multiple meter sources. Power analyzer middleware – sorts/sifts/analyzes/correlates profiling data. Performance profiler – use common utilities to poll system performance status. This work sponsored in part by the Department of Energy Office of Science Early Career Principal Investigator (ECPI) Program under grant number DOE DE-FG02-04ER Ethernet SwitchData Collection System Multimeters Resistors Node under test + - Component R V S V R P =(V S -V R )V R /R RS232/GBIC Ethernet Distributed Power Profiles: NAS codes exhibit regularity (e.g. FT on 4 nodes – above left) that reflects algorithm behavior. Intensive use of memory corresponds to decreases in CPU power and increases in memory power use (above right). Power consumption can vary with node for a single application, with number of nodes under fixed workload and with varied workload under fixed number of nodes. Results often correlate to comm/comp ratio. Normalized Energy and Delay with CPU MISER for FT.C auto CPU MISER normalized delay normalized energy Reducing Energy Consumption: (left) CPU Miser uses dynamic voltage and frequency scaling (DVFS) to lower average processor power consumption. Using the default cpuspeed daemon (auto) or any fixed lower frequency, performance loss is common. CPU Miser is able to reduce energy consumption without reducing performance significantly. (above) Memory Miser uses power scalable DRAM to lower average memory power consumption by turning off memory DIMMs based on memory use and allocation. Note the top curve shows the amount of online memory and the bottom curve shows actual demand. CPU Miser and Memory Miser are both capable of 30% total system energy savings with less than 1% performance loss. Time for a single iteration: T i = T HPU + T APU + Offload Off-loaded time: Offload = O r + O s Total time: T = ∑i(T HPU,i + T APU,i + O offload,i ) Single APU: T APU = T APUp + C APU T APUp : APU part that can be parallelized C APU : APU sequential part Multiple APUs: T APU(1,p) = T APU(1,1) /p + C APU p: number of APUs T APU(1,1) : offloaded time for 1 APU T APU(1,p) : offloaded time for p APUs T = T HPU + T APU(1,1) /p + C APU + O offload + p·g Optimizing Heterogeneous Multicore Systems We use a variation of the log n P performance model to predict the cost of various process and data placement configurations at runtime. Using the performance model we can schedule process and data placement optimally for a heterogeneous multicore architecture. Results on the IBM Cell Broadband Engine show dynamic multicore scheduling using analytical modeling is a viable, accurate technique to improve performance efficiencies. Portions of this work were accomplished in collaboration with the Pearl Laboratory led by Prof. D. Nikolopoulos. HPU time for one iteration: T HPU(m,1) = a m · T HPU(1,1) + T CSW + O col T (m,p) = T HPU(m,p) + T APU(m,p) + O offload + p·g Application: Parallel Bayesian Phylogenetic Inference Dataset: 107 sequences, each nucleotides, 20,000 gens MMGP mean error 3.2%, std. dev. 2.6, max. error 10% PBPI executes sampling phase at the beginning of execution MMGP params are determined during the sampling phase Execution restarted after the sampling phase with MMGP PBPI with sampling phase outperforms other configs by 1% - 4x. Sampling phase overhead is 2.5%.