Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.

Slides:

Advertisements

Similar presentations

1 Hardware Support for Isolation Krste Asanovic U.C. Berkeley MURI “DHOSA” Site Visit April 28, 2011.

Advertisements

Daniel Schall, Volker Höfner, Prof. Dr. Theo Härder TU Kaiserslautern.

SLA-Oriented Resource Provisioning for Cloud Computing

An Analysis of Node Sharing on HPC Clusters using XDMoD/TACC_Stats Joseph P White, Ph.D Scientific Programmer - Center for Computational Research University.

1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.

Power Management in Cloud Computing using Green Algorithm -Kushal Mehta COP 6087 University of Central Florida.

Towards Power Efficiency on Task-Based, Decoupled Access-Execute Models Konstantinos Koukos David Black-Schaffer Vasileios Spiliopoulos Stefanos Kaxiras.

Introduction CSCI 444/544 Operating Systems Fall 2008.

Virtualization in HPC Minesh Joshi CSC 469 Dr. Box Feb 1, 2012.

Enabling High-level SLOs on Shared Storage Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, Randy Katz, Ion Stoica Cake 1.

Project Proposal Presented by Michael Kazecki. Outline Background –Algorithms Goals Ideas Proposal –Introduction –Motivation –Implementation.

NETWORK LOAD BALANCING NLB.  Network Load Balancing (NLB) is a Clustering Technology.  Windows Based. (windows server).  To scale performance, Network.

1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.

COMS E Cloud Computing and Data Center Networking Sambit Sahu

Energy Efficient Web Server Cluster Andrew Krioukov, Sara Alspaugh, Laura Keys, David Culler, Randy Katz.

7/2/2015EECS 584, Fall Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Operating systems CHAPTER 7.

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Storage Management in Virtualized Cloud Environments Sankaran Sivathanu, Ling Liu, Mei Yiduo and Xing Pu Student Workshop on Frontiers of Cloud Computing,

◦ What is an Operating System? What is an Operating System? ◦ Operating System Objectives Operating System Objectives ◦ Services Provided by the Operating.

AUTHORS: STIJN POLFLIET ET. AL. BY: ALI NIKRAVESH Studying Hardware and Software Trade-Offs for a Real-Life Web 2.0 Workload.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Providing QoS with Virtual Private Machines Kyle J. Nesbit, James Laudon, and James E. Smith.

张俊 BTLab Embedded Virtualization Group Outline  Introduction  Performance Analysis  PerformanceTuning Methods.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.

Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.

1 University of Maryland Linger-Longer: Fine-Grain Cycle Stealing in Networks of Workstations Kyung Dong Ryu © Copyright 2000, Kyung Dong Ryu, All Rights.

Web Search Using Mobile Cores Presented by: Luwa Matthews 0.

Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.

Towards Dynamic Green-Sizing for Database Servers Mustafa Korkmaz, Alexey Karyakin, Martin Karsten, Kenneth Salem University of Waterloo.

Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

1 Iterative Integer Programming Formulation for Robust Resource Allocation in Dynamic Real-Time Systems Sethavidh Gertphol and Viktor K. Prasanna University.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,

E-MOS: Efficient Energy Management Policies in Operating Systems

Improving System Availability in Distributed Environments Sam Malek with Marija Mikic-Rakic Nels.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

Speaker : Kyu Hyun, Choi. Problem: Interference in shared caches – Lack of isolation → no QoS – Poor cache utilization → degraded performance.

The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath, Dean Tullsen 1 Micro 2015.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,

NFV Group Report --Network Functions Virtualization LIU XU →

Lecture 2: Performance Evaluation

Is Virtualization ready for End-to-End Application Performance?

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Hadoop Aakash Kag What Why How 1.

Cluster Optimisation using Cgroups

Operating Systems : Overview

Hydra: Leveraging Functional Slicing for Efficient Distributed SDN Controllers Yiyang Chang, Ashkan Rezaei, Balajee Vamanan, Jahangir Hasan, Sanjay Rao.

Measurement-based Design

Green cloud computing 2 Cs 595 Lecture 15.

Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,

Resource Aware Scheduler – Initial Results

Computing Resource Allocation and Scheduling in A Data Center

Standards and Patterns for Dynamic Resource Management

Frequency Governors for Cloud Database OLTP Workloads

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

湖南大学-信息科学与工程学院-计算机与科学系

Haishan Zhu, Mattan Erez

Resource-Efficient and QoS-Aware Cluster Management

Operating Systems : Overview

Performance And Scalability In Oracle9i And SQL Server 2000

Presentation transcript:

Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.

Outline Introduction Design ◦ Isolation Mechanisms ◦ Controllers Evaluation Conclusion

Motivation Average server utilization in most datacenter is low, ranging between 10%~50%. ◦ Difficult to consolidate the latency-critical services on a subset of highly utilized servers. Increase the server utilization by launching best-effort tasks on the same server with a latency-critical job.

Motivation(Cont.) Previous works tend to protect LC workloads, but reduce the opportunities for higher utilization through co-location.

Goal Eliminate SLO violations at all levels of load for the LC job while maximizing the throughput for BE tasks.

Heracles A real-time, feedback-based controller ◦ Enables the safe co-location of best-effort(BE) tasks alongside a latency-critical(LC) service. ◦ Ensures that LC jobs meet their target while maximizing the resources given to BE tasks.

Heracles(Cont.) ◦ Four hardware and software isolation mechanisms.  Hardware: shared cache partitioning, fine-grained power/frequency setting.  Software: core isolation, network traffic control.

Isolation Mechanisms(Soft) Core isolation ◦ Pin workload to a set of core using cpuset cgroups. ◦ Speed of (re)allocation: tens of milliseconds. Network traffic ◦ Limit the outgoing bandwidth of BE tasks using Linux traffic control. ◦ No limit on LC job. ◦ Take effect in less than hundreds of milliseconds.

Isolation Mechanisms(Hard) LLC isolation ◦ Cache Allocation Technology(CAT) in recent Intel chip.  Use way-partitioning to define non-overlapping partitions on LLC.  Take effect in a few milliseconds. ◦ Implement software monitor to track the bandwidth usage of LC and BE jobs.  Scale down the # of cores for BE jobs if LC jobs does not receive sufficient bandwidth.

Isolation Mechanisms(Hard)(Cont.) Power isolation ◦ CPU frequency monitoring, Running Average Power Limit(RAPL), and per-core DVFS. ◦ Take effect within a few milliseconds.

Design Approach An optimization problem ◦ Maximize utilization with the constraint that the SLO must be met. Heracles ◦ decomposes the high-dimensional optimization problem into many smaller and independent problem.  Decoupling interference sources. ◦ Monitors latency, latency slack, and load.  Adjust the BE job allocation.

System Diagram

High-level Controller

Core & Memory Sub-controller

Max Load under SLO

Power and Network Sub-controller

Evaluation Two sets of experiments ◦ Co-locates LC applications with BE tasks on a single server. ◦ Measuring end-to-end latency of Websearch on tens of servers.  BE tasks are also running. Effective Machine Utilization(EMU) ◦ LC throughput + BE throughput

Workloads Three Google production LC workloads: ◦ websearch ◦ ml_cluster  Real-time text clustering using machine learning ◦ memkeyval  In-memory key-value store Run LC workloads with benchmarks that stress a single shared resource. ◦ Stream-LLC, Stream-DRAM, cpu-pwr, iperf, brain, and streetview.

Latency of LC Applications

EMU

Shared Resource Utilization

Websearch in Cluster

Conclusion Heracles ◦ a heuristic feedback-based system that manages four isolation mechanisms to enable a latency-critical workload to be co-located with batch jobs without SLO violations. ◦ Evaluation on real hardware demonstrates an average utilization of 90% across all evaluated scenarios without any SLO violations for the latency-critical job.

Interference Analysis Three Google production LC workloads: ◦ websearch ◦ ml_cluster  Real-time text clustering using machine learning ◦ memkeyval  In-memory key-value store Run LC workloads with synthetic benchmarks that stress each resource in isolation.