Tapestry: Reducing Interference on Manycore Processors for IaaS Clouds

Slides:



Advertisements
Similar presentations
Chapter 3 Embedded Computing in the Emerging Smart Grid Arindam Mukherjee, ValentinaCecchi, Rohith Tenneti, and Aravind Kailas Electrical and Computer.
Advertisements

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
TimeCube A Manycore Embedded Processor with Interference-agnostic Progress Tracking Anshuman Gupta Jack Sampson Michael Bedford Taylor University of California,
1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre.
DSPs Vs General Purpose Microprocessors
Energy Efficiency through Burstiness Athanasios E. Papathanasiou and Michael L. Scott University of Rochester, Computer Science Department Rochester, NY.
D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.
Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture Seongbeom Kim, Dhruba Chandra, and Yan Solihin Dept. of Electrical and Computer.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Computer Abstractions and Technology
A Framework for Dynamic Energy Efficiency and Temperature Management (DEETM) Michael Huang, Jose Renau, Seung-Moon Yoo, Josep Torrellas University of Illinois.
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
CLOUD COMPUTING AN OVERVIEW & QUALITY OF SERVICE Hamzeh Khazaei University of Manitoba Department of Computer Science Jan 28, 2010.
1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.
1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.
Mathew Paul and Peter Petrov Proceedings of the IEEE Symposium on Application Specific Processors (SASP ’09) July /6/13.
Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:
Computer Organization and Architecture
Synergistic Processing In Cell’s Multicore Architecture Michael Gschwind, et al. Presented by: Jia Zou CS258 3/5/08.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
A Low-Power Low-Memory Real-Time ASR System. Outline Overview of Automatic Speech Recognition (ASR) systems Sub-vector clustering and parameter quantization.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
Jiazhang Liu;Yiren Ding Team 8 [10/22/13]. Traditional Database Servers Database Admin DBMS 1.
Chapter 3 Memory Management: Virtual Memory
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
A Novel Cache Architecture with Enhanced Performance and Security Zhenghong Wang and Ruby B. Lee.
Network Aware Resource Allocation in Distributed Clouds.
November , 2009SERVICE COMPUTATION 2009 Analysis of Energy Efficiency in Clouds H. AbdelSalamK. Maly R. MukkamalaM. Zubair Department.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Virtual Memory The memory space of a process is normally divided into blocks that are either pages or segments. Virtual memory management takes.
Hyper-Threading Technology Architecture and Microarchitecture
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
June 30 - July 2, 2009AIMS 2009 Towards Energy Efficient Change Management in A Cloud Computing Environment: A Pro-Active Approach H. AbdelSalamK. Maly.
© 2004 Wayne Wolf Memory system optimizations Strictly software:  Effectively using the cache and partitioned memory. Hardware + software:  Scratch-pad.
Demand Side Management in Smart Grid Using Heuristic Optimization (IEEE TRANSACTIONS ON SMART GRID, VOL. 3, NO. 3, SEPTEMBER 2012) Author : Thillainathan.
Full and Para Virtualization
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.
Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.
Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.
Unit 2 VIRTUALISATION. Unit 2 - Syllabus Basics of Virtualization Types of Virtualization Implementation Levels of Virtualization Virtualization Structures.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
Chapter 2 Memory and process management
Lecture: Large Caches, Virtual Memory
Authors: Sajjad Rizvi, Xi Li, Bernard Wong, Fiodar Kazhamiaka
System Control based Renewable Energy Resources in Smart Grid Consumer
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Improving cache performance of MPEG video codec
Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar, Hongzhou Zhao†, Arrvindh Shriraman Eric Matthews∗, Sandhya.
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Computer Architecture Lecture 4 17th May, 2006
Lecture: Cache Innovations, Virtual Memory
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
Massachusetts Institute of Technology
(A Research Proposal for Optimizing DBMS on CMP)
Lecture: Cache Hierarchies
Lecture 23: Virtual Memory, Multiprocessors
Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,
Virtual Memory: Working Sets
Chapter 2 Operating System Overview
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
IIS Progress Report 2016/01/18.
Restrictive Compression Techniques to Increase Level 1 Cache Capacity
Presentation transcript:

Tapestry: Reducing Interference on Manycore Processors for IaaS Clouds Anshuman Gupta and Michael Bedford Taylor CSE Department, University of California at San Diego Overview The Details Dynamically Partitioned Manycore Architecture Online Performance Estimation (pTables) Higher-Order Decisions Manycore Architectures have many potential benefits for Infrastructure-as-a-Service (IaaS) Clouds Tapestry is a distributed manycore architecture with dynamically partitioned last-level cache and memory bandwidth, shared between many applications. Performance Tables (pTables) store the performance estimates for all applications for a spectrum of allocations of the last level cache and memory bandwidth. Virtual Time Metering (VTM) Power – High performance per watt Space – High performance per rack Hardware – Low cost per processing core We estimate an application’s virtual execution time as the duration for which the application should have run using all the resources on the chip in order to execute the same number of instructions as it did in the actual execution. Basic Problem: Excessive resource interference hinders adoption of Manycores in IaaS Clouds To calculate pTables, we use an online analytical performance model that uses cache and prefetcher statistics for all configurations. Dynamic resource sharing leads to unpredictable slowdowns for applications. Interference gets worse with increasing number of concurrent applications. We charge the consumer for using the entire chip for this estimated virtual time. CacheBlocs Our cache uses Dynamic Set Partitioning (DSP) and Tag-Indirect Cache Addressing (TICA) for scalable cache partitioning. Flattened Partial LRU Vector (FPLV) Interference leads to Higher-Order Problems for IaaS Clouds Simultaneous Performance Optimization Table (SPOT) Shadow caching for DSP requires tracking the LRU orders for different cache sizes. We efficiently maintain all these LRU orders using a topologically sorted vector. How much to charge the consumers, also called metering? How to minimize the slowdowns of concurrent applications? How to maximize the throughput of co-located applications? Can we handle higher application load on a single processor? Fair Slowdown Metric, an approximate geometric mean of application virtual times, when maximized increases throughput while maintaining fairness. Prefetcher Throttling Our Approach: Tapestry We use a dynamic algorithm in hardware to find the resource distribution that will maximize the Fair Slowdown Metric. With dynamic bandwidth allocation, we change prefetcher aggression to maximize bandwidth utilization. Dynamically partition the critical shared resources. Estimate application performances for all resource allocations. Use the performance estimates to make higher order decisions. Shadow Prefetcher To determine the shadow prefetcher statistics, we run the prefetching algo without actually prefetching data. Results 1. Dynamically Partitioned Manycore Architecture > CacheBlocs – Scalable Non-associative Cache Partitioning > Prefetcher Throttling – Efficient Bandwidth Utilization 2. Online Performance Estimation (pTables) > Flattened Partial LRU Vector (FPLV) – Shadow Cache Statistics > Shadow Prefetcher – Shadow Prefetcher Statistics 3. Higher-Order Decisions > Virtual Time Metering (VTM) – Accurately charge customers > Simultaneous Performance Optimization Table (SPOT) – Maximize throughput while maintaining fairness Novel Microarchitectural Components in Tapestry III. Tapestry improves overall-throughput by as much as 1.8x. I. While existing techniques can overcharge consumers by as much as 12x, Tapestry does fair metering. Additional Results We were able to estimate the pTables with an error of just about 1%. Using CacheBlocs we were able to reduce power consumption in partitioned caches by 67% . We were able to approximately track the pareto optimal curve for prefetcher performance with our throttling. IV. With increasing application load, Tapestry provides progressively better overall-throughput as well as worstcase-performance. II. Slowdowns are imminent, but Tapestry improves worstcase-performance by as much as 1.6x. Area and Energy Costs in Tapestry