Symbiotic Scheduling for Shared Caches in Multi-Core Systems Using Memory Footprint Signature  Mrinmoy Ghosh  Ripal Nathuji Min Lee Karsten Schwan Hsien-Hsin.

Slides:



Advertisements
Similar presentations
David Luebke 1 6/7/2014 CS 332: Algorithms Skip Lists Introduction to Hashing.
Advertisements

CRUISE: Cache Replacement and Utility-Aware Scheduling
Bypass and Insertion Algorithms for Exclusive Last-level Caches
A Memory-optimized Bloom Filter using An Additional Hashing Function Author: Mahmood Ahmadi, Stephan Wong Publisher: IEEE GLOBECOM 2008 Presenter: Yu-Ping.
Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Modeling shared cache and bus in multi-core platforms for timing analysis Sudipta Chattopadhyay Abhik Roychoudhury Tulika Mitra.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.
DLL-Conscious Instruction Fetch Optimization for SMT Processors Fayez Mohamood Mrinmoy Ghosh Hsien-Hsin (Sean) Lee School of Electrical and Computer Engineering.
Phase Reconciliation for Contended In-Memory Transactions Neha Narula, Cody Cutler, Eddie Kohler, Robert Morris MIT CSAIL and Harvard 1.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
International Symposium on Low Power Electronics and Design Dynamic Workload Characterization for Power Efficient Scheduling on CMP Systems 1 Gaurav Dhiman,
XENMON: QOS MONITORING AND PERFORMANCE PROFILING TOOL Diwaker Gupta, Rob Gardner, Ludmila Cherkasova 1.
1 Ally: OS-Transparent Packet Inspection Using Sequestered Cores Jen-Cheng Huang 1, Matteo Monchiero 2, Yoshio Turner 3, Hsien-Hsin Lee 1 1 Georgia Tech.
1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.
Memory System Characterization of Big Data Workloads
Hit or Miss ? !!!.  Cache RAM is high-speed memory (usually SRAM).  The Cache stores frequently requested data.  If the CPU needs data, it will check.
Colorado Computer Architecture Research Group Architectural Support for Enhanced SMT Job Scheduling Alex Settle Joshua Kihm Andy Janiszewski Daniel A.
CSCE 212 Chapter 4: Assessing and Understanding Performance Instructor: Jason D. Bakos.
SyNAR: Systems Networking and Architecture Group Symbiotic Jobscheduling for a Simultaneous Multithreading Processor Presenter: Alexandra Fedorova Simon.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
BUFFALO: Bloom Filter Forwarding Architecture for Large Organizations Minlan Yu Princeton University Joint work with Alex Fabrikant,
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
Flashing Up the Storage Layer I. Koltsidas, S. D. Viglas (U of Edinburgh), VLDB 2008 Shimin Chen Big Data Reading Group.
Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads Iraklis Psaroudakis (EPFL), Tobias Scheuer (SAP AG), Norman May.
Kenichi Kourai (Kyushu Institute of Technology) Takuya Nagata (Kyushu Institute of Technology) A Secure Framework for Monitoring Operating Systems Using.
Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.
Embedded System Lab. 최 길 모최 길 모 Kilmo Choi A Software Memory Partition Approach for Eliminating Bank-level Interference in Multicore.
Virtualization Part 2 – VMware. Virtualization 2 CS5204 – Operating Systems VMware: binary translation Hypervisor VMM Base Functionality (e.g. scheduling)
CS533 Concepts of Operating Systems Jonathan Walpole.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
Author : Guangdeng Liao, Heeyeol Yu, Laxmi Bhuyan Publisher : Publisher : DAC'10 Presenter : Jo-Ning Yu Date : 2010/10/06.
“Trusted Passages”: Meeting Trust Needs of Distributed Applications Mustaque Ahamad, Greg Eisenhauer, Jiantao Kong, Wenke Lee, Bryan Payne and Karsten.
Instrumentation of Xen VMs for efficient VM scheduling and capacity planning in hybrid clouds. Kurt Vermeersch Coordinator: Sam Verboven.
Embedded System Lab 김해천 Thread and Memory Placement on NUMA Systems: Asymmetry Matters.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
GSLPI: a Cost-based Query Progress Indicator
Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.
Min Lee, Vishal Gupta, Karsten Schwan
Design of DSP testing environment Performed By: Safovich Yevgeny Instructors: Eli Shoshan Yevgeni Rifkin הטכניון - מכון טכנולוגי לישראל הפקולטה.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
Full and Para Virtualization
Trusted Passages: Managing Trust Properties of Open Distributed Overlays Faculty: Mustaque Ahamad, Greg Eisenhauer, Wenke Lee and Karsten Schwan PhD Students:
The Evicted-Address Filter
Technical Reading Report Virtual Power: Coordinated Power Management in Virtualized Enterprise Environment Paper by: Ripal Nathuji & Karsten Schwan from.
Embedded System Lab. 오명훈 Addressing Shared Resource Contention in Multicore Processors via Scheduling.
IMP: Indirect Memory Prefetcher
1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.
CSE598c - Virtual Machines - Spring Diagnosing Performance Overheads in the Xen Virtual Machine EnvironmentPage 1 CSE 598c Virtual Machines “Diagnosing.
vCAT: Dynamic Cache Management using CAT Virtualization
Author: Heeyeol Yu; Mahapatra, R.; Publisher: IEEE INFOCOM 2008
Algorithmic Improvements for Fast Concurrent Cuckoo Hashing
Alex Kogan, Yossi Lev and Victor Luchangco
MemCache Widely used for high-performance Easy to use.
18742 Parallel Computer Architecture Caching in Multi-core Systems
CSCE 212 Chapter 4: Assessing and Understanding Performance
What we need to be able to count to tune programs
OS Virtualization.
Professor, No school name
Taeweon Suh § Hsien-Hsin S. Lee § Shih-Lien Lu † John Shen †
Taeweon Suh §, Hsien-Hsin S. Lee §, Sally A. Mckee †,
CARP: Compression-Aware Replacement Policies
(A Research Proposal for Optimizing DBMS on CMP)
Author: Yi Lu, Balaji Prabhakar Publisher: INFOCOM’09
Progress Report 2015/01/28.
Xing Pu21 Ling Liu1 Yiduo Mei31 Sankaran Sivathanu1 Younggyun Koh1
An index-split Bloom filter for deep packet inspection
CNN are proven image classifiers, performing better than humans over the last several year.
Presentation transcript:

Symbiotic Scheduling for Shared Caches in Multi-Core Systems Using Memory Footprint Signature  Mrinmoy Ghosh  Ripal Nathuji Min Lee Karsten Schwan Hsien-Hsin S. Lee ARM  Microsoft Research  Georgia Tech

Cache Interference in “Concurrent Processes” L2 Cache Core A L1 Cache Core B L1 Cache P1 P2 P1 $ Line P2 $ Line Line Hit !!! Conflict !!!

Cache Interference Effect (Concurrent Processes) Maximum performance degradation less than 10%

Cache Interference in “Shared Cache Multi-Core” L2 Cache Core A L1 Cache Core B L1 Cache P1 P2 P1 $ Line P2 $ Line Conflict !!!

Cache Interference Effect (Shared Cache Multi-Core) Performance degraded by as much as 65% Intelligent Process Management Needed !!

Problem –Processes in different cores can be incompatible –Shared resource contention Observation –Less contention of incompatible processes when running on the same core Insight: –Process incompatibility severely affects performance –Compatibility-based scheduling increases throughput Process (In-)Compatibility in Multi-Cores

7 Ideas Use Counting Bloom Filter to record memory access signature Compatibility test using signature

Insertion Insertion: Counting Bloom Filter Presence Bit Counter N-to-m Hash Func X N-to-m Hash Func X N-to-m Hash Func Y N-to-m Hash Func Y N-bit Data Address A

Insertion Insertion: Counting Bloom Filter Presence Bit Counter N-to-m Hash Func X N-to-m Hash Func X N-to-m Hash Func Y N-to-m Hash Func Y N-bit Data Address B 2 2

Deletion Deletion: Counting Bloom Filter Presence Bit Counter N-to-m Hash Func X N-to-m Hash Func X N-to-m Hash Func Y N-to-m Hash Func Y Data Address A Was Evicted 1 1 2

Query Query: Counting Bloom Filter Presence Bit Counter N-to-m Hash Func X N-to-m Hash Func X N-to-m Hash Func Y N-to-m Hash Func Y Data Address A ?? 1 Data Not Present !!!

Bloom Filter Signatures vs. Cache Footprint Strong Correlation !!!

13 Architectural Support

Bloom Filter Signature Multi-Core Architecture L2 Cache Core A L1 Cache Core B L1 Cache Last Filter Core Filter Last Filter Core Filter Bloom Filter Counters

Bloom Filter Signature Multi-Core Architecture L2 Cache Core A L1 Cache Core B L1 Cache P1 P2 Last Filter Core Filter Last Filter Core Filter Bloom Filter Counters P3

Metric for Execution State Last Filter Core Filter RBV (Running Bit Vector) + Occupancy Weight (i.e., # of 1s)

Interference Metric (Complement of Symbiosis) Process Pool (Processes waiting to be scheduled) Proc1 RBV Proc0 Proc1 Proc2 Proc** Proc* Core Filter Symbiosis = 5 + Interference Metric = N - 5 +

18 Process-to-Core Mapping Algorithms A1: Use Occupancy Weight A2: Use Interference Graph A3: Use Weighted Interference Graph

Sort all processes according to occupancy weight Processes form groups using sorted weight –# of processes in a group =  Processes/Cores  Map processes to cores based on sorting results A1: Weight Sorted Algorithm P0 100 P0 100 P4 99 P4 99 P2 70 P2 70 P5 65 P5 65 P6 43 P6 43 P3 20 P3 20 P1 15 P1 15 Core A L1 Cache Core B L1 Cache Core C L1 Cache Core D L1 Cache

Form interference graph using interference metric Find MAX-CUT of the graph A2: Interference Graph Algorithm P0 C A =20 C B =30 P0 C A =20 C B =30 P1 C A =10 C B =45 P1 C A =10 C B =45 P2 C A =40 C B =25 P2 C A =40 C B =25 P3 C A =15 C B =50 P3 C A =15 C B =50 Was in C A Was in C B P0 (A) P0 (A) P1 (A) P1 (A) P2 (B) P2 (B) P3 (B) P3 (B) Interference Graph

Form interference graph using interference metric Find MAX-CUT of the graph A2: Interference Graph Algorithm P0 C A =20 C B =30 P0 C A =20 C B =30 P1 C A =10 C B =45 P1 C A =10 C B =45 P2 C A =40 C B =25 P2 C A =40 C B =25 P3 C A =15 C B =50 P3 C A =15 C B =50 Was in C A Was in C B P0 (A) P0 (A) P1 (A) P1 (A) P2 (B) P2 (B) P3 (B) P3 (B) 70 Interference Graph

Form interference graph using interference metric Find MAX-CUT of the graph A2: Interference Graph Algorithm P0 C A =20 C B =30 P0 C A =20 C B =30 P1 C A =10 C B =45 P1 C A =10 C B =45 P2 C A =40 C B =25 P2 C A =40 C B =25 P3 C A =15 C B =50 P3 C A =15 C B =50 Was in C A Was in C B P0 (A) P0 (A) P1 (A) P1 (A) P2 (B) P2 (B) P3 (B) P3 (B) 70 Interference Graph

Form interference graph using interference metric Find MAX-CUT of the graph A2: Interference Graph Algorithm P0 (A) P0 (A) P1 (A) P1 (A) P2 (B) P2 (B) P3 (B) P3 (B) 70 Interference Graph P1 (A) P1 (A) P3 (B) P3 (B) P0 (A) P0 (A) P2 (B) P2 (B) 85 45

To address high interference issues Weight the edges of the interference graph The rest are the same as A2 A3: Weighted Interference Graph Algorithm P0 OW=90 C A =20 C B =30 P0 OW=90 C A =20 C B =30 P1 OW=85 C A =10 C B =45 P1 OW=85 C A =10 C B =45 P2 OW=50 C A =40 C B =25 P2 OW=50 C A =40 C B =25 P3 OW=100 C A =15 C B =50 P3 OW=100 C A =15 C B =50 Was in C A Was in C B P0 (A) P0 (A) P1 (A) P1 (A) P2 (B) P2 (B) P3 (B) P3 (B) 90*30 50*40 Interference Graph

25 Performance Evaluation

Evaluation Methodology P1 P2 P3 PN Fedora Linux Simics x86 Gather Footprint in Emulator “magic” interface Process-to-Core Mapping P1 P2 P3 PN Intel Core 2 Native x86 Run P1 P2 PN Linux Xen Hypervisor Intel Core 2 VM Run

Performance Results Maximum performance improvement of up to 54% Average performance improvement of up to 23%

Performance of Virtualized Systems Maximum performance improvement of up to 26% Average performance improvement of up to 9.5%

Performance Sensitivity of 3 Algorithms Weighted Interference Graph has the best performance

Conclusion 30/53 Shared Resource (e.g., LLC) Management is Critical Capturing Cache Reference Behavior for Processes Symbiotic Scheduling with Bloom Filter Signature Measured Speedup of 22% (up to 54%) on Intel Core 2

31 That’s All, Folks !