Download presentation
Presentation is loading. Please wait.
Published byBrisa Lyle Modified over 9 years ago
1
Symbiotic Scheduling for Shared Caches in Multi-Core Systems Using Memory Footprint Signature Mrinmoy Ghosh Ripal Nathuji Min Lee Karsten Schwan Hsien-Hsin S. Lee ARM Microsoft Research Georgia Tech
2
Cache Interference in “Concurrent Processes” L2 Cache Core A L1 Cache Core B L1 Cache P1 P2 P1 $ Line P2 $ Line Line Hit !!! Conflict !!!
3
Cache Interference Effect (Concurrent Processes) Maximum performance degradation less than 10%
4
Cache Interference in “Shared Cache Multi-Core” L2 Cache Core A L1 Cache Core B L1 Cache P1 P2 P1 $ Line P2 $ Line Conflict !!!
5
Cache Interference Effect (Shared Cache Multi-Core) Performance degraded by as much as 65% Intelligent Process Management Needed !!
6
Problem –Processes in different cores can be incompatible –Shared resource contention Observation –Less contention of incompatible processes when running on the same core Insight: –Process incompatibility severely affects performance –Compatibility-based scheduling increases throughput Process (In-)Compatibility in Multi-Cores
7
7 Ideas Use Counting Bloom Filter to record memory access signature Compatibility test using signature
8
Insertion Insertion: Counting Bloom Filter Presence Bit 1 1 1 1 Counter N-to-m Hash Func X N-to-m Hash Func X N-to-m Hash Func Y N-to-m Hash Func Y N-bit Data Address A
9
Insertion Insertion: Counting Bloom Filter Presence Bit 1 1 1 1 1 1 Counter N-to-m Hash Func X N-to-m Hash Func X N-to-m Hash Func Y N-to-m Hash Func Y N-bit Data Address B 2 2
10
Deletion Deletion: Counting Bloom Filter Presence Bit 1 1 1 1 Counter N-to-m Hash Func X N-to-m Hash Func X N-to-m Hash Func Y N-to-m Hash Func Y Data Address A Was Evicted 1 1 2
11
Query Query: Counting Bloom Filter Presence Bit 1 1 0 0 2 2 Counter N-to-m Hash Func X N-to-m Hash Func X N-to-m Hash Func Y N-to-m Hash Func Y Data Address A ?? 1 Data Not Present !!!
12
Bloom Filter Signatures vs. Cache Footprint Strong Correlation !!!
13
13 Architectural Support
14
Bloom Filter Signature Multi-Core Architecture L2 Cache Core A L1 Cache Core B L1 Cache Last Filter Core Filter Last Filter Core Filter Bloom Filter Counters
15
Bloom Filter Signature Multi-Core Architecture L2 Cache Core A L1 Cache Core B L1 Cache P1 P2 Last Filter Core Filter Last Filter Core Filter Bloom Filter Counters P3
16
Metric for Execution State Last Filter Core Filter RBV (Running Bit Vector) + Occupancy Weight (i.e., # of 1s)
17
Interference Metric (Complement of Symbiosis) Process Pool (Processes waiting to be scheduled) Proc1 RBV Proc0 Proc1 Proc2 Proc** Proc* Core Filter Symbiosis = 5 + Interference Metric = N - 5 +
18
18 Process-to-Core Mapping Algorithms A1: Use Occupancy Weight A2: Use Interference Graph A3: Use Weighted Interference Graph
19
Sort all processes according to occupancy weight Processes form groups using sorted weight –# of processes in a group = Processes/Cores Map processes to cores based on sorting results A1: Weight Sorted Algorithm P0 100 P0 100 P4 99 P4 99 P2 70 P2 70 P5 65 P5 65 P6 43 P6 43 P3 20 P3 20 P1 15 P1 15 Core A L1 Cache Core B L1 Cache Core C L1 Cache Core D L1 Cache
20
Form interference graph using interference metric Find MAX-CUT of the graph A2: Interference Graph Algorithm P0 C A =20 C B =30 P0 C A =20 C B =30 P1 C A =10 C B =45 P1 C A =10 C B =45 P2 C A =40 C B =25 P2 C A =40 C B =25 P3 C A =15 C B =50 P3 C A =15 C B =50 Was in C A Was in C B P0 (A) P0 (A) P1 (A) P1 (A) P2 (B) P2 (B) P3 (B) P3 (B) 30 40 Interference Graph
21
Form interference graph using interference metric Find MAX-CUT of the graph A2: Interference Graph Algorithm P0 C A =20 C B =30 P0 C A =20 C B =30 P1 C A =10 C B =45 P1 C A =10 C B =45 P2 C A =40 C B =25 P2 C A =40 C B =25 P3 C A =15 C B =50 P3 C A =15 C B =50 Was in C A Was in C B P0 (A) P0 (A) P1 (A) P1 (A) P2 (B) P2 (B) P3 (B) P3 (B) 70 Interference Graph
22
Form interference graph using interference metric Find MAX-CUT of the graph A2: Interference Graph Algorithm P0 C A =20 C B =30 P0 C A =20 C B =30 P1 C A =10 C B =45 P1 C A =10 C B =45 P2 C A =40 C B =25 P2 C A =40 C B =25 P3 C A =15 C B =50 P3 C A =15 C B =50 Was in C A Was in C B P0 (A) P0 (A) P1 (A) P1 (A) P2 (B) P2 (B) P3 (B) P3 (B) 70 Interference Graph 60 3075 45 85
23
Form interference graph using interference metric Find MAX-CUT of the graph A2: Interference Graph Algorithm P0 (A) P0 (A) P1 (A) P1 (A) P2 (B) P2 (B) P3 (B) P3 (B) 70 Interference Graph 60 3075 45 85 P1 (A) P1 (A) P3 (B) P3 (B) P0 (A) P0 (A) P2 (B) P2 (B) 85 45
24
To address high interference issues Weight the edges of the interference graph The rest are the same as A2 A3: Weighted Interference Graph Algorithm P0 OW=90 C A =20 C B =30 P0 OW=90 C A =20 C B =30 P1 OW=85 C A =10 C B =45 P1 OW=85 C A =10 C B =45 P2 OW=50 C A =40 C B =25 P2 OW=50 C A =40 C B =25 P3 OW=100 C A =15 C B =50 P3 OW=100 C A =15 C B =50 Was in C A Was in C B P0 (A) P0 (A) P1 (A) P1 (A) P2 (B) P2 (B) P3 (B) P3 (B) 90*30 50*40 Interference Graph
25
25 Performance Evaluation
26
Evaluation Methodology P1 P2 P3 PN Fedora Linux Simics x86 Gather Footprint in Emulator “magic” interface Process-to-Core Mapping P1 P2 P3 PN Intel Core 2 Native x86 Run P1 P2 PN Linux Xen Hypervisor Intel Core 2 VM Run
27
Performance Results Maximum performance improvement of up to 54% Average performance improvement of up to 23%
28
Performance of Virtualized Systems Maximum performance improvement of up to 26% Average performance improvement of up to 9.5%
29
Performance Sensitivity of 3 Algorithms Weighted Interference Graph has the best performance
30
Conclusion 30/53 Shared Resource (e.g., LLC) Management is Critical Capturing Cache Reference Behavior for Processes Symbiotic Scheduling with Bloom Filter Signature Measured Speedup of 22% (up to 54%) on Intel Core 2
31
31 That’s All, Folks !
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.