Download presentation
Presentation is loading. Please wait.
Published byCory Russell Modified over 9 years ago
1
1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian University of Science and Technology ‡ Energy Micro
2
2 Chip Multiprocessor Resources Hardware-controlled, shared resources –Interconnect bandwidth –Shared cache capacity –Memory bus bandwidth –Memory capacity is allocated by the operating system Interference can occur in all shared units Current CMP implementations do not take interference into account
3
3 Why Control Resource Allocation? Provide predictable performance Support OS scheduler assumptions Cloud: Fulfill Service Level Agreement
4
4 Resource Allocation Tasks Focus of this work
5
5 Resource Allocation Baselines Baseline = Interference-free configuration Quantify performance impact from interference Private Mode and Shared Mode
6
6 Multi-Programmed Baseline All processes in a workload run concurrently Static and equal partitioning of all shared resources
7
7 Single Program Baseline The process is run alone in one core All other cores are idle Exclusive access to all shared resources
8
8 Baseline Weaknesses Multiprogrammed Baseline –Only accounts for interference in partitioned resources –Static and equal division of DRAM bandwidth does not give equal latency –Complex relationship between resource allocation and performance Single Program Baseline –Does not exist in shared mode Dynamic Interference Estimation Framework (DIEF)
9
9 Outline Introduction Dynamic Interference Estimation Framework –Shared Cache –Memory Bus –On-chip interconnect Results Summary
10
10 Interference Estimation Full-System Interference Estimation Aggregate interference from different units Common unit of measure Average Latency (Clock Cycles) DIEF General, component-based framework
11
11 Interference Definition Interference Private Mode Latency Estimate Error Private Mode Latency Measurement Private Mode Latency Measurement Shared Mode Latency Private Mode Latency Estimate Private Mode Latency Estimate
12
12 Shared Cache Interference B NM ABAMN Auxiliary Tag Directories C P U 0 C P U 1 Cache Accesses: B Shared Cache..................
13
13 Shared Cache Interference B NM AABMN Auxiliary Tag Directories C P U 0 C P U 1 Cache Accesses: B Shared Cache.................. CC Eviction may not be interference
14
14 Shared Cache Interference B NM AABM Auxiliary Tag Directories C P U 0 C P U 1 Cache Accesses: B Shared Cache.................. CCC B N Interference cost = miss penalty Hit Miss
15
15 Bus Interference Requirements Out-of-order memory bus scheduling Shared mode only cache misses and cache hits Shared cache writebacks Computing private latency based on shared mode queue contents is difficult Emulate private scheduling in the shared mode
16
16 ED Shared Bus Queue CB DCBA 12020040 Arrival Order Head Pointer Execution Order 15 32 Latency Lookup Table Bank0 1...... Open Page Emulation Registers Memory Latency Estimation Buffer Bank/Page Mapping:A (0,15),B (0,19),C (0,15),D (1,32) Estimated Queue Latency12040 ++= B C D 200
17
17 Interconnect Interference A FE BCCPU0 1 L2Bank0 L2 1 Interference Counters 00 A E 4 8 CPU 1 delays CPU 0
18
18 Outline Introduction Dynamic Interference Estimation Framework –Shared Cache –Memory Bus –On-chip interconnect Results Summary
19
19 Relative Estimation Errors
20
20 RMS Error Breakdown Remaining units contribute less than 2 clock cycles
21
21 Auxiliary Tag Directory Accuracy
22
22 Outline Introduction Dynamic Interference Estimation Framework –Shared Cache –Memory Bus –On-chip interconnect Results Summary
23
23 Summary Memory system interference causes unpredictable performance DIEF provides –Accurate private mode latency estimates –Accurate shared mode latency measurements Future opportunities –Guiding dynamic optimizations –Guiding OS scheduling decisions –Debugging and optimization
24
24 Thank you! Visit our website: http://research.idi.ntnu.no/multicore/ Questions?
25
25 Experiment Methodology M5 simulator –Extended with crossbar and ring on-chip interconnect models –DDR2 memory bus model Randomly generated workloads of SPEC2000 benchmarks –40 4-core workloads –20 8-core workloads –10 16-core workloads
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.