Presentation is loading. Please wait.

Presentation is loading. Please wait.

MacSim Tutorial (In ICPADS 2013) 1. |The Structural Simulation Toolkit: A Parallel Architectural Simulator (for HPC) A parallel simulation environment.

Similar presentations


Presentation on theme: "MacSim Tutorial (In ICPADS 2013) 1. |The Structural Simulation Toolkit: A Parallel Architectural Simulator (for HPC) A parallel simulation environment."— Presentation transcript:

1 MacSim Tutorial (In ICPADS 2013) 1

2 |The Structural Simulation Toolkit: A Parallel Architectural Simulator (for HPC) A parallel simulation environment based on MPI Fully modular design that enables extensive exploration of an individual system parameter without the need for intrusive changes to the simulator Includes parallel simulation core, configuration, power models, basic network and processor models, and interface to detailed memory model SST-download link: http://sst.sandia.gov/ MacSim Tutorial (In ICPADS 2013) 2/8

3 MacSim Tutorial (In ICPADS 2013) 3/8

4 |Processor Components MacSim Gem5 |Memory Components DRAMSim2 VaultSim (3D memory model) MemHierarchy |Network Components Merlin Iris MacSim Tutorial (In ICPADS 2013) 4/8

5 MacSim Tutorial (In ICPADS 2013) |Multiple MacSim components can be instantiated |Each of which can act as An entire GPU node (composed of multiple SMs) A heterogeneous computing node (CPU + GPU) A GPU/CPU core Any combination of listed above 5/8

6 MacSim Tutorial (In ICPADS 2013) |MacSim can talk to memHierarchy |MacSim can make use of memHierarchy’s cache hierarchy. Which means, whatever memory system is connected to memHierarchy, MacSim can be configured with them. DRAMSim2 or VaultSim. |Pipeline Stages with memHierarchy D-Cache (MH) Front-end Decode Rename ScheduleExecutionRetire I-Cache (MH) VaultSim SST Link 6/8

7 MacSim Tutorial (In ICPADS 2013) |MacSim can directly talk to DRAMSim2 VaultSim |Using MacSim’s highly versatile memory controller interface, it can directly talk to DRAMSim2 and VaultSim. |Pipeline Stages with external memory component Front-end Decode Rename ScheduleExecutionRetire D-Cache (MS) VaultSim I-Cache (MS) VaultSim SST Link 7/8

8 |A SST component which models a memory hierarchy, such as multiple cache levels Sub component: Cache, Bus, Memory Controller |Usage Processor Component(s) + memHierarchy(s) + Memory Component(s)  MacSim + L1/L2 cache + DRAMSim2  MacSim + L1/L2 cache + (3D memory model)  (MacSim + private L1 cache) + (Gem5 + private L1 cache) + shared L2 cache + (DRAMSim2 or 3D memory model) MacSim Tutorial (In ICPADS 2013) 8/8

9 |Encapsulated MacSim as a SST Component, SST feeds clocks into MacSim and provides communication channels. |By talking to memHierarchy, MacSim indirectly can communicate with bunch of memory components without bothering to modify its interface. MacSim SST::Component L2 (memHierarchy) SST::Link DRAMSim2 SST::Component VaultSim SST::Link MacSim SST::Component L1 (memHierarchy) SST::Link MacSim Tutorial (In ICPADS 2013) 9

10 Gem5 SST::Component L2 (memHierarchy) SST::Link DRAMSim2 SST::Component VaultSim SST::Link MacSim SST::Component L1 (memHierarchy) SST::Link MacSim SST::Component L2 (memHierarchy) SST::Link DRAMSim2 SST::Component LLC (VaultSim) SST::Link MacSim SST::Component L1 SST::Link MacSim Tutorial (In ICPADS 2013) 10

11 11/8 Make sure macsimComponent doesn’t have.ignore file, otherwise SST build system will ignore the component How to build: See the instruction from SST websiteSST website How to execute: Pay special attention to the following files SDL (or XML) : SST component configuration trace_file_list: Which trace to execute. Can be specified in the aforementioned SDL file params.in: MacSim configuration, in which you can specify… Whether MacSim uses its internal cache or memHierarchy as cache Which DRAM controller to use amongst its internal FCFS/FRFCFS-based controller, DRAMSim2 controller and VaultSim controller. Specific examples will be elaborated in the following slides

12 MacSim Tutorial (In ICPADS, 2013) 12/8 |params.in use_memhierarchy = 0 dram_scheduling_policy = FRFCFS or FCFS |SDL (or XML) Nothing except macsimComponent configuration In this case, link configuration will not be used

13 13/8 |params.in use_memhierarchy = 1 Note, when use_memhierarchy is set to 1, MacSim’s DRAM controller configuration has no effect at all |SDL (or XML) Specify memHierarchy’s cache configuration like the following Similar configuration for D-cache as well MacSim Tutorial (In ICPADS, 2013)

14 14/8 |params.in use_memhierarchy = 1 Note, when use_memhierarchy is set to 1, MacSim’s DRAM controller configuration has no effect at all |SDL (or XML) Specify MemController configuration for DRAMSim2 like the following Note, DRAMSim2 configurations should be appended MacSim Tutorial (In ICPADS, 2013)

15 15/8 |params.in use_memhierarchy = 1 Note, when use_memhierarchy is set to 1, MacSim’s DRAM controller configuration has no effect at all |SDL (or XML) Specify MemController configuration for VaultSim like the following Note, VaultSim configurations should be appended MacSim Tutorial (In ICPADS, 2013)

16 16/8 |params.in use_memhierarchy = 0 dram_scheduling_policy = DRAMSIM |SDL (or XML) Specify configurations for DRAMSim2 like the following MacSim Tutorial (In ICPADS, 2013)

17 17/8 |params.in use_memhierarchy = 0 dram_scheduling_policy = VAULTSIM |SDL (or XML) Nothing special but to set macsimComponent’s mem_link matches to VaultSim’s toCPU link

18 MacSim Tutorial (In ICPADS 2013) 18MacSim Tutorial (In ICPADS 2013)

19 Thread fetch policies Branch predictor Thread fetch policies Branch predictor Software and Hardware prefetcher Cache studies (sharing, inclusion) DRAM scheduling Interconnection studies Software and Hardware prefetcher Cache studies (sharing, inclusion) DRAM scheduling Interconnection studies Power model Front-endMemory SystemMisc. 19/8 MacSim Tutorial (In ICPADS 2013)

20 Memory System Trace Generator (PIN, GPUOCelot) Trace Generator (PIN, GPUOCelot) Hardware Prefetcher Frontend Software prefetch instructions PTX  prefetch, prefetchu x86  prefetcht0, prefetcht1, prefetchnta Hardware prefetch requests Stream, stride, GHB, … Many-thread Aware Prefetching Mechanism [Lee et al. MICRO-43, 2010] When prefetching works, when it doesn’t, and why [Lee et al. ACM TACO, 2012] Spare Register Aware Prefetching for Graph Algorithms on GPUs [Lakshminarayana, HPCA 2014] MacSim 20/8 MacSim Tutorial (In ICPADS 2013)

21 |Cache studies – sharing, inclusion property |On-chip interconnection studies TLP-Aware Cache Management Policy [Lee and Kim, HPCA-18, 2012] $ $ $ $ $ $ $ $ $ $ $ $ $ $ Shared $ Interconnection Private Caches Interconnection Shared Cache 21/8 MacSim Tutorial (In ICPADS 2013)

22 |Heterogeneous link configuration Ring Network GPU CPU L3 MC Different topologies CCMM CCMM CCGG CCGG C0 L3 G0 M1 C1C2 G1G2 M0L3 C0 L3 G0 M1 C1C2 G1G2 M0L3 On-chip Interconnection for CPU-GPU Heterogeneous Architecture [Lee et al. JPDC2013] 22/8 MacSim Tutorial (In ICPADS 2013)

23 Execution Trace Generator (GPUOCelot) Trace Generator (GPUOCelot) Frontend Effect of Instruction Fetch and Memory Scheduling on GPU Performance [Lakshminarayana and Kim, LCA-GPGPU, 2010] DRAM RR, ICOUNT, FAIR, LRF, … FCFS, FRFCFS, FAIR, … 23/8 MacSim Tutorial (In ICPADS 2013)

24 DRAM Bank DRAM Controller Core-0 Core-1 Qs for Core-0 RH RM RH RM RH RM RH RM RH RM RH RM W0 W1 W2 W3 Tolerance(Core-0) < Tolerance(Core-1) Qs for Core-1 RH RM RH RM RH W0 W1 W2 W3 Potential of Requests from Core-0 = |W0| α + |W1| α + |W2| α + |W3| α = 4 α + 3 α + 5 α (α < 1) Reduction in potential if: row hit from queue of length L is serviced next  L α – (L – 1) α row hit from queue of length L is serviced next  L α – (L – 1/m) α m = cost of servicing row miss/cost of servicing row hit Potential of Requests from Core-0 = |W0| α + |W1| α + |W2| α + |W3| α = 4 α + 3 α + 5 α (α < 1) Reduction in potential if: row hit from queue of length L is serviced next  L α – (L – 1) α row hit from queue of length L is serviced next  L α – (L – 1/m) α m = cost of servicing row miss/cost of servicing row hit Tolerance(Core-0) < Tolerance(Core-1)  select Core-0 Servicing row hit from W1 (of Core-0) results in greatest reduction in potential, so service row hits from W1 next Tolerance(Core-0) < Tolerance(Core-1)  select Core-0 Servicing row hit from W1 (of Core-0) results in greatest reduction in potential, so service row hits from W1 next DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function [Lakshminarayana et al. IEEE CAL, 2011] 24/8 MacSim Tutorial (In ICPADS 2013)

25 Trace Generator (PIN, GPUOcelot) Trace Generator (PIN, GPUOcelot) Frontend CPU Traces (X86) GPU Traces (CUDA) Out-of-The-Box MacSim Out-of-The-Box MacSim DRAM Stacks 3D Stacked DRAM Model (New Module) 3D Stacked DRAM Model (New Module) Memory Requests Cache Hierarchy Off-Chip Memory Memory System Configure 3-D Stack as DRAM caches Part of main memory MacSim Tutorial (In ICPADS 2013) Resilient Die-stacked DRAM Caches, [Sim et al.,ISCA-40, 2013] A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch [Sim et al., MICRO, 2012]

26 |Verifying simulator and GTX580 |Modeling X86-CPU power |Modeling GPU power Still on-going research 26/8 MacSim Tutorial (In ICPADS 2013)

27 2013 ~ 2014 Power/Energy Model ARM Architecture Mobile Platform MacSim Tutorial (In ICPADS 2013) 27/8

28 MacSim Tutorial (In ICPADS 2013) 28/8 MacSim Tutorial (In ICPADS 2013)

29 29


Download ppt "MacSim Tutorial (In ICPADS 2013) 1. |The Structural Simulation Toolkit: A Parallel Architectural Simulator (for HPC) A parallel simulation environment."

Similar presentations


Ads by Google