MacSim Tutorial (In ICPADS 2013) 1. |The Structural Simulation Toolkit: A Parallel Architectural Simulator (for HPC) A parallel simulation environment.

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
High Performing Cache Hierarchies for Server Workloads
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
1 MacSim Tutorial (In ISCA-39, 2012). Thread fetch policies Branch predictor Thread fetch policies Branch predictor Software and Hardware prefetcher Cache.
High Performance Embedded Computing © 2007 Elsevier Lecture 15: Embedded Multiprocessor Architectures Embedded Computing Systems Mikko Lipasti, adapted.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
|Introduction |Background |TAP (TLP-Aware Cache Management Policy) Core sampling Cache block lifetime normalization TAP-UCP and TAP-RRIP |Evaluation Methodology.
1 Microprocessor-based Systems Course 4 - Microprocessors.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Figure 1.1 Interaction between applications and the operating system.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Intel IA-64 Architecture Chehun Kim Glenn Ramos. Contents *Pipelining - Stages of pipelining *Microprogramming *Interconnection Structures.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
HPArch Research Group. |Part 2. Overview of MacSim Introduction For black box approach users |Part 3: Details of MacSim For computer architecture researchers.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Veynu Narasiman The University of Texas at Austin Michael Shebanow
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Back-end Timing Models Core Models.
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
Conference title1 A New Methodology for Studying Realistic Processors in Computer Science Degrees Crispín Gómez, María E. Gómez y Julio Sahuquillo DISCA.
Please do not distribute
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
CISC Machine Learning for Solving Systems Problems Arch Explorer Lecture 5 John Cavazos Dept of Computer & Information Sciences University of Delaware.
Parallelism Processing more than one instruction at a time. Pipelining
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.
Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY Ocelot and the SST-MacSim Simulator Genie.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
HPArch Research Group. |Part III: Overview of MacSim Features of MacSim Basic MacSim architecture How to simulate architectures with MacSim |Part IV:
Hardware Image Signal Processing and Integration into Architectural Simulator for SoC Platform Hao Wang University of Wisconsin, Madison.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,
A Common Machine Language for Communication-Exposed Architectures Bill Thies, Michal Karczmarek, Michael Gordon, David Maze and Saman Amarasinghe MIT Laboratory.
Opening Up Automatic Structural Design Space Exploration by Fixing Modular Simulation VEERLE DESMET SYLVAIN GIRBAL OLIVIER TEMAM Ghent University Thales.
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Sunpyo Hong, Hyesoon Kim
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Advanced Science and Technology Letters Vol.43 (Multimedia 2013), pp Superscalar GP-GPU design of SIMT.
UltraSparc IV Tolga TOLGAY. OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History.
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
1 Lecture 5a: CPU architecture 101 boris.
Structural Simulation Toolkit / Gem5 Integration
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Mattan Erez The University of Texas at Austin
Rachata Ausavarungnirun
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Introduction to Heterogeneous Parallel Computing
Operation of the Basic SM Pipeline
Mattan Erez The University of Texas at Austin
A Case for Interconnect-Aware Architectures
Graphics Processing Unit
CSE 502: Computer Architecture
Stream-based Memory Specialization for General Purpose Processors
Presentation transcript:

MacSim Tutorial (In ICPADS 2013) 1

|The Structural Simulation Toolkit: A Parallel Architectural Simulator (for HPC) A parallel simulation environment based on MPI Fully modular design that enables extensive exploration of an individual system parameter without the need for intrusive changes to the simulator Includes parallel simulation core, configuration, power models, basic network and processor models, and interface to detailed memory model SST-download link: MacSim Tutorial (In ICPADS 2013) 2/8

MacSim Tutorial (In ICPADS 2013) 3/8

|Processor Components MacSim Gem5 |Memory Components DRAMSim2 VaultSim (3D memory model) MemHierarchy |Network Components Merlin Iris MacSim Tutorial (In ICPADS 2013) 4/8

MacSim Tutorial (In ICPADS 2013) |Multiple MacSim components can be instantiated |Each of which can act as An entire GPU node (composed of multiple SMs) A heterogeneous computing node (CPU + GPU) A GPU/CPU core Any combination of listed above 5/8

MacSim Tutorial (In ICPADS 2013) |MacSim can talk to memHierarchy |MacSim can make use of memHierarchy’s cache hierarchy. Which means, whatever memory system is connected to memHierarchy, MacSim can be configured with them. DRAMSim2 or VaultSim. |Pipeline Stages with memHierarchy D-Cache (MH) Front-end Decode Rename ScheduleExecutionRetire I-Cache (MH) VaultSim SST Link 6/8

MacSim Tutorial (In ICPADS 2013) |MacSim can directly talk to DRAMSim2 VaultSim |Using MacSim’s highly versatile memory controller interface, it can directly talk to DRAMSim2 and VaultSim. |Pipeline Stages with external memory component Front-end Decode Rename ScheduleExecutionRetire D-Cache (MS) VaultSim I-Cache (MS) VaultSim SST Link 7/8

|A SST component which models a memory hierarchy, such as multiple cache levels Sub component: Cache, Bus, Memory Controller |Usage Processor Component(s) + memHierarchy(s) + Memory Component(s)  MacSim + L1/L2 cache + DRAMSim2  MacSim + L1/L2 cache + (3D memory model)  (MacSim + private L1 cache) + (Gem5 + private L1 cache) + shared L2 cache + (DRAMSim2 or 3D memory model) MacSim Tutorial (In ICPADS 2013) 8/8

|Encapsulated MacSim as a SST Component, SST feeds clocks into MacSim and provides communication channels. |By talking to memHierarchy, MacSim indirectly can communicate with bunch of memory components without bothering to modify its interface. MacSim SST::Component L2 (memHierarchy) SST::Link DRAMSim2 SST::Component VaultSim SST::Link MacSim SST::Component L1 (memHierarchy) SST::Link MacSim Tutorial (In ICPADS 2013) 9

Gem5 SST::Component L2 (memHierarchy) SST::Link DRAMSim2 SST::Component VaultSim SST::Link MacSim SST::Component L1 (memHierarchy) SST::Link MacSim SST::Component L2 (memHierarchy) SST::Link DRAMSim2 SST::Component LLC (VaultSim) SST::Link MacSim SST::Component L1 SST::Link MacSim Tutorial (In ICPADS 2013) 10

11/8 Make sure macsimComponent doesn’t have.ignore file, otherwise SST build system will ignore the component How to build: See the instruction from SST websiteSST website How to execute: Pay special attention to the following files SDL (or XML) : SST component configuration trace_file_list: Which trace to execute. Can be specified in the aforementioned SDL file params.in: MacSim configuration, in which you can specify… Whether MacSim uses its internal cache or memHierarchy as cache Which DRAM controller to use amongst its internal FCFS/FRFCFS-based controller, DRAMSim2 controller and VaultSim controller. Specific examples will be elaborated in the following slides

MacSim Tutorial (In ICPADS, 2013) 12/8 |params.in use_memhierarchy = 0 dram_scheduling_policy = FRFCFS or FCFS |SDL (or XML) Nothing except macsimComponent configuration In this case, link configuration will not be used

13/8 |params.in use_memhierarchy = 1 Note, when use_memhierarchy is set to 1, MacSim’s DRAM controller configuration has no effect at all |SDL (or XML) Specify memHierarchy’s cache configuration like the following Similar configuration for D-cache as well MacSim Tutorial (In ICPADS, 2013)

14/8 |params.in use_memhierarchy = 1 Note, when use_memhierarchy is set to 1, MacSim’s DRAM controller configuration has no effect at all |SDL (or XML) Specify MemController configuration for DRAMSim2 like the following Note, DRAMSim2 configurations should be appended MacSim Tutorial (In ICPADS, 2013)

15/8 |params.in use_memhierarchy = 1 Note, when use_memhierarchy is set to 1, MacSim’s DRAM controller configuration has no effect at all |SDL (or XML) Specify MemController configuration for VaultSim like the following Note, VaultSim configurations should be appended MacSim Tutorial (In ICPADS, 2013)

16/8 |params.in use_memhierarchy = 0 dram_scheduling_policy = DRAMSIM |SDL (or XML) Specify configurations for DRAMSim2 like the following MacSim Tutorial (In ICPADS, 2013)

17/8 |params.in use_memhierarchy = 0 dram_scheduling_policy = VAULTSIM |SDL (or XML) Nothing special but to set macsimComponent’s mem_link matches to VaultSim’s toCPU link

MacSim Tutorial (In ICPADS 2013) 18MacSim Tutorial (In ICPADS 2013)

Thread fetch policies Branch predictor Thread fetch policies Branch predictor Software and Hardware prefetcher Cache studies (sharing, inclusion) DRAM scheduling Interconnection studies Software and Hardware prefetcher Cache studies (sharing, inclusion) DRAM scheduling Interconnection studies Power model Front-endMemory SystemMisc. 19/8 MacSim Tutorial (In ICPADS 2013)

Memory System Trace Generator (PIN, GPUOCelot) Trace Generator (PIN, GPUOCelot) Hardware Prefetcher Frontend Software prefetch instructions PTX  prefetch, prefetchu x86  prefetcht0, prefetcht1, prefetchnta Hardware prefetch requests Stream, stride, GHB, … Many-thread Aware Prefetching Mechanism [Lee et al. MICRO-43, 2010] When prefetching works, when it doesn’t, and why [Lee et al. ACM TACO, 2012] Spare Register Aware Prefetching for Graph Algorithms on GPUs [Lakshminarayana, HPCA 2014] MacSim 20/8 MacSim Tutorial (In ICPADS 2013)

|Cache studies – sharing, inclusion property |On-chip interconnection studies TLP-Aware Cache Management Policy [Lee and Kim, HPCA-18, 2012] $ $ $ $ $ $ $ $ $ $ $ $ $ $ Shared $ Interconnection Private Caches Interconnection Shared Cache 21/8 MacSim Tutorial (In ICPADS 2013)

|Heterogeneous link configuration Ring Network GPU CPU L3 MC Different topologies CCMM CCMM CCGG CCGG C0 L3 G0 M1 C1C2 G1G2 M0L3 C0 L3 G0 M1 C1C2 G1G2 M0L3 On-chip Interconnection for CPU-GPU Heterogeneous Architecture [Lee et al. JPDC2013] 22/8 MacSim Tutorial (In ICPADS 2013)

Execution Trace Generator (GPUOCelot) Trace Generator (GPUOCelot) Frontend Effect of Instruction Fetch and Memory Scheduling on GPU Performance [Lakshminarayana and Kim, LCA-GPGPU, 2010] DRAM RR, ICOUNT, FAIR, LRF, … FCFS, FRFCFS, FAIR, … 23/8 MacSim Tutorial (In ICPADS 2013)

DRAM Bank DRAM Controller Core-0 Core-1 Qs for Core-0 RH RM RH RM RH RM RH RM RH RM RH RM W0 W1 W2 W3 Tolerance(Core-0) < Tolerance(Core-1) Qs for Core-1 RH RM RH RM RH W0 W1 W2 W3 Potential of Requests from Core-0 = |W0| α + |W1| α + |W2| α + |W3| α = 4 α + 3 α + 5 α (α < 1) Reduction in potential if: row hit from queue of length L is serviced next  L α – (L – 1) α row hit from queue of length L is serviced next  L α – (L – 1/m) α m = cost of servicing row miss/cost of servicing row hit Potential of Requests from Core-0 = |W0| α + |W1| α + |W2| α + |W3| α = 4 α + 3 α + 5 α (α < 1) Reduction in potential if: row hit from queue of length L is serviced next  L α – (L – 1) α row hit from queue of length L is serviced next  L α – (L – 1/m) α m = cost of servicing row miss/cost of servicing row hit Tolerance(Core-0) < Tolerance(Core-1)  select Core-0 Servicing row hit from W1 (of Core-0) results in greatest reduction in potential, so service row hits from W1 next Tolerance(Core-0) < Tolerance(Core-1)  select Core-0 Servicing row hit from W1 (of Core-0) results in greatest reduction in potential, so service row hits from W1 next DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function [Lakshminarayana et al. IEEE CAL, 2011] 24/8 MacSim Tutorial (In ICPADS 2013)

Trace Generator (PIN, GPUOcelot) Trace Generator (PIN, GPUOcelot) Frontend CPU Traces (X86) GPU Traces (CUDA) Out-of-The-Box MacSim Out-of-The-Box MacSim DRAM Stacks 3D Stacked DRAM Model (New Module) 3D Stacked DRAM Model (New Module) Memory Requests Cache Hierarchy Off-Chip Memory Memory System Configure 3-D Stack as DRAM caches Part of main memory MacSim Tutorial (In ICPADS 2013) Resilient Die-stacked DRAM Caches, [Sim et al.,ISCA-40, 2013] A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch [Sim et al., MICRO, 2012]

|Verifying simulator and GTX580 |Modeling X86-CPU power |Modeling GPU power Still on-going research 26/8 MacSim Tutorial (In ICPADS 2013)

2013 ~ 2014 Power/Energy Model ARM Architecture Mobile Platform MacSim Tutorial (In ICPADS 2013) 27/8

MacSim Tutorial (In ICPADS 2013) 28/8 MacSim Tutorial (In ICPADS 2013)

29