HPArch Research Group. |Part 2. Overview of MacSim Introduction For black box approach users |Part 3: Details of MacSim For computer architecture researchers.

Slides:

Advertisements

Similar presentations

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.

Advertisements

RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.

1 MacSim Tutorial (In ISCA-39, 2012). Thread fetch policies Branch predictor Thread fetch policies Branch predictor Software and Hardware prefetcher Cache.

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Back-end Timing Models Core Models.

GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

Term Project Overview Yong Wang. Introduction Goal –familiarize with the design and implementation of a simple pipelined RISC processor What to do –Build.

The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.

1 RAMP Infrastructure Krste Asanovic UC Berkeley RAMP Tutorial, ISCA/FCRC, San Diego June 10, 2007.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Back-end Timing Models Core Models.

Architecture Basics ECE 454 Computer Systems Programming

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

MacSim Tutorial (In ICPADS 2013) 1. |The Structural Simulation Toolkit: A Parallel Architectural Simulator (for HPC) A parallel simulation environment.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY Ocelot and the SST-MacSim Simulator Genie.

Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,

HPArch Research Group. |Part III: Overview of MacSim Features of MacSim Basic MacSim architecture How to simulate architectures with MacSim |Part IV:

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Manifold Execution Model and System.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

Data Management for Decision Support Session-4 Prof. Bharat Bhasker.

What GPGPU-Sim Simulates

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Sunpyo Hong, Hyesoon Kim

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

My Coordinates Office EM G.27 contact time:

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

1 Lecture 5a: CPU architecture 101 boris.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

General Purpose computing on Graphics Processing Units

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Outline Installing Gem5 SPEC2006 for Gem5 Configuring Gem5.

Dynamic Scheduling Why go out of style?

Muen Policy & Toolchain

CS427 Multicore Architecture and Parallel Computing

Lecture 12 Reorder Buffers

Flow Path Model of Superscalars

Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.

NVIDIA Fermi Architecture

Introduction to Heterogeneous Parallel Computing

Operation of the Basic SM Pipeline

©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

6- General Purpose GPU Programming

CSE 502: Computer Architecture

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

HPArch Research Group

|Part 2. Overview of MacSim Introduction For black box approach users |Part 3: Details of MacSim For computer architecture researchers |Part 4. MacSim-SST case studies Ocelot-MacSim case studies Research using Ocelot Research using MacSim MacSim Tutorial (In ISCA-39, 2012)

|Heterogeneous architecture simulator (x86+PTX) |Developed from Georgia Tech |Trace driven simulator Internal RISC style micro-op generation module X86 traces – using Pin, PTX traces – using GPUOcelot |Cycle-level simulator Cores, caches, memory systems are modeled |Support various simulations - single/multi-threaded application, multi-program, heterogeneous (CPU+GPU) MacSim Tutorial (In ISCA-39, 2012)

|Flexible design to support various platforms |Integration with a parallel simulator (SST) to support high- performance computing systems |From mobile to Exascale computing systems MacSim Tutorial (In ISCA-39, 2012)

X86 binaries CUDA code (.cu) Open GL code PIN (API Generator) PIN (API Generator) PIN Trace Generator PIN Trace Generator NVCC (Compiler) NVCC (Compiler) GPUOcelot Trace Generator GPUOcelot Trace Generator Attila (OpenGL Emulator) Attila (OpenGL Emulator) Heterogeneous Architecture Timing & Power Simulator Heterogeneous Architecture Timing & Power Simulator PTX code Prof. Yalamanchili (Georgia Tech) Instruction Thread information Ongoing Work MacSim Tutorial (In ISCA-39, 2012)

|Getting MacSim Stable version – google code project Latest code from SVN repository |Directions are explained in |How to build Chapter 2 of manual provides an instruction to build README file in the simulator directory MacSim Tutorial (In ISCA-39, 2012)

|Macsim package IRIS (NoC simulator from Prof. Yalamanchili’s group) is included CPU trace generator  Download PIN separately. Trace generator tool is in the MacSim Package GPU trace generator  Download Ocelot Separately. Trace generator is in the Ocelot’s package |MacSim-SST SST needs to be downloaded separately |Energy Introspector (From Prof. Yalamanchili’s group) EI is a power model based on McPAT, HotSpot. Because of McPAT license issue, currently EI cannot be distributed, but we will resolve this issue soon MacSim Tutorial (In ISCA-39, 2012)

|Once build process is successful, binary will be created in macsim-top/trunk/bin/macsim |Screenshot of a simulation |Now, How to configure simulation models ? MacSim Tutorial (In ISCA-39, 2012)

|Knob variables need to set up (3 ways) Default value in the source code Params.in Command line Core type 1 Core type 2 Core type 3 Core type 1 Core type 2 Core type 3 Core type 1 Core type 2 Core type 3 Core type 1 Core type 2 Core type 3 Core type 1 Core type 2 Core type 3 Memory MacSim Tutorial (In ISCA-39, 2012)

num_sim_cores 4 // 4 cores num_sim_small_cores 0 num_sim_medium_cores 0 num_sim_large_cores 4 max_threads_per_large_core 2 large_core_type x86 repeat_trace 1 num_sim_cores 4 // 4 cores num_sim_small_cores 0 num_sim_medium_cores 0 num_sim_large_cores 4 max_threads_per_large_core 2 large_core_type x86 repeat_trace 1 |Configuration 4 cores 2-way SMT param./macsim –num_sim_cores=4.def params.in commandline MacSim Tutorial (In ISCA-39, 2012)

|To configure CPU+GPU arch. Set up number of cores and type accordingly num_sim_cores 8 // 4 CPUs + 4 GPUs num_sim_small_cores 4 // 4 GPU num_sim_medium_cores 0 num_sim_large_cores 4 // 4 CPUs core_type ptx // specify small cores large_core_type x86 cpu_frequency 3 gpu_frequency 1.5 repeat_trace 1 num_sim_cores 8 // 4 CPUs + 4 GPUs num_sim_small_cores 4 // 4 GPU num_sim_medium_cores 0 num_sim_large_cores 4 // 4 CPUs core_type ptx // specify small cores large_core_type x86 cpu_frequency 3 gpu_frequency 1.5 repeat_trace 1 |Usually, we use small core for GPU and large for CPU |GPU has internally multiple processing elements (N-wide SIMD) MacSim Tutorial (In ISCA-39, 2012)

|Multiple Applications Set up from trace_file_list MCF GCC MM thread 1 MM thread 1 MM thread 2 MM thread 2 Blackscholes 4 <-- number of applications /sample/mcf/trace.txt <- appl 1 /sample/gcc/trace.txt <- appl 2 /sample/mm/trace.txt <- appl 3 /sample/blackscholes/trace.txt <- appl 4 4 <-- number of applications /sample/mcf/trace.txt <- appl 1 /sample/gcc/trace.txt <- appl 2 /sample/mm/trace.txt <- appl 3 /sample/blackscholes/trace.txt <- appl 4 MacSim Tutorial (In ISCA-39, 2012)

|Execution time for each application is different. |Provide an option to enable repeat short traces until the longest trace ends |Whether it’s the right way to simulate? mcf gcc bfs Program 1 Program 2 Program 3 MacSim Tutorial (In ISCA-39, 2012)

|Sample configuration files in macsim-top/trunk/params File nameContents params_8800gtGeForce 8800 GT (G80) params_gtx280GeForce GTX 280 (GT200) params_gtx465 NVIDIA GeForce GTX 465 (Fermi) params_gtx465GeForce GTX 465 (Fermi) params_x86Intel’s Sandy Bridge (CPU part only) params_hetero_4c_4gIntel’s Sandy Bridge (CPU + GPU) MacSim Tutorial (In ISCA-39, 2012)

|Thread spawn is modeled. |Lock is not modeled. GPU Kernel invocation core Main thread Threads spawn Barrier Host thread core MacSim Tutorial (In ISCA-39, 2012)

|It will be covered in Part-III |Trace generator will generate thread execution information is automatically. |Users do not need to worry about this. MacSim Tutorial (In ISCA-39, 2012)

|MacSim has 5 different clock domains CPU GPU Last-level cache Interconnection network DRAM # Clock clock_cpu 3 clock_gpu 1.5 clock_l3 1 clock_noc 1 clock_mc 1.6 # Clock clock_cpu 3 clock_gpu 1.5 clock_l3 1 clock_noc 1 clock_mc 1.6

|X86 instructions are mapped to uops |PTX instructions are mapped to uops (almost 1-1 mapping) |Pipeline stages Pin XED Macro instructions with decoded information from Pin’s XED MacSim Trace decoder uops Timing/ power simulator Timing/ power simulator MacSim Tutorial (In ISCA-39, 2012) Memory Front-end Decode Rename ScheduleExecutionRetire

|Front-end, DEC/Rename: Just a simple FIFO queue. fetch_latency 5 // front-end depth alloc_latency 5 // decode/allocation depth width // pipeline width (same width for all the pipeline) bp_dir_mech gshare bp_hist_length 14 // branch history length |Rename: create RAW dependency (map structure) rob_size 96 // ROB size |Scheduler // in-order scheduler, ooo scheduler schedule io, ooo // instruction scheduling policy MacSim Tutorial (In ISCA-39, 2012)

|Execution latency Fixed uop latency (macsim-top/def/uop_latency_[x86,ptx].def) Variable latency: Cache/Memory latency |Instruction scheduling rates isched_rate 4 // # of integer inst. that can be executed per cycle msched_rate 2 // # of memory inst. that can be executed per cycle fsched_rate 2 // # of FP inst. That can be executed per cycle MacSim Tutorial (In ISCA-39, 2012)

|Cache configuration # of sets, # of associativity, line size, # of banks, etc. (See manual) |Cache size = # of sets x assoc x line_size x # of tiles |DRAM configuration Frequency, bus width, column/activate/precharge latency # of Memory controllers, # banks, # channels, row buffer size, DRAM scheduling policy Simple, but fast DRAM model that models key features |MacSim is connected with DRAM-SIM2 Users can use DRAM-SIM2 for a detailed DRAM timing simulation L3 only MacSim Tutorial (In ISCA-39, 2012)

|Statistics Simulation outputs: *.stat.out macsim/trunk/def file has stat definition (more details in Part-III) |Important Stats IPC = INST_COUNT_TOT/CYC_COUNT_TOT CPI = CYC_COUNT_TOT/INST_COUNT_TOT |Per Core stats IPC for core 0  INST_COUNT_CORE_0/CYC_COUNT_CORE_0 |Multiple applications stats *.stat.out. e.g.) memory.stat.out.0, bp.stat.out.1 Each stat file contains stats only for the first running (repeated simulations are ignored) MacSim Tutorial (In ISCA-39, 2012)

|Memory Systems L[1-3]_HIT_CPU/L[1-3]_HIT_GPU L[1-3]_MISS_CPU/L[1-3]_MISS_GPU |Front-end BP_ON_PATH_[CORRECT/MISPREDICT/MISFETCH ] |Instruction profiling Based on instruction category. inst.stat.out |More details regarding statistics are in the documentation |We will provide simple script file to fetch stat data MacSim Tutorial (In ISCA-39, 2012)

|Multi-threading support is already there. |Different ISAs: using micro-ops |Warp ? One warp is treated as one thread. Each thread generates its own trace file. Active bit information is included Trace format will be explained in Part-III |Thread and block scheduling Block-level barrier, block-level scheduling/retirement More details will be explained in Part-III |Different memory structures Memory systems MacSim Tutorial (In ISCA-39, 2012)

|Include the memory access by each thread of a warp as a separate instruction in the trace |In trace, mark these accesses as coming from the same warp SIMD load instruction Addr 0 Addr 1 Addr 2 Addr 3 Addr 4 Addr 5 Addr 6 Addr 7 CoalescedUncoalesced Mem inst with 128B size 64B Request 32B Req. TraceInst TraceInst_begin TraceMem1 TraceMem2 TraceMem3 TraceInst_end Trace file start of memory instruction marker end of memory instruction marker MacSim Tutorial (In ISCA-39, 2012)

|During simulation, form a “parent” uop that holds all the individual memory accesses as its child uops |Parent uop flows through the pipeline, only in the memory stage, the individual children uops are issued to the memory Parent uop is ready for retirement when all children have completed TraceInst_begin TraceMem1 TraceMem2 TraceMem3 … TraceMemN TraceInst_end Trace file start of memory instruction marker end of memory instruction marker MacSim uop addr0 addr1 addr2 addr3 addr4 addr5 … … addrN Mem_type: ld #children: 8 Parent uop Children uops MacSim Tutorial (In ISCA-39, 2012)

|IRIS (From Prof. Yalamanchili’s group) Flit-level interconnection network simulator Virtual channel, credit-based flow control deadlock-avoidance, … Part-IV will cover more. |MacSim-SST Parallel simulation Node Topology (Ring, Mesh, Torus,..) router MacSim Tutorial (In ISCA-39, 2012)