GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008.

Slides:



Advertisements
Similar presentations
CS/COE1541: Introduction to Computer Architecture Datapath and Control Review Sangyeun Cho Computer Science Department University of Pittsburgh.
Advertisements

A Micro-benchmark Suite for AMD GPUs Ryan Taylor Xiaoming Li.
Adding the Jump Instruction
ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
Computer Organization and Architecture
Computer Organization. This module surveys the physical resources of a computer system. –Basic components CPUMemoryBus I/O devices –CPU structure Registers.
Instruction Level Parallelism (ILP) Colin Stevens.
1 Lecture 2: MIPS Instruction Set Today’s topic:  MIPS instructions Reminder: sign up for the mailing list cs3810 Reminder: set up your CADE accounts.
Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.
Choice for the rest of the semester New Plan –assembler and machine language –Operating systems Process scheduling Memory management File system Optimization.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Henry Hexmoor1 Chapter 10- Control units We introduced the basic structure of a control unit, and translated assembly instructions into a binary representation.
Computer Architecture
Educational Computer Architecture Experimentation Tool Dr. Abdelhafid Bouhraoua.
Pipelining By Toan Nguyen.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
EKT 422 Computer Architecture
Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.
Lecture 4: MIPS Instruction Set Reminders: –Homework #1 posted: due next Wed. –Midterm #1 scheduled Friday September 26 th, 2014 Location: TODD 430 –Midterm.
Module : Algorithmic state machines. Machine language Machine language is built up from discrete statements or instructions. On the processing architecture,
Processes Introduction to Operating Systems: Module 3.
Computer Architecture Memory, Math and Logic. Basic Building Blocks Seen: – Memory – Logic & Math.
Principles of Linear Pipelining
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
Reduced Instruction Set Computers. Major Advances in Computers(1) The family concept —IBM System/ —DEC PDP-8 —Separates architecture from implementation.
Chapter One Introduction to Pipelined Processors
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
ALU (Continued) Computer Architecture (Fall 2006).
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Sunpyo Hong, Hyesoon Kim
What is a Process ? A program in execution.
My Coordinates Office EM G.27 contact time:
Computer Organization Exam Review CS345 David Monismith.
Computers’ Basic Organization
Basic Processor Structure/design
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Computer Architecture
Processor (I).
CS/COE0447 Computer Organization & Assembly Language
Super Quick Architecture Review
Lecture 4: MIPS Instruction Set
Computer Organization “Central” Processing Unit (CPU)
Trying to avoid pipeline delays
The University of Adelaide, School of Computer Science
Topic 5: Processor Architecture Implementation Methodology
Instruction encoding We’ve already seen some important aspects of processor design. A datapath contains an ALU, registers and memory. Programmers and compilers.
Topic 5: Processor Architecture
Branch instructions We’ll implement branch instructions for the eight different conditions shown here. Bits 11-9 of the opcode field will indicate the.
Instruction encoding We’ve already seen some important aspects of processor design. A datapath contains an ALU, registers and memory. Programmers and compilers.
Control units In the last lecture, we introduced the basic structure of a control unit, and translated our assembly instructions into a binary representation.
Chapter 12 Pipelining and RISC
Course Outline for Computer Architecture
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 16 – RISC-V Vectors Krste Asanovic Electrical Engineering and.
6- General Purpose GPU Programming
Chapter 4 The Von Neumann Model
Presentation transcript:

GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008

University of Central Florida Outline Motivation and background Software design Implementation Test cases Future work

University of Central Florida Motivation and background Motivation  Better understanding of GPU  Improving the GPU architecture. Background  Two GPU manufacturers: Nvidia and ATI  Similar programming mode: block vs group share memory vs lds  ATI uses vliw  We want to work on both.

University of Central Florida Software design Programming Model Layer  Platform independent  Define abstract part, most of ISA: ISA, Register  Implement similar most of resource: group, wavefront, … Hardware Implementation Layer  Implement the abstract part of PML for different platform ATI NVIDIA

University of Central Florida Programming Model Layer Code parser to get instruction list Allocate resource by the configuration file: group, thread, share memory, memory, wavefront schedule. Load input stream from txt file. Wavefront schedule executes instruction list on the wavefronts When instruction is executed on one thread, instruction update the resource: register of thread, share memory of group, texture(global) memory of gpuprogram. Save the output memory to txt file

University of Central Florida Code Parser(HIL) read the assembly and parse it into instructions INST LABEL NO: unique # of instruction Stream core label: one of x, y, z, w, t INST: Operand

University of Central Florida Operand(HIL) Global Purpose Register: 0 y: ADD ____, R0.x, -0.5 Previous Vector(x, y, z, w) and Previous Scalar (t) 3 t: F_TO_I ____, R0.x 4 t: MULLO_UINT R1.z, 1, PS3 Temporary Register 3 t: RCP_UINT T0.x, R1.x Constant Register 1 z: AND_INT ____, R0.x, (0x F, e-44f).x

University of Central Florida Instruction (HIL) Opcode dst, src1, src2, …  ADD_INT R0.x, R1.x, R2.x  Dst, src1, src2 is Operand GPUProgram hold instruction lists. Instruction implement the execution  Receive the thread as parameter, and execute on the thread  For example: ADD_INT R0.x, R1.x, R2.x Instruction get value of R1.x, R2.x from thread Save value of R1.x+R2.x as R0.x to thread

University of Central Florida Memory Handle (HIL) Texture Memory  0 SAMPLE R1, R0.xyxx, t0, s0 UNNORM(XYZW)  EXP_DONE: PIX0, R0  Cache support (future work) Global Memory  6 RD_SCATTER R3, DWORD_PTR[0+R2.x], ELEM_SIZE(3) UNCACHED BURST_CNT(0)  03 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x].x___, R0, ELEM_SIZE(3)  Coalesced support: First thread handle (future work) Use the text file as input and output.

University of Central Florida Thread(PML) Belong Group Hold Data Unit(HIL)  128 bit (x, y, z, w) + 32 bit ( t )  Most of resource is 4 component: register  One thread processor is five-way, and have 5 output (x,y,z,w,t) Hold mapping table of Register (GPR, CR, TR) to Data Unit

University of Central Florida Wavefront(PML) Hold Program counter Hold the thread id list Belong to Group

University of Central Florida Group(PML) Hold threads Hold wavefront Belong to GPUProgram Hold Share memory(PML)  Instruction access the share memory through Group  Instruction (HIL) 12 LOCAL_DS_WRITE (8) R0, STRIDE(16) SIMD_REL 17 LOCAL_DS_READ R2, R2.xy WATERFALL

University of Central Florida Wavefront Schedule(PML) Current version (function simulator)  Pick up one instruction, let all wavefronts execute this operation. for time simulator  Decided by the hardware capacity and software request  Decided by the static instruction list  Decided by execution result

University of Central Florida GPUProgram(PML) Code parser parses instruction list Load input stream from txt file to memory. Allocate resource by the configuration file: group, thread, share memory, memory, wavefront schedule. Wavefront schedule executes instruction list on the wavefronts  When instruction is executed on one thread, instruction update the resources: register of thread, share memory of group, texture(global) memory of gpuprogram. Save the output memory to txt file

University of Central Florida Test case Sum, division, subtract, multiplication  Support texture memory  Support different data types (int, float, uint, int1, int4…)  Support fundamental ALU operations (+-*/, shift, and, compare, cast) domain_sum  Support global memory read and write Sum_share_memory  support share memory read and write  Support group, wavefront Branch and Loop: to be done  Support constant buffer  Loop operation

University of Central Florida Future work Now support 30 of 200 instructions for ATI Support Nvidia, optimize two layers design Timing simulator