Status – Week 265 Victor Moya. Summary ShaderEmulator ShaderEmulator ShaderFetch ShaderFetch ShaderDecodeExecute ShaderDecodeExecute Communication storage.

Slides:

Advertisements

Similar presentations

CPU Structure and Function

Advertisements

PIPELINING AND VECTOR PROCESSING

Computer Organization and Architecture

CSCI 4717/5717 Computer Architecture

Central Processing Unit

COMP25212 Further Pipeline Issues. Cray 1 COMP25212 Designed in 1976 Cost $8,800,000 8MB Main Memory Max performance 160 MFLOPS Weight 5.5 Tons Power.

9-6 The Control Word Fig The selection variables for the datapath control the microoperations executed within datapath for any given clock pulse.

Superscalar processors Review. Dependence graph S1S2 Nodes: instructions Edges: ordered relations among the instructions Any ordering-based transformation.

Status – Week 257 Victor Moya. Summary GPU interface. GPU interface. GPU state. GPU state. API/Driver State. API/Driver State. Driver/CPU Proxy. Driver/CPU.

A look at interrupts What are interrupts and why are they needed.

Instruction-Level Parallelism (ILP)

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.

Computer Organization and Architecture

1 Lecture 17: Basic Pipelining Today’s topics:  5-stage pipeline  Hazards and instruction scheduling Mid-term exam stats:  Highest: 90, Mean: 58.

A look at interrupts What are interrupts and why are they needed in an embedded system? Equally as important – how are these ideas handled on the Blackfin.

Pipelining II Andreas Klappenecker CPSC321 Computer Architecture.

Status – Week 274 Victor Moya. Simulator model Boxes. Boxes. Perform the actual work. Perform the actual work. A box can only access its own data, external.

Status – Week 249 Victor Moya. Summary MemoryController. MemoryController. Streamer. Streamer. TraceDriver. TraceDriver. Statistics. Statistics.

Chapter 12 Pipelining Strategies Performance Hazards.

Status – Week 259 Victor Moya. Summary OpenGL Traces. OpenGL Traces. DirectX Traces. DirectX Traces. Proxy CPU. Proxy CPU. Command Processor. Command.

Status – Week 247 Victor Moya. Summary Streamer. Streamer. TraceDriver. TraceDriver. bGPU bGPU Signal Traffic Analyzer. Signal Traffic Analyzer.

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.

Status – Week 243 Victor Moya. Summary Current status. Current status. Tests. Tests. XBox documentation. XBox documentation. Post Vertex Shader geometry.

Status – Week 231 Victor Moya. Summary Primitive Assembly Primitive Assembly Clipping triangle rejection. Clipping triangle rejection. Rasterization.

GPU Simulator Victor Moya. Summary Rendering pipeline for 3D graphics. Rendering pipeline for 3D graphics. Graphic Processors. Graphic Processors. GPU.

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

Status – Week 230 Victor Moya. Summary Simulator parameters. Simulator parameters. Oclusion culling (Z-Buffer). Oclusion culling (Z-Buffer). To be done.

Status – Week 248 Victor Moya. Summary Streamer. Streamer. TraceDriver. TraceDriver. bGPU bGPU Signal Traffic Analyzer. Signal Traffic Analyzer. How to.

A look at interrupts What are interrupts and why are they needed.

Status – Week 270 Victor Moya. Summary ShaderEmulator. ShaderEmulator. ShaderSimulator. ShaderSimulator. Schedule. Schedule. Name. Name. Projects. Projects.

Status – Week 264 Victor Moya. Summary Doctorado. Doctorado. Credits Recerca. Credits Recerca. GPU design GPU design PS2 PS2 PS3 PS3 Imagine Imagine NV30.

Status – Week 272 Victor Moya. Vertex Shader VS 2.0+ (NV30) based Vertex Shader model. VS 2.0+ (NV30) based Vertex Shader model. Multithreaded?? Implemented.

Chapter 12 CPU Structure and Function. Example Register Organizations.

Computer ArchitectureFall 2007 © October 31, CS-447– Computer Architecture M,W 10-11:20am Lecture 17 Review.

Status – Week 275 Victor Moya. Simulator model Boxes. Boxes. Perform the actual work. Perform the actual work. Parameters: wires in, wires out, child.

Status – Week 260 Victor Moya. Summary shSim. shSim. GPU design. GPU design. Future Work. Future Work. Rumors and News. Rumors and News. Imagine. Imagine.

ECE 353 ECE 353 Fall 2007 Lab 3 Machine Simulator November 1, 2007.

Status – Week 245 Victor Moya. Summary Streamer Streamer Creditos investigación. Creditos investigación.

Status – Week 266 Victor Moya. Summary ShaderEmulator ShaderEmulator ShaderFetch ShaderFetch ShaderDecodeExecute ShaderDecodeExecute Communication storage.

Pipelining By Toan Nguyen.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Parallelism Processing more than one instruction at a time. Pipelining

1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.

Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell CS352H: Computer Systems Architecture Topic 8: MIPS Pipelined.

Pipeline Hazards. CS5513 Fall Pipeline Hazards Situations that prevent the next instructions in the instruction stream from executing during its.

SoC CAD 2015/11/22 1 Instruction Set Extensions for Multi-Threading in LEON3 林孟諭電機系, Department of Electrical Engineering 國立成功大學, National Cheng Kung.

EECE 476: Computer Architecture Slide Set #5: Implementing Pipelining Tor Aamodt Slide background: Die photo of the MIPS R2000 (first commercial MIPS microprocessor)

Water Flow GROUP A. Analogue input voltage results: Motor Input voltage( V) pin 12 Analogue input voltage (V) Display number

Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.

LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.

PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.

Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.

EmuOS Phase 3 Design Brendon Drew Will Mosley Anna Clayton

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Figure 13.1 MIPS Single Clock Cycle Implementation.

CSCI206 - Computer Organization & Programming

Systems Architecture II

Control unit extension for data hazards

Computer Architecture

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Conceptual execution on a processor which exploits ILP

Presentation transcript:

Status – Week 265 Victor Moya

Summary ShaderEmulator ShaderEmulator ShaderFetch ShaderFetch ShaderDecodeExecute ShaderDecodeExecute Communication storage classes Communication storage classes GPU design GPU design PS2 PS2 PS3 PS3 Imagine Imagine

ShaderEmulator Decoded information for emulation moved from ShaderEmulator to ShaderInstruction. Decoded information for emulation moved from ShaderEmulator to ShaderInstruction. Used macros for shader instructions. Used macros for shader instructions. Functions renamed. Functions renamed. Not tested yet. Not tested yet. Shader assembler? Shader assembler?

Shader Box Diagram

ShaderFetch Parameters: Parameters: numThreads: numThreads: Number of threads and input buffers. Number of threads and input buffers. numActiveThreads: numActiveThreads: Number of threads. Number of threads. issueRate: issueRate: Max. instructions fetched per cycle. Max. instructions fetched per cycle. retireRate: retireRate: Max. newPC inputs per signal??? Max. newPC inputs per signal???

ShaderFetch Signals Input: Input: ShaderCommand: ShaderCommand: NEW_INPUT: Inputs to the shader. NEW_INPUT: Inputs to the shader. LOAD_PROGRAM: Load a new Shader Program. LOAD_PROGRAM: Load a new Shader Program. LOAD_PARAMETERS: Load new parameters values. LOAD_PARAMETERS: Load new parameters values. ShaderNewPC: ShaderNewPC: Thread flow changes. Thread flow changes. End of thread (EXIT). End of thread (EXIT). ShaderDecodeState: ShaderDecodeState: Decoder/Execute ready/busy state. Decoder/Execute ready/busy state. ConsumerState: ConsumerState: Consumer state: ready/busy to receive shader outputs. Consumer state: ready/busy to receive shader outputs.

Shader Fetch Signals Output: Output: ShaderState: ShaderState: Shader ready to receive new inputs. Shader ready to receive new inputs. Shader ready to receive new program or new parameters (all previous inputs must have finished). Shader ready to receive new program or new parameters (all previous inputs must have finished). ShaderInstruction: ShaderInstruction: Shader instructions fetched. Shader instructions fetched. ShaderOutput: ShaderOutput: Shader output (an array of QuadFloats) sent to the consumer. Shader output (an array of QuadFloats) sent to the consumer.

ShaderFetch Clock Read from ShaderCommand: Read from ShaderCommand: Panic if Shader not ready for command. Panic if Shader not ready for command. NEW_INPUT: NEW_INPUT: Get free thread or free input buffer. Get free thread or free input buffer. Reset thread state. Reset thread state. LOAD_PROGRAM: LOAD_PROGRAM: Stores new program in Shader Emulator. Stores new program in Shader Emulator. LOAD_PARAMETERS: LOAD_PARAMETERS: Stores new parameters in Shader Emulator. Stores new parameters in Shader Emulator.

ShaderFetch Clock Read ShaderNewPC: Read ShaderNewPC: N reads. N reads. NEW_PC: NEW_PC: Update the thread PC. Update the thread PC. END_THREAD: END_THREAD: Mark thread as finished. Mark thread as finished. Read ShaderDecodeState: Read ShaderDecodeState: If decoder busy: If decoder busy: Update Shader state. Update Shader state. Write ShaderState. Write ShaderState.

ShaderFetch Clock Fetch new instructions: Fetch new instructions: Fetch N instructions. Fetch N instructions. Check if thread is ready (not finished, not free). Check if thread is ready (not finished, not free). Fetch instruction from ShaderEmulator. Fetch instruction from ShaderEmulator. Write ShaderInstruction. Write ShaderInstruction. Update thread PC (+1). Update thread PC (+1). Update thread instruction counter. Update thread instruction counter. If crosses instruction execution limit: If crosses instruction execution limit: –Finish thread. Update nextThread pointer (Round Robin). Update nextThread pointer (Round Robin).

ShaderFetch Clock Check shader output state: Check shader output state: Output transmission in progress. Output transmission in progress. Send more data. Write to ShaderOutput. Send more data. Write to ShaderOutput. Update state. Update state. Output to send. Output to send. Check consumer state. Check consumer state. Start sending data. Write to ShaderOutput. Start sending data. Write to ShaderOutput. Set state as send in progress. Set state as send in progress. NOT IMPLEMENTED YET. NOT IMPLEMENTED YET. Update Shader state: Update Shader state: Write ShaderSate. Write ShaderSate.

ShaderFetch Comments Finished threads act as buffers for output. Finished threads act as buffers for output. Add output buffers? Add output buffers? Consumer/Output protocol. Consumer/Output protocol. Output sizes. Output sizes. Output latency. Output latency. Multicycle transmission. Multicycle transmission. Order of operations in clock(). Order of operations in clock().

ShaderFetch Comments ‘Scheduling’ policy for threads? ‘Scheduling’ policy for threads? Pure Round Robin: Pure Round Robin: Free/Finished threads fetch NOPs. Free/Finished threads fetch NOPs. Round Robin with priority: Round Robin with priority: Free/Finished threads are skipped. Free/Finished threads are skipped. Other? Other? Memory: Memory: THREAD_BLOCK/THREAD_RESUME from Shader Memory box or from ShaderDecodeExecute box. THREAD_BLOCK/THREAD_RESUME from Shader Memory box or from ShaderDecodeExecute box. New thread state: blocked. New thread state: blocked.

ShaderDecodeExecute Parameters: Parameters: numThreads: numThreads: ShaderEmulator threads. ShaderEmulator threads. issueRate: issueRate: Max. number of instructions received per cycle. Max. number of instructions received per cycle. retireRate: retireRate: Max. number of NewPC signals per cycle. Max. number of NewPC signals per cycle.

ShaderDecodeExecute Signals: Signals: Input: Input: ShaderInstruction: ShaderInstruction: –New Instructions from Fetch. ShaderExec: ShaderExec: –Instructions that end their execution.

ShaderDecodeExecute Output: Output: ShaderDecodeState: ShaderDecodeState: –Decoder ready/busy. ShaderNewPC: ShaderNewPC: –NEW_PC: PC changes from branch, call and ret instructions. PC changes from branch, call and ret instructions. Instruction refetch? -> Thread blocked? Instruction refetch? -> Thread blocked? –END_THREAD: EXIT instruction. EXIT instruction. ShaderExec: ShaderExec: –Instructions that start their execution.

ShaderDecodeExecute Clock Read ShaderExec: Read ShaderExec: N finished instructions. N finished instructions. Update dependence tables. Update dependence tables. Free resource tables (limited Ufs?). Free resource tables (limited Ufs?).

ShaderDecodeExecute Clock If decode is blocked: If decode is blocked: Check resources for block instruction. Check resources for block instruction. If resources available: If resources available: Execute instruction in ShaderEmulator. Execute instruction in ShaderEmulator. Update dependence table. Update dependence table. Read exec. latency table. Read exec. latency table. Write ShaderExec. Write ShaderExec. Unblock decoder. Unblock decoder. Write ShaderDecodeState. Write ShaderDecodeState.

ShaderDecodeExecute Read ShaderInstruction: Read ShaderInstruction: N instructions. N instructions. Check dependences. Check dependences. If instruction has dependencies: If instruction has dependencies: Threw away instruction. Threw away instruction. Send NewPC with the same PC. Send NewPC with the same PC. Check resource tables. Check resource tables. If there are no resources for the instruction: If there are no resources for the instruction: Block decoder. Block decoder. Write ShaderDecodeState. Write ShaderDecodeState. Exit. Exit. Execute instruction in ShaderEmulator. Execute instruction in ShaderEmulator. Read instruction exec. latency table. Read instruction exec. latency table. Write ShaderInstruction. Write ShaderInstruction.

ShaderDecodeExecute Comments Additional features to implement: Additional features to implement: Out-of-order support. Out-of-order support. Instruction variable latency support. Instruction variable latency support. Simulate UF latency: Simulate UF latency: –Ex. DIV/SQRT unit. Load/Store support. Load/Store support. Add Memory box: Add Memory box: –Variable latency. –Thread block at first level memory miss. Signal THREAD_BLOCK (to fetch and decode). Signal THREAD_BLOCK (to fetch and decode). –Thread wakeup when read ends. Signal THREAD_RESUME (to fetch and decode). Signal THREAD_RESUME (to fetch and decode).

ShaderDecodeExecute Comments Resource hazards: Resource hazards: None: None: All UFs are fully pipelined with 1 cycle input latency). All UFs are fully pipelined with 1 cycle input latency). Supported: Supported: Block decoder. Block decoder. –Other threads could use free UFs. Block thread. Block thread. –Signal to fetch (ShaderNewPC): BLOCK_THREAD. –Needs a RESUME_THREAD signal? –Fetch hardware implementation? (not pure Round Robin?).

ShaderDecodeExecute Comments Dependence hazards: Dependence hazards: None : None : Same latency for all instructions. Same latency for all instructions. Enough threads to fill instruction latency. Enough threads to fill instruction latency. ‘Empty’ (free or already finished) threads fetch NOPs (pure Round Robin). ‘Empty’ (free or already finished) threads fetch NOPs (pure Round Robin). Hardware is less complex. Hardware is less complex.

ShaderDecodeExecute Comments Dependence hazards: Dependence hazards: Supported: Supported: Block decoder: Block decoder: –With multithread Shader it is waste. Block thread: Block thread: –Send a BLOCK_THREAD signal to fetch. –Needs a RESUME_THREAD signal. Ignore: Ignore: –Just ignore current instruction. –Send REFETCH (ShaderNewPC) with old PC. Instruction limit counter must not be updated!!! Instruction limit counter must not be updated!!!

Communication storage Communication between boxes: Communication between boxes: ShaderExecInstruction ShaderExecInstruction ShaderCommand ShaderCommand ShaderDecodeCommand ShaderDecodeCommand Dynamic: creation/destruction. Dynamic: creation/destruction. Class model or struct model? Class model or struct model? Inherit from a ‘dynamic data’ class. Inherit from a ‘dynamic data’ class. Modified new/delete implementation. Modified new/delete implementation.

GPU design Target architecture? Target architecture? NV30 NV30 DX9 DX9 DX10 DX10 OpenGL2 OpenGL2 PS3 PS3 Imagine Imagine Vector Vector Multithreaded Multithreaded Are we really going for it? Are we really going for it? Do we really know what we are doing? Do we really know what we are doing?

PS2 I got the EE, VU and GS programming manuals :). I got the EE, VU and GS programming manuals :).

PS3 Sony patent. Sony patent. I haven’t read it yet. I haven’t read it yet.

Imagine ‘Computer Graphics on a Stream Architecture’, John Douglas Owens, PhD dissertation. ‘Computer Graphics on a Stream Architecture’, John Douglas Owens, PhD dissertation. Not read yet either. Not read yet either.