Download presentation
Presentation is loading. Please wait.
1
Status – Week 243 Victor Moya
2
Summary Current status. Current status. Tests. Tests. XBox documentation. XBox documentation. Post Vertex Shader geometry. Post Vertex Shader geometry. Rasterization. Rasterization.
3
Current Status Basic Command Processor. Basic Command Processor. Read/Write GPU registers. Read/Write GPU registers. Read/Write GPU memory. Read/Write GPU memory. GPU commands. GPU commands. No DMA/AGP data access. No DMA/AGP data access. Basic Memory Controller. Basic Memory Controller. 1 transaction per cycle served. 1 transaction per cycle served. Memory module access latency accounted. Memory module access latency accounted. Transmission latency accounted. Transmission latency accounted. 3 buses (req/write + data): CP, StreamerFetch, StreamerLoader. 3 buses (req/write + data): CP, StreamerFetch, StreamerLoader.
4
Current Status Shader (Vertex Shader). Shader (Vertex Shader). Multithreaded. Multithreaded. F/D/E/W pipeline. F/D/E/W pipeline. Variable execution latency. Variable execution latency. Dependency checking is full register right now, should be component based. Dependency checking is full register right now, should be component based. Problems with ‘ending’ instruction (requires something to fetch after it and takes many cycles). Problems with ‘ending’ instruction (requires something to fetch after it and takes many cycles). No branches (support code but instructions not implemented). No branches (support code but instructions not implemented). No texture access (memory). No texture access (memory).
5
Current Status Streamer. Streamer. Pipelined: Pipelined: Hit: Fetch/OCache/Insert/Commit Hit: Fetch/OCache/Insert/Commit Miss: Fetch/OCache/IRQInsert/IRQRead/AttrLoad/Sh/Store/Co mmit. Miss: Fetch/OCache/IRQInsert/IRQRead/AttrLoad/Sh/Store/Co mmit. Stream and index based modes implemented. Stream and index based modes implemented. No pre T&L cache (should be added to Streamer Loader?). No pre T&L cache (should be added to Streamer Loader?). Supports out of order vertexes (shader or memory). Supports out of order vertexes (shader or memory). Doesn’t support data from the AGP. Doesn’t support data from the AGP.
6
Current Status Streamer: Streamer: Streamer Loader pipeline should be (in hardware): Streamer Loader pipeline should be (in hardware): Insert in the IRQ. Insert in the IRQ. Load from IRQ. Load from IRQ. Setup Input: start address + address increment for each active attribute. Setup Input: start address + address increment for each active attribute. Attribute Load: request attribute to MC, increment address generators. Attribute Load: request attribute to MC, increment address generators. Issue to Shader. Issue to Shader. IRQ should be implemented with a pre T&L cache. IRQ should be implemented with a pre T&L cache.
7
Current Status Comments: Comments: Currently the signal latency/bandwidth is specified with raw numbers. Alternatives: Currently the signal latency/bandwidth is specified with raw numbers. Alternatives: Use constants. Store in a single ‘signal definition’ file for all units or in separate units (must be shared between the two boxes connected by the signal). Use constants. Store in a single ‘signal definition’ file for all units or in separate units (must be shared between the two boxes connected by the signal). Use some kind of Architecture Description for signal delays, bandwidth, data bus width (to be used in memory transmission calculations and similar). Use some kind of Architecture Description for signal delays, bandwidth, data bus width (to be used in memory transmission calculations and similar). Currently most units only support single issue/fetch/process. Should be ‘generalized’ to multiissue/fetch/process and parametrized. Currently most units only support single issue/fetch/process. Should be ‘generalized’ to multiissue/fetch/process and parametrized.
8
Tests Signal Tracer Analyzer -> Carlos. Signal Tracer Analyzer -> Carlos. OpenGL test trace: OpenGL test trace: Sphere. Sphere. Using glutSolidSphere: 2 triangle fans, n quad strips. Using glutSolidSphere: 2 triangle fans, n quad strips. Trying to implement a sphere using Icosahedron subdivision to create a triangle strip mesh to test the index mode. And later add lighting shader. Trying to implement a sphere using Icosahedron subdivision to create a triangle strip mesh to test the index mode. And later add lighting shader. As many vertexes/polygons as we want (~10000 in current generated trace). As many vertexes/polygons as we want (~10000 in current generated trace).
9
Tests Changes needed: Changes needed: Add support for glNormal3f, GL_TRIANGLE_FAN and GL_QUAD_STRIP to the TraceReader/Library/Driver. Add support for glNormal3f, GL_TRIANGLE_FAN and GL_QUAD_STRIP to the TraceReader/Library/Driver. Add support for triangle fans and quad strips (?) to the CP and the fake rasterizer (Shader and Streamer don’t care about that). Add support for triangle fans and quad strips (?) to the CP and the fake rasterizer (Shader and Streamer don’t care about that).
10
XBox Documentation Interesting information about the Vertex Shader architecture and the T&L pipeline down to the Primitive Assembly Cache and the Triangle Setup. Interesting information about the Vertex Shader architecture and the T&L pipeline down to the Primitive Assembly Cache and the Triangle Setup. Includes estimated sizes and clock latencies for most of the operations. Includes estimated sizes and clock latencies for most of the operations.
11
Memory Pre T&L Cache Vertex Shader Post T&L Cache Primitive Assembly Triangle Setup cache line (raw vertex data) raw vertex transformed and lit vertex 3 transformed and lit vertices Rasterization 4 KB 4-way set associative 128 32-B cache lines 16 – 24 entry FIFO 200 MHz 3 vertices
12
XBOX Differences: Differences: No Pre T&L cache. No Pre T&L cache. The Post T&L cache seems to be accessed by the Primitive Assembly Cache. However we push the vertex to the Rasterizer (or whatever lays after the shader). The Post T&L cache seems to be accessed by the Primitive Assembly Cache. However we push the vertex to the Rasterizer (or whatever lays after the shader). Sending the shaded vertex to the primitive assembly takes multiple cycles (2+) depending on the number of attributes used by the vertex. Sending the shaded vertex to the primitive assembly takes multiple cycles (2+) depending on the number of attributes used by the vertex.
13
XBOX Vertex Shader Registers: Registers: 16 input registers. 16 input registers. 12 temporary registers. 12 temporary registers. 192 constant registers. 192 constant registers. 1 address regsiter. 1 address regsiter. 11 output registers. 11 output registers.
14
XBOX Vertex Shader Instructions: Instructions: Shader Operations: Shader Operations: 13 MAC opcodes. 13 MAC opcodes. 7 ILU (inverse logic unit) opcodes. 7 ILU (inverse logic unit) opcodes. 136 microcode instructions. Each instruction can: 136 microcode instructions. Each instruction can: Read three register with swizzle and negation. Read three register with swizzle and negation. Compute one MAC op and one ILU op. Compute one MAC op and one ILU op. Write up one output register and two temporary registers with masking. Write up one output register and two temporary registers with masking. Shader types: Shader types: Normal vertex shaders. Normal vertex shaders. Read/write vertex shaders. Read/write vertex shaders. Vertex state shaders. Vertex state shaders.
15
XBOX Vertex Shaders Timing: Timing: The cycle speed is 250 MHz The cycle speed is 250 MHz For normal shaders, instructions take between one-half cycle and one cycle to complete. For normal shaders, instructions take between one-half cycle and one cycle to complete. For read/write and vertex state shaders, instructions take between one cycle and six cycles to complete. For read/write and vertex state shaders, instructions take between one cycle and six cycles to complete.
16
XBOX Vertex Shaders Multithreaded: Multithreaded: Two copies of the vertex shader pipeline (2 VS). Two copies of the vertex shader pipeline (2 VS). Each copy can run up to three threads (3 active threads per shader). Each copy can run up to three threads (3 active threads per shader). Read/write vertex shaders and vertex state shaders run single threaded, on a single pipeline. Read/write vertex shaders and vertex state shaders run single threaded, on a single pipeline. Stalling: Stalling: Instructions take six cycles to compute their outputs. Instructions take six cycles to compute their outputs. Bypasses: ALU, ILU and MLU bypasses. Bypasses: ALU, ILU and MLU bypasses. Three cycles latency with bypasses. Three cycles latency with bypasses. Bypass allows swizzling and negate of the result. Bypass allows swizzling and negate of the result.
17
Post Vertex Shader Divide by w. Divide by w. Can be avoided/delayed if rasterization is performed in homogenous coordinates (Olano & Greer). Can be avoided/delayed if rasterization is performed in homogenous coordinates (Olano & Greer). Viewport transformation. Viewport transformation. Scale to screen/window coordinate system. Scale to screen/window coordinate system. Primitive Assembly: Primitive Assembly: Get the three vertexes of a triangle. Get the three vertexes of a triangle.
18
Post Vertex Shader Back face culling: Back face culling: Can be calculated using the area of the triangle (determinant three vertex in homogeneous coordinates). Can be calculated using the area of the triangle (determinant three vertex in homogeneous coordinates). Negative or possitive area. Negative or possitive area. Can be also used to cull zero area triangles Can be also used to cull zero area triangles Clipping: Clipping: Using rasterization in homogeneous coordinates: just add more clipping edges. Using rasterization in homogeneous coordinates: just add more clipping edges. Triangle clipping: ? Triangle clipping: ?
19
Post Vertex Shader Discard degenerate triangles: Discard degenerate triangles: If two or more vertex are the same (could be index based or full vertex comparition) the triangle can be discarded. If two or more vertex are the same (could be index based or full vertex comparition) the triangle can be discarded.
20
Rasterization Alternatives: Alternatives: Scanline incremental interpolation (DDA). Scanline incremental interpolation (DDA). Rasterization in homogeneous coordinates. Rasterization in homogeneous coordinates. Two phases: Two phases: Triangle setup. Triangle setup. Set interpolation registers. Set interpolation registers. Fragment generation. Fragment generation. Incrementally update the interpolants. Incrementally update the interpolants.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.