3D Graphic Hardware Pipeline Victor Moya
Index 3D Graphic Pipeline Overview. 3D Graphic Pipeline Overview. Geometry. Geometry. Rasterization. Rasterization. Fragment. Fragment. 3D Graphic Hardware pipeline. 3D Graphic Hardware pipeline. Current GPUs. Current GPUs. ATI R300. ATI R300. NVidia NV30. NVidia NV30. 3DLabs P10. 3DLabs P10. Matrox Parhelia. Matrox Parhelia.
3D Graphics Pipeline
Application: Simulation, Input event handlers, modify data structures, database traversal, primitive generation, utility functions. Application: Simulation, Input event handlers, modify data structures, database traversal, primitive generation, utility functions. Command: command buffering, command interpretation, unpack and perform format conversion, mantain graphics state. Command: command buffering, command interpretation, unpack and perform format conversion, mantain graphics state. Geometry: evaluation of polynomials for curved surfaces, transform and projection, clipping, culling and primitive assembly. Geometry: evaluation of polynomials for curved surfaces, transform and projection, clipping, culling and primitive assembly.
3D Graphics Pipeline Fixed vs Programmable. Fixed vs Programmable.
Geometry Vertex operations: Vertex operations: (1) Transform coordinates and normal (1) Transform coordinates and normal Model => World. Model => World. World => Eye. World => Eye. (2) Normalize the length of the normal. (2) Normalize the length of the normal. (3) Compute vertex lightning. (3) Compute vertex lightning. (4) Transform texture coordinates. (4) Transform texture coordinates. (5) Transform coordinates to clip coordinates (projection). (5) Transform coordinates to clip coordinates (projection). (8) Divide coordinate by w. (8) Divide coordinate by w. (9) Apply affine viewport transform (x, y, z). (9) Apply affine viewport transform (x, y, z).
Geometry Primitive operations: Primitive operations: (6) Primitive assembly (6) Primitive assembly (7) Clipping: (7) Clipping: (10) Backface cull: eliminate back-facing triangles. (10) Backface cull: eliminate back-facing triangles. Primitive generation: new pipeline stage (ATI TruForm). Primitive generation: new pipeline stage (ATI TruForm).
Lightning Diffuse Lightning. Diffuse Lightning. Light Sources. Light Sources. Specular Lightning. Specular Lightning. Emission. Emission. Gouraud Shading. Gouraud Shading. Phong Shading. Phong Shading. Bump Mapping. Bump Mapping. OpenGL Lightning. OpenGL Lightning.
Light Sources Ambient Light. Ambient Light. Directional Light Sources Directional Light Sources Infinite light source (parallel rays). Infinite light source (parallel rays). No attenuation. No attenuation. Point Light Sources. Point Light Sources. All directions. All directions. Attenuation. Attenuation. Spot Light Sources. Spot Light Sources. Cone of light. Cone of light. Attenuation. Attenuation. Kc, Kl and Kq are constant, linear and quadratic attenuation values. U: Direction of the spot light. L: Unit direction vector from surface point to light spot.
Diffuse Lighting A: Ambient light T: Texture sample. D: Surface diffuse reflection color. Ci: Intensity of the i light at the surface point. N: Normal vector of the surface. Li: Unit direction vector to the light source I.
Specular Lighting S: Surface specular color. Ci: Intensity of the incident light. m: specular exponent (larger, sharper hightlight). G: Gloss map sample. N: Normal vector at the surface. L: Unit direction to light vector. Hi: Halfway vector (V + L). V: Unit direction to viewer vector.
Emission K emission = EM E: Surface emission color. M: Emission map sample.
OpenGL Lighting Calculated at vertex, interpolated inside the triangle (Gouraud). Calculated at vertex, interpolated inside the triangle (Gouraud). Bump mapping supported by propietary extensions. Bump mapping supported by propietary extensions. Pixel Shaders for programmable per pixel lighting. Pixel Shaders for programmable per pixel lighting.
OpenGL Lighting
Clipping Clip geometry primitives with the view frustrum (6 planes). Clip geometry primitives with the view frustrum (6 planes). Clip geometry primitives with the user clip planes. Clip geometry primitives with the user clip planes. Techniques used: Techniques used: Guard-Band Clipping. Guard-Band Clipping. Homogenous rasterization avoids clipping in the geometry stage. Homogenous rasterization avoids clipping in the geometry stage.
Guard-Band Clipping
Homogeneus coordinates “Triangle Scan Conversion using 2D Homogeneus Coordinates”, Olano and Greer. “Triangle Scan Conversion using 2D Homogeneus Coordinates”, Olano and Greer.
Programmable Pipeline
Vertex Program
Vertex Shader VS 1.0, 1.1 and 1.2 (current technology) for Direct3D 8 and 8.1. OpenGL extensions: ARB_vertex_program (finally in OpenGL v1.4), NV_vertex_program1_1 (NVidia), EXT_vertex_shader (ATI). VS 1.0, 1.1 and 1.2 (current technology) for Direct3D 8 and 8.1. OpenGL extensions: ARB_vertex_program (finally in OpenGL v1.4), NV_vertex_program1_1 (NVidia), EXT_vertex_shader (ATI). No branching. No branching. Single cycle execution latency (?). Single cycle execution latency (?). Single issue instruction each cycle. Single issue instruction each cycle. Simple in order pipeline (?). Simple in order pipeline (?).
Vertex Shader 16 input registers (read only). 16 input registers (read only). 15 output registers (write only). 15 output registers (write only). 12 temporary registers (read/write). 12 temporary registers (read/write). 96 constant registers (read only or read/write?). 96 constant registers (read only or read/write?). 256 instructions max 256 instructions max
Vertex Shader Output Output Inputs (vector or Inputs (vector or Opcode (scalar or vector) replicated scalar) Operation Opcode (scalar or vector) replicated scalar) Operation ARL s address register address register load ARL s address register address register load MOV v v move MOV v v move MUL v,v v multiply MUL v,v v multiply ADD v,v v add ADD v,v v add MAD v,v,v v multiply and add MAD v,v,v v multiply and add RCP s ssss reciprocal RCP s ssss reciprocal RSQ s ssss reciprocal square root RSQ s ssss reciprocal square root DP3 v,v ssss 3-component dot product DP3 v,v ssss 3-component dot product DP4 v,v ssss 4-component dot product DP4 v,v ssss 4-component dot product DST v,v v distance vector DST v,v v distance vector MIN v,v v minimum MIN v,v v minimum MAX v,v v maximum MAX v,v v maximum SLT v,v v set on less than SLT v,v v set on less than SGE v,v v set on greater equal than SGE v,v v set on greater equal than EXP s v exponential base 2 EXP s v exponential base 2 LOG s v logarithm base 2 LOG s v logarithm base 2 LIT v v light coefficients LIT v v light coefficients DPH v,v ssss homogeneous dot product DPH v,v ssss homogeneous dot product RCC s ssss reciprocal clamped RCC s ssss reciprocal clamped SUB v,v v subtract SUB v,v v subtract ABS v v absolute value ABS v v absolute value
NV_vertex_program2 ARL (new support for four-component A0 and A1 instead of just A0.x) ARL (new support for four-component A0 and A1 instead of just A0.x) ARR (similar to ARL, but rounds instead of truncating before storing the integer result in an address register) ARR (similar to ARL, but rounds instead of truncating before storing the integer result in an address register) BRA, CAL, RET (branching instructions) BRA, CAL, RET (branching instructions) COS, SIN (high-precision trigonometric functions) COS, SIN (high-precision trigonometric functions) FLR, FRC (floor and fraction of floating-point values) FLR, FRC (floor and fraction of floating-point values) EX2, LG2 (high-precision exponentiation and logarithm functions) EX2, LG2 (high-precision exponentiation and logarithm functions) ARA (adds pairs of components of an address register; useful for looping and other operations) ARA (adds pairs of components of an address register; useful for looping and other operations) SEQ, SFL, SGT, SLE, SNE, STR (“set on” instructions similar to SLT, SGE) SEQ, SFL, SGT, SLE, SNE, STR (“set on” instructions similar to SLT, SGE) SSG (“set sign” operation; generates a vector holding –1.0 for negative operand components, 0 for zero-value components, and +1.0 for positive components) SSG (“set sign” operation; generates a vector holding –1.0 for negative operand components, 0 for zero-value components, and +1.0 for positive components)
NV_vertex_program Overview 1. Condition codes 2. Branching & subroutines 3. Even faster performance 4. Nineteen new instructions 5. New source modifiers 6. Clip plane support 7. More registers & instructions
NV_vertex_program2 Resource Limits 256 vertex program parameters 256 vertex program parameters Up from 96 Up from temporary registers 16 temporary registers Up from 12 Up from 12 Two 4-component address registers Two 4-component address registers Up from one single-component address register Up from one single-component address register 256 static instructions per program 256 static instructions per program Up from 128 Up from 128 Given branching, dynamic instructions can execute before termination to avoid infinite loops Given branching, dynamic instructions can execute before termination to avoid infinite loops
NV_vertex_program2 Source Modifiers Source operand absolute value Source operand absolute value Example: MOV R0, |R1|; Example: MOV R0, |R1|; In addition to source negation & swizzling In addition to source negation & swizzling Example: MAD R0, -|R1|.yzwy, |R2|, - R3,w; Example: MAD R0, -|R1|.yzwy, |R2|, - R3,w; Swizzle, negate, & absolute value operations are “free” source modifiers Swizzle, negate, & absolute value operations are “free” source modifiers
NV_vertex_program2 Condition Codes (1) Condition code state Condition code state 4-component register stores condition code values 4-component register stores condition code values Four possible values Four possible values LT –less than zero LT –less than zero EQ – equal to zero EQ – equal to zero GT –greater than zero GT –greater than zero UN– unordered, for comparisons involving NaN UN– unordered, for comparisons involving NaN Most instructions optionally update condition code state Most instructions optionally update condition code state Indicated with “C” suffix: DP4C, MOVC, etc Indicated with “C” suffix: DP4C, MOVC, etc “CC” pseudo-register used to just update condition codes “CC” pseudo-register used to just update condition codes
NV_vertex_program2 Condition Codes (2) Optional condition code based destination masking Optional condition code based destination masking Example: MOV R1.xy(NE.z), R0; Example: MOV R1.xy(NE.z), R0; Copy R0components to R1’s X & Y components except when condition code’s Z component is EQ Copy R0components to R1’s X & Y components except when condition code’s Z component is EQ Condition code rules: EQ, equal; GE, greater or equal; GT, greater than; LE, less or equal; LT, less than; NE, not equal; FL, false; and TR, true Condition code rules: EQ, equal; GE, greater or equal; GT, greater than; LE, less or equal; LT, less than; NE, not equal; FL, false; and TR, true Note that condition code masking rule can swizzle condition code components Note that condition code masking rule can swizzle condition code components
Rasterization Setup (per-triangle). Setup (per-triangle). Sampling (triangle = {fragments}. Sampling (triangle = {fragments}. Interpolation (interpolate colors and coordinates). Interpolation (interpolate colors and coordinates).
Rasterization Converts primitives to fragments. Converts primitives to fragments. Primitive: point, line, polygon, … Primitive: point, line, polygon, … Fragment: transient data structure Fragment: transient data structure short x, y; long depth; short r, g, b, a; Fragment selection. Fragment selection. Parameter Assignment (color, depth...). Parameter Assignment (color, depth...).
Rasterization Setup triangles. Setup triangles. Fill triangle: Interpolate parameters. Fill triangle: Interpolate parameters. Parameters: R, G, B, z, r, s, t, q. Parameters: R, G, B, z, r, s, t, q.
Pixel Planes Calculate 3 edge functions: if all the edge functions are positive in a point (x, y) the point is inside the triangle. Calculate 3 edge functions: if all the edge functions are positive in a point (x, y) the point is inside the triangle. E(x, y) = (x – X)dY – (y – Y)dX E(x, y) > 0 if (x, y) is to the “right” side. E(x, y) = 0 if (x, y) is exactly on the line. E(x, y) < 0 if (x, y) is to the “left” side.
Edge Functions
Classification (1) A polygon defined by N vertex: (xi, yi) 0 < i <= N (x0, y0) = (xN, yN) The incremental classification of the points around a polygon can be calculated as: Initial values: dXi = Xi – X(i-1) dYi = Yi – Y(i-1) Ei(Xs, Ys) = (Xs – Xi) dY – (Ys – Yi) dXi for 0 < i <= N
Classification(2) Incremental computation for a unit step in X and Y axis: E(x + 1, y) = Ei(x, y) + dYi E(x - 1, y) = Ei(x, y) - dYi E(x, y + 1) = Ei(x, y) - dYi E(x, y - 1) = Ei(x, y) + dXi Fragment inside of the triangle if: Ei >= 0 for all i : 0 = 0 for all i : 0 < i <= N
Classification
Traversing the Polygon
Clipping
Parallel Rasterization E(x + L, y) = E(x) + Ldy Allows a group of interpolators, each responsible for a pixel within a block of contiguous pixels, to simultaneously compute the edge function of an adjacent block in a single cycle
Olano and Greer Triangle Scan Conversion using 2D Homogeneous Coordinates Triangle Scan Conversion using 2D Homogeneous Coordinates Based in Pixel Planes and Pineda approach (edge functions) but using homogeneous coordinates. Based in Pixel Planes and Pineda approach (edge functions) but using homogeneous coordinates. Avoids the need of clipping. Avoids the need of clipping. Adds a hither edge function for user clipping. Adds a hither edge function for user clipping. Perspective correct interpolation. Perspective correct interpolation.
Interpolation function A parameter varies linearly accross a triangle in 3D: u = aX + bY + cZ The 3D position (X, Y, Z) projects to 2D, using 2DH coords (x = X, y = Y, w = Z). The equation in 2DH space: u = ax + by + cw 2D perspective correct function (division by w): u/w = a x/w + b y/w + c = a X + b Y + c u/w is a linear function in screen space (X, Y)
Interpolation function If each vertex has a a value for u we can resolve [a b c] using this equation: If each vertex has a a value for u we can resolve [a b c] using this equation:
Scan conversion Edge function parameters: [1 0 0], [0 1 0], [0 0 1]. Edge function parameters: [1 0 0], [0 1 0], [0 0 1]. 1/w interpolation parameter: [1 1 1]. 1/w interpolation parameter: [1 1 1]. Zero-area and back facing triangles: 3x3 matrix inverse of M only exists if the determinant of M isn’t 0. The determinant calculates a function of the area of the triangle. Zero-area and back facing triangles: 3x3 matrix inverse of M only exists if the determinant of M isn’t 0. The determinant calculates a function of the area of the triangle.
Arbitrary clip planes To add arbitrary clip planes (user clip planes) we need to add new clip edge functions: To add arbitrary clip planes (user clip planes) we need to add new clip edge functions:
Algorithm To summarize the algorithm: setup: three edge functions = M-1 = inverse of 2D homogeneous vertex matrix for each clip edge clip edge function = dot product test * M-1 clip edge function = dot product test * M-1 interpolation function for 1/w = sum of rows of M-1 interpolation function for 1/w = sum of rows of M-1 for each parameter for each parameter interpolation function = parameter vector * M-1 interpolation function = parameter vector * M-1 pixel processing: interpolate linear edge and parameter functions interpolate linear edge and parameter functions where all edge functions are positive where all edge functions are positive w = 1/(1/w) for each parameter perspective-correct parameter = parameter * w
Cost Setup: Setup: Calculate the interpolation coefficients and slopes. Calculate the interpolation coefficients and slopes. 1 matrix inversion (1 division, multiple multiplication/additions). 1 matrix inversion (1 division, multiple multiplication/additions). 1 matrix vector multiplication for each parameter. This includes the edge and clip edge functions, the 1/w value and the other parameters (r, g, b, z, s, t, r) (3x3 matrix/vector multiplication: 9 Mul + 6 Add). 1 matrix vector multiplication for each parameter. This includes the edge and clip edge functions, the 1/w value and the other parameters (r, g, b, z, s, t, r) (3x3 matrix/vector multiplication: 9 Mul + 6 Add). Calculate the X and Y slopes (derivatives) for each parameter and the initial value at the first pixels (2 Mul + 2 Add per parameter). Calculate the X and Y slopes (derivatives) for each parameter and the initial value at the first pixels (2 Mul + 2 Add per parameter).
Cost (2) Per pixel: Per pixel: Interpolate parameters: 1 Addition per parameter. Interpolate parameters: 1 Addition per parameter. Determine if the 3 edge functions are positive (3 test sign). Determine if the 3 edge functions are positive (3 test sign). Determine if the clip edge functions are positive (n test sign) Determine if the clip edge functions are positive (n test sign) Per pixel inside the triangle: Per pixel inside the triangle: w = 1/(1/w) (1 division????) w = 1/(1/w) (1 division????) For each parameter, perspective correct parameter value: u = uw * w (1 multiplication for each parameter). For each parameter, perspective correct parameter value: u = uw * w (1 multiplication for each parameter).
OpenGL Rasterization
Rasterization/Fragments Calculate the final color value of the fragment: Calculate the final color value of the fragment: Texture Read. Texture Read. Color sum. Color sum. Fog. Fog.
Texture Texture transformation and projection. Texture transformation and projection. Texture address calculation. Texture address calculation. Texture filtering. Texture filtering.
Gouraud Shading Lighting is calculated at each vertex and interpolated across the triangle. Lighting is calculated at each vertex and interpolated across the triangle. K = K primary * T 1 * T 2 *... * T k + K secondary Ti : Color samples for one of k texture maps. * : One of several available texture combination operations
Phong Shading Interpolate vertex normals and evaluates the lighting formula at each pixel. Interpolate vertex normals and evaluates the lighting formula at each pixel. K = Kemission + Kdiffuse + Kspecular Problem: interpolation of normals produce non unit vectors. Use normalization cube maps. Problem: interpolation of normals produce non unit vectors. Use normalization cube maps.
Flat, Gouraud and Phong Shading
Bump Mapping A hardware implementation of Phong Shading. A hardware implementation of Phong Shading. Uses a texture map to perturb the normal vector at each pixel (not interpolated). Uses a texture map to perturb the normal vector at each pixel (not interpolated). Bump Map: 2D arrays of 3D vectors. Direction of the normal vector relative to the interpolated normal vector at the pixel. Bump Map: 2D arrays of 3D vectors. Direction of the normal vector relative to the interpolated normal vector at the pixel. Uses tangent space for storing the perturbations. Object to tanget space transformation (3x3 matrix multiplication). Uses tangent space for storing the perturbations. Object to tanget space transformation (3x3 matrix multiplication).
Bump Mapping
Fragment Texture combiners and fog. Texture combiners and fog. Owner, scrissor, depth, alpha and stencil tests. Owner, scrissor, depth, alpha and stencil tests. Blending or compositing. Blending or compositing. Dithering and logical operations. Dithering and logical operations.
Per fragment (tests) Determine the vissibility of the fragment: Determine the vissibility of the fragment: Ownership test. Ownership test. Scissor test. Scissor test. Alpha test. Alpha test. Stencil test. Stencil test. Depth Buffer test. Depth Buffer test. Final pixel color: Final pixel color: Blending. Blending. Dithering. Dithering. Logic Operation. Logic Operation.
OpenGL per fragment
Textures Map from screen space coordinates to object space to texture space. Map from screen space coordinates to object space to texture space. Texture formats: 1D, 2D, 3D and cubemap. Texture formats: 1D, 2D, 3D and cubemap. Texture read: take a number of texture samples (texels), filter them and combine the result with other texture results or original pixel color. Texture read: take a number of texture samples (texels), filter them and combine the result with other texture results or original pixel color. Size pixel > Size texel => minification Size pixel > Size texel => minification Size pixel = Size texel => copy Size pixel = Size texel => copy Size pixel magnification Size pixel magnification
Level of Detail LOD is calculated to determine the mipmap level to use and to determine if minification or magnification. LOD is calculated to determine the mipmap level to use and to determine if minification or magnification.
Level of Detail Select sampling mode using parameter C (can be 0 or 0.5): Select sampling mode using parameter C (can be 0 or 0.5): If λ > c => minification If λ > c => minification If λ magnification If λ magnification Scaler factor: Scaler factor:
Minification Minification: Minification: Nearest: the texel in the center of the texture coordinates is read. Nearest: the texel in the center of the texture coordinates is read. Linear: interpolation (bilinear). Linear: interpolation (bilinear).
Minification(2)
Mipmapping A texture is formed by a piramidal data structure of max(n,m) images from 2 n x2 m to 1x1 pixels. A texture is formed by a piramidal data structure of max(n,m) images from 2 n x2 m to 1x1 pixels. The proper image is accessed using the LOD parameter. The proper image is accessed using the LOD parameter.
Mipmapping Use calculated LOD for deciding which level to read from. Use calculated LOD for deciding which level to read from. Filtering: Filtering: NEAREST_MIPMAP_NEAREST and LINEAR_MIPMAP_NEAREST NEAREST_MIPMAP_NEAREST and LINEAR_MIPMAP_NEAREST NEAREST_MIPMAP_LINEAR and LINEAR_MIPMAP_LINEAR (trilinear filtering) NEAREST_MIPMAP_LINEAR and LINEAR_MIPMAP_LINEAR (trilinear filtering)
Magnification LINEAR of NEAREST: similar to mignification. LINEAR of NEAREST: similar to mignification.
OpenGL Multitexture
Cubemap A cubemap texture is composed by 6 2D texture/images for each of the 6 faces of a cube. A cubemap texture is composed by 6 2D texture/images for each of the 6 faces of a cube. The texture coordinates (s, t, r) are used as a direction vector from the center of the cube to one of the sides. The texture coordinates (s, t, r) are used as a direction vector from the center of the cube to one of the sides. The coordinate with the greatest absolute value is used to determine which face to access. The coordinate with the greatest absolute value is used to determine which face to access. The other two coordinates are recalculated to acess the texture in that face as normal 2D texture. The other two coordinates are recalculated to acess the texture in that face as normal 2D texture.
Cubemap
Texture environment and texture functions OpenGL 1.4, basic support for register combiners (NV_texture_shaders for GF3 and beyond, ATI_fragment_shader for R200). OpenGL 1.4, basic support for register combiners (NV_texture_shaders for GF3 and beyond, ATI_fragment_shader for R200). Defines source arguments and functions to combine textures and original color. Defines source arguments and functions to combine textures and original color. Functions: REPLACE, MODULATE, ADD, ADD_SIGNED, INTERPOLATE, SUBSTRACT, DOT3_RGB, DOT3_RGBA. Functions: REPLACE, MODULATE, ADD, ADD_SIGNED, INTERPOLATE, SUBSTRACT, DOT3_RGB, DOT3_RGBA. Color channels (RGB) and alpha channel (A) are calculated (and configured) separately in parallel. Color channels (RGB) and alpha channel (A) are calculated (and configured) separately in parallel.
Shadow map First pass: write depth buffer to a texture from the point of view of a light. First pass: write depth buffer to a texture from the point of view of a light. Second pass: compare z value in texture with current z value (eye). Use stencil buffer. Second pass: compare z value in texture with current z value (eye). Use stencil buffer. In OpenGL 1.4 use texture internal format DEPT_COMPONENT and texture comparision mode: TEXTURE_COMPARE_MODE = COMPARE_R_TO_TEXTURE. TEXTURE_COMPARE_FUNC = {LEQUAL, GEQUAL}. In OpenGL 1.4 use texture internal format DEPT_COMPONENT and texture comparision mode: TEXTURE_COMPARE_MODE = COMPARE_R_TO_TEXTURE. TEXTURE_COMPARE_FUNC = {LEQUAL, GEQUAL}.
Projected textures Divide by fourth component (s, t, r, q) and access the texture (s/q, t/q, r/q). Divide by fourth component (s, t, r, q) and access the texture (s/q, t/q, r/q).
Textures Original: additional color (material) information per pixel. It is used to compensate lack of geometry information. Original: additional color (material) information per pixel. It is used to compensate lack of geometry information. Current: color, normals or any kind of information. Different formats (access) supporter by hardware (1D, 2D, 3D, cubemap). Current: color, normals or any kind of information. Different formats (access) supporter by hardware (1D, 2D, 3D, cubemap). Supported dependant reads (use information from a texture as address to access another texture). Supported dependant reads (use information from a texture as address to access another texture). Minimification, magnification. Minimification, magnification. MIP mapping (Multus in Parvum): multiple levels of detail for a single texture. MIP mapping (Multus in Parvum): multiple levels of detail for a single texture. Filtering: bilinear (4 access same mipmap), trilinear (8 access to two mipmaps), anisotropic (up to 128 access (16x trilinear) access). Filtering: bilinear (4 access same mipmap), trilinear (8 access to two mipmaps), anisotropic (up to 128 access (16x trilinear) access).
Register combiners Multitexture: multiple textures can be read per cycle (multiple texture units per pipe, up to 4 in Matrox Parhelia). Also multiple textures per pass (loop mode, up to 16 in DX9 hardware). Multitexture: multiple textures can be read per cycle (multiple texture units per pipe, up to 4 in Matrox Parhelia). Also multiple textures per pass (loop mode, up to 16 in DX9 hardware). The output of those textures is combined (*, +,...) with the pixel interpolated color. The output of those textures is combined (*, +,...) with the pixel interpolated color. First implementation of pixel shaders (not really instructions for a processor, but a configuration for the hardware). First implementation of pixel shaders (not really instructions for a processor, but a configuration for the hardware).
GeForce256 Register Combiners Spare 0 Fragment Color Texture Fetching General Combiner 0 4 RGB Inputs Texture 0 Texture 1 Fog Color/Factor Register Set 6 RGB Inputs Specular Color 4 Alpha Inputs 3 RGB Outputs 3 Alpha Outputs General Combiner 1 4 RGB Inputs 4 Alpha Inputs 3 RGB Outputs 3 Alpha Outputs Final Combiner 1 Alpha Input Specular Color
GeForce 3/4 Register Combiners
Texture Effects There is a large a new graphics effects that can be achieved with those extended texture functions: There is a large a new graphics effects that can be achieved with those extended texture functions: Cubemap (lightning, shadows). Cubemap (lightning, shadows). Bump Mapping (per pixel lightning/shading). Bump Mapping (per pixel lightning/shading). Others? Others?
Color Sum C = Cpri + Csec. C = Cpri + Csec. Combines diffuse and specular color. Combines diffuse and specular color.
Fog Calculate blending factor f (3 modes): Calculate blending factor f (3 modes): c: FRAGMENT_DEPTH (eye to fragment distance), FOG_COORDINATE (interpolated). c: FRAGMENT_DEPTH (eye to fragment distance), FOG_COORDINATE (interpolated). d: FOG_DENSITY d: FOG_DENSITY s: FOG_START s: FOG_START e: FOG_END. e: FOG_END. Final color: Final color:
Ownership Test Current pixel (x, y) is owned by the current OGL context? Current pixel (x, y) is owned by the current OGL context?
Scissor Test void Scissor(int right, int bottom, sizei width, sizei height). void Scissor(int right, int bottom, sizei width, sizei height). If left <= x < left + width and bottom <= y < bottom + height the test passes. If left <= x < left + width and bottom <= y < bottom + height the test passes. Otherwisee fails and fragment is discarded. Otherwisee fails and fragment is discarded.
Alpha Test void AlphaFunc(enum func, clampf ref) void AlphaFunc(enum func, clampf ref) Compares reference value with current fragment alpha (A) componed with a function (NEVER, ALWAYS, LESS, LEQUAL, EQUAL, GEQUAL, GREATER, NOTEQUAL). Compares reference value with current fragment alpha (A) componed with a function (NEVER, ALWAYS, LESS, LEQUAL, EQUAL, GEQUAL, GREATER, NOTEQUAL). If test fails fragment is discarded. If test fails fragment is discarded.
Stencil Test void StencilFunc(enum func, int ref, uint mask). void StencilFunc(enum func, int ref, uint mask). Void StencilOp(enum sfail, dpfail, enum dppass). Void StencilOp(enum sfail, dpfail, enum dppass). Stencil Buffer: a n-bit (uses to be 8-bit) buffer per pixel in the framebuffer. Stencil Buffer: a n-bit (uses to be 8-bit) buffer per pixel in the framebuffer. Tests the current stencil buffer value for the fragment against the reference value, applying a binary mask and using a test function. Tests the current stencil buffer value for the fragment against the reference value, applying a binary mask and using a test function. If the function fails the fragment is discarded and sfail function executed over the stencil entry. If the function fails the fragment is discarded and sfail function executed over the stencil entry. The stencil buffer is also updated after depth test. dpfail function is executed when depth test fails and dppass when depth test pass. The stencil buffer is also updated after depth test. dpfail function is executed when depth test fails and dppass when depth test pass.
Stencil Test Test functions: NEVER, ALWAYS, LESS, LEQUAL, GEQUAL, GREATER, NOTEQUAL. Test functions: NEVER, ALWAYS, LESS, LEQUAL, GEQUAL, GREATER, NOTEQUAL. Update functions: KEEP, ZERO, REPLACE, INCR, DECR, INVERT, INCR_WRAP, DECR_WRAP. Update functions: KEEP, ZERO, REPLACE, INCR, DECR, INVERT, INCR_WRAP, DECR_WRAP. Applications: Applications: Shadows volumes. Shadows volumes. Shadow maps. Shadow maps. Others? Others?
Depth Buffer Test void DepthFunc(enum func) void DepthFunc(enum func) Test functions (fragment z value with framebuffer z value): Test functions (fragment z value with framebuffer z value): NEVER NEVER ALWAYS ALWAYS LESS LESS LEQUAL LEQUAL EQUAL EQUAL GREATER GREATER GEQUAL GEQUAL NOTEQUAL NOTEQUAL If test fails fragment is discarded. If test fails fragment is discarded. If enabled stencil update functions are called. If enabled stencil update functions are called.
Z-Buffer Vissibility test. Vissibility test. 1 read from the Z-buffer (24bits). 1 read from the Z-buffer (24bits). If test fails the fragment is discarded. If test fails the fragment is discarded. If not 1 write to the Z-buffer (24 bits). If not 1 write to the Z-buffer (24 bits). Early Z test (avoid useless work). Early Z test (avoid useless work). Hierarchical Z-Buffer: reduces bandwidth Hierarchical Z-Buffer: reduces bandwidth Z-Buffer compression: reduces bandwidth and memory usage. Z-Buffer compression: reduces bandwidth and memory usage. Fast Z clear. Fast Z clear. Pixel shaders that change pixel depth (Z) disable early Z test. Pixel shaders that change pixel depth (Z) disable early Z test.
Hierarchical Z, Z Compression and Fast Z-Clear
Blending Combine fragment color with framebuffer color. Combine fragment color with framebuffer color. Blend equations: Blend equations: FUNC_ADD: C =Cs*S + Cd*D FUNC_ADD: C =Cs*S + Cd*D FUNC_SUBTRACT: C = Cs*S + Cd* FUNC_SUBTRACT: C = Cs*S + Cd* FUNC_REVERSE_SUBTRACT: C = Cd*D – Cs*S FUNC_REVERSE_SUBTRACT: C = Cd*D – Cs*S MIN: C = min(Cs, Cd) MIN: C = min(Cs, Cd) MA: C = max(Cs, CD) MA: C = max(Cs, CD) Blend functions: weigth factors for the blend equation. Blend functions: weigth factors for the blend equation. Blend color: Cc constant color. Blend color: Cc constant color.
Dithering Approximate a fragment higher fragment precission color to a lower precission framebuffer color. Approximate a fragment higher fragment precission color to a lower precission framebuffer color. Used? Used?
Logical Operation From an early OGL extension. From an early OGL extension. Operations: Operations:
Fragment Program
Pixel Shaders Pixel Shader 1.0, 1.1, 1.2, 1.3: Program register combiners stage in NVidia GeForce3 (NV20) and GeForce4 (NV25). Supported in DX8 and NV_texture_shader/NV_texture_shader2. Pixel Shader 1.0, 1.1, 1.2, 1.3: Program register combiners stage in NVidia GeForce3 (NV20) and GeForce4 (NV25). Supported in DX8 and NV_texture_shader/NV_texture_shader2. Pixel Shader 1.4: ATI R200 (Radeon 8500), extra features but also based in register combiner hardware. Supported in DX8.1 and ATI_fragment_shader. Pixel Shader 1.4: ATI R200 (Radeon 8500), extra features but also based in register combiner hardware. Supported in DX8.1 and ATI_fragment_shader.
Pixel Shaders Pixel Shader 2.0: Programmable shaders (like vertex shaders) but without branching. To be supported in DX9 and ARB_fragment_shader. Pixel Shader 2.0: Programmable shaders (like vertex shaders) but without branching. To be supported in DX9 and ARB_fragment_shader. Pixel Shader 3.0: Extended pixel shaders, unknown features (branching?, NV30 pixel shaders?). To be supported in DX9 or DX9.1. Pixel Shader 3.0: Extended pixel shaders, unknown features (branching?, NV30 pixel shaders?). To be supported in DX9 or DX9.1.
Pixel Shader Pixel Shader 1.4: Pixel Shader 1.4: 8 constants. 8 constants. Two phases divided in 4 parts: Two phases divided in 4 parts: Optional Sampling (Texture read): up to 6 textures. Optional Sampling (Texture read): up to 6 textures. Address Shader: up to 8 instructions. Address Shader: up to 8 instructions. Optional Sampling: up to 6 textures, can be dependent reads. Optional Sampling: up to 6 textures, can be dependent reads. Color Shader: up to 8 instructions. Color Shader: up to 8 instructions.
Pixel Shaders PS2 pixel shaders are true processors (?). Based in Vertex Shaders but without branching. PS2 pixel shaders are true processors (?). Based in Vertex Shaders but without branching. Replaces (or complements) the register combiner stage (NV30). Replaces (or complements) the register combiner stage (NV30). Most instructions of the vertex shader are present in the pixel shader (but branches). Most instructions of the vertex shader are present in the pixel shader (but branches). Conditional codes, swizzle, negate, absolute value, mask, conditional mask (NV30). Conditional codes, swizzle, negate, absolute value, mask, conditional mask (NV30).
Pixel Shaders DX9 pixel shaders are true processors. Based in Vertex Shaders but without branching. Replaces (or complements) the register combiner stage. DX9 pixel shaders are true processors. Based in Vertex Shaders but without branching. Replaces (or complements) the register combiner stage. Most instructions of the vertex shader are present in the pixel shader (but branches). Conditional codes, swizzle, negate, absolute value, mask, conditional mask (NV30). Most instructions of the vertex shader are present in the pixel shader (but branches). Conditional codes, swizzle, negate, absolute value, mask, conditional mask (NV30). Additional instructions (NV30): Additional instructions (NV30): Texture read: TEX, TEXP, TXD. Texture read: TEX, TEXP, TXD. Partial derivarives: DDX, DDY. Partial derivarives: DDX, DDY. Pack/Unpack: PK2H, PK2US, PK4B, PK4UB, PK4UBG, UP2H, UP2US, UP4B, UP4UB, UP4UBG. Pack/Unpack: PK2H, PK2US, PK4B, PK4UB, PK4UBG, UP2H, UP2US, UP4B, UP4UB, UP4UBG. Fragment conditional kill: KIL. Fragment conditional kill: KIL. Extra math: LRP (linear interpolation), X2D (2D coordinate transform), RFL (reflection), POW (exponentation). Extra math: LRP (linear interpolation), X2D (2D coordinate transform), RFL (reflection), POW (exponentation).
R300 Pixel Shader
Pixel Shader Inputs: Inputs: 1 position (x, y, z, 1/w) 1 position (x, y, z, 1/w) 2 colors (4 compenent vector RGBA) 2 colors (4 compenent vector RGBA) 8 texture coordinates 8 texture coordinates 1 fog coordinate. 1 fog coordinate. Outputs: Outputs: fragment color (RGBA), optionally new fragment depth. fragment color (RGBA), optionally new fragment depth. In NV30/R300 also to 4 RGBA textures. In NV30/R300 also to 4 RGBA textures.
Pixel Shader Temporaries: Temporaries: NV30: bit registers (64 16-bit registers). NV30: bit registers (64 16-bit registers). R300: 12 temporary registers R300: 12 temporary registers Constants: Constants: NV30: unlimited? (maybe memory?). Accessed by ‘name’ (label). Also literal constants (embedded). NV30: unlimited? (maybe memory?). Accessed by ‘name’ (label). Also literal constants (embedded). R300: 32 constants. R300: 32 constants. DX9 (PS 2.0): 16 samplers and 8 texture coordinates. DX9 (PS 2.0): 16 samplers and 8 texture coordinates.
Pixel Shader R300: 64 ALU instructions, 32 texture instructions, 4 levels of dependent read. Up to 96 instructions (?). R300: 64 ALU instructions, 32 texture instructions, 4 levels of dependent read. Up to 96 instructions (?). R300: R300: ALU instructions: ADD, MOV, MUL, MAD, DP3, DP4, FRAC, RCP, RSP, EXD, LOG, CMP. ALU instructions: ADD, MOV, MUL, MAD, DP3, DP4, FRAC, RCP, RSP, EXD, LOG, CMP. Texture: TEXLD, TEXLDP, TEXLDBIAS, TEXKILL. Texture: TEXLD, TEXLDP, TEXLDBIAS, TEXKILL. NV30: up to 1024 instructions. NV30: up to 1024 instructions.
Pixel Shader NV30: up to 1024 instructions. NV30: up to 1024 instructions. Additional instructions (NV30): Additional instructions (NV30): Texture read: TEX, TEXP, TXD. Texture read: TEX, TEXP, TXD. Partial derivarives: DDX, DDY. Partial derivarives: DDX, DDY. Pack/Unpack: PK2H, PK2US, PK4B, PK4UB, PK4UBG, UP2H, UP2US, UP4B, UP4UB, UP4UBG. Pack/Unpack: PK2H, PK2US, PK4B, PK4UB, PK4UBG, UP2H, UP2US, UP4B, UP4UB, UP4UBG. Fragment conditional kill: KIL. Fragment conditional kill: KIL. Extra math: LRP (linear interpolation), X2D (2D coordinate transform), RFL (reflection), POW (exponentation). Extra math: LRP (linear interpolation), X2D (2D coordinate transform), RFL (reflection), POW (exponentation).
Others Antialiasing Antialiasing Anisotropic Filtering (textures). Anisotropic Filtering (textures). Line Antialiasing. Line Antialiasing. Edge Antialiasing Edge Antialiasing Full Screen Antialiasing (FSA). Full Screen Antialiasing (FSA). Supersampling. Supersampling. MultiSampling. MultiSampling.
Display Gamma correction. Gamma correction. Analog to digital conversion. Analog to digital conversion.
3D Graphic Hardware Pipeline Command Processor. Command Processor. Vertex Shader. Vertex Shader. Rasterization. Rasterization. Pixel Shader. Pixel Shader. Fragment Operations and Tests. Fragment Operations and Tests.
Command Processor Recieves commands from the CPU (driver, OpenGL/Direct3D). Recieves commands from the CPU (driver, OpenGL/Direct3D). Fetches data from memory: vertex data (DMA). Fetches data from memory: vertex data (DMA). Updates and stores OpenGL/Direct3D render state. Updates and stores OpenGL/Direct3D render state.
Vertex Shader Transforms and lits vertex streams. Transforms and lits vertex streams. Vertex shader program (from GPU memory?). Vertex shader program (from GPU memory?). Vertex shader constans (from GPU memory?). Vertex shader constans (from GPU memory?). Inputs: vertex data 16x4D Inputs: vertex data 16x4D Outputs: vertex data 14x4D Outputs: vertex data 14x4D
Rasterization Includes: Includes: Clipping Clipping Divide by w Divide by w Affine transform Affine transform Primitive assembly Primitive assembly Culling Culling Setup Setup Fragment generation. Fragment generation. Recieves vertexs and produces fragments. Recieves vertexs and produces fragments. Uses OpenGL/Direct3D render state. Uses OpenGL/Direct3D render state. Input: vertex (15x4D). Input: vertex (15x4D). Output: fragments (10x4D). Output: fragments (10x4D).
Pixel Shader Shades fragments: calculate texture address, read texture, color operations. Shades fragments: calculate texture address, read texture, color operations. Pixel Shader program and constants (from GPU memory?). Pixel Shader program and constants (from GPU memory?). Texture read: TMU (texture sample, filter unit, texture cache, GPU memory). Texture read: TMU (texture sample, filter unit, texture cache, GPU memory). Optional: Optional: Modify depth coordinate (1 Z output). Modify depth coordinate (1 Z output). Render to texture (up to 4 colors outputs). Render to texture (up to 4 colors outputs). Input: fragment (12x4D). Input: fragment (12x4D). Output: color (2x4D). Output: color (2x4D).
Fragment Operations and Tests Includes (OpenGL): Includes (OpenGL): Fog. Fog. Color Sum. Color Sum. Ownership Test. Ownership Test. Scissor Test. Scissor Test. Alpha Test. Alpha Test. Stencil Test. Stencil Test. Depth Test. Depth Test. Blend. Blend. Logic Operation. Logic Operation. Accesses framebuffer (GPU memory). Updates framebuffer. Accesses framebuffer (GPU memory). Updates framebuffer. Framebuffer: color, Z and stencil. Framebuffer: color, Z and stencil. OpenGL/Direct3D render state defines operations. OpenGL/Direct3D render state defines operations. Input: color. Input: color. Output: FB updated. Output: FB updated.
Others Antialiasing Antialiasing Anisotropic Filtering (textures). Anisotropic Filtering (textures). Line Antialiasing. Line Antialiasing. Edge Antialiasing Edge Antialiasing Full Screen Antialiasing (FSAA): Full Screen Antialiasing (FSAA): Supersampling. Supersampling. MultiSampling. MultiSampling. TBDR: Tile Based Deferred Rendering (STMicro PowerVR). TBDR: Tile Based Deferred Rendering (STMicro PowerVR). HOS (High Order Surfaces): N-Patches, Bezier, Displacement Mapping, TruForm, Tesselation. HOS (High Order Surfaces): N-Patches, Bezier, Displacement Mapping, TruForm, Tesselation.
Vertex Shader The command processor sends a vertex stream to the vertex shaders. The command processor sends a vertex stream to the vertex shaders. A vertex buffer stores data read from DMA. A vertex buffer stores data read from DMA. A vertex cache (~ 10 vertexs) can be used to avoid to execute vertex shader for the same vertex twice. A vertex cache (~ 10 vertexs) can be used to avoid to execute vertex shader for the same vertex twice. The vertex stream is grouped in primitives and sent to the rasterizer. The vertex stream is grouped in primitives and sent to the rasterizer.
Hardware Pipeline
Vertex Shader Architecture SIMD architecture. Registers are 128b wide, four 32 bit fields. SIMD architecture. Registers are 128b wide, four 32 bit fields. Instruction set: typical arithmetic instructions (vector mul, add) and some special instructions (ARL, DST), some complex mathematic instructions (EXP, COS), support for branching, loops and procedures. Instruction set: typical arithmetic instructions (vector mul, add) and some special instructions (ARL, DST), some complex mathematic instructions (EXP, COS), support for branching, loops and procedures. 3 different sources of data: 3 different sources of data: Input stream (~ 16 registers). Input stream (~ 16 registers). Constants (~ 256 registers). Constants (~ 256 registers). Temporaries (~ 16 registers). Temporaries (~ 16 registers). 2 different destinations: 2 different destinations: Output stream (~ 15 registers). Output stream (~ 15 registers). Temporaries (~ 16 registers). Temporaries (~ 16 registers). Conditional registers (NV30) and boolean constants (R300, DX9) for conditional ‘execution’. Conditional registers (NV30) and boolean constants (R300, DX9) for conditional ‘execution’.
Vertex Shader Inputs and Outputs
Vertex Shader Architecture
Vertex Shader: NV20 Exposes programmability of a small part of the geometry pipeline. Exposes programmability of a small part of the geometry pipeline. Vertex load & store, format conversion, primitive assembly, clipping, triangle setup occur completely in parallel, in pipeline fashion. Vertex load & store, format conversion, primitive assembly, clipping, triangle setup occur completely in parallel, in pipeline fashion. 4-wide fine grained SIMD FP to provide the necessary performance, and run multiple execution threads to maintain efficiency and provide a very simple programming mode. 4-wide fine grained SIMD FP to provide the necessary performance, and run multiple execution threads to maintain efficiency and provide a very simple programming mode.
NV20: Introduction Independent vertices. Independent vertices. IEEE single precission FP. IEEE single precission FP. 4 component vectors (x, y, z, w). 4 component vectors (x, y, z, w). Input registers can have their components arbitrarily rearranged/replicated (swizzled). Input registers can have their components arbitrarily rearranged/replicated (swizzled). Any operation generating a scalar must generate that scalar replicated across all components, and output writes have a component write mask. Any operation generating a scalar must generate that scalar replicated across all components, and output writes have a component write mask.
NV20: Program Model
NV20: Input Attributes Input Attributes: Input Attributes: 16 quad-float vertex source attribute registers. 16 quad-float vertex source attribute registers. Position, normal, two colors, up to 8 texture coordinate sets, skin weights, fog and point size. Position, normal, two colors, up to 8 texture coordinate sets, skin weights, fog and point size. Default 0.0 for second and third components, 1.0 for the fourth. Default 0.0 for second and third components, 1.0 for the fourth. Attributes are persistent. Attributes are persistent. Only one vertex attribute may be read per program instruction. Only one vertex attribute may be read per program instruction. Constant memory: Constant memory: 96 quad floats. 96 quad floats. Can only be loaded before vertices are processed. Can only be loaded before vertices are processed. Only one constant may be read by one program instruction. Only one constant may be read by one program instruction. The program may not read to constants. The program may not read to constants.
NV20: Input Attributes Integer address register: Integer address register: Loaded using ARL. Loaded using ARL. Indexed constant reads with out-of-range reads returning (0,0,0,0). Indexed constant reads with out-of-range reads returning (0,0,0,0). Read/Write register file: Read/Write register file: 12 quad floats. 12 quad floats. Three reads and one write per instruction. Three reads and one write per instruction. Initialized to (0,0,0,0) per vertex. Initialized to (0,0,0,0) per vertex. Any vector read may be sourced as multiple operands and individually swizzled/negated each time. Any vector read may be sourced as multiple operands and individually swizzled/negated each time.
NV20: Output attributes Standard mapping for the fixed function pipeline at the homogeneous clip space point. Standard mapping for the fixed function pipeline at the homogeneous clip space point. Position for clipping. Position for clipping. Vertex color output clamped to the range 0.0 to 1.0. Vertex color output clamped to the range 0.0 to 1.0. Fog distance, point size. Fog distance, point size. 8 texture coordinates. 8 texture coordinates. All instruction writes have an optional 4- component write mask. All instruction writes have an optional 4- component write mask. Initialized to (0.0, 0.0, 0.0, 1.0). Initialized to (0.0, 0.0, 0.0, 1.0).
NV20: Instruction Set. No branching. No branching. Constant Latency: issue any instruction per clock and execute all instructions with thhe same latency. All operands are immediately available, limiting the size of registers and memory banks. Constant Latency: issue any instruction per clock and execute all instructions with thhe same latency. All operands are immediately available, limiting the size of registers and memory banks.
NV20: Hardware Implementation Two blocks: vertex attribute buffer (VAB) and the floating point core. Two blocks: vertex attribute buffer (VAB) and the floating point core.
NV20: VAB The VAB is responsible for vertex attribute persistence. The VAB is responsible for vertex attribute persistence. 16 input attributes 16 input attributes When a write to an addres is recieved defaults (0.0, 0.0, 0.0, 1.0) and the valid data overwrites the components. When a write to an addres is recieved defaults (0.0, 0.0, 0.0, 1.0) and the valid data overwrites the components. The VAB drains into a number of input buffers (IB) that are used to feed the FP core in a round robin fashion. The VAB drains into a number of input buffers (IB) that are used to feed the FP core in a round robin fashion. Dirty bits are maintained in the VAB so only changed attributes are updated when the same buffer is again the drain target. Dirty bits are maintained in the VAB so only changed attributes are updated when the same buffer is again the drain target. The transfer of a vertex is triggered by a write to address 0 (vertex position). The transfer of a vertex is triggered by a write to address 0 (vertex position). To prevent bubbles during simultaneous loading and draining of the VAB, incoming writes may push out th contents of the target address, superceding a default drain sequence. To prevent bubbles during simultaneous loading and draining of the VAB, incoming writes may push out th contents of the target address, superceding a default drain sequence.
NV20: VAB
NV20: Floating Point Core Processes the instruction set. Processes the instruction set. Multithreaded vector processor operating on quad-float data. Multithreaded vector processor operating on quad-float data. Vertex data read from input buffers and transformed into output buffers (OB). Vertex data read from input buffers and transformed into output buffers (OB). Same latency for vector and special function units. Same latency for vector and special function units. Multiple vertex threads are used to hide this latency. Multiple vertex threads are used to hide this latency. SIMD VU: MOV, MUL, ADD, MAD, DP3, DP4, DST, MIN, MAX, SLT, SGE. SIMD VU: MOV, MUL, ADD, MAD, DP3, DP4, DST, MIN, MAX, SLT, SGE. Special FU: RCP, RSQ, LOG, EXP, LIT. Special FU: RCP, RSQ, LOG, EXP, LIT. VU is approximately IEEE (no denormalized numbers or exceptions, rounding always toward negative infinity). VU is approximately IEEE (no denormalized numbers or exceptions, rounding always toward negative infinity). 1 instruction per clock and all input/output options have no performance penalty. 1 instruction per clock and all input/output options have no performance penalty. All input vectors are available with no latency. All input vectors are available with no latency.
NV20: Float Point Core
Vertex Shader: R300 4 vertex shader units. 4 vertex shader units. 1 scalar unit, 1 vector unit. 1 scalar unit, 1 vector unit. Registers: Registers: ALU Registers: ALU Registers: Constants: 256 read only vectors. Constants: 256 read only vectors. Temporary: 12 read/write vectors Temporary: 12 read/write vectors Input: 16 read only vectors. Input: 16 read only vectors. Output: 15 write only vectors. Output: 15 write only vectors. Flow Control Registers: Flow Control Registers: Integer Constat: 16 read only vectors. Integer Constat: 16 read only vectors. Address: 1 read/write vector. Address: 1 read/write vector. Loop Counter: 1 scalar. Loop Counter: 1 scalar. Boolean Constant: 16 read only bits. Boolean Constant: 16 read only bits.
R300: Instructions Up to 256 instructions long shaders. Up to 256 instructions long shaders. Up to 64K executed instructions per vertex. Up to 64K executed instructions per vertex. ALU instructions: ADD, DP3, DP4, EXP, EXPP, EXPE, FRAC, LOG, LOGP, MAD, MADDX2, MAX, MIN, MOV, MUL, POW, RCP, RSQ, SGE, SLT. ALU instructions: ADD, DP3, DP4, EXP, EXPP, EXPE, FRAC, LOG, LOGP, MAD, MADDX2, MAX, MIN, MOV, MUL, POW, RCP, RSQ, SGE, SLT. Control Flow instructions: CALL, LOOP, ENDLOOP, JUMP, JNZ, LABEL, REPEAT, ENDREPEAT, RETURN. Control Flow instructions: CALL, LOOP, ENDLOOP, JUMP, JNZ, LABEL, REPEAT, ENDREPEAT, RETURN. Address Instructions: ARL, ARR. Address Instructions: ARL, ARR. Graphic Instructions: DST, LIT. Graphic Instructions: DST, LIT. Instructions based in DX9 VS2.0. Instructions based in DX9 VS2.0.
NV30: Overview Supports all VS1 instructions and features. Supports all VS1 instructions and features. Beyond VS2? Beyond VS2? Condition codes. Condition codes. Branches and subroutines. Branches and subroutines. Modifiers: absolute. Modifiers: absolute. User clip support (new output registers CLP0- CLP5). User clip support (new output registers CLP0- CLP5). New instructions. New instructions. More registers. More registers.
NV30: Overview Up to 256 instructions per program. Up to 256 instructions per program. Up to 64K executed instructions per vertex. Up to 64K executed instructions per vertex. 16 temporary registers. 16 temporary registers. 2 vector address registers. 2 vector address registers. 256 program parameters (constants). 256 program parameters (constants).
NV30: Condition Codes 4 component register: 4 component register: LT: less than zero. LT: less than zero. EQ: equal to zero. EQ: equal to zero. GT: greater than zero. GT: greater than zero. UN: unordered, for comparisions involving NaN. UN: unordered, for comparisions involving NaN. Instructions optionally update condition code state: Instructions optionally update condition code state: “C” suffix: DP4C, MOVC. “C” suffix: DP4C, MOVC. “CC” pseudo register for update condition codes. “CC” pseudo register for update condition codes. Condition code used in: Condition code used in: Branches and procedure call/return. Branches and procedure call/return. Result masking. Result masking.
NV30: Modifiers Source: Source: Swizle Swizle Negate Negate Absolute Absolute Target Target Masking Masking Conditional masking Conditional masking
NV30: Branching and subroutines BRA BRA Unconditional. Unconditional. Conditional: BRA label (LE.xyww) Conditional: BRA label (LE.xyww) Computed (indirect): BRA [A1.z] (GT.x) Computed (indirect): BRA [A1.z] (GT.x) Call & return for subroutines. Call & return for subroutines. CAL & RET. CAL & RET. Same options that with branches. Same options that with branches. Four levels of subroutin execution. Four levels of subroutin execution. No parameter stack. No parameter stack.
NV30: Clipping New output registers: o[CLP0]..o[CLP5]. New output registers: o[CLP0]..o[CLP5]. GL_CLIP_PLANEn enabled. GL_CLIP_PLANEn enabled. Clip coordinate n interpolated across the primitive. Clip coordinate n interpolated across the primitive. Only the portion of the primitive where the clip coordinate is greater than zero is rasterized. Only the portion of the primitive where the clip coordinate is greater than zero is rasterized. Hardware performs fast trivial reject if all clip coordinats of a primitive are negative. Hardware performs fast trivial reject if all clip coordinats of a primitive are negative.
NV30: New Instructions ARL: supports loading 4-component A0 and A1 intergre registers now. ARL: supports loading 4-component A0 and A1 intergre registers now. ARR: like ARL except rounds rather than truncates before storing integer result in an address register. ARR: like ARL except rounds rather than truncates before storing integer result in an address register. BRA, CAL, RET: branching instructions. BRA, CAL, RET: branching instructions. COS, SIN: high precision trigonometric functions. COS, SIN: high precision trigonometric functions. FLR, FRC: floor and fraction of floating point values. FLR, FRC: floor and fraction of floating point values. EX2, LG2: high-preccision exponentiation and logarithm functions. EX2, LG2: high-preccision exponentiation and logarithm functions. ARA: adds pairs of components of an address register, useful for looping and other operations. ARA: adds pairs of components of an address register, useful for looping and other operations. SEQ, SFL, SGT, SLE, SNE, STR: add six “set on” instructions similar to SLT and SGE. SEQ, SFL, SGT, SLE, SNE, STR: add six “set on” instructions similar to SLT and SGE. SSG: “set sign” operation generates a vector holding –1.0 for negative operand components, 0 for zero components, and +1.0 for positive components. SSG: “set sign” operation generates a vector holding –1.0 for negative operand components, 0 for zero components, and +1.0 for positive components.
NV30: Instruction List Add & multiply instructions: ADD, DP3, DP4, DPH, MAD, MOV, SUB. Add & multiply instructions: ADD, DP3, DP4, DPH, MAD, MOV, SUB. Math functions: ABS, COS, EX2, FLR, FRC, LG2, LOG, RCP, RSQ, SIN. Math functions: ABS, COS, EX2, FLR, FRC, LG2, LOG, RCP, RSQ, SIN. Set on instructions: SEG, SFL, SGE, SGT, SLE, SLT, SNE, STR. Set on instructions: SEG, SFL, SGE, SGT, SLE, SLT, SNE, STR. Branching instructions: BRA, CAL, RET. Branching instructions: BRA, CAL, RET. Address register instructions: ARL, ARA. Address register instructions: ARL, ARA. Graphics-oriented instructions: DST, LIT, RCC, SSG. Graphics-oriented instructions: DST, LIT, RCC, SSG. Minimum/maximum instructions: MAX, MIN Minimum/maximum instructions: MAX, MIN
Current GPUs ATI R300. ATI R300. 3DLabs P10. 3DLabs P10. Matrox Parhelia. Matrox Parhelia.
ATI R300. Specs 0.15 micron technology 0.15 micron technology 110+ million transistors million transistors. 8 pixel rendering pipelines, 1 texture unit per pipeline, 16 textures per pass. 8 pixel rendering pipelines, 1 texture unit per pipeline, 16 textures per pass. 4 programmable vect4 vertex shader pipelines. 4 programmable vect4 vertex shader pipelines. 256-bit DDR memory bus. 256-bit DDR memory bus. Up to 256 MB of memory on board, clocket at over 300 MHz (19,2 GB/s). Up to 256 MB of memory on board, clocket at over 300 MHz (19,2 GB/s). AGP8X. AGP8X. Full DirectX 9 Pixel and Vertex Shader support. Full DirectX 9 Pixel and Vertex Shader support.
ATI R300. Specs.
ATI R300. GPU.
ATI R300. Memory Crossbar.
ATI R300. Vertex Shader.
ATI R300. Pixel Shader.
3D Labs P10. Specs micron manufacturing process (same process as the GeForce4) 0.15-micron manufacturing process (same process as the GeForce4) 76M transistors 76M transistors Fabbed at TSMC (NVIDIA's chips are made here as well) Fabbed at TSMC (NVIDIA's chips are made here as well) 860 ball HSBGA package (TSMC's latest packaging technology) 860 ball HSBGA package (TSMC's latest packaging technology) 4 pixel rendering pipelines, can process two textures per pipeline 4 pixel rendering pipelines, can process two textures per pipeline 256-bit DDR memory interface (up to 20GB/s of memory bandwidth w/ 312.5MHz DDR) 256-bit DDR memory interface (up to 20GB/s of memory bandwidth w/ 312.5MHz DDR) up to 256MB of memory on-board up to 256MB of memory on-board AGP 4X support AGP 4X support Full DX8 pixel and vertex shader support Full DX8 pixel and vertex shader support
3DLabs P10. Evolution.
3DLabs P10. Pipeline.
3DLabs. Command.
3DLabs. Vertex Units.
3DLabs P10. Raster Pipe.
3DLabs P10. Texture Pipe.
3DLabs P10. Pixel Pipe.
3DLabs P10. Virtual Memory.
Matrox Parhelia. Specs micron GPU manufactured at UMC 0.15-micron GPU manufactured at UMC 80 Million transistors 80 Million transistors 4 pixel rendering pipelines, can process four textures per pipeline per clock 4 pixel rendering pipelines, can process four textures per pipeline per clock 4 programmable vect4 vertex shaders 4 programmable vect4 vertex shaders 256-bit DDR memory bus (up to 20GB/s of memory bandwidth w/ 312.5MHz DDR) 256-bit DDR memory bus (up to 20GB/s of memory bandwidth w/ 312.5MHz DDR) up to 256MB of memory on board up to 256MB of memory on board AGP 4/8X support AGP 4/8X support Full DX8 pixel and vertex shader support Full DX8 pixel and vertex shader support
Matrox Parhelia. Pipeline.
Bibliography ml ml ml ml
Bibliography “Real Time Graphic Architecture” “Real Time Graphic Architecture” Kurt Akeley Kurt Akeley Pat Hanrahan Pat Hanrahan fall fall fall fall The OpenGL Graphics System: A Specification (version 1.4) The OpenGL Graphics System: A Specification (version 1.4) Mark Seagal Mark Seagal Kurt Akeley Kurt Akeley
Bibliography Computer Graphics: Principles and Practice in C Computer Graphics: Principles and Practice in C James D. Foley James D. Foley Andreis Van Dam Andreis Van Dam Steven K. Feiner Steven K. Feiner John F. Hughes John F. Hughes