An Architecture Extension for Efficient Geometry Processing Radhika Thekkath, Mike Uhler, Chandlee Harrell, Ying-wai Ho MIPS Technologies, Inc. 1225 Charleston.

Slides:



Advertisements
Similar presentations
Is There a Real Difference between DSPs and GPUs?
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos What’s the.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
Spring 2013 Advising Starts this week! CS2710 Computer Organization1.
The University of Adelaide, School of Computer Science
Comp Sci Floating Point Arithmetic 1 Ch. 10 Floating Point Unit.
Computer Organization CS224 Fall 2012 Lesson 19. Floating-Point Example  What number is represented by the single-precision float …00 
05/03/2009CA&O Lecture 8,9,10 By Engr. Umbreen sabir1 Computer Arithmetic Computer Engineering Department.
TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.
Shadow Volumes Revisited Stefan Roettger Alexander Irion Thomas Ertl University of Stuttgart, VIS Group.
9/25/2001CS 638, Fall 2001 Today Shadow Volume Algorithms Vertex and Pixel Shaders.
CS5500 Computer Graphics © Chun-Fa Chang, Spring 2007 CS5500 Computer Graphics February 26, 2007.
1 Lecture 9: Floating Point Today’s topics:  Division  IEEE 754 representations  FP arithmetic Reminder: assignment 4 will be posted later today.
Computer Graphics Hardware Acceleration for Embedded Level Systems Brian Murray
Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.
Instruction Level Parallelism (ILP) Colin Stevens.
1 Ó1998 Morgan Kaufmann Publishers Chapter 4 計算機算數.
Performance D. A. Patterson and J. L. Hennessey, Computer Organization & Design: The Hardware Software Interface, Morgan Kauffman, second edition 1998.
GPU Simulator Victor Moya. Summary Rendering pipeline for 3D graphics. Rendering pipeline for 3D graphics. Graphic Processors. Graphic Processors. GPU.
1 Lecture 8 Clipping Viewing transformations Projections.
Farhan Mohamed Ali (W2-1) Jigar Vora (W2-2) Sonali Kapoor (W2-3) Avni Jhunjhunwala (W2-4) Siven Seth (W2-5) Presentation 1 MAD MAC th January, 2006.
1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.
1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.
PlayStation 2 Architecture Irin Jose Farid Momin Quy Ngo Olivia Wong.
CPS Computer Architecture Assignment 4: Multiplication, Division, Floating Point.
Emotion Engine A look at the microprocessor at the center of the PlayStation2 gaming console Charles Aldrich.
© 2007 by S - Squared, Inc. All Rights Reserved.
Computer performance.
Background image by chromosphere.deviantart.com Fella in following slides by devart.deviantart.com DM2336 Programming hardware shaders Dioselin Gonzalez.
Enhancing GPU for Scientific Computing Some thoughts.
David Carter SCEE Technology Group
Graphics Graphics Korea University cgvr.korea.ac.kr 1 Texture Mapping 고려대학교 컴퓨터 그래픽스 연구실.
Buffers Textures and more Rendering Paul Taylor & Barry La Trobe University 2009.
CPS3340 COMPUTER ARCHITECTURE Fall Semester, /14/2013 Lecture 16: Floating Point Instructor: Ashraf Yaseen DEPARTMENT OF MATH & COMPUTER SCIENCE.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Lecture 9: Floating Point
The Rendering Pipeline CS 445/645 Introduction to Computer Graphics David Luebke, Spring 2003.
Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.
Review on Graphics Basics. Outline Polygon rendering pipeline Affine transformations Projective transformations Lighting and shading From vertices to.
Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University
ICC Module 3 Lesson 1 – Computer Architecture 1 / 9 © 2015 Ph. Janson Information, Computing & Communication Computer Architecture Clip 7 – Architectural.
Playstation2 Architecture Architecture Hardware Design.
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
Performance – Last Lecture Bottom line performance measure is time Performance A = 1/Execution Time A Comparing Performance N = Performance A / Performance.
Shadows David Luebke University of Virginia. Shadows An important visual cue, traditionally hard to do in real-time rendering Outline: –Notation –Planar.
King Fahd University of Petroleum and Minerals King Fahd University of Petroleum and Minerals Computer Engineering Department Computer Engineering Department.
1 ECE 734 Final Project Presentation Fall 2000 By Manoj Geo Varghese MMX Technology: An Optimization Outlook.
Computer Architecture & Operations I
GPU Architecture and Its Application
- Introduction - Graphics Pipeline
Morgan Kaufmann Publishers Arithmetic for Computers
September 2 Performance Read 3.1 through 3.4 for Tuesday
Computer Architecture & Operations I
Computer Architecture & Operations I
Morgan Kaufmann Publishers
Modeling 101 For the moment assume that all geometry consists of points, lines and faces Line: A segment between two endpoints Face: A planar area bounded.
Graphics Processing Unit
Chapter 6 Floating Point
The Graphics Rendering Pipeline
CS451Real-time Rendering Pipeline
COMP4211 : Advance Computer Architecture
STUDY AND IMPLEMENTATION
Computer Arithmetic Multiplication, Floating Point
Alex Saify Chad Reynolds James Aldorisio Brian Bischoff
Lecture 13 Clipping & Scan Conversion
COMS 361 Computer Organization
Morgan Kaufmann Publishers Arithmetic for Computers
Chapter 4 計算機算數.
Presentation transcript:

An Architecture Extension for Efficient Geometry Processing Radhika Thekkath, Mike Uhler, Chandlee Harrell, Ying-wai Ho MIPS Technologies, Inc Charleston Road Mountain View, CA 94043

An Architecture Extension for Efficient Geometry Processing2 Talk Outline l Motivation---why enhance the MIPS ® architecture l Background on 3D graphics geometry operations and current MIPS ® architecture l What are the enhancements? l Performance and cost l Summary

An Architecture Extension for Efficient Geometry Processing3 Current 3D Rendering Limited by Geometry Processing l Front-end: Geometry and Lighting operations u General-purpose processors: M polygons/s. Eg. R5000 ® (1996,200MHz), PIII (1999,500MHz). l Back-end: Rendering u Graphics processors: M polygons/s. Eg. ATI Rage 128(1999), 3Dfx Voodoo3(1999). l Dedicated hardware, eg., Sony Emotion Engine---silicon-intensive, but feeds higher performance rendering engines.

An Architecture Extension for Efficient Geometry Processing4 Our Solution l Enhance the MIPS ® architecture to improve 3D geometry performance: MIPS-3D™ ASE (Application Specific Extension) includes 13 new instructions l Lower cost than dedicated geometry hardware l Main processor improvements are leveraged u technology/speed u parallelism/pipelining

An Architecture Extension for Efficient Geometry Processing5 Talk Outline l Motivation---why enhance the MIPS ® architecture l Background on 3D graphics geometry operations and current MIPS ® architecture l What are the enhancements? l Performance and cost l Summary

An Architecture Extension for Efficient Geometry Processing6 Geometry and Lighting Operations l Vertex transformation (matrix multiplication) l Clip-check (compare and branch) l Transform to screen coordinates (perspective division using reciprocal) l Lighting: infinite and local (normalization using reciprocal square root)

An Architecture Extension for Efficient Geometry Processing7 Already in the MIPS Architecture l Floating point operations u MUL (S, D, PS) u ADD (S, D, PS) u MADD (S, D, PS) (multiply-add) u RECIP (S, D) u RSQRT (S, D) PS- Paired-Single, two singles S S 64bits S - Single FP format (32 bits) D - Double FP format (64 bits)

An Architecture Extension for Efficient Geometry Processing8 Talk Outline l Motivation---why enhance the MIPS ® architecture l Background on 3D graphics geometry operations and current MIPS ® architecture l What are the enhancements? l Performance and cost l Summary

An Architecture Extension for Efficient Geometry Processing9 ADDR: for Vertex Transformation x y z w m0 m4 m8 m12 m1 m5 m9 m13 m2 m6 m10 m14 m3 m7 m11 m15 * = x t y t z t w t FP0 = [m1 | m0] FP1 = [m3 | m2] Eg. x t = m0x + m1y + m2z + m3w MUL.PS FP10, FP0, FP8 FP10 = [m1y | m0x] * FP8 = [ y | x ] FP9 = [ w | z ] MADD.PS FP11, FP10, FP1, FP9 * [m3w | m2z] + FP11 = [m1y+m3w | m0x+m2z] Reorganize register to enable add ADD.PS... ADDR.PS FP11, FP?, FP11 FP11 = [ y t | x t =m1y+m3w+m0x+m2z] ADDR

An Architecture Extension for Efficient Geometry Processing10 Clip Check (Compare) x >= -w, x <= w y >= -w, y <= w z >= -w, z <= w Set 6 Condition Code (CC) bits Is the vertex within the viewing pyramid? |x| <= |w| |y| <= |w| |z| <= |w| Set only 3 CC bits Observation : Can use magnitude compares.

An Architecture Extension for Efficient Geometry Processing11 CABS: for Clip Check Compare CABS.LE.PS |y|<=|w|?, |x|<=|w|? CABS.LE.PS |w|<=|w|?, |z|<=|w|? Transformed [w | z] [y | x] in FP registers PUU.PS to get [w | w] NEG.PS to get [-w | -w] C.NGE.PS !(y >= -w)? !(x >= -w)? C.NGE.S !(z >= -w)? C.LE.PS y<=w? x<=w? C.LE.S z<=w? Replace with absolute compares

An Architecture Extension for Efficient Geometry Processing12 BC1ANY4F: for Clip Check Branch l Without absolute compare, need 6 branch instructions to check the 6 CC bits. l With absolute compare, need 3 branch instructions to check the 3 CC bits. l New MIPS-3D™ ASE instruction --- BC1ANY4F, a single branch instruction that checks 4 CC bits.

An Architecture Extension for Efficient Geometry Processing13 Geometry and Lighting Operations l Vertex transformation (matrix multiplication) l Clip-check (compare and branch) l Transform to screen coordinates (perspective division using reciprocal) l Lighting: infinite and local (normalization using reciprocal square root)

An Architecture Extension for Efficient Geometry Processing14 Perspective Division and Normalization l In MIPS ® IV architecture u RECIP u RSQRT l Full precision l Long latency l Not fully pipeline-able l Only S and D formats l New MIPS-3D™ ASE instructions: u RECIP1 u RECIP2 u RSQRT1 u RSQRT2 l Reduced & full precision l Pipeline-able l S, D, and PS format

An Architecture Extension for Efficient Geometry Processing15 Other Instruction Sets l 3DNow!™ Technology -- enhance 3D graphics and multimedia u 2-packed FP SIMD (PS) u PFACC - accumulate u PFRCP, PFRCPIT1, PFRCPIT2 - reciprocal u PFRSQRT, PFRSQIT1 - reciprocal square root u PF2ID, PI2FD - convert l AltiVec™ Technology u 4 SIMD (32-bits) u vrefp, vnmsubfp, vmaddfp - reciprocal u vrsqrtefp, etc - reciprocal square root u vcmpbfp - bounds compare u vcfsx, vctsxs - convert

An Architecture Extension for Efficient Geometry Processing16 Talk Outline l Motivation---why enhance the MIPS ® architecture l Background on 3D graphics geometry operations and current MIPS ® architecture l What are the enhancements? l Performance and cost l Summary

An Architecture Extension for Efficient Geometry Processing17 Implementation Cost l Die Area (of the Ruby processor) u Implementation of PS adds 6-7% to FP die area. u MIPS-3D™ ASE adds 3% to the floating point die area. (FP is less than 15% of the total die area). l Logic/pipeline complexity u ADDR, CABS, BC1ANY4F, etc. - minimal impact on both die area and FP pipeline logic. u RECIP1, RSQRT1 - 2x64 word lookup tables contribute to most of the 3% die area increase.

An Architecture Extension for Efficient Geometry Processing18 Performance: Number of Instructions Note: Inner-loop instructions/vertex = cycles/vertex

An Architecture Extension for Efficient Geometry Processing19 Experiment/Coding Assumptions l FP pipeline has 4-cycle data dependency l Loop interleaves computations of 2 vertices l Transform constants locked in cache l Vertex co-ordinates are pre-fetched from memory to cache, every loop iteration l Code uses full precision reciprocal and reduced precision reciprocal square-root

An Architecture Extension for Efficient Geometry Processing20 Performance : M polygons/s 45% 83% M polygons/s Using today’s high-end desktop processor frequency---500MHz transform+complex lighttransform

An Architecture Extension for Efficient Geometry Processing21 Summary l MIPS-3D™ ASE adds thirteen instructions to the current MIPS64™ architecture l Low cost (3% of FP die area) l Increases polygons/sec count by 45% for the transform code to obtain 25 M polygons/s l Increases polygons/sec count by 83% for transform together with complex lighting to obtain 10 M polygons/s

An Architecture Extension for Efficient Geometry Processing22 Appendix:Vertex Transformation Code MUL.PS FP10,FP8,FP0 FP10 <-- m1*y | m0*x MUL.PS FP11,FP8,FP2 FP11 <-- m5*y | m4*x MUL.PS FP12,FP8,FP4 FP12 <-- m9*y | m8*x MUL.PS FP13,FP8,FP6 FP13 <-- m13*y | m12*x MADD.PS FP10,FP10,FP9,FP1 FP10 <-- m3*w+m1*y | m2*z+m0*x MADD.PS FP11,FP11,FP9,FP3 FP11 <-- m7*w+m5*y | m6*z+m4*x MADD.PS FP12,FP12,FP9,FP5 FP12 <-- m11*w+m9*y | m10*z+m8*x MADD.PS FP13,FP13,FP9,FP7 FP13 <-- m15*w+m13*y | m14*z+m12*x PLL.PS FP14,FP11,FP10 PUU.PS FP15,FP11,FP10 PLL.PS FP16,FP13,FP12 PUU.PS FP17,FP13,FP12 ADD.PS FP8, FP15,FP14 ADD.PS FP9,FP17,FP16 ADDR.PS FP8,FP11,FP10 FP8 <-- m4x+m5y+m6z+m7w | m0x+m1y+m2z+m3w ADDR.PS FP9,FP13,FP12 FP9 <-- m12x+m13y+m14z+m15w | m8x+m9y+m10z+m11w FP0--FP7 hold m0--m15 in pair-single FP8, FP9 hold x,y,z,w in pair-single Replace with

An Architecture Extension for Efficient Geometry Processing23 Appendix:The 13 MIPS-3D™ ASE Instructions