Optimizing Pixomatic For Modern Processors

Slides:



Advertisements
Similar presentations
Code Optimization and Performance Chapter 5 CS 105 Tour of the Black Holes of Computing.
Advertisements

Instruction Set Design
Program Optimization (Chapter 5)
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Computer Organization and Assembly Languages Yung-Yu Chuang
Implementation of the Convolution Operation on General Purpose Processors Ernest Jamro AGH Technical University Kraków, Poland.
Beginning Assembly, Part 2 The Assembling! Poorly Presented by Gleep.
C Programming and Assembly Language Janakiraman V – NITK Surathkal 2 nd August 2014.
IA-32 Processor Architecture
© 2006 Pearson Education, Upper Saddle River, NJ All Rights Reserved.Brey: The Intel Microprocessors, 7e Chapter 2 The Microprocessor and its Architecture.
1 ICS 51 Introductory Computer Organization Fall 2006 updated: Oct. 2, 2006.
Accessing parameters from the stack and calling functions.
Assembly Language for Intel-Based Computers Chapter 2: IA-32 Processor Architecture Kip Irvine.
Assembly Language Basic Concepts IA-32 Processor Architecture.
High Performance Computing Introduction to classes of computing SISD MISD SIMD MIMD Conclusion.
Microprocessors Introduction to ia32 Architecture Jan 31st, 2002.
Implementing a FIR-filter algorithm using MMX instructions by Lars Persson.
Software Performance Tuning Project Monkey’s Audio Prepared by: Meni Orenbach Roman Kaplan Advisors: Liat Atsmon Kobi Gottlieb.
An Introduction to IA-32 Processor Architecture Eddie Lopez CSCI 6303 Oct 6, 2008.
Assembly Language for Intel-Based Computers, 4 th Edition Chapter 2: IA-32 Processor Architecture (c) Pearson Education, All rights reserved. You.
Linked Lists in MIPS Let’s see how singly linked lists are implemented in MIPS on MP2, we have a special type of doubly linked list Each node consists.
Fall 2012 Chapter 2: x86 Processor Architecture. Irvine, Kip R. Assembly Language for x86 Processors 6/e, Chapter Overview General Concepts IA-32.
The ISA Level The Instruction Set Architecture (ISA) is positioned between the microarchtecture level and the operating system level.  Historically, this.
Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.
1 ICS 51 Introductory Computer Organization Fall 2009.
Overview of Processor Techniques A brief look at CDA 3101 and CDA 5155.
26-Nov-15 (1) CSC Computer Organization Lecture 6: Pentium IA-32.
Other Processors. Having learnt MIPS, we can learn other major processors. Not going to be able to cover everything; will pick on the interesting aspects.
Implementation of MPEG2 Codec with MMX/SSE/SSE2 Technology Speaker: Rong Jiang, Xu Jin Instructor: Yu-Hen Hu.
MMX-accelerated Matrix Multiplication
Computer Organization & Assembly Language University of Sargodha, Lahore Campus Prepared by Ali Saeed.
Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.
DR. MIGUEL ÁNGEL OROS HERNÁNDEZ 2. Software de bajo nivel.
Microprocessor, Programming & Interfacing Tutorial 2- Module 3.
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 12-14, 2010 Paradyn Project Safe and Efficient Instrumentation Andrew Bernat.
Chapter Overview General Concepts IA-32 Processor Architecture
Practical Session 3.
Assembly function call convention
Instruction Set Architecture
Credits and Disclaimers
x86 Processor Architecture
ISA's, Compilers, and Assembly
Other Processors.
Introduction to Compilers Tim Teitelbaum
CS203 – Advanced Computer Architecture
Assembly IA-32.
High-Level Language Interface
asum.ys A Y86 Programming Example
Computer Organization and Assembly Language
BIC 10503: COMPUTER ARCHITECTURE
Instruction Scheduling for Instruction-Level Parallelism
Systems I Pipelining II
Introduction to Intel IA-32 and IA-64 Instruction Set Architectures
CS 301 Fall 2002 Computer Organization
MIPS Procedure Calls CSE 378 – Section 3.
Practical Session 4.
Intel SIMD architecture
Multi-modules programming
Week 2: Buffer Overflow Part 1.
Computer Architecture CST 250
X86 Assembly Review.
Chapter 12 Pipelining and RISC
Low-Level Thread Dispatching on the x86
Other Processors Having learnt MIPS, we can learn other major processors. Not going to be able to cover everything; will pick on the interesting aspects.
Intel MMX™ Technology Accelerating 3D Geometry Transformation
A first attempt at learning about optimizing the TigerSHARC code
Credits and Disclaimers
ICS51 Introductory Computer Organization
Computer Architecture and System Programming Laboratory
Presentation transcript:

Optimizing Pixomatic For Modern Processors Michael Abrash RAD Game Tools, Inc.

Assume Nothing

Pixomatic X86 software renderer Windows and Linux High-end DX7-class feature set Except cubemaps Low-end DX7-class performance Peak P4/3GHz performance, 1 texture+Gouraud 110 megapixels/second 4.86 million triangles/second

A DX7-Class Rasterizer Turned Out To Be Possible

Appropriate Technology In Appropriate Places Mostly C Inline ASM in key places Custom preprocessor Welding - code compiled on the fly

Pixel Pipeline Register Allocation EAX - scratch register EBX - z-buffer pixel address ECX - loop counter EDX - texture 0 pointer ESI - span-list pointer EDI - pixel-buffer pixel address EBP - texture 0 pointer ESP - 1/z MM0 - texture 0 coordinates (u0, v0) MM1 - texture 1 coordinates (u1, v1) MM2 - Gouraud color MM3 - specular color MM4-MM7 - scratch registers

Span Generation Register Allocation EAX - scratch register EBX - -scanline length ECX - 1/z EDX - scratch register ESI - pixel-buffer pixel address EBP - span list pointer EDI - z-buffer pixel address ESP - stack pointer MM0 - previous span (u0, v0) XMM0 - 1/w MM1 - previous span (u1, v1) XMM1 - u0,v0,u1,v1 MM2 - Gouraud GB components XMM2 - 1/w2 MM3 - Gouraud AR components XMM3 - left edge 1/w2 MM4 - specular GB components XMM4 - left edge 1/w MM3-MM7 - scratch registers XMM5 - left edge XMM6-XMM7 - scratch registers u0, v0, u1, v1

MMX Pixel Format Each field has 8 integral bits; 63 Each field has 8 integral bits; the number of fractional bits varies throughout the pipeline

Texture Mapping Code pand mm0,[WrapUV0Mask] pshufw mm5,mm0,0Dh psrld mm5,[WrapUV0RightShift] movd eax,mm5 movd mm7,[edx+eax] padd mm0,[UV0Step]

From U,V To A Texture Address 00VV.vvvv UU.uuuuuu 63 48 47 32 31 16 15 PSHUFW 00VV UU.uu 63 48 47 32 31 16 15 PSRLD 0 0 0 0VVUU 63 48 47 32 31 16 15

Welded Code Sample 1 LoopTop: add esp,dword ptr [_RotatedFixed16ZXStep] ; stepping adc esp,0 paddsw mm2,mmword ptr [_argb7x_GouraudXStep] paddd mm0,mmword ptr _Spans+20h[esi] cmp sp,word ptr [ebx+ecx*2] ; z buffering ja LoopBottom mov word ptr [ebx+ecx*2],sp pand mm0,mmword ptr [_TexMap] ; texture mapping pshufw mm5,mm0,0Dh psrld mm5,mmword ptr [_TexMap+28h] movd eax,mm5 movd mm7,dword ptr [edx+eax*4] movq mm6,mm2 ; Gouraud shading punpcklbw mm7,dword ptr [_MMX_0] psllw mm7,1 pmulhw mm7,mm6 packuswb mm7,mm7 ; pixel pack/write movd dword ptr [edi+ecx*4],mm7 LoopBottom: inc ecx ; loop control jne LoopTop

Welded Code Sample 2 LoopTop: and eax,dword ptr [_TexMap+0F8h] add esp,dword ptr [_RotatedFixed16ZXStep] adc esp,0 paddsw mm2,mmword ptr [_argb7x_GouraudXStep] paddd mm0,mmword ptr _Spans+20h[esi] cmp sp,word ptr [ebx+ecx*2] ja LoopBottom mov word ptr [ebx+ecx*2],sp pand mm0,mmword ptr [_TexMap] pshufw mm6,mm0,0Dh psrld mm6,mmword ptr [_TexMap+28h] movd eax,mm6 movd mm7,dword ptr [edx+eax*4] pslld mm6,mmword ptr [_TexMap+28h] add eax,dword ptr [_TexMap+0F4h] and eax,dword ptr [_TexMap+0F8h] paddw mm6,mmword ptr [_TexMap+40h] movq mm4,mm0 psrld mm4,mmword ptr [_TexMap+48h] pand mm4,mmword ptr [_MMX_0x003F003F003F003F] movd mm5,dword ptr [edx+eax*4] punpcklbw mm7,dword ptr [_MMX_0] movd mm6,dword ptr [edx+eax*4] punpcklbw mm5,dword ptr [_MMX_0] pshufw mm4,mm4,0 and eax,dword ptr [_TexMap+0F8h] punpcklbw mm6,dword ptr [_MMX_0] movq mmword ptr [_MMX_UFrac],mm4 movd mm4,dword ptr [edx+eax*4] punpcklbw mm4,dword ptr [_MMX_0] psubw mm6,mm7 psubw mm4,mm5 psubw mm5,mm7 psubw mm4,mm6 pmullw mm6,mmword ptr [_MMX_UFrac] psraw mm6,6 pmullw mm4,mmword ptr [_MMX_UFrac] paddw mm6,mm7 pshufw mm7,mm0,0AAh psrlw mm7,6 psllw mm5,6 pmulhw mm4,mm7 pmulhw mm7,mm5 paddw mm6,mm4 paddw mm7,mm6 packuswb mm7,mm7 movq mm6,mm2 punpcklbw mm7,dword ptr [_MMX_0] psllw mm7,1 pmulhw mm7,mm6 movd dword ptr [edi+ecx*4],mm7 LoopBottom: inc ecx jne LoopTop

Out Of Order Processing is Cool No need to swizzle textures No need to overlap divides Extra moves are often free

Try Stuff And See What Sticks

Loop Unrolling Is Rarely A Win Unrolling once sometimes helped

Branch Prediction, And Unexpected Implications Thereof

Linear Search if (condition 1) { handler 1 } else if (condition 2)

Linear Branching Patterns fail condition 1 fail condition 2 pass condition 3 pass condition 1 fail condition 1 pass condition 2 fail condition 1 fail condition 2 fail condition 3

Binary Search if (condition 2) { if (condition 1) handler 1 else } if (condition 3) handler 3 handler 4

Linear Versus Binary Search

Help The Data Cache Work Efficiently Hundreds of cycles per miss to memory Not always hidden by caching and out-of-order processing Don’t chase sparse pointers Avoid sparse accesses to large data structures in general

SSE2 Didn’t Help Us Much For integer ops, half the speed of MMX Doubled parallelism didn’t help us Requires yet another code path For doubles, only 2-way SIMD

Small Changes -> Huge Effects Double alignment on stack 64K aliasing

Hyperthreading Didn’t Help Not a good fit for a standard 3D pipeline Potentially helpful for deferred rendering

Questions?