Graphical Processing Units SUPERVISED BY: DR.HADI Adineh By: Azhar Albakry & Abbas Alkhafaji Dpartment: Software
INTRODUCTION The graphics processing unit (GPU) has become an integral part of today’s mainstream computing systems. Over the past years. GPU is not only a powerful graphics engine but also a highly parallel programmable processor The GPU’s rapid increase in both programmability and capability has spawned a research community that has successfully mapped a broad range of computationally demanding, complex problems to the GPU.
INTRODUCTION The GPU is designed for a particular class of applications with the following characteristics. Computational requirements are large. Real-time rendering requires billions of pixels per second, and each pixel requires hundreds or more operations. GPUs must deliver an enormous amount of compute performance to satisfy the demand of complex real-time applications. Parallelism is substantial. Fortunately, the graphics pipeline is well suited for parallelism, which in turn are applicable to many other computational domains. Throughput is more important than latency. GPU implementations of the graphics pipeline prioritize throughput over latency. Use multiple points, if necessary.
GPU ARCHITECTURE: A. The Graphics Pipeline The input to the GPU is a list of geometric primitives, typically triangles, in a 3-D world coordinate system. Through many steps, those primitives are shaded and mapped onto the screen where they are assembled to create a final picture. specific steps in the canonical pipeline: Vertex Operations: The input primitives are formed from individual vertices. Each vertex must be transformed into screen space and shaded, typically through computing their interaction with the lights in the scene. Primitive Assembly: The vertices are assembled into triangles, the fundamental hardware-supported primitive in today’s GPUs. Rasterization: Rasterization is the process of determining which screen- space pixel locations are covered by each triangle. Use brief bullets and discuss details verbally.
GPU ARCHITECTURE: A. The Graphics Pipeline specific steps in the canonical pipeline: Fragment Operations: Using color information from the vertices and possibly fetching additional data from global memory in the form of textures (images that are mapped onto surfaces), each fragment is shaded to determine its final color. Composition: Fragments are assembled into a final image with one color per pixel, usually by keeping the closest fragment to the camera for each pixel location. Historically, the operations available at the vertex and fragment stages were configurable but not programmable.
GPU ARCHITECTURE: B. Evolution of GPU Architecture The fixed-function pipeline lacked the generality to efficiently express more complicated shading and lighting operations that are essential for complex effects. The key step was replacing the fixed-function per-vertex operations with user-specified programs run on each vertex and fragment. Over the past years, these vertex programs and fragment programs have become increasingly more capable, with larger limits on their size and resource consumption, with more fully featured instruction sets, and with more flexible control-flow operations.
GPU ARCHITECTURE: C. Architecture of a Modern GPU the GPU is built for different application demands than the CPU: large, parallel computation requirements with an emphasis on throughput rather than latency. Consequently, the architecture of the GPU has progressed in a different direction than that of the CPU. In a pipeline, the output of each successive task is fed into the input of the next task. The pipeline exposes the task parallelism of the application, as data in multiple pipeline stages can be computed at the same time; within each stage, computing more than one element at the same time is data parallelism. GPU divides the resources of the processor among the different stages, such that the pipeline is divided in space, not time.
GPU ARCHITECTURE: C. Architecture of a Modern GPU This machine organization was highly successful in fixed-function GPUs for two reasons: First, the hardware in any given stage could exploit data parallelism within that stage, processing multiple elements at the same time. Secondly, each stage’s hardware could be customized with special- purpose hardware for its given task, allowing substantially greater compute and area efficiency over a general-purpose solution. For instance, the rasterization stage, which computes pixel coverage information for each input triangle, is more efficient when implemented in special-purpose hardware.
GPU ARCHITECTURE: C. Architecture of a Modern GPU In a CPU, any given operation may take on the order of 20 cycles between entering and leaving the CPU pipeline. On a GPU, a graphics operation may take thousands of cycles from start to finish. The latency of any given operation is long. However, the task and data parallelism across and between stages delivers high throughput The major disadvantage of the GPU task-parallel pipeline is load balancing. Like any pipeline, the performance of the GPU pipeline is dependent on its slowest stage. If the vertex program is complex and the fragment program is simple, overall throughput is dependent on the performance of the vertex program.
GPU ARCHITECTURE: C. Architecture of a Modern GPU AMD introduced the first unified shader architecture for modern GPUs in its Xenos GPU in the XBox 360 (2005). Today, both AMD’s and NVIDIA’s flagship GPUs feature unified shaders (Fig. 1). The benefit for GPU users is better load-balancing at the cost of more complex hardware. The benefit for GPGPU users is clear: with all the programmable power in a single hardware unit, GPGPU programmers can now target that programmable unit directly, rather than the previous approach of dividing work across multiple hardware units.
GPU ARCHITECTURE: C. Architecture of a Modern GPU
CASE STUDY: GAME PHYSICS Physics simulation occupies an increasingly important role in modern video games. Game players and developers seek environments that move and react in a physically plausible fashion, requiring immense computational resources. case study focuses on Havok FX (Fig. 2), a GPUaccelerated game physics package and one of the first successful consumer applications of GPU computing.
CASE STUDY: GAME PHYSICS
CASE STUDY: GAME PHYSICS Game physics takes many forms and increasingly includes articulated characters “Brag doll physics”, vehicle simulation, cloth, deformable bodies, and fluid simulation. We concentrate here on rigid body dynamics, which simulate solid objects moving under gravity and obeying Newton’s laws of motion and are probably the most important form of game physics today. Rigid body simulation typically incorporates three steps: integration, collision detection, and collision resolution.
CASE STUDY: GAME PHYSICS Integration: The integration step updates the objects’ velocities based on the applied forces (e.g., gravity, wind, player interactions) and updates the objects’ position based on the velocities. Collision detection: This step determines which objects are colliding after integration and their contact points. Collision detection must in principle compare each object with every other object a very expensive (O(n2)) proposition. In practice, most systems mitigate this cost by splitting collision detection into a broad phase and a narrow phase. The broad phase compares a simplified representation of the objects (typically their bounding boxes) to quickly determine potentially colliding pairs of objects. The narrow phase then accurately determines the pairs of objects that are actually colliding
CASE STUDY: GAME PHYSICS Collision resolution: Once collisions are detected, collision resolution applies impulses (instant transitory force) to the colliding objects so that they move apart. In 2005, Havok, the leading game physics middleware supplier, began researching new algorithms targeted at simulating tens of thousands of rigid bodies on parallel and ATI have worked with Havok to implement and optimize the system on the GPU. Several reasons argue for moving some physics simulation to the GPU. For instance, many games today are CPU-limited, and physics can easily consume 10% or more of CPU time.
CASE STUDY: GAME PHYSICS Performing physics on the GPU also enables direct rendering of simulation results from GPU memory, avoiding the need to transfer the positions and orientations of thousands or millions of objects from CPU to GPU each frame. Havok FX is a hybrid system, leveraging the strengths of the CPU and GPU. It stores the complete object state (position, orientation, linear and angular velocities) on the GPU, as well as a proprietary texture- based representation for the shapes of the objects.
CASE STUDY: GAME PHYSICS The CPU performs broad phase collision detection using a highly optimized sort and sweep algorithm after reading axis-aligned bounding boxes of each object back from the GPU each frame in a compressed format. The list of potential colliding pairs is then downloaded back to the GPU for the narrow phase. Both transfers consist of a relatively small amount of data which transfer quickly over the PCIe bus. The GPU performs all narrow phase collision detection and integration. Havok FX uses a simple Euler integrator with a fixed time step.
CASE STUDY: GAME PHYSICS The end result is an order of magnitude performance boost over Havok’s reference single-core CPU implementation. . Simulating a scene of 15 000 boulders rolling down a terrain, the CPU implementation (on a single core of an Intel 2.9 GHz Core 2 Duo) achieved 6.2 frames per second whereas the initial GPU implementation on an NVIDIA GeForce 8800 GTX reached 64.5 frames per second. Havok FX demonstrates the feasibility of building a hybrid system in which the CPU executes serial portions of the algorithm and the GPU executes data parallel portions. The overall performance of this hybrid system far exceeds a CPU-only system despite the frequent transfers between CPU and GPU
REFERENCES [1] John D. Owens, Mike Houston, David Luebke, Simon Green, John E. Stone, and James C. Phillips, Graphics Processing Units-powerful, programmable, and highly parallel-are increasingly targeting general-purpose computing applications.,[IEEE. Vol. 96, No. 5, May 2008 [2] M. Harris, BMapping computational concepts to GPUs,[ in GPU Gems 2, M. Pharr, Ed. Reading, MA: Addison-Wesley, Mar. 2005, pp. 493–508. [3] M. McCool, BData-parallel programming on the cell BE and the GPU using the RapidMind development platform,[ in Proc. GSPx Multicore Applicat. Conf., Oct.– Nov. 2006. [4] P. M. Hubbard, BCollision detection for interactive graphics applications,[ IEEE Trans. Vis. Comput. Graphics, vol. 1, no. 3, pp. 218–230, 1995. [5] B. Bustos, O. Deussen, S. Hiller, and D. Keim, BA graphics hardware accelerated algorithm for nearest neighbor search,[ in Proc. 6th Int. Conf. Comput. Sci., May 2006, vol. 3994, pp. 196–199, Lecture Notes in Computer Science.