Download presentation
Presentation is loading. Please wait.
Published byClarence Holland Modified over 9 years ago
1
Department of Computer Science 1 Beyond CUDA/GPUs and Future Graphics Architectures Karu Sankaralingam University of Wisconsin-Madison Adapted from “Toward A Multicore Architecture for Real-time Raytracing, MICRO-41, 2008, Venkatraman Govindaraju, Peter Djeu, Karthikeyan Sankaralingam, Mary Vernon, William R. Mark.
2
Department of Computer Science 2 Real-time Graphics Rendering Today
3
Department of Computer Science 3 Real-time Graphics Rendering Today Future
4
Department of Computer Science 4 Real-time Graphics Rendering What are the problems? How can we get there?
5
Department of Computer Science What is wrong with this picture? 5
6
Department of Computer Science GPU/CUDA 6 Z-buffer
7
Department of Computer Science 7 Z-buffer Arch “Ptolemic” Graphic Universe Architecture, application all optimized for Z-buffer Difficult to render images with realistic effects. –self-reflection, soft shadows, ambient occlusion Problems: –Scene constraints, Artist and programmer productivity Application
8
Department of Computer Science Current Graphics Architectures 8 Courtesy: ACM Queue
9
Department of Computer Science How did we get here? Hardware Rasterizers and perspective-correct texture mapping (RIVA 128) Single Pass Multitexture (TNT / TNT2) Register Combiners: a generalization of multitexture (GeForce 256) Per-pixel Shading (Geforce 2 GTS) Programmable Hardware Pixel Shading Programmable Vertex Shading CUDA 9
10
Department of Computer Science 10 Algorithm Arch “Copernican” Graphic Universe Architecture, application revolves around Algorithm More general purpose algorithm Easier to provide realistic effects Architecture can support other applications ApplicationRay-tracing
11
Department of Computer Science Future Graphics Architectures 11 Courtesy: ACM Queue
12
Department of Computer Science 12 Executive Summary: Copernicus System Co-designed application, architecture and analysis framework Path from specialized graphics architecture to more general purpose architecture. A detailed characterization and analysis framework Real-time frame rates possible for high quality dynamic scenes
13
Department of Computer Science 13 Outline Motivation Copernicus system –Graphics Algorithm: Razor –Architecture –Evaluation and Results Summary
14
Department of Computer Science 14 Ray-tracing Full scene CubeCylinder Simulating the behavior light rays through 3D scene Rays from eye to scene (Primary rays) Rays from hitpoint to light (Secondary rays) Acceleration structure (eg. BSP Tree) for efficiency
15
Department of Computer Science 15 Disadvantages of Raytracing Every frame need to rebuild the acceleration structure for dynamic scenes. Irregular data accesses for traversing the acceleration structure. Higher resolution secondary ray tracing computation
16
Department of Computer Science 16 Razor: A Dynamic Multiresolution Raytracer Cube Cylinder Thread 1Thread 2 Packet ray-tracer: Traces beam of rays instead of a ray –Opportunity for data level parallelism Each thread lazily builds its own acceleration structure(KD Tree) –Builds the portion of structure it needs.
17
Department of Computer Science 17 Razor: A Dynamic Multiresolution Raytracer Multi-level resolution to reduce secondary rays computation. Replicates KD-Tree to reduce synchronization across threads. –Hypothesis: Duplication across threads will be limited.
18
Department of Computer Science 18 Razor Implementation Linux/x86 –Implemented Razor in Intel Clovertown. –Parallelized using pthreads. Optimized with SSE instructions Sustains 1 FPS on this prototype system Helps develop algorithms Designed with future hardware in mind
19
Department of Computer Science 19 Razor’s Memory Usage # Threads Memory footprint
20
Department of Computer Science 20 Parallel Scalability # Threads Speedup
21
Department of Computer Science 21 Outline Motivation Copernicus system –Graphics Algorithm: Razor –Architecture –Evaluation and Results Summary
22
Department of Computer Science 22 Architecture: Core Inorder core Private L1 Data and Instruction Cache Supports SIMD instructions SMT Threads to hide memory latency
23
Department of Computer Science 23 Architecture: Tile Shared L2 cache Shared Accelerator for specialized instructions
24
Department of Computer Science 24 Architecture: Chip
25
Department of Computer Science 25 Architecture Razor Mapping Assigned to Tile Assigned to Core
26
Department of Computer Science 26 Outline Motivation Copernicus system –Graphics Algorithm: Razor –Architecture –Evaluation and Results Summary
27
Department of Computer Science 27 Benchmark Scenes v CourtyardFairyforestForest JuarezSaloon
28
Department of Computer Science 28 Evaluation Methodology Simulation with Multifacet/GEMS –Simulate SSE Instructions –Simulate a full tile –Validated with prototype data Pin-based and PAPI-based performance counters –Randomly selected regions of scenes Full chip –Simulating full chip is too slow –Build customized analytic model
29
Department of Computer Science 29 Analytical Model Core Level –Pipeline stalls –Multiple threads Tile Level –L2 contention Chip Level –Main memory contention Compared with our simulation results
30
Department of Computer Science 30 Single Core Performance (Single Issue) IPC
31
Department of Computer Science 31 Single Core Performance (Dual Issue) IPC
32
Department of Computer Science 32 Single Tile Performance IPC
33
Department of Computer Science 33 Full Chip Performance #Tiles Million Rays/Seconds
34
Department of Computer Science 34 So, Are we there yet?
35
Department of Computer Science 35 Results Goal: 100 Million rays per second Achieved: 50 Million rays per second –With 16 tiles and 4 DIMMs Insights: –4 SMT single issue is ideal for this workload –Good parallel scalability –Razor’s physically-motivated optimizations work Potential for further architectural optimizations –Shared accelerator –Wide SIMD bundles
36
Department of Computer Science 36 Outline Motivation Copernicus system –Graphics Algorithm: Razor –Architecture –Evaluation and Results Summary
37
Department of Computer Science 37 Summary A transformation path to ray-tracing –Ptolemic universe to Copernican graphics universe Unique architecture design point –Tradeoff data redundancy and re-computation over synchronization Evaluation methodology interesting in its own right –Prototype, simulation and analytical framework to design and evaluate future systems Future work –Instructions specialization and shared accelerator design –Tradeoffs with SIMD width and area –Memory system
38
Department of Computer Science 38 Other Questions?
39
Department of Computer Science 39 Raytracing
40
Department of Computer Science 40 Razor: A Dynamic Packet Ray-tracer Packet ray-tracer –Traces beam of rays instead of ray –Opportunity for data level parallelism Each thread lazily builds its own acceleration structure (kd-Tree). –Builds the portion of structure it needs. Multi-level resolution to reduce secondary rays computation. Replicates acceleration structure to reduce synchronization across threads. –Hypothesis: Duplication across threads will be limited.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.