Dukki Hong 1 Youngduke Seo 1 Youngsik Kim 2 Kwon-Taek Kwon 3 Sang-Oak Woo 3 Seok-Yoon Jung 3 Kyoungwoo Lee 4 Woo-Chan Park 1 1 Media Processor Lab., Sejong University 2 Korea Polytechnic University 3 SAIT of Samsung Electronics Co., Ltd. 4 Yonsei University October 3, 2013
Introduction Related Work ◦ Texture mapping ◦ Non-blocking Scheme Proposed Non-Blocking Texture Cache ◦ The Proposed Architecture ◦ Buffers for Non-blocking scheme ◦ Execution Flow of The NBTC Experimental Results Conclusion October 3,
Texture mapping ◦ Core technique for 3D graphics ◦ Maps texture images to the surface Problem: a huge amount of memory access is required ◦ Major bottleneck in graphics pipelines ◦ Modern GPUs generally use texture caches to solve this problem Improving texture cache performance ◦ Improving cache hit rates ◦ Reducing miss penalty ◦ Reducing cache access time October 3,
The visual quality of mobile 3D games have evolved enough to compare with PC games. ◦ Detailed texture images ex) Infinity blade : 2048 [GDC 2011] ◦ Demand high texture mapping throughput October 3,
Improving texture cache performance ◦ Improving cache hit rates ◦ Reducing miss penalty ◦ Reducing cache access time In this presentation, we introduce a non-blocking texture cache (NBTC) architecture ◦ Out-of-order (OOO) execution ◦ Conditional in-order (IO) completion the same screen coordinate to support the standard API effectively October 3, “Our approach”
October 3, Texture mapping is that glue n-D images onto geometrical objects ◦ To increase realism Texture mapping Texture filtering Texture filtering is a operation for reducing artifacts of texture aliasing caused by the texture mapping Bi-linear filtering : four samples per texture access Tri-linear filtering : eight samples per texture access
Cache performance study ◦ In [Hakura and Gupta 1997], the performance of a texture cache was measured with regard to various benchmarks ◦ In [Igehy et al. 1999], the performance of a texture cache was studied with regard to multiple pixel pipelines Pre-fetching scheme ◦ In [Igehy et al. 1998], the latency generated during texture cache misses can be hidden by applying an explicit pre-fetching scheme Survey of texture cache ◦ The introduction of a texture cache and the integration of texture cache architectures into modern GPUs were studied in [Doggett 2012] October 3,
Non-blocking cache (NBC) ◦ allows the following cache request while a cache miss is handled Reducing the miss-induced processor stalls ◦ Kroft firstly published a NBC using missing information/status holding registers (MSHR) that keep track of multiple miss information [Kroft 1981] 8
Performance study with regard to non-blocking cache ◦ Comparison with four different MSHRs [Farkas and Jouppi 1994]. Implicitly addressed MSHR : Kroft’s MSHR Explicitly addressed MSHR : complement version of implicitly MSHR In-cache MSHR : each cache line as MSHR The first three MSHRs : only one entry per miss block address Inverted MSHR: single entry per possible destination The number of entries = usable registers in a processor (possible destination) 9 ◦ Recent high-performance out-of-order (OOO) processor using the latest SPEC benchmark [Li et al. 2011] A hit under two-misses non-blocking cache improved the OOO processor’s performance 17.76% more than the one using a blocking data cache
Proposed Non-Blocking Texture Cache October 3,
This architecture includes a typical blocking texture cache (BTC) of a level 1 (L1) cache as well as three kinds of buffers for non-blocking scheme: ◦ Retry buffer Guarantee IO completion ◦ Waiting list buffer Keep track of miss information ◦ Block address buffer Remove duplicate block address October 3, texaddr or
Feature ◦ The most important property of the retry buffer (RB) is its support of IO completion The RB stores fragment information by input order The RB is designed as FIFO Data Format of each RB entry ◦ Valid bit : 0 = empty, 1 = occupied ◦ Screen coordinate : screen coordinate for output display unit (x, y) ◦ Texture request ◦ Ready bit : 0 = invalid filtered texture data, 1 = valid filtered texture data ◦ Filtered texture data : texture data for accomplished texture mapping October 3,
Features ◦ The waiting list buffer (WLB) is similar to the inverted MSHR proposed in [Farkas and Jouppi 1994] The WLB stores information of both missed and hit addresses The texture address of the WLB plays a similar role as a register in the inverted MSHR Data format of each WLB entry ◦ Valid bit : 0 = empty, 1 = occupied ◦ Texture ID : ID number of a texture request ◦ Filtering information : the information to accomplish the texture mapping ◦ Texel addr N : the texture address of necessary texture data ◦ Texel data N : the texel data of Texel Addr N ◦ Ready bit N : 0 = invalid texe data N, 1 = valid texel data N October 3,
Feature ◦ The block address buffer operates the DRAM access sequentially with regard to the texel request that caused a cache miss The block address buffer removes duplicate DRAM requests When data are loaded, all the removed DRAM requests are found The block address buffer is designed as FIFO October 3,
October 3, Start Execute lookup RB Generate texture addresses Execute tag compare with texel requests All hits Occurred miss Miss handling caseHit handling case
October 3, Read texel data from L1 cache Input texel data to texture mapping unit via MUX Execute texture mapping Hit handling case Update RB
October 3, Read hit texel data from L1 cache Input missed texture requests to WLB Miss handling case Input missed texel requests to BAB “Concurrent execution” Remove duplicate texel requests Process the next texture request
October 3, Read hit texel data from L1 cache Input missed texture requests to WLB Miss handling case Input missed texel requests to BAB “Concurrent execution” Remove duplicate texel requests Process the next texture request Complete memory request Forward the loaded data to WLB and cache Determine the ready entry in WLB Invalidate the entry Execute texture mapping Update RB Input texel data to texture mapping unit via MUX
October 3, Update RB Determine the ready entry in RB Forward the ready entry to the shading unit Process the next fragment infromation Determine whether IO completion
Experimental Results October 3,
Simulator configuration ◦ mRPsim : announced by SAIT [Yoo et al. 2010] Execution driven cycle-accurate simulator for SRP-based GPU Modification of the texture mapping unit Eight pixel processors DRAM access latency cycles : 50, 100, 200, and 300 cycles ◦ Benchmark Taiji which has nearest, bi-linear, and tri-linear filtering modes Cache configuration ◦ Four-way set associative, eight-word block size and 32KByte cache size ◦ The number of each buffer entries : 32 October 3,
October 3, Pixel shader cycle/frame ◦ PS run cycle : running cycles ◦ PS stall cycle : stall cycle ◦ NBTC stall cycle : stall cycles due to the WLB full ◦ The pixel shader’s execution cycle decreased from 12.47% (latency 50) to 41.64% (latency 300)
Cache miss rates ◦ The NBTC’s cache miss rate increased slightly more than the BTC’s cache miss rate The NBTC can handle the following cache accesses in cases where a cache update is not completed October 3,
Memory bandwidth requirement ◦ The memory bandwidth requirement of the NBTC increased up to 11% more than that of the BTC Since the block address buffer removes duplicate DRAM requests, the increasing memory bandwidth requirement was relatively lower 24
A non-blocking texture cache to improve the performance of texture caches ◦ basic OOO executions maintaining IO completion for texture requests with the same screen coordinate ◦ Three buffers to support the non-blocking scheme: The retry buffer : IO completion The waiting list buffer : tracking the miss information The block address buffer : deleting the duplicate block address We plan to also implement hardware for the proposed NBTC architecture and then will measure both the power consumption and the hardware area of the proposed NBTC architecture October 3,
Thank you for your attention October 3,
Backup Slides October 3,
October 3,