Dukki Hong 1 Youngduke Seo 1 Youngsik Kim 2 Kwon-Taek Kwon 3 Sang-Oak Woo 3 Seok-Yoon Jung 3 Kyoungwoo Lee 4 Woo-Chan Park 1 1 Media Processor Lab., Sejong.

Dukki Hong 1 Youngduke Seo 1 Youngsik Kim 2 Kwon-Taek Kwon 3 Sang-Oak Woo 3 Seok-Yoon Jung 3 Kyoungwoo Lee 4 Woo-Chan Park 1 1 Media Processor Lab., Sejong University 2 Korea Polytechnic University 3 SAIT of Samsung Electronics Co., Ltd. 4 Yonsei University dkhong@rayman.sejong.ac.kr http://rayman.sejong.ac.kr October 3, 2013

 Introduction  Related Work ◦ Texture mapping ◦ Non-blocking Scheme  Proposed Non-Blocking Texture Cache ◦ The Proposed Architecture ◦ Buffers for Non-blocking scheme ◦ Execution Flow of The NBTC  Experimental Results  Conclusion October 3, 2013 2

 Texture mapping ◦ Core technique for 3D graphics ◦ Maps texture images to the surface  Problem: a huge amount of memory access is required ◦ Major bottleneck in graphics pipelines ◦ Modern GPUs generally use texture caches to solve this problem  Improving texture cache performance ◦ Improving cache hit rates ◦ Reducing miss penalty ◦ Reducing cache access time October 3, 2013 3

 The visual quality of mobile 3D games have evolved enough to compare with PC games. ◦ Detailed texture images  ex) Infinity blade : 2048 [GDC 2011] ◦ Demand high texture mapping throughput October 3, 2013 4

 Improving texture cache performance ◦ Improving cache hit rates ◦ Reducing miss penalty ◦ Reducing cache access time  In this presentation, we introduce a non-blocking texture cache (NBTC) architecture ◦ Out-of-order (OOO) execution ◦ Conditional in-order (IO) completion  the same screen coordinate to support the standard API effectively October 3, 2013 5 “Our approach”

October 3, 2013 6  Texture mapping is that glue n-D images onto geometrical objects ◦ To increase realism  Texture mapping  Texture filtering  Texture filtering is a operation for reducing artifacts of texture aliasing caused by the texture mapping Bi-linear filtering : four samples per texture access Tri-linear filtering : eight samples per texture access

 Cache performance study ◦ In [Hakura and Gupta 1997], the performance of a texture cache was measured with regard to various benchmarks ◦ In [Igehy et al. 1999], the performance of a texture cache was studied with regard to multiple pixel pipelines  Pre-fetching scheme ◦ In [Igehy et al. 1998], the latency generated during texture cache misses can be hidden by applying an explicit pre-fetching scheme  Survey of texture cache ◦ The introduction of a texture cache and the integration of texture cache architectures into modern GPUs were studied in [Doggett 2012] October 3, 2013 7

 Non-blocking cache (NBC) ◦ allows the following cache request while a cache miss is handled  Reducing the miss-induced processor stalls ◦ Kroft firstly published a NBC using missing information/status holding registers (MSHR) that keep track of multiple miss information [Kroft 1981] 8

 Performance study with regard to non-blocking cache ◦ Comparison with four different MSHRs [Farkas and Jouppi 1994].  Implicitly addressed MSHR : Kroft’s MSHR  Explicitly addressed MSHR : complement version of implicitly MSHR  In-cache MSHR : each cache line as MSHR  The first three MSHRs : only one entry per miss block address  Inverted MSHR: single entry per possible destination  The number of entries = usable registers in a processor (possible destination) 9 ◦ Recent high-performance out-of-order (OOO) processor using the latest SPEC benchmark [Li et al. 2011]  A hit under two-misses non-blocking cache improved the OOO processor’s performance 17.76% more than the one using a blocking data cache

Proposed Non-Blocking Texture Cache October 3, 2013 10

 This architecture includes a typical blocking texture cache (BTC) of a level 1 (L1) cache as well as three kinds of buffers for non-blocking scheme: ◦ Retry buffer  Guarantee IO completion ◦ Waiting list buffer  Keep track of miss information ◦ Block address buffer  Remove duplicate block address October 3, 2013 11 texaddr or

 Feature ◦ The most important property of the retry buffer (RB) is its support of IO completion  The RB stores fragment information by input order  The RB is designed as FIFO  Data Format of each RB entry ◦ Valid bit : 0 = empty, 1 = occupied ◦ Screen coordinate : screen coordinate for output display unit (x, y) ◦ Texture request ◦ Ready bit : 0 = invalid filtered texture data, 1 = valid filtered texture data ◦ Filtered texture data : texture data for accomplished texture mapping October 3, 2013 12

 Features ◦ The waiting list buffer (WLB) is similar to the inverted MSHR proposed in [Farkas and Jouppi 1994]  The WLB stores information of both missed and hit addresses  The texture address of the WLB plays a similar role as a register in the inverted MSHR  Data format of each WLB entry ◦ Valid bit : 0 = empty, 1 = occupied ◦ Texture ID : ID number of a texture request ◦ Filtering information : the information to accomplish the texture mapping ◦ Texel addr N : the texture address of necessary texture data ◦ Texel data N : the texel data of Texel Addr N ◦ Ready bit N : 0 = invalid texe data N, 1 = valid texel data N October 3, 2013 13

 Feature ◦ The block address buffer operates the DRAM access sequentially with regard to the texel request that caused a cache miss  The block address buffer removes duplicate DRAM requests  When data are loaded, all the removed DRAM requests are found  The block address buffer is designed as FIFO October 3, 2013 14

October 3, 2013 15 Start Execute lookup RB Generate texture addresses Execute tag compare with texel requests All hits Occurred miss Miss handling caseHit handling case

October 3, 2013 16 Read texel data from L1 cache Input texel data to texture mapping unit via MUX Execute texture mapping Hit handling case Update RB

October 3, 2013 17 Read hit texel data from L1 cache Input missed texture requests to WLB Miss handling case Input missed texel requests to BAB “Concurrent execution” Remove duplicate texel requests Process the next texture request

October 3, 2013 18 Read hit texel data from L1 cache Input missed texture requests to WLB Miss handling case Input missed texel requests to BAB “Concurrent execution” Remove duplicate texel requests Process the next texture request Complete memory request Forward the loaded data to WLB and cache Determine the ready entry in WLB Invalidate the entry Execute texture mapping Update RB Input texel data to texture mapping unit via MUX

October 3, 2013 19 Update RB Determine the ready entry in RB Forward the ready entry to the shading unit Process the next fragment infromation Determine whether IO completion

Experimental Results October 3, 2013 20

 Simulator configuration ◦ mRPsim : announced by SAIT [Yoo et al. 2010]  Execution driven cycle-accurate simulator for SRP-based GPU  Modification of the texture mapping unit  Eight pixel processors  DRAM access latency cycles : 50, 100, 200, and 300 cycles ◦ Benchmark  Taiji which has nearest, bi-linear, and tri-linear filtering modes  Cache configuration ◦ Four-way set associative, eight-word block size and 32KByte cache size ◦ The number of each buffer entries : 32 October 3, 2013 21

October 3, 2013 22  Pixel shader cycle/frame ◦ PS run cycle : running cycles ◦ PS stall cycle : stall cycle ◦ NBTC stall cycle : stall cycles due to the WLB full ◦ The pixel shader’s execution cycle decreased from 12.47% (latency 50) to 41.64% (latency 300)

 Cache miss rates ◦ The NBTC’s cache miss rate increased slightly more than the BTC’s cache miss rate  The NBTC can handle the following cache accesses in cases where a cache update is not completed October 3, 2013 23

 Memory bandwidth requirement ◦ The memory bandwidth requirement of the NBTC increased up to 11% more than that of the BTC  Since the block address buffer removes duplicate DRAM requests, the increasing memory bandwidth requirement was relatively lower 24

 A non-blocking texture cache to improve the performance of texture caches ◦ basic OOO executions maintaining IO completion for texture requests with the same screen coordinate ◦ Three buffers to support the non-blocking scheme:  The retry buffer : IO completion  The waiting list buffer : tracking the miss information  The block address buffer : deleting the duplicate block address  We plan to also implement hardware for the proposed NBTC architecture and then will measure both the power consumption and the hardware area of the proposed NBTC architecture October 3, 2013 25

Thank you for your attention October 3, 2013 26 http://rayman.sejong.ac.kr

Backup Slides October 3, 2013 27

October 3, 2013 28

Dukki Hong 1 Youngduke Seo 1 Youngsik Kim 2 Kwon-Taek Kwon 3 Sang-Oak Woo 3 Seok-Yoon Jung 3 Kyoungwoo Lee 4 Woo-Chan Park 1 1 Media Processor Lab., Sejong.

Similar presentations

Presentation on theme: "Dukki Hong 1 Youngduke Seo 1 Youngsik Kim 2 Kwon-Taek Kwon 3 Sang-Oak Woo 3 Seok-Yoon Jung 3 Kyoungwoo Lee 4 Woo-Chan Park 1 1 Media Processor Lab., Sejong."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dukki Hong 1 Youngduke Seo 1 Youngsik Kim 2 Kwon-Taek Kwon 3 Sang-Oak Woo 3 Seok-Yoon Jung 3 Kyoungwoo Lee 4 Woo-Chan Park 1 1 Media Processor Lab., Sejong.

Similar presentations

Presentation on theme: "Dukki Hong 1 Youngduke Seo 1 Youngsik Kim 2 Kwon-Taek Kwon 3 Sang-Oak Woo 3 Seok-Yoon Jung 3 Kyoungwoo Lee 4 Woo-Chan Park 1 1 Media Processor Lab., Sejong."— Presentation transcript:

Similar presentations

About project

Feedback