Dukki Hong 1 Youngduke Seo 1 Youngsik Kim 2 Kwon-Taek Kwon 3 Sang-Oak Woo 3 Seok-Yoon Jung 3 Kyoungwoo Lee 4 Woo-Chan Park 1 1 Media Processor Lab., Sejong.

Slides:



Advertisements
Similar presentations
3D Graphics Content Over OCP Martti Venell Sr. Verification Engineer Bitboys.
Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S.
Lecture 12 Reduce Miss Penalty and Hit Time
August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Performance of Cache Memory
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Caching Strategies for Textures Paul Arthur Navratil.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Chapter 3.2 : Virtual Memory
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Virtual Memory I Chapter 8.
Review for Midterm 2 CPSC 321 Computer Architecture Andreas Klappenecker.
1 Lecture 7: Caching in Row-Buffer of DRAM Adapted from “A Permutation-based Page Interleaving Scheme: To Reduce Row-buffer Conflicts and Exploit Data.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Lec17.1 °For in-order pipeline, 2 options: Freeze pipeline in Mem stage (popular early on: Sparc, R4000) IF ID EX Mem stall stall stall … stall Mem Wr.
DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.
Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
COOL Chips IV A High Performance 3D Graphics Rasterizer with Effective Memory Structure Woo-Chan Park, Kil-Whan Lee*, Seung-Gi Lee, Moon-Hee Choi, Won-Jong.
Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
CMPE 421 Parallel Computer Architecture
1 Chapter 3.2 : Virtual Memory What is virtual memory? What is virtual memory? Virtual memory management schemes Virtual memory management schemes Paging.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
Fall 2000M.B. Ibáñez Lecture 17 Paging Hardware Support.
CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.
Memory Management Fundamentals Virtual Memory. Outline Introduction Motivation for virtual memory Paging – general concepts –Principle of locality, demand.
SoC CAD 2015/11/22 1 Instruction Set Extensions for Multi-Threading in LEON3 林孟諭 電機系, Department of Electrical Engineering 國立成功大學, National Cheng Kung.
Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
Exam 2 Review Two’s Complement Arithmetic Ripple carry ALU logic and performance Look-ahead techniques, performance and equations Basic multiplication.
Sunpyo Hong, Hyesoon Kim
Lectures 8 & 9 Virtual Memory - Paging & Segmentation System Design.
Jeffrey Ellak CS 147. Topics What is memory hierarchy? What are the different types of memory? What is in charge of accessing memory?
Computer Organization CS224 Fall 2012 Lessons 39 & 40.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
The University of Adelaide, School of Computer Science
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers
ECE 445 – Computer Organization
Short Circuiting Memory Traffic in Handheld Platforms
CMSC 611: Advanced Computer Architecture
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Instruction Execution Cycle
ARM ORGANISATION.
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Virtual Memory Lecture notes from MKP and S. Yalamanchili.
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
CSE 542: Operating Systems
Virtual Memory 1 1.
Presentation transcript:

Dukki Hong 1 Youngduke Seo 1 Youngsik Kim 2 Kwon-Taek Kwon 3 Sang-Oak Woo 3 Seok-Yoon Jung 3 Kyoungwoo Lee 4 Woo-Chan Park 1 1 Media Processor Lab., Sejong University 2 Korea Polytechnic University 3 SAIT of Samsung Electronics Co., Ltd. 4 Yonsei University October 3, 2013

 Introduction  Related Work ◦ Texture mapping ◦ Non-blocking Scheme  Proposed Non-Blocking Texture Cache ◦ The Proposed Architecture ◦ Buffers for Non-blocking scheme ◦ Execution Flow of The NBTC  Experimental Results  Conclusion October 3,

 Texture mapping ◦ Core technique for 3D graphics ◦ Maps texture images to the surface  Problem: a huge amount of memory access is required ◦ Major bottleneck in graphics pipelines ◦ Modern GPUs generally use texture caches to solve this problem  Improving texture cache performance ◦ Improving cache hit rates ◦ Reducing miss penalty ◦ Reducing cache access time October 3,

 The visual quality of mobile 3D games have evolved enough to compare with PC games. ◦ Detailed texture images  ex) Infinity blade : 2048 [GDC 2011] ◦ Demand high texture mapping throughput October 3,

 Improving texture cache performance ◦ Improving cache hit rates ◦ Reducing miss penalty ◦ Reducing cache access time  In this presentation, we introduce a non-blocking texture cache (NBTC) architecture ◦ Out-of-order (OOO) execution ◦ Conditional in-order (IO) completion  the same screen coordinate to support the standard API effectively October 3, “Our approach”

October 3,  Texture mapping is that glue n-D images onto geometrical objects ◦ To increase realism  Texture mapping  Texture filtering  Texture filtering is a operation for reducing artifacts of texture aliasing caused by the texture mapping Bi-linear filtering : four samples per texture access Tri-linear filtering : eight samples per texture access

 Cache performance study ◦ In [Hakura and Gupta 1997], the performance of a texture cache was measured with regard to various benchmarks ◦ In [Igehy et al. 1999], the performance of a texture cache was studied with regard to multiple pixel pipelines  Pre-fetching scheme ◦ In [Igehy et al. 1998], the latency generated during texture cache misses can be hidden by applying an explicit pre-fetching scheme  Survey of texture cache ◦ The introduction of a texture cache and the integration of texture cache architectures into modern GPUs were studied in [Doggett 2012] October 3,

 Non-blocking cache (NBC) ◦ allows the following cache request while a cache miss is handled  Reducing the miss-induced processor stalls ◦ Kroft firstly published a NBC using missing information/status holding registers (MSHR) that keep track of multiple miss information [Kroft 1981] 8

 Performance study with regard to non-blocking cache ◦ Comparison with four different MSHRs [Farkas and Jouppi 1994].  Implicitly addressed MSHR : Kroft’s MSHR  Explicitly addressed MSHR : complement version of implicitly MSHR  In-cache MSHR : each cache line as MSHR  The first three MSHRs : only one entry per miss block address  Inverted MSHR: single entry per possible destination  The number of entries = usable registers in a processor (possible destination) 9 ◦ Recent high-performance out-of-order (OOO) processor using the latest SPEC benchmark [Li et al. 2011]  A hit under two-misses non-blocking cache improved the OOO processor’s performance 17.76% more than the one using a blocking data cache

Proposed Non-Blocking Texture Cache October 3,

 This architecture includes a typical blocking texture cache (BTC) of a level 1 (L1) cache as well as three kinds of buffers for non-blocking scheme: ◦ Retry buffer  Guarantee IO completion ◦ Waiting list buffer  Keep track of miss information ◦ Block address buffer  Remove duplicate block address October 3, texaddr or

 Feature ◦ The most important property of the retry buffer (RB) is its support of IO completion  The RB stores fragment information by input order  The RB is designed as FIFO  Data Format of each RB entry ◦ Valid bit : 0 = empty, 1 = occupied ◦ Screen coordinate : screen coordinate for output display unit (x, y) ◦ Texture request ◦ Ready bit : 0 = invalid filtered texture data, 1 = valid filtered texture data ◦ Filtered texture data : texture data for accomplished texture mapping October 3,

 Features ◦ The waiting list buffer (WLB) is similar to the inverted MSHR proposed in [Farkas and Jouppi 1994]  The WLB stores information of both missed and hit addresses  The texture address of the WLB plays a similar role as a register in the inverted MSHR  Data format of each WLB entry ◦ Valid bit : 0 = empty, 1 = occupied ◦ Texture ID : ID number of a texture request ◦ Filtering information : the information to accomplish the texture mapping ◦ Texel addr N : the texture address of necessary texture data ◦ Texel data N : the texel data of Texel Addr N ◦ Ready bit N : 0 = invalid texe data N, 1 = valid texel data N October 3,

 Feature ◦ The block address buffer operates the DRAM access sequentially with regard to the texel request that caused a cache miss  The block address buffer removes duplicate DRAM requests  When data are loaded, all the removed DRAM requests are found  The block address buffer is designed as FIFO October 3,

October 3, Start Execute lookup RB Generate texture addresses Execute tag compare with texel requests All hits Occurred miss Miss handling caseHit handling case

October 3, Read texel data from L1 cache Input texel data to texture mapping unit via MUX Execute texture mapping Hit handling case Update RB

October 3, Read hit texel data from L1 cache Input missed texture requests to WLB Miss handling case Input missed texel requests to BAB “Concurrent execution” Remove duplicate texel requests Process the next texture request

October 3, Read hit texel data from L1 cache Input missed texture requests to WLB Miss handling case Input missed texel requests to BAB “Concurrent execution” Remove duplicate texel requests Process the next texture request Complete memory request Forward the loaded data to WLB and cache Determine the ready entry in WLB Invalidate the entry Execute texture mapping Update RB Input texel data to texture mapping unit via MUX

October 3, Update RB Determine the ready entry in RB Forward the ready entry to the shading unit Process the next fragment infromation Determine whether IO completion

Experimental Results October 3,

 Simulator configuration ◦ mRPsim : announced by SAIT [Yoo et al. 2010]  Execution driven cycle-accurate simulator for SRP-based GPU  Modification of the texture mapping unit  Eight pixel processors  DRAM access latency cycles : 50, 100, 200, and 300 cycles ◦ Benchmark  Taiji which has nearest, bi-linear, and tri-linear filtering modes  Cache configuration ◦ Four-way set associative, eight-word block size and 32KByte cache size ◦ The number of each buffer entries : 32 October 3,

October 3,  Pixel shader cycle/frame ◦ PS run cycle : running cycles ◦ PS stall cycle : stall cycle ◦ NBTC stall cycle : stall cycles due to the WLB full ◦ The pixel shader’s execution cycle decreased from 12.47% (latency 50) to 41.64% (latency 300)

 Cache miss rates ◦ The NBTC’s cache miss rate increased slightly more than the BTC’s cache miss rate  The NBTC can handle the following cache accesses in cases where a cache update is not completed October 3,

 Memory bandwidth requirement ◦ The memory bandwidth requirement of the NBTC increased up to 11% more than that of the BTC  Since the block address buffer removes duplicate DRAM requests, the increasing memory bandwidth requirement was relatively lower 24

 A non-blocking texture cache to improve the performance of texture caches ◦ basic OOO executions maintaining IO completion for texture requests with the same screen coordinate ◦ Three buffers to support the non-blocking scheme:  The retry buffer : IO completion  The waiting list buffer : tracking the miss information  The block address buffer : deleting the duplicate block address  We plan to also implement hardware for the proposed NBTC architecture and then will measure both the power consumption and the hardware area of the proposed NBTC architecture October 3,

Thank you for your attention October 3,

Backup Slides October 3,

October 3,