Eager Writeback — A Technique for Improving Bandwidth Utilization

Slides:

Advertisements

Similar presentations

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

Advertisements

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Performance of Cache Memory

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

® 1 Stack Value File : Custom Microarchitecture for the Stack Hsien-Hsin Lee Mikhail Smelyanskiy Chris Newburn Gary Tyson University of Michigan Intel.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Improving Cache Performance by Exploiting Read-Write Disparity

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)

Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.

Quantum Computing II CPSC 321 Andreas Klappenecker.

Review for Midterm 2 CPSC 321 Computer Architecture Andreas Klappenecker.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

1 Lecture 7: Caching in Row-Buffer of DRAM Adapted from “A Permutation-based Page Interleaving Scheme: To Reduce Row-buffer Conflicts and Exploit Data.

Lecture 41: Review Session #3 Reminders –Office hours during final week TA as usual (Tuesday & Thursday 12:50pm-2:50pm) Hassan: Wednesday 1pm to 4pm or.

Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.

Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.

Dan Tang, Yungang Bao, Yunji Chen, Weiwu Hu, Mingyu Chen

INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu.

Storage HierarchyCS510 Computer ArchitectureLecture Lecture 12 Storage Hierarchy.

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

Dong Hyuk Woo Nak Hee Seong Hsien-Hsin S. Lee

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

On the Value Locality of Store Instructions Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison

CSE 378 Cache Performance1 Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache /

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Multiprocessor cache coherence. Caching: terms and definitions cache line, line size, cache size degree of associativity –direct-mapped, set and fully.

B. Ramamurthy.  12 stage pipeline  At peak speed, the processor can request both an instruction and a data word on every clock.  We cannot afford pipeline.

BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Computer Organization CS224 Fall 2012 Lessons 45 & 46.

Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

1 CSCI 2510 Computer Organization Memory System II Cache In Action.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

Cache Memory Chapter 17 S. Dandamudi To be used with S. Dandamudi, “Fundamentals of Computer Organization and Design,” Springer,  S. Dandamudi.

Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

A Measurement Based Memory Performance Evaluation of Streaming Media Servers Garba Isa Yau and Abdul Waheed Department of Computer Engineering King Fahd.

Computer Architecture System Interface Units Iolanthe II in the Bay of Islands.

Carnegie Mellon 1 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Virtual Memory: Concepts Slides adapted from Bryant.

The Alpha – Data Stream Matt Ziegler.

Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,

Sunpyo Hong, Hyesoon Kim

Computer Organization CS224 Fall 2012 Lessons 39 & 40.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

© Copyright 3Dlabs, Page 1 - PROPRIETARY & CONFIDENTIAL Virtual Textures Texture Management in Silicon Chris Hall Director, Product Marketing 3Dlabs.

處理器設計與實作 CPU LAB for Computer Organization 1. LAB 7: MIPS Set-Associative D- cache.

Cache memory. Cache memory Overview CPU Cache Main memory Transfer of words Transfer of blocks of words.

Soner Onder Michigan Technological University

ISPASS th April Santa Rosa, California

Multilevel Memories (Improving performance using alittle “cash”)

ECE 445 – Computer Organization

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )

Comparison of Two Processors

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Virtual Memory Overcoming main memory size limitation

Main Memory Background

Presentation transcript:

Eager Writeback — A Technique for Improving Bandwidth Utilization Hsien-Hsin Lee Gary Tyson Matt Farrens I am here today to communicate with you about my work on Eager Writeback , A technique for improving bandwidth utilization, In this work, I am going to show you a fairly simple memory type extension, called eager writeback that has the capability of re-distributing memory bandwidth by early dirty line evictions to fill unused memory bandwidth and improve the performance of streaming or multimedia type of applications. This research was done with professor…. Intel Corporation, Santa Clara University of Michigan, Ann Arbor University of California, Davis

Agenda Introduction Memory Type and Bandwidth Issues Memory Reference Characterization Eager Writeback Experimental Results and Analysis Conclusions My notes

Modern Multimedia Computing System Memory (DRAM) Graphics Processing Unit Chipset Cache L2 Texture data Local Frame Buffer Back-Side Bus Front-Side Bus Core Processor The Host Processor I/O A.G.P. Commands   Data CPU reads data from main memory, and performs some manipulation, generates commands for graphics processor and stores those commands to the graphics memory space. Technique such as Direct RDRAM from RAMBUS , they try to provide as much as possible bandwidth on the memory bus in order to satisfy the needs of system memory bandwidth consumed by CPU and the graphics accelerator. Command and Texture Traffics

Memory Type Support Page-based programmable memory types Uncacheable (e.g. memory-mapped I/O) Write-Combining (e.g. frame buffers) Write-Protected (e.g. copy-on-write when fork) Write-Through Write-Back or Copy-Back The memory type associated with a particular memory can be programmed in some memory type range registers. UC – The system memory is not cached. The sequence of all reads and writes are executed in program order without reordering, in other words, strongly ordered. WC – A weak ordering mode. system memory locations are not cached. Write ordering is unimportant. The processor will execute a burst-write transaction (in a cache line size) to the uncacheable memory if the WC buffer is filled. Otherwise, partial write transactions will be executed. This will be inefficient WP: writes are propagated to the system bus and causes all corresponding cache lines on all processors on the bus to be invalidated.

Write-through vs. Writeback CPU L1$ Main Memory allocate writes Reads CPU L1$ Main Memory allocate writes Dirty Reads You write through all levels of memory hierarchies. This could throttle the bus bandwidth all the time every time you write something to the memory. However, it is one way to reduce conherency misses in MP systems. Because every write will propagate the most-up-to-date data to the outermost globally observable memory location.

Potential WB Bandwidth Issues Conflict on the bus while streaming data in Incoming : Demand fetches Outgoing : Dirty Data Dirty data Can steal cycles amid successive data streaming Delay of data delivery for critical path Writeback (Castout) buffer could be ineffective How to alleviate the conflicts ? Try to find balance between WT and WB To find the right trigger for cache line writeback

Probability of Rewrites to Dirty Lines 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 MRU MRU - 1 LRU + 1 LRU L1 data cache L2 cache Xlock-mount POV-ray xdoom Xanim Average Denominator: when a cache line enters the state, e.g. LRU, is dirty. 4-way caches using x-benchmark [Austin 98] Pr(R|D) = # re-dirty / # dirty lines entering a particular LRU state MRU lines are much more likely to be written

Normalized L1 Dirty Line States If touched again by reads, NO PROBLEM. It’s not dirty anymore. Even if touched by writes, it accounts for a very small overheads. Enter dirty  the first time a line is written Re-dirty  writing to a dirty line

Eager Writeback Trigger Dirty lines enter LRU state ! A dirty enters the LRU state becomes a good candidate to be the trigger for eager writeback.

Eager Writeback Mechanism Cache Miss Address Set-Associative Cache Data way0 LRU bits MSHR Forward set0 Path Block Data Addr Writeback Buffer 01 This is easier for illustration, in reality, the cache line does not move around. The LRU bits point to the line that is Least Recently Used Block Data Data Addr Return Next Level Cache/Memory

Eager Writeback Mechanism Cache Miss Address Set-Associative Cache Data way0 LRU bits MSHR Forward set0 Path Block Data Addr Writeback Buffer 00 This is easier for illustration, in reality, the cache line does not move around. The LRU bits point to the line that is Least Recently Used Block Data Data Addr Return Next Level Cache/Memory

Eager Writeback Mechanism Cache Miss Address Set-Associative Cache Data way0 LRU bits MSHR Forward set0 Path Block Data Addr Writeback Buffer 00 Block Data Data Addr Return Next Level Cache/Memory

Eager Writeback Mechanism Cache Miss Address Set-Associative Cache Data way0 LRU bits MSHR Forward set0 Path Block Data Addr Writeback Buffer 00 Block Data Data Addr Return X Next Level Cache/Memory

Eager Writeback Mechanism Cache Miss Address Set-Associative Cache Data way0 LRU bits MSHR Forward set0 Path Block Data Addr Eager Queue (EQ) set IDs Set ID Trigger when entry freed Writeback Buffer 00 Block Data Data Addr Return Next Level Cache/Memory

Simulation Framework Simplescalar suite 8-wide OOO superscalar machine Enhanced memory subsystem modeling Non-blocking caches (32KB L1 / 512 KB L2) Model MSHRs for all cache levels Model WC memory type 2-level Gshare (10-bit) branch predictor RDRAM model (single-channel) Model limited bus bandwidth peak front-side bus bandwidth = 1.6 GB/s Well, we are from Michigan. So we use Simplescalar to make sure we receive the best possible support from Todd’s office, which is 10 meters away from my office.

Simulation Framework

Case Studies 3D Geometry Engine Streaming A triangle-based rendering algorithm Used in Microsoft Direct3D and SGI OpenGL Xform Light Driver 3D model Buffer To AGP memory Geom engine Streaming

Bandwidth Shifting (Geometry Engine) 1.6GB/s Baseline Writeback 0.6GB/s Execution time  Eager Writeback 1.6GB/s 0.4GB/s 18

Load Response Time Eager Writeback e.g. 600kth load Vertex ID  Baseline Writeback Execution time 

Performance of Geometry Engine Free writeback represents performance upper bound

Bandwidth Filling (Streaming) 1.6GB/s Execution time  Baseline Writeback 1.6GB/s Eager Writeback

Performance of Streaming Benchmark

Conclusions Writebacks compete bandwidth with demand misses Demand data delivery can be deferred LRU dirty lines are rarely promoted again Eager writeback Triggered by dirty lines entering LRU state Additional programmable memory type Shift writeback traffic Effective for content-rich apps, e.g. 3D geometry Can be extended for Improving context switch penalty Reducing coherency misse latencies for MP systems (similar technique: LTP [LaiFalsafi 00] ) Global data and stack data show good life span and their working set sizes are rather small compared to the dynamically allocated heap data.

Questions & Answers Bandwidth problem can be cured with money. Latency problems are harder because the speed of light is fixed  you cannot bribe God.  David Clark, MIT

That's all, folks !!! http://www.eecs.umich.edu/~linear

Backup Foils

Speedup with Traffic Injection Imitating bandwidth stealing from other bus agents Uniform memory traffic injection

Injected Memory Traffic (0.8GB/s) Execution time  1.6GB/s 320B/400 clks 1.6GB/s 2560B/3200 clks