Data Prefetching Mechanism by Exploiting Global and Local Access Patterns Ahmad SharifQualcomm Hsien-Hsin S. LeeGeorgia Tech The 1 st JILP Data Prefetching.

Slides:

Advertisements

Similar presentations

Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching Pedro Díaz and Marcelo Cintra University of Edinburgh

Advertisements

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S.

Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Access Map Pattern Matching Prefetch: Optimization Friendly Method

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

CS7810 Prefetching Seth Pugsley. Predicting the Future Where have we seen prediction before? – Does it always work? Prefetching is prediction – Predict.

Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Chapter 12 Pipelining Strategies Performance Hazards.

Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.

1  2004 Morgan Kaufmann Publishers Chapter Seven.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.

Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

Sanghyun Park, §Aviral Shrivastava and Yunheung Paek

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

Prefetching Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen and Mark Hill Updated by Mikko Lipasti.

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Computer Organization CS224 Fall 2012 Lessons 45 & 46.

M. Tiwari, B. Agrawal, S. Mysore, J. Valamehr, T. Sherwood, CS & ECE of UCSB Reading Group Presentation by Theo.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

Scavenger: A New Last Level Cache Architecture with Global Block Priority Arkaprava Basu, IIT Kanpur Nevin Kirman, Cornell Mainak Chaudhuri, IIT Kanpur.

Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.

Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.

Efficiently Prefetching Complex Address Patterns Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian University of Utah Chris Wilkerson, Zeshan.

SOFTENG 363 Computer Architecture Cache John Morris ECE/CS, The University of Auckland Iolanthe I at 13 knots on Cockburn Sound, WA.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Memory Hierarchy Ideal memory is fast, large, and inexpensive

Prof. Hsien-Hsin Sean Lee

Lecture: Large Caches, Virtual Memory

Multilevel Memories (Improving performance using alittle “cash”)

Less is More: Leveraging Belady’s Algorithm with Demand-based Learning

Lecture: Cache Hierarchies

5.2 Eleven Advanced Optimizations of Cache Performance

Lecture: Cache Hierarchies

Temporal Streaming of Shared Memory

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )

Lecture 08: Memory Hierarchy Cache Performance

Address-Value Delta (AVD) Prediction

Ka-Ming Keung Swamy D Ponpandi

Lecture: Cache Innovations, Virtual Memory

Lecture 10: Branch Prediction and Instruction Delivery

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Translation Buffers (TLB’s)

Cache - Optimization.

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

10/18: Lecture Topics Using spatial locality

Ka-Ming Keung Swamy D Ponpandi

Presentation transcript:

Data Prefetching Mechanism by Exploiting Global and Local Access Patterns Ahmad SharifQualcomm Hsien-Hsin S. LeeGeorgia Tech The 1 st JILP Data Prefetching Championship (DPC-1)

2 Can OOO Tolerate the Entire Memory Latency? OOO can hide certain latency but not all Memory latency disparity has grown up to 200 to 400 cycles Solutions –Larger and larger caches (or put memory on die) –Deepened ROB: reduced probability of right path instructions –Multi-threading –Timely data prefetching Load miss ROB Machine Stalled D-cache miss ROB full Independent instructions filled No productivity Date returned ROB entries De-allocated Untolerated Miss latency Revised from “A 1 st -order superscalar processor model in ISCA-31

Performance Limit: L1 vs. L2 Prefetching Result from Config 1 (32KB L1/2MB L2/~unlimited bandwidth) L1 miss Latencies seem to be tolerated by OOO We decided to perform just L2 prefetching –And it turns out….. right after submission deadline, not a bright decision 3 Perfect L2 Perfect mem hierarchy Skipping first 40 billions and simulate 100 millions

4 Objective and Approach cache address patternsPrefetch by analyzing cache address patterns (addr<<6) Identify commonly seen patterns in address delta –462.libquantum: 1, 1, 1, 1, etc. –470.lbm: 2, 1, 2, 1, 2, 1, etc. (in all accesses and L2 misses) –429.mcf: 6, 13, 26, 52, etc. (sort of exponential) Patterns can be observed from: –All accesses (regardless hits or misses) –L2 misses –Our data prefetcher exploits these two based on both global and local histories

5 Our Data Prefetcher Organization GHB (log all unique accesses, age-based) g sized GHB g=128 l=24 m=32 k=32 g=128 l=24 m=32 k=32 LHBs (All per-PC unique accesses, age-based) PC 1 LRU PC 2 PC m l sized LHB32 bit tag From d-cache: virtual address timestamp (not used) hit/miss Total : ~26,000 bits (82% of 32 KB) Rest dedicated to “temporaries” Total : ~26,000 bits (82% of 32 KB) Rest dedicated to “temporaries” Pattern Detection Logic (state-free logic) & k-sized fully associative Request Collapsing Buffer Pattern Detection Logic (state-free logic) & k-sized fully associative Request Collapsing Buffer

6 Prefetcher Table Bit Count bit frame addresses in the request collapsing buffer (832 bits) Total: bits Rest for temporary variables, e.g., binned output pattern, etc., but not needed GHB 128 entries 26-bit addr 2-bit info 26-bit addr 2-bit info 3584 bits PC 1 PC 2 26-bit addr 2-bit info 26-bit addr 2-bit info LHBs 32 rows 24 entries PC n 32-bit PC bits

7 Pattern Detection Logic Whenever a unique access is added –Bin accesses according to region (64KB) –Detect pattern using addr deltas (sorry, it is brute-force) Finding “maximum reverse prefix match” (generic) Finding exponential rise in deltas (exponential) –Check request collapsing buffer –Issue prefetch 4 deltas ahead for generic or 2 ahead for exponential Currently assume a complex combinational logic which (may) require: –Binning –Sorting network –Match logic for Generic patterns Exponential patterns

8 Example 1: Basic Stride Common access pattern in streaming benchmarks PC-independent (GHB) or per-PC (LHB) low memory addresshigh memory address History Buffer Pattern Detection Logic Pattern Detection Logic Trigger different memory region Same bin

9 Example 2: Exponential Stride Exponentially increasing stride –Seen in 429.mcf –Traversing a tree laid out as an array low memory addresshigh memory address History Buffer Pattern Detection Logic Pattern Detection Logic Trigger 2481

10 Example 3: Pattern in L2 misses Stride in L2 misses –with deltas (1, 2, 3, 4, 1, 2, 3, 4, …) –Issue prefetches for 1, 2, 3, 4 –Observed in 403.gcc Accessing members of an AoS –Cold start –Members are separate out in terms of cache lines –Footprint is too large to accommodate the AoS members in cache

11 Example 4: Out of Order Patterns Accesses that appear out-of-order –(0, 1, 3, 2, 6, 5, 4)  with deltas (1, 2, -1, 4, -1, -1) –Ordered (0, 1, 2, 3, 4, 5, 6) issue prefetches for stride 1 –See the processor issue memory instructions out-of-order –No need to deal with if prefetcher sees memory address resolution in program order Can be found in with any program as this is an artifact due to OOO

12 Simulation Infrastructure Provided by DPC-1 15-stage, 4-issue, OOO processor with no FE hazards 128-entry ROB –Can potentially get filled up in 32 cycles L1 is 32:64:8 with infrastructure default latency (1-cycle hit) L2 is 2048:64:16 with latency=20 cycles DRAM latency=200 cycles Configuration 2 and 3 have fairly limited bandwidth

Performance Improvement 13 L1L2L2 BWMem BW Config 132KB2MB1000 apc Config 232KB2MB1 apc0.1 apc Config 332KB512KB1 apc0.1 apc Performance Speedup (GeoMean) = 1.21x

LLC Miss Reduction Avg L2 reduction percentage : 64.88% Reduction does not directly correlate to performance improvement though 14 L2 queue full for Config 2 and 3 Does not show too many patterns Streaming with regular patterns Streaming with regular patterns

15 Wish List for a Journal Version To make it more hardware-friendly (logic freak or more tables needed?) Prefetch promotion into L1 cache (our ouch) Better algorithm for more LHB utilization Improve Scoring System for Accuracy Feedback using closed loop

16 Conclusion GHB with LHBs shows –A “big picture” of program’s memory access behavior –Program history repeats itself –Address sequence of Data access is not random Delta Patterns are often analyzable We achieve 1.21x geomean speedup LLC miss reduction doesn’t directly translate into performance –Need to prefetch a lot in advance

17 THAT’S ALL, FOLKS! ENJOY HPCA-15 Georgia Tech ECE MARS Labs