Mini-Project Presentation: Prefetching TDT4260 Computer Architecture

Slides:



Advertisements
Similar presentations
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Lecture 12 Reduce Miss Penalty and Hit Time
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)
Performance of Cache Memory
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
CGS 3763 Operating Systems Concepts Spring 2013 Dan C. Marinescu Office: HEC 304 Office hours: M-Wd 11: :30 AM.
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Memory Management 2010.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,
1. Memory Manager 2 Memory Management In an environment that supports dynamic memory allocation, the memory manager must keep a record of the usage of.
CACHE MEMORY Cache memory, also called CPU memory, is random access memory (RAM) that a computer microprocessor can access more quickly than it can access.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
Sequential Hardware Prefetching in Shared-Memory Multiprocessors Fredrik Dahlgren, Member, IEEE Computer Society, Michel Dubois, Senior Member, IEEE, and.
Advances in digital image compression techniques Guojun Lu, Computer Communications, Vol. 16, No. 4, Apr, 1993, pp
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Fundamentals of Parallel Computer Architecture - Chapter 61 Chapter 6 Introduction to Memory Hierarchy Organization Yan Solihin Copyright.
Lecture#15. Cache Function The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
Memory Management OS Fazal Rehman Shamil. swapping Swapping concept comes in terms of process scheduling. Swapping is basically implemented by Medium.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz.
Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component,
COMPSYS 304 Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots.
PipeliningPipelining Computer Architecture (Fall 2006)
CMSC 611: Advanced Computer Architecture
CSC 4250 Computer Architectures
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
M. Usha Professor/CSE Sona College of Technology
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Cache - Optimization.
rePLay: A Hardware Framework for Dynamic Optimization
COMP755 Advanced Operating Systems
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Mini-Project Presentation: Prefetching TDT4260 Computer Architecture Stefano Nichele, Angelo Spalluto Department of Computer and Information Science 2011, April 15th Stefano Nichele – Angelo Spalluto, 2011

Agenda Moore’s law – Memory wall Related work Fixed Sequential Prefetching Sequential Aggressive Prefetching (M-Adaptive, DM-Adaptive) DCPT, DCPT-P WA-DCPT and SA-DCPT Results Conclusion References

Moore vs. Mem. Wall Spatial Locality Temporal Locality

Prefetching Predicting Fetching 1 – Which data will be needed by the next instructions? 2 – Deliver it into the cache before it is referenced! Sequential RPT PC/DC DCPT Adaptive

Fixed Sequential Prefetching Sequential Algorithm The prefetcher issues N requests after a miss occurs; The value of window is constant for the whole execution of program; Sequential benchmarks Wupwise; Applu; Galgel; Not sequential benchmarks Ammp; Art110; Art470; Speed up Benchmarks Fixed size window

Sequential Aggressive Adaptive Prefetcher The adaptive prefetcher adjusts dynamically the degree of prefetching (N) Adaptive window parameters Window: Number of N contiguous blocks issued by prefetcher Accuracy: Number of good prefetches referred to a window Threshold: Number of good prefetches necessary to increase the window (Accuracy >= Threshold) Lock window: Number of times whereby the window is locked Listening state: The prefetcher counts the number of good prefetches Prefetcher algorithm Prefetcher initialises Window, Threshold and Lock Window Upon a request issued by CPU, the prefetcher issues N prefetching It waits for N times (listening state) In step N it checks if Accuracy >= Threshold If previous condition is satisfied, then it uses the same window for other L-1 times. Otherwise it decreases the window and it issues N requests (back in step 3) If step 4 succeedes for L times, the prefetcher increases the window and it issues other N requests Back in step 3

Example Seq. Aggressive Adaptive Prefetcher

Different listening states Sequential Aggressive Prefetching occurs immediately after the last element checked in the window (either if it is a miss or hit) Each window is composed by P elements = #hits + #misses Miss-Adaptive (M-Adaptive) The M-Adaptive issues a prefetching (restart a new window) only when the first miss occurs after that the whole window has been checked (hits do not trigger prefetching) Discard Miss-Adaptive (DM-Adaptive) DM-Adaptive issues a prefetching immediately after the first miss occurs inside the window Each window is composed by P elements = #hits

DCPT and DCPT-P No last prefetched Test if in cache before prefetching Maybe in the queue

Aggressive Adaptive - DCPT Stefano, Aggressive Adaptive works pretty well with sequential benchmarks. What about DCPT? Great!! DCPT works very goods with not sequential benchmarks. Let’s try to combine them togheter !! Ja ja, we may achieve better results! Aggressive Adaptive DCPT Aggressive Adaptive-DCPT SA-DCPT WA-DCPT

WA-DCPT and SA-DCPT WA-DCPT SA-DCPT WA-DCPT adds the concept of window in DCPT When DCPT issues a prefetching for a specific PC, it also delivers all subsequent blocks according to its window size WA-DCPT is more memory demanding than DCPT. It uses a larger data structure SA-DCPT At runtime it adapts the best algorithm between DCPT and Aggressive Sequential Switch Threshold is the major concern Best switch threshold is 4

Adaptive results Aggressive Adaptive M-Adaptive and DM-Adaptive In some benchmarks (galgel, applu, wupwise) the window reaches also size between 13 and 15 Using a window greater than 12 does not improve the performances Low sequencing for ammp, art110 and art470 M-Adaptive and DM-Adaptive The results of M-Adaptive and DM-Adaptive are not better than Aggressive Adaptive As expected, they produce less “misses” and “prefetches issued”

DCPT results DCPT and DCPT-P As expected, DCPT-P is slightly better than DCPT For ammp, DCPT-P outperforms almost twice better than adaptive Table composed by 16 deltas and 97 PCs is the best configuration (smaller than 8KB) DCPT-P uses a masking of 8bits In our tests there are not improvement using a bit mask of 12

Adaptive DCPT results WA-DCPT SA-DCPT WA-DCPT has a different data structure than DCPT (window data) Best results are achieved using 14 deltas SA-DCPT SA-DCPT has same data structure than DCPT Tuning on switching threshold Best switching factor is 4 SA-DCPT behaves as DCPT for switching factor greater than 4

Developed and Literature prefetcher Developed Prefetchers DCPT obtains the best performances SA-DCPT is a good compromise when we do not know the type of benchmark Literature VS Developed Our DCPT-P implementation outperforms the reference DCPT-P Likely because they have different data structure

Coverage Analysis Coverage Coverage vs Speedup Benchmarks with low sequencing (ammp, art110 and art470) have a higher coverage with DCPT-P Benchmarks with high sequencing (except applu) have better coverage with SA-DCPT Coverage vs Speedup The coverage is not directly proportional to speedup If the algorithm spends too much time to discover the next element to prefetch, as consequence it might increase its execution time

Conclusion Importance of prefetcher, it can really improve performances Contribution: 3 new prefetcher variants: adaptive window (aggressive technique) DCPT-based with bit masking Combination: delta correlation with adaptive window Importance of parameter tuning DCPT-P has best performances (on overall) Difficult to combine two different (opposite) algorithms to exploit the best properties of each

References G. E. Moore, Cramming more Components onto Integrated Circuits, Electronics, 38(8), April 9, 1965. W.A. Wulf and S.A. McKee, Hitting the Memory Wall: Implications of the Obvious, Computer Architecture News, vol. 23, no. 1, Mar. 1995, pp. 20–24 A. R. Brodtkorb,  C. Dyken,  T. R. Hagen, J. M. Hjelmervik, O. O. Storaasli, State-of-the-art in heterogeneous computing, Sci. Program., Vol. 18 (January 2010), pp. 1-33. M. Jahre, Managing Shared Resources in Chip Multiprocessor Memory Systems.: NTNU 2010 (ISBN 978-82-471-2287-7) 238 s. Doktoravhandlinger ved NTNU (159) M. Grannaes, Reducing Memory Latency by Improving Resource Utilization.: NTNU 2010 (ISBN 978-82-471-2177-8) 242 s. Doktoravhandlinger ved NTNU (106) A. J. Smith, Cache memories, ACM Comput. Surv., vol. 14, no. 3, pp. 473–530, 1982 F. Dahlgren, M. Dubois, and P. Stenstrom. Fixed and adaptive sequential prefetching in shared memory multiprocessors. In Parallel Processing, 1993. ICPP 1993. International Conference on, volume 1, pages 56-63, Aug. 1993. M. Grannaes, M. Jahre and L. Natvig. Multi-level Hardware Prefetching Using Low Complexity Delta Correlating Prediction Tables with Partial Matching. High Performance Embedded Architectures and Compilers LNCS, 2010, Volume 5952/2010, 247-261. M. Grannaes, M. Jahre and L. Natvig. Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables. In Data Prefetching Championships (2009)

QUESTIONS ?