Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching by: Josefin Hallberg, Tuva Palm and Mats Brorsson Presented.

Slides:

Advertisements

Similar presentations

Computer Organization and Architecture

Advertisements

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Performance of Cache Memory

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

Segmentation and Paging Considerations

Software Methods to Increase Data Cache Performance Presented by Philip Marshall.

Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Fundamentals of Python: From First Programs Through Data Structures

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.

Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.

Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

1  2004 Morgan Kaufmann Publishers Chapter Seven.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

Optimizing RAM-latency Dominated Applications

Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.

Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

Cache Locality for Non-numerical Codes María Jesús Garzarán University of Illinois at Urbana-Champaign.

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

Lecture 14: Caching, cont. EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.

Data Flow Analysis for Software Prefetching Linked Data Structures in Java Brendon Cahoon Dept. of Computer Science University of Massachusetts Amherst,

Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.

Computer Organization CS224 Fall 2012 Lessons 41 & 42.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz.

CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.

Cache memory. Cache memory Overview CPU Cache Main memory Transfer of words Transfer of blocks of words.

Associativity in Caches Lecture 25

Multilevel Memories (Improving performance using alittle “cash”)

Chapter 9 – Real Memory Organization and Management

5.2 Eleven Advanced Optimizations of Cache Performance

Improving cache performance of MPEG video codec

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Bojian Zheng CSCD70 Spring 2018

Lecture 14: Reducing Cache Misses

Address-Value Delta (AVD) Prediction

Lecture: Cache Innovations, Virtual Memory

Performance metrics for caches

Adapted from slides by Sally McKee Cornell University

Performance metrics for caches

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Cache Performance Improvements

Performance metrics for caches

10/18: Lecture Topics Using spatial locality

Presentation transcript:

Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching by: Josefin Hallberg, Tuva Palm and Mats Brorsson Presented by: Lena Salman

Introduction Pointer-based data structures are usually randomly allocated in memory Will usually not achieve good locality Higher miss rates Software approach against Hardware approach

Software approach Two techniques:  Cache concious allocation - By far the most efficient  Software prefetch – Better suited for automization, better for implementation in compilers. Combination of both cache-concious allocation and software prefetch, does not add significantly to performance

Hardware approach Calculating and prefetching pointers Calculating pointer dependencies Effects of effectively predicting what to evict from the cache General HW prefetch – More likely to pollute the cache Problem! All the hardware strategies take advantage of the increased locality of cache – consciously allocated data.

Prefetching and cache – concious allocation Should complement each other ’ s weakness – Reduce the prefetch overhead of fetching blocks with partially unwanted data. Prefetching should reduce the cache misses and miss latencies between the nodes

Cache-conscious allocation Excellent improvement in execution time performance Can be adapted to specific need by choosing the cache-conscious block-size (cc – block size) Attempts to co-allocate data in the same cache-line. Nodes are referenced after each other on the same cache line.

Allocation to improve locality

Cache – conscious allocation Attempts to allocate the data in the same cache- line Better locality can be achieved Improved cache performance by a reduction of misses

ccmalloc() Does the cache-concious allocation of memory. Takes extra argument – pointer to data structure that is likely to be referenced. #ifdef CCMALOC child=ccmaloc(sizeof(struct node), parent) #else child= malloc(sizeof(struct node)); #endif

ccmalloc() Takes pointer to data that is likely to be referenced close ( in time) to the newly allocated stucture Invokes calls to the standard malloc(): When allocating new cc-block When data is larger than cc-block Otherwise: allocate in the empty slot of cc-block

cc-blocks = Cache – conscious blocks Demands cache lines large enough to contain more than one pointer structure The bigger the blocks – the lower the miss- rate if allocation is smart. Can be set dynamically in software, independently of the HW cache line size. In our study cc-block size – 256B hardware cache line size – 16B – 256B

Prefetch Prefetching will reduce the cost of cache – miss Can be controlled by software and/or hardware Software results in extra instructions Hardware leads to complexity in hardware

Software controlled prefetch Implemented by including prefetch instruction in the instruction set Should be inserted well ahead of reference, according to prefetch algorithm In this study: we will use greedy algorithm, by Mowry et al.

Software prefetch – Greedy algorithm When a node is referenced, it prefetches all children at that node. Without extra calculation, can only be done to children, not to grandchildren Easier to control and optimize The risk of polluting the cache decreases (since prefetch only needed lines)

Software greedy prefetch

Hardware Controlled Prefetch Depending on the algorithm used, prefetching can occur when a miss is caused Or when a hint is given by the programmer through an instruction, Or can always occur on certain types of data

Hardware prefetch Techniques used: Prefetch on miss Tagged Prefetch Attempt to utilize spatial locality Do NOT analyze data access patterns

Prefetch-on-Miss Prefetches the next sequantial line i+1, when detecting miss on line i. Line i : Miss! Line i-1 Line i+1 : will be prefetched

Tagged Prefetch Each prefetched line is tagged with a tag When a prefetched line - i is referenced, the line i+1 is prefetched. (no miss has occurred) Efficient when memory is fairly sequential, and has been shown efficient

Pre-fetch on miss – for ccmalloc() HW prefetch can be combined with ccmalloc(), by introducing a hint with address to the beginning of such a block.

Prefetch-one-cc on miss Prefetch the next line after detecting a cache – miss on a cache-conciously allocated block. Miss!!

Prefetch-all-cc on miss Decides dynamically how many lines to prefetch. Depends on where on cc-block the missing cache line is located. Prefetches all the cache lines on the cc- block, from the address causing miss Miss!!

Experimental Framework MIPS-like, out-of-order processor simulator. Memory latency equal to 50 ns random access time. Benchmarks: health – simulates columbian health care system mst – creates graph and calculates minimum span tree perimeter – calculates the perimeter of image treeadd – calculates recursive sum of values

More about benchmarks: health – elements are moved between lists during execution, and there is more calculation between data. mst – originally used a locality optimization procedure, which made ccmalloc() non noticeable. perimeter – data allocated in an order similar to access order, resulting locality optimization. treeadd – has calculation between nodes in a balanced binary tree.

Results: Execution time

Stalls: Memory stall – an instruction waits a cycle, due to the oldest instruction waiting to be retired – load / store instr. FU stall – the oldest instr. Is not load / store instr. Fetch stall – there is no instruction waiting to be retired. Prefetch is likely to affect when memory stalls are dominant!!

Graphs:

Cache performance - SW Miss rates are improved by most strategies Increased spatial locality with ccmalloc() reduces cache misses (less pollution) Software shows some decrease of misses, but prefetches a lot unused data Combination of software techniques achieves the lowest rates

Cache performance – cache lines The larger cache lines the more effective is using ccmalloc() HW prefetch alone, however, tends to pollute the cache, with unwanted data SW prefetch alone, tends to bring data already existing in the cache

Cache performance: SW prefetch achieves higher precision HW prefetch alone, are no good. HW prefetch is more sensitive to cache line size than the SW prefetch

Cache performance – SW pref. with ccmalloc() Results in increased amount of used cache lines, among the prefetched lines This is caused by increased spatial locality However! Also results trying prefetching lines already in the cache.

Cache performance – HW prefetch with ccmalloc() HW are greater improvement with cache- conscious allocation, then on their own, Prefetch-on-miss and tagged-prefetch both show the same results Still : large amount of unused prefetched lines Unused lines decrease with larger cache lines, due to spatial locality, and lack of need to prefetch

Conclusions:  The best way still remains cache conscious allocation – ccmalloc()  Efficient to overcome the drawbacks of large cache line  Creates locality necessary for prefetch  The larger the cache line – less prominent the prefetch strategy

Conclusions 2:  Cache-conscious allocation with HW prefetch, is not prominent, and it seems that ccmalloc() alone is enough  However, ccmalloc() can be used to overcome the negative effect of next- line prefetch  HW prefetch is better then SW prefetch

Conclusions 3:  When a compiler can use profiling info. and optimize memory allocation in cache-conscious manner – it ’ s preferable!  However, when profiling is too expensive – will likely to benefit from general prefetch support.

The endddd !!! You can tell me, I can take it.. What ’ s up doc???

לנה סלמן