Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631.

Slides:



Advertisements
Similar presentations
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
EZ-COURSEWARE State-of-the-Art Teaching Tools From AMS Teaching Tomorrow’s Technology Today.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.
© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.
Review for Midterm 2 CPSC 321 Computer Architecture Andreas Klappenecker.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Assignment 2 posted; due in a week.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Lect 13-1 Lect 13: and Pentium. Lect Microprocessor Family  Microprocessor  Introduced in 1989  High Integration  On-chip 8K.
Architecture Basics ECE 454 Computer Systems Programming
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.
Microprocessor-based systems Curse 7 Memory hierarchies.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Computer Organization and Architecture Tutorial 1 Kenneth Lee.
Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.
CS5222 Advanced Computer Architecture Part 3: VLIW Architecture
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
Implementation of MPEG2 Codec with MMX/SSE/SSE2 Technology Speaker: Rong Jiang, Xu Jin Instructor: Yu-Hen Hu.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.
® GDC’99 Streaming SIMD Extensions Overview Haim Barad Project Leader/Staff Engineer Media Team Haifa, Israel Intel Corporation March.
1 Lecture 12: Advanced Static ILP Topics: parallel loops, software speculation (Sections )
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
1 Lecture 5a: CPU architecture 101 boris.
Computer Architecture Principles Dr. Mike Frank
Improving Memory Access 1/3 The Cache and Virtual Memory
The University of Adelaide, School of Computer Science
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
5.2 Eleven Advanced Optimizations of Cache Performance
PIII Data Stream Power Saving Modes Buses Memory Order Buffer
TLC: A Tag-less Cache for reducing dynamic first level Cache Energy
Levels of Parallelism within a Single Processor
Systems Architecture II
STUDY AND IMPLEMENTATION
CPE 631 Lecture 05: Cache Design
Instruction Level Parallelism (ILP)
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Memory System Performance Chapter 3
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631

Introduction Processor-Memory Gap Processor-Memory Gap Memory speed is the bottleneck in the computer system Memory speed is the bottleneck in the computer system At least 20% from stalls are D-cache stalls (Alpha) At least 20% from stalls are D-cache stalls (Alpha) Cache miss is expensive Cache miss is expensive Reduce cache misses by ensuring data in L1 Reduce cache misses by ensuring data in L1How?!

Data Prefetching Appeared first with Multimedia applications using MMX technology or SSE processor extension Appeared first with Multimedia applications using MMX technology or SSE processor extension Cache memory designed for data with high temporal & spatial locality Cache memory designed for data with high temporal & spatial locality Multimedia data has high spatial locality but low temporal locality Multimedia data has high spatial locality but low temporal locality

Data Prefetching (cont’d) Idea Idea Bring data closer to the processor before it is actually needed Bring data closer to the processor before it is actually needed Advantages Advantages No extra hardware is needed (Implemented in software) No extra hardware is needed (Implemented in software) Used to mitigate the memory latency problem Used to mitigate the memory latency problem Disadvantages Disadvantages Increase Code size Increase Code size

Example //Before prefetching for (i=0; i<N; i++) { sum += A[i]; sum += A[i];} //After prefetching for (i=0; i<N; i++) { _mm_prefetchnta( &A[i+1], _MM_HINT_NTA); sum += A[i]; sum += A[i];}

Properties prefetch instruction loads one cache line from main memory into cache memory prefetch instruction loads one cache line from main memory into cache memory During prefetching processor must continue execution During prefetching processor must continue execution Cache memory must support hits while prefetching occurs Cache memory must support hits while prefetching occurs Decrease miss ratio Decrease miss ratio It will be ignored if prefetched data exist in cache It will be ignored if prefetched data exist in cache

Prefetching Instructions The temporal instructions The temporal instructions prefetcht0 fetch data into all cache levels, that is to L1 and L2 for Pentium III processors prefetcht0 fetch data into all cache levels, that is to L1 and L2 for Pentium III processors prefetcht1 fetch data into all cache levels except the 0th level, that is to L2 only on Pentium III processors prefetcht1 fetch data into all cache levels except the 0th level, that is to L2 only on Pentium III processors prefetcht2 fetch data into all cache levels except the 0th and 1st levels, that is, to L2 only on Pentium III processors prefetcht2 fetch data into all cache levels except the 0th and 1st levels, that is, to L2 only on Pentium III processors Non-temporal instruction Non-temporal instruction prefetchnta fetch data into location closest to the processor, minimizing cache pollution. On the Pentium® III processor, this is the L1 cache. prefetchnta fetch data into location closest to the processor, minimizing cache pollution. On the Pentium® III processor, this is the L1 cache.

Prefetching Guidelines prefetch scheduling distance prefetch scheduling distance What is the next data to prefetch? minimize the number of prefetches minimize the number of prefetches optimize execution time! mixing prefetch with computation instructions mixing prefetch with computation instructions minimize code size and cache stalls

Important notice Prefetching can be harmful if the loop is small Prefetching can be harmful if the loop is small Combined with loop unrolling may enhance the application execution time Combined with loop unrolling may enhance the application execution time Can not cause exception if we fetch beyond the array index the call will be ignored Can not cause exception if we fetch beyond the array index the call will be ignored

Support Check if the processor support SSE extension (using CPUID inst) Check if the processor support SSE extension (using CPUID inst) mov eax, 1 ; request for feature flags cpuid ; cpuid instruction test EDX, h ; bit 25 in feature flags equal to 1 jnz Found We used Intel compiler in our simulation We used Intel compiler in our simulation Has built-in macro for prefetching Has built-in macro for prefetching Support loop unrolling Support loop unrolling

Loop Unrolling Idea Idea Test performance of code including data prefetch and loop unrolling Test performance of code including data prefetch and loop unrolling Advantages Unrolling reduces the branch overhead, since it eliminates branches Unrolling allows you to aggressively schedule the loop to hide latencies. Disadvantages Excessive unrolling, or unrolling of very large loops can lead to increased code size.

Implementation of Loop Unrolling //Prefetch without Unroll for (i=0; i<N; i++) { _mm_prefetchnta( &A[i+1], _MM_HINT_NTA); sum += A[i]; sum += A[i];} //Prefetching with Unroll #pragma unroll (1) for (i=0; i<N; i++) { _mm_prefetchnta( &A[i+1], _MM_HINT_NTA); sum += A[i]; sum += A[i];} #pragma unroll (1)

Simulation We simulate simple addition loop We simulate simple addition loop for (i=0; i<size; i++) { prefetch (depth) sum += A[i]; sum += A[i];} We studied effects of two factors We studied effects of two factors Data size Data size Prefetch depth Prefetch depth Combination of loop unrolling and prefetching Combination of loop unrolling and prefetching

Simulation (cont’d) Intel VTune performance analyzer Intel VTune performance analyzer Event based simulation Event based simulation CPI CPI L1 miss rate L1 miss rate Clock ticks Clock ticks

Size Vs CPI

Size Vs L1 miss ratio

Size Vs clock ticks

Depth Vs CPI for prefetching with unrolling

Depth Vs L1 miss ratio for prefetching with unrolling

Depth Vs clockticks for prefetching with loop unrolling

Depth Vs CPI for prefetching without loop unrolling

Depth Vs L1 miss ratio for prefetching without unrolling

Depth Vs clockticks for prefetching without loop unrolling

Questions!!