Basic Performance Parameters in Computer Architecture:

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
Performance of Cache Memory
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Memory Hierarchy and Cache. A Mystery… Memory Main memory = RAM : Random Access Memory – Read/write – Multiple flavors – DDR SDRAM most common 64 bit.
Associative Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word Tag uniquely identifies block of.
CS161 – Design and Architecture of Computer
CMSC 611: Advanced Computer Architecture
COSC3330 Computer Architecture
CSE 351 Section 9 3/1/12.
CS161 – Design and Architecture of Computer
CSC 4250 Computer Architectures
Multilevel Memories (Improving performance using alittle “cash”)
How will execution time grow with SIZE?
Basic Performance Parameters in Computer Architecture:
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Lecture: Large Caches, Virtual Memory
William Stallings Computer Organization and Architecture 7th Edition
Lecture 21: Memory Hierarchy
Lecture 21: Memory Hierarchy
Lecture 23: Cache, Memory, Virtual Memory
Lecture 08: Memory Hierarchy Cache Performance
Lecture 22: Cache Hierarchies, Memory
Lecture: Cache Innovations, Virtual Memory
CPE 631 Lecture 05: Cache Design
Performance metrics for caches
Performance metrics for caches
Adapted from slides by Sally McKee Cornell University
ECE232: Hardware Organization and Design
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Performance metrics for caches
Miss Rate versus Block Size
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
CS 704 Advanced Computer Architecture
Part V Memory System Design
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
/ Computer Architecture and Design
Lecture 22: Cache Hierarchies, Memory
Lecture 11: Cache Hierarchies
CS 3410, Spring 2014 Computer Science Cornell University
Lecture: Cache Hierarchies
Lecture 21: Memory Hierarchy
Performance metrics for caches
Cache - Optimization.
Fundamentals of Computing: Computer Architecture
Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )
Memory & Cache.
Andy Wang Operating Systems COP 4610 / CGS 5765
Performance metrics for caches
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Sarah Diesburg Operating Systems COP 4610
Presentation transcript:

Basic Performance Parameters in Computer Architecture:

Good Old Moore’s Law: (Technology vs Architects) For every 18-24 months, 2x transistors on the same chip area. Processor Speed Doubles every 18-24 months Energy Operation halves every 18-24 months Memory Capacity doubles every 18–24 months

Instructions/Sec  2x / 2 yrs Memory Capacity  2x / 2 yrs Memory Latency  1.1x /2 yrs

Cache Magic

Parameters for Metric and Evaluation: What does better mean in Computer Architecture? Is the speed (GHz) or the Memory size(GB)? Latency and Throughput are two key performance parameters. Latency: time taken from start to end for a process Throughput: Number of computations per second (#/second)

Comparing Performance of CPU X and Y: X is N times faster than Y  Speedup = N N = Speed [X] / Speed of [Y] N = (Latency [Y]) / (Latency [X]) N = (Throughput [X]) / (Throughput of [Y]) Comparing Performance of CPU X and Y:

Introduction to Caches:

Locality Principle: Which of these are not good examples of locality? Rained 3 times today, likely to rain again. Ate dinner at 7pm last week, probably will eat dinner around 7pm this week It was New Years Eve yesterday, probably it will be new years eve today. Things that will happen soon are likely to be close to things that just happened.

Memory Locality:

Accessed Address X recently Likely to Access X again soon Likely to Access address close to X too

Temporal & Spatial Locality Implementation: for (j = 0; j < 1000 ; j++) print arr[j]

Locality to enhance Data Access: Library : Repository to store data, large but slow to access Library Accesses have temporal and spatial locality A student 1. Will go to library find information, go home (Does not benefit from locality) 2. Borrow the book 3. Take all books and build a library at home (Expensive and high latency)

Cache Lookups: Fast Small Not Everything will fit Access: Cache Hit : Found in the cache  FAST  Cache Miss : Not in Cache, Access RAM, slow memory  : Copy this location to Cache

Cache Performance: Hit Time  Should be low; small and fast cache Miss Penalty  Main Memory Access Time, Large (10-100s cycles) MISS TIME = HIT TIME + MISS PENALTY (RAM Access time when Cache Miss) Miss Rate  Should be low; large and/or smart cache Hit Time  Should be low; small and fast cache Average Memory Access Time (AMAT) AMAT = HIT TIME + MISS RATE x MISS PENALTY

Cache Size in Real Processors: Complication : Several Caches in the Processor L1 Cache  Directly service all RD/WR requests from Processor Size: 16-64 KB  Large enough to get ~ 90% hit rate  Small enough to hit in 1 – 3 cycles

Cache Organization: How to determine HIT or MISS ? How to determine what to kick out ? Address from Data Has to be large enough to satisfy spatial locality, if more than 1 block needs to be replaced when cache miss. HIT DATA (Bytes in each entry) Block Size / Line Size 32 to 128 bytes Block size can’t be as large as 1 KB, as precious Cache memory will remain unused.

Blocks in Cache and Main Memory: A line is a Cache slot where a Memory block can effectively fit. 4 8 12 BLOCK 16 20 24 28 32 36 40 44 LINE Each Memory location has a capacity of 4 bytes and block size in 16 bytes.

Block Offset and Block Number: 4 8 12 BLOCK 16 20 24 28 32 36 40 44 LINE Address Processor wants to Cache Cache Data 31-------------------------4 3---------0 Block Block Number Offset 1. Block # tells which block is tried to be found in Cache. 2. Block offset to get the correct data, tells where within a block we are. Block Size = 16 bytes; 2^4

Cache Block Number Quiz: 32 Byte Block Size 16 bit address created by the Processor 1111 0000 1010 0101 What is the block number corresponding to the above address? What is the block offset?

Cache Tag(Compares data b/w Block & Cache): Cache Tag has block # tells which block is in Cache Data # matches Line in Cache; determines which line is present in Cache. Compare Block# with each Tag. Cache Hit if match produces 1. Thereafter, the offset will tell which line contains the data to be supplied to the Processor. Cache Data Cache Tag Block # in cache = = = 1 = In Cache Miss, the Data is put in Cache and Block # is put in the corresponding Tag. Processor Generated Address Block # Offset

Hit  (Tag == Block#) and Valid = 1 Valid Bit: During Boot up, no data from Cache needed. Garbage Data accessed if Memory Block and TAG match. Cache Data Garbage Data not brought from RAM. Cache TAG 000000 (Initial) VALID Any initial value at the Cache Tag will be problematic, not just zero. Therefore Valid bit = 0 Hit  (Tag == Block#) and Valid = 1 0X 0000 0000 001C

Fully Associative: Any block can be in any Cache Line, N lines with N comparisons (Extreme flexible form of Set Associative) Set Associative: N Lines where a block can be (Middle Ground) Direct Mapped: A block can go into 1 line (Extreme rigid form of Set Associative) Types of Caches:

Direct Mapped Cache: B Memory 1 2 3 4 5 6 B Cache 1 2 3 1 2 3 4 5 6 B Cache 1 2 3 Each block of Memory maps to single location, Blocks match the lines sequentially. Offset: Where in the block are we, if block found Index: Where in the Cache, block can be found (2 bit) Processor Generated Address Block # Index Block Offset TAG

Adv./Disadvantages of Direct Mapped Cache: Looks only in one place, 1:1 mapping:  Fast: Since one location checked, less traffic, Hit Time (good)  Cheap: Less complex design, one tag and valid bit comparator  Energy Efficient: Lesser Power Dissipation due to smaller design Blocks must go in one place:  Frequent accesses to A B A B which map to same place in cache (Line)  Simultaneous kicking out of A & B. Conflict over one spot  Therefore, a high miss rate

Set Associative Caches: N – Way Set Associative  Block can be in one of N lines in any SET SET 0 SET 1 SET 2 SET 3 Line 0 Line 1 Line 2 Line 3 Line 4 Line 5 Line 6 Line 7 2 Way Set Associative, N = 2, 2 lines / block Within a set, there are 2 lines which can contain the block Few bits of block address allocated for which set the block will go

Offset, Index Tag for Set Associative Caches: Line 0 Line 1 Line 2 Line 3 Line 4 Line 5 Line 6 Line 7 TAG INDEX OFFSET Determines which set to access (2 bits) Where in the block we are Direct Mapped Cache of the same size will have a smaller TAG ?

Fully Associative Cache: No index bits, as destination can be in any of the cache. Line 0 Line 1 Line 2 Line 3 Line 4 Line 5 TAG Offset

Offset = log2(block size) Cache Summary: Direct Mapped  1 Way Set Associative Fully Associative  N Way Set Associative, N = # of lines, no sets TAG INDEX OFFSET Index = log2(sets) Offset = log2(block size)

Cache Replacement: Cache Miss during a full set  Need a new block in set Which block to kick out?  Random  FIFO: Kick out which has been in the longest  LRU : Kick out block not been used for the longest

Implementing LRU : Implements Locality Block TAG VALID LRU Counter A 0 3 2 1 0 B 1 0 3 2 1 C 2 1 0 3 2 D 3 2 1 0 3 Maintaining count is complicated. For N – way set associative Cache, we have N counters with size log2(N) – bit Counters. Here we have 4, 2–bit counters to count from 0 to 3. E 3 2 1 1 B 0 3 2 3 C 1 0 0 0 D 2 1 3 2 Cost: N log2(N) Counters Energy: Change N Counters on each access (even Cache hits)

Write Policy of Caches: Do we insert blocks we write (Write Miss)/Allocate Policy ?  Write Allocate: Bring block into cache (helps locality RD/WR)  No Write Allocate: Do not bring block into cache when written. Do we write just to Cache or also to Memory ? (Cache Hit)  Write Through: Update Memory immediately  Write Back: Write to Cache, only write to RAM when Cache block replaced (High Locality WR will only update Cache frequently)

Write Back Caches: Dirty Bit  1  Block is dirty (need to write back on replacement) Dirty Bit  0  Block is clean (not written since brought from RAM) Multiple Writes on Read inefficient / Efficient implementation? Add a dirty bit to Cache Blocks we didn’t write, when replaced  No need to write to RAM Blocks we did write, when replaced  Write to Memory (RAM)

Multi-Level Caches (Cache Hierarchy):

Reducing the AMAT: Miss in L1 Cache  goes to a higher level Cache Reduce Hit Time Reduce Miss Rate Reduce Miss Penalty Multi-Level Cache Hierarchy: Miss in L1 Cache  goes to a higher level Cache L1 Miss Penalty != Memory Latency (RAM Access) L1 Miss Penalty = L2 Hit Time + L2 Miss Rate x L2 Miss Penalty Can have L3, L4 etc. Reducing the AMAT:

AMAT with Cache Hierarchies: AMAT = L1 Hit Time + L1 Miss Rate x L1 Miss Penalty L1 Miss Penalty = L2 Hit Time + L2 Miss Rate x L2 Miss Penalty L2 Miss Penalty = L3 Hit Time + L3 Miss Rate x L3 Miss Penalty LLC Miss Penalty = Main Memory Latency AMAT with Cache Hierarchies:

Multi – Level Cache Performance: 16 Kb 128 Kb No Cache L1 = 16 Kb L2 = 128 Kb Hit Time 2 10 100 2 cycles for L1 hit 12 cycles for L2 hit Hit Rate 90% 97.5% 100% 90% for L1 75% for L2 AMAT 2 + 0.1 x 100 = 12 10 + 0.05 x 100 = 12.5 5.5 Net AMAT  2 + 0.1 x (10 + 0.25 x 100) = 5.5 AMAT = Hit Time + Miss Rate x Miss Penalty

Hit Rate in L1/L2 Cache: 16 Kb 128 Kb No Cache L1 = 16 Kb L2 = 128 Kb Hit Time 2 10 100 2 cycles for L1 hit 12 cycles for L2 hit Hit Rate 90% 97.5% 100% 90% for L1 75% for L2 AMAT 2 + 0.1 x 100 = 12 10 + 0.05 x 100 = 12.5 5.5 Global Hit Rate Local Hit Rate

Global vs Local Hit Rate: Global Hit Rate  1 – Global Miss Rate Global Miss Rate  No. of misses in this Cache / No of All Memory Accesses Local Hit Rate  No. of hits / No. of accesses to this Cache Misses per 1000 instructions (MPKI) Global vs Local Hit Rate:

Inclusion Property in Caches: Assuming Block in L1-Cache May or may not be in L2 – Cache Has to be in L2 – Cache  Inclusion Cannot be in L2 – Cache  Exclusion Misses from L1-Cache May or may not be in L2 – Cache, unless specified as Inclusion/Exclusion

Inclusion Property Accesses: RAM(Main Memory) A (Writes to L1, L2) B (Writes to L1, L2) C (Writes to L1, L2) D (Writes to L1, L2) E (Writes to L1, L2) L2 – Cache A(2) (1) (0) E B(3) (2) (1) B X(0)C(3) (2) C X(1) (0)D(3) D Processor accesses A,B,C from RAM (Cache Miss): L1 – Cache A(0)(1) (0)(1) (0)(1) A B(1)(0)C(1)(0)D(1)(0) E Cache Hierarchy doesn’t guarantee inclusion  RAM Access E  L1 access != L2 access Add Inclusion bit in L2  L2 = 1 when block exists in L1  prevents LRU block replacement Solution:

Intel Montecito Chip:

Shared vs Private Cache in Multi-Core: 256 KB of L2-Cache Banks and 1MB of Monolithic L2-Cache?

Why Private L1 – Cache only? Shared L1 Cache: Multiple Accesses on Multi-Core  Creates too much traffic congestion  Build Multi-Port Cache to reduce traffic. Single Cycle Access: If P4 Accesses L1 location  higher Latency/Clock Cycles

Shared L2 – Cache disadvantages: Miss on L2 Private != RAM, check other L2 for latest value in Private L2 In Shared L2  1 copy of A location  Miss on L2  RAM access; Congestion on shared L2 -Bus Higher Latency  More Wire Delay to Cache controller [20 cycles] Wait for Priority Requests from other Cores [20+5 cycles] vs [10 cycles] for Private L2 - Cache Shared L2 – Cache disadvantages:

Shared L2 – Cache Merits: Higher Hit Rates  Private L2 has duplicates (<1MB capacity) A’s in multiple locations (Lower HR) Static Allocation in Private L2, 256 KB/core, P1[385KB] > L2[256]  High MR, P2[150KB] < L2[256KB]  Resources Under Utilized Dynamic Memory Allocation in shared L2, Memory assigned as per requirements [Core Demands] Potential faster Cache Coherence (easier to locate data on miss)

Memory in Modern Processor (L1 Cache?):