B10100 Numerical Accuracy ENGR xD52 Eric VanWyk Fall 2012.

Slides:



Advertisements
Similar presentations
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Advertisements

Caches Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University See P&H 5.1, 5.2 (except writes)
CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.
CS2100 Computer Organisation Cache II (AY2014/2015) Semester 2.
Memory Subsystem and Cache Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.
1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.
331 Week13.1Spring :332:331 Computer Architecture and Assembly Language Spring 2006 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Memory Chapter 7 Cache Memories.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
The Memory Hierarchy II CPSC 321 Andreas Klappenecker.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
361 Computer Architecture Lecture 14: Cache Memory
CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.
Computer ArchitectureFall 2007 © November 7th, 2007 Majd F. Sakr CS-447– Computer Architecture.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
DAP Spr.‘98 ©UCB 1 Lecture 11: Memory Hierarchy—Ways to Reduce Misses.
Caches Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University See P&H 5.1, 5.2 (except writes)
CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.
CMPE 421 Parallel Computer Architecture
B10100 Numerical Accuracy ENGR xD52 Eric VanWyk Fall 2012.
CS1104: Computer Organisation School of Computing National University of Singapore.
EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff Case.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
Lecture 14 Memory Hierarchy and Cache Design Prof. Mike Schulte Computer Architecture ECE 201.
CS 3410, Spring 2014 Computer Science Cornell University See P&H Chapter: , 5.8, 5.15.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
223 Memory Hierarchy: Caches, Virtual Memory Big memories are slow Fast memories are small Need to get fast, big memories Processor Computer Control Datapath.
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
EEL5708/Bölöni Lec 4.1 Fall 2004 September 10, 2004 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Review: Memory Hierarchy.
1010 Caching ENGR 3410 – Computer Architecture Mark L. Chang Fall 2006.
The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
Computer Organization & Programming
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
Cps 220 Cache. 1 ©GK Fall 1998 CPS220 Computer System Organization Lecture 17: The Cache Alvin R. Lebeck Fall 1999.
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
What is it and why do we need it? Chris Ward CS147 10/16/2008.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
CMSC 611: Advanced Computer Architecture
COSC3330 Computer Architecture
Yu-Lun Kuo Computer Sciences and Information Engineering
The Goal: illusion of large, fast, cheap memory
Improving Memory Access 1/3 The Cache and Virtual Memory
Morgan Kaufmann Publishers Memory & Cache
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Morgan Kaufmann Publishers Memory Hierarchy: Introduction
CS 3410, Spring 2014 Computer Science Cornell University
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Chapter Five Large and Fast: Exploiting Memory Hierarchy
10/18: Lecture Topics Using spatial locality
Presentation transcript:

b10100 Numerical Accuracy ENGR xD52 Eric VanWyk Fall 2012

Acknowledgements Mark L. Chang lecture notes for Computer Architecture (Olin ENGR3410)

Today Fast = Expensive Slow = Cheap Computer Architecture = Best of Both Worlds Because we are awesome

Recap So far, we have three types of memory: – Registers (between pipeline stages) – Register File – Giant Magic Black Box for lw and sw Inside the box is CompArch’s favorite thing: – More Black Boxes

Recap So far, we have three types of memory: – Registers (between pipeline stages) – Register File – Giant Magic Black Box for lw and sw Inside the box is CompArch’s favorite thing: – More Black Boxes – WITH KNOBS ON THEM WE CAN ADJUST

Cost vs Speed Faster Memories are more expensive per bit – Slow and Expensive technologies are phased out – Expensive means chip area, which means $$$ TechnologyAccess Time 2004 $/GB in 2004$/GB in 2013 SRAM0.5-5 ns$4000-$10,000$1400 DRAM50-70 ns$100-$200>$3 (Nov ‘12) >$6 (Nov ‘13) Disk (HDD)(5-20)x10 6 ns$0.50-$2>$.05 Disk (SSD)>100ns$0.60

Static Random Access Memory Like a register file, but: – Only one Port (unified read/write) – Many more words (Deeper) – Several of you accidentally invented it Different Cell Construction – Smaller than a D-Flip Flop Different Word Select Construction – Much fancier to allow for poorer cells – Hooray for VLSI

SRAM Cell M1-M3 store bit weakly Word Line accesses cell – Open M5, M6 Bit Lines – Read bit (weak signal) – Write bit (strong signal) Squint to see SR latch!

DRAM Cell Capacitor stores bit – For a while – Must be “refreshed” FET controls access Smaller than SRAM Slower than SRAM

Technology Trends Processor-DRAM Memory Gap (latency) µProc 60%/yr. (2X/1.5yr) DRAM 9%/yr. (2X/10 yrs) DRAM CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) Performance Time “Moore’s Law”

The Problem The Von Neumann Bottleneck – Logic gets faster – Memory capacity gets larger – Memory speed is not keeping up with logic How do we get the full Daft Punk experience? – Faster, Bigger, Cheaper, Stronger?

Fast, Big, Cheap: Pick 3 Design Philosophy – Use a hybrid approach that uses aspects of both – Lots of Slow’n’Cheap – Small amount of Fast’n’Costly “Cache” – Make the common case fast – Keep frequently used things in a small amount of fast/expensive memory

The Solution By taking advantage of the principle of locality: – Provide as much memory as is available in the cheapest technology. – Provide access at the speed offered by the fastest technology. Control Datapath Secondary Storage (Disk) Processor Registers Main Memory (DRAM) Second Level Cache (SRAM) On-Chip Cache NameRegisterCacheMain MemoryDisk Memory Speed<1ns<10ns60ns10 ms Size100 BsKBsMBsGBs

Cache Terminology Hit: Data appears in that level – Hit rate – percent of accesses hitting in that level – Hit time – Time to access this level Hit time = Access time + Time to determine hit/miss Miss: Data does not appear in that level and must be fetched from lower level – Miss rate – percent of misses at that level = (1 – hit rate) – Miss penalty – Overhead in getting data from a lower level Miss penalty = Lower level access time + Replacement time + Time to deliver to processor Miss penalty is usually MUCH larger than the hit time

Cache Access Time Average access time – Access time = (hit time) + (miss penalty)x(miss rate) – Want high hit rate & low hit time, since miss penalty is large Average Memory Access Time (AMAT) – Apply average access time to entire hierarchy.

Handling A Cache Miss Data Miss 1.Stall the pipeline (freeze following instructions) 2.Instruct memory to perform a read and wait 3.Return the result from memory and allow the pipeline to continue Instruction Miss 1.Send the original PC to the memory 2.Instruct memory to perform a read and wait (no write enables) 3.Write the result to the appropriate cache line 4.Restart the instruction

Cache Access Time Example Note: Numbers are local hit rates – the ratio of access that go to that cache that hit (remember, higher levels filter accesses to lower levels) LevelHit TimeHit Rate Access Time L11 cycle95% L210 cycles90% Main Memory50 cycles99% Disk50,000 cycles100%

Cache Access Time Example Note: Numbers are local hit rates – the ratio of access that go to that cache that hit (remember, higher levels filter accesses to lower levels) LevelHit TimeHit Rate Access Time L11 cycle95% L210 cycles90% Main Memory50 cycles99% Disk50,000 cycles100%50,000

Cache Access Time Example Note: Numbers are local hit rates – the ratio of access that go to that cache that hit (remember, higher levels filter accesses to lower levels) LevelHit TimeHit Rate Access Time L11 cycle95% L210 cycles90% Main Memory50 cycles99% * = 550 Disk50,000 cycles100%50,000

Cache Access Time Example Note: Numbers are local hit rates – the ratio of access that go to that cache that hit (remember, higher levels filter accesses to lower levels) LevelHit TimeHit Rate Access Time L11 cycle95% L210 cycles90% * 550 = 65 Main Memory50 cycles99% * = 550 Disk50,000 cycles100%50,000

Cache Access Time Example Note: Numbers are local hit rates – the ratio of access that go to that cache that hit (remember, higher levels filter accesses to lower levels) LevelHit TimeHit Rate Access Time L11 cycle95% * 65 = 4.25 L210 cycles90% * 550 = 65 Main Memory50 cycles99% * = 550 Disk50,000 cycles100%50,000

How do we get those Hit Rates? There ain’t no such thing as a free lunch – If access was totally random, we’d be hosed – But we have knowledge a priori about programs Locality – Temporal Locality – If an item has been accessed recently, it will tend to be accessed again soon – Spatial Locality – If an item has been accessed recently, nearby items will tend to be accessed soon

Example What does this code do? What type(s) of locality does it have? char *index = string; while (*index != 0) { /* C strings end in 0 */ if (*index >= ‘a’ && *index <= ‘z’) *index = *index +(‘A’ - ‘a’); index++; }

Exploiting Locality Temporal locality – Keep more recently accessed items closer to the processor – When we must evict items to make room for new ones, attempt to keep more recently accessed items Spatial locality – Move blocks consisting of multiple contiguous words to upper level

Basic Cache Design 2^n Blocks of Data 2^M bytes wide Tag indicates what block is being stored n -1 ……… … Valid BitTag (Address)Data ………

Whose Line is it, Anyway? How do we map System Memory into Cache? Idea One: Direct Mapping – Each address maps to a single cache line – Lower M bits are address within a block – Next N bits are the cache line address – Remainder are the address Tag N bitsM bitsAddress Tag

Direct Mapped Cache A B C E D F Cache Address

Direct Mapped Cache A B C E D F Cache Address

Direct Mapped Cache A B C E D F Cache Address M[C] Tag C / 4 = 3

Cache Access Example Assume 4 byte cache (N=2, M=0) Access pattern: Valid Bit Tag Data

Cache Access Example Assume 4 byte cache (N=2, M=0) Access pattern: Valid Bit Tag Data M[00001]

Cache Access Example Assume 4 byte cache (N=2, M=0) Access pattern: Valid Bit Tag Data M[00001] M[00110]001

Cache Access Example Assume 4 byte cache (N=2, M=0) Access pattern: Valid Bit Tag Data M[00001] M[00110]001

Cache Access Example Assume 4 byte cache (N=2, M=0) Access pattern: Valid Bit Tag Data M[00001] M[00110]001 != 110

Direct Mapping Simple Control Logic Great Spatial Locality Awful Temporal Locality When does this fail in a direct mapped cache? char *image1, *image2; int stride; for(int i=0;i<size;i+=stride) { diff +=abs(image1[i] - image2[i]); }

Block Size Tradeoff With fixed cache size, increasing Block size(M): – Worse Miss penalty: longer to fill block – Better Spatial Locality – Worse Temporal Locality Miss Penalty Block Size Miss Rate Exploits Spatial Locality Fewer blocks: compromises temporal locality Average Access Time Increased Miss Penalty & Miss Rate Block Size

N-Way Associative Mapping Like N parallel direct map caches = =? … Valid … TagBlock …… Valid … TagBlock … = =? Addr Cache Tag Hit Cache Block

Associative Mapping Can handle up to N conflicts without eviction Better Temporal Locality – Assuming good Eviction policy More Complicated Control Logic – Slower to confirm hit or miss – Slower to find associated cache line

Fully Associative Cache n -1 … …… … Valid BitTagData …… … 012 m -1 = =? Full Associative = 2^N Way Associative – Everything goes Anywhere!

What gets evicted? Approach #1: Random – Just arbitrarily pick from possible locations Approach #2: Least Recently Used (LRU) – Use temporal locality – Must track somehow – extra bits to recent usage In practice, Random ~12% worse than LRU

Cache Arrangement Direct Mapped – Memory addresses map to particular location in that cache – 1-Way Associative Set Fully Associative – Data can be placed anywhere in the cache – 2^n-Way Associative Set N-way Set Associative – Data can be placed in a limited number of places in the cache depending upon the memory address

My L1 Data Cache M = ? N = ? Tag Size = ?

My L1 Data Cache M = 6 bits – 64-byte line size – 64 byte blocks – 2^6 = 64 N = 6 bits – 32kByte -> 15 bits – 8 way -> 3 fewer bits – M=6 -> 6 fewer bits Tag Size = 52 bits – 64 bit address space – Whatever is left over

3 C’s of Cache Misses Compulsory/Coldstart – First access to a block – basically unavoidable – For long-running programs this is a small fraction of misses Capacity – Had been loaded, evicted by too many other accesses – Can only be mitigated by cache size Conflict – Had been loaded, evicted by mapping conflict – Mitigated by associativity

Cache Miss Example 8-word cache, 8-byte blocks. Determine types of misses (CAP, COLD, CONF). Byte AddrBlock AddrDirect Mapped2-Way AssocFully Assoc Total Miss:

Cache Miss Example 8-word cache, 8-byte blocks. Determine types of misses (CAP, COLD, CONF). Byte AddrBlock AddrDirect Mapped2-Way AssocFully Assoc 00 Cold 40 Hit (in 0’s block) 81 Cold 243 Cold 567 Cold 81 Conf (w/56) 243 Hit 162 Cold 00 Hit Total Miss: 6

Split Caches Often separate Instruction and Data Caches – Harvard Architecture: Split I & D – von Neumann Architecture: Unified Higher bandwidth Optimize to usage Slightly higher miss rate: each cache is smaller

With Remaining Time Finish Cache Example (Answers Below) Find out what your computer’s caches are Write your own “Pop Quiz” on Caching We will exchange & take quizes – Not graded – Then discuss

Quiz Ideas Describe a specific Cache: – How big is the tag? Where does address X map? Describe an access pattern: – Hit Rate? Average Access Time?

Cache Summary Software Developers must respect the cache! – Usually a good place to start optimizing Especially on laptops/desktops/servers – Cache Line size? Page size? Cache design implies a usage pattern – Very good for instruction & stack – Not as good for Heap – Modern languages are very heapy

Matrix Multiplication for (k = 0; k < n; k++){ for (i = 0; i < n; i++){ for (j = 0; j < n; j++){ c[k][i] = c[k][i] + a[k][j]*b[j][i]; }}} Vs for (k = 0; k < n; k++){ for (i = 0; i < n; i++){ for (j = 0; j < n; j++){ c[i][j] = c[i][j] + a[i][k]*b[k][j]; }}} Which has better locality?

Cache Miss Comparison Fill in the blanks: Zero, Low, Medium, High, Same for all Direct MappedN-Way Set AssociativeFully Associative Cache Size: Small, Medium, Big? Compulsory Miss: Capacity Miss Conflict Miss Invalidation Miss

Cache Miss Comparison Fill in the blanks: Zero, Low, Medium, High, Same for all Direct MappedN-Way Set AssociativeFully Associative Cache Size: Small, Medium, Big? Big (Few comparators) MediumSmall (lots of comparators) Compulsory Miss:Same Capacity MissLowMediumHigh Conflict MissHighMediumZero Invalidation MissSame

Cache Miss Example 8-word cache, 8-byte blocks. Determine types of misses (CAP, COLD, CONF). Byte AddrBlock AddrDirect Mapped2-Way AssocFully Assoc 00 Cold 40 Hit (in 0’s block) 81 Cold 243 Cold 567 Cold 81 Conf (w/56) Hit 243 HitConf (w/8)Hit 162 Cold 00 Hit Cap Total Miss: 676

56 Replacement Methods If we need to load a new cache line, where does it go? Direct-mapped Only one possible location Set Associative N locations possible, optimize for temporal locality? Fully Associative All locations possible, optimize for temporal locality?

Cache Access Example Assume 4 byte cache Access pattern: Valid Bit Tag Data

58 Cache Access Example Assume 4 byte cache Access pattern: M[00001] M[00110] Valid Bit Tag Data Compulsory/Cold Start miss

59 Cache Access Example (cont.) Assume 4 byte cache Access pattern: M[00001] M[11010] Valid Bit Tag Data Compulsory/Cold Start miss

60 Cache Access Example (cont. 2) Assume 4 byte cache Access pattern: M[00001] M[00110] Valid Bit Tag Data Conflict Miss