EE 382 Processor DesignWinter 98/99Michael Flynn 1 Buffers Buffers minimize memory delays caused by variation in throughput between the pipeline and memory.

EE 382 Processor DesignWinter 98/99Michael Flynn 1 Buffers Buffers minimize memory delays caused by variation in throughput between the pipeline and memory Two types of buffer design criteria –Maximum rate for units that have high request rates The buffer is sized to mask the service latency Generally read buffers that you want to keep full –Mean rate buffers for units that have a lower expected request rate The buffer design is sized to minimize the probability of overflowing Generally, write buffers that you want not to be full

EE 382 Processor DesignWinter 98/99Michael Flynn 2 Maximum-Rate Buffer Design Buffer is sized to avoid “runout”. In this case the processor stalls while the buffer is empty awaiting service. –Need buffer input rate > buffer output rate –Then size to cover latency at maximum demand For I buffer, the buffer size (BF) should be BF = 1+[(Instrs decoded/~)*(IF latency in~)]/Instrs/IF

EE 382 Processor DesignWinter 98/99Michael Flynn 3 Maximum-Rate Buffer Example Assumptions: Decode consumes max 1 inst/clock Icache supplies 2 inst/clock bandwidth at 6 clocks latency

EE 382 Processor DesignWinter 98/99Michael Flynn 4 Mean-Rate Buffer Design Buffer is sized to avoid “overflow”. In this case the processor stalls while the buffer is awaiting service to free an entry. –Need buffer input rate < buffer output rate –Then size to reduce frequency of overflow to acceptably low level with respect to overall performance

EE 382 Processor DesignWinter 98/99Michael Flynn 5 Mean-Rate Buffer Design Use inequalities from statistics to target buffer size –For infinite buffer, assume distribution of buffer occupancy is q and mean occupancy is Q –Using Markov’s inequality for buffer of size BF Prob of overflow = p(q >= BF) <= Q/BF –Using Chebyshev’s inequality for buffer of size BF Prob of overflow = p(q >= BF) <=  2 /(BF- Q) 2 –for a given probability of overflow (p), conservatively select BF BF = min(Q/p,Q +  / p) –Analyze design carefully to pick correct BF that causes overflow/stall

EE 382 Processor DesignWinter 98/99Michael Flynn 6 Mean-Rate Buffer Example Data Cache Store Buffer Memory References from Pipeline Reads Writes Assumptions: store rate = 0.15 inst/cycle store latency to data cache = 2 clocks  2 = 0.3 BF = 2

EE 382 Processor DesignWinter 98/99Michael Flynn 7 Cache Design Target Miss Ratio (DTMR) –DTMR is the design target miss rate assuming fully associative, unified cache with LRU replacement Includes compulsory and capacity misses for user-only traces

EE 382 Processor DesignWinter 98/99Michael Flynn 8 Cache Apply adjustments to DTMR –Associativity –split Instruction/Data –Write Policy: Write-Through (WT) or Copy Back (CB) –Write Allocate (WA) vs. No Write-Allocate (NWA) –Replacement Policy –OS, multiprogramming, and I/O effects

EE 382 Processor DesignWinter 98/99Michael Flynn 9 Target Miss Rate Trend Target Miss Rates tend to increase over time –programming environments become more complex –application functionality increases –problem size grows –memory capacity increases

EE 382 Processor DesignWinter 98/99Michael Flynn 10 Target Miss Rate Trend

EE 382 Processor DesignWinter 98/99Michael Flynn 11 System Effects Operating System Multiprogramming –Q = no. instructions between task switches Cold-Start vs. Warm-Start –Cold-Start short transactions are created frequently and run quickly to completion –Warm-Start long processes are executed in time slices

EE 382 Processor DesignWinter 98/99Michael Flynn 12 Multi-Level Caches Very useful for matching processor to memory –Generally 2-level For microprocessors, L1 on-chip at frequency of pipeline and L2 off-chip at slower latency and bandwidth

EE 382 Processor DesignWinter 98/99Michael Flynn 13 Multi-Level Caches Analysis by (statistical) inclusion –If a L2 cache is greater than four times the size of the L1 cache then we assume statistically that the contents of L1 lies in L2 –Relevant L2 Miss Rates Local Miss Rate: No. L2 misses / No. L2 references Global Miss Rate: No. misses / No. processor references Solo Miss Rate: No. misses without L1 / No. processor references Inclusion => Solo miss rate = Global miss rate –Miss penalty calculation L1 miss rate x (miss in L1, hit in L2 penalty) plus L2 miss rate x ( miss in L1, miss in L2 penalty - L1 to L2 penalty)

EE 382 Processor DesignWinter 98/99Michael Flynn 14 Multi-Level Cache Example L1 L2 Memory Miss Rate4%1% Delays: Miss in L1, Hit in L22~ Miss in L1, Miss in L27~ Cpi loss due to L1 is 1.5 refr/instr x 0.04 x 2~/miss = 0.12 Cpi loss due to L2 is 1.5 refr/instr x 0.01 x (7-2) ~/miss = 0.075

EE 382 Processor DesignWinter 98/99Michael Flynn 15 Logical Inclusion In multiprocessors it is important to know exactly that the L1 cache does NOT contain a line by determining that the L2 cache does not have this line. –Reduces latency and bandwidth for snooping –For this purpose we need to insure that all the contents of L1 are always in L2 –We call this property Logical Inclusion

EE 382 Processor DesignWinter 98/99Michael Flynn 16 Logical Inclusion Techniques –Control Cache size, organization, policies No. L2 sets >= No. L1 sets L2 set size >= L1 set size Compatible replacement algorithms But this is highly restrictive and difficult to guarantee –Back-invalidation Whenever a line is replaced or invalidated in the L2, ensure that it is not present in L1 or it is evicted from L1

EE 382 Processor DesignWinter 98/99Michael Flynn 17 Outline: Memory and Queuing Models Physical Memory –Memory Technology Simple Memory Performance Models –Hellerman –Strecker

EE 382 Processor DesignWinter 98/99Michael Flynn 18 Outline: Memory and Queuing Models Basic Queuing Models –Terminology –Approximations –Key Results Application of Queuing Models to Memory Performance –Interleaved memory –Cache –Bus

EE 382 Processor DesignWinter 98/99Michael Flynn 19 Memory Processors are increasingly limited by memory and not processor organization or cycle time. Memory is characterized by 3 parameters –size –access time (latency) –cycle time (bandwidth)

EE 382 Processor DesignWinter 98/99Michael Flynn 20 Physical Memory System

EE 382 Processor DesignWinter 98/99Michael Flynn 21 Achieved vs. Offered Bandwidth Offered Request Rate: –Rate that processor(s) would make requests if memory had unlimited bandwidth and no contention

EE 382 Processor DesignWinter 98/99Michael Flynn 22 DRAM Technology (text section 6.2) DRAM cell –Capacitor used to store charge for 0/1 state –Transistor used to switch capacitor to bit line –Charge decays over time => refresh required

EE 382 Processor DesignWinter 98/99Michael Flynn 23 DRAM Technology (text section 6.2) DRAM chip –Row and column addresses muxed –Row/Column Strobes for timing DRAM array -Stores 2 n bits in a square array -2 n/2 row lines connect to Tx gates -2 n/2 column bit lines with sense amp

EE 382 Processor DesignWinter 98/99Michael Flynn 24 Technology and Market Trends Market Trends –Demand for minimum/incremental memory growing only ~30% per year for PCs –Bandwidth requirements growing rapidly driven by multimedia (esp. video and graphics) => caches do not help –Fill Frequency captures this trend Rate at which it is possible to read/write all of memory Bandwidth[MB/sec] / Memory Size [MB] Example PC: 500 MB/sec for 32MB => 15.6 Hz fill frequency

EE 382 Processor DesignWinter 98/99Michael Flynn 25 Technology and Market Trends Technology Trends –Capacity increases 60% per year –Access time decreases 7% per year –Cost/bit decreases 26% per year As DRAM Density Increases, Need More Bandwidth and Fewer Memory Chips With Few Pins

EE 382 Processor DesignWinter 98/99Michael Flynn 26 Techniques to Increase DRAM Bandwidth Fast Paged Mode Synchronous DRAM Rambus (RDRAM)

EE 382 Processor DesignWinter 98/99Michael Flynn 27 Fast Page Mode (FPM) Page Mode => Save most recently accessed row (“page”) –Only need column row and CAS to access within the page Fast Page Mode –Counter built into RAM for sequential accesses –Only need CAS for sequential accesses –With Extended Data-Out (EDO) can cycle at 33-50 MHz

EE 382 Processor DesignWinter 98/99Michael Flynn 28 Synchronous DRAM (SDRAM) Clocked (Synchronous) interface enables pipelined access Internally implemented with dual banks for higher throughput Clock rates of 100-150 MHz possible Dual Data Rate (DDR SDRAMS) –proposal to transfer 2 data bits per clock –target 200 Mb/s at 100 MHz –technically feasible, but demanding system requirements

EE 382 Processor DesignWinter 98/99Michael Flynn 29 RAMBus DRAM (RDRAM) Channel interface bus replaces row/column address –9 data signals, 4 clock and control signals –Protocol for transferring address and data bursts –Clock frequency currently 266 MHz with 533 MB/s –Up to 32 RDRAMs per channel –Die area ~10% overhead for 16Mb DRAM SLDRAM is open standard with similar technology From IEEE Micro 11/97

EE 382 Processor DesignWinter 98/99Michael Flynn 30 RAMBus DRAM (RDRAM) Channel interface bus replaces row/column address –9 data signals, 4 clock and control signals –Protocol for transferring address and data bursts –Clock frequency currently 266 MHz with 533 MB/s –Up to 32 RDRAMs per channel –Die area ~10% overhead for 16Mb DRAM

EE 382 Processor DesignWinter 98/99Michael Flynn 31 RAMBus DRAM (RDRAM) SLDRAM is open standard with similar technology From IEEE Micro 11/97

EE 382 Processor DesignWinter 98/99Michael Flynn 32 Memory Module The module consists of the DRAM chips that make up the physical memory word. In addition to the DRAM chips there are memory controller, timing and bus driver chips. If the DRAM is organized 2 n words x b bits and the memory has p bits/ physical word then the module has p/b DRAM chips.

EE 382 Processor DesignWinter 98/99Michael Flynn 33 Memory Module The module has 2 n words x p bits Parity or Error-Correction Code (ECC) generally required for error detection and availability –Text section 6.2.2 describes code for Single-Error Correction, Double-Error Detection (SECDED) Requires ~ extra  log 2 (p)  + 1 bits

EE 382 Processor DesignWinter 98/99Michael Flynn 34 Memory system Consists of multiple modules that are interleaved by low order or high order address bits (or both). Low order interleaving improves memory BW High-order interleaving improves availability

EE 382 Processor DesignWinter 98/99Michael Flynn 35 Processor memory model Assume that n processors each make one request each Tc to one of m memory modules. B(n,m) is number of successes. Tc is the memory cycle time, Ta is the memory access time. To the memory one processor making n request per Tc behaves as n processors making 1 request per Tc.

EE 382 Processor DesignWinter 98/99Michael Flynn 36 Basic terms B = B(m,n) or B(m) is number of requests that succeed each Tc. It is the bandwidth normalized to Tc. Ts is a more generalized term for service time. Tc = Ts. Used in I/O models. BW is the achieved bandwidth in requests serviced per second. BW = B / Ts = B(m,n) /Tc

EE 382 Processor DesignWinter 98/99Michael Flynn 37 Modeling and Evaluation Methodology Identify relevant physical parameters –for memory: word size, module size, no. modules, Tc, Ta Find the offered Bandwidth –n/Tc Find the bottleneck –performance limited by most restrictive service point

EE 382 Processor DesignWinter 98/99Michael Flynn 38 Modeling and Evaluation Methodology Determine the type of reference pattern –sequential, stride, random Select an appropriate model –Use the simplest possible –Evaluate the achieved bandwidth B(m,n) for memory

EE 382 Processor DesignWinter 98/99Michael Flynn 39 Models for computing B(m,n); text 6.3 Hellerman’s..... B(m) = m –Limited, unrealistic model single processor generates random references in-order until bank conflict –Only historical interest: “the square-root rule” Strecker’s (the null binomial) Queue models –open –closed –mixed

EE 382 Processor DesignWinter 98/99Michael Flynn 40 Strecker’s model Model description –Each processor generates 1 reference per cycle –Requests random and uniformly distributed across modules –Any busy module serves 1 request –All unserviced requests are dropped each cycle There are no queues B(m,n) = m[1 - (1 - 1/m) n ]

EE 382 Processor DesignWinter 98/99Michael Flynn 41 Queuing Models, text 6.4 Arrival process Server Occupancy/Utilization Time Number of items Art of Computer Systems Performance Analysis, Raj Jain, Fig. 30.2 t tsts twtw

EE 382 Processor DesignWinter 98/99Michael Flynn 42 Key Model Characteristics Queuing Models characterized by 3-tuple –arrival distribution –service time distribution –No. Servers –E.g., G/G/1 for general arrival and service distribution Arrival distributions we will use –M B, the Binomial distribution –M, the Poisson distribution inter-arrival times are exponentially distributed limiting case of the Binomial

EE 382 Processor DesignWinter 98/99Michael Flynn 43 Key Model Characteristics Service distributions we will use –M, exponential service time –D, deterministic service time (constant) We will generally be looking for mean values –time in system, utilization

EE 382 Processor DesignWinter 98/99Michael Flynn 44 Binomial arrivals, text 6.4.1 Description –n items enter system with has m modules –requests are randomly distributed across modules with equal probability 1/m (Bernoulli trial) Probability that k out of n requests are to a specific module –P n (k) = C( k n )(p) k (1-p) n-k 1  P n (0) = B(m,n) from Strecker’s model

EE 382 Processor DesignWinter 98/99Michael Flynn 45 Poisson distribution, text 6.4.1 Assume n and m very large, then p is small but np, the expected no. of arrivals at the server during T, is finite.  np/T and p = T/n Then P(k) = [( T) k /k!] e - T

EE 382 Processor DesignWinter 98/99Michael Flynn 46 Queuing Properties, text 6.4.5 No. items items in system = items in queue + items in service –Mean values N = Q +  Little’s result: N = T and Q = T w Service Distributions –Coefficient of variation = c =  –For exponential distribution c = 1 –For deterministic (constant) distribution c = 0

EE 382 Processor DesignWinter 98/99Michael Flynn 47 Pollaczek-Khinchin (P-K) Theorem For M/G/1 Mean waiting time T w = (1/ )[  2 (1+c 2 )/2(1-  )] Mean items in queue Q =  T w =  2 (1+c 2 )/2(1-  ) Cases of interest –For M/M/1, c 2 =1 T w = (1/ )[  2 / (1-  )] Q =  2 /(1-  ) –For M/D/1, c 2 = 0 T w = (1/ )[  2 / 2(1-  )] Q =  2 /2(1-  ) –For M B /D/1, c 2 =0 Define p =  /m where  is the prob. that a source makes a request T w = (1/ )[ (  2 -p  )/2(1-  )] Q = (  2 -p  )/2(1-  )

EE 382 Processor DesignWinter 98/99Michael Flynn 48 Open vs Closed queues, text 6.5 Open Queue Models –Arrival process is independent of queue size –Processor not slowed (or otherwise affected ) by contention –Queue size is unbounded Closed Queue Models –Arrival rate slows down as queue grows –If n items are offered and the system initially accepts only B, then the queue size Q = n  B

EE 382 Processor DesignWinter 98/99Michael Flynn 49 Open vs Closed queues, text 6.5 Mixed Queue Models –Buffers allow arrival rate to continue without slowdown (like open queue) until queue size reaches a threshold, then arrival rate starts slowing down (like closed queue) Use this later for I/O

EE 382 Processor DesignWinter 98/99Michael Flynn 50 Closed Queues, text 6.5.2 From the P-K theorem for total queue size N =  a +[  a 2 (1+c 2 )/2(1-  a )] –N = n/m –n/m per Tc is offered to each module –  a is achieved –and Tw =  -1 [(  -  a )/  a ] B(m,n) = m  a, solving for B(m,n) in asymptotic case –for M/D/1  a = (  + 1)     B(m,n) = m + n  n 2 + m 2 –for M B /D/1 B(m,n) = m+n-1/2  (m+n  1/2) 2  2mn

EE 382 Processor DesignWinter 98/99Michael Flynn 51  -Binomial Model Suppose each of the n sources make a request each Tc with probability  M B /D/1 –B(m,n,  ) = m + n   /2  (m + n -  /2) 2 -2mn This model is useful in many processor designs where the source is buffered or makes requests on a statistical basis If n is the mean request rate and z is the no. of sources, then  = n/z  -Binomial is the key model when we know the processor organization and buffer hit rates

EE 382 Processor DesignWinter 98/99Michael Flynn 52  -Binomial Example System Description –4 processors sharing common data cache Processor Characteristics –Freq = 200 MHz –Superscalar capable of 4 IPC Achieved in absence of memory delays is 2 IPC Capable of 2 load/store per clock –Average ref/inst = 0.2 read + 0.1 write Memory Cache –Tc = 5 ns, Ta = 10 ns –8-way interleaved

EE 382 Processor DesignWinter 98/99Michael Flynn 53 Buffers for M/M/1, text 6.6.2 The M/M/1 queue has a simple equation for determining the probability for overflowing a buffer of size BF –Prob(Q > BF) =  BF+2 –Prob(Q >= BF) =  BF+1

EE 382 Processor DesignWinter 98/99Michael Flynn 54 Caches and Memory On a miss, a cache may –block... the processor stops until the line is accessed –partially block... the processor continues after a delay –non-block... the processor continues if it doesn’t immediately need the data

EE 382 Processor DesignWinter 98/99Michael Flynn 55 Cache Line Access Time T line access is the time to retrieve a line from memory T busy is the time that both the memory is busy and the processor/cache can make requests L, no. of physical words in a line Ta, access time for the first word T, sequential access time in fast page mode –, no. sequential accesses possible in fast page mode

EE 382 Processor DesignWinter 98/99Michael Flynn 56 Write-Back Cache Example Processor Characteristics –Freq = 200 MHz –Superscalar capable of 4 IPC Achieved in absence of memory delays is 2 IPC Capable of 2 load/store per clock –Average ref/inst = 0.2 read + 0.1 write Data Cache –Tc = 5 ns, Ta = 10 ns –dual-ported –64B lines –Miss rate = 1%

EE 382 Processor DesignWinter 98/99Michael Flynn 57 Write-Back Cache Example Memory –Ta=100ns, Tc=50 ns –4-way interleave Bus –8B wide, 100 MHz clock

EE 382 Processor DesignWinter 98/99Michael Flynn 58 Shared Bus Suppose n processors each offer a bus occupancy  –  = bus Xact time/(proc. time + bus Xact ) Bus types –simple –arbitrated –multiplexed (address and data) –split transaction

EE 382 Processor DesignWinter 98/99Michael Flynn 59 Bus Model, text 6.8.9 Assuming resubmissions of denied requests –B( ,n) = n  a solve for  a iteratively –n  a = 1- (1-a) n where a is a =  a  initially let a=  Example –  = 0.3, n = 2

EE 382 Processor DesignWinter 98/99Michael Flynn 60 Memory System Evaluation Summary Identify relevant physical parameters –Word size, module size, no. modules, Tc, Ta Find the offered Bandwidth –n/Tc Find the bottleneck –Performance limited by most restrictive service point Next –Vector Multimedia Processing –Multiprocessing –I/O

EE 382 Processor DesignWinter 98/99Michael Flynn 61 Memory System Evaluation Summary Select an appropriate model –Use the simplest possible  -Binomial is the key model when we know the processor organization and buffer hit rates –Evaluate the achieved bandwidth B(m,n) Next –Vector Multimedia Processing –Multiprocessing –I/O

EE 382 Processor DesignWinter 98/99Michael Flynn 1 Buffers Buffers minimize memory delays caused by variation in throughput between the pipeline and memory.

Similar presentations

Presentation on theme: "EE 382 Processor DesignWinter 98/99Michael Flynn 1 Buffers Buffers minimize memory delays caused by variation in throughput between the pipeline and memory."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EE 382 Processor DesignWinter 98/99Michael Flynn 1 Buffers Buffers minimize memory delays caused by variation in throughput between the pipeline and memory.

Similar presentations

Presentation on theme: "EE 382 Processor DesignWinter 98/99Michael Flynn 1 Buffers Buffers minimize memory delays caused by variation in throughput between the pipeline and memory."— Presentation transcript:

Similar presentations

About project

Feedback