Slide #1Friday, October 10, 1997 Alpha 21172 Core Logic Chip Set Jerry Huang Alpha 21172 Core Logic Chip Set Jerry Huang Alpha 21172 Inside out Zhihui.

Slides:

Advertisements

Similar presentations

Memory Interleaving.

Advertisements

Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.

Lecture 12 Reduce Miss Penalty and Hit Time

Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

Performance of Cache Memory

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.

Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.

1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Registers  Flip-flops are available in a variety of configurations. A simple one with two independent D flip-flops with clear and preset signals is illustrated.

Virtual Memory. Why do we need VM? Program address space: 0 – 2^32 bytes –4GB of space Physical memory available –256MB or so Multiprogramming systems.

1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.

1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

Chapter 4 Computer Memory

Caches – basic idea Small, fast memory Stores frequently-accessed blocks of memory. When it fills up, discard some blocks and replace them with others.

Lecture 21 Last lecture Today’s lecture Cache Memory Virtual memory

Lecture 19: Virtual Memory

EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff Case.

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

Main Memory CS448.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

The Three C’s of Misses 7.5 Compulsory Misses The first time a memory location is accessed, it is always a miss Also known as cold-start misses Only way.

1 Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: –illusion of having more physical memory –program relocation.

B. Ramamurthy.  12 stage pipeline  At peak speed, the processor can request both an instruction and a data word on every clock.  We cannot afford pipeline.

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

CS/EE 5810 CS/EE 6810 F00: 1 Main Memory. CS/EE 5810 CS/EE 6810 F00: 2 Main Memory Bottom Rung of the Memory Hierarchy 3 important issues –capacity »BellÕs.

Memory Architecture Chapter 5 in Hennessy & Patterson.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

1 CSCI 2510 Computer Organization Memory System II Cache In Action.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,

1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.

The Memory Hierarchy (Lectures #17 - #20) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.

LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

Administration Midterm on Thursday Oct 28. Covers material through 10/21. Histogram of grades for HW#1 posted on newsgroup. Sample problem set (and solutions)

CS 704 Advanced Computer Architecture

Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.

Memory COMPUTER ARCHITECTURE

COMP SYSTEM ARCHITECTURE

Reducing Hit Time Small and simple caches Way prediction Trace caches

Improving Memory Access 1/3 The Cache and Virtual Memory

CSC 4250 Computer Architectures

Cache Memory Presentation I

Morgan Kaufmann Publishers Memory & Cache

ECE 445 – Computer Organization

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

CS 704 Advanced Computer Architecture

CS 3410, Spring 2014 Computer Science Cornell University

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Update : about 8~16% are writes

Cache Memory Rabi Mahapatra

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Slide #1Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang Alpha Inside out Zhihui Huang (Jerry) University of Michigan

Slide #2Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang Components  One CA chip –Control, I/O, address chip(CIA) –388 pins, plastic ball grid array(PBGA)  Four BA –data switch chip (DSW) –208 pins, plastic quad flat pack (PQFP)

Slide #3Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang Data Paths  64-bit data path between CIA and DSW –iod  128-bit data path between and DSW –cpu_dat  256-bit memory data path between DSW and memory –mem_dat Slowest Slowest part widest has the widest bus

Slide #4Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang 3-way Interface DRAM 1 DRAM 2 DRAM 3 DRAM 4 DRAM 5 DRAM 6 DRAM 7 DRAM 8 DSW0DSW1DSW2DSW3 64-bit PCI Bus 64-bit IOD bus addr RAS CAS memadr control

Slide #5Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang Memory DRAM 1 DRAM 2 DRAM 3 DRAM 4 DRAM 5 DRAM 6 DRAM 7 DRAM 8 one The DRAM is contained in one SIMMs, bank of SIMMs, whether there are 4 SIMMs or 8 SIMMs. one The DRAM is contained in one SIMMs, bank of SIMMs, whether there are 4 SIMMs or 8 SIMMs. SIMM 1 SIMM 2 SIMM 3 SIMM bit SIMMs 4 SIMMs fill a data bus of bits SIMMs 4 SIMMs fill a data bus of bits SIMM 5 SIMM 6 SIMM 7 SIMM bit SIMMs 8 SIMMs fill a data bus of bits SIMMs 8 SIMMs fill a data bus of bits Needs ajumper

Slide #6Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang Memory block DRAM 1 DRAM 2 DRAM 3 DRAM 4 DRAM 5 DRAM 6 DRAM 7 DRAM 8 DSW0DSW1DSW2DSW3 256-bit256-bit 256-bit A 256-bit block is composed of bit slices across all 8SIMMs the 8 SIMMs The arrangement of the slices are interleaved interleaved within the 4 DSWs 256-bit A 256-bit block is composed of bit slices across all 8SIMMs the 8 SIMMs The arrangement of the slices are interleaved interleaved within the 4 DSWs 15:0 31:16 47:32 63:48 79:64 95:80 102:96 127: bit As you just see, DSWs the 4 DSWs together provide the lower 128-bit 128-bit memory bus. 256-bit For the 256-bit configuration, DSWs DSWs also provide the upper part of the bus As you just see, DSWs the 4 DSWs together provide the lower 128-bit 128-bit memory bus. 256-bit For the 256-bit configuration, DSWs DSWs also provide the upper part of the bus 128-bit It is better to use the 256-bit 256-bit configuration, or you pay the full price for DSWs and only use half of the resources. It is better to use the 256-bit 256-bit configuration, or you pay the full price for DSWs and only use half of the resources. It may be clear now onebank why it is a one bank schema with SIMMs all the SIMMs same have the same size. It may be clear now onebank why it is a one bank schema with SIMMs all the SIMMs same have the same size.

Slide #7Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang Bcache and Memory  3rd Level Cache for the  Attributes –optional, external,physical, synchronous SRAM –direct-mapped, write-back,write-allocate  256-bit or 512-bit block  cache size of 1,2,4,8,16,32,64 Mbytes  support up to 512MB of memory –1MBx36, 2MBx36,4MBx36,8MBx36,16MBx36 A cache architecture in which datacache is only written to main memory when it is forced out of the cache. Opposite of write-through.write-through The Scache and Bcache block size is either 64-bytes or 32 bytes. The Scache and Bcache always have identical block sizes. All the Bcache and main memory FILLs or write transactions are of the selected block size. The Scache and Bcache block size is either 64-bytes or 32 bytes. The Scache and Bcache always have identical block sizes. All the Bcache and main memory FILLs or write transactions are of the selected block size. A cache where the cache location for a given address is determined from the middle address bits. If the cache line size is 2^n then the bottom n address bits correspond to an offset within a cache entry. If the cache can hold 2^m entries then the next m address bits give the cache location. The remaining top address bits are stored as a TAG along with the entry. In this scheme, there is no choice of which block to flush on a cache miss since there is only one place for any block to go. This simple scheme has the disadvantage that if the program alternately accesses different addresses which map to the same cache location then it will suffer a cache miss on every access to these locations. This kind of cache conflict is quite likely on a multi-processor. A cache where the cache location for a given address is determined from the middle address bits. If the cache line size is 2^n then the bottom n address bits correspond to an offset within a cache entry. If the cache can hold 2^m entries then the next m address bits give the cache location. The remaining top address bits are stored as a TAG along with the entry. In this scheme, there is no choice of which block to flush on a cache miss since there is only one place for any block to go. This simple scheme has the disadvantage that if the program alternately accesses different addresses which map to the same cache location then it will suffer a cache miss on every access to these locations. This kind of cache conflict is quite likely on a multi-processor. A cache line is allocated when the write memory data miss the cache A cache line is allocated when the write memory data miss the cache PC164 In the PC164 ECC protected

Slide #8Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang PCI features  Supports 64-bits PCI bus width  Supports 64-bit PCI addressing (DAC cycles)  Accept PCI fast back-to-back cycles –addr,data0,data1,data2,...,addr_again! –The Frame# is only deasserted for a cycle to allow the last to finish  Issues PCI fast back-to-back cycles in dense addrss space addrdata clk Frame#

Slide #9Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang CIA Transactions  memory read miss  memory read miss with victim  I/O read  I/O write  DMA read  DMA read(prefetch)  DMA write

Slide #10Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang DSW Data Paths SYSSYS Memory BCache DMA 0DMA 1 IODIOD MEMMEM Victim Path Read Miss Path PCI PCI SYS MEM FlushFlush IO Paths not shown Instruction Queue MEM

Slide #11Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang DSW Buffers  DMA Buffer Sets (0 and 1) –PCI buffer for PCI DMA write data –Memory buffer for memory data –Flush buffer for system bus data IODIOD MEMMEM PCI FlushFlush DMA 0DMA 1 PCI

Slide #12Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang DMA Writes  Data arrives in the PCI Buffer  Memory Buffer loaded at the same time  Bcache line flushed and Flush buffer loaded  3 sources merged and data back at memory DMA 0 IOD MEM PCI Flush Memory BCache As you just see, the DMA operation PCI buffer causes PCI buffer loaded IODMEM from the IOD bus, the MEM buffermemory buffer loaded from memory, flush buffer and the flush buffer loaded system bus from system bus at the same time As you just see, the DMA operation PCI buffer causes PCI buffer loaded IODMEM from the IOD bus, the MEM buffermemory buffer loaded from memory, flush buffer and the flush buffer loaded system bus from system bus at the same time Then the 3 sources merged are merged and written main memory back to main memory Then the 3 sources merged are merged and written main memory back to main memory

Slide #13Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang Read Transaction  If hit in the Bcache, no memory access is required Memory BCache Read Miss Path SYS MEM HIT !! Read data Data back to CPU

Slide #14Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang Read Miss  If not hit in the Bcache during a read, memory access is involved. Memory BCache Read Miss Path SYS MEM Read data Data back to CPU CIA Command command BA Miss!!

Slide #15Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang Read Miss With Victim  Two scenarios –write data with different address tag into a valid cache line –read data with different address tag into a valid cache line Write allocate!! read allocate!! Memory BCache Read Miss Path SYS MEM Write data CIA Command command Miss!! Victim Path Merge data Read Missed block and Write victim block indivisible are indivisible in the logic design

Slide #16Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang Traffic Jam on MEM bus SYSSYS DMA 0DMA 1 IODIOD MEMMEM Victim Path PCI PCI Memory BCache Read Miss Path SYS MEM FlushFlush IO Paths not shown Instruction Queue MEM Let’s think about this PCI senario, during the PCI DMA DMA transfer, READ there are READ and WRITE WRITE memory happening at the same time Let’s think about this PCI senario, during the PCI DMA DMA transfer, READ there are READ and WRITE WRITE memory happening at the same time All the circle parts compete for this resource Cause read miss with victim Cause read miss Don’t forget instruction fetch uses memory too Don’t forget instruction fetch uses memory too

Slide #17Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang How Fast can DMA be? SYSSYS DMA 0DMA 1 IODIOD MEMMEM PCI FlushFlush  2 fetches and 2 writes to memory/DMA –64 bytes/240 ns = 266 Mbytes/s –8 bytes /30 ns = 266 Mbytes/s PCI 33 MHz PCI has the same speed with DRAM DRAM !! Can we really do this ?? PCI 33 MHz PCI has the same speed with DRAM DRAM !! Can we really do this ?? DRAM 60 ns DRAM bit bus DRAM 60 ns DRAM bit bus PCI 33MHz PCI bit bus PCI 33MHz PCI bit bus Overheadretrysread lines Overhead, retrys, read lines, read line with victim read line with victim, instruction fetch all share the same bandwidth!! It turns out for the worst case, 17MBytes/s 17MBytes/s is achieved bottom line just above bottom line

Slide #18Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang Performance of the MB2PCI  Worst case –29.9MBytes/s –25.5MBytes/s –17.5MBytes/s  Best case –95MBytes/s –80MBytes/s –72MBytes/s - No intervenence - read line, instruction fetch - read line, read line with victim, instruction fetch - No intervenence - read line, instruction fetch - read line, read line with victim, instruction fetch

Slide #19Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang Conclusion  If we want to improve –use 256-bit cache block instead of 512-bit –Is there a next version chip surport 512-bit memory bus? –Is there DRAM chips faster then 60ns –can we afford 64M Bcache(SRAM)? trade off There is a trade off here, by using smaller block, the will generate more cache miss cycles and may slow down. On the other hand, for the DMA transfer, when only 128-bit data is transferred, no more 512-bit memory read overhead. There is only 256-bit read now. Thus improve the worst case worst case performance. trade off There is a trade off here, by using smaller block, the will generate more cache miss cycles and may slow down. On the other hand, for the DMA transfer, when only 128-bit data is transferred, no more 512-bit memory read overhead. There is only 256-bit read now. Thus improve the worst case worst case performance.