Presentation is loading. Please wait.

Presentation is loading. Please wait.

Slide #1Friday, October 10, 1997 Alpha 21172 Core Logic Chip Set Jerry Huang Alpha 21172 Core Logic Chip Set Jerry Huang Alpha 21172 Inside out Zhihui.

Similar presentations


Presentation on theme: "Slide #1Friday, October 10, 1997 Alpha 21172 Core Logic Chip Set Jerry Huang Alpha 21172 Core Logic Chip Set Jerry Huang Alpha 21172 Inside out Zhihui."— Presentation transcript:

1

2 Slide #1Friday, October 10, 1997 Alpha 21172 Core Logic Chip Set Jerry Huang Alpha 21172 Core Logic Chip Set Jerry Huang Alpha 21172 Inside out Zhihui Huang (Jerry) University of Michigan

3 Slide #2Friday, October 10, 1997 Alpha 21172 Core Logic Chip Set Jerry Huang Alpha 21172 Core Logic Chip Set Jerry Huang Components  One 21172-CA chip –Control, I/O, address chip(CIA) –388 pins, plastic ball grid array(PBGA)  Four 21172-BA –data switch chip (DSW) –208 pins, plastic quad flat pack (PQFP)

4 Slide #3Friday, October 10, 1997 Alpha 21172 Core Logic Chip Set Jerry Huang Alpha 21172 Core Logic Chip Set Jerry Huang Data Paths  64-bit data path between CIA and DSW –iod  128-bit data path between 21164 and DSW –cpu_dat  256-bit memory data path between DSW and memory –mem_dat Slowest Slowest part widest has the widest bus

5 Slide #4Friday, October 10, 1997 Alpha 21172 Core Logic Chip Set Jerry Huang Alpha 21172 Core Logic Chip Set Jerry Huang 3-way Interface 21164211642117221172 DRAM 1 DRAM 2 DRAM 3 DRAM 4 DRAM 5 DRAM 6 DRAM 7 DRAM 8 DSW0DSW1DSW2DSW3 64-bit PCI Bus 64-bit IOD bus addr RAS CAS memadr control

6 Slide #5Friday, October 10, 1997 Alpha 21172 Core Logic Chip Set Jerry Huang Alpha 21172 Core Logic Chip Set Jerry Huang Memory DRAM 1 DRAM 2 DRAM 3 DRAM 4 DRAM 5 DRAM 6 DRAM 7 DRAM 8 one The DRAM is contained in one SIMMs, bank of SIMMs, whether there are 4 SIMMs or 8 SIMMs. one The DRAM is contained in one SIMMs, bank of SIMMs, whether there are 4 SIMMs or 8 SIMMs. SIMM 1 SIMM 2 SIMM 3 SIMM 4 128 bit SIMMs 4 SIMMs fill a data bus of 128 128 bits SIMMs 4 SIMMs fill a data bus of 128 128 bits SIMM 5 SIMM 6 SIMM 7 SIMM 8 256-bit SIMMs 8 SIMMs fill a data bus of 256 256 bits SIMMs 8 SIMMs fill a data bus of 256 256 bits Needs ajumper

7 Slide #6Friday, October 10, 1997 Alpha 21172 Core Logic Chip Set Jerry Huang Alpha 21172 Core Logic Chip Set Jerry Huang Memory block DRAM 1 DRAM 2 DRAM 3 DRAM 4 DRAM 5 DRAM 6 DRAM 7 DRAM 8 DSW0DSW1DSW2DSW3 256-bit256-bit 256-bit A 256-bit block is composed of bit slices across all 8SIMMs the 8 SIMMs The arrangement of the slices are interleaved interleaved within the 4 DSWs 256-bit A 256-bit block is composed of bit slices across all 8SIMMs the 8 SIMMs The arrangement of the slices are interleaved interleaved within the 4 DSWs 15:0 31:16 47:32 63:48 79:64 95:80 102:96 127:102 128-bit As you just see, DSWs the 4 DSWs together provide the lower 128-bit 128-bit memory bus. 256-bit For the 256-bit configuration, DSWs DSWs also provide the upper part of the bus As you just see, DSWs the 4 DSWs together provide the lower 128-bit 128-bit memory bus. 256-bit For the 256-bit configuration, DSWs DSWs also provide the upper part of the bus 128-bit It is better to use the 256-bit 256-bit configuration, or you pay the full price for DSWs and only use half of the resources. It is better to use the 256-bit 256-bit configuration, or you pay the full price for DSWs and only use half of the resources. It may be clear now onebank why it is a one bank schema with SIMMs all the SIMMs same have the same size. It may be clear now onebank why it is a one bank schema with SIMMs all the SIMMs same have the same size.

8 Slide #7Friday, October 10, 1997 Alpha 21172 Core Logic Chip Set Jerry Huang Alpha 21172 Core Logic Chip Set Jerry Huang Bcache and Memory  3rd Level Cache for the 21164  Attributes –optional, external,physical, synchronous SRAM –direct-mapped, write-back,write-allocate  256-bit or 512-bit block  cache size of 1,2,4,8,16,32,64 Mbytes  support up to 512MB of memory –1MBx36, 2MBx36,4MBx36,8MBx36,16MBx36 A cache architecture in which datacache is only written to main memory when it is forced out of the cache. Opposite of write-through.write-through The Scache and Bcache block size is either 64-bytes or 32 bytes. The Scache and Bcache always have identical block sizes. All the Bcache and main memory FILLs or write transactions are of the selected block size. The Scache and Bcache block size is either 64-bytes or 32 bytes. The Scache and Bcache always have identical block sizes. All the Bcache and main memory FILLs or write transactions are of the selected block size. A cache where the cache location for a given address is determined from the middle address bits. If the cache line size is 2^n then the bottom n address bits correspond to an offset within a cache entry. If the cache can hold 2^m entries then the next m address bits give the cache location. The remaining top address bits are stored as a TAG along with the entry. In this scheme, there is no choice of which block to flush on a cache miss since there is only one place for any block to go. This simple scheme has the disadvantage that if the program alternately accesses different addresses which map to the same cache location then it will suffer a cache miss on every access to these locations. This kind of cache conflict is quite likely on a multi-processor. A cache where the cache location for a given address is determined from the middle address bits. If the cache line size is 2^n then the bottom n address bits correspond to an offset within a cache entry. If the cache can hold 2^m entries then the next m address bits give the cache location. The remaining top address bits are stored as a TAG along with the entry. In this scheme, there is no choice of which block to flush on a cache miss since there is only one place for any block to go. This simple scheme has the disadvantage that if the program alternately accesses different addresses which map to the same cache location then it will suffer a cache miss on every access to these locations. This kind of cache conflict is quite likely on a multi-processor. A cache line is allocated when the write memory data miss the cache A cache line is allocated when the write memory data miss the cache PC164 In the PC164 ECC protected

9 Slide #8Friday, October 10, 1997 Alpha 21172 Core Logic Chip Set Jerry Huang Alpha 21172 Core Logic Chip Set Jerry Huang PCI features  Supports 64-bits PCI bus width  Supports 64-bit PCI addressing (DAC cycles)  Accept PCI fast back-to-back cycles –addr,data0,data1,data2,...,addr_again! –The Frame# is only deasserted for a cycle to allow the last to finish  Issues PCI fast back-to-back cycles in dense addrss space addrdata clk Frame#

10 Slide #9Friday, October 10, 1997 Alpha 21172 Core Logic Chip Set Jerry Huang Alpha 21172 Core Logic Chip Set Jerry Huang CIA Transactions  21164 memory read miss  21164 memory read miss with victim  21164 I/O read  21164 I/O write  DMA read  DMA read(prefetch)  DMA write

11 Slide #10Friday, October 10, 1997 Alpha 21172 Core Logic Chip Set Jerry Huang Alpha 21172 Core Logic Chip Set Jerry Huang DSW Data Paths SYSSYS Memory 21164 BCache DMA 0DMA 1 IODIOD MEMMEM Victim Path Read Miss Path PCI PCI SYS MEM FlushFlush IO Paths not shown Instruction Queue MEM

12 Slide #11Friday, October 10, 1997 Alpha 21172 Core Logic Chip Set Jerry Huang Alpha 21172 Core Logic Chip Set Jerry Huang DSW Buffers  DMA Buffer Sets (0 and 1) –PCI buffer for PCI DMA write data –Memory buffer for memory data –Flush buffer for system bus data IODIOD MEMMEM PCI FlushFlush DMA 0DMA 1 PCI

13 Slide #12Friday, October 10, 1997 Alpha 21172 Core Logic Chip Set Jerry Huang Alpha 21172 Core Logic Chip Set Jerry Huang DMA Writes  Data arrives in the PCI Buffer  Memory Buffer loaded at the same time  Bcache line flushed and Flush buffer loaded  3 sources merged and data back at memory DMA 0 IOD MEM PCI Flush Memory 21164 BCache As you just see, the DMA operation PCI buffer causes PCI buffer loaded IODMEM from the IOD bus, the MEM buffermemory buffer loaded from memory, flush buffer and the flush buffer loaded system bus from system bus at the same time As you just see, the DMA operation PCI buffer causes PCI buffer loaded IODMEM from the IOD bus, the MEM buffermemory buffer loaded from memory, flush buffer and the flush buffer loaded system bus from system bus at the same time Then the 3 sources merged are merged and written main memory back to main memory Then the 3 sources merged are merged and written main memory back to main memory

14 Slide #13Friday, October 10, 1997 Alpha 21172 Core Logic Chip Set Jerry Huang Alpha 21172 Core Logic Chip Set Jerry Huang 21164 Read Transaction  If hit in the Bcache, no memory access is required Memory 21164 BCache Read Miss Path SYS MEM HIT !! Read data Data back to CPU

15 Slide #14Friday, October 10, 1997 Alpha 21172 Core Logic Chip Set Jerry Huang Alpha 21172 Core Logic Chip Set Jerry Huang 21164 Read Miss  If not hit in the Bcache during a read, memory access is involved. Memory 21164 BCache Read Miss Path SYS MEM Read data Data back to CPU 21172 CIA Command command 21172 BA Miss!!

16 Slide #15Friday, October 10, 1997 Alpha 21172 Core Logic Chip Set Jerry Huang Alpha 21172 Core Logic Chip Set Jerry Huang Read Miss With Victim  Two scenarios –write data with different address tag into a valid cache line –read data with different address tag into a valid cache line Write allocate!! read allocate!! Memory 21164 BCache Read Miss Path SYS MEM Write data 21172 CIA Command command Miss!! Victim Path Merge data Read Missed block and Write victim block indivisible are indivisible in the logic design

17 Slide #16Friday, October 10, 1997 Alpha 21172 Core Logic Chip Set Jerry Huang Alpha 21172 Core Logic Chip Set Jerry Huang Traffic Jam on MEM bus SYSSYS DMA 0DMA 1 IODIOD MEMMEM Victim Path PCI PCI Memory 21164 BCache Read Miss Path SYS MEM FlushFlush IO Paths not shown Instruction Queue MEM Let’s think about this PCI senario, during the PCI DMA DMA transfer, READ there are READ and WRITE WRITE memory happening at the same time Let’s think about this PCI senario, during the PCI DMA DMA transfer, READ there are READ and WRITE WRITE memory happening at the same time All the circle parts compete for this resource Cause read miss with victim Cause read miss Don’t forget instruction fetch uses memory too Don’t forget instruction fetch uses memory too

18 Slide #17Friday, October 10, 1997 Alpha 21172 Core Logic Chip Set Jerry Huang Alpha 21172 Core Logic Chip Set Jerry Huang How Fast can DMA be? SYSSYS DMA 0DMA 1 IODIOD MEMMEM PCI FlushFlush  2 fetches and 2 writes to memory/DMA –64 bytes/240 ns = 266 Mbytes/s –8 bytes /30 ns = 266 Mbytes/s PCI 33 MHz PCI has the same speed with DRAM DRAM !! Can we really do this ?? PCI 33 MHz PCI has the same speed with DRAM DRAM !! Can we really do this ?? DRAM 60 ns DRAM 256 256-bit bus DRAM 60 ns DRAM 256 256-bit bus PCI 33MHz PCI 64 64-bit bus PCI 33MHz PCI 64 64-bit bus Overheadretrysread lines Overhead, retrys, read lines, read line with victim read line with victim, instruction fetch all share the same bandwidth!! It turns out for the worst case, 17MBytes/s 17MBytes/s is achieved bottom line just above bottom line

19 Slide #18Friday, October 10, 1997 Alpha 21172 Core Logic Chip Set Jerry Huang Alpha 21172 Core Logic Chip Set Jerry Huang Performance of the MB2PCI  Worst case –29.9MBytes/s –25.5MBytes/s –17.5MBytes/s  Best case –95MBytes/s –80MBytes/s –72MBytes/s - No intervenence - read line, instruction fetch - read line, read line with victim, instruction fetch - No intervenence - read line, instruction fetch - read line, read line with victim, instruction fetch

20 Slide #19Friday, October 10, 1997 Alpha 21172 Core Logic Chip Set Jerry Huang Alpha 21172 Core Logic Chip Set Jerry Huang Conclusion  If we want to improve –use 256-bit cache block instead of 512-bit –Is there a next version 21172 chip surport 512-bit memory bus? –Is there DRAM chips faster then 60ns –can we afford 64M Bcache(SRAM)? trade off There is a trade off here, by using smaller block, the 21164 will generate more cache miss cycles and may slow down. On the other hand, for the DMA transfer, when only 128-bit data is transferred, no more 512-bit memory read overhead. There is only 256-bit read now. Thus improve the worst case worst case performance. trade off There is a trade off here, by using smaller block, the 21164 will generate more cache miss cycles and may slow down. On the other hand, for the DMA transfer, when only 128-bit data is transferred, no more 512-bit memory read overhead. There is only 256-bit read now. Thus improve the worst case worst case performance.


Download ppt "Slide #1Friday, October 10, 1997 Alpha 21172 Core Logic Chip Set Jerry Huang Alpha 21172 Core Logic Chip Set Jerry Huang Alpha 21172 Inside out Zhihui."

Similar presentations


Ads by Google