Slide #1Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang Alpha Inside out Zhihui Huang (Jerry) University of Michigan
Slide #2Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang Components One CA chip –Control, I/O, address chip(CIA) –388 pins, plastic ball grid array(PBGA) Four BA –data switch chip (DSW) –208 pins, plastic quad flat pack (PQFP)
Slide #3Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang Data Paths 64-bit data path between CIA and DSW –iod 128-bit data path between and DSW –cpu_dat 256-bit memory data path between DSW and memory –mem_dat Slowest Slowest part widest has the widest bus
Slide #4Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang 3-way Interface DRAM 1 DRAM 2 DRAM 3 DRAM 4 DRAM 5 DRAM 6 DRAM 7 DRAM 8 DSW0DSW1DSW2DSW3 64-bit PCI Bus 64-bit IOD bus addr RAS CAS memadr control
Slide #5Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang Memory DRAM 1 DRAM 2 DRAM 3 DRAM 4 DRAM 5 DRAM 6 DRAM 7 DRAM 8 one The DRAM is contained in one SIMMs, bank of SIMMs, whether there are 4 SIMMs or 8 SIMMs. one The DRAM is contained in one SIMMs, bank of SIMMs, whether there are 4 SIMMs or 8 SIMMs. SIMM 1 SIMM 2 SIMM 3 SIMM bit SIMMs 4 SIMMs fill a data bus of bits SIMMs 4 SIMMs fill a data bus of bits SIMM 5 SIMM 6 SIMM 7 SIMM bit SIMMs 8 SIMMs fill a data bus of bits SIMMs 8 SIMMs fill a data bus of bits Needs ajumper
Slide #6Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang Memory block DRAM 1 DRAM 2 DRAM 3 DRAM 4 DRAM 5 DRAM 6 DRAM 7 DRAM 8 DSW0DSW1DSW2DSW3 256-bit256-bit 256-bit A 256-bit block is composed of bit slices across all 8SIMMs the 8 SIMMs The arrangement of the slices are interleaved interleaved within the 4 DSWs 256-bit A 256-bit block is composed of bit slices across all 8SIMMs the 8 SIMMs The arrangement of the slices are interleaved interleaved within the 4 DSWs 15:0 31:16 47:32 63:48 79:64 95:80 102:96 127: bit As you just see, DSWs the 4 DSWs together provide the lower 128-bit 128-bit memory bus. 256-bit For the 256-bit configuration, DSWs DSWs also provide the upper part of the bus As you just see, DSWs the 4 DSWs together provide the lower 128-bit 128-bit memory bus. 256-bit For the 256-bit configuration, DSWs DSWs also provide the upper part of the bus 128-bit It is better to use the 256-bit 256-bit configuration, or you pay the full price for DSWs and only use half of the resources. It is better to use the 256-bit 256-bit configuration, or you pay the full price for DSWs and only use half of the resources. It may be clear now onebank why it is a one bank schema with SIMMs all the SIMMs same have the same size. It may be clear now onebank why it is a one bank schema with SIMMs all the SIMMs same have the same size.
Slide #7Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang Bcache and Memory 3rd Level Cache for the Attributes –optional, external,physical, synchronous SRAM –direct-mapped, write-back,write-allocate 256-bit or 512-bit block cache size of 1,2,4,8,16,32,64 Mbytes support up to 512MB of memory –1MBx36, 2MBx36,4MBx36,8MBx36,16MBx36 A cache architecture in which datacache is only written to main memory when it is forced out of the cache. Opposite of write-through.write-through The Scache and Bcache block size is either 64-bytes or 32 bytes. The Scache and Bcache always have identical block sizes. All the Bcache and main memory FILLs or write transactions are of the selected block size. The Scache and Bcache block size is either 64-bytes or 32 bytes. The Scache and Bcache always have identical block sizes. All the Bcache and main memory FILLs or write transactions are of the selected block size. A cache where the cache location for a given address is determined from the middle address bits. If the cache line size is 2^n then the bottom n address bits correspond to an offset within a cache entry. If the cache can hold 2^m entries then the next m address bits give the cache location. The remaining top address bits are stored as a TAG along with the entry. In this scheme, there is no choice of which block to flush on a cache miss since there is only one place for any block to go. This simple scheme has the disadvantage that if the program alternately accesses different addresses which map to the same cache location then it will suffer a cache miss on every access to these locations. This kind of cache conflict is quite likely on a multi-processor. A cache where the cache location for a given address is determined from the middle address bits. If the cache line size is 2^n then the bottom n address bits correspond to an offset within a cache entry. If the cache can hold 2^m entries then the next m address bits give the cache location. The remaining top address bits are stored as a TAG along with the entry. In this scheme, there is no choice of which block to flush on a cache miss since there is only one place for any block to go. This simple scheme has the disadvantage that if the program alternately accesses different addresses which map to the same cache location then it will suffer a cache miss on every access to these locations. This kind of cache conflict is quite likely on a multi-processor. A cache line is allocated when the write memory data miss the cache A cache line is allocated when the write memory data miss the cache PC164 In the PC164 ECC protected
Slide #8Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang PCI features Supports 64-bits PCI bus width Supports 64-bit PCI addressing (DAC cycles) Accept PCI fast back-to-back cycles –addr,data0,data1,data2,...,addr_again! –The Frame# is only deasserted for a cycle to allow the last to finish Issues PCI fast back-to-back cycles in dense addrss space addrdata clk Frame#
Slide #9Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang CIA Transactions memory read miss memory read miss with victim I/O read I/O write DMA read DMA read(prefetch) DMA write
Slide #10Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang DSW Data Paths SYSSYS Memory BCache DMA 0DMA 1 IODIOD MEMMEM Victim Path Read Miss Path PCI PCI SYS MEM FlushFlush IO Paths not shown Instruction Queue MEM
Slide #11Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang DSW Buffers DMA Buffer Sets (0 and 1) –PCI buffer for PCI DMA write data –Memory buffer for memory data –Flush buffer for system bus data IODIOD MEMMEM PCI FlushFlush DMA 0DMA 1 PCI
Slide #12Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang DMA Writes Data arrives in the PCI Buffer Memory Buffer loaded at the same time Bcache line flushed and Flush buffer loaded 3 sources merged and data back at memory DMA 0 IOD MEM PCI Flush Memory BCache As you just see, the DMA operation PCI buffer causes PCI buffer loaded IODMEM from the IOD bus, the MEM buffermemory buffer loaded from memory, flush buffer and the flush buffer loaded system bus from system bus at the same time As you just see, the DMA operation PCI buffer causes PCI buffer loaded IODMEM from the IOD bus, the MEM buffermemory buffer loaded from memory, flush buffer and the flush buffer loaded system bus from system bus at the same time Then the 3 sources merged are merged and written main memory back to main memory Then the 3 sources merged are merged and written main memory back to main memory
Slide #13Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang Read Transaction If hit in the Bcache, no memory access is required Memory BCache Read Miss Path SYS MEM HIT !! Read data Data back to CPU
Slide #14Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang Read Miss If not hit in the Bcache during a read, memory access is involved. Memory BCache Read Miss Path SYS MEM Read data Data back to CPU CIA Command command BA Miss!!
Slide #15Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang Read Miss With Victim Two scenarios –write data with different address tag into a valid cache line –read data with different address tag into a valid cache line Write allocate!! read allocate!! Memory BCache Read Miss Path SYS MEM Write data CIA Command command Miss!! Victim Path Merge data Read Missed block and Write victim block indivisible are indivisible in the logic design
Slide #16Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang Traffic Jam on MEM bus SYSSYS DMA 0DMA 1 IODIOD MEMMEM Victim Path PCI PCI Memory BCache Read Miss Path SYS MEM FlushFlush IO Paths not shown Instruction Queue MEM Let’s think about this PCI senario, during the PCI DMA DMA transfer, READ there are READ and WRITE WRITE memory happening at the same time Let’s think about this PCI senario, during the PCI DMA DMA transfer, READ there are READ and WRITE WRITE memory happening at the same time All the circle parts compete for this resource Cause read miss with victim Cause read miss Don’t forget instruction fetch uses memory too Don’t forget instruction fetch uses memory too
Slide #17Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang How Fast can DMA be? SYSSYS DMA 0DMA 1 IODIOD MEMMEM PCI FlushFlush 2 fetches and 2 writes to memory/DMA –64 bytes/240 ns = 266 Mbytes/s –8 bytes /30 ns = 266 Mbytes/s PCI 33 MHz PCI has the same speed with DRAM DRAM !! Can we really do this ?? PCI 33 MHz PCI has the same speed with DRAM DRAM !! Can we really do this ?? DRAM 60 ns DRAM bit bus DRAM 60 ns DRAM bit bus PCI 33MHz PCI bit bus PCI 33MHz PCI bit bus Overheadretrysread lines Overhead, retrys, read lines, read line with victim read line with victim, instruction fetch all share the same bandwidth!! It turns out for the worst case, 17MBytes/s 17MBytes/s is achieved bottom line just above bottom line
Slide #18Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang Performance of the MB2PCI Worst case –29.9MBytes/s –25.5MBytes/s –17.5MBytes/s Best case –95MBytes/s –80MBytes/s –72MBytes/s - No intervenence - read line, instruction fetch - read line, read line with victim, instruction fetch - No intervenence - read line, instruction fetch - read line, read line with victim, instruction fetch
Slide #19Friday, October 10, 1997 Alpha Core Logic Chip Set Jerry Huang Alpha Core Logic Chip Set Jerry Huang Conclusion If we want to improve –use 256-bit cache block instead of 512-bit –Is there a next version chip surport 512-bit memory bus? –Is there DRAM chips faster then 60ns –can we afford 64M Bcache(SRAM)? trade off There is a trade off here, by using smaller block, the will generate more cache miss cycles and may slow down. On the other hand, for the DMA transfer, when only 128-bit data is transferred, no more 512-bit memory read overhead. There is only 256-bit read now. Thus improve the worst case worst case performance. trade off There is a trade off here, by using smaller block, the will generate more cache miss cycles and may slow down. On the other hand, for the DMA transfer, when only 128-bit data is transferred, no more 512-bit memory read overhead. There is only 256-bit read now. Thus improve the worst case worst case performance.