Peter Bannon Staff Fellow HP

Peter Bannon Staff Fellow HP pbannon@hp.com
EV7 Peter Bannon Staff Fellow HP

Alpha Microprocessor Roadmap
The 1GHz version of the EV6C processor has been shipping in products since July of this year. A 1.25 GHZ upgrade of that device will begin shipping early in These products are built in a 0.18 bulk copper process at IBM and Samsung Electronics. Towards the end of 2002 we expect to introduce products using EV7, which will be built using IBM’s 0.18 bulk copper process. Note the significant increase in die size, pin count, and power for the EV7 device. EV79 will be a direct shrink of EV7 using IBM’s 0.13 copper SIO processes. April 6, 2019 BARC 2003

Alpha 21264 core with enhancements Integrated L2 Cache
EV7 Features Alpha core with enhancements Integrated L2 Cache Integrated memory controller Integrated network interface The will start with an enhanced version of the core. We will at an integrated L2 cache, Direct RAMbus memory controller, and a network interface. The chip will support lock-step operation to enable high-availability systems. April 6, 2019 BARC 2003

EV7 System Block Diagram
364 M 364 M 364 M 364 M IO IO IO IO 364 M 364 M 364 M 364 M IO IO IO IO 364 M 364 M 364 M 364 M Here’s the block diagram of a 12 processor system using the 2D torus topology. Each processor may have its own local memory and may have its own local I/O connection. It is possible for a processor to operate in the system without memory or I/O if that is attractive. Using this topology EV7 will support systems with up to 128 processors and a maximum memory of 4TB. IO IO IO IO April 6, 2019 BARC 2003

N S E W Router Mem1 Mem0 L2 Data L2 Tag EV68 Core P 1 2 3 4 5 6 7 I/O
1 2 3 4 5 6 7 Next I’d like to give an overview of the EV7 device. The die is 4 cm2 and is pictured here after processing though metal 1. In the lower center part of the die is the EV68C core. This is the same set of polygons used in the current EV68C products with some changes to support up to 16 cache block requests, and larger virtual pages up to 512MB is size. The load/store bus is located under the label on the core. It is the key bottle neck when moving data in and out of the memory system. I connects the four integer and two floating point pipelines to the L1 data cache. It is also used to move fill data into the cache and/or registers. I’ll speak more about this in the L2 section of the talk. The data array for the L2 cache is split on either side of the core, 64b per side. The L2 tags are located directly above the core, close to the L2 cache control, which is located within the core (top center of the core). The two RDRAM memory controllers are at the top of the die on either side of the router. The data path for the memory controller are located to either side of the L2 tag array. Each controller send data to/from the RDRAM using the interfaces located to either side of the core. Each controller used four RDRAM channels in parallel with and optional 5th channel for parity. The router is located at the top center of the die. In the center of the router you can see the cross bar running vertically through the center of the router. The four compass points are at the top, two on each side. Within a compass point you can the control structures in the ½ closest to the cross bar with the buffer storage located in the outer ½ of the compass point. Below the compass points on the left are the local ports to the L2 cache, Mem0 and Mem1. The IO port is located on the bottom of the right with the control and status register master located in the far bottom right corner of the router. The router table is located in the bottom left corner of the router. It contains 128 entries with 1 write port and 5 read ports. The drivers and receivers for the compass points are located just above the RDRAM I/O. I the interface to the I/O port is stretched across the top of the die. April 6, 2019 BARC 2003

1.75 MB, 7-way set associative, with ECC
Integrated L2 Cache 1.75 MB, 7-way set associative, with ECC 20 GB/s total read/write bandwidth 16 Victim buffers for L1 -> L2 16 Victim buffers for L2 -> Memory 9.6ns load to use latency Tag access start every cycle Data access in 4 cycle blocks Couple Tag/Data access to minimize latency Decoupled Tag access to minimize resource use. The 1.75MB 7-set L2 cache has a 12 cycle load to use latency. This latency is set by the existing control in the core and is used to significantly reduce the power consumption of the L2 array. The L2 cache can read or write 16 bytes/cycle at 1.25GHz, resulting in 20GB/second of read or write bandwidth. The array is protected by a single error correct, double error detect ECC code. A tag access can start every cycle, while data access are allocated in blocks of 4 to read a full 64 byte cache block. The controller uses the additional tag bandwidth to probe the cache while the data array is busy. This allows misses to be sent to memory early. Hits are rescheduled when the data array becomes free. L1 data misses are processed as coupled loads if the tag and data array are free. This provides the minimum hit latency at the expense of speculatively allocating the load/store bus. If the data array is busy, or the request is for Istream or a store miss, a de-coupled reference will be used. In this case the full tag access is completed before requesting the load/store bus and the data array. While is increases the hit latency is reduces the wasted cycles on the load/store bus. Bandwidth on the load/store bus is also lost when switching between processing load/store request for the core and doing fills. To reduce this cost, the L2 controller will never insert one or two cycles between a fill because these cycles can not be used by the core to process additional loads or store. Instead if one or two cycles are needed, the controller will insert 4. This policy increased the performance of memory intensive applications by 10%. April 6, 2019 BARC 2003

Two Integrated Memory Controllers
RDRAM memory Directly connect to the processor High data capacity per pin 800 Mb/s operation 75ns load to use latency 12.8 GB/sec peak bandwidth 6 GB/sec read or write bandwidth 2048 open pages 64 entry directory based cache coherence engine ECC SECDED Optional 4+1 parity in memory The chip contains two integrated RDRAM memory controllers. RDRAM provides high data storage capacity per bin along with outstanding bandwidth and latency. The load to use latency is 75ns. The memory controller will provide 6GB/sec of read or write bandwidth to the core. With 2GFLOPs, the chip provides 3byte/FLOP of usable memory bandwidth, a significant improvement over current systems. To reduce memory latency the memory controller will track 2K of open pages in the RDRAM array. A directory based cache coherence protocol is an integral part of the memory controller. The memory is protected by a single error correct, double error detect ECC code. Errors are corrected inline without any additional latency or reduction in bandwidth. RAID provides even more protection, allowing the machine to survive the failure of entire RIMM modules. After a small delay to find the error (<100 cycles), the chip resumes full bandwidth and latency while doing the correction. EV56/ MB/sec / 1.2GFLOPS = 0.4 b/flop EV6/ MB/sec /1.2GFLOPS = 0.83b/flop April 6, 2019 BARC 2003

ZBox Block Diagram Data Path DRAM Scheduling Cache Coherence Engine
Directory out Data Path Data to Core Directory in DATA OUT DATA IN CHK COR REMAP COL OUT ROW OUT ROW COL PA DRAM Scheduling R D A M Cache Coherence Engine 32 MEMORY REFERENCES CC STATE MACHINE Data MAP 4 PRQ 4 RSQ 8 WCAS SLOT 8 RCAS QUE Each of the two controllers is split into three sections, the cache coherence engine, the DRAM scheduler, and the data path. Memory requests arrive at the cache coherence engine which has storage 32 transactions. This logic makes cache block read and write requests of the RDRAM array. The scheduling hardware converts the physical address into device, bank, row, and column addresses. It also checks the status of the request device, bank, and row. This information is used to place the request into the pre-charge, RAS, or CAS queues. If the request conflicts with a request already in progress, it is rejected to be retried at a later time. This allows the controller to search through many of the current requests looking for work that can be scheduled with the existing load. The CAS queue is divided into read and write requests to allow batching of reads and writes. This reduce the amount of bandwidth wasted by bus turn arounds. The design tries to strike a balance between processing requests in the order received to minimize latency and to preserve the programmers intent, and issuing requests out of order to maximize bandwidth. As requests arrive, they are placed in a FIFO which ensures that each request attempts to access the memory in order. After the initial access, the requests are processed in any order. April 6, 2019 BARC 2003

Integrated Network Interface
Direct processor-to-processor interconnect 4 links 6.4 GB/second per link 32 bits + ECC at 800 Mb/s each direction 18ns processor-to-processor latency ECC, single error correct, double error detect, per hop Out-of-order network with adaptive routing IO, Request, Forward, Special, and Response channels VO, V1, and Adaptive virtual networks Asynchronous clocking between processors 3 GB/second I/O interface per processor The integrated network interface allows multi-processor systems to be built using a 2D torus topology. No additional system logic is required. Each link is build from two uni-directional busses. Each bus contains 32 data wires, 7 bits of ECC, two differential clocks and two reference voltage signals. The links are clocked at 400MHz using both edges of the clock to send data. The links allow for significant module etch, connectors, and up to 4 meters of cable to form the connections between processors. Each hop in the network of a 64P machine will take an average of 18ns. ECC is checked and corrected each hop. Single bit errors are corrected in line without bandwidth or latency penalties. The network moves data and control packets from the source to the destination. It does not guarantee ordering. Adaptive routing of packets allows the network to detect and avoid hot spots and busy links. Asynchronous clocking between processors removes the need to distribute a low skew clock within a large system. A fifth port provides up to 3GB/sec on bandwidth to industry standard buses, PCI, PCI-X, and AGP. April 6, 2019 BARC 2003

Rbox Block Diagram Q E W N S IO C Z0 Z1 L0 L1 OF Bytes Adap. V0 V1
Request Forward Block Resp Resp Write I/O Read I/O Special ~1KB per port Q I E W N S IO C Z0 Z1 O L0 L1 OF The router uses input buffering. At each hop in the network, the source must have a buffer credit before it can send a packet. The packets are divided into V0, V1, and adaptive buffer pools. There are pools for 5 classes of cache coherence traffic, IO, Request, Forward, Broadcast, and response. Each compass point contains storage for 53 packets. Each input queue has two read ports that allow it to transmit packets faster than they can be received. This is important to allow queue delay to be pressed out of the network when it builds up during blocking events. Each of the 8 inputs are connected to all of the outputs except itself. For example, it is not possible to loop back a packet from the N input put to the N output port. All of the ports except IO are required to run at the same speed. The speed of the I/O link can be reduce to accommodate standard ASIC designs. This requires Input packets from the IO port arrive completely before moving into the network and an output buffer on the I/O port. April 6, 2019 BARC 2003

while (p) p=*p; April 6, 2019 BARC 2003

April 6, 2019 BARC 2003

CPU INT 2000 April 6, 2019 BARC 2003

CPU FP 2000 April 6, 2019 BARC 2003

Database Performance April 6, 2019 BARC 2003

April 6, 2019 BARC 2003

64P Running TTOY, 32 memory controllers
April 6, 2019 BARC 2003

64P Running TTOY, 64 memory controllers
April 6, 2019 BARC 2003

April 6, 2019 BARC 2003

Peter Bannon Staff Fellow HP

Similar presentations

Presentation on theme: "Peter Bannon Staff Fellow HP"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Peter Bannon Staff Fellow HP

Similar presentations

Presentation on theme: "Peter Bannon Staff Fellow HP"— Presentation transcript:

Similar presentations

About project

Feedback