PIII Data Stream Power Saving Modes Buses Memory Order Buffer System Cache Memory Order Buffer Memory Hierarchy L1 Cache L2 Cache
Power Saving Modes AutoHALT uses 10-15% of available power Deep Sleep less than 2% of available power Resume time for AutoHALT is much faster than for Deep Sleep. Actually, the resume time is measured in ns for AutoHALT and in micro seconds for Deep Sleep. Deep Sleep is characterized by low power/ high latency, while AutoHALT is “low latency/high power”. Thus, the deep sleep feature is frequently reserved for the “power-on suspend” feature on notebook computers New Quickstart feature that’s available on the newer PIII mobile processors uses 5% of available power which is approximately 2 watts.
Power Saving Modes BCLK is the system bus clock, which provides clock for processor and on-die L2 cache
Power Saving Modes
Bus Interface “Backside” cache
PIII Buses At-a-Glance Address Bus Width 36 Bit Data Bus Width 64 Bit Dual Independent Bus (DIB) dedicated for L2 64+8 Bit (0.25 mm) 256+32 Bit (0.18 mm)
PIII System Bus 133 MHZ ECC error checking Supports multiple processors 4 write back buffers 6 fill buffers 8 bus queue entries Synchronous latched bus BCLK signal clocks the bus and L2 cache some studies say that system bus is only 25% utilized which should equate to 4 processors could share the same bus Intel claims that only 2 processors can be used in a multiprocessing environment Write back buffers are use
PIII Bus Enhancements Pentium II Write Buffers Removed dead cycle Using all fill buffers as WC fill buffers PII write buffer designed for instantaneous bandwidth, not average case. With the advent of the SSE instructions demand high average throughput. SSE instructions require large amounts of streaming data. In the PII, a dead cycle existed b/w back to back write combining writes. The intel designers removed this dead cycle to increase the throughput for SSE instructions. The Pentium III processor's write combining has been implemented in such a way that its memory cluster allows all fill buffers to be utilized as write-combining fill buffers at the same time, as opposed to the Pentium II processor which allows just one. To support this enhancement, the following WC eviction conditions, as well as all Pentium® Pro WC eviction policies, are implemented in the Pentium III processor: A buffer is evicted when all bytes are written (all dirty) to the fill buffer. Previously the buffer eviction policy was "resource demand“ driven, i.e. a buffer gets evicted when DCU requests the allocation of new buffer. When all fill buffers are busy a DCU fill buffer allocation request, such as regular loads, stores, or prefetches requiring a fill buffer can evict a WC buffer even if it si not full yet. These enhancements improved the bus write bandwidth by 20%
Memory Order Buffer (MOB) Load Buffer (LB) 16 entries Store Buffer (SB) 12 entries Re-dispatches mops Cache bandwidth Memory order buffer is the main interface between the processor’s Out-Of-Order Core and the memory hierarchy. It consists of two buffers, the Load and Store Buffers. The main function of the MOB is to keep track of all current memory accesses and re-dispatch them when necessary. When a load/store micro op is issued to the RS, it is also allocated in the Load or Store Buffer. The Load buffer contains 16 entries, and the Store buffer contains 12. If there’s a miss in the cache, we do not have to stall because the caches are non-blocking, which I will discuss later. Thus, when there’s a cache miss, the cache needs to fill
Memory Order Buffer (MOB) Re-Ordering Stores can not pass other loads or stores Loads can pass other loads, but can not pass stores Store Coloring Multiprocessing dilemma Stores must me constrained from passing other stores, for only a small loss in performance Stores must be constrained from passing loads, for an inconsequential performance loss Constraining loads from passing other loads or stores has a significant impact on performance Store coloring: tag all loads with the store previous to it. This is used to enforce the policy that loads can’t pass stores. So, the MOB can reorder loads. Remember that once an load or store has reached the MOB, the RS has already checked for data dependencies. The MOB just keeps sequential consistency. Explain sequential consistency.
PIII Cache Design Harvard Architecture for L1 Unified for L2 Inclusive Harvard architecture : separation between instruction cache and data cache Advantages: with the out of order core can fetch instructions from the Icache and the execution units can fetch from the Dcache. This allows access time to decrease because each respective cache is smaller. Disadvantage, when fetching data and instructions, you must decide where to place it. L2 No need for separation. You wouldn’t gain anything, only add complexity
Inclusive vs. Exclusive Inclusive: reduces effective size of lower level caches Exclusive: data resides in one cache
L1 Instruction Cache Non-blocking 16 KB 4-way associativity 32 Byte/Line SI Fetch Port Internal and External Snoop Port Least Recently Used
L1 Data Cache Non-blocking 16 KB 4-way associativity 32 Bytes/Line MESI Dual-ported Snoop Port Write Allocate Least Recently Used MESI = Modified, Exclusive, Shared, Invalid
L2 Cache Discrete Level 2 Cache Advanced Transfer Cache Discrete off-fie slower, but more updateable ATC faster, but less updateable takes up room on the chip, therefore smaller
Discrete L2 Cache 512 KB+ off-die 64 Bit bus 4-way set associativity Slower, but bigger Less associativity to make access time faster, but allows for more misses
Advanced Transfer Cache 256 KB on-die 256 Bit Bus 8-way associativity Faster, but smaller On the Xeon models of the pentium III, the want a high hit ratio and a faster access time.
L2 Cache Effects on Power On-die L2 decreases Power/Area On die L2 also decreases the available size for the cache
Software Controlled Caching Streaming Data Trashes Cache Skip levels in Memory Hierarchy Senior Load Lots of multimedia applications that need to process large amounts of data, but only process them once. There’s no need for this data to evict more useful data already in the cache. With previous cache models, the Pentium II would simply keep the most recently accessed data all levels of the cache. Yan mentioned earlier that the new SSE instructions can bypass levels of the cache.