PIII Data Stream Power Saving Modes Buses Memory Order Buffer

Slides:

Advertisements

Similar presentations

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time

August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.

Performance of Cache Memory

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University See P&H Chapter: , 5.8, 5.10, 5.15; Also, 5.13 & 5.17.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.

Memory Hierarchy and Cache Memory Jennifer Tsay CS 147 Section 3 October 8, 2009.

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.

Computer Architecture System Interface Units Iolanthe II approaches Coromandel Harbour.

L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Memory Hierarchy. Hierarchy List Registers L1 Cache L2 Cache Main memory Disk cache Disk Optical Tape.

Lecture#15. Cache Function The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Cache Issues Computer Organization II 1 Main Memory Supporting Caches Use DRAMs for main memory – Fixed width (e.g., 1 word) – Connected by fixed-width.

Nios II Processor: Memory Organization and Access

William Stallings Computer Organization and Architecture 7th Edition

Processor support devices Part 2: Caches and the MESI protocol

Cache Organization of Pentium

COSC3330 Computer Architecture

Cache Memory and Performance

Improving Memory Access 1/3 The Cache and Virtual Memory

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Lecture: Cache Hierarchies

The University of Adelaide, School of Computer Science

A Study on Snoop-Based Cache Coherence Protocols

Assembly Language for Intel-Based Computers, 5th Edition

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

William Stallings Computer Organization and Architecture 7th Edition

Lecture: Cache Hierarchies

Cache memory Direct Cache Memory Associate Cache Memory

ECE 445 – Computer Organization

Introduction to Pentium Processor

Comparison of Two Processors

Ka-Ming Keung Swamy D Ponpandi

Lecture: Cache Innovations, Virtual Memory

Chapter 6 Memory System Design

Performance metrics for caches

Adapted from slides by Sally McKee Cornell University

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Performance metrics for caches

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

* From AMD 1996 Publication #18522 Revision E

If a DRAM has 512 rows and its refresh time is 9ms, what should be the frequency of row refresh operation on the average?

CS-447– Computer Architecture Lecture 20 Cache Memories

CS 3410, Spring 2014 Computer Science Cornell University

Lecture: Cache Hierarchies

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Performance metrics for caches

10/18: Lecture Topics Using spatial locality

Ka-Ming Keung Swamy D Ponpandi

Overview Problem Solution CPU vs Memory performance imbalance

ARM920T Processor This training module provides an introduction to the ARM920T processor embedded in the AT91RM9200 microcontroller.We’ll identify the.

Presentation transcript:

PIII Data Stream Power Saving Modes Buses Memory Order Buffer System Cache Memory Order Buffer Memory Hierarchy L1 Cache L2 Cache

Power Saving Modes AutoHALT uses 10-15% of available power Deep Sleep less than 2% of available power Resume time for AutoHALT is much faster than for Deep Sleep. Actually, the resume time is measured in ns for AutoHALT and in micro seconds for Deep Sleep. Deep Sleep is characterized by low power/ high latency, while AutoHALT is “low latency/high power”. Thus, the deep sleep feature is frequently reserved for the “power-on suspend” feature on notebook computers New Quickstart feature that’s available on the newer PIII mobile processors uses 5% of available power which is approximately 2 watts.

Power Saving Modes BCLK is the system bus clock, which provides clock for processor and on-die L2 cache

Power Saving Modes

Bus Interface “Backside” cache

PIII Buses At-a-Glance Address Bus Width 36 Bit Data Bus Width 64 Bit Dual Independent Bus (DIB) dedicated for L2 64+8 Bit (0.25 mm) 256+32 Bit (0.18 mm)

PIII System Bus 133 MHZ ECC error checking Supports multiple processors 4 write back buffers 6 fill buffers 8 bus queue entries Synchronous latched bus BCLK signal clocks the bus and L2 cache some studies say that system bus is only 25% utilized which should equate to 4 processors could share the same bus Intel claims that only 2 processors can be used in a multiprocessing environment Write back buffers are use

PIII Bus Enhancements Pentium II Write Buffers Removed dead cycle Using all fill buffers as WC fill buffers PII write buffer designed for instantaneous bandwidth, not average case. With the advent of the SSE instructions demand high average throughput. SSE instructions require large amounts of streaming data. In the PII, a dead cycle existed b/w back to back write combining writes. The intel designers removed this dead cycle to increase the throughput for SSE instructions. The Pentium III processor's write combining has been implemented in such a way that its memory cluster allows all fill buffers to be utilized as write-combining fill buffers at the same time, as opposed to the Pentium II processor which allows just one. To support this enhancement, the following WC eviction conditions, as well as all Pentium® Pro WC eviction policies, are implemented in the Pentium III processor: A buffer is evicted when all bytes are written (all dirty) to the fill buffer. Previously the buffer eviction policy was "resource demand“ driven, i.e. a buffer gets evicted when DCU requests the allocation of new buffer. When all fill buffers are busy a DCU fill buffer allocation request, such as regular loads, stores, or prefetches requiring a fill buffer can evict a WC buffer even if it si not full yet. These enhancements improved the bus write bandwidth by 20%

Memory Order Buffer (MOB) Load Buffer (LB) 16 entries Store Buffer (SB) 12 entries Re-dispatches mops Cache bandwidth Memory order buffer is the main interface between the processor’s Out-Of-Order Core and the memory hierarchy. It consists of two buffers, the Load and Store Buffers. The main function of the MOB is to keep track of all current memory accesses and re-dispatch them when necessary. When a load/store micro op is issued to the RS, it is also allocated in the Load or Store Buffer. The Load buffer contains 16 entries, and the Store buffer contains 12. If there’s a miss in the cache, we do not have to stall because the caches are non-blocking, which I will discuss later. Thus, when there’s a cache miss, the cache needs to fill

Memory Order Buffer (MOB) Re-Ordering Stores can not pass other loads or stores Loads can pass other loads, but can not pass stores Store Coloring Multiprocessing dilemma Stores must me constrained from passing other stores, for only a small loss in performance Stores must be constrained from passing loads, for an inconsequential performance loss Constraining loads from passing other loads or stores has a significant impact on performance Store coloring: tag all loads with the store previous to it. This is used to enforce the policy that loads can’t pass stores. So, the MOB can reorder loads. Remember that once an load or store has reached the MOB, the RS has already checked for data dependencies. The MOB just keeps sequential consistency. Explain sequential consistency.

PIII Cache Design Harvard Architecture for L1 Unified for L2 Inclusive Harvard architecture : separation between instruction cache and data cache Advantages: with the out of order core can fetch instructions from the Icache and the execution units can fetch from the Dcache. This allows access time to decrease because each respective cache is smaller. Disadvantage, when fetching data and instructions, you must decide where to place it. L2 No need for separation. You wouldn’t gain anything, only add complexity

Inclusive vs. Exclusive Inclusive: reduces effective size of lower level caches Exclusive: data resides in one cache

L1 Instruction Cache Non-blocking 16 KB 4-way associativity 32 Byte/Line SI Fetch Port Internal and External Snoop Port Least Recently Used

L1 Data Cache Non-blocking 16 KB 4-way associativity 32 Bytes/Line MESI Dual-ported Snoop Port Write Allocate Least Recently Used MESI = Modified, Exclusive, Shared, Invalid

L2 Cache Discrete Level 2 Cache Advanced Transfer Cache Discrete off-fie slower, but more updateable ATC faster, but less updateable takes up room on the chip, therefore smaller

Discrete L2 Cache 512 KB+ off-die 64 Bit bus 4-way set associativity Slower, but bigger Less associativity to make access time faster, but allows for more misses

Advanced Transfer Cache 256 KB on-die 256 Bit Bus 8-way associativity Faster, but smaller On the Xeon models of the pentium III, the want a high hit ratio and a faster access time.

L2 Cache Effects on Power On-die L2 decreases Power/Area On die L2 also decreases the available size for the cache

Software Controlled Caching Streaming Data Trashes Cache Skip levels in Memory Hierarchy Senior Load Lots of multimedia applications that need to process large amounts of data, but only process them once. There’s no need for this data to evict more useful data already in the cache. With previous cache models, the Pentium II would simply keep the most recently accessed data all levels of the cache. Yan mentioned earlier that the new SSE instructions can bypass levels of the cache.