Adam Kunk Anil John Pete Bohman

Adam Kunk Anil John Pete Bohman
UIUC - CS 433 IBM POWER7 Adam Kunk Anil John Pete Bohman

Quick Facts Released by IBM in 2010 (~ February)
Successor of the POWER6 Shift from high frequency to multi-core Implements IBM PowerPC architecture v2.06 Clock Rate: 2.4 GHz GHz Feature size: 45 nm ISA: Power ISA v 2.06 (RISC) Cores: 4, 6, 8 Cache: L1, L2, L3 – On Chip References: [1], [5]

Why the POWER7? PERCS – Productive, Easy-to-use, Reliable Computer System DARPA funded contract that IBM won in order to develop the Power7 ($244 million contract, 2006) Contract was to develop a petascale supercomputer architecture before 2011 in the HPCS (High Performance Computing Systems) project. IBM, Cray, and Sun Microsystems received HPCS grant for Phase II. IBM was chosen for Phase III in 2006. References: [1], [2]

Blue Waters Side note: The Blue Waters system was meant to be the first supercomputer using PERCS technology. But, the contract was cancelled (cost and complexity).

History of Power 2001 2004 2007 2010 Future POWER8 POWER7/7+ POWER6/6+
Most POWERful & Scalable Processor in Industry POWER5/5+ Fastest Processor In Industry POWER4/4+ Hardware Virtualization for Unix & Linux Dual Core High Frequencies Virtualization + Memory Subsystem + Altivec Instruction Retry Dyn Energy Mgmt 2 Thread SMT + Protection Keys 65nm 4,6,8 Core 32MB On-Chip eDRAM Power Optimized Cores Mem Subsystem ++ 4 Thread SMT++ Reliability + VSM & VSX Protection Keys+ 45nm, 32nm First Dual Core in Industry Dual Core & Quad Core Md Enhanced Scaling 2 Thread SMT Distributed Switch + Core Parallelism + FP Performance + Memory bandwidth + 130nm, 90nm Dual Core Chip Multi Processing Distributed Switch Shared L2 Dynamic LPARs (32) 180nm, 2001 2004 2007 2010 Future References: [3]

POWER7 Layout Cores: Chip: O W E L2 G X U
8 Intelligent Cores / chip (socket) 4 and 6 Intelligent Cores available on some models 12 execution units per core Out of order execution 4 Way SMT per core 32 threads per chip L1 – 32 KB I Cache / 32 KB D Cache per core L2 – 256 KB per core Chip: 32MB Intelligent L3 Cache on chip Core L2 Memory Interface G X S M P F A B R I C O W E U L3 Cache eDRAM 3 – levels of on chip cache Memory++ References: [3]

Scalability POWER7 can handle up to 32 Sockets
32 sockets with up to 8 cores/socket Each core can execute 32 threads simultaneously (this means up to 32*32 = 1024 simultaneous threads) 360 GB/s peak SMP bandwidth / chip 590 GB/s peak I/O bandwidth / chip Up to 20,000 coherent operations in flight (very aggressive out-of-order execution) References: [3]

POWER7 Options (8, 6, 4 cores) References: [3]

POWER7 TurboCore TurboCore mode Tradeoffs 8 core to 4 Core
7.25% higher core frequency 2X the amount of L3 cache (fluid cache) Tradeoffs Reduces per core software licenses Increases throughput computing Decreases parallel transactional based workloads

POWER7 Core Each core implements “aggressive” out-of-order (OoO) instruction execution The processor has an Instruction Sequence Unit capable of dispatching up to six instructions per cycle to a set of queues Up to eight instructions per cycle can be issued to the Instruction Execution units References: [4]

Pipeline Fetch up to 8 inst, decode and dispatch 6 instructions, issue/execute 8 , commit 6 instructions, (ROB 250 entries) register renaming

Instruction Fetch 8 inst. fetched from L2 to L1 I-cache or fetch buffer Balanced instruction rates across active threads Inst. Grouping Instructions belonging to group issued together Groups contain independent instructions Fetch up to 8 instructions from L2 cache, on demand or prefetch 3 Cycles A single bit is used to signal the beginning of a new group A group cannot use more resources than are available in the processor. A group cannot have a WAW or RAW dependency The modified branch instructions with a partially computer target address, is stored in the L1 I-Cache

Branch Prediction POWER7 uses different mechanisms to predict the branch direction (taken/not taken) and the branch target address. Instruction Fetch Unit (IFU) supports 3-cycle branch scan loop (to scan instructions for branches taken, compute target addresses, and determine if it is an unconditional branch or taken) A branch target address cache (BTAC) is used to reduce the loss of fetch cycles in single-threaded mode. This is because in ST mode (single-thread) on a taken branch, the three-cycle branch scan loop can cause two dead cycles where no instruction fetch takes place. To mitigate this penalty, a BTAC was added to track the targets of direct branches. The BTAC uses the current fetch address to predict the fetch address two cycles in the future. When this is predicted correctly, the pipelined BTAC provides the capability to not waste cycles given any amount of taken branches encountered. References: [5]

Branch Direction Prediction
Tournament Predictor (due to GSEL): 8-K entry local BHT (LBHT) BHT – Branch History Table 16-K entry global BHT (GBHT) 8-K entry global selection array (GSEL) All arrays above provide branch direction predictions for all instructions in a fetch group (fetch group - up to 8 instructions) The arrays are shared by all threads All instructions in a fetch group are capable of being branches as well. References: [5]

Branch Direction Prediction (cont.)
Indexing : 8-K LBHT directly indexed by 10 bits from instruction fetch address The GBHT and GSEL arrays are indexed by the instruction fetch address hashed with a 21-bit global history vector (GHV) folded down to 11 bits, one per thread References: [5]

Branch Direction Prediction (cont.)
Value in GSEL chooses between LBHT and GBHT for the direction of the prediction of each individual branch Hence the tournament predictor! Each BHT (LBHT and GBHT) entry contains 2 bits: Higher order bit determines direction (taken/not taken) Lower order bit provides hysteresis (history of the branch) Hysteresis notes: Hysteresis is used to take into account the history of the branch beyond just the last time the branch was executed. This can often be implemented using saturating counters. ** It is somewhat confusing how POWER7 uses 1-bit for hysteresis (since 1 bit can only track the last action) References: [5]

Branch Target Address Prediction
Predicted in two ways: Indirect branches that are not subroutine returns use a 128-entry count cache (shared by all active threads). Count cache is indexed by doing an XOR of 7 bits from the instruction fetch address and the GHV (global history vector) Each entry in the count cache contains a 62-bit predicted address with 2 confidence bits Notes: The two confidence bits for the count cache are used to determine when an entry is replaced if an indirect branch prediction is incorrect. References: [5]

Branch Target Address Prediction (cont.)
Predicted in two ways: Subroutine returns are predicted using a link stack (one per thread). This is like the “Return Address Stack” discussed in lecture Support in POWER7 modes: ST, SMT2  16-entry link stack (per thread) SMT4  8-entry link stack (per thread) The link stack works as follows: Whenever a branch-to-link instruction is scanned, the address of the next instruction is pushed down in the link stack for that thread. The link stack is “popped” whenever a branch-to-link instruction is scanned. The POWER7 also allows for one speculative entry to be saved in the event of a mispredicted branch which would cause a flush of the link stack.

Execution Units Each POWER7 core has 12 execution units:
2 fixed point units 2 load store units 4 double precision floating point units (2x power6) 1 vector unit 1 branch unit 1 condition register unit 1 decimal floating point unit References: [4]

ILP Advanced branch prediction Large out-of-order execution windows
Large and fast caches Execute more than one execution thread per core A single 8-core Power7 processor can execute 32 threads in the same clock cycle.

POWER7 Demo IBM POWER7 Demo
Visual representation of the SMT capabilities of the POWER7 Brief introduction to the on-chip L3 cache

SMT Simultaneous Multithreading POWER7 supports:
Separate instruction streams running concurrently on the same physical processor POWER7 supports: 2 pipes for storage instructions (load/stores) 2 pipes for executing arithmetic instructions (add, subtract, etc.) 1 pipe for branch instructions (control flow) Parallel support for floating-point and vector operations References: [7], [8]

SMT (cont.) Simultaneous Multithreading Explanation:
SMT1: Single instruction execution thread per core SMT2: Two instruction execution threads per core SMT4: Four instruction execution threads per core This means that an 8-core Power7 can execute 32 threads simultaneously POWER7 supports SMT1, SMT2, SMT4 References: [5], [8]

Multithreading History
FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL Single thread Out of Order FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL S80 HW Multi-thread FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL POWER5 2 Way SMT FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL POWER7 4 Way SMT No Thread Executing Thread 0 Executing Thread 1 Executing Thread 3 Executing Thread 2 Executing References: [3]

Cache Overview Parameter L1 L2 L3 (Local) L3 (Global) Size 64 KB
(32K I, 32K D) 256 KB 4 MB 32 MB Location Core On-Chip Access Time .5 ns 2 ns 6 ns 30 ns Associativity 4-way I-cache 8-way D-cache 8-way Write Policy Write Through Write Back Partial Victim Adaptive Line size 128 B (Look at section in

Cache Design Considerations
On-Chip cache required for sufficient bandwidth to 8 cores. Previous off-chip socket interface unable to scale Support dynamic cores Utilize ILP and increased SMT latency overlap eDRAM enables complete on chip cache On-chip cache allowed for an additional memory controller interface

L1 Cache I and D cache split to reduce latency
Way prediction bits reduce hit latency Write-Through No L1 write-backs required on line eviction High speed L2 able to handle bandwidth B-Tree LRU replacement Prefetching On each L1 I-Cache miss, prefetch next 2 blocks

L2 Cache Superset of L1 (inclusive)
Reduced latency by decreasing capacity L2 utilizes larger L3-Local cache as victim cache Increased associativity

L3 Cache 32 MB Fluid L3 cache 4 MB of local L3 cache per 8 cores
Lateral cast outs, disabled core provisioning 4 MB of local L3 cache per 8 cores Local cache closer to respective core, reduced latency L3 cache access routed to the local L3 cache first Cache lines cloned when used by multiple cores L3 Replacement priority => evict from 2 before 1 1. Those installed as victims via a cast-out operation by the associated L2 cache. 2. Those installed as victims via a lateral cast-out operation by one of the other seven L3 regions, those that are residual shared copies of lines that have been moved to the local L2 cache, and those that are invalid.

eDRAM Embedded Dynamic Random-Access memory
Less area (1 transistor vs. 6 transistor SRAM) Enables on-chip L3 cache Reduces L3 latency Larger internal bus size which increases bandwidth Compared to off chip SRAM cache 1/6 latency 1/5 standby power Utilized in game consoles (PS2, Wii, Etc.) References: [5], [6]

Memory 2 memory controllers, 4 channels per core
Exploits elimination of off-chip L3 cache interface 32 GB per core, 256 GB Capacity 180 GB/s (Power6 75GB/s) 16 KB scheduling buffer

Multicore (TODO) (Look at table on page 401 in book)

Cache Coherence Multicore coherence protocal: Extended MESI with behavioral and locality hints NOTE: MESI is the “Illinois Protocol” Most common protocol to support write-back cache Each cache line marked with the following states (2 additional bits): Modified: present only in current cache, dirty Exclusive: present only in current cache, clean Shared: may be stored in other caches, clean Invalid: cache line invalid Every cache line is marked with one of the four following states (coded in two additional bits): Modified: The cache line is present only in the current cache, and is dirty; it has been modified from the value in main memory. The cache is required to write the data back to main memory at some time in the future, before permitting any other read of the (no longer valid) main memory state. The write-back changes the line to the Exclusive state. Exclusive: The cache line is present only in the current cache, but is clean; it matches main memory. It may be changed to the Shared state at any time, in response to a read request. Alternatively, it may be changed to the Modified state when writing to it. Shared: Indicates that this cache line may be stored in other caches of the machine and is "clean" ; it matches the main memory. The line may be discarded (changed to the Invalid state) at any time. Invalid: Indicates that this cache line is invalid. References: [9], [10]

Cache Coherence (cont.)
Multicore Coherence Implementation: Directory at the L3 cache Directory as opposed to snooping-based system Directory-based: sharing status of block oh physical memory is kept in one location, called the directory One centralized directory in outermost cache (L3) Snooping: every cache that has a copy of the data from a block of physical memory could track the sharing status of the block. Monitor or snoop the broadcast medium Ref[9]: CS 433 book: Look at pages for directory-based vs. snooping based systems. References: [9], [11]

Maintaining The Balance

Energy Management Three idle states to optimize power vs. latency Nap
Sleep “Heavy” Sleep

Energy Management Nap Optimized for wake-up time
Turn off clocks to execution units Caches remain coherent Reduce frequency to core

Energy Management Sleep “Heavy” Sleep
Purge and clock off core plus caches “Heavy” Sleep Optimized for power reduction All cores sleep mode Reduce voltage of all cores Voltage ramps automatically on wake-up No hardware re-initialization required

Energy Management Per-core frequency Scaling:
-50% thru +10% frequency slew independent per core. (DVFS) Supports energy optimization in partitioned system configuration Less utilized partitions can run at lower frequencies Heavily utilized partitions maintain peak performance Each partition can run under different energy saving policy

Energy Management Impact
IBM research states the following improvements in SPECPower_ssj2008 scores Adding dynamic fan speed control 14% improvement Static power savings (low power operation) 24% improvement Dynamic power savings (DVFS with Turbo mode) 50% improvement

Performance Technology Chips Cores Threads GHz rPerf CPW POWER7 2 16
64 3.86 195.45 105,200 3.92 197.6 106,000 8 32 4.14 115.86 57,450 rPerf – Relative performance metric for Power Systems servers. Derived from an IBM analytical model which uses characteristics from IBM internal workloads, TPC and SPEC benchmarks. The IBM eServer pSeries 640 is the baseline reference system and has a value of 1.0. (SMT2 vs SMT4 with SPEC) CPW – Commercial Processing Workload Based on benchmarks owned and managed by the Transaction Processing Performance Council. Provides an indicator of transaction processing performance capacity when comparing between members of the iSeries and AS/400 families.

Performance SPEC CPU2006 performance (Speed) Technology Chips Cores
Threads GHz SPECint SPECfp POWER7 2 16 3.86 71.5 4.14 44.0 SPEC CPU2006 performance (Throughput) (SMT2 vs SMT4 with SPEC) Technology Chips Cores Threads GHz OS SPECint_rate SPECfp_rate POWER7 2 16 64 3.86 AIX 6.1 652 586

References 1. http://en.wikipedia.org/wiki/POWER7
2. 3. Central PA PUG POWER7 review.ppt

References (cont.) 4. 5. 6. 7. 8. 9. Computer Architecture: A Quantitative Approach. Fifth Edition. Morgan Kaufman. 10. Wikipedia: MESI Protocol. 11. Wikipedia: Cache Coherence.

Adam Kunk Anil John Pete Bohman

Similar presentations

Presentation on theme: "Adam Kunk Anil John Pete Bohman"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Adam Kunk Anil John Pete Bohman

Similar presentations

Presentation on theme: "Adam Kunk Anil John Pete Bohman"— Presentation transcript:

Similar presentations

About project

Feedback