Jeff Stuecheli Hardware Architect POWER8 Technology

Jeff Stuecheli Hardware Architect POWER8 Technology
OpenPOWER Academic Discussion Group Workshop 2015 Jeff Stuecheli Hardware Architect

POWER8 Technology  Processor
IBM 22nm Technology Silicon-on-Insulator 15 metal layers Deep trench eDRAM POWER8 Processor Compute 12 cores (thread strength optimized) SMT8, 16-wide execution 2X internal data flows Transactional Memory Cache 64KB L KB L2 / core 96MB L3 + up to 128MB L4 / socket 2X bandwidths System Interfaces 230 GB/s memory bandwidth / socket Up to 48x Integrated PCI gen 3 / socket CAPI (over PCI gen 3) Robust, Large SMP Interconnect On chip Energy Mgmt, VRM / core 2

POWER8 Core Execution Improvement vs. POWER7 SMT4  SMT8 8 dispatch
10 issue 16 execution pipes: 2 FXU, 2 LSU, 2 LU, 4 FPU, 2 VMX, 1 Crypto, 1 DFU, 1 CR, 1 BR 16 SP / 8 DP FLOPS per cycle Larger Issue queues (4 x 16-entry) Larger global completion Larger Load/Store reorder Improved branch prediction Improved unaligned storage access Larger Caching Structures vs. POWER7 2x L1 data cache (64 KB) 2x outstanding data cache misses 4x translation Cache Wider Load/Store 32B  64B L2 to L1 data bus 2x data cache to execution dataflow Enhanced Prefetch Instruction speculation awareness Data prefetch depth awareness Adaptive bandwidth awareness Topology awareness VSU FXU IFU DFU ISU LSU Core Performance vs . POWER7 ~1.6x Thread ~2x Max SMT 3

“NUCA” Cache policy (Non-Uniform Cache Architecture)
On-chip Caches L2: 512 KB 8 way per core L3: 96 MB (12 x 8 MB 8 way Bank) “NUCA” Cache policy (Non-Uniform Cache Architecture) Scalable bandwidth and latency Migrate “hot” lines to local L2, then local L3 (replicate L2 contained footprint) Chip Interconnect: 150 GB/sec x 12 segments per direction = 3.6 TB/sec Core Core Core Core Core Core L2 L2 L2 SMP Acc L2 L2 L2 L3 Bank L3 Bank L3 Bank L3 Bank L3 Bank L3 Bank Memory Chip Interconnect Memory L3 Bank L3 Bank L3 Bank L3 Bank L3 Bank L3 Bank SMP PCIe L2 L2 L2 L2 L2 L2 Core Core Core Core Core Core

POWER8 Technology  Memory Organization
Memory Buffers Memory Buffers DRAM Chips DRAM Chips POWER8 Processor Up to 8 high speed channels, each 2B rd + 1B wr at 9.6 Gb/s for up to 230 GB/s Up to 32 total DDR ports yielding 410 GB/s peak at the DRAM Up to 128MB L4 cache and 1 TB memory capacity per processor socket 5

Centaur Memory Buffer Chip….with 16MB Cache…
DRAM Chips Memory Buffer DDR Interfaces Intelligence Moved into Memory Scheduling logic, caching structures Energy Mgmt, RAS decision point Moved from Processor to Memory Buffer Processor Interface 9.6 GB/s high speed interface More robust RAS, “On-the-fly” lane isolation/repair Extensible for innovation build-out Performance Value End-to-end fastpath and data retry (latency) Cache  latency/bandwidth, partial updates Cache  write scheduling, prefetch, energy 22nm SOI for optimal performance / energy Scheduler & Management 16MB Memory Cache POWER8 Link 6

POWER8 Technology  Integrated PCI Gen 3
Native PCIe Gen 3 Support Direct processor integration Replaces proprietary GX/Bridge Low latency Gen3 bandwidth Transport Layer for CAPI Protocol Coherently Attach Devices Protocol encapsulated in PCIe GX Bus PCIe G3 I/O Bridge PCIe G2 PCI Device PCI Device 7

CAPI Typical I/O Model Flow Flow with a Coherent Model
FPGA IBM Supplied POWER Service Layer CAPP PCIe Function 0 Function 1 Function 2 Function n POWER8 Processor Typical I/O Model Flow DD Call Copy or Pin Source Data MMIO Notify Accelerator Acceleration Poll / Int Completion Copy or Unpin Result Data Ret. From DD Flow with a Coherent Model Shared Mem. Notify Accelerator Acceleration Shared Memory Completion Advantages of Coherent Attachment Over I/O Attachment Virtual Addressing & Data Caching (significant latency reduction) Easier, Natural Programming Model (avoid application restructuring) Enables Apps Not Possible on I/O (Pointer chasing, shared mem semaphores, …) 8

POWER8 Max Enterprise Interconnect
192-way SMP system 48-way Drawer 76.8 GB/s 25.6 GB/s 9

POWER8 Scaling Enhancements
Coherence Protocol Innovations -Triple scope coherence protocol (& remote node I/O) Chip Scope Book Scope System Scope 10 10

Peak Socket Coherence Capability
POWER8 Coherence Scaling Enhancements Increased Nest frequency Additional coherence “scope” Speculative Greedy Coherence Arbitration Peak Socket Coherence Capability POWER7 POWER8 11

POWER8 Scaling Enhancements
Technology - Improved High speed chip interfaces improve multi-chip SMP bandwidth - More compute capacity per chip and node reduces need to cross boundaries Coherence Protocol Innovations - Parallel TLBIE broadcast and TLBIE filtering/priority capability - Autonomic prefetch throttling - Larx/Stcx scaling improvement reduces chance of contention driven queuing Caching Capacity/Latency/Bandwidth Innovations - 8M L3 cache per core, Large L4 cache reduce miss traffic and improve latency - Local memory latency improves by 20-25% - On-chip intervention latency improves by 25% - 2X intervention resources per core improves intervention bandwidth Architectural Features/Assists - Hardware Transactional Memory support enables better SW scaling - Hot/Cold page tracking enables SW CP/mem placement improvements 12

POWER8 Max Enterprise Interconnect
192-way SMP system 48-way Drawer 76.8 GB/s 25.6 GB/s 13

POWER 795/780+ 3-Hop 128-way Topology
32-way Drawer 128-way SMP system 14

POWER 795/780+ 3-Hop 128-way Topology
15

POWER8 Enterprise 2-hop 192-way Topology
16

POWER8 Enterprise 2-hop Multi-path 192-way Topology
17

Centaur Memory Buffer Chip
POWER8 Low Profile Memory Cards Centaur Memory Buffer Chip DDR Interfaces Scheduler & Management 16MB Memory Cache POWER8 Link POWER8 Memory Card Capacity: 16GB / 32 GB / 64 GB 1600 MHz Memory Sparing Up to 8 Cards per socket Quad Interleave Systems: 2U to Enterprise Intelligence Moved into Memory Scheduling logic, 16MB Cache Energy Mgmt, RAS decision point Moved from Processor to Memory Buffer 9.6 GB/s high speed processor interface 4 DDR3 1600MHz DRAM interfaces Extensible for innovation build-out 22nm SOI for optimal performance / energy 18

Centaur Memory Buffer Chip
POWER8 Full Height Memory Card Centaur Memory Buffer Chip DDR Interfaces Scheduler & Management 16MB Memory Cache POWER8 Link POWER8 Memory Card Capacity: 16GB / 32 GB / 64 GB / 128 GB 1600 MHz Memory Sparing Up to 8 Cards per socket Quad Interleave Systems: 4U and Enterprise Intelligence Moved into Memory Scheduling logic, 16MB Cache Energy Mgmt, RAS decision point Moved from Processor to Memory Buffer 9.6 GB/s high speed processor interface 4 DDR3 1600MHz DRAM interfaces Extensible for innovation build-out 22nm SOI for optimal performance / energy 19

CEC I/O Drawer Schematic
I/O Fan Out Module PCI x16 CXP FPGA PCI CDR PCI x16 / x8 CEC POWER8 Dual Paths x8 per path 20

Scale-OUT Flexible Compute System 2 x Scale-OUT Chip
- Up to 4-socket (Heavy I/O Connectivity) - Focus on Socket Throughput - Strong in Compute and I/O 2 x Scale-OUT Chip - 12 Core - 3 SMP Links - 48x PCI (32x CAPI) Core L2 MemCtrl Tier2 SMP Tier1 SMP PCI CAPI 2 (remote) SMP Links + 24x PCI 3 T i e r 1 Tier1 local SMP link 8M L3 Region L3 Cache and Chip Interconnect Core L2 MemCtrl Tier2 SMP Tier1 SMP PCI CAPI 2 (remote) SMP Links + 24x PCI 3 T i e r 1 Tier1 local SMP link 8M L3 Region L3 Cache and Chip Interconnect Achieve Socket objectives using two “60%” chips and extra SMP Tier

Two POWER8 Chips: Scale-Out and Enterprise Optimized
Scale-Out Chip DCM Enterprise Chip SCM Two x 6 Cores  12 Cores Two x 48 MB L3  96 MB L3 Two x 4 Mem Attach  8 Mem Attach Two x 24x PCI G3  48x PCI G3 Two x 16x CAPI  32x CAPI SMP 1st Tier  on-DCM SMP 2nd Tier  Multiple DCM sockets Max SMP  Small 12 Cores 96 MB L3 8 Mem Attach 32x PCI G3 16x CAPI SMP 1st Tier  Multi-chip Drawer SMP 2nd Tier  Multiple Drawers Max SMP  Large 22

Performance optimization (partial list)
Core execution Bigger caches SMT8 Improved alignment capability Symmetric VSX pipeline 2x increase in load execution 4x translation cache On chip caches 2x Dataflow width Improved locking protocols Improved L3 replacement policy SMP interconnect Multipath topology Broadcast scope enhancements Dynamic coherence traffic balance Acceleration In-core Vector encryption Decrease NX latency CAPI NVLINK Reduced memory latency Fastpath/bypass Late Error detection (Flush pipeline on error) 2-hop topology L4 cache DRAM refresh avoidance Increased memory bandwidth 32 DDR3 channels 9.6 GHz interface Virtual Write Queue Extended prefetch into L4 Memory management Hot/cold, affinity, and history tracking in hardware Translation shoot down

Reliability Availability Serviceability
Driving forces Build upon the best Details become critical Examples Memory DIMM design New SMP cable design Area POWER7 Enterprise POWER7+ Enterprise POWER8 Enterprise Added Error Detection/Fault Isolation/New Function L2/L3 cache ECC Memory Chipkill and symbol error correction Error Handling for On Chip Accelerators Error handling for Off Chip accelerators etc. using CAPI interface Advanced error checking on fabric bus address generation Integrated On Chip Controller used for Power/Thermal Handling Host Boot Integrated PCIe function Advanced technology for avoiding soft errors SOI Processor Modules eDRAM L3 cache Stacked Latches More comprehensive use of stacked latches eDRAM for L4 Cache Recovery/Retry Processor Instruction Retry Memory buffer soft error retry Memory instruction replay Self-healing/Repair/Fault Avoidance L2/L3 cache line delete Dynamic Memory Bus data-lane repair Spare DRAM(s) in memory Dynamic inter-node fabric bus lane repair L3 cache column repair L2 Cache column repair L4 cache persistent fault handling Other Error Mitigation Alternate Processor Recovery CPU predictive deconfiguration Active Memory Mirroring of Hypervisor Use of Power On Reset Engine for dynamic processor re-initialization Dynamic substitution of unassigned memory for memory DIMMs called out for repair

br.ibmtechu.com KEY FEATURES... Create a personal agenda using the agenda planner. View the agenda and agenda changes. Use the agenda search to find the sessions and/or Download presentations. Submit Session and Conference Evaluations.

Jeff Stuecheli Hardware Architect POWER8 Technology

Similar presentations

Presentation on theme: "Jeff Stuecheli Hardware Architect POWER8 Technology"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jeff Stuecheli Hardware Architect POWER8 Technology

Similar presentations

Presentation on theme: "Jeff Stuecheli Hardware Architect POWER8 Technology"— Presentation transcript:

Similar presentations

About project

Feedback