Adam Kunk Anil John Pete Bohman
Released by IBM in 2010 (~ February) Successor of the POWER6 Implements IBM PowerPC architecture v2.06 Clock Rate: 2.4 GHz GHz Feature size: 45 nm ISA: Power ISA v 2.06 (RISC) Cores: 4, 6, 8 Cache: L1, L2, L3 – On Chip References: [1], [5]
PERCS – Productive, Easy-to-use, Reliable Computer System DARPA funded contract that IBM won in order to develop the Power7 ($244 million contract, 2006) ▪ Contract was to develop a petascale supercomputer architecture before 2011 in the HPCS (High Performance Computing Systems) project. IBM, Cray, and Sun Microsystems received HPCS grant for Phase II. IBM was chosen for Phase III in References: [1], [2]
Side note: The Blue Waters system was meant to be the first supercomputer using PERCS technology. But, the contract was cancelled (cost and complexity).
POWER4/4+ Dual Core Dual Core Chip Multi Processing Chip Multi Processing Distributed Switch Distributed Switch Shared L2 Shared L2 Dynamic LPARs (32) Dynamic LPARs (32) 180nm, 180nm, POWER5/5+ Dual Core & Quad Core Md Dual Core & Quad Core Md Enhanced Scaling Enhanced Scaling 2 Thread SMT 2 Thread SMT Distributed Switch + Distributed Switch + Core Parallelism + Core Parallelism + FP Performance + FP Performance + Memory bandwidth + Memory bandwidth + 130nm, 90nm 130nm, 90nm POWER6/6+ Dual Core Dual Core High Frequencies High Frequencies Virtualization + Virtualization + Memory Subsystem + Memory Subsystem + Altivec Altivec Instruction Retry Instruction Retry Dyn Energy Mgmt Dyn Energy Mgmt 2 Thread SMT + 2 Thread SMT + Protection Keys Protection Keys 65nm 65nm POWER7/7+ 4,6,8 Core 4,6,8 Core 32MB On-Chip eDRAM 32MB On-Chip eDRAM Power Optimized Cores Power Optimized Cores Mem Subsystem ++ Mem Subsystem ++ 4 Thread SMT++ 4 Thread SMT++ Reliability + Reliability + VSM & VSX VSM & VSX Protection Keys+ Protection Keys+ 45nm, 32nm 45nm, 32nm POWER8 Future First Dual Core in Industry Hardware Virtualization for Unix & Linux Fastest Processor In Industry Most POWERful & Scalable Processor in Industry References: [3]
Cores: 8 Intelligent Cores / chip (socket) 4 and 6 Intelligent Cores available on some models 12 execution units per core Out of order execution 4 Way SMT per core 32 threads per chip L1 – 32 KB I Cache / 32 KB D Cache per core L2 – 256 KB per core Chip: 32MB Intelligent L3 Cache on chip Core L2 Core L2 Memory Interface Core L2 Core L2 Core L2 Core L2 Core L2 Core L2 GXGX SMPFABRICSMPFABRIC POWERPOWER BUSBUS Memory++ L3 Cache eDRAM References: [3]
Each core implements “aggressive” out-of- order (OoO) instruction execution The processor has an Instruction Sequence Unit capable of dispatching up to six instructions per cycle to a set of queues Up to eight instructions per cycle can be issued to the Instruction Execution units References: [4]
8 inst. fetched from L2 to L1 I-cache or fetch buffer Balanced instruction rates across active threads Inst. Grouping Instructions belonging to group issued together Groups contain independent instructions
Branch Prediction
Each POWER7 core has 12 execution units: 2 fixed point units 2 load store units 4 double precision floating point units (2x power6) 1 vector unit 1 branch unit 1 condition register unit 1 decimal floating point unit References: [4]
Simultaneous Multithreading SMT1: Single instruction execution thread per core SMT2: Two instruction execution threads per core SMT4: Four instruction execution threads per core This means that an 8-core Power7 can execute 32 threads simultaneously
Thread 1 ExecutingThread 0 ExecutingNo Thread Executing FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL Single thread Out of Order FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL S80 HW Multi-thread FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL POWER5 2 Way SMT FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL POWER7 4 Way SMT Thread 3 ExecutingThread 2 Executing References: [3]
(Look at section in fs/redp4639.pdf) fs/redp4639.pdf
ParameterL1L2L3 (Local)L3 (Global) Size64 KB (32 I, 32 D) 256 KB4 MB32 MB Access Time.5 ns2 ns6 ns30 ns Associativity4-way I-cache 8-way D-cache 8-way Write PolicyWrite ThroughWrite BackPartial VictimAdaptive Line size128 B
2 read ports, 1 write port Write has higher priority over a read Write-Through No L1 cast-outs required B-Tree LRU replacement Way prediction bits reduce hit latency
Inclusive of L1 L3 partial victim relationship
Details of the L3 Cache …. (leads up to eDRAM)
eDRAM – Embedded dynamic random-access memory This means the L3 cache (shared 32 MB) is on-chip Essentially faster due to decreased distance Less area, less power, on-chip interconnects provide each core with 32-byte buses to and from the L3 cache Side note: eDRAM is also used in many different game consoles (PS2, GameCube, Wii, Etc.) References: [5], [6]
eDRAM in the POWER7 provides 1/6 the latency and twice the bandwidth (compared with off-chip eDRAM), and 1/5 standby power in 1/3 the required area (compared with SRAM) References: [5]
3. Central PA PUG POWER7 review.ppt =s&source=web&cd=1&ved=0CCEQFjAA&url=ht ks%2Fwikis%2Fdownload%2Fattachments%2F %2FCentral%2BPA%2BPUG%2BPOW ER7%2Breview.ppt&ei=3El3T6ejOI-40QGil- GnDQ&usg=AFQjCNFESXDZMpcC2z8y8NkjE- v3S_5t3A
4. dfs/redp4639.pdf dfs/redp4639.pdf 5. ower7.pdf ower7.pdf 6.