Presentation is loading. Please wait.

Presentation is loading. Please wait.

Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the POWER6  Shift from high frequency to multi-core  Implements.

Similar presentations


Presentation on theme: "Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the POWER6  Shift from high frequency to multi-core  Implements."— Presentation transcript:

1 Adam Kunk Anil John Pete Bohman

2  Released by IBM in 2010 (~ February)  Successor of the POWER6  Shift from high frequency to multi-core  Implements IBM PowerPC architecture v2.06  Clock Rate: 2.4 GHz - 4.25 GHz  Feature size: 45 nm  ISA: Power ISA v 2.06 (RISC)  Cores: 4, 6, 8  Cache: L1, L2, L3 – On Chip References: [1], [5]

3  PERCS – Productive, Easy-to-use, Reliable Computer System  DARPA funded contract that IBM won in order to develop the Power7 ($244 million contract, 2006) ▪ Contract was to develop a petascale supercomputer architecture before 2011 in the HPCS (High Performance Computing Systems) project.  IBM, Cray, and Sun Microsystems received HPCS grant for Phase II.  IBM was chosen for Phase III in 2006. References: [1], [2]

4  Side note:  The Blue Waters system was meant to be the first supercomputer using PERCS technology.  But, the contract was cancelled (cost and complexity).

5 2004 2001 20072010 POWER4/4+  Dual Core Dual Core  Chip Multi Processing Chip Multi Processing  Distributed Switch Distributed Switch  Shared L2 Shared L2  Dynamic LPARs (32) Dynamic LPARs (32)  180nm, 180nm, POWER5/5+  Dual Core & Quad Core Md Dual Core & Quad Core Md  Enhanced Scaling Enhanced Scaling  2 Thread SMT 2 Thread SMT  Distributed Switch + Distributed Switch +  Core Parallelism + Core Parallelism +  FP Performance + FP Performance +  Memory bandwidth + Memory bandwidth +  130nm, 90nm 130nm, 90nm POWER6/6+  Dual Core Dual Core  High Frequencies High Frequencies  Virtualization + Virtualization +  Memory Subsystem + Memory Subsystem +  Altivec Altivec  Instruction Retry Instruction Retry  Dyn Energy Mgmt Dyn Energy Mgmt  2 Thread SMT + 2 Thread SMT +  Protection Keys Protection Keys  65nm 65nm POWER7/7+  4,6,8 Core 4,6,8 Core  32MB On-Chip eDRAM 32MB On-Chip eDRAM  Power Optimized Cores Power Optimized Cores  Mem Subsystem ++ Mem Subsystem ++  4 Thread SMT++ 4 Thread SMT++  Reliability + Reliability +  VSM & VSX VSM & VSX  Protection Keys+ Protection Keys+  45nm, 32nm 45nm, 32nm POWER8 Future First Dual Core in Industry Hardware Virtualization for Unix & Linux Fastest Processor In Industry Most POWERful & Scalable Processor in Industry References: [3]

6  IBM POWER7 Demo IBM POWER7 Demo

7 Cores:  8 Intelligent Cores / chip (socket)  4 and 6 Intelligent Cores available on some models  12 execution units per core  Out of order execution  4 Way SMT per core  32 threads per chip  L1 – 32 KB I Cache / 32 KB D Cache per core  L2 – 256 KB per core Chip:  32MB Intelligent L3 Cache on chip Core L2 Core L2 Memory Interface Core L2 Core L2 Core L2 Core L2 Core L2 Core L2 GXGX SMPFABRICSMPFABRIC POWERPOWER BUSBUS Memory++ L3 Cache eDRAM References: [3]

8

9  Each core implements “aggressive” out-of- order (OoO) instruction execution  The processor has an Instruction Sequence Unit capable of dispatching up to six instructions per cycle to a set of queues  Up to eight instructions per cycle can be issued to the Instruction Execution units References: [4]

10

11  8 inst. fetched from L2 to L1 I-cache or fetch buffer  Balanced instruction rates across active threads  Inst. Grouping  Instructions belonging to group issued together  Groups contain independent instructions

12  Branch Prediction

13  Each POWER7 core has 12 execution units:  2 fixed point units  2 load store units  4 double precision floating point units (2x power6)  1 vector unit  1 branch unit  1 condition register unit  1 decimal floating point unit References: [4]

14

15

16  Simultaneous Multithreading  SMT1: Single instruction execution thread per core  SMT2: Two instruction execution threads per core  SMT4: Four instruction execution threads per core  This means that an 8-core Power7 can execute 32 threads simultaneously

17 Thread 1 ExecutingThread 0 ExecutingNo Thread Executing FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL Single thread Out of Order FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL S80 HW Multi-thread FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL POWER5 2 Way SMT FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL POWER7 4 Way SMT Thread 3 ExecutingThread 2 Executing References: [3]

18

19

20  (Look at section 2.1.4 in http://www.redbooks.ibm.com/redpapers/pd fs/redp4639.pdf) http://www.redbooks.ibm.com/redpapers/pd fs/redp4639.pdf

21 ParameterL1L2L3 (Local)L3 (Global) Size64 KB (32K I, 32K D) 256 KB4 MB32 MB LocationCore On-Chip Access Time.5 ns2 ns6 ns30 ns Associativity4-way I-cache 8-way D-cache 8-way Write PolicyWrite ThroughWrite BackPartial VictimAdaptive Line size128 B

22  On-Chip cache required for sufficient bandwidth to 8 cores.  Previous off-chip socket interface unable to scale  Support dynamic cores  Utilize ILP and increased SMT latency overlap

23  I and D cache split to reduce latency  Way prediction bits reduce hit latency  Write-Through  No L1 write-backs required on line eviction  High speed L2 able to handle bandwidth  B-Tree LRU replacement

24  Superset of L1 (inclusive)  Reduced latency by decreasing capacity  L2 utilizes L3-Local cache as victim cache  Increased associativity

25  32 MB Fluid L3 cache  4 MB of local L3 cache per 8 cores ▪ Local cache closer to respective core, reduced latency  L3 cache access routed to the local L3 cache first  Cache lines cloned when used by multiple cores

26

27  Embedded Dynamic Random-Access memory  Less area (1 transistor vs. 6 transistor SRAM)  Enables on-chip L3 cache ▪ Reduces L3 latency ▪ Larger internal bus size which increases bandwidth  Compared to off chip SRAM cache ▪ 1/6 latency ▪ 1/5 standby power  Utilized in game consoles (PS2, Wii, Etc.) References: [5], [6]

28

29

30  1. http://en.wikipedia.org/wiki/POWER7 1. http://en.wikipedia.org/wiki/POWER7  2. http://en.wikipedia.org/wiki/PERCS 2. http://en.wikipedia.org/wiki/PERCS  3. Central PA PUG POWER7 review.ppt  http://www.google.com/url?sa=t&rct=j&q=&esrc =s&source=web&cd=1&ved=0CCEQFjAA&url=ht tp%3A%2F%2Fwww.ibm.com%2Fdeveloperwor ks%2Fwikis%2Fdownload%2Fattachments%2F1 35430247%2FCentral%2BPA%2BPUG%2BPOW ER7%2Breview.ppt&ei=3El3T6ejOI-40QGil- GnDQ&usg=AFQjCNFESXDZMpcC2z8y8NkjE- v3S_5t3A

31  4. http://www.redbooks.ibm.com/redpapers/p dfs/redp4639.pdf http://www.redbooks.ibm.com/redpapers/p dfs/redp4639.pdf  5. http://www.serc.iisc.ernet.in/~govind/243/P ower7.pdf http://www.serc.iisc.ernet.in/~govind/243/P ower7.pdf  6. http://en.wikipedia.org/wiki/EDRAMhttp://en.wikipedia.org/wiki/EDRAM


Download ppt "Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the POWER6  Shift from high frequency to multi-core  Implements."

Similar presentations


Ads by Google