Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the POWER6  Shift from high frequency to multi-core  Implements.

Slides:

Advertisements

Similar presentations

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Advertisements

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.

Better answers The Alpha and Microprocessors: Continuing the Performance Lead Beyond Y2K Shubu Mukherjee, Ph.D. Principal Hardware Engineer.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the POWER6  Implements IBM PowerPC architecture v2.06  Clock.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

ELEC 6200, Fall 07, Oct 29 McPherson: Vector Processors1 Vector Processors Ryan McPherson ELEC 6200 Fall 2007.

Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.

Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the POWER6  Shift from high frequency to multi-core  Implements.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

1 Lecture 16: Cache Innovations / Case Studies Topics: prefetching, blocking, processor case studies (Section 5.2)

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

Lect 13-1 Lect 13: and Pentium. Lect Microprocessor Family  Microprocessor  Introduced in 1989  High Integration  On-chip 8K.

CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.

Computer performance.

Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the POWER6  Shift from high frequency to multi-core  Implements.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation.

Alpha 21364: A Scalable Single-chip SMP

Lecture 15: Virtual Memory EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

History of Microprocessor MPIntroductionData BusAddress Bus

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.

Computer Architecture System Interface Units Iolanthe II approaches Coromandel Harbour.

Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

The original MIPS I CPU ISA has been extended forward three times The practical result is that a processor implementing MIPS IV is also able to run MIPS.

Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the POWER6  Shift from high frequency to multi-core  Implements.

Chapter One Introduction to Pipelined Processors

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

The Intel 86 Family of Processors

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

AIX and PowerVM Workshop © 2013 IBM Corporation 1 POWER5POWER5+POWER6POWER7POWER7+ Technology130nm90nm65nm45nm32nm Size389 mm mm mm mm.

Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the Power6  Clock Rate: 2.4 GHz GHz  Feature size: 45.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

The Alpha – Data Stream Matt Ziegler.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Chao Han ELEC6200 Computer Architecture Fall 081ELEC : Han: PowerPC.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Hewlett-Packard PA-RISC Bit Processors: History, Features, and Architecture Presented By: Adam Gray Christie Kummers Joshua Madagan.

Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,

© 2004 IBM Corporation Power Everywhere POWER5 Processor Update Mark Papermaster VP, Technology Development IBM Systems and Technology Group.

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

UltraSparc IV Tolga TOLGAY. OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Memory COMPUTER ARCHITECTURE

Adam Kunk Anil John Pete Bohman

Lynn Choi School of Electrical Engineering

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

Flow Path Model of Superscalars

Introduction to Pentium Processor

Computer Architecture Lecture 4 17th May, 2006

Comparison of Two Processors

Alpha Microarchitecture

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Hardware Overview System P & Power5.

Presentation transcript:

Adam Kunk Anil John Pete Bohman

 Released by IBM in 2010 (~ February)  Successor of the POWER6  Shift from high frequency to multi-core  Implements IBM PowerPC architecture v2.06  Clock Rate: 2.4 GHz GHz  Feature size: 45 nm  ISA: Power ISA v 2.06 (RISC)  Cores: 4, 6, 8  Cache: L1, L2, L3 – On Chip References: [1], [5]

 PERCS – Productive, Easy-to-use, Reliable Computer System  DARPA funded contract that IBM won in order to develop the Power7 ($244 million contract, 2006) ▪ Contract was to develop a petascale supercomputer architecture before 2011 in the HPCS (High Performance Computing Systems) project.  IBM, Cray, and Sun Microsystems received HPCS grant for Phase II.  IBM was chosen for Phase III in References: [1], [2]

 Side note:  The Blue Waters system was meant to be the first supercomputer using PERCS technology.  But, the contract was cancelled (cost and complexity).

POWER4/4+  Dual Core Dual Core  Chip Multi Processing Chip Multi Processing  Distributed Switch Distributed Switch  Shared L2 Shared L2  Dynamic LPARs (32) Dynamic LPARs (32)  180nm, 180nm, POWER5/5+  Dual Core & Quad Core Md Dual Core & Quad Core Md  Enhanced Scaling Enhanced Scaling  2 Thread SMT 2 Thread SMT  Distributed Switch + Distributed Switch +  Core Parallelism + Core Parallelism +  FP Performance + FP Performance +  Memory bandwidth + Memory bandwidth +  130nm, 90nm 130nm, 90nm POWER6/6+  Dual Core Dual Core  High Frequencies High Frequencies  Virtualization + Virtualization +  Memory Subsystem + Memory Subsystem +  Altivec Altivec  Instruction Retry Instruction Retry  Dyn Energy Mgmt Dyn Energy Mgmt  2 Thread SMT + 2 Thread SMT +  Protection Keys Protection Keys  65nm 65nm POWER7/7+  4,6,8 Core 4,6,8 Core  32MB On-Chip eDRAM 32MB On-Chip eDRAM  Power Optimized Cores Power Optimized Cores  Mem Subsystem ++ Mem Subsystem ++  4 Thread SMT++ 4 Thread SMT++  Reliability + Reliability +  VSM & VSX VSM & VSX  Protection Keys+ Protection Keys+  45nm, 32nm 45nm, 32nm POWER8 Future First Dual Core in Industry Hardware Virtualization for Unix & Linux Fastest Processor In Industry Most POWERful & Scalable Processor in Industry References: [3]

Cores:  8 Intelligent Cores / chip (socket)  4 and 6 Intelligent Cores available on some models  12 execution units per core  Out of order execution  4 Way SMT per core  32 threads per chip  L1 – 32 KB I Cache / 32 KB D Cache per core  L2 – 256 KB per core Chip:  32MB Intelligent L3 Cache on chip Core L2 Core L2 Memory Interface Core L2 Core L2 Core L2 Core L2 Core L2 Core L2 GXGX SMPFABRICSMPFABRIC POWERPOWER BUSBUS Memory++ L3 Cache eDRAM References: [3]

 POWER7 can handle up to 32 Sockets  32 sockets with up to 8 cores/socket  Each core can execute 32 threads simultaneously (this means up to 32*32 = 1024 simultaneous threads)  360 GB/s peak SMP bandwidth / chip  590 GB/s peak I/O bandwidth / chip  Up to 20,000 coherent operations in flight (very aggressive out-of-order execution) References: [3]

 TurboCore mode  8 core to 4 Core  7.25% higher core frequency  2X the amount of L3 cache (fluid cache)  Tradeoffs  Reduces per core software licenses  Increases throughput computing  Decreases parallel transactional based workloads

 Each core implements “aggressive” out-of- order (OoO) instruction execution  The processor has an Instruction Sequence Unit capable of dispatching up to six instructions per cycle to a set of queues  Up to eight instructions per cycle can be issued to the Instruction Execution units References: [4]

 8 inst. fetched from L2 to L1 I-cache or fetch buffer  Balanced instruction rates across active threads  Inst. Grouping  Instructions belonging to group issued together  Groups contain independent instructions

 POWER7 uses different mechanisms to predict the branch direction (taken/not taken) and the branch target address.  Instruction Fetch Unit (IFU) supports 3-cycle branch scan loop (to scan instructions for branches taken, compute target addresses, and determine if it is an unconditional branch or taken) References: [5]

 Tournament Predictor (due to GSEL):  8-K entry local BHT (LBHT) ▪ BHT – Branch History Table  16-K entry global BHT (GBHT)  8-K entry global selection array (GSEL)  All arrays above provide branch direction predictions for all instructions in a fetch group (fetch group - up to 8 instructions)  The arrays are shared by all threads References: [5]

 Indexing :  8-K LBHT directly indexed by 10 bits from instruction fetch address  The GBHT and GSEL arrays are indexed by the instruction fetch address hashed with a 21-bit global history vector (GHV) folded down to 11 bits, one per thread References: [5]

 Value in GSEL chooses between LBHT and GBHT for the direction of the prediction of each individual branch  Hence the tournament predictor!  Each BHT (LBHT and GBHT) entry contains 2 bits:  Higher order bit determines direction (taken/not taken)  Lower order bit provides hysteresis (history of the branch) References: [5]

 Predicted in two ways: 1. Indirect branches that are not subroutine returns use a 128-entry count cache (shared by all active threads). ▪ Count cache is indexed by doing an XOR of 7 bits from the instruction fetch address and the GHV (global history vector) ▪ Each entry in the count cache contains a 62-bit predicted address with 2 confidence bits References: [5]

 Predicted in two ways: 1. Subroutine returns are predicted using a link stack (one per thread). ▪ This is like the “Return Address Stack” discussed in lecture  Support in POWER7 modes:  ST, SMT2  16-entry link stack (per thread)  SMT4  8-entry link stack (per thread)

 Each POWER7 core has 12 execution units:  2 fixed point units  2 load store units  4 double precision floating point units (2x power6)  1 vector unit  1 branch unit  1 condition register unit  1 decimal floating point unit References: [4]

 Advanced branch prediction  Large out of order execution windows  Large and fast caches  Execute more than one execution thread per core  A single 8-core Power7 processor can execute 32 threads in the same clock cycle.

 IBM POWER7 Demo IBM POWER7 Demo  Visual representation of the SMT capabilities of the POWER7  Brief introduction to the on-chip L3 cache

 Simultaneous Multithreading  Separate instruction streams running concurrently on the same physical processor  POWER7 supports:  2 pipes for storage instructions (load/stores)  2 pipes for executing arithmetic instructions (add, subtract, etc.)  1 pipe for branch instructions (control flow)  Parallel support for floating-point and vector operations References: [7], [8]

 Simultaneous Multithreading Explanation:  SMT1: Single instruction execution thread per core  SMT2: Two instruction execution threads per core  SMT4: Four instruction execution threads per core  This means that an 8-core Power7 can execute 32 threads simultaneously  POWER7 supports SMT1, SMT2, SMT4 References: [5], [8]

Thread 1 ExecutingThread 0 ExecutingNo Thread Executing FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL Single thread Out of Order FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL S80 HW Multi-thread FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL POWER5 2 Way SMT FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL POWER7 4 Way SMT Thread 3 ExecutingThread 2 Executing References: [3]

ParameterL1L2L3 (Local)L3 (Global) Size64 KB (32K I, 32K D) 256 KB4 MB32 MB LocationCore On-Chip Access Time.5 ns2 ns6 ns30 ns Associativity4-way I-cache 8-way D-cache 8-way Write PolicyWrite ThroughWrite BackPartial VictimAdaptive Line size128 B

 On-Chip cache required for sufficient bandwidth to 8 cores.  Previous off-chip socket interface unable to scale  Support dynamic cores  Utilize ILP and increased SMT latency overlap

 (Look at table on page 401 in book)

 Multicore coherence protocal: Extended MESI with behavioral and locality hints  NOTE: MESI is the “Illinois Protocol”  Multicore coherence implementation: Directory at L3  (*From page 401 in book)  Can define and discuss directory-based coherence (Look it up on Wikipedia or wherever else)

 I and D cache split to reduce latency  Way prediction bits reduce hit latency  Write-Through  No L1 write-backs required on line eviction  High speed L2 able to handle bandwidth  B-Tree LRU replacement  Prefetching  On each L1 I-Cache miss, prefetch next 2 blocks

 Superset of L1 (inclusive)  Reduced latency by decreasing capacity  L2 utilizes larger L3-Local cache as victim cache  Increased associativity

 32 MB Fluid L3 cache  Lateral cast outs, disabled core provisioning  4 MB of local L3 cache per 8 cores ▪ Local cache closer to respective core, reduced latency  L3 cache access routed to the local L3 cache first  Cache lines cloned when used by multiple cores

 Embedded Dynamic Random-Access memory  Less area (1 transistor vs. 6 transistor SRAM)  Enables on-chip L3 cache ▪ Reduces L3 latency ▪ Larger internal bus size which increases bandwidth  Compared to off chip SRAM cache ▪ 1/6 latency ▪ 1/5 standby power  Utilized in game consoles (PS2, Wii, Etc.) References: [5], [6]

 2 memory controllers, 4 channels per core  Exploits elimination of off-chip L3 cache interface  32 GB per core, 256 GB Capacity  180 GB/s (Power6 75GB/s)  16 KB scheduling buffer

 Three idle states to optimize power vs. latency  Nap  Sleep  “Heavy” Sleep

 Nap  Optimized for wake-up time  Turn off clocks to execution units  Caches remain coherent  Reduce frequency to core

 Sleep  Purge and clock off core plus caches  “Heavy” Sleep  Optimized for power reduction  All cores sleep mode  Reduce voltage of all cores  Voltage ramps automatically on wake-up  No hardware re-initialization required

 Per-core frequency Scaling:  -50% thru +10% frequency slew independent per core. (DVFS)  Supports energy optimization in partitioned system configuration ▪ Less utilized partitions can run at lower frequencies ▪ Heavily utilized partitions maintain peak performance  Each partition can run under different energy saving policy

 IBM research states the following improvements in SPECPower_ssj2008 scores  Adding dynamic fan speed control ▪ 14% improvement  Static power savings (low power operation) ▪ 24% improvement  Dynamic power savings (DVFS with Turbo mode) ▪ 50% improvement

TechnologyChipsCoresThreadsGHzrPerfCPW POWER ,200 POWER ,000 POWER ,450 rPerf – Relative performance metric for Power Systems servers. Derived from an IBM analytical model which uses characteristics from IBM internal workloads, TPC and SPEC benchmarks. The IBM eServer pSeries 640 is the baseline reference system and has a value of 1.0. CPW – Commercial Processing Workload Based on benchmarks owned and managed by the Transaction Processing Performance Council. Provides an indicator of transaction processing performance capacity when comparing between members of the iSeries and AS/400 families.

TechnologyChipsCoresThreadsGHzSPECintSPECfp POWER POWER TechnologyChipsCoresThreadsGHzOS SPECint _rate SPECfp_ rate POWER AIX SPEC CPU2006 performance (Speed) SPEC CPU2006 performance (Throughput)

   3. Central PA PUG POWER7 review.ppt  =s&source=web&cd=1&ved=0CCEQFjAA&url=ht tp%3A%2F%2Fwww.ibm.com%2Fdeveloperwor ks%2Fwikis%2Fdownload%2Fattachments%2F %2FCentral%2BPA%2BPUG%2BPOW ER7%2Breview.ppt&ei=3El3T6ejOI-40QGil- GnDQ&usg=AFQjCNFESXDZMpcC2z8y8NkjE- v3S_5t3A

 pdf pdf  5. f f  6.  7. s/Power7_Performance_Overview.pdf s/Power7_Performance_Overview.pdf  ibm.com/systems/resources/pwrsysperf_SMT4O nP7.pdfhttp://www- 03.ibm.com/systems/resources/pwrsysperf_SMT4O nP7.pdf