Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the POWER6  Implements IBM PowerPC architecture v2.06  Clock.

Slides:



Advertisements
Similar presentations
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Advertisements

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Quiz 4 Solution. n Frequency = 2.5GHz, CLK = 0.4ns n CPI = 0.4, 30% loads and stores, n L1 hit =0, n L1-ICACHE : 2% miss rate, 32-byte blocks n L1-DCACHE.
Structure of Computer Systems
Better answers The Alpha and Microprocessors: Continuing the Performance Lead Beyond Y2K Shubu Mukherjee, Ph.D. Principal Hardware Engineer.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Multithreading Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Processor history / DX/SX SX/DX Pentium 1997 Pentium MMX
SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.
Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
PowerPC 601 Stephen Tam. To be tackled today Architecture Execution Units Fixed-Point (Integer) Unit Floating-Point Unit Branch Processing Unit Cache.
1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.
Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the POWER6  Shift from high frequency to multi-core  Implements.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Cell Broadband Processor Daniel Bagley Meng Tan. Agenda  General Intro  History of development  Technical overview of architecture  Detailed technical.
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.
Lect 13-1 Lect 13: and Pentium. Lect Microprocessor Family  Microprocessor  Introduced in 1989  High Integration  On-chip 8K.
Computer performance.
Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.
Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the POWER6  Shift from high frequency to multi-core  Implements.
Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.
Lynn Choi School of Electrical Engineering Microprocessor Microarchitecture The Past, Present, and Future of CPU Architecture.
Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation.
Alpha 21364: A Scalable Single-chip SMP
History of Microprocessor MPIntroductionData BusAddress Bus
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
CIS 501: Comp. Arch. | Prof. Joe Devietti | Xbox1/PS41 CIS 501: Computer Architecture Unit 12: Putting it All Together: The Xbox One/PS4 Game Consoles.
Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.
Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the POWER6  Shift from high frequency to multi-core  Implements.
IBM/Motorola/Apple PowerPC
CIS 501: Comp. Arch. | Prof. Joe Devietti | Xbox1/PS41 CIS 501: Computer Architecture Unit 12: Putting it All Together: The Xbox One/PS4 Game Consoles.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
AIX and PowerVM Workshop © 2013 IBM Corporation 1 POWER5POWER5+POWER6POWER7POWER7+ Technology130nm90nm65nm45nm32nm Size389 mm mm mm mm.
Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the Power6  Clock Rate: 2.4 GHz GHz  Feature size: 45.
The Alpha – Data Stream Matt Ziegler.
Chao Han ELEC6200 Computer Architecture Fall 081ELEC : Han: PowerPC.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Hewlett-Packard PA-RISC Bit Processors: History, Features, and Architecture Presented By: Adam Gray Christie Kummers Joshua Madagan.
© 2004 IBM Corporation Power Everywhere POWER5 Processor Update Mark Papermaster VP, Technology Development IBM Systems and Technology Group.
Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the POWER6  Shift from high frequency to multi-core  Implements.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
UltraSparc IV Tolga TOLGAY. OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)
1 Design and Implementation of the POWER5 Microprocessor J. Clabes 1, J. Friedrich 1, M. Sweet 1, J DiLullo 1, S. Chu 1, D. Plass 2, J. Dawson 2, P. Muench.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Itanium® 2 Processor Architecture
Memory COMPUTER ARCHITECTURE
Microarchitecture.
Adam Kunk Anil John Pete Bohman
Lynn Choi School of Electrical Engineering
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Computer Architecture Lecture 4 17th May, 2006
Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Hardware Overview System P & Power5.
Learning Objectives To be able to describe the purpose of the CPU
Chip&Core Architecture
Lecture 3 (Microprocessor)
Presentation transcript:

Adam Kunk Anil John Pete Bohman

 Released by IBM in 2010 (~ February)  Successor of the POWER6  Implements IBM PowerPC architecture v2.06  Clock Rate: 2.4 GHz GHz  Feature size: 45 nm  ISA: Power ISA v 2.06 (RISC)  Cores: 4, 6, 8  Cache: L1, L2, L3 – On Chip References: [1], [5]

 PERCS – Productive, Easy-to-use, Reliable Computer System  DARPA funded contract that IBM won in order to develop the Power7 ($244 million contract, 2006) ▪ Contract was to develop a petascale supercomputer architecture before 2011 in the HPCS (High Performance Computing Systems) project.  IBM, Cray, and Sun Microsystems received HPCS grant for Phase II.  IBM was chosen for Phase III in References: [1], [2]

 Side note:  The Blue Waters system was meant to be the first supercomputer using PERCS technology.  But, the contract was cancelled (cost and complexity).

POWER4/4+  Dual Core Dual Core  Chip Multi Processing Chip Multi Processing  Distributed Switch Distributed Switch  Shared L2 Shared L2  Dynamic LPARs (32) Dynamic LPARs (32)  180nm, 180nm, POWER5/5+  Dual Core & Quad Core Md Dual Core & Quad Core Md  Enhanced Scaling Enhanced Scaling  2 Thread SMT 2 Thread SMT  Distributed Switch + Distributed Switch +  Core Parallelism + Core Parallelism +  FP Performance + FP Performance +  Memory bandwidth + Memory bandwidth +  130nm, 90nm 130nm, 90nm POWER6/6+  Dual Core Dual Core  High Frequencies High Frequencies  Virtualization + Virtualization +  Memory Subsystem + Memory Subsystem +  Altivec Altivec  Instruction Retry Instruction Retry  Dyn Energy Mgmt Dyn Energy Mgmt  2 Thread SMT + 2 Thread SMT +  Protection Keys Protection Keys  65nm 65nm POWER7/7+  4,6,8 Core 4,6,8 Core  32MB On-Chip eDRAM 32MB On-Chip eDRAM  Power Optimized Cores Power Optimized Cores  Mem Subsystem ++ Mem Subsystem ++  4 Thread SMT++ 4 Thread SMT++  Reliability + Reliability +  VSM & VSX VSM & VSX  Protection Keys+ Protection Keys+  45nm, 32nm 45nm, 32nm POWER8 Future First Dual Core in Industry Hardware Virtualization for Unix & Linux Fastest Processor In Industry Most POWERful & Scalable Processor in Industry References: [3]

 IBM POWER7 Demo IBM POWER7 Demo

Cores:  8 Intelligent Cores / chip (socket)  4 and 6 Intelligent Cores available on some models  12 execution units per core  Out of order execution  4 Way SMT per core  32 threads per chip  L1 – 32 KB I Cache / 32 KB D Cache per core  L2 – 256 KB per core Chip:  32MB Intelligent L3 Cache on chip Core L2 Core L2 Memory Interface Core L2 Core L2 Core L2 Core L2 Core L2 Core L2 GXGX SMPFABRICSMPFABRIC POWERPOWER BUSBUS Memory++ L3 Cache eDRAM References: [3]

 Each core implements “aggressive” out-of- order (OoO) instruction execution  The processor has an Instruction Sequence Unit capable of dispatching up to six instructions per cycle to a set of queues  Up to eight instructions per cycle can be issued to the Instruction Execution units References: [4]

 8 inst. fetched from L2 to L1 I-cache or fetch buffer  Balanced instruction rates across active threads  Inst. Grouping  Instructions belonging to group issued together  Groups contain independent instructions

 Branch Prediction

 Each POWER7 core has 12 execution units:  2 fixed point units  2 load store units  4 double precision floating point units (2x power6)  1 vector unit  1 branch unit  1 condition register unit  1 decimal floating point unit References: [4]

 Simultaneous Multithreading  SMT1: Single instruction execution thread per core  SMT2: Two instruction execution threads per core  SMT4: Four instruction execution threads per core  This means that an 8-core Power7 can execute 32 threads simultaneously

Thread 1 ExecutingThread 0 ExecutingNo Thread Executing FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL Single thread Out of Order FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL S80 HW Multi-thread FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL POWER5 2 Way SMT FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL POWER7 4 Way SMT Thread 3 ExecutingThread 2 Executing References: [3]

 (Look at section in fs/redp4639.pdf) fs/redp4639.pdf

ParameterL1L2L3 (Local)L3 (Global) Size64 KB (32 I, 32 D) 256 KB4 MB32 MB Access Time.5 ns2 ns6 ns30 ns Associativity4-way I-cache 8-way D-cache 8-way Write PolicyWrite ThroughWrite BackPartial VictimAdaptive Line size128 B

 2 read ports, 1 write port  Write has higher priority over a read  Write-Through  No L1 cast-outs required  B-Tree LRU replacement  Way prediction bits reduce hit latency

 Inclusive of L1  L3 partial victim relationship

 Details of the L3 Cache …. (leads up to eDRAM)

 eDRAM – Embedded dynamic random-access memory  This means the L3 cache (shared 32 MB) is on-chip  Essentially faster due to decreased distance  Less area, less power, on-chip interconnects provide each core with 32-byte buses to and from the L3 cache  Side note: eDRAM is also used in many different game consoles (PS2, GameCube, Wii, Etc.) References: [5], [6]

 eDRAM in the POWER7 provides 1/6 the latency and twice the bandwidth (compared with off-chip eDRAM), and 1/5 standby power in 1/3 the required area (compared with SRAM) References: [5]

   3. Central PA PUG POWER7 review.ppt  =s&source=web&cd=1&ved=0CCEQFjAA&url=ht tp%3A%2F%2Fwww.ibm.com%2Fdeveloperwor ks%2Fwikis%2Fdownload%2Fattachments%2F %2FCentral%2BPA%2BPUG%2BPOW ER7%2Breview.ppt&ei=3El3T6ejOI-40QGil- GnDQ&usg=AFQjCNFESXDZMpcC2z8y8NkjE- v3S_5t3A

 4. dfs/redp4639.pdf dfs/redp4639.pdf  5. ower7.pdf ower7.pdf  6.