Office of Science U.S. Department of Energy Bassi/Power5 Architecture John Shalf NERSC Users Group Meeting Princeton Plasma Physics Laboratory June 2005.

Slides:



Advertisements
Similar presentations
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Advertisements

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Blue Gene/P System Overview - Hardware.
Recent Progress In Embedded Memory Controller Design
System Level Benchmarking Analysis of the Cortex™-A9 MPCore™ John Goodacre Director, Program Management ARM Processor Division October 2009 Anirban Lahiri.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
© Cray Inc. CSC, Finland September 21-24, XT3XT4XT5XT6 Number of cores/socket Number of cores/node Clock Cycle (CC) ??
Chapter 1 An Introduction To Microprocessor And Computer
GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.
April 27, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Putting it all together: Intel Nehalem Steve Ko Computer Sciences and Engineering University.
CS 213 Commercial Multiprocessors. Origin2000 System – Shared Memory Directory state in same or separate DRAMs, accessed in parallel Upto 512 nodes (1024.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Bandwidth Rocks (1) Latency Lags Bandwidth (last ~20 years) Performance Milestones Disk: 3600, 5400, 7200, 10000, RPM.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.
Computer Architecture Part III-A: Memory. A Quote on Memory “With 1 MB RAM, we had a memory capacity which will NEVER be fully utilized” - Bill Gates.
Paper Review Building a Robust Software-based Router Using Network Processors.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Computers Central Processor Unit. Basic Computer System MAIN MEMORY ALUCNTL..... BUS CONTROLLER Processor I/O moduleInterconnections BUS Memory.
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
DLS Digital Controller Tony Dobbing Head of Power Supplies Group.
Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.
SLAC Particle Physics & Astrophysics The Cluster Interconnect Module (CIM) – Networking RCEs RCE Training Workshop Matt Weaver,
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.
Designing Packet Buffers for Internet Routers Friday, October 23, 2015 Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
Jacquard: Architecture and Application Performance Overview NERSC Users’ Group October 2005.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
© 2005 IBM Essential Overview Louisiana Tech University Ruston, Louisiana Charles Grassl IBM January, 2006.
The Alpha – Data Stream Matt Ziegler.
Processor Memory Processor-memory bus I/O Device Bus Adapter I/O Device I/O Device Bus Adapter I/O Device I/O Device Expansion bus I/O Bus.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Computer Hardware & Processing Inside the Box CSC September 16, 2010.
From Clustered SMPs to Clustered NUMA John M. Levesque The Advanced Computing Technology Center.
Dell PowerEdge Blade Server PDVSA Jun-05
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
PipeliningPipelining Computer Architecture (Fall 2006)
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Itanium® 2 Processor Architecture
ESE532: System-on-a-Chip Architecture
CS2100 Computer Organization
CS111 Computer Programming
Discovering Computers 2011: Living in a Digital World Chapter 4
RECONFIGURABLE PROCESSING AND AVIONICS SYSTEMS
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Milad Hashemi, Onur Mutlu, Yale N. Patt
Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.
Introduction to Computing
Lecture: Cache Innovations, Virtual Memory
Alpha Microarchitecture
Lecture 20: OOO, Memory Hierarchy
If a DRAM has 512 rows and its refresh time is 9ms, what should be the frequency of row refresh operation on the average?
Presented by David Wolinsky
Die Stacking (3D) Microarchitecture -- from Intel Corporation
Lecture 22: Cache Hierarchies, Memory
Hardware Overview System P & Power5.
Lecture: Cache Hierarchies
Lecture 21: Memory Hierarchy
DRAM Hwansoo Han.
Chapter 4 Multiprocessors
CSC3050 – Computer Architecture
CS 295: Modern Systems Cache And Memory System
Presentation transcript:

Office of Science U.S. Department of Energy Bassi/Power5 Architecture John Shalf NERSC Users Group Meeting Princeton Plasma Physics Laboratory June 2005

Office of Science U.S. Department of Energy POWER5 IH Overview. POWER5 IH System  2U rack chassis  Rack: 24" X 43 " Deep, Full Drawer POWER5 IH Node Architecture 4W or 8W POWER5 Processors L3 Cache144MB / 288MB (total) Memory2GB - 128/256GB Packaging 2U ( 24" rack ) 16 Nodes / Rack DASD / Bays2 DASD ( Hot Plug) I/O Expansion6 slots ( Blindswap) Integrated SCSIUltra 320 Integrated Ethernet 4 Ports 10/100/1000 RIO DrawersYes ( 1/2 or 1 ) LPARYes SwitchHPS OSAIX 5.2 & Linux IBM H C R U6 IBM H C R U6H C R U6 H C R U6H C R U6 H C R U6H C R U6 H C R U6H C R U6 H C R U6H C R U6 H C R U6H C R U6 H C R U6H C R U6 H C R U6H C R U6 H C R U6H C R U6 H C R U6H C R U6 16 Systems / Rack 128 Processors / Rack Image From IBM

Office of Science U.S. Department of Energy POWER5 IH Physical Structure. Processor Board I/O Board Blower (2X) SMI (32X) Pluggable DIMM (64X) DCM (8X) PCI Risers (2x) DASD BackPlane DCA Power Connector (6X) Image From IBM

Office of Science U.S. Department of Energy IBM Power Series Processors Power3+Power4Power5 MHz FLOPS/clock444 Peak FLOPS L1 (D-Cache)64k32k L2 (unified)8MB1.5M (0.75M)1.9M L3 (unified)--32M (16M)36M STREAM GB/s0.4 (0.7)1.4 (2.4)5.4 (7.6) Bytes/FLOP

Office of Science U.S. Department of Energy Power3 vs. Power5 die Power3 Power5 Image From IBM Power5 RedbookImage From IBM Power3 Redbook

Office of Science U.S. Department of Energy Power3 vs. Power5 die Power3 Power5 Image From IBM Power5 RedbookImage From IBM Power3 Redbook

Office of Science U.S. Department of Energy Power3 vs. Power5 die Power3 Power5 Image From IBM Power5 RedbookImage From IBM Power3 Redbook

Office of Science U.S. Department of Energy Memory Subsystem Image From IBM Power5 Redbook

Office of Science U.S. Department of Energy SMP Fabric P L2 P FC L3 Memory Controller DRAM SMI P L2 P FC L3 Memory Controller DRAM SMI P L2 P FC L3 DRAM SMI P L2 P FC L3 Memory Controller DRAM SMI Memory Controller P L2 Memory Controller DRAM P L2 P Memory Bus Memory Controller DRAM Power3Power4Power5

Office of Science U.S. Department of Energy SMP Fabric Remaining core gets All of L2 + L3 cache All memory BW Increased Clock Rate P L2 P FC L3 Memory Controller DRAM SMI P L2 P FC L3 Memory Controller DRAM SMI P L2 FC L3 DRAM SMI P L2 FC L3 Memory Controller DRAM SMI Memory Controller P L2 Memory Controller DRAM P L2 P Memory Bus Memory Controller DRAM Power3Power4Power5 SC

Office of Science U.S. Department of Energy System Packaging (MCM vs. DCM)  Multi-Chip Module (MCM)  4 x Power5 + L3 Caches  Up to 8 Power5 cores  L2 shared among cores  Dual-Chip Module (DCM)  1 x Power5 + L3 Cache  Up to 2 Power5 cores  Private L2 and L3 Image From IBM Power5 Redbook

Office of Science U.S. Department of Energy System Packaging (MCM vs. DCM) P5 L3 4-8 DIMM Slots 2xPCI Express x16 2x8 GB/s (4+4)

Office of Science U.S. Department of Energy Power5 Cache Hierarchy P5 Squadron 4 MCM 64 proc SMP P5 Squadron IH 4 DCM 8 16 Image From IBM Power5 Redbook

Office of Science U.S. Department of Energy Power5-IH Node Architecture Image From IBM

Office of Science U.S. Department of Energy Power5 IH Memory Affinity Mem0 Proc 0Proc 1Proc 2Proc 3 Mem1Mem2Mem3 Proc 4Proc 5Proc 6Proc 7 Mem7Mem6Mem5Mem4

Office of Science U.S. Department of Energy Power5 IH Memory Affinity Mem0 Proc 0Proc 1Proc 2Proc 3 Mem1Mem2 Mem3 Proc 4Proc 5Proc 6Proc 7 Mem7 Mem6Mem5Mem4 Proc0 to Mem0 == 90ns

Office of Science U.S. Department of Energy Power5 IH Memory Affinity Mem0 Proc 0Proc 1Proc 2Proc 3 Mem1Mem2 Mem3 Proc 4Proc 5Proc 6Proc 7 Mem7 Mem6Mem5Mem4 Proc0 to Mem0 == 90ns Proc0 to Mem7 == 200+ns

Office of Science U.S. Department of Energy Page Mapping Mem0 Proc 0Proc 1Proc 2Proc 3 Mem1Mem2 Mem3 Proc 4Proc 5Proc 6Proc 7 Mem7 Mem6Mem5Mem4 Linear Walk through Memory Addresses Default Affinity is round_robin (RR) Pages assigned round-robin to mem ctrlrs. Average latency ~ ns Proc0 Page# Mem# RR

Office of Science U.S. Department of Energy Page Mapping Mem0 Proc 0Proc 1Proc 2Proc 3 Mem1Mem2 Mem3 Proc 4Proc 5Proc 6Proc 7 Mem7 Mem6Mem5Mem4 Linear Walk through Memory Addresses MCM affinity (really DCM affinity in this case) Pages assigned to mem ctrlr where first touched. Average latency 90ns + no fabric contention Proc0 Page# Mem# RR Mem# MCM

Office of Science U.S. Department of Energy Memory Affinity Directives  For processor-local memory affinity, you should also set environment variables to  MP_TASK_AFFINITY=MCM  MEMORY_AFFINITY=MCM  For OpenMP need to eliminate memory affinity  Unset MP_TASK_AFFINITY  MEMORY_AFFINITY=round_robin (depending on OMP memory usage pattern)

Office of Science U.S. Department of Energy Large Pages  Enable Large Pages  -blpdata (at link time)  Or ldedit -blpdata on existing executable  Effect on STREAM performance  TRIAD without -blpdata: 5.3GB/s per task  TRIAD with -blpdata: 7.2 GB/s per task (6.9 loaded)

Office of Science U.S. Department of Energy A Short Commentary about Latency  Little’s Law: bandwidth * latency = concurrency  For Power-X (some arbitrary) single-core:  150ns * 20 Gigabytes/sec (DDR2 memory)  3000 bytes of data in flight  23.4 cache lines (very close to 24 memory request queue depth)  375 operands must be prefetched to fully engage the memory subsystem  THAT’S a LOT of PREFETCH!!! (esp. with 32 architected registers!)

Office of Science U.S. Department of Energy Deep Memory Request Pipelining Using Stream Prefetch Image From IBM Power4 Redbook

Office of Science U.S. Department of Energy Stanza Triad Results  Perfect prefetching:  performance is independent of L, the stanza length  expect flat line at STREAM peak  our results show performance depends on L

Office of Science U.S. Department of Energy XLF Prefetch Directives  DCBT (Data Cache Block Touch) explicit prefetch  Pre-request some data  !IBM PREFETCH_BY_LOAD(arrayA(I))  !IBM PREFETCH_FOR_LOAD(variable list)  !IBM PREFETCH_FOR_STORE(variable list)  Stream Prefetch  Install stream on a hardware prefetch engine  Syntax: !IBM PREFETCH_BY_STREAM(arrayA(I))  PREFETCH_BY_STREAM_BACKWARD(variable list)  DCBZ (Data Cache Block Zero)  Use for store streams  Syntax: !IBM CACHE_ZERO(StoreArrayA(I))  Automatically include using -qnopteovrlp option (unreliable)  Improve performance by another 10%

Office of Science U.S. Department of Energy Power5: Protected Streams  Protected Prefetch Streams  There are 12 filters and 8 prefetch engines  Need control of stream priority and prevent rolling of the filter list (takes a while to ramp up prefetch)  Helps give hardware hints about “stanza” streams  PROTECTED_UNLIMITED_STREAM_SET_GO_FORWARD(variable, streamID)  PROTECTED_UNLIMITED_STREAM_SET_GO_BACKWARD(var,streamID)  PROTECTED_STREAM_SET_GO_FORWARD/BACKWARD(var,streamID)  PROTECTED_STREAM_COUNT(ncachelines)  PROTECTED_STREAM_GO  PROTECTED_STREAM_STOP(stream_ID)  PROTECTED_STREAM_STOP_ALL  STREAM_UNROLL  Use more aggressive loop unrolling and SW pipelining in conjunction with prefetch  EIEIO (Enforce Inorder Execution of IO)