© Cray Inc. CSC, Finland September 21-24, 2009. XT3XT4XT5XT6 Number of cores/socket 244-612 Number of cores/node 248-1224 Clock Cycle (CC)2.62.32.6??

Slides:

Advertisements

Similar presentations

1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.

Advertisements

Altix ccNUMA Architecture Distributed Memory - Shared address space.

The University of Adelaide, School of Computer Science

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Office of Science U.S. Department of Energy Bassi/Power5 Architecture John Shalf NERSC Users Group Meeting Princeton Plasma Physics Laboratory June 2005.

1 Computational models of the physical world Cortical bone Trabecular bone.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Metadata Performance Improvements Presentation for LUG 2011 Ben Evans Principal Software Engineer Terascala, Inc.

Thoughts on Shared Caches Jeff Odom University of Maryland.

Today’s topics Single processors and the Memory Hierarchy

Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,

GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.

Nov COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.

Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.

IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

ECE Dept., University of Toronto

Adam Meyer, Michael Beck, Christopher Koch, and Patrick Gerber.

Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.

Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.

1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)

CENG 546 Dr. Esma Yıldırım. Copyright © 2012, Elsevier Inc. All rights reserved What is a computing cluster?  A computing cluster consists of.

CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.

Dynamic Time Variant Connection Management for PGAS Models on InfiniBand Abhinav Vishnu 1, Manoj Krishnan 1 and Pavan Balaji 2 1 Pacific Northwest National.

Dr Mark Parsons Commercial Director, EPCC HECToR The latest UK National High Performance Computing Service.

History of Microprocessor MPIntroductionData BusAddress Bus

Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1.

Jacquard: Architecture and Application Performance Overview NERSC Users’ Group October 2005.

2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.

Virtualisation Front Side Buses SMP systems COMP Jamie Curtis.

Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.

Gravitational N-body Simulation Major Design Goals -Efficiency -Versatility (ability to use different numerical methods) -Scalability Lesser Design Goals.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

WildFire: A Scalable Path for SMPs Erick Hagersten and Michael Koster Sun Microsystems Inc. Presented by Terry Arnold II.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.

Presented by NCCS Hardware Jim Rogers Director of Operations National Center for Computational Sciences.

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

TCCluster: A Cluster Architecture Utilizing the Processor Host Interface as a Network Interconnect Heiner Litz University of Heidelberg.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

The University of Adelaide, School of Computer Science

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Cache Issues Computer Organization II 1 Main Memory Supporting Caches Use DRAMs for main memory – Fixed width (e.g., 1 word) – Connected by fixed-width.

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

Lecture 2. A Computer System for Labs

Itanium® 2 Processor Architecture

John Levesque Director Cray Supercomputing Center of Excellence

Lecture 19 Optimizing for the memory hierarchy NUMA

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

QuickPath interconnect GB/s GB/s total To I/O

Lecture: Memory, Multiprocessors

The University of Adelaide, School of Computer Science

Mattan Erez The University of Texas at Austin

Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.

A SEMINAR ON 64 BIT COMPUTING.

Interconnect with Cache Coherency Manager

CPU Key Revision Points.

Hybrid Programming with OpenMP and MPI

High Performance Computing

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Memory System Performance Chapter 3

Lecture 17 Multiprocessors and Thread-Level Parallelism

Jakub Yaghob Martin Kruliš

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

© Cray Inc. CSC, Finland September 21-24, 2009

XT3XT4XT5XT6 Number of cores/socket Number of cores/node Clock Cycle (CC) ?? Number of 64 bit Results/CC 2444 GFLOPS/Node ~200 InterconnectSeastar 1Seastar 2+ Gemini Link Bandwidth GB/sec 6x2.46x4.8 10x4.7 MPI Latency microseconds Messages/sec400K 10M Global Addressing No Yes September 21-24, 2009 © Cray Inc. 2

9.6 GB/sec 2 – 8 GB 12.8 GB/sec direct connect memory (DDR 800) 6.4 GB/sec direct connect HyperTransport Cray SeaStar2+ Interconnect 4-way SMP >35 Gflops per node Up to 8 GB per node OpenMP Support within socket

9.6 GB/sec 2 – 32 GB memory 6.4 GB/sec direct connect HyperTransport Cray SeaStar2+ Interconnect 25.6 GB/sec direct connect memory 8-way SMP >70 Gflops per node Up to 32 GB of shared memory per node OpenMP Support

September 21-24, 2009 © Cray Inc. 5 BudapestBarcelonaShanghaiIstanbul 4 FPS/CC 64KB 512KB 2048KB 6144KB Core L1 L2 Level 3 Cache L1

…... registers L1 data cache L2 cache 16 SSE2 128-bit registers bit registers 2 x 16 Bytes per clock loads or 1 x 16 Bytes per clock store, (76.8 GB/s or 38.4 GB/s on 2.4 Ghz) Main memory  64 Byte cache line  complete data cache lines are loaded from main memory, if not in L2 or L3 cache  if L1 data cache needs to be refilled, then storing back to L2 cache, if L2 needs to be refilled, storing back to L3  Items in L1 and L2 are exclusive, L3 is “sharing aware”  write back cache: data offloaded from L1 data cache are stored here first until they are flushed out to main memory  Pre-fetching can go to L1 or L2 cache 16 Bytes wide => 12.8 GB/s for DDR2-800, 73ns 16 Bytes per clock, 38.4 GB/s BW …... Shared L3 cache 32 GB/s Remote memory 8GB/s over coherent Hyper Transport, 115ns

September 21-24, 2009 © Cray Inc. 7 Core L1 L2 Level 3 Cache Memory 2 HT Links 1 HT Links L1 Core L1 L2 Level 3 Cache L GB/sec 6.4 GB/sec 12.8 GB/sec 115 ns 8 GB/sec 73 ns

 Strengths Upgradability Scalability of interconnects Increased node performance – need to use fewer nodes to achieve same performance Global addressing with Gemini  Adds ability to use PGAS – UPC And CAF to program around some of the weaknesses September 21-24, 2009 © Cray Inc. 8

 Bottlenecks Memory bandwidth will never be enough to support all the cores  Need to think about programming around this o PGAS – UPC and CAF o OpenMP Injection Bandwidth will never be enough to support all the cores  Need to think about programming around this o PGAS – UPC and CAF o OpenMP Global Bandwidth will never be enough to support ALL to ALLs across all of the cores  Need to think about programming around this o PGAS – UPC and CAF o OpenMP September 21-24, 2009 © Cray Inc. 9

 Need to investigate Hybrid programming  Need to investigate use of PGAS Global Arrays UPC Co-Array Fortran SHMEM September 21-24, 2009 © Cray Inc. 10