Jacquard: Architecture and Application Performance Overview NERSC Users’ Group October 2005.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

1 Memory hierarchy and paging Electronic Computers M.
ARCHITECTURE OF APPLE’S G4 PROCESSOR BY RON WEINWURZEL MICROPROCESSORS PROFESSOR DEWAR SPRING 2002.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Office of Science U.S. Department of Energy Bassi/Power5 Architecture John Shalf NERSC Users Group Meeting Princeton Plasma Physics Laboratory June 2005.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
© Cray Inc. CSC, Finland September 21-24, XT3XT4XT5XT6 Number of cores/socket Number of cores/node Clock Cycle (CC) ??
Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.
NERCS Users’ Group, Oct. 3, 2005 Interconnect and MPI Bill Saphir.
Intel Core2 GHz Q6700 L2 Cache 8 Mbytes (4MB per pair) L1 Cache: (128 KB Instruction +128KB Data at the core level???) L3 Cache: None? CPU.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
Nov COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.
Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.
Multiprocessors ELEC 6200 Computer Architecture and Design Instructor: Dr. Agrawal Yu-Chun Chen 10/27/06.
Computer Architecture, Memory Hierarchy & Virtual Memory
Disco Running Commodity Operating Systems on Scalable Multiprocessors.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.
Performance Engineering and Debugging HPC Applications David Skinner
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
NERCS Users’ Group, Oct. 3, 2005 NUG Training 10/3/2005 Logistics –Morning only coffee and snacks –Additional drinks $0.50 in refrigerator in small kitchen.
Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.
Computers Central Processor Unit. Basic Computer System MAIN MEMORY ALUCNTL..... BUS CONTROLLER Processor I/O moduleInterconnections BUS Memory.
 Introduction, concepts, review & historical perspective  Processes ◦ Synchronization ◦ Scheduling ◦ Deadlock  Memory management, address translation,
CMP 301A Computer Architecture 1 Lecture 4. Outline zVirtual memory y Terminology y Page Table y Translation Lookaside Buffer (TLB)
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
CSC 7080 Graduate Computer Architecture Lec 12 – Advanced Memory Hierarchy 2 Dr. Khalaf Notes adapted from: David Patterson Electrical Engineering and.
Comparing High-End Computer Architectures for Business Applications Presentation: 493 Track: HP-UX Dr. Frank Baetke HP.
Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Sun Fire™ E25K Server Keith Schoby Midwestern State University June 13, 2005.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Embedded System Lab. 김해천 Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist,
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
IT253: Computer Organization
Reconfigurable Computing: A First Look at the Cray-XD1 Mitch Sukalski, David Thompson, Rob Armstrong, Curtis Janssen, and Matt Leininger Orgs: 8961 & 8963.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
MPI Performance in a Production Environment David E. Skinner, NERSC User Services ScicomP 10 Aug 12, 2004.
Supporting Multi-Processors Bernard Wong February 17, 2003.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
Copyright  2005 SRC Computers, Inc. ALL RIGHTS RESERVED Overview.
Operating System Issues in Multi-Processor Systems John Sung Hardware Engineer Compaq Computer Corporation.
Memory Hierarchy. Hierarchy List Registers L1 Cache L2 Cache Main memory Disk cache Disk Optical Tape.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
CIP HPC CIP - HPC HPC = High Performance Computer It’s not a regular computer, it’s bigger, faster, more powerful, and more.
TCCluster: A Cluster Architecture Utilizing the Processor Host Interface as a Network Interconnect Heiner Litz University of Heidelberg.
Moore’s Law Electronics 19 April Moore’s Original Data Gordon Moore Electronics 19 April 1965.
ICC Module 3 Lesson 3 – Storage 1 / 4 © 2015 Ph. Janson Information, Computing & Communication Storage – Clip 0 – Introduction School of Computer Science.
SSU 1 Dr.A.Srinivas PES Institute of Technology Bangalore, India 9 – 20 July 2012.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.
Brief introduction about “Grid at LNS”
Application of General Purpose HPC Systems in HPEC
QuickPath interconnect GB/s GB/s total To I/O
NERSC Reliability Data
What is Computer Architecture?
What is Computer Architecture?
CS703 - Advanced Operating Systems
Week1 software - Lecture outline & Assignments
Chip&Core Architecture
CSE 102 Introduction to Computer Engineering
Presentation transcript:

Jacquard: Architecture and Application Performance Overview NERSC Users’ Group October 2005

Outline An engineering level overview of the HW and SW that make up jacquard. 1)CPU’s 2)Memory 3)OS 4)Interconnect Will use seaborg as a point of reference.

Colony Switch PGFS seaborg.nersc.gov (review?) ResourceSpeedBytes Registers 3 ns 256 B L1 Cache 5 ns 32 KB L2 Cache 45 ns 8 MB Main Memory300 ns 16 GB Remote Memory 19 us 7 TB GPFS 10 ms 50 TB HPSS 5 s 9 PB 380 x HPS S CSS0 CSS dedicated CPUs, 96 shared login CPUs Hierarchy of caching, speeds Bottleneck determined by first depleted resource 16 way SMP NHII Node Seaborg: crossbar main memory GPFS MPI

Infiniban d Switch PGFS jacquard.nersc.gov basics ResourceSpeedBytes Registers 0.5 ns 2 KB L1 Cache 1.5 ns 64 KB L2 Cache 45 ns 1 MB Main Memory ns 6 GB Remote Memory 5 us 2 TB GPFS 10 ms 15 TB HPSS 5 s 9 PB 320 x HPS S IB 640 dedicated CPUs, 8 shared login CPUs Smaller caches, HT, Really Fast SMP? NUMA? SUMO. 2 way Opteron node Jacquard: Main Memory GPFS MPI HT

Opteron Block Diagram : Not strictly SMP 1 TLB per CPU 1K entries 4K pages  4MB coverage SDRAM Switch, I/O

Hyper Transport: Good Stuff Little conflict between data movement and computation

SMP size and memory contention Jacquard’s numbers 1 task : 100 % 2 tasks: 98% Why is Jacquard 2 way SMP?

2.2 GHz Peak Theoretical Flops –Double (64 bit) floats : 1 add + 1 mult = 2.2 GFlop/s –Single (32 bit) floats : 2 add + 2 mult = 4.4 GFlop/s Peak Realized Flops –Double (64 bit) floats : 1.9 GFlop/s –Single (32 bit) floats : 3.4 GFlop/s Your Flops? – Walltime is more important than flops – For a known algorithm flops are a sanity check Memory BW 4 GB/sec per CPU

MPI Bandwidth: seaborg

MPI Bandwidth: Jacquard

Linux for AIX Users Linux and AIX are more similar than different Linux is not as good as AIX in keeping processes scheduled of the same CPU  processor affinity work. Linux has easy interfaces to architectural and process performance information /proc/cpuinfo, /proc/self, etc. AIX MPI is in /usr/{bin,lib}, Linux MPI is in modules Linux doesn’t need –bmaxdata ! Little vs. Big Endian

Conclusions The underlying HW technologies HT, IB, etc. are quite promising. Opteron systems are delivering great price/performance. Still working some SDRAMM, OS, and SW issues. What’s useful to you? Let us know.