Advanced Computer Architecture 5MD00 / 5Z033 TOP 500 supercomputers

Slides:



Advertisements
Similar presentations
4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Advertisements

© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Blue Gene/P System Overview - Hardware.
1 Computational models of the physical world Cortical bone Trabecular bone.
© Cray Inc. CSC, Finland September 21-24, XT3XT4XT5XT6 Number of cores/socket Number of cores/node Clock Cycle (CC) ??
Exascale Computing: Challenges and Opportunities Ahmed Sameh and Ananth Grama NNSA/PRISM Center, Purdue University.
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
Istituto Tecnico Industriale "A. Monaco"
Ido Tov & Matan Raveh Parallel Processing ( ) January 2014 Electrical and Computer Engineering DPT. Ben-Gurion University.
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
XEON PHI. TOPICS What are multicore processors? Intel MIC architecture Xeon Phi Programming for Xeon Phi Performance Applications.
Parallel Processing1 Parallel Processing (CS 676) Overview Jeremy R. Johnson.
Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing Dr. Jason D. Bakos.
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
Supercomputers Daniel Shin CS 147, Section 1 April 29, 2010.
1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.
An Introduction to Princeton’s New Computing Resources: IBM Blue Gene, SGI Altix, and Dell Beowulf Cluster PICASso Mini-Course October 18, 2006 Curt Hillegas.
Earth Simulator Jari Halla-aho Pekka Keränen. Architecture MIMD type distributed memory 640 Nodes, 8 vector processors each. 16GB shared memory per node.
Real Parallel Computers. Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra, Meuer, Simon Parallel.
Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
What’s a Supercomputer Good for Anyway? Ruth Poole – IBM Software Engineer Blue Gene Control System.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.
Jaguar Super Computer Topics Covered Introduction Architecture Location & Cost Bench Mark Results Location & Manufacturer Machines in top 500 Operating.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Design of a Software Correlator for the Phase I SKA Jongsoo Kim Cavendish Lab., Univ. of Cambridge & Korea Astronomy and Space Science Institute Collaborators:
BlueGene/L Facts Platform Characteristics 512-node prototype 64 rack BlueGene/L Machine Peak Performance 1.0 / 2.0 TFlops/s 180 / 360 TFlops/s Total Memory.
HPC Business update HP Confidential – CDA Required
VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc
Massive Supercomputing Coping with Heterogeneity of Modern Accelerators Toshio Endo and Satoshi Matsuoka Tokyo Institute of Technology, Japan.
- Rohan Dhamnaskar. Overview  What is a Supercomputer  Some Concepts  Couple of examples.
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
A look at computing performance and usage.  3.6GHz Pentium 4: 1 GFLOPS  1.8GHz Opteron: 3 GFLOPS (2003)  3.2GHz Xeon X5460, quad-core: 82 GFLOPS.
IDC HPC User Forum April 14 th, 2008 A P P R O I N T E R N A T I O N A L I N C Steve Lyness Vice President, HPC Solutions Engineering
High Performance Computing
1 High Performance Computing: A Look Behind and Ahead Jack Dongarra Computer Science Department University of Tennessee.
Presented by NCCS Hardware Jim Rogers Director of Operations National Center for Computational Sciences.
BluesGene/L Supercomputer A System Overview Pietro Cicotti October 10, 2005 University of California, San Diego.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
APE group Many-core platforms and HEP experiments computing XVII SuperB Workshop and Kick-off Meeting Elba, May 29-June 1,
Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.
PARALLEL MODEL OF EVOLUTIONARY GAME DYNAMICS Amanda Peters MIT /13/2009.
Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.
COMP7330/7336 Advanced Parallel and Distributed Computing Dr. Xiao Qin Auburn University
NIIF HPC services for research and education
Network Connected Multiprocessors
LinkSCEEM-2: A computational resource for the development of Computational Sciences in the Eastern Mediterranean Mostafa Zoubi SESAME Outreach SESAME,
NVIDIA’s Extreme-Scale Computing Project
CS427 Multicore Architecture and Parallel Computing
Appro Xtreme-X Supercomputers
Super Computing By RIsaj t r S3 ece, roll 50.
32nd TOP500 List SC08, Austin, TX.
Clusters of Computational Accelerators
Parallel Computers Today
Multi-/Many-Core Processors
Advanced Computer Architecture 5MD00 / 5Z033 Overview
CS427 Multicore Architecture and Parallel Computing
Nicole Ondrus Top 500 Parallel System Presentation
BlueGene/L Supercomputer
Advanced Computer Architecture 5MD00 Project on Network-on-Chip
Advanced Computer Architecture 5MD00 / 5Z033 TOP 500 supercomputers
Multicore and GPU Programming
Benchmark software for HPC systems
Multicore and GPU Programming
Presentation transcript:

Advanced Computer Architecture 5MD00 / 5Z033 TOP 500 supercomputers Henk Corporaal www.ics.ele.tue.nl/~heco/courses/aca h.corporaal@tue.nl TUEindhoven 2011

Topics How to cross the Petaflop boundary Ranking Examples Nov 2008 Nov 2009 / Nov 2010: what has been changed Examples Roadrunner (IBM) Jaguar Cray SGI Altix BlueGene 1/17/2019 ACA H.Corporaal

How to build a Petaflop supercomputer? Some examples from 2008: Opteron cluster (e.g. ~2X Ranger/TACC) 32,000 quad-core Opterons (130K cores) Cray XT3/4 (e.g. Baker/ORNL sooner) IBM BlueGene/P (bigger sooner) 80,000 BG/P PPC processors (320K cores) IBM Cell-accelerated Roadrunner cluster 10,000 Cells (80K Cell SPUs) 1/17/2019 ACA H.Corporaal

Supercomputer Ranking Started in 1993 Jack Dongarra, University of Tennessee Based on LINPACK benchmark linear algebra (LU factorization) Superseded by LAPACK based on BLAS (Basic Lin. Alg. Subprograms) exploits caches Measures Floating Point performance Fortran code see http://www.top500.org 1/17/2019 ACA H.Corporaal

Single-Chip GPU v.s. Fastest Super Computers ref: http://www.llnl.gov/str/JanFeb05/Seager.html

Performance Ranking Nov. 2008 # Name N_PE Rmax (Tflop) Rpeak P (kW) 1 Roadrunner IBM 129600 1105 1456 2483 2 Cray XT5 150152 1059 1381 6950 3 SGI Altix ICE 51200 487 608 2090 4 BlueGene IBM 212992 478 596 2329 100 Cluster Platform (Xeons) 5120 27 51 - 52 Power 575 SARA (Amst) 3328 49 63 532 75 BlueGene Astron 12288 35 42 95 496 Cluster in Gent Univ. 1568 13 16 1/17/2019 ACA H.Corporaal

Performance Ranking 2008: we crossed the Petaflop boundary # Name Npe Rmax (Tflop) Rpeak P (kW) 1 Roadrunner IBM 129600 1105 1456 2483 2 Cray XT5 150152 1059 1381 6950 3 SGI Altix ICE 51200 487 608 2090 4 BlueGene IBM 212992 478 596 2329 100 Cluster Platform (Xeons) 5120 27 51 - 52 Power 575 SARA (Amst) 3328 49 63 532 75 BlueGene Astron 12288 35 42 95 496 Cluster in Gent Univ. 1568 13 16 2008: we crossed the Petaflop boundary 1/17/2019 ACA H.Corporaal

Update November 2009 # Name N_PE Rmax (Tflop) Rpeak P (kW) 1 Jaguar-Cray XT5-HE Oak Ridge, USA 224162 1759 2331 6951 2 Roadrunner IBM DOE, USA 122400 1042 1376 2346 3 Kraken Cray XT5-HE Tennessee, USA 98928 832 1029 - 4 BlueGene IBM Juelich, Germany 294912 826 1003 2268 5 Tianhe Xeon / ATI cluster, China 71680 563 1206 1/17/2019 ACA H.Corporaal

Update November 2010 # Name N_PE Rmax (Tflop) Rpeak P (kW) 1 Tianhe-1A, China Intel+NVIDIA GPU 186368 2566 4701 4040 2 Jaguar-Cray XT5 DOE, USA Opteron 6-cores 224162 1759 2331 6950 3 Nebulae, China Intel + NVIDIA + GPU 120640 1271 2984 2580 4 TSUBAME, NEC, Japan Intel + NVIDIA GPU 73278 1192 2287 1399 5 Hopper-Cray XE6 138368 1050 1254 4590 1/17/2019 ACA H.Corporaal

Alternative ranking: Green500 Most Power efficient Supercomputers 2008: best result = 536 MFlops/Watt => 1.87 nJ / FloatingPt_operation 2009: best result = 723 MFlops/Watt => 1.38 nJ / FloatingPt_operation Cell cluster, ranking 110 in top500 2010: best result = 1684 MFlops/Watt => 594 pJ / FloatingPt operation IBM BlueGene/Q See www.green500.org 1/17/2019 ACA H.Corporaal

Nr1 (2008): Roadrunner IBM cluster 6480 nodes with Dual core Opteron 1.8 GHz 2 * PowerXCell 8i 3.2 GHz (12.8 GFlops) Infiniband connection fabric (16 Gbit/s per link) FAT tree interconnect 100 Tbyte DRAM memory 216 I/O nodes MPI programming 2.35 MW power !! Size: 296 racks, 5500 ft2 This is huge !! 1/17/2019 ACA H.Corporaal

Cell/B.E. – the architecture 1 x PPE 64-bit PowerPC L1: 32 KB I$ + 32 KB D$ L2: 512 KB 8 x SPE cores: Local store: 256 KB 128 x 128 bit vector registers Hybrid memory model: PPE: Rd/Wr SPEs: Asynchronous DMA EIB: 205 GB/s sustained aggregate bandwidth Processor-to-memory bandwidth: 25.6 GB/s Processor-to-processor: 20 GB/s in each direction 1/17/2019 ACA H.Corporaal

1/17/2019 ACA H.Corporaal

Roadrunner: TriBlade = 2 nodes For more details: Presentation slides of Ken Koch, March 2008 1/17/2019 ACA H.Corporaal

Nr2 (2008): Jaguar Cray XT5 QC I guess 5 times In total 150152 cores 7832 quad-core 2.1 GHz AMD Opetron 62 TB memory (= 2GB / core) 600 TB file system 250 TFlop In total 150152 cores SeaStar2+ interconnect (from Cray) Note 2009: quad-cores replaced by six-cores now nr 1 224,256 cores peak 1.75 PetaFlop paper: Bland A.S., Kendall R.A., Kothe D.B., Rogers J.H., Shipman G.M. Jaguar: The World’s Most Powerful Computer 1/17/2019 ACA H.Corporaal

Jaguar 1/17/2019 ACA H.Corporaal

Nr3 (2008): SGI Altix ICE8200 92 racks of Al5x ICE 8200EX with 3.0 Ghz Intel Xenon quad-core processors or 47,104 cores 8 racks of Al5x ICE 8200 with 2.66 Ghz Intel quad-core 4096 cores. 51 TB Main memory DDR InfiniBand 1/17/2019 ACA H.Corporaal

Nr:4 (2008) BlueGene/L IBM Based on ASIC with PowerPC 440, 700 Mhz, each 2.8 GFlops 105,496 nodes 3D Torus interconnect for p2p communication + Collective network 3D-torus Complete system rack 1/17/2019 ACA H.Corporaal

BlueGene/L ASIC node 1/17/2019 ACA H.Corporaal

BlueGene/L Node board 16 cards with 2 ASICs each 8 GB 180 Gflop 1/17/2019 ACA H.Corporaal

2009: BlueGene/P System: 256 racks upto 1PB 3.56 PFlops Rack: 32 Node Cards 13.9 TF/s 2-4 TB Node card: 32 processor cards 64-128 GB 435 GFlops Processor card: one 4-processor chip 13.6 GFlops 2-4 GB ASIC: 13.6 Gflops 8 MB EDRAM 1/17/2019 ACA H.Corporaal

BlueGene/P ASIC 1/17/2019 ACA H.Corporaal

PPC450: Exploiting SIMD Two FPUs SIMD 2 x 32 64-bit registers SIMD Datapath width = 16 bytes Feeds two FPUs with 8 bytes each every cycle Two FP multiply-add operations per cycle 3.4 GFLOP/s peak performance 1/17/2019 ACA H.Corporaal

BlueGene/P ASIC 208M trans 850 MHz 16W 90nm 1/17/2019 ACA H.Corporaal

BlueGene/P node card 1/17/2019 ACA H.Corporaal

Next: BlueGene/Q 10 PFlops in 2011-2012 see www.research.ibm.com/bluegene 1/17/2019 ACA H.Corporaal

Can we match the human brain ??? Performance = 100 Billion (10^11) Neurons * 1000 (10^3) Connections/Neuron * 200 (2 * 10^2) Calculations Per Second Per Connection = 2 * 10^16 Calculations Per Second Memory = 100 Billion (10^11) Neurons * 1000 (10^3) Connections/Neuron * 10 bytes (information about connection strength and adress of output neuron, type of synapse) = 10^15 bytes = 1 PB = 1000 TB How far off are we? 1/17/2019 ACA H.Corporaal

Blue brain research Software replica of one column of the neocortex cortex: 85% of brains total mass required for language, learning, memory and complex thought the essential first step to simulating the whole brain Next: include circuitry from other brain regions and eventually the whole brain. 1/17/2019 ACA H.Corporaal

Latest news: factorization of RSA768 RSA used to encypher text using both public and private key EPFL, CWI and others have broken RSA768 This means: Factorize 768 bit number into 2 primes Using 1700 AMD 2.2 GHz cores for 1 year => 15 Mh (single core) compute time Current RSA standard uses 1024 bits still save for some years News of 11 jan 2010 1/17/2019 ACA H.Corporaal

RSA (Rivest, Shamir, Adleman) choose 2 (large) primes p and q n = p*q choose e such that e and (p-1)(q-1) are coprime (i.e. do not share prime factors) choose d such d*e = 1 mod ((p-1)(q-1)) public key = (n,e) private key = (n,d) Encryption of message m: c=me mod n Decryption of cypher c: m = cd mod n see wikipedia for details and working example 1/17/2019 ACA H.Corporaal

RSA factorization result factorization of RSA768, the following 768-bit, 232-digit number from RSA's challenge list: 12301866845301177551304949583849627207728535695953347921973224215172640050726365751874520219978646938995647494277406384592519255732630345373154826850791702612214291346167042921431160222124047927473779408066535141959745985 6902143413 = 33478071698956898786044169848212690817704794983713768568912431388982883793878002287614711652531743087737814467999489 * 36746043666799590428244633799627952632279158164343087642676032283815739666511279233373417143396810270092798736308917 1/17/2019 ACA H.Corporaal