Advanced Computer Architecture 5MD00 / 5Z033 TOP 500 supercomputers Henk Corporaal www.ics.ele.tue.nl/~heco/courses/aca h.corporaal@tue.nl TUEindhoven 2011
Topics How to cross the Petaflop boundary Ranking Examples Nov 2008 Nov 2009 / Nov 2010: what has been changed Examples Roadrunner (IBM) Jaguar Cray SGI Altix BlueGene 1/17/2019 ACA H.Corporaal
How to build a Petaflop supercomputer? Some examples from 2008: Opteron cluster (e.g. ~2X Ranger/TACC) 32,000 quad-core Opterons (130K cores) Cray XT3/4 (e.g. Baker/ORNL sooner) IBM BlueGene/P (bigger sooner) 80,000 BG/P PPC processors (320K cores) IBM Cell-accelerated Roadrunner cluster 10,000 Cells (80K Cell SPUs) 1/17/2019 ACA H.Corporaal
Supercomputer Ranking Started in 1993 Jack Dongarra, University of Tennessee Based on LINPACK benchmark linear algebra (LU factorization) Superseded by LAPACK based on BLAS (Basic Lin. Alg. Subprograms) exploits caches Measures Floating Point performance Fortran code see http://www.top500.org 1/17/2019 ACA H.Corporaal
Single-Chip GPU v.s. Fastest Super Computers ref: http://www.llnl.gov/str/JanFeb05/Seager.html
Performance Ranking Nov. 2008 # Name N_PE Rmax (Tflop) Rpeak P (kW) 1 Roadrunner IBM 129600 1105 1456 2483 2 Cray XT5 150152 1059 1381 6950 3 SGI Altix ICE 51200 487 608 2090 4 BlueGene IBM 212992 478 596 2329 100 Cluster Platform (Xeons) 5120 27 51 - 52 Power 575 SARA (Amst) 3328 49 63 532 75 BlueGene Astron 12288 35 42 95 496 Cluster in Gent Univ. 1568 13 16 1/17/2019 ACA H.Corporaal
Performance Ranking 2008: we crossed the Petaflop boundary # Name Npe Rmax (Tflop) Rpeak P (kW) 1 Roadrunner IBM 129600 1105 1456 2483 2 Cray XT5 150152 1059 1381 6950 3 SGI Altix ICE 51200 487 608 2090 4 BlueGene IBM 212992 478 596 2329 100 Cluster Platform (Xeons) 5120 27 51 - 52 Power 575 SARA (Amst) 3328 49 63 532 75 BlueGene Astron 12288 35 42 95 496 Cluster in Gent Univ. 1568 13 16 2008: we crossed the Petaflop boundary 1/17/2019 ACA H.Corporaal
Update November 2009 # Name N_PE Rmax (Tflop) Rpeak P (kW) 1 Jaguar-Cray XT5-HE Oak Ridge, USA 224162 1759 2331 6951 2 Roadrunner IBM DOE, USA 122400 1042 1376 2346 3 Kraken Cray XT5-HE Tennessee, USA 98928 832 1029 - 4 BlueGene IBM Juelich, Germany 294912 826 1003 2268 5 Tianhe Xeon / ATI cluster, China 71680 563 1206 1/17/2019 ACA H.Corporaal
Update November 2010 # Name N_PE Rmax (Tflop) Rpeak P (kW) 1 Tianhe-1A, China Intel+NVIDIA GPU 186368 2566 4701 4040 2 Jaguar-Cray XT5 DOE, USA Opteron 6-cores 224162 1759 2331 6950 3 Nebulae, China Intel + NVIDIA + GPU 120640 1271 2984 2580 4 TSUBAME, NEC, Japan Intel + NVIDIA GPU 73278 1192 2287 1399 5 Hopper-Cray XE6 138368 1050 1254 4590 1/17/2019 ACA H.Corporaal
Alternative ranking: Green500 Most Power efficient Supercomputers 2008: best result = 536 MFlops/Watt => 1.87 nJ / FloatingPt_operation 2009: best result = 723 MFlops/Watt => 1.38 nJ / FloatingPt_operation Cell cluster, ranking 110 in top500 2010: best result = 1684 MFlops/Watt => 594 pJ / FloatingPt operation IBM BlueGene/Q See www.green500.org 1/17/2019 ACA H.Corporaal
Nr1 (2008): Roadrunner IBM cluster 6480 nodes with Dual core Opteron 1.8 GHz 2 * PowerXCell 8i 3.2 GHz (12.8 GFlops) Infiniband connection fabric (16 Gbit/s per link) FAT tree interconnect 100 Tbyte DRAM memory 216 I/O nodes MPI programming 2.35 MW power !! Size: 296 racks, 5500 ft2 This is huge !! 1/17/2019 ACA H.Corporaal
Cell/B.E. – the architecture 1 x PPE 64-bit PowerPC L1: 32 KB I$ + 32 KB D$ L2: 512 KB 8 x SPE cores: Local store: 256 KB 128 x 128 bit vector registers Hybrid memory model: PPE: Rd/Wr SPEs: Asynchronous DMA EIB: 205 GB/s sustained aggregate bandwidth Processor-to-memory bandwidth: 25.6 GB/s Processor-to-processor: 20 GB/s in each direction 1/17/2019 ACA H.Corporaal
1/17/2019 ACA H.Corporaal
Roadrunner: TriBlade = 2 nodes For more details: Presentation slides of Ken Koch, March 2008 1/17/2019 ACA H.Corporaal
Nr2 (2008): Jaguar Cray XT5 QC I guess 5 times In total 150152 cores 7832 quad-core 2.1 GHz AMD Opetron 62 TB memory (= 2GB / core) 600 TB file system 250 TFlop In total 150152 cores SeaStar2+ interconnect (from Cray) Note 2009: quad-cores replaced by six-cores now nr 1 224,256 cores peak 1.75 PetaFlop paper: Bland A.S., Kendall R.A., Kothe D.B., Rogers J.H., Shipman G.M. Jaguar: The World’s Most Powerful Computer 1/17/2019 ACA H.Corporaal
Jaguar 1/17/2019 ACA H.Corporaal
Nr3 (2008): SGI Altix ICE8200 92 racks of Al5x ICE 8200EX with 3.0 Ghz Intel Xenon quad-core processors or 47,104 cores 8 racks of Al5x ICE 8200 with 2.66 Ghz Intel quad-core 4096 cores. 51 TB Main memory DDR InfiniBand 1/17/2019 ACA H.Corporaal
Nr:4 (2008) BlueGene/L IBM Based on ASIC with PowerPC 440, 700 Mhz, each 2.8 GFlops 105,496 nodes 3D Torus interconnect for p2p communication + Collective network 3D-torus Complete system rack 1/17/2019 ACA H.Corporaal
BlueGene/L ASIC node 1/17/2019 ACA H.Corporaal
BlueGene/L Node board 16 cards with 2 ASICs each 8 GB 180 Gflop 1/17/2019 ACA H.Corporaal
2009: BlueGene/P System: 256 racks upto 1PB 3.56 PFlops Rack: 32 Node Cards 13.9 TF/s 2-4 TB Node card: 32 processor cards 64-128 GB 435 GFlops Processor card: one 4-processor chip 13.6 GFlops 2-4 GB ASIC: 13.6 Gflops 8 MB EDRAM 1/17/2019 ACA H.Corporaal
BlueGene/P ASIC 1/17/2019 ACA H.Corporaal
PPC450: Exploiting SIMD Two FPUs SIMD 2 x 32 64-bit registers SIMD Datapath width = 16 bytes Feeds two FPUs with 8 bytes each every cycle Two FP multiply-add operations per cycle 3.4 GFLOP/s peak performance 1/17/2019 ACA H.Corporaal
BlueGene/P ASIC 208M trans 850 MHz 16W 90nm 1/17/2019 ACA H.Corporaal
BlueGene/P node card 1/17/2019 ACA H.Corporaal
Next: BlueGene/Q 10 PFlops in 2011-2012 see www.research.ibm.com/bluegene 1/17/2019 ACA H.Corporaal
Can we match the human brain ??? Performance = 100 Billion (10^11) Neurons * 1000 (10^3) Connections/Neuron * 200 (2 * 10^2) Calculations Per Second Per Connection = 2 * 10^16 Calculations Per Second Memory = 100 Billion (10^11) Neurons * 1000 (10^3) Connections/Neuron * 10 bytes (information about connection strength and adress of output neuron, type of synapse) = 10^15 bytes = 1 PB = 1000 TB How far off are we? 1/17/2019 ACA H.Corporaal
Blue brain research Software replica of one column of the neocortex cortex: 85% of brains total mass required for language, learning, memory and complex thought the essential first step to simulating the whole brain Next: include circuitry from other brain regions and eventually the whole brain. 1/17/2019 ACA H.Corporaal
Latest news: factorization of RSA768 RSA used to encypher text using both public and private key EPFL, CWI and others have broken RSA768 This means: Factorize 768 bit number into 2 primes Using 1700 AMD 2.2 GHz cores for 1 year => 15 Mh (single core) compute time Current RSA standard uses 1024 bits still save for some years News of 11 jan 2010 1/17/2019 ACA H.Corporaal
RSA (Rivest, Shamir, Adleman) choose 2 (large) primes p and q n = p*q choose e such that e and (p-1)(q-1) are coprime (i.e. do not share prime factors) choose d such d*e = 1 mod ((p-1)(q-1)) public key = (n,e) private key = (n,d) Encryption of message m: c=me mod n Decryption of cypher c: m = cd mod n see wikipedia for details and working example 1/17/2019 ACA H.Corporaal
RSA factorization result factorization of RSA768, the following 768-bit, 232-digit number from RSA's challenge list: 12301866845301177551304949583849627207728535695953347921973224215172640050726365751874520219978646938995647494277406384592519255732630345373154826850791702612214291346167042921431160222124047927473779408066535141959745985 6902143413 = 33478071698956898786044169848212690817704794983713768568912431388982883793878002287614711652531743087737814467999489 * 36746043666799590428244633799627952632279158164343087642676032283815739666511279233373417143396810270092798736308917 1/17/2019 ACA H.Corporaal