1 High Performance Computing: A Look Behind and Ahead Jack Dongarra Computer Science Department University of Tennessee.

Slides:

Advertisements

Similar presentations

Clusters, Grids and their applications in Physics David Barnes (Astro) Lyle Winton (EPP)

Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

1 Computational models of the physical world Cortical bone Trabecular bone.

Introductions to Parallel Programming Using OpenMP

Parallel computer architecture classification

Commodity Computing Clusters - next generation supercomputers? Paweł Pisarczyk, ATM S. A.

♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.

Today’s topics Single processors and the Memory Hierarchy

Beowulf Supercomputer System Lee, Jung won CS843.

Zhao Lixing.  A supercomputer is a computer that is at the frontline of current processing capacity, particularly speed of calculation.  Supercomputers.

ICS 556 Parallel Algorithms Ebrahim Malalla Office: Bldg 22, Room

Claude TADONKI Mines ParisTech – LAL / CNRS / INP 2 P 3 University of Oujda (Morocco) – October 7, 2011 High Performance Computing Challenges and Trends.

Introduction CS 524 – High-Performance Computing.

Tuesday, September 04, 2006 "If you were plowing a field, what would you rather use, two strong oxen or 1024 chickens?" (Commenting on parallel architectures)

Supercomputers Daniel Shin CS 147, Section 1 April 29, 2010.

1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.

An Introduction to Princeton’s New Computing Resources: IBM Blue Gene, SGI Altix, and Dell Beowulf Cluster PICASso Mini-Course October 18, 2006 Curt Hillegas.

Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.

Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.

High Performance Computing 1 Challenges and the Future of HPC Some material from a lecture by David H. Bailey NERSC.

Earth Simulator Jari Halla-aho Pekka Keränen. Architecture MIMD type distributed memory 640 Nodes, 8 vector processors each. 16GB shared memory per node.

PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.

Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.

Lecture 1: Introduction to High Performance Computing.

Real Parallel Computers. Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra, Meuer, Simon Parallel.

Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,

Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.

1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)

CENG 546 Dr. Esma Yıldırım. Copyright © 2012, Elsevier Inc. All rights reserved What is a computing cluster?  A computing cluster consists of.

Benchmarks for Parallel Systems Sources/Credits:  “Performance of Various Computers Using Standard Linear Equations Software”, Jack Dongarra, University.

Lappeenranta University of Technology / JP CT30A7001 Concurrent and Parallel Computing Introduction to concurrent and parallel computing.

Maximizing The Compute Power With Mellanox InfiniBand Connectivity Gilad Shainer Wolfram Technology Conference 2006.

Russ Miller Center for Computational Research Computer Science & Engineering SUNY-Buffalo Hauptman-Woodward Medical Inst IDF: Multi-Core Processing for.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.

Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.

Edgar Gabriel Short Course: Advanced programming with MPI Edgar Gabriel Spring 2007.

Cross Council ICT Conference May High Performance Computing Ron Perrott Chairman High End Computing Strategy Committee Queen’s University Belfast.

Rensselaer Why not change the world? Rensselaer Why not change the world? 1.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.

1 CRESCO: Centro computazionale di RicErca sui Sistemi COmplessi Where the World Stands on Supercomputing Prof. Jack Dongarra University of Tennessee and.

BlueGene/L Facts Platform Characteristics 512-node prototype 64 rack BlueGene/L Machine Peak Performance 1.0 / 2.0 TFlops/s 180 / 360 TFlops/s Total Memory.

- Rohan Dhamnaskar. Overview  What is a Supercomputer  Some Concepts  Couple of examples.

October 12, 2004Thomas Sterling - Caltech & JPL 1 Roadmap and Change How Much and How Fast Thomas Sterling California Institute of Technology and NASA.

Protein Explorer: A Petaflops Special Purpose Computer System for Molecular Dynamics Simulations David Gobaud Computational Drug Discovery Stanford University.

2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.

A look at computing performance and usage.  3.6GHz Pentium 4: 1 GFLOPS  1.8GHz Opteron: 3 GFLOPS (2003)  3.2GHz Xeon X5460, quad-core: 82 GFLOPS.

Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

CS591x -Cluster Computing and Parallel Programming

Interconnection network network interface and a case study.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

Background Computer System Architectures Computer System Software.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

High Performance Computing Kyaw Zwa Soe (Director) Ministry of Science & Technology Centre of Advanced Science & Technology.

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.

01/28/2011CS267 Lecture 31 Parallel Machines and Programming Models Lecture 3 Based on James Demmel and Kathy Yelick

Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.

Super Computing By RIsaj t r S3 ece, roll 50.

Nicole Ondrus Top 500 Parallel System Presentation

BlueGene/L Supercomputer

Course Description: Parallel Computer Architecture

Advanced Computer Architecture 5MD00 / 5Z033 TOP 500 supercomputers

An Overview of MIMD Architectures

Cluster Computers.

Presentation transcript:

1 High Performance Computing: A Look Behind and Ahead Jack Dongarra Computer Science Department University of Tennessee

2 High Performance Computers u From the beginning of the digital age, supercomputers have been time machines that let researchers peer into the future, both intellectually and temporally.  Intellectually they bring to life models of complex phenomena when economics and other constraints preclude experimentation.  Temporally they reduce the time to solution by enabling us to evaluate larger and more complex models than would be possible on more conventional systems.

3 High Performance Computers u ~ 25 years ago  1x10 6 Floating Point Ops/sec (Mflop/s) »Scalar based u ~ 15 years ago  1x10 9 Floating Point Ops/sec (Gflop/s) »Vector & Shared memory computing, bandwidth aware »Block partitioned, latency tolerant u ~ Today  1x10 12 Floating Point Ops/sec (Tflop/s) »Highly parallel, distributed processing, message passing, network based »data decomposition, communication/computation u ~ 5 years away  1x10 15 Floating Point Ops/sec (Pflop/s) »Many more levels MH, combination/grids&HPC »More adaptive, LT and bandwidth aware, fault tolerant, extended precision, attention to SMP nodes

4 Architecture/Systems Continuum u Custom processor with custom interconnect  Cray X1  NEC SX-7  IBM Regatta  IBM Blue Gene/L u Commodity processor with custom interconnect  SGI Altix »Intel Itanium 2  Cray Red Storm »AMD Opteron u Commodity processor with commodity interconnect  Clusters »Pentium, Itanium, Opteron, Alpha »GigE, Infiniband, Myrinet, Quadrics  NEC TX7  IBM eServer  Dawning Loosely Coupled Tightly Coupled u Best processor performance for codes that are not “cache friendly” u Good communication performance u Simplest programming model u Most expensive u Good communication performance u Good scalability u Best price/performance (for codes that work well with caches and are latency tolerant) u More complex programming model Custom Commod Hybrid

5 Top 500 Computers - Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP Ax=b, dense problem Updated twice a year SC‘xy in the States in November Meeting in Germany in June Size Rate TPP performance

6 u A supercomputer is a hardware and software system that provides close to the maximum performance that can currently be achieved. u Over the last 10 years the range for the Top500 has increased greater than Moore’s Law u 1993:  #1 = 59.7 GFlop/s  #500 = 422 MFlop/s u 2004:  #1 = 70 TFlop/s  #500 = 850 GFlop/s What is a Supercomputer? Why do we need them? Almost all of the technical areas that are important to the well-being of humanity use supercomputing in fundamental and essential ways. Computational fluid dynamics, protein folding, climate modeling, national security, in particular for cryptanalysis and for simulating nuclear weapons to name a few.

7 24th List: The TOP10 ManufacturerComputer Rmax [TF/s] Installation SiteCountryYear#Proc 1IBM BlueGene/L β-System 70.72DOE/IBMUSA SGI Columbia Altix, Infiniband 51.87NASA AmesUSA NECEarth-Simulator35.86Earth Simulator CenterJapan IBM MareNostrum BladeCenter JS20, Myrinet Barcelona Supercomputer Center Spain CCD Thunder Itanium2, Quadrics Lawrence Livermore National Laboratory USA HP ASCI Q AlphaServer SC, Quadrics Los Alamos National Laboratory USA Self Made X Apple XServe, Infiniband 12.25Virginia TechUSA IBM/LLNL BlueGene/L DD1 500 MHz Lawrence Livermore National Laboratory USA IBMpSeries Naval Oceanographic OfficeUSA Dell Tungsten PowerEdge, Myrinet 9.82NCSAUSA system > 1 TFlop/s; 294 machines are clusters, top10 average 8K proc

8 Performance Development My Laptop

9 Performance Projection

10 Performance Projection My Laptop

11 Chip (2 processors) Compute Card (2 chips, 2x1x1) 4 processors Node Card (32 chips, 4x4x2) 16 Compute Cards 64 processors System (64 racks, 64x32x32) 131,072 procs Rack (32 Node boards, 8x8x16) 2048 processors 2.8/5.6 GF/s 4 MB (cache) 5.6/11.2 GF/s 1 GB DDR 90/180 GF/s 16 GB DDR 2.9/5.7 TF/s 0.5 TB DDR 180/360 TF/s 32 TB DDR IBM BlueGene/L 131,072 Processors “Fastest Computer” BG/L 700 MHz 32K proc 16 racks Peak: 91.7 Tflop/s Linpack: 70.7 Tflop/s 77% of peak BlueGene/L Compute ASIC Full system total of 131,072 processors

12 Customer Segments / Systems

13 Manufacturers / Systems

14 Processor Types

15 Interconnects / Systems

16

17 Clusters (NOW) / Systems

18 Power: Watts/Gflop (smaller is better) Top 20 systems Based on processor power rating only

19 Top500 Computers in Australia RankSite Manufac ture ComputerAreaYearr-max N procsPeak 130PGSIBM xSeries Xeon 3.06 GHz - Gig-EIndustry Consumer GoodsIBM eServer pSeries 690 (1.7 GHz Power4+, GigE)Industry Bureau of Meteorology / CSIRO HPCCCNECSX-6/144M18Research Australian Centre for Advanced Computing and CommunicationsDell PowerEdge 1750, Pentium4 Xeon 3.06 GHz, GigEAcademic University of QueenslandSGISGI Altix 1.3 GHzAcademic

20 Top500 Conclusions u Microprocessor based supercomputers have brought a major change in accessibility and affordability. u Clusters continue to account of more than half of all installed high- performance computers worldwide.

21 Future: Petaflops ( fl pt ops/s) u A Pflop for 1 second  a laptop computing for 1 year. u From an algorithmic standpoint  concurrency  data locality  latency & sync  floating point accuracy  dynamic redistribution of workload  new language and constructs  role of numerical libraries  algorithm adaptation to hardware failure

22 Petaflop (10 15 flop/s) Computers Within the Next 5 Years u Five basis design points:  Conventional technologies »4.8 GHz processor, 8000 nodes, each w/16 processors  Processing-in-memory (PIM) designs »Reduce memory access bottleneck  Superconducting processor technologies »Digital superconductor technology, Rapid Single-Flux-Quantum (RSFQ) logic & hybrid technology multi-threaded (HTMT)  Special-purpose hardware designs »Specific applications e.g. GRAPE Project in Japan for gravitational force computations  Schemes utilizing the aggregate computing power of processors distributed on the web ~35

23 Petaflops (10 15 flop/s) Computer Today? 2 GHz processor (O(10 9 ) ops/s) .5 Million PCs  $.5B ($1K each)  100 Mwatts  5 acres .5 Million Windows licenses!!  PC failure every second

24 Real Crisis With HPC Is With The Software u Programming is stuck  Arguably hasn’t changed since the 60’s u It’s time for a change  Complexity is rising dramatically »highly parallel and distributed systems  From 10 to 100 to 1000 to to of processors!! »multidisciplinary applications u A supercomputer application and software are usually much more long-lived than a hardware  Hardware life typically five years at most.  Fortran and C are the main programming models u Software is a major cost component of modern technologies.  The tradition in HPC system procurement is to assume that the software is free. u We don’t have many great ideas about how to solve this problem.

25 Some Current Unmet Needs u Performance / Portability u Fault tolerance u Better programming models  Global shared address space  Visible locality u Maybe coming soon (incremental, yet offering real benefits):  Global Address Space (GAS) languages: UPC, Co-Array Fortran, Titanium) »“Minor” extensions to existing languages  More convenient than MPI »Have performance transparency via explicit remote memory references u The critical cycle of prototyping, assessment, and commercialization must be a long-term, sustaining investment, not a one time, crash program.

26 Collaborators / Support u Top500 Team  Erich Strohmaier, NERSC  Hans Meuer, Mannheim  Horst Simon, NERSC