Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D Slide 1 Exascale? No problem! Paul Henning.

Slides:

Advertisements

Similar presentations

Threads, SMP, and Microkernels

Advertisements

Managed by UT-Battelle for the Department of Energy Vinod Tipparaju P2S2 Panel: Is Hybrid Programming a Bad Idea Whose Time Has Come?

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.

Last Lecture The Future of Parallel Programming and Getting to Exascale 1.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Introduction CSCI 444/544 Operating Systems Fall 2008.

Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems.

Problem Uncertainty quantification (UQ) is an important scientific driver for pushing to the exascale, potentially enabling rigorous and accurate predictive.

Introduction to Operating Systems CS-2301 B-term Introduction to Operating Systems CS-2301, System Programming for Non-majors (Slides include materials.

March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.

Introduction CS 524 – High-Performance Computing.

A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

Supercomputers Daniel Shin CS 147, Section 1 April 29, 2010.

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

Threads CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University

Contemporary Languages in Parallel Computing Raymond Hummel.

Virtualization for Cloud Computing

CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Computer System Architectures Computer System Software

Dax: Rethinking Visualization Frameworks for Extreme-Scale Computing DOECGF 2011 April 28, 2011 Kenneth Moreland Sandia National Laboratories SAND P.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

LOGO OPERATING SYSTEM Dalia AL-Dabbagh

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

Virtualization. Virtualization  In computing, virtualization is a broad term that refers to the abstraction of computer resources  It is "a technique.

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D Slide 1 Nick Salazar Operations Support.

Fall 2000M.B. Ibáñez Lecture 01 Introduction What is an Operating System? The Evolution of Operating Systems Course Outline.

OPERATING SYSTEMS Goals of the course Definitions of operating systems Operating system goals What is not an operating system Computer architecture O/S.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

Towards Exascale File I/O Yutaka Ishikawa University of Tokyo, Japan 2009/05/21.

Processes Introduction to Operating Systems: Module 3.

Dean Tullsen UCSD.  The parallelism crisis has the feel of a relatively new problem ◦ Results from a huge technology shift ◦ Has suddenly become pervasive.

JAVA AND MATRIX COMPUTATION

SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW SC’11 UltraVis Workshop, November 13, 2011 Yu Su*, Gagan Agrawal*, Jon Woodring†

Template This is a template to help, not constrain, you. Modify as appropriate. Move bullet points to additional slides as needed. Don’t cram onto a single.

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

Software Development in HPC environments: A SE perspective Rakan Alseghayer.

CIS250 OPERATING SYSTEMS Chapter One Introduction.

| nectar.org.au NECTAR TRAINING Module 4 From PC To Cloud or HPC.

Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.

Template This is a template to help, not constrain, you. Modify as appropriate. Move bullet points to additional slides as needed. Don’t cram onto a single.

HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.

Lenovo - Eficiencia Energética en Sistemas de Supercomputación Miguel Terol Palencia Arquitecto HPC LENOVO.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

A computer contains two major sets of tools, software and hardware. Software is generally divided into Systems software and Applications software. Systems.

CS4315A. Berrached:CMS:UHD1 Introduction to Operating Systems Chapter 1.

Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED Optimizing the Energy Usage and Cognitive Value of.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Introduction to parallel programming

Lecture 2: Intro to the simd lifestyle and GPU internals

Chapter 15, Exploring the Digital Domain

Compiler Back End Panel

The Arabica Project A distributed scientific computing project based on a cluster computer and Java technologies. Daniel D. Warner Dept. of Mathematical.

Compiler Back End Panel

Java Programming Introduction

Lecture Topics: 11/1 Hand back midterms

Presentation transcript:

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D Slide 1 Exascale? No problem! Paul Henning Los Alamos National Laboratory LAUR

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D Slide 2 Exascale? No problem! Paul Henning Los Alamos National Laboratory (But most apps are screwed)

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D According to Merriam-Webster… “screw” 4.a (1): to mistreat or exploit through extortion, trickery, or unfair actions; especially: to deprive of or cheat out of something due or expected Slide 3 The majority of developers, not yet working on petascale projects, are going to be blindsided by pervasive changes coming to computing environments. Can we (read, “you”) help them?

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D Exascale applications have been hammered on in other settings… Combustion Nuclear energy Biology and biofuels Fusion Materials Climate modeling High-energy and nuclear physics National security Slide 4

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D The “Path to Zero” provides a suite of exascale drivers Cradle-to-grave assessments to support complex-wide process and resource optimization System-scale simulation support for national policy decisions on stockpile changes or reductions System-scale physics of understanding of nuclear weapons, including complex interacting microscale processes Slide 5 Source: Scientific Grand Challenges for National Security: The Role of Computing at the Extreme Scales Workshop Report

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D Large scale simulation capability is cheap compared to alternatives “Icecap” UGT, suspended on U.S. entrance to Underground Nuclear Testing Moratorium, 10/3/92 157’ tower over 94”×1675’ hole 350k-500k lbs of gear, miles of cable This would have been the 929 th test at NTS Slide 6 Information from DOE/NV-1212, May 2007 (NTS “Icecap” Fact Sheet)

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D Exascale hardware will be achieved by depriving* software of expected execution environments Lots of threads and vector units Total memory doesn’t scale with threads/flops Jaguar ~0.3PB Exascale ~64PB 213x != 1000x Less cache/thread and/or local store Smaller (than node) coherency domains Will require many types of parallelism, simultaneously Radical changes in interconnect possible Slide 7 * screwing

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D “Blue collar” supercomputing may be more impacted than high-end High-end is already planning for it Exascale community will not, by definition, fail. All hardware is going to be affected, thanks to power/mobility drivers Exascale needs processor specialization to meet power requirements Desktops are still selling well, but aren’t coming near the growth of mobile computing Mobile processors look to specialization to reduce power and size When does the “traditional” CPU, and its programming model, die? Are the scientific and commercial applications expecting traditional performance increases without modifying their codes? Slide 8

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D Q: Numerical libraries? “Library” has two meanings: A collection of related functionality (presumably) assembled by domain experts A collection of reusable object code to link/load into applications Object libraries have two benefits Easy access to highly tuned platform-specific variants Convenience for apps developers (faster compiles, pre-installed) Note: libraries are products, apps are not! Dynamically-loaded libraries are mostly bunk for scientific apps No space savings for single executables, unnecessary relocation costs Band-aid for bad language support: dlopen, cross-language linking Object libraries are opaque to global parallelization, optimization and resource allocation/scheduling. Slide 9

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D A: Not –lmega_parallel_foo, thanks. (-lm is fine) At O(1B) threads, concurrency is far more important than optimizing serial instruction streams. We can’t assume that a single kernel/algorithm can fully utilize all of the hardware on a node On node resource scheduling will migrate into the application We need to consider MPMD Libraries don’t need to be object files Example: C++ template libraries “Compiler” gets to see everything for resource management Can still do file-at-a-time conversion to IRs, just don’t go to object code Slide 10

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D Just some closing notes… The TOP500 stagnates without investment and effort The first sustained exaflop/sec calculation will run in 2018 Self-fulfilling prophecy Where? Somewhere east of here (on a sphere) What? HPL, of course! And then a particle/MC method Autotuning could look at higher-level resource allocation issues How much of each processor type should be allocated to kernels that can run simultaneously? What are “optimal” memory layouts for multiple processor and communication types on a node? — Vector vs. scalar — DMAs and local store vs. PGAS vs. MPI Slide 11

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D DOE Scientific Grand Challenges Workshop Series Slide 12