Ultra-Efficient Exascale Scientific Computing Lenny Oliker, John Shalf, Michael Wehner And other LBNL staff.

Slides:



Advertisements
Similar presentations
LEIT (ICT7 + ICT8): Cloud strategy - Cloud R&I: Heterogeneous cloud infrastructures, federated cloud networking; cloud innovation platforms; - PCP for.
Advertisements

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Scaling of the Community Atmospheric Model to ultrahigh resolution Michael F. Wehner Lawrence.
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Last Lecture The Future of Parallel Programming and Getting to Exascale 1.
Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems.
March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.
Claude TADONKI Mines ParisTech – LAL / CNRS / INP 2 P 3 University of Oujda (Morocco) – October 7, 2011 High Performance Computing Challenges and Trends.
SYNAR Systems Networking and Architecture Group CMPT 886: Special Topics in Operating Systems and Computer Architecture Dr. Alexandra Fedorova School of.
“This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Introduction CS 524 – High-Performance Computing.
Some Thoughts on Technology and Strategies for Petaflops.
Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Evaluating the Tera MTA Allan Snavely, Wayne Pfeiffer et.
What Great Research ?s Can RAMP Help Answer? What Are RAMP’s Grand Challenges ?
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
EET 4250: Chapter 1 Performance Measurement, Instruction Count & CPI Acknowledgements: Some slides and lecture notes for this course adapted from Prof.
Climate Machine Update David Donofrio RAMP Retreat 8/20/2008.
Power is Leading Design Constraint Direct Impacts of Power Management – IDC: Server 2% of US energy consumption and growing exponentially HPC cluster market.
Hardware/software co-design of global cloud system resolving models Michael Wehner John Shalf, Lenny Oliker, David Donofrio, Tony Drummond, Shoaib Kamil,
Computer performance.
Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.
Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.
Folklore Confirmed: Compiling for Speed = Compiling for Energy Tomofumi Yuki INRIA, Rennes Sanjay Rajopadhye Colorado State University 1.
ICTSUS6233A Integrate sustainability in ICT planning and design projects analyse energy audit data on enterprise resource consumption plan and integrate.
Introduction to Parallel Computing. Serial Computing.
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Previously Fetch execute cycle Pipelining and others forms of parallelism Basic architecture This week we going to consider further some of the principles.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Automated Design of Custom Architecture Tulika Mitra
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
CS/ECE 3330 Computer Architecture Kim Hazelwood Fall 2009.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
BlueGene/L Facts Platform Characteristics 512-node prototype 64 rack BlueGene/L Machine Peak Performance 1.0 / 2.0 TFlops/s 180 / 360 TFlops/s Total Memory.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
Advanced Computer Architecture Fundamental of Computer Design Instruction Set Principles and Examples Pipelining:Basic and Intermediate Concepts Memory.
Parallelism: A Serious Goal or a Silly Mantra (some half-thought-out ideas)
Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.
1 Application Scalability and High Productivity Computing Nicholas J Wright John Shalf Harvey Wasserman Advanced Technologies Group NERSC/LBNL.
Chapter 1 — Computer Abstractions and Technology — 1 The Computer Revolution Progress in computer technology – Underpinned by Moore’s Law Makes novel applications.
CPU Inside Maria Gabriela Yobal de Anda L#32 9B. CPU Called also the processor Performs the transformation of input into output Executes the instructions.
© 2009 IBM Corporation Motivation for HPC Innovation in the Coming Decade Dave Turek VP Deep Computing, IBM.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Exascale climate modeling 24th International Conference on Parallel Architectures and Compilation Techniques October 18, 2015 Michael F. Wehner Lawrence.
Full and Para Virtualization
B5: Exascale Hardware. Capability Requirements Several different requirements –Exaflops/Exascale single application –Ensembles of Petaflop apps requiring.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 3.
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
Real-Time System-On-A-Chip Emulation.  Introduction  Describing SOC Designs  System-Level Design Flow  SOC Implemantation Paths-Emulation and.
CERN VISIONS LEP  web LHC  grid-cloud HL-LHC/FCC  ?? Proposal: von-Neumann  NON-Neumann Table 1: Nick Tredennick’s Paradigm Classification Scheme Early.
KERRY BARNES WILLIAM LUNDGREN JAMES STEED
3/12/07CS Visit Days1 A Sea Change in Processor Design Uniprocessor SpecInt Performance: From Hennessy and Patterson, Computer Architecture: A Quantitative.
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Conclusions on CS3014 David Gregg Department of Computer Science
CS203 – Advanced Computer Architecture
Green cloud computing 2 Cs 595 Lecture 15.
COSC 3406: Computer Organization
Architecture & Organization 1
Introduction to Reconfigurable Computing
Architecture & Organization 1
Power is Leading Design Constraint
Course Description: Parallel Computer Architecture
Chapter 1 Introduction.
HIGH LEVEL SYNTHESIS.
Greg Stitt ECE Department University of Florida
Presentation transcript:

Ultra-Efficient Exascale Scientific Computing Lenny Oliker, John Shalf, Michael Wehner And other LBNL staff

Exascale is Critical to the DOE SC Mission “…exascale computing (will) revolutionize our approaches to global challenges in energy, environmental sustainability, and security.” E3 report

Green Flash: Ultra-Efficient Climate Modeling We present an alternative route to exascale computing –DOE SC exascale science questions are already identified. – Our idea is to target specific machine designs to each of these questions. This is possible because of new technologies driven by the consumer market. We want to turn the process around. –Ask “What machine do we need to answer a question?” –Not “What can we answer with that machine?”

Green Flash: Ultra-Efficient Climate Modeling We present an alternative route to exascale computing –DOE SC exascale science questions are already identified. – Our idea is to target specific machine designs to each of these questions. This is possible because of new technologies driven by the consumer market. We want to turn the process around. –Ask “What machine do we need to answer a question?” –Not “What can we answer with that machine?” Caveat: –We present here a feasibility design study. –Goal is to influence the HPC industry by evaluating a prototype design.

Global Cloud System Resolving Climate Modeling Direct simulation of cloud systems replacing statistical parameterization. This approach recently was called for by the 1st WMO Modeling Summit. Championed by Prof. Dave Randall, Colorado State University Direct simulation of cloud systems in global models requires exascale! Individual cloud physics fairly well understood Parameterization of mesoscale cloud statistics performs poorly.

Global Cloud System Resolving Models are a Transformational Change 1km Cloud system resolving models 25km Upper limit of climate models with cloud parameterizations 200km Typical resolution of IPCC AR4 models Surface Altitude (feet)

1km-Scale Global Climate Model Requirements Simulate climate 1000x faster than real time 10 Petaflops sustained per simulation (~200 Pflops peak) simulations (~20 Exaflops peak) Truly exascale! Some specs: Advanced dynamics algorithms: icosahedral, cubed sphere, reduced mesh, etc. ~20 billion cells  Massive parallelism 100 Terabytes of Memory Can be decomposed into ~20 million total subdomains fvCAM Icosahedral

Proposed Ultra-Efficient Computing Cooperative “science-driven system architecture” approach Radically change HPC system development via application- driven hardware/software co-design –Achieve 100x power efficiency over mainstream HPC approach for targeted high impact applications, at significantly lower cost –Accelerate development cycle for exascale HPC systems –Approach is applicable to numerous scientific areas in the DOE Office of Science Research activity to understand feasibility of our approach

Primary Design Constraint: POWER Transistors still getting smaller –Moore’s Law: alive and well Power efficiency and clock rates no longer improving at historical rates Demand for supercomputing capability is accelerating E3 report considered an Exaflop system for 2016 Power estimates for exascale systems based on extrapolation of current design trends range up to 179MW DOE E3 Report 2008 DARPA Exascale Report (in production) LBNL IJHPCA Climate Simulator Study 2008 (Wehner, Oliker, Shalf) Need fundamentally new approach to computing designs

Our Approach Identify high-impact Exascale scientific applications important to DOE Office of Science (E3 report) Tailor system to requirements of target scientific problem –Use design principles from embedded computing –Leverage commodity components in novel ways - not full custom design Tightly couple hardware/software/science development –Simulate hardware before you build it (RAMP) –Use applications for validation, not kernels –Automate software tuning process (Auto-Tuning)

Path to Power Efficiency Reducing Waste in Computing Examine methodology of embedded computing market –Optimized for low power, low cost, and high computational efficiency “Years of research in low-power embedded computing have shown only one design technique to reduce power: reduce waste.”  Mark Horowitz, Stanford University & Rambus Inc. Sources of Waste –Wasted transistors (surface area) –Wasted computation (useless work/speculation/stalls) –Wasted bandwidth (data movement) –Designing for serial performance Technology now favors parallel throughput over peak sequential performance

Processor Technology Trend 1990s - R&D computing hardware dominated by desktop/COTS –Had to learn how to use COTS technology for HPC R&D investments moving rapidly to consumer electronics/ embedded processing –Must learn how to leverage embedded processor technology for future HPC systems

Design for Low Power: More Concurrency Intel Core2 15W Power 5 120W This is how iPhones and MP3 players are designed to maximize battery life and minimize cost PPC450 3W Tensilica DP 0.09W Cubic power improvement with lower clock rate due to V 2 F Slower clock rates enable use of simpler cores Simpler cores use less area (lower leakage) and reduce cost Tailor design to application to reduce waste

Low Power Design Principles IBM Power5 (server) –Baseline Intel Core2 sc (laptop) : –4x more FLOPs/watt than baseline IBM PPC 450 (BG/P - low power) –90x more Tensilica XTensa (Moto Razor) : –400x more Intel Core2 Tensilica DP.09W Power 5 Even if each core operates at 1/3 to 1/10th efficiency of largest chip, you can pack 100s more cores onto a chip and consume 1/20 the power

Processor Generator (Tensilica) Build with any process in any fab Tailored SW Tools: Compiler, debugger, simulators, Linux, other OS Ports (Automatically generated together with the Core) Application- optimized processor implementation (RTL/Verilog) Base CPU Apps Datapaths OCD Timer FPU Extended Registers Cache Embedded Design Automation (Example from Existing Tensilica Design Flow) Processor configuration 1.Select from menu 2.Automatic instruction discovery (XPRES Compiler) 3.Explicit instruction description (TIE)

Advanced Hardware Simulation (RAMP) Research Accelerator for Multi-Processors (RAMP) –Utilize FGPA boards to emulate large-scale multicore systems –Simulate hardware before it is built –Break slow feedback loop for system designs –Allows fast performance validation –Enables tightly coupled hardware/software/science co-design (not possible using conventional approach) Technology partners: –UC Berkeley: John Wawrzynek, Jim Demmel, Krste Asanovic, Kurt Keutzer –Stanford University / Rambus Inc.: Mark Horowitz –Tensilica Inc.: Chris Rowen

Customization Continuum: Green Flash General Purpose Special Purpose Single Purpose Cray XT3 D.E. Shaw Anton MD Grape BlueGene Green Flash Application Driven Application-driven does NOT necessitate a special purpose machine MD-Grape: Full custom ASIC design –1 Petaflop performance for one application using 260 kW for $9M D.E. Shaw Anton System: Full and Semi-custom design –Simulate 100x–1000x timescales vs any existing HPC system (~200kW) Application-Driven Architecture (Green Flash): Semicustom design –Highly programmable core architecture using C/C++/Fortran –Goal of 100x power efficiency improvement vs general HPC approach –Better understand how to build/buy application-driven systems –Potential: 1km-scale model (~200 Petaflops peak) running in O(5 years)

Green Flash Strawman System Design We examined three different approaches (in 2008 technology) Computation.015 o X.02 o X100L: 10 PFlops sustained, ~200 PFlops peak AMD Opteron: Commodity approach, lower efficiency for scientific applications offset by cost efficiencies of mass market BlueGene: Generic embedded processor core and customize system-on-chip (SoC) to improve power efficiency for scientific applications Tensilica XTensa: Customized embedded CPU w/SoC provides further power efficiency benefits but maintains programmability ProcessorClockPeak/ Core (Gflops) Cores/ Socket SocketsCoresPowerCost 2008 AMD Opteron2.8GHz K1.7M179 MW$1B+ IBM BG/P850MHz K3.0M20 MW$1B+ Green Flash / Tensilica XTensa 650MHz K4.0M3 MW$75M

Climate System Design Concept Strawman Design Study 10PF sustained ~120 m 2 <3MWatts < $75M 32 boards per rack 100 ~25KW power + comms 32 chip + memory clusters per board ( W VLIW CPU: 128b load-store + 2 DP MUL/ADD + integer op/ DMA per cycle: Synthesizable at 650MHz in commodity 65nm 1mm 2 core, mm 2 with inst cache, data cache data RAM, DMA interface, 0.25mW/MHz Double precision SIMD FP : 4 ops/cycle (2.7GFLOPs) Vectorizing compiler, cycle-accurate simulator, debugger GUI (Existing part of Tensilica Tool Set) 8 channel DMA for streaming from on/off chip DRAM Nearest neighbor 2D communications grid Proc Array RAM 8 DRAM per processor chip: ~50 GB/s CPU K D 2x128b 32K I 8 chan DMA CPU D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA Opt. 8MB embedded DRAM External DRAM interface Master Processor Comm Link Control 32 processors per 65nm chip 83 7W

Portable Performance for Green Flash Challenge: Our approach would produce multiple architectures, each different in the details –Labor-intensive user optimizations for each specific architecture –Different architectural solutions require vastly different optimizations –Non-obvious interactions between optimizations & HW yield best results Our solution: Auto-tuning –Automate search across a complex optimization space –Achieve performance far beyond current compilers –Attain performance portability for diverse architectures

3.5x Auto-Tuning for Multicore (finite-difference computation ) 1.4x 4.4x 4.6x 2.0x 23.3x 2.3x 4.5x Power Efficiency Performance Scaling Take advantage of unique multicore features via auto-tuning Attains performance portability across different designs Only requires basic compiling technology Achieve high serial performance, scalability, and optimized power efficiency

Traditional New Architecture Hardware/Software Design Cycle Time 4-6+ years Design New System (2 year concept phase) Port Application Build Hardware (2 years) Tune Software (2 years) How long does it take for a full scale application to influence architectures?

Proposed New Architecture Hardware/Software Co-Design Cycle Time 1-2 days Synthesize SoC (hours) Build application Emulate Hardware (RAMP) (hours) Autotune Software (Hours) How long does it take for a full scale application to influence architectures?

Summary Exascale computing is vital to the DOE SC mission We propose a new approach to high-end computing that enables transformational changes for science Research effort: study feasibility and share insight w/ community This effort will augment high-end general purpose HPC systems –Choose the science target first (climate in this case) –Design systems for applications (rather than the reverse) –Leverage power efficient embedded technology –Design hardware, software, scientific algorithms together using hardware emulation and auto-tuning –Achieve exascale computing sooner and more efficiently Applicable to broad range of exascale-class DOE applications